MiWORD of the Day is… Residual!

Have you ever tried to assemble a Lego set and ended up with mysterious extra pieces? Or perhaps you have cleaned up after a big party and found some confetti hiding in the corners days later? Welcome to the world of “residuals”!

Residuals pop up everywhere. It’s an everyday term but it’s actually fancier than just referring to the leftovers of a meal; it’s also a term used in regression models to describe the difference between observed and predicted values, or in finance to talk about what’s left of an asset. However, nothing I mentioned compares to the role residuals played in machine learning and particularly training deep neural networks.

When you learn an approximation of a function from an input space to an output space using backpropagation, the weights are updated based on the learning rate and gradients that are calculated through chain rule. As a neural network gets deeper, you have to multiply a small value—usually much smaller than 1—multiple times to pass it to the earliest layers, making the neural network excessively hard to optimize. This phenomenon prevalent in deep learning is call the vanishing gradient problem.

However, notice how deep layers of a neural network are usually composed by mappings that are close to identity. This is exactly why residual connections do their magic! Suppose your true mapping from input to output is h(x), and let the forward pass be f(x)+x. It follows that the mapping subject to learning would be h(x)-x, which is close to a zero function. This means f(x) would be way easier to learn under the vanishing gradient problem, since functions that are close to zero functions demand a lower level of sensitivity to each parameter, unlike the identity function.

Now before we dive too deep into the wizardry of residuals, should we use residual in a sentence?

Serious: Neuroscientists wanted to explore if CNNs perform similarly to the human brain in visual tasks, and to this end, they simulated the grasp planning using a computational model called the generative residual convolutional neural network.

Less serious: Mom: “What happened?”
Me: “Sorry Mom, but after my attempt to bake chocolate cookies, the residuals were a smoke-filled kitchen and a cookie-shaped piece of charcoal that even the dog wouldn’t eat”

See you in the blogosphere,
Mason Hu

MiWORD of the Day is… Silhouette Score!

Silhouette score… is that some sort of way to measure whose silhouette looks better? Or how identifiable the silhouettes are? Well… kind of! It turns out that in statistics, silhouette score is a measure for how “good” a clustering algorithm is. It considers two factors: cohesion and separation. Particularly, how compact is the cluster? And how separated is the cluster from other clusters?

Let’s say you asked your friend to group a bunch of cats into 3 clusters based on where they were sitting on the floor, because you wanted to know whether the cats sit in groups or if they just sit randomly. How can we determine how “good” your friend clustered them? Let’s zoom in to one specific cat who happens to be placed in Cluster 1. We first look at intra-cluster distance, which would be the mean distance to all other cats in Cluster 1. We then take the mean nearest-cluster distance, which would be the distance between the cat and the nearest cluster the cat is not a part of, either Cluster 2 or 3, in this case.

To have a “good” clustering algorithm, we want to minimize the intra-cluster distance and maximize the mean nearest-cluster distance. Together, this can be used to calculate our silhouette score for one cat. Then, we can repeat this for each cat and average the score for all cats to get the overall silhouette score. Silhouette score ranges from -1 to +1, and the higher the score, the better! A high score indicates that the cats are generally similar to the other cats in their clusters and distinct from the cats in other clusters. A score of 0 means that clusters are overlapping. So, if it turns out that the cats were sitting in distinct groups and your friend is good at clustering, we’d expect a high silhouette score.

Now, to use it in a sentence!

Serious: I am unsure of how many clusters I should group my data into for k-means clustering… it seems like choosing 3 or 4 will give me the same silhouette score of 0.38!

Less serious (suggested to me by ChatGPT): I tried sorting my sock drawer by color, But it’s a bit tricky with all those shades of grey. I mean, I can’t even tell the difference between dark grey and mid grey. My sock drawer’s silhouette score is so low!

See you in the blogosphere!
Lucie Yang

MiWORD of The Day is … Feature Extraction!

Imagine you have a photo of a cat sitting in a garden. If you want to describe the cat to someone who has never seen it, you might say it has pointy ears, a furry body, and green eyes. These details are the features that make the cat unique and distinguishable.

Similarly, in medical imaging, ML algorithms like CNN are widely used to analyze images like X-rays or MRIs. The CNN works like a set of filters that look for specific features in the image, such as edges, corners, or textures, and then combines these features to create a representation of the image.

For example, when looking at a chest X-ray, a CNN can detect features like the shape of the lungs, blood vessels, and other structures. By analyzing these features, CNN can identify patterns that indicate the presence of a disease like pneumonia or lung cancer. The CNN can also analyze other medical images, like MRIs, to detect tumors, blood clots, or other abnormalities.

To perform feature extraction, CNN applies a series of convolutional filters to the image, each designed to detect a specific pattern or feature. The filters slide over the image, computing the dot product between the filter and the corresponding pixel values in the image to produce a new feature map. These feature maps are then passed through non-linear activation functions to increase the discriminative power of the network. CNN then down-samples the feature map to increase the robustness of the network to translation and rotation. This process is repeated multiple times in a CNN, with each layer learning more complex features based on the previous layers. The final output of the network is a set of high-level features that can be used to classify or diagnose medical conditions.

Now let’s use feature extraction in a sentence!

Serious: “How can we ensure that the features extracted by a model are truly representative of the underlying data and not biased towards certain characteristics or attributes?”

Less Serious:
My sister: “You know, finding the right filter for my selfie is like performing feature extraction on my face.”

Me: “I guess you’re just trying to extract the most Instagram-worthy features right?”

MiWORD of the Day is… Domain Shift!

From looking at the image from two different domains, could you tell what are they?
Hmmm? Is this a trick question or not, aren’t they the same? You might ask.
Yes, you are right. They are all bags. They are generally the same object, and I am sure you can easily tell just at a glimpse. However, unlike human beings, if you let a machine learning model read these images from two different domains, it would easily get confused by them, and eventually, make mistakes in identifying them. This is known as domain shift in Machine Learning.

Domain shift, also known as distribution shift, usually occurs in deep learning models
when the data distribution changes when the model reads the data. For instance, let’s say a deep learning model is trained on a dataset containing the images of backpacks on domain 1 (see the backpack image above). The model itself would then learn the specific features of the backpack image from domain 1 like the size, shape, angle of the picture taken etc. When you take the exact same model to test or retrain on the backpack images from domain 2, due to a slight variation in the background angle, the data distribution of the model encounters shifts a little bit, which would most likely result in a drop in model performance.

Deep learning models, such as a CNN model, are also widely used in the medical
imaging industry. Researchers have been implementing deep learning models in image
classification, segmentation and other tasks. However, because different imaging centers might use different machines, tools, and protocols, the datasets on the exact same image modality across different imaging centers might differ. Therefore, a model might experience a domain shift when it encounters a new unseen dataset which has variation in the data distribution.

Me: “What can we do if a domain shift exists in a model between the source and target dataset?”
Professor Tyrrell: “Try mixing the target dataset with some images from the source dataset! ”

Less serious:
Mom: “I heard that your brother is really good at physics, what is your domain?”
Me: “I used to be an expert on Philosophy, but now due to my emerging interest in AI, I shift my domain to learning Artificial Intelligence.”
Mom: “Oh! A domain shift!”

MiWORD of the Day is… Self-Supervised Learning!

Having just finished my intro to psychology course, I found that a useful way to learn the material (and guarantee a higher score on the test) was giving myself pop quizzes. In particular, I would write out a phrase and erase a keyword, then I will try to fill in the blank. Turns out this is such an amazing study method that even machines are using it! This method of learning, where the learner learns more in-depth by creating a quiz (e.g. fill-in-the-blank) for itself constitutes the essence of self-supervised learning.

Many machine learning models follow the encoder-decoder architecture, where an encoder network first extracts useful representations from the input data and the decoder network then uses the extracted representations to perform some task (e.g. image classification, semantic segmentation). In a typical supervised learning setting, when there is a large amount of data but only little of which is labelled, all the unlabelled data would have to be discarded as supervised learning requires a model be trained using input-label pairs. On the other hand, self-supervised learning utilizes the unlabelled data by introducing a pretext task to first pretrain the encoder such that it extracts richer and more useful representations. The weights of the pretrained encoder can then be transferred to another model, where it is fine-tuned using the labelled data to perform a specified downstream task (e.g. image classification, semantic segmentation). The general idea of this process is that better representations can be learned through mass unlabelled data, which provides the decoder with a better starting point and ultimately improves the model’s performance on the downstream task.

The choice of pretext task is paramount for pretraining the encoder as it decides what kind of representations can be extracted. One powerful pretrain method that has yielded higher downstream performance on image classification tasks is SimCLR. In the SimCLR framework, a batch of  images are first sampled, and each image is applied two different augmentations  and , resulting in  images. Two augmented versions of the same image is called a positive pair, and two augmented versions of different images is called a negative pair. Each pair of the  images is passed to the encoder  to produce representations  and , and these are then passed to a projection layer  to obtain the final representations  and . A contrastive loss defined using cosine similarity operates on the final representations such that the encoder  would produce similar representations for positive pairs and dissimilar representations for negative pairs. After pretraining, the weights of the encoder  could then be transferred to a downstream image classification model.

Although better representations may be extracted by the encoder using self-supervised learning, it does require a large unlabelled dataset (typically >100k images for vision-related tasks).

Now, using self-supervised learning in a sentence:

Serious: Self-supervised learning methods such as SimCLR has shown to improve downstream image classification task performance.

Less Serious: I thought you will be implementing a self-supervised learning pipeline for this project, why are you teaching the machine how to solve a rubik’s cube instead? (see Self-supervised Feature Learning for 3D Medical Images by Playing a Rubik’s Cube)

See you in the blogosphere!

Paul Tang

MiWORD of the Day is… Attention!

In cognitive psychology, attention refers to the process of concentrating mental effort on sensory or mental events. When we attenuate to a certain object over others, our memory associated with that object is often better. Attention, according to William James, also involves “withdrawing from some things in order to effectively deal with others.” There are lots of things that are potential objects of our attention, but we attend to some things and ignore others. This ability helps our brain save processing resources by suppressing irrelevant features.

In image segmentation, attention is the process of highlighting the relevant activations during training. Attention gates can learn to focus on target features automatically through training. Then during testing, they can highlight salient information useful for a specific task. Therefore, just like when we allocate attention to specific tasks our performance would be improved, the attention gates would also improve model sensitivity and accuracy. In addition, models trained with attention gates also learn to suppress irrelevant regions as humans do; hence, reducing the computational resources used on irrelevant activations.

Now let’s use attention in a sentence by the end of the day!

Serious: With the introduction of attention gates in standard U-Net, the global information of the recess distention is obtained, and the irrelevant background noise is suppressed which in turn increases the model’s sensitivity and leads to smoother and more complete segmentation.

Less serious:
Will: That lady said I am a guy worth paying attention to (。≖ˇ∀ˇ≖。)
Nana: Sadly, she said that to the security guard…

Today’s MiWORD of the day is… Agreement!

You know that magical moment where you and your friend finally agree on a place to eat, or a movie to watch, and you wonder what lucky stars had to align to make that happen? When the chance of agreement was so small that you didn’t think you’d ever decide? If you wanted to capture how often you and your friend agree on a restaurant or a movie in such a way that accounted for whether it was due to random chance, Cohen’s Kappa is the choice for you.

Agreement can be calculated just by taking the number of agreed upon observations divided by the total observations; however, Jacob Cohen believed that wasn’t enough. As agreement was typically used for inter-rater reliability, Cohen argued that this measure didn’t account for the fact that sometimes, people just guess–especially if they are uncertain. In 1960, he proposed Cohen’s Kappa as a counter to traditional percent agreement, claiming his measure was more robust as it accounted for random chance agreement.

Cohen’s Kappa is used to calculate agreement between two raters–or in machine learning, it can be used to find the agreement between the prediction sets of two models. It is calculated by subtracting the probability of chance agreement from the probability of observed agreement, all over one minus the probability of chance agreement. Like many correlation metrics, it ranges from -1 to +1. A negative value of Cohen’s Kappa indicates that there is no relationship between the raters, or that they had a tendency to give different ratings. A Cohen’s Kappa of 0 indicates that there is no agreement between the predictors above what would be expected by chance. A Cohen’s Kappa of 1 indicates that the raters are in complete agreement.

As Cohen’s Kappa is calculated using frequencies, it can be unreliable in measuring agreement in situations where an outcome is rare. In such cases, it tends to be overly conservative and underestimates agreement on the rare category. Additionally, some statisticians disagree with the claim that Cohen’s Kappa accounts for random chance, as an explicit model of how chance affected decision making would be necessary to say this decisively. The chance adjustment of Kappa simply assumes that when raters are uncertain, they completely guess an outcome. However, this is highly unlikely in practice–usually people have some reason for their decision.

Let’s use this in a sentence, shall we?
Serious: The Cohen’s Kappa score between the two raters was 0.7. Therefore, there is substantial agreement between the raters’ observations.
Silly: A kappa of 0.7? They must always agree on a place to eat!

Today’s MiWORD of the day is… Artifact!

When the ancient Egyptians built the golden Mask of Tutankhamun or carved a simple message into the now infamous Rosetta Stone, they probably didn’t know that we’d be holding onto them centuries later, considering them incredible reflections of Egyptian history.

Both of these are among the most famous artifacts existing today in museums. An artifact is a man-made object that’s considered to be of high historical significance. However, in radiology, an artifact is a lot less desirable – it refers to parts of an image that appear differently and inaccurately reflect the body structures they are taken of.

Artifacts in radiography can happen to any image. For instance, they can occur from improper handling of machines used to take medical scans, patient movement during imaging, external objects (i.e. jewelry, buttons) and other unwanted occurrences.

Why are artifacts so important? They can lead to misdiagnoses that could be detrimental to a patient. Consider a hypothetical scenario where a patient goes in for imaging for a tumor. The radiologist identifies the tumor as benign, but in reality, due to mishandling of a machine, an image artifact exists on the image that hides the fact that it is in fact malignant. The outcome would be catastrophic in this case!

Of course, this kind of diagnosis is highly unlikely (especially with modern day medical imaging) and there a ton of factors at play with diagnosis. A real diagnosis, especially nowadays, would not be so simple (or we would be wrong not to severely lament the state of medicine today). However, even if artifacts don’t cause a misdiagnosis, they can pose obstacles to both radiologists and researchers working with these images.

One such area of research is the application of machine learning into the field of medical imaging. Whether we’re using convolutional neural networks or vision transformers, all of these machine learning models rely on images produced in some facility. The quality of these images, including the presence and frequency of artifacts, can affect the outcome of any experiments conducted with them. For instance, imagine you build a machine learning model to classify between two different types of ultrasound scans. The performance of the model is certainly a main factor – but the concern that the model might be focusing on artifacts within the image rather than structures of interest would also be a huge consideration.

In any case, the presence of artifacts (whether in medical imaging or in historical museums) certainly gives us a lot more to think about!

Now onto the fun part, using artifact in a sentence by the end of the day:

Serious: My convolutional neural network could possibly be focusing on artifacts resulting from machine handling in the ultrasound images during classification rather than actual body structures of interest. That would be terrible.

Less serious: The Rosetta Stone – a phenomenal, historically significant, hugely investigated Egyptian artifact that happened to be a slab of stone on which I have no idea what was inscribed.

I’ll see you in the blogosphere!

Jeffrey Huang

MiWORD of the day is…Compression!

In physics, compression means that inward forces are evenly applied to an object from different directions. During this process, the atoms in the object change their position. After the forces are removed, the object may be restored depending on the type of materials it is made of. For example, when compression is applied to an elastic material such as a rubber ball, the air molecules inside the ball are compressed with decreased volume. After the compression force is removed, it quickly restores to its original sphere shape. On the other hand, when a compression force is applied to the brick, the solid clay cannot be compressed. Therefore, the compressive forces concentrate on the weakest point, causing the block to break at the middle point.

For images, compression is the process of encoding digital image information by using a specific encoding scheme. After compression, the image will have a smaller size. An image can be compressed because of the degree of redundancy. Since the neighboring pixels of an image are correlated, the information may be considered redundant among the neighboring pixels. During compression, these redundant pixel values (i.e., values close to zero when encoding digital image information) are removed by comparing with the neighboring pixels. The higher the ratio is, the more the small values are removed. The image after compression uses fewer bits than the original unencoded representation, therefore, achieving size reduction purposes.

In the area of medical imaging, there are two different methods of compression that are commonly used, JPEG and JPEG2000. JPEG2000 is a new image coding system that compresses images based on the two-dimensional discrete wavelet transform (WDT). Unlike JPEG compression which decomposes the image based on the frequency content, JPEG decomposes image signals based on scale or resolution. It performs better than JPEG with much better image quality at moderate compression ratios. However, JPEG2000 does not perform well at a high compression ratio with the image appearing blurred.

Now onto the fun part, using compression in a sentence by the end of the day!

Serious: This high compression ratio has caused so much information loss that I cannot even recognize that it is a dog.

Less serious:

Tong: I could do compression with my eyes.

Grace: Really? How?

Tong: It is simple. Just remove the eyeglasses. Everything becomes compressed.

See you in the blogosphere!

Tong Su

MiWORD of the day is…Transformers and “The Age of Extinction” of CNNs?

Having had studied Machine Learning and Neural Networks for a long time, I no longer think of the movie when I hear the word “transformers”. Now, when I hear CNN, I no longer think of the news channel. So, I had a confusing conversation with my Social Sciences friend, when she said that CNN was biased, and I asked if her dataset was imbalanced. Nevertheless, why are we talking about this?

Before I learned about Neural Networks, I always wondered as to how computers could even “look” at images, let alone, tell us something about that image. Then, I learned about Convolutional Neural Networks, or CNNs! They work by sliding a small “window” across an image, while trying to make sense of the pixels that the window sees. As the CNN trains on images, it learns how to pick out edges and shapes that help it make sense of images down the line. For almost a decade, the best performing image models relied on convolutions. They are designed to do very well with images due to their “inductive bias” or “expertise” on images. These sliding window operations make it suitable to detect patterns in images.

Transformers, on the other hand, were designed to work well with sequences of words. They take in a sequence of encoded words and can perform various tasks with them, such as language translation, sentiment analysis etc. However, they are so versatile that, in 2020, they were shown to outperform CNNs on image tasks. How the heck does a model designed for text even work with images you might ask! Well, you might have heard of the saying, “an image is worth a 1000 words.” But in 2020, Dosovitskiy et al. said “An image is worth 16×16 words”. In this paper, they cut up an image into patches of 16×16 pixels. Pixels from each patch were then fed into a transformer model as if each patch were a word from a text. On training this model on millions of images, it was found that it outperformed CNNs, even though it does not have that inductive bias! Essentially, it learns to look at images by looking at a lot of images.

Now, just like the Transformers franchise, a new paper on different flavors of vision transformers drops every year. And just as the movies in the franchise take a lot of money to make, transformers take a lot of data to train. However, once pretrained on enough data, they can smash CNNs out the park when further finetuned on small datasets like those common in medical imaging. 

Now let’s use transformers in a sentence…

Serious: My pretrained vision transformer finetuned to detect infiltration in these chest X-Ray images, outperformed the CNN.

Less Serious: I have 100,000 images, that’s enough data to train my Vision Transformer from scratch! *Famous last words*

See you in the blogosphere!

Manav Shah