## MiWORD of the Day is… Residual!

Have you ever tried to assemble a Lego set and ended up with mysterious extra pieces? Or perhaps you have cleaned up after a big party and found some confetti hiding in the corners days later? Welcome to the world of “residuals”!

Residuals pop up everywhere. It’s an everyday term but it’s actually fancier than just referring to the leftovers of a meal; it’s also a term used in regression models to describe the difference between observed and predicted values, or in finance to talk about what’s left of an asset. However, nothing I mentioned compares to the role residuals played in machine learning and particularly training deep neural networks.

When you learn an approximation of a function from an input space to an output space using backpropagation, the weights are updated based on the learning rate and gradients that are calculated through chain rule. As a neural network gets deeper, you have to multiply a small value—usually much smaller than 1—multiple times to pass it to the earliest layers, making the neural network excessively hard to optimize. This phenomenon prevalent in deep learning is call the vanishing gradient problem.

However, notice how deep layers of a neural network are usually composed by mappings that are close to identity. This is exactly why residual connections do their magic! Suppose your true mapping from input to output is h(x), and let the forward pass be f(x)+x. It follows that the mapping subject to learning would be h(x)-x, which is close to a zero function. This means f(x) would be way easier to learn under the vanishing gradient problem, since functions that are close to zero functions demand a lower level of sensitivity to each parameter, unlike the identity function.

Now before we dive too deep into the wizardry of residuals, should we use residual in a sentence?

Serious: Neuroscientists wanted to explore if CNNs perform similarly to the human brain in visual tasks, and to this end, they simulated the grasp planning using a computational model called the generative residual convolutional neural network.

Less serious: Mom: “What happened?”
Me: “Sorry Mom, but after my attempt to bake chocolate cookies, the residuals were a smoke-filled kitchen and a cookie-shaped piece of charcoal that even the dog wouldn’t eat”

See you in the blogosphere,
Mason Hu

## Lucie Yang’s STA299 Journey

Hello! My name is Lucie Yang, and I am excited to share my experience with my ROP project this summer! I’m heading into my second year, pursuing a Data Science specialist. While I have been interested in statistics for a long time, I was not sure exactly what field to pursue. Over the past year, I became fascinated with machine learning and decided to apply to Prof. Tyrrell’s posting, despite being in my first year and not having any previous experience with machine learning or medical imaging. To my surprise, I was accepted and thus began my difficult, yet incredibly rewarding journey at the lab.

I remember Prof. Tyrrell had warned me during my interview that the research process would be challenging for me, but still, I was excited and confident that I could succeed. The first obstacle I encountered was choosing a research project. Despite spending hours scrolling through lessons on Coursera and YouTube and reading relevant papers to build my understanding, I struggled to come up with a topic that was feasible, novel, and interesting. I would go to the weekly ROP meetings thinking I had come up with a brilliant idea, only to realize that there was some problem that I had not even considered. After finally settling on an adequate project, I was met with another major obstacle: actually implementing it.

My project was about accelerating the assessment of heterogeneity on an X-Ray dataset with Fourier-transformed features. Past work done in the lab had shown that cluster analysis of features extracted from CNN models could indicate dataset heterogeneity, therefore, I wanted to explore whether the same would hold for Fourier-transformed features and whether it would be faster to use them. With the help of a previous student’s code, implementing the CNN pipeline was relatively straightforward; however, I struggled to understand how to apply the Fast Fourier Transform to images and extract the features. As deadlines loomed near and time was quickly ticking away, I was unsure of whether my code was even correct and became very frustrated. Prof. Tyrrell and Mauro gave me immense help, helping me refine my methodology and answering my many questions. After that, I was able to get back on track and thankfully, completed the rest of my project in time.

I learned a lot from this journey, far more than I have in any class I’ve taken, from the exciting state-of-the-art technologies being developed to the process of conducting research and writing code for machine learning. Above all, I gained a deeper appreciation of the bumpy road of research, and I am incredibly grateful to have had the opportunity to get a taste of it. I am very thankful to all the helpful lab members, and I look forward to continuing my journey in data science and research in the coming years!

Lucie Yang

## MiWORD of the Day is… Silhouette Score!

Silhouette score… is that some sort of way to measure whose silhouette looks better? Or how identifiable the silhouettes are? Well… kind of! It turns out that in statistics, silhouette score is a measure for how “good” a clustering algorithm is. It considers two factors: cohesion and separation. Particularly, how compact is the cluster? And how separated is the cluster from other clusters?

Let’s say you asked your friend to group a bunch of cats into 3 clusters based on where they were sitting on the floor, because you wanted to know whether the cats sit in groups or if they just sit randomly. How can we determine how “good” your friend clustered them? Let’s zoom in to one specific cat who happens to be placed in Cluster 1. We first look at intra-cluster distance, which would be the mean distance to all other cats in Cluster 1. We then take the mean nearest-cluster distance, which would be the distance between the cat and the nearest cluster the cat is not a part of, either Cluster 2 or 3, in this case.

To have a “good” clustering algorithm, we want to minimize the intra-cluster distance and maximize the mean nearest-cluster distance. Together, this can be used to calculate our silhouette score for one cat. Then, we can repeat this for each cat and average the score for all cats to get the overall silhouette score. Silhouette score ranges from -1 to +1, and the higher the score, the better! A high score indicates that the cats are generally similar to the other cats in their clusters and distinct from the cats in other clusters. A score of 0 means that clusters are overlapping. So, if it turns out that the cats were sitting in distinct groups and your friend is good at clustering, we’d expect a high silhouette score.

Now, to use it in a sentence!

Serious: I am unsure of how many clusters I should group my data into for k-means clustering… it seems like choosing 3 or 4 will give me the same silhouette score of 0.38!

Less serious (suggested to me by ChatGPT): I tried sorting my sock drawer by color, But it’s a bit tricky with all those shades of grey. I mean, I can’t even tell the difference between dark grey and mid grey. My sock drawer’s silhouette score is so low!

See you in the blogosphere!
Lucie Yang

## Christine Wang’s STA299 Journey

Hi! My name is Christine Wang, and I’m finishing my third year at the University of Toronto pursuing a specialist in statistics with a focus on cognitive psychology. The STA299 journey through the whole year has been a really amazing and challenging experience.

My research project involved assessing whether the heterogeneity of medical images affects the clustering of image features extracted from the CNN model. Initially, I found it quite challenging to understand the difference between my research and the previous work done by Mauro, who analyzed the impact of heterogeneity on the generalizability of CNN by testing the overall model performance on the test clusters. Many thanks to the discussions in the ROP meeting every week, I understood that I needed to retrain the CNN model using the images in each of the clusters in the training set to see how heterogeneity could affect the clustering of image features. By checking whether the retrained CNN models from each cluster perform differently, I was able to show that heterogeneity could affect the clustering of image features. However, the most challenging part of the research is not just achieving the desired results, but rather interpreting what I could learn from those results. For instance, even though I obtained results that showed the retrained models perform differently, I spent a lot of time trying to understand what the clusters represent and why some retrained models perform better than others. I am very grateful to Professor Pascal Tyrrell for helping me understand my project and providing me with essential advice to check the between-cluster distances. This enabled me to interpret the results and identify a possible pattern: the retrained models with similar performance come from clusters that are also close to each other. However, further research is still required because the two datasets I used were not large enough. Looking back, I realize that it would have been better if I used the dataset in our lab, as finding the appropriate dataset and code was very challenging. I would like to thank Mauro, Atshuhiro, and Tristal for their generous help in teaching me how to do feature extraction and cluster analysis.

Before starting the project, I was fascinated by the high accuracy and excellent performance of ML techniques. However, during the ROP journey, I realized that achieving high model performance is not the most important thing. As Professor Pascal mentioned, the most crucial aspect of doing research is truly understanding what we are doing and focusing on interpreting what we can learn from the results we obtain. It is not enough to just have tables and figures; we need to go further by choosing appropriate statistical analysis to understand our results.

## MiWORD of The Day is … Feature Extraction!

Imagine you have a photo of a cat sitting in a garden. If you want to describe the cat to someone who has never seen it, you might say it has pointy ears, a furry body, and green eyes. These details are the features that make the cat unique and distinguishable.

Similarly, in medical imaging, ML algorithms like CNN are widely used to analyze images like X-rays or MRIs. The CNN works like a set of filters that look for specific features in the image, such as edges, corners, or textures, and then combines these features to create a representation of the image.

For example, when looking at a chest X-ray, a CNN can detect features like the shape of the lungs, blood vessels, and other structures. By analyzing these features, CNN can identify patterns that indicate the presence of a disease like pneumonia or lung cancer. The CNN can also analyze other medical images, like MRIs, to detect tumors, blood clots, or other abnormalities.

To perform feature extraction, CNN applies a series of convolutional filters to the image, each designed to detect a specific pattern or feature. The filters slide over the image, computing the dot product between the filter and the corresponding pixel values in the image to produce a new feature map. These feature maps are then passed through non-linear activation functions to increase the discriminative power of the network. CNN then down-samples the feature map to increase the robustness of the network to translation and rotation. This process is repeated multiple times in a CNN, with each layer learning more complex features based on the previous layers. The final output of the network is a set of high-level features that can be used to classify or diagnose medical conditions.

Now let’s use feature extraction in a sentence!

Serious: “How can we ensure that the features extracted by a model are truly representative of the underlying data and not biased towards certain characteristics or attributes?”

Less Serious:
My sister: “You know, finding the right filter for my selfie is like performing feature extraction on my face.”

Me: “I guess you’re just trying to extract the most Instagram-worthy features right?”

## Alice Zhang’s STA299 Journey

Hi friends! My name is Alice Zhang. I am finishing my third year of undergrad pursuing a
statistical science specialist with a focus on genetics and biotechnology, as well as a biology minor. It was a blessing to take part in STA299Y ROP with Professor Tyrrell and his MiDATA lab. As this experience comes to an end, I would like to share about my incredible journey.

Coming into the lab, I held great interest but zero research experience and zero knowledge about machine learning. I remember being completely lost and worried in my very first lab meeting. Looking back, I’m actually quite proud of how far I’ve come. My project was to compare multiple-instance classifiers and single-instance classifiers for diagnosing knee recess distension ultrasounds. I also explored factors that may influence multiple-instance model training.

The start of my project was rather smooth compared to others since it was more application-based than theoretical. I was able to grasp key concepts through literature searches and gather usable models and datasets (thanks to Mauro) needed to begin the project. However, with a lack of research experience and weak background in programming, I soon faced obstacles, confusion, panic and doubts. I had the tools in hand, but the hard part was designing, running and interpreting appropriate experiments. How do I modify and apply the code to my ultrasound data? How do I fairly compare two dissimilar algorithms? How do I unbiasedly alter and compare training factors? How do I give rational interpretations of the outcomes and unusual observations?

As the project progressed, I constantly felt that I was falling behind; I was still doubting and
modifying my experiments while my peers obtained results, I was still training my models while others were starting the write-up. To be honest, I panicked in every ROP meeting, but I was supported by Professor Tyrrell, lab members and my peers. I was able to power through. I am so grateful for having Professor Tyrrell as my guide through the first doorstep of research. He taught me that research isn’t about finding and reporting a standard answer, it is a process of discovering and then solving problems, and there’s no template for it. I was constantly encouraged to reflect on the “what”, “how” and “why” of the process. I also greatly appreciate the help from Mauro, who prepared the dataset and spent many hours guiding me through programming and model training.

Progressing through the project, I was later able to solve problems and modify bugs
independently. I started from zero to now completing my very first research project in machine learning. It feels like I’ve raised my first “research baby”! I would like to once again thank Professor Tyrrell and the lab members for their support, I couldn’t have gained this marvellous learning experience without them.

## Diana Escoboza’s ESC499 Journey

Hello there! My name is Diana Escoboza, and I’ve just finished my undergraduate studies at UofT in Machine Intelligence Engineering. I am very fortunate to have Prof. Tyrell as my supervisor while I worked on my engineering undergraduate thesis project ESC499 during the summer. I believe such an experience is worth sharing!

My project consisted of training an algorithm to identify/detect the anatomical landmarks on ultrasounds for the elbow, knee, and ankle joints. In medical imaging, it is challenging to correctly label large amounts of data since we require experts, and their time is minimal and costly. For this reason, I wanted my project to compare the performance of different machine learning approaches when we have limited labelled data for training.

The approaches I worked on were reinforcement and semi-supervised learning. Reinforcement learning is based on learning optimal behaviour in an environment through decision-making. In this method, the model would ‘see’ a section of the image and choose a direction to move towards the target landmark. In semi-supervised learning, both labelled and unlabelled data are used for training, and it consists of feeding the entire image to the model for it to learn the target’s location. Finally, I analysed the performance of both architectures and the training resources used to determine the optimal architecture.

While working on my project, I sometimes got lost in the enthusiasm and possibilities and overestimated the time I had. Prof. Tyrell was always very helpful in advising me throughout my progress to keep myself sensible on the limited time and resources I had while still giving me the freedom to work on my interests. The team meetings not only provided help, but they were also a time we would talk about AI research and have interesting discussions that would excite us for our projects and future possibilities. We also had a lot of support from the grad students in the lab, providing us with great help when encountering obstacles. A big shout-out to Mauro for saving me when I was freaking out my code wasn’t working, and time was running out.

Overall, I am very grateful for having the opportunity to work with such a supportive team and for everything I learned along the way. With Prof. Tyrell, I gained a better understanding of scientific research and advanced my studies in machine learning. I want to thank the MiData team for all the help and for providing me with such a welcoming environment.

## MiWORD of the Day is… Domain Shift!

From looking at the image from two different domains, could you tell what are they?
Hmmm? Is this a trick question or not, aren’t they the same? You might ask.
Yes, you are right. They are all bags. They are generally the same object, and I am sure you can easily tell just at a glimpse. However, unlike human beings, if you let a machine learning model read these images from two different domains, it would easily get confused by them, and eventually, make mistakes in identifying them. This is known as domain shift in Machine Learning.

Domain shift, also known as distribution shift, usually occurs in deep learning models
when the data distribution changes when the model reads the data. For instance, let’s say a deep learning model is trained on a dataset containing the images of backpacks on domain 1 (see the backpack image above). The model itself would then learn the specific features of the backpack image from domain 1 like the size, shape, angle of the picture taken etc. When you take the exact same model to test or retrain on the backpack images from domain 2, due to a slight variation in the background angle, the data distribution of the model encounters shifts a little bit, which would most likely result in a drop in model performance.

Deep learning models, such as a CNN model, are also widely used in the medical
imaging industry. Researchers have been implementing deep learning models in image
classification, segmentation and other tasks. However, because different imaging centers might use different machines, tools, and protocols, the datasets on the exact same image modality across different imaging centers might differ. Therefore, a model might experience a domain shift when it encounters a new unseen dataset which has variation in the data distribution.

Serious:
Me: “What can we do if a domain shift exists in a model between the source and target dataset?”
Professor Tyrrell: “Try mixing the target dataset with some images from the source dataset! ”

Less serious:
Mom: “I heard that your brother is really good at physics, what is your domain?”
Me: “I used to be an expert on Philosophy, but now due to my emerging interest in AI, I shift my domain to learning Artificial Intelligence.”
Mom: “Oh! A domain shift!”

## Will Wu’s ROP299 Journey

Hey folks! My name is Will Wu. I have just finished my second year at the University of
Toronto, currently pursuing a Statistics Specialist and Computer Science minor. Recently, I have just wrapped up my final paper on the ROP project with Professor Pascal Tyrrell. Looking back on the entire experience of doing this ROP, I feel grateful that I could have such an opportunity to learn and engage in research activities, so I find it meaningful to share my experience in the lab!

In the first couple of meetings that I attend, I sometimes find it difficult to follow up and
understand the concepts or projects that they discuss or introduce during the lab meeting, but Professor Tyrrell would usually explain these concepts that we are unfamiliar with. As I work more on the slide deck about Machine Learning, I begin to be familiar with some of the common AI knowledge, the logic behind the neural network and most importantly its significance in medical imaging.

When I am looking for an area of research that is related to Machine Learning as well as
medical imaging, Professor Tyrrell introduced us to a few interesting topics, and one of them is about domain shift. After a bit of literature review on this topic, I further grasp some knowledge about catastrophic forgetting, domain adaptation and out-of-distribution shift. Domain shift represents a shift in the data distribution when a deep learning model sees an unseen new set of data from a different dataset. This often occurs in the medical imaging area as images from different imaging centers have different acquisition tools or rules, which might lead to a difference between datasets. Therefore, I found it interesting to see the impact domain shift would bring on the performance of a CNN model, and how to quantify such a shift, especially on regular CT scans and low-dose CT scans.

For my project, it would require training and retraining the CNN model to observe such
an impact on the model performance, and it often leads to frustration for me as errors and
potential risks for overfitting keep showing up. Most of the time, I would look online for a quick fix and adjust the model as well as the dataset to eliminate such a problem. Mauro and Atsuhiro also provided tremendous help in sorting out the potential mistakes I might make during the experiment. The weekly ROP meeting was super helpful as well because Professor Tyrrell often listens to our follow-ups and gives us valuable suggestions to aid our research experience.

Throughout the entire research experience, there have been frustrations, endeavours and
success. This is overall a wonderful experience for me. I not only learned a lot about Statistics, Machine learning and its implementation in medical imaging, but I also got to know how research is generally being conducted, and most importantly the skills I have acquired throughout the Journey. Thank you for the kind help from the lab members to guide me through such an experience, it is such an intriguing experience!

## MiWORD of the Day is… Self-Supervised Learning!

Having just finished my intro to psychology course, I found that a useful way to learn the material (and guarantee a higher score on the test) was giving myself pop quizzes. In particular, I would write out a phrase and erase a keyword, then I will try to fill in the blank. Turns out this is such an amazing study method that even machines are using it! This method of learning, where the learner learns more in-depth by creating a quiz (e.g. fill-in-the-blank) for itself constitutes the essence of self-supervised learning.

Many machine learning models follow the encoder-decoder architecture, where an encoder network first extracts useful representations from the input data and the decoder network then uses the extracted representations to perform some task (e.g. image classification, semantic segmentation). In a typical supervised learning setting, when there is a large amount of data but only little of which is labelled, all the unlabelled data would have to be discarded as supervised learning requires a model be trained using input-label pairs. On the other hand, self-supervised learning utilizes the unlabelled data by introducing a pretext task to first pretrain the encoder such that it extracts richer and more useful representations. The weights of the pretrained encoder can then be transferred to another model, where it is fine-tuned using the labelled data to perform a specified downstream task (e.g. image classification, semantic segmentation). The general idea of this process is that better representations can be learned through mass unlabelled data, which provides the decoder with a better starting point and ultimately improves the model’s performance on the downstream task.

The choice of pretext task is paramount for pretraining the encoder as it decides what kind of representations can be extracted. One powerful pretrain method that has yielded higher downstream performance on image classification tasks is SimCLR. In the SimCLR framework, a batch of  images are first sampled, and each image is applied two different augmentations  and , resulting in  images. Two augmented versions of the same image is called a positive pair, and two augmented versions of different images is called a negative pair. Each pair of the  images is passed to the encoder  to produce representations  and , and these are then passed to a projection layer  to obtain the final representations  and . A contrastive loss defined using cosine similarity operates on the final representations such that the encoder  would produce similar representations for positive pairs and dissimilar representations for negative pairs. After pretraining, the weights of the encoder  could then be transferred to a downstream image classification model.

Although better representations may be extracted by the encoder using self-supervised learning, it does require a large unlabelled dataset (typically >100k images for vision-related tasks).

Now, using self-supervised learning in a sentence:

Serious: Self-supervised learning methods such as SimCLR has shown to improve downstream image classification task performance.

Less Serious: I thought you will be implementing a self-supervised learning pipeline for this project, why are you teaching the machine how to solve a rubik’s cube instead? (see Self-supervised Feature Learning for 3D Medical Images by Playing a Rubik’s Cube)

See you in the blogosphere!

Paul Tang