MiWORD of the Day is… Self-Supervised Learning!

Having just finished my intro to psychology course, I found that a useful way to learn the material (and guarantee a higher score on the test) was giving myself pop quizzes. In particular, I would write out a phrase and erase a keyword, then I will try to fill in the blank. Turns out this is such an amazing study method that even machines are using it! This method of learning, where the learner learns more in-depth by creating a quiz (e.g. fill-in-the-blank) for itself constitutes the essence of self-supervised learning.

Many machine learning models follow the encoder-decoder architecture, where an encoder network first extracts useful representations from the input data and the decoder network then uses the extracted representations to perform some task (e.g. image classification, semantic segmentation). In a typical supervised learning setting, when there is a large amount of data but only little of which is labelled, all the unlabelled data would have to be discarded as supervised learning requires a model be trained using input-label pairs. On the other hand, self-supervised learning utilizes the unlabelled data by introducing a pretext task to first pretrain the encoder such that it extracts richer and more useful representations. The weights of the pretrained encoder can then be transferred to another model, where it is fine-tuned using the labelled data to perform a specified downstream task (e.g. image classification, semantic segmentation). The general idea of this process is that better representations can be learned through mass unlabelled data, which provides the decoder with a better starting point and ultimately improves the model’s performance on the downstream task.

The choice of pretext task is paramount for pretraining the encoder as it decides what kind of representations can be extracted. One powerful pretrain method that has yielded higher downstream performance on image classification tasks is SimCLR. In the SimCLR framework, a batch of  images are first sampled, and each image is applied two different augmentations  and , resulting in  images. Two augmented versions of the same image is called a positive pair, and two augmented versions of different images is called a negative pair. Each pair of the  images is passed to the encoder  to produce representations  and , and these are then passed to a projection layer  to obtain the final representations  and . A contrastive loss defined using cosine similarity operates on the final representations such that the encoder  would produce similar representations for positive pairs and dissimilar representations for negative pairs. After pretraining, the weights of the encoder  could then be transferred to a downstream image classification model.

Although better representations may be extracted by the encoder using self-supervised learning, it does require a large unlabelled dataset (typically >100k images for vision-related tasks).

Now, using self-supervised learning in a sentence:

Serious: Self-supervised learning methods such as SimCLR has shown to improve downstream image classification task performance.

Less Serious: I thought you will be implementing a self-supervised learning pipeline for this project, why are you teaching the machine how to solve a rubik’s cube instead? (see Self-supervised Feature Learning for 3D Medical Images by Playing a Rubik’s Cube)

See you in the blogosphere!

Paul Tang

Paul Tang’s STA299 Journey

Hi! My name is Paul Tang and I just finished my second year at UofT studying computer science specialist and cognitive science major. During this summer, I enrolled in STA299 under the supervision of Prof. Pascal Tyrrell to learn how to conduct research, and I will be sharing my experience in this reflection blog post.

The first phase of my ROP experience concerns formulating a research question. Having a keen interest in machine learning, I got my inspiration for combining it with my research from a weekly lab meeting where Mauro presented his graduate research work (on the generation of synthetic ultrasound image data). I decided to focus on the problem that the amount of annotated data in the field of medical imaging is often limited for effective supervised training. Eventually, by reading papers and discussing my ideas with Prof. Tyrrell during the first few weeks, the solution I decided on was to use self supervised learning to pretrain a machine learning model for improving its performance. In particular, I chose the contrastive learning based self supervised learning method called DenseCL. Luckily, I got my data right at the lab using the ultrasound knee recess distension dataset for semantic segmentation. My ROP project dealt with comparing the effect of using DenseCL pretraining on the segmentation performance.

At first, I was doubtful of my research question: afterall, many papers I read already showed using self supervised pretraining did improve task performance, so wouldn’t my research be too “obvious”? However, I realized along the way that some interesting gaps still existed (e.g. current self supervised pretrain methods used in the domain of medical images do not extract local image features, which could be helpful for segmentation tasks), and these gave me confidence and excitement for my research.

Getting to work, I first identified the github repositories I would use in my project. Setting up the environment and the repositories to work with my dataset took much longer than expected (in fact, I had to switch to a different github repository due to “false advertising” from the original one), and I learned that checking with lab members (Mauro, Atsuhiro) and asking for ideas when starting to work on anything could save much needed time. I made several mistakes while training my models. When I first obtained the performance result (mIoU) from my segmentation model, I was relieved that it was consistent with previous results obtained in the lab. However, using this model in another experiment produced highly untypical results, which led me back to debug the model. Eventually the problem was found to be due to small batch size. Although this mistake cost me much training time, it did allow me to explore and gain familiarity with the configurations of a machine learning model, which I find very rewarding.

Eventually, I obtained results that show a small performance improvement in using DenseCL pretraining for the segmentation of ultrasound knee distention images. My project still had its limitations: my result was not statistically rigorous as I didn’t account for randomness in the training process. Furthermore, the amount of images I used for DenseCL pretraining is much fewer than what would typically be used in a self supervised learning setting. These limitations served as great motivation for further research.

This research experience taught me how humbling doing research was: many things I took for granted require careful testing, and that many gaps still exist in the current literature upon closer inspection. I am thankful to Prof. Tyrrell’s openness for allowing us to choose our own research questions, and I am thankful to all the help the lab members (especially Mauro and Atsuhiro) provided to me.

Paul Tang

MiWORD of the Day is… Attention!

In cognitive psychology, attention refers to the process of concentrating mental effort on sensory or mental events. When we attenuate to a certain object over others, our memory associated with that object is often better. Attention, according to William James, also involves “withdrawing from some things in order to effectively deal with others.” There are lots of things that are potential objects of our attention, but we attend to some things and ignore others. This ability helps our brain save processing resources by suppressing irrelevant features.

In image segmentation, attention is the process of highlighting the relevant activations during training. Attention gates can learn to focus on target features automatically through training. Then during testing, they can highlight salient information useful for a specific task. Therefore, just like when we allocate attention to specific tasks our performance would be improved, the attention gates would also improve model sensitivity and accuracy. In addition, models trained with attention gates also learn to suppress irrelevant regions as humans do; hence, reducing the computational resources used on irrelevant activations.

Now let’s use attention in a sentence by the end of the day!

Serious: With the introduction of attention gates in standard U-Net, the global information of the recess distention is obtained, and the irrelevant background noise is suppressed which in turn increases the model’s sensitivity and leads to smoother and more complete segmentation.

Less serious:
Will: That lady said I am a guy worth paying attention to (。≖ˇ∀ˇ≖。)
Nana: Sadly, she said that to the security guard…

Nana Ye’s STA299 Journey

Hi everyone! My name is Nana Ye, and I am finishing my second year at the University of
Toronto as a statistical science specialist and cognitive science major. I am grateful to participate in an ROP (Research Opportunities Program) project with the guidance of Professor Tyrrell during the summer of 2022. This project provides me with a valuable opportunity to learn about machine learning and understand scientific research. I would love to share my experiences with you all!

My project is analyzing the effect of additional attention gates in U-Net for knee recess
distention ultrasound segmentation. The recess distention area detected by the ultrasonic signal is similar to the image background and the ultrasound image often has a large amount of noise, distortion, and shadow which causes blurred local details, lots of dark areas, and no obvious division. Thus, I wanted to see whether implementing the additional attention gates in standard U-Net would improve segmentation accuracy. Prior to this project, I had not learned about machine learning; therefore, being able to implement a machine learning model on real-world patient data is exciting and challenging.

The journey of my ROP had a rocky start. I started off hoping to do a different project that dealt with comparing Vision Transformers and Convolutional Neural Networks on segmentation tasks for objects located in different regions of the image (central and non-central). However, when I was searching for a ViT model, I struggled with its implementation on my dataset and since ViT is new in the lab I could not get much help with its implementation from others. Thus, I made the decision to change my project. Professor Tyrrell was supportive of my decision and provided me with several articles to read which led me to my current project. When I was worried about falling behind because others were already training their models, Professor Tyrrell encouraged me that understanding what is feasible in a given time frame is also a valuable lesson. Atsuhiro and Mauro also offered me lots of help along the way. When I was having a tough time understanding the technical aspect of image processing, Atsuhiro scheduled a meeting with me to
explain the concept and answer all my questions. With their help, I was able to finish my first research project in machine learning and obtained promising results.

Overall, it is a completely unique experience from other lectures at the university. Researching as an ROP student in Professor Tyrrell’s lab gives me the opportunity to do a research project from the very beginning of doing background research and picking a topic to the very end of analyzing the results and revising the report. In the entire process, not only did I learn technical knowledge about machine learning and medical imaging, but also, I learned to manage the timeline for a project efficiently, think critically, and problem-solve independently. I feel privileged to be one of the ROP students in Professor Tyrrell’s lab and gain such worthwhile experience that would benefit my academic career.

Adele Lauzon’s ROP399 Journey

Hi there! My name is Adele Lauzon, and I’ve just finished up my 3rd year at UofT with a major in statistics and minors in computer science and psychology. A huge highlight of my year has been my ROP399 with Professor Tyrell, where I got to do a deep dive into the intersection of statistics, computer science, and biomedical data.

A little bit about my background–I went to high school in Houston, Texas, which is where I first fell in love with statistics. I remember my AP Statistics teacher beginning our first class with a quote by esteemed statistician John Tukey, where he claimed statistics was the best discipline because it meant you got to “play in everyone’s backyard.” As I’ve gotten farther along in my statistics education, I’ve realized how much truth is behind that phrase. Statistics is wonderful because it allows you to understand other fields simply based on the data you use. Through this ROP, I’ve been able to learn a bit more about the field of medicine.

My project was about measures of confidence in binary classification algorithms using biomedical data. Specifically, I investigated error consistency and error agreement–meaning I took a close look at what was happening when the model was making incorrect predictions. I’m not going to lie, probably the hardest part of this project was just getting started. I have a little bit of programming experience due to my computer science minor, but I had a lot of catching up to do compared to my classmates. A word of advice–set yourself on the GPUs early. Running my code locally made for a frighteningly overheated laptop.

Probably my biggest takeaway from this course was how the process of research actually works. While the scientific method is helpful, it doesn’t account for all of the back-and-forth you are guaranteed to be doing. This is where documenting all of your steps really comes in handy. If you reach an obstacle and need to reevaluate, keep a record of what you were doing beforehand in case you need to regress again. I made this mistake, and ended up having to do some work that I had already done.

All in all, this ROP has been such a valuable experience to me. Many thanks to Professor Tyrrell and the rest of the MiDATA team for their unwavering patience!

Today’s MiWORD of the day is… Agreement!

You know that magical moment where you and your friend finally agree on a place to eat, or a movie to watch, and you wonder what lucky stars had to align to make that happen? When the chance of agreement was so small that you didn’t think you’d ever decide? If you wanted to capture how often you and your friend agree on a restaurant or a movie in such a way that accounted for whether it was due to random chance, Cohen’s Kappa is the choice for you.

Agreement can be calculated just by taking the number of agreed upon observations divided by the total observations; however, Jacob Cohen believed that wasn’t enough. As agreement was typically used for inter-rater reliability, Cohen argued that this measure didn’t account for the fact that sometimes, people just guess–especially if they are uncertain. In 1960, he proposed Cohen’s Kappa as a counter to traditional percent agreement, claiming his measure was more robust as it accounted for random chance agreement.

Cohen’s Kappa is used to calculate agreement between two raters–or in machine learning, it can be used to find the agreement between the prediction sets of two models. It is calculated by subtracting the probability of chance agreement from the probability of observed agreement, all over one minus the probability of chance agreement. Like many correlation metrics, it ranges from -1 to +1. A negative value of Cohen’s Kappa indicates that there is no relationship between the raters, or that they had a tendency to give different ratings. A Cohen’s Kappa of 0 indicates that there is no agreement between the predictors above what would be expected by chance. A Cohen’s Kappa of 1 indicates that the raters are in complete agreement.

As Cohen’s Kappa is calculated using frequencies, it can be unreliable in measuring agreement in situations where an outcome is rare. In such cases, it tends to be overly conservative and underestimates agreement on the rare category. Additionally, some statisticians disagree with the claim that Cohen’s Kappa accounts for random chance, as an explicit model of how chance affected decision making would be necessary to say this decisively. The chance adjustment of Kappa simply assumes that when raters are uncertain, they completely guess an outcome. However, this is highly unlikely in practice–usually people have some reason for their decision.

Let’s use this in a sentence, shall we?
Serious: The Cohen’s Kappa score between the two raters was 0.7. Therefore, there is substantial agreement between the raters’ observations.
Silly: A kappa of 0.7? They must always agree on a place to eat!

Today’s MiWORD of the day is… Artifact!

When the ancient Egyptians built the golden Mask of Tutankhamun or carved a simple message into the now infamous Rosetta Stone, they probably didn’t know that we’d be holding onto them centuries later, considering them incredible reflections of Egyptian history.

Both of these are among the most famous artifacts existing today in museums. An artifact is a man-made object that’s considered to be of high historical significance. However, in radiology, an artifact is a lot less desirable – it refers to parts of an image that appear differently and inaccurately reflect the body structures they are taken of.

Artifacts in radiography can happen to any image. For instance, they can occur from improper handling of machines used to take medical scans, patient movement during imaging, external objects (i.e. jewelry, buttons) and other unwanted occurrences.

Why are artifacts so important? They can lead to misdiagnoses that could be detrimental to a patient. Consider a hypothetical scenario where a patient goes in for imaging for a tumor. The radiologist identifies the tumor as benign, but in reality, due to mishandling of a machine, an image artifact exists on the image that hides the fact that it is in fact malignant. The outcome would be catastrophic in this case!

Of course, this kind of diagnosis is highly unlikely (especially with modern day medical imaging) and there a ton of factors at play with diagnosis. A real diagnosis, especially nowadays, would not be so simple (or we would be wrong not to severely lament the state of medicine today). However, even if artifacts don’t cause a misdiagnosis, they can pose obstacles to both radiologists and researchers working with these images.

One such area of research is the application of machine learning into the field of medical imaging. Whether we’re using convolutional neural networks or vision transformers, all of these machine learning models rely on images produced in some facility. The quality of these images, including the presence and frequency of artifacts, can affect the outcome of any experiments conducted with them. For instance, imagine you build a machine learning model to classify between two different types of ultrasound scans. The performance of the model is certainly a main factor – but the concern that the model might be focusing on artifacts within the image rather than structures of interest would also be a huge consideration.

In any case, the presence of artifacts (whether in medical imaging or in historical museums) certainly gives us a lot more to think about!

Now onto the fun part, using artifact in a sentence by the end of the day:

Serious: My convolutional neural network could possibly be focusing on artifacts resulting from machine handling in the ultrasound images during classification rather than actual body structures of interest. That would be terrible.

Less serious: The Rosetta Stone – a phenomenal, historically significant, hugely investigated Egyptian artifact that happened to be a slab of stone on which I have no idea what was inscribed.

I’ll see you in the blogosphere!

Jeffrey Huang

MiWORD of the day is…Compression!

In physics, compression means that inward forces are evenly applied to an object from different directions. During this process, the atoms in the object change their position. After the forces are removed, the object may be restored depending on the type of materials it is made of. For example, when compression is applied to an elastic material such as a rubber ball, the air molecules inside the ball are compressed with decreased volume. After the compression force is removed, it quickly restores to its original sphere shape. On the other hand, when a compression force is applied to the brick, the solid clay cannot be compressed. Therefore, the compressive forces concentrate on the weakest point, causing the block to break at the middle point.

For images, compression is the process of encoding digital image information by using a specific encoding scheme. After compression, the image will have a smaller size. An image can be compressed because of the degree of redundancy. Since the neighboring pixels of an image are correlated, the information may be considered redundant among the neighboring pixels. During compression, these redundant pixel values (i.e., values close to zero when encoding digital image information) are removed by comparing with the neighboring pixels. The higher the ratio is, the more the small values are removed. The image after compression uses fewer bits than the original unencoded representation, therefore, achieving size reduction purposes.

In the area of medical imaging, there are two different methods of compression that are commonly used, JPEG and JPEG2000. JPEG2000 is a new image coding system that compresses images based on the two-dimensional discrete wavelet transform (WDT). Unlike JPEG compression which decomposes the image based on the frequency content, JPEG decomposes image signals based on scale or resolution. It performs better than JPEG with much better image quality at moderate compression ratios. However, JPEG2000 does not perform well at a high compression ratio with the image appearing blurred.

Now onto the fun part, using compression in a sentence by the end of the day!

Serious: This high compression ratio has caused so much information loss that I cannot even recognize that it is a dog.

Less serious:

Tong: I could do compression with my eyes.

Grace: Really? How?

Tong: It is simple. Just remove the eyeglasses. Everything becomes compressed.

See you in the blogosphere!

Tong Su

MiWORD of the day is…Transformers and “The Age of Extinction” of CNNs?

Having had studied Machine Learning and Neural Networks for a long time, I no longer think of the movie when I hear the word “transformers”. Now, when I hear CNN, I no longer think of the news channel. So, I had a confusing conversation with my Social Sciences friend, when she said that CNN was biased, and I asked if her dataset was imbalanced. Nevertheless, why are we talking about this?

Before I learned about Neural Networks, I always wondered as to how computers could even “look” at images, let alone, tell us something about that image. Then, I learned about Convolutional Neural Networks, or CNNs! They work by sliding a small “window” across an image, while trying to make sense of the pixels that the window sees. As the CNN trains on images, it learns how to pick out edges and shapes that help it make sense of images down the line. For almost a decade, the best performing image models relied on convolutions. They are designed to do very well with images due to their “inductive bias” or “expertise” on images. These sliding window operations make it suitable to detect patterns in images.

Transformers, on the other hand, were designed to work well with sequences of words. They take in a sequence of encoded words and can perform various tasks with them, such as language translation, sentiment analysis etc. However, they are so versatile that, in 2020, they were shown to outperform CNNs on image tasks. How the heck does a model designed for text even work with images you might ask! Well, you might have heard of the saying, “an image is worth a 1000 words.” But in 2020, Dosovitskiy et al. said “An image is worth 16×16 words”. In this paper, they cut up an image into patches of 16×16 pixels. Pixels from each patch were then fed into a transformer model as if each patch were a word from a text. On training this model on millions of images, it was found that it outperformed CNNs, even though it does not have that inductive bias! Essentially, it learns to look at images by looking at a lot of images.

Now, just like the Transformers franchise, a new paper on different flavors of vision transformers drops every year. And just as the movies in the franchise take a lot of money to make, transformers take a lot of data to train. However, once pretrained on enough data, they can smash CNNs out the park when further finetuned on small datasets like those common in medical imaging. 

Now let’s use transformers in a sentence…

Serious: My pretrained vision transformer finetuned to detect infiltration in these chest X-Ray images, outperformed the CNN.

Less Serious: I have 100,000 images, that’s enough data to train my Vision Transformer from scratch! *Famous last words*

See you in the blogosphere!

Manav Shah

Tong Su’s ROP299 Journey

Hi everyone! My name is Tong Su, and I have just wrapped up my ROP299 project in Professor Tyrrell’s Lab, as well as my second year at the University of Toronto, pursuing a computer science specialist and statistics major. It is a great pleasure to complete my whole second-year journey along with this research experience. I have learned a lot of things about both artificial intelligence topics and the process of scientific research. I would like to share my experiences with you here.

My ROP project is the effect of compression and downsampling on the accuracy of the Convolutional Neural Network (CNN)-based histological image binary classification model. Advances in medical imaging systems have made medical images more details. They also increased the size of medical images as scarification. Compared to other images, medical images are larger and occupy more storage space. Therefore, most medical images were downsampled or compressed before they were stored. While some compressions are reversible, most of the others are irreversible. Once the image is compressed, perceptible information is lost and could not be restored. When these modified medical images are used for training machine learning algorithms, the information loss during compression may affect the algorithms’ accuracy. This study aims to investigate how compression and downsampling ratio to medical imaging affect the accuracy of CNN.

Similar to other ROP students, I decided on my research topic early by selecting from a bunch of topics in different areas. However, the focus of my research has slightly adjusted as I progressed through my project. Initially, my research topic is “Can we compress training data without degrading accuracy?”. This topic only illustrates the effect of compression on the accuracy of the algorithm and at the end of the research, I need to propose the best compression ratio that is suitable for medical images storage without much loss of accuracy.

Among all the compression types, I decided to work with JPEG2000 as it is one of the most commonly used compression types in medical imaging. The dataset chosen consisted of 100,000 different image patches from histological images of human colorectal cancer (CRC) and normal tissue. It was organized into 9 for each image. The next step is to choose the machine learning model. I decided to work on binary classification with the CNN model. The two categories were picked for the binary classification model that classifies whether a given tissue image is cancer-associated stroma (STR) (1) or is normal colon mucosa (NORM) (0).

The next step is compressing the dataset. I used Python Image Library (PIL) to compress the dataset using JPEG2000. However, the binary classification model does not support the dataset with format j2k. In this case, I needed to include another process of converting the j2k images to a type that is supported by the model. I decided to convert the image to TIFF as it is the same as the dataset’s original format.

During my research about compression, Professor Tyrrell pointed out another image size reduction method, downsampling. Although both methods are used to reduce the image size, there are some differences between them. This aroused my interest that which image size reduction method performs better than the machine learning algorithm. In that case, I started to add another purpose to my project to compare the difference between downsampling and compression and state which image size reduction method is more suitable for medical imaging.

Despite all the obstacles I encountered along the way, such as changing the dataset halfway through the project, making modifications to the model and rerunning everything, and the unexpected 54.39% error for high compression ratio, etc., I have successfully come to the end and concluded my excellent ROP experience through this reflection. Now I have a greater understanding of the process of research and deep learning algorithm. At the end of this reflection, I want to thank Professor Tyrrell for offering me this opportunity and guiding my research progress through the weekly meetings. I also want to thank Dr. Atsuhiro Hibi for providing me with endless guidance and support for the whole research project through meetings and frequent email exchange even when he was busy. Without their help, I would not be able to have such an excellence research experience.

Tong Su