Adele Lauzon’s ROP399 Journey

Hi there! My name is Adele Lauzon, and I’ve just finished up my 3rd year at UofT with a major in statistics and minors in computer science and psychology. A huge highlight of my year has been my ROP399 with Professor Tyrell, where I got to do a deep dive into the intersection of statistics, computer science, and biomedical data.

A little bit about my background–I went to high school in Houston, Texas, which is where I first fell in love with statistics. I remember my AP Statistics teacher beginning our first class with a quote by esteemed statistician John Tukey, where he claimed statistics was the best discipline because it meant you got to “play in everyone’s backyard.” As I’ve gotten farther along in my statistics education, I’ve realized how much truth is behind that phrase. Statistics is wonderful because it allows you to understand other fields simply based on the data you use. Through this ROP, I’ve been able to learn a bit more about the field of medicine.

My project was about measures of confidence in binary classification algorithms using biomedical data. Specifically, I investigated error consistency and error agreement–meaning I took a close look at what was happening when the model was making incorrect predictions. I’m not going to lie, probably the hardest part of this project was just getting started. I have a little bit of programming experience due to my computer science minor, but I had a lot of catching up to do compared to my classmates. A word of advice–set yourself on the GPUs early. Running my code locally made for a frighteningly overheated laptop.

Probably my biggest takeaway from this course was how the process of research actually works. While the scientific method is helpful, it doesn’t account for all of the back-and-forth you are guaranteed to be doing. This is where documenting all of your steps really comes in handy. If you reach an obstacle and need to reevaluate, keep a record of what you were doing beforehand in case you need to regress again. I made this mistake, and ended up having to do some work that I had already done.

All in all, this ROP has been such a valuable experience to me. Many thanks to Professor Tyrrell and the rest of the MiDATA team for their unwavering patience!

Today’s MiWORD of the day is… Agreement!

You know that magical moment where you and your friend finally agree on a place to eat, or a movie to watch, and you wonder what lucky stars had to align to make that happen? When the chance of agreement was so small that you didn’t think you’d ever decide? If you wanted to capture how often you and your friend agree on a restaurant or a movie in such a way that accounted for whether it was due to random chance, Cohen’s Kappa is the choice for you.

Agreement can be calculated just by taking the number of agreed upon observations divided by the total observations; however, Jacob Cohen believed that wasn’t enough. As agreement was typically used for inter-rater reliability, Cohen argued that this measure didn’t account for the fact that sometimes, people just guess–especially if they are uncertain. In 1960, he proposed Cohen’s Kappa as a counter to traditional percent agreement, claiming his measure was more robust as it accounted for random chance agreement.

Cohen’s Kappa is used to calculate agreement between two raters–or in machine learning, it can be used to find the agreement between the prediction sets of two models. It is calculated by subtracting the probability of chance agreement from the probability of observed agreement, all over one minus the probability of chance agreement. Like many correlation metrics, it ranges from -1 to +1. A negative value of Cohen’s Kappa indicates that there is no relationship between the raters, or that they had a tendency to give different ratings. A Cohen’s Kappa of 0 indicates that there is no agreement between the predictors above what would be expected by chance. A Cohen’s Kappa of 1 indicates that the raters are in complete agreement.

As Cohen’s Kappa is calculated using frequencies, it can be unreliable in measuring agreement in situations where an outcome is rare. In such cases, it tends to be overly conservative and underestimates agreement on the rare category. Additionally, some statisticians disagree with the claim that Cohen’s Kappa accounts for random chance, as an explicit model of how chance affected decision making would be necessary to say this decisively. The chance adjustment of Kappa simply assumes that when raters are uncertain, they completely guess an outcome. However, this is highly unlikely in practice–usually people have some reason for their decision.

Let’s use this in a sentence, shall we?
Serious: The Cohen’s Kappa score between the two raters was 0.7. Therefore, there is substantial agreement between the raters’ observations.
Silly: A kappa of 0.7? They must always agree on a place to eat!

Today’s MiWORD of the day is… Artifact!

When the ancient Egyptians built the golden Mask of Tutankhamun or carved a simple message into the now infamous Rosetta Stone, they probably didn’t know that we’d be holding onto them centuries later, considering them incredible reflections of Egyptian history.

Both of these are among the most famous artifacts existing today in museums. An artifact is a man-made object that’s considered to be of high historical significance. However, in radiology, an artifact is a lot less desirable – it refers to parts of an image that appear differently and inaccurately reflect the body structures they are taken of.

Artifacts in radiography can happen to any image. For instance, they can occur from improper handling of machines used to take medical scans, patient movement during imaging, external objects (i.e. jewelry, buttons) and other unwanted occurrences.

Why are artifacts so important? They can lead to misdiagnoses that could be detrimental to a patient. Consider a hypothetical scenario where a patient goes in for imaging for a tumor. The radiologist identifies the tumor as benign, but in reality, due to mishandling of a machine, an image artifact exists on the image that hides the fact that it is in fact malignant. The outcome would be catastrophic in this case!

Of course, this kind of diagnosis is highly unlikely (especially with modern day medical imaging) and there a ton of factors at play with diagnosis. A real diagnosis, especially nowadays, would not be so simple (or we would be wrong not to severely lament the state of medicine today). However, even if artifacts don’t cause a misdiagnosis, they can pose obstacles to both radiologists and researchers working with these images.

One such area of research is the application of machine learning into the field of medical imaging. Whether we’re using convolutional neural networks or vision transformers, all of these machine learning models rely on images produced in some facility. The quality of these images, including the presence and frequency of artifacts, can affect the outcome of any experiments conducted with them. For instance, imagine you build a machine learning model to classify between two different types of ultrasound scans. The performance of the model is certainly a main factor – but the concern that the model might be focusing on artifacts within the image rather than structures of interest would also be a huge consideration.

In any case, the presence of artifacts (whether in medical imaging or in historical museums) certainly gives us a lot more to think about!

Now onto the fun part, using artifact in a sentence by the end of the day:

Serious: My convolutional neural network could possibly be focusing on artifacts resulting from machine handling in the ultrasound images during classification rather than actual body structures of interest. That would be terrible.

Less serious: The Rosetta Stone – a phenomenal, historically significant, hugely investigated Egyptian artifact that happened to be a slab of stone on which I have no idea what was inscribed.

I’ll see you in the blogosphere!

Jeffrey Huang

MiWORD of the day is…Compression!

In physics, compression means that inward forces are evenly applied to an object from different directions. During this process, the atoms in the object change their position. After the forces are removed, the object may be restored depending on the type of materials it is made of. For example, when compression is applied to an elastic material such as a rubber ball, the air molecules inside the ball are compressed with decreased volume. After the compression force is removed, it quickly restores to its original sphere shape. On the other hand, when a compression force is applied to the brick, the solid clay cannot be compressed. Therefore, the compressive forces concentrate on the weakest point, causing the block to break at the middle point.

For images, compression is the process of encoding digital image information by using a specific encoding scheme. After compression, the image will have a smaller size. An image can be compressed because of the degree of redundancy. Since the neighboring pixels of an image are correlated, the information may be considered redundant among the neighboring pixels. During compression, these redundant pixel values (i.e., values close to zero when encoding digital image information) are removed by comparing with the neighboring pixels. The higher the ratio is, the more the small values are removed. The image after compression uses fewer bits than the original unencoded representation, therefore, achieving size reduction purposes.

In the area of medical imaging, there are two different methods of compression that are commonly used, JPEG and JPEG2000. JPEG2000 is a new image coding system that compresses images based on the two-dimensional discrete wavelet transform (WDT). Unlike JPEG compression which decomposes the image based on the frequency content, JPEG decomposes image signals based on scale or resolution. It performs better than JPEG with much better image quality at moderate compression ratios. However, JPEG2000 does not perform well at a high compression ratio with the image appearing blurred.

Now onto the fun part, using compression in a sentence by the end of the day!

Serious: This high compression ratio has caused so much information loss that I cannot even recognize that it is a dog.

Less serious:

Tong: I could do compression with my eyes.

Grace: Really? How?

Tong: It is simple. Just remove the eyeglasses. Everything becomes compressed.

See you in the blogosphere!

Tong Su

MiWORD of the day is…Transformers and “The Age of Extinction” of CNNs?

Having had studied Machine Learning and Neural Networks for a long time, I no longer think of the movie when I hear the word “transformers”. Now, when I hear CNN, I no longer think of the news channel. So, I had a confusing conversation with my Social Sciences friend, when she said that CNN was biased, and I asked if her dataset was imbalanced. Nevertheless, why are we talking about this?

Before I learned about Neural Networks, I always wondered as to how computers could even “look” at images, let alone, tell us something about that image. Then, I learned about Convolutional Neural Networks, or CNNs! They work by sliding a small “window” across an image, while trying to make sense of the pixels that the window sees. As the CNN trains on images, it learns how to pick out edges and shapes that help it make sense of images down the line. For almost a decade, the best performing image models relied on convolutions. They are designed to do very well with images due to their “inductive bias” or “expertise” on images. These sliding window operations make it suitable to detect patterns in images.

Transformers, on the other hand, were designed to work well with sequences of words. They take in a sequence of encoded words and can perform various tasks with them, such as language translation, sentiment analysis etc. However, they are so versatile that, in 2020, they were shown to outperform CNNs on image tasks. How the heck does a model designed for text even work with images you might ask! Well, you might have heard of the saying, “an image is worth a 1000 words.” But in 2020, Dosovitskiy et al. said “An image is worth 16×16 words”. In this paper, they cut up an image into patches of 16×16 pixels. Pixels from each patch were then fed into a transformer model as if each patch were a word from a text. On training this model on millions of images, it was found that it outperformed CNNs, even though it does not have that inductive bias! Essentially, it learns to look at images by looking at a lot of images.

Now, just like the Transformers franchise, a new paper on different flavors of vision transformers drops every year. And just as the movies in the franchise take a lot of money to make, transformers take a lot of data to train. However, once pretrained on enough data, they can smash CNNs out the park when further finetuned on small datasets like those common in medical imaging. 

Now let’s use transformers in a sentence…

Serious: My pretrained vision transformer finetuned to detect infiltration in these chest X-Ray images, outperformed the CNN.

Less Serious: I have 100,000 images, that’s enough data to train my Vision Transformer from scratch! *Famous last words*

See you in the blogosphere!

Manav Shah

Tong Su’s ROP299 Journey

Hi everyone! My name is Tong Su, and I have just wrapped up my ROP299 project in Professor Tyrrell’s Lab, as well as my second year at the University of Toronto, pursuing a computer science specialist and statistics major. It is a great pleasure to complete my whole second-year journey along with this research experience. I have learned a lot of things about both artificial intelligence topics and the process of scientific research. I would like to share my experiences with you here.

My ROP project is the effect of compression and downsampling on the accuracy of the Convolutional Neural Network (CNN)-based histological image binary classification model. Advances in medical imaging systems have made medical images more details. They also increased the size of medical images as scarification. Compared to other images, medical images are larger and occupy more storage space. Therefore, most medical images were downsampled or compressed before they were stored. While some compressions are reversible, most of the others are irreversible. Once the image is compressed, perceptible information is lost and could not be restored. When these modified medical images are used for training machine learning algorithms, the information loss during compression may affect the algorithms’ accuracy. This study aims to investigate how compression and downsampling ratio to medical imaging affect the accuracy of CNN.

Similar to other ROP students, I decided on my research topic early by selecting from a bunch of topics in different areas. However, the focus of my research has slightly adjusted as I progressed through my project. Initially, my research topic is “Can we compress training data without degrading accuracy?”. This topic only illustrates the effect of compression on the accuracy of the algorithm and at the end of the research, I need to propose the best compression ratio that is suitable for medical images storage without much loss of accuracy.

Among all the compression types, I decided to work with JPEG2000 as it is one of the most commonly used compression types in medical imaging. The dataset chosen consisted of 100,000 different image patches from histological images of human colorectal cancer (CRC) and normal tissue. It was organized into 9 for each image. The next step is to choose the machine learning model. I decided to work on binary classification with the CNN model. The two categories were picked for the binary classification model that classifies whether a given tissue image is cancer-associated stroma (STR) (1) or is normal colon mucosa (NORM) (0).

The next step is compressing the dataset. I used Python Image Library (PIL) to compress the dataset using JPEG2000. However, the binary classification model does not support the dataset with format j2k. In this case, I needed to include another process of converting the j2k images to a type that is supported by the model. I decided to convert the image to TIFF as it is the same as the dataset’s original format.

During my research about compression, Professor Tyrrell pointed out another image size reduction method, downsampling. Although both methods are used to reduce the image size, there are some differences between them. This aroused my interest that which image size reduction method performs better than the machine learning algorithm. In that case, I started to add another purpose to my project to compare the difference between downsampling and compression and state which image size reduction method is more suitable for medical imaging.

Despite all the obstacles I encountered along the way, such as changing the dataset halfway through the project, making modifications to the model and rerunning everything, and the unexpected 54.39% error for high compression ratio, etc., I have successfully come to the end and concluded my excellent ROP experience through this reflection. Now I have a greater understanding of the process of research and deep learning algorithm. At the end of this reflection, I want to thank Professor Tyrrell for offering me this opportunity and guiding my research progress through the weekly meetings. I also want to thank Dr. Atsuhiro Hibi for providing me with endless guidance and support for the whole research project through meetings and frequent email exchange even when he was busy. Without their help, I would not be able to have such an excellence research experience.

Tong Su

Manav Shah’s Journey in ROP399

Hi! My name is Manav Shah, and I am finishing the third year of Computer Science Specialist and Statistics Minor at UofT. This past academic year, I had the opportunity to do an ROP399 research project under the guidance of Professor Pascal Tyrrell, and I would like to share experience on this blog.

My ROP project dealt with comparing the effect of decrease in sample size on Vision Transformer’s against Convolutional Neural Networks on a Chest X-Ray classification task using the NIH Chest X-Ray dataset. Convolutional Neural Networks have been predominantly used in medical imaging tasks as they are easy to train and perform very well with any image modality. However, in recent years, Vision Transformers (ViTs) have been shown to outperform on Convolutional Neural Networks. However, they have only been shown to do so only when trained/pretrained on extremely large amounts of data. Given that large amounts of labelled data are hard to come by in the field of Medical Imaging, it is important to set up some baselines for performance and gauge whether future work and research is warranted in this arena. This exploratory aspect made my project very exciting.

I started the project not knowing anything about ViTs. I had some experience training and using CNNs or Resnets before. Thus, I started with reading up everything I could about Vision Transformers. However, since it is a relatively new class of models, it was hard to gain an initial intuitive understanding of what was happening in the research papers I read. I did not know where I should start. To not waste time, I started by cleaning my data and preparing a binary classification dataset from the NIH Chest X-Ray dataset, to detect infiltration within the lungs. I trained a small CNN classifier from scratch to see if the results made sense. I was getting an accuracy of around 60%, which I knew was not good enough. Then, I spoke to Prof. Tyrrell and Atsuhiro, who pointed to the fact that my dataset might have some noise relating to the same patients being in the positive and negative class of images. Thus, I cleaned my data some more and made sure there was little correlation between the negative and positive class of images.

I then proceeded to train a small CNN again, with fair results. However, when I tried training a ViT from scratch on my datasets, it would only learn to output “No Infiltration” for all images as that was the majority class. So, I did some more research and tried a lot of different techniques, but to no avail. However, in trying to debug the ViT model, I gained an in-depth understanding of some concepts like learning rate scheduling, training regimes, transfer learning, self-attention etc. I learned a lot from a lot of failures that I encountered in the project. I was close to giving up, had it not been for Prof. Tyrrell’s patience and encouraging words. I also spoke to my Neural Networks professor and some friends for advice and learned a lot. In the end, I decided to use transfer learning, which ended up giving me very fruitful results.

More than technical knowledge, I learned how to stick with tough projects and what to expect when navigating one. I found Prof. Tyrrell’s attitude towards failures in projects very inspiring, which gave me the confidence to persevere through. The experience, in my opinion, teaches you how tough research actually is, and more importantly, how you can still overcome challenges and only get better having gone through them.

 Manav Shah

Grace Yu’s STA299 Journey

Hi everyone! My name is Grace Yu and I’m finishing my second year at the University of Toronto, pursuing a computer science specialist and a molecular genetics major. From September 2021 to April 2022, I was fortunate to have the opportunity to do a STA299 project with Professor Tyrrell through the Research Opportunity Program. I am excited to share my experience with you all!

My project was landmarking with reduced sample size in MSK ultrasound images for knees. Similar to many other ROP students, this was my first research experience. Prior to this project, I have no idea about how machine learning works. However, I am always interested in the intersection between computer science and medical field, and that’s what drives me in this opportunity.

The start of the project was interesting but not easy. There were many times I did not know if I was doing the right thing, or if I was making the efforts towards the correct path. Luckily, Professor Tyrrell, and people in the lab were always very patient and helpful. I begin by reading some research papers on developing new semi-supervised learning models, but found them difficult to comprehend and time-consuming. Mauro kindly provided the suggestions on which parts to focus when doing the literature research, and advised me to pay more attention in selecting a model instead of focusing on the technical details about how the model is constructed. In addition, as I spent much time in choosing a model, I fell behind others. Professor Tyrrell reminded me of the timeline of my project and the next steps I should take on as soon as possible, which was to find a dataset. Fortunately, with the help of lab, we prepared a dataset together and my project went back to schedule. Looking back, I appreciated the period of exploring and experimenting, and the guidance provided by others. The starting point of a project can be difficult and sometimes we do not know what we are doing, but really that’s ok. For me, the time I spent in the beginning paid off by having extra suitable model and leading to a nice comparison. In addition, this experience also allows me to get on new projects or new fields more quickly.

I am very grateful to having the opportunity to work in the MiDATA lab this year. Not only did I had more understanding of statistical and computer science concepts, but also I learned the methods and process of conducting research. I would like to thank professor Tyrrell, Majid, Mauro, and Atsuhiro for their guidance and feedback on my way of doing this project. With this experience, I am more confidence and looking forward to applying what I have learned to my future research journey.

Grace Yu

The MiDATA Word of the Day is… “AP”

AP? Average Precision! What is it? And how is it useful?

Imagine you are given a prediction model that can identify common objects, and you want to know how well the model performs. So you prepare a picture that contains 2 people, and labels them with bounding boxes in yellow yourself. Then you applied the model on this image, and the model boxes the people in red with different confidence scores. Not bad right? But how can you tell if this prediction is correct?

That’s where Intersection of Union (IoU) comes in, the first stop on our journey to AP. Looking at the boxes in the picture, you can see some parts of yellow box and red box overlap. IoU is the proprotion of their overlapping region over the union. For example, the prediction for the person on the left will have smaller IoU than the prediction for the other person.

If we set the cutoff the IoU to be 0.8, then the prediction on the left will be classified as false positive (FP) since it does not reach the threshold, whereas the prediction on the right will be true positive (TP).

Now final piece before calculating AP. In this image of cats, we labeled 5 cats in red, and predictions are made in yellow. We rank the predictions on descending confidence score, and calculate the precision and recall. Precision is the proportion of TP out of all predictions, and Recall is the proportion of TP out of all ground-truth.

Here is a summary of calculations.

Rank of predictionsCorrect (Y/N)PrecisionRecall
1T10.2
2T10.4
3F0.670.4
4T0.750.6
5T0.80.8
6F0.670.8

Then we plotted the precicion over recall curve.

Generally as recall increases, the precision decreases. AP is the area under the precision-recall curve! It is from 0 to 1, the higher the better.

Whoa! That’s a complicated definition. Often AP can be calculated directly by the model. Next time you see AP, you know it represents how good your model is.

Now for the fun part, using AP in a sentence by the end of the day:

Serious: AP is a measurement of accuracy in object detection model.

Less serious:

Child: Hey mom! I need some help with the assignment in boxing all the cars on the road.

Mother: Try this model! It has AP of 0.8, and it may be better at this than I do.

…I’ll see you in the blogosphere.

Grace Yu

Linxi Chen’s STA399 Journey

Hi everyone! My name is Linxi Chen and I’m finishing my third year at the University of Toronto, pursuing a statistics specialist and a mathematics major. I did an STA399 research opportunity program with Professor Pascal Tyrrell from May 2021 – August 2021 and I am very grateful that I can have this opportunity. This project provided me an opportunity in understanding machine learning and scientific research. I would love to share my experience with you all.

Initially, I had no experience with machine learning and convolutional neural network and integrating machine learning with medical imaging was a brand-new area for me. Therefore, at the beginning of this project, I searched loads of information on machine learning to gain a general picture of this area. The first assignment in this project was to make a slide deck on machine learning in medical imaging. Extracting and simplifying the gathered information helped me understand this area more deeply. 

My research project is to find out an objective metric for heterogeneity and explore how dataset heterogeneity will affect heterogeneity as measured by the CNN training image features with sample size. At first, how to specifically define the term “heterogeneity” was a big challenge for me, since there are various kinds of definitions on Google and there was so little information that was directly related to my project. By comparing the information on websites and talking with Professor Tyrrell in the weekly meeting, I managed to define the term “between-group heterogeneity” as the extent to which the measurements of each group vary within a dataset, considering the mean of each subgroup and the grand mean of the population. Next, designing the experiment setup was also challenging, because I have to ensure the experiment steps are applicable and explicable. The datasets were separated into different groups according to the label of each image. We introduced new groups into the dataset while keeping the total sample size the same in each case. There was a total of four cases and the between-group heterogeneity was measured using Cochran’s Q which is a test statistic based on chi-squared distribution. The experiment setup was modified several times, because problems came up from time to time. For example, I planned to use multi-classification CNN model at first, but it showed that I have to use different number of output neurons for the model in different cases, which made the results not comparable. Professor Tyrrell and Mauro suggested I use a binary-classification model with pseudo label, which successfully solved this problem. Luckily, I found some code on the website and with the help of Mauro, I managed to come up with the code that was applicable to use. Next came the hardest part that I encountered. Although idealistically my expectation could be explained very well, still the output results I got were not what I expected. After modifying the model and the sample size several times, I finally managed to get the expected results.

Overall, I have learned a lot from this ROP program. With the guidance of Professor Tyrrell and the help of students in the lab, I have gained an in-depth understanding of machine learning, neural network and the process of scientific research. Also, I have become more familiar with writing a formal scientific research paper. The most valuable thing that I got from this experience is the ability of problem-solving and never be frustrated when things get wrong. I would like to thank Professor Tyrrell for giving me this opportunity to learn about scientific research and helping me overcome all the challenges that I encountered during this process. I’m very grateful that I have gained so many valuable skills in this project. Also, I would like to thank Mauro and all the members in the lab for giving me so much help with my project.

Linxi Chen