Nana Ye’s STA299 Journey

Hi everyone! My name is Nana Ye, and I am finishing my second year at the University of
Toronto as a statistical science specialist and cognitive science major. I am grateful to participate in an ROP (Research Opportunities Program) project with the guidance of Professor Tyrrell during the summer of 2022. This project provides me with a valuable opportunity to learn about machine learning and understand scientific research. I would love to share my experiences with you all!

My project is analyzing the effect of additional attention gates in U-Net for knee recess
distention ultrasound segmentation. The recess distention area detected by the ultrasonic signal is similar to the image background and the ultrasound image often has a large amount of noise, distortion, and shadow which causes blurred local details, lots of dark areas, and no obvious division. Thus, I wanted to see whether implementing the additional attention gates in standard U-Net would improve segmentation accuracy. Prior to this project, I had not learned about machine learning; therefore, being able to implement a machine learning model on real-world patient data is exciting and challenging.

The journey of my ROP had a rocky start. I started off hoping to do a different project that dealt with comparing Vision Transformers and Convolutional Neural Networks on segmentation tasks for objects located in different regions of the image (central and non-central). However, when I was searching for a ViT model, I struggled with its implementation on my dataset and since ViT is new in the lab I could not get much help with its implementation from others. Thus, I made the decision to change my project. Professor Tyrrell was supportive of my decision and provided me with several articles to read which led me to my current project. When I was worried about falling behind because others were already training their models, Professor Tyrrell encouraged me that understanding what is feasible in a given time frame is also a valuable lesson. Atsuhiro and Mauro also offered me lots of help along the way. When I was having a tough time understanding the technical aspect of image processing, Atsuhiro scheduled a meeting with me to
explain the concept and answer all my questions. With their help, I was able to finish my first research project in machine learning and obtained promising results.

Overall, it is a completely unique experience from other lectures at the university. Researching as an ROP student in Professor Tyrrell’s lab gives me the opportunity to do a research project from the very beginning of doing background research and picking a topic to the very end of analyzing the results and revising the report. In the entire process, not only did I learn technical knowledge about machine learning and medical imaging, but also, I learned to manage the timeline for a project efficiently, think critically, and problem-solve independently. I feel privileged to be one of the ROP students in Professor Tyrrell’s lab and gain such worthwhile experience that would benefit my academic career.

Adele Lauzon’s ROP399 Journey

Hi there! My name is Adele Lauzon, and I’ve just finished up my 3rd year at UofT with a major in statistics and minors in computer science and psychology. A huge highlight of my year has been my ROP399 with Professor Tyrell, where I got to do a deep dive into the intersection of statistics, computer science, and biomedical data.

A little bit about my background–I went to high school in Houston, Texas, which is where I first fell in love with statistics. I remember my AP Statistics teacher beginning our first class with a quote by esteemed statistician John Tukey, where he claimed statistics was the best discipline because it meant you got to “play in everyone’s backyard.” As I’ve gotten farther along in my statistics education, I’ve realized how much truth is behind that phrase. Statistics is wonderful because it allows you to understand other fields simply based on the data you use. Through this ROP, I’ve been able to learn a bit more about the field of medicine.

My project was about measures of confidence in binary classification algorithms using biomedical data. Specifically, I investigated error consistency and error agreement–meaning I took a close look at what was happening when the model was making incorrect predictions. I’m not going to lie, probably the hardest part of this project was just getting started. I have a little bit of programming experience due to my computer science minor, but I had a lot of catching up to do compared to my classmates. A word of advice–set yourself on the GPUs early. Running my code locally made for a frighteningly overheated laptop.

Probably my biggest takeaway from this course was how the process of research actually works. While the scientific method is helpful, it doesn’t account for all of the back-and-forth you are guaranteed to be doing. This is where documenting all of your steps really comes in handy. If you reach an obstacle and need to reevaluate, keep a record of what you were doing beforehand in case you need to regress again. I made this mistake, and ended up having to do some work that I had already done.

All in all, this ROP has been such a valuable experience to me. Many thanks to Professor Tyrrell and the rest of the MiDATA team for their unwavering patience!

Tong Su’s ROP299 Journey

Hi everyone! My name is Tong Su, and I have just wrapped up my ROP299 project in Professor Tyrrell’s Lab, as well as my second year at the University of Toronto, pursuing a computer science specialist and statistics major. It is a great pleasure to complete my whole second-year journey along with this research experience. I have learned a lot of things about both artificial intelligence topics and the process of scientific research. I would like to share my experiences with you here.

My ROP project is the effect of compression and downsampling on the accuracy of the Convolutional Neural Network (CNN)-based histological image binary classification model. Advances in medical imaging systems have made medical images more details. They also increased the size of medical images as scarification. Compared to other images, medical images are larger and occupy more storage space. Therefore, most medical images were downsampled or compressed before they were stored. While some compressions are reversible, most of the others are irreversible. Once the image is compressed, perceptible information is lost and could not be restored. When these modified medical images are used for training machine learning algorithms, the information loss during compression may affect the algorithms’ accuracy. This study aims to investigate how compression and downsampling ratio to medical imaging affect the accuracy of CNN.

Similar to other ROP students, I decided on my research topic early by selecting from a bunch of topics in different areas. However, the focus of my research has slightly adjusted as I progressed through my project. Initially, my research topic is “Can we compress training data without degrading accuracy?”. This topic only illustrates the effect of compression on the accuracy of the algorithm and at the end of the research, I need to propose the best compression ratio that is suitable for medical images storage without much loss of accuracy.

Among all the compression types, I decided to work with JPEG2000 as it is one of the most commonly used compression types in medical imaging. The dataset chosen consisted of 100,000 different image patches from histological images of human colorectal cancer (CRC) and normal tissue. It was organized into 9 for each image. The next step is to choose the machine learning model. I decided to work on binary classification with the CNN model. The two categories were picked for the binary classification model that classifies whether a given tissue image is cancer-associated stroma (STR) (1) or is normal colon mucosa (NORM) (0).

The next step is compressing the dataset. I used Python Image Library (PIL) to compress the dataset using JPEG2000. However, the binary classification model does not support the dataset with format j2k. In this case, I needed to include another process of converting the j2k images to a type that is supported by the model. I decided to convert the image to TIFF as it is the same as the dataset’s original format.

During my research about compression, Professor Tyrrell pointed out another image size reduction method, downsampling. Although both methods are used to reduce the image size, there are some differences between them. This aroused my interest that which image size reduction method performs better than the machine learning algorithm. In that case, I started to add another purpose to my project to compare the difference between downsampling and compression and state which image size reduction method is more suitable for medical imaging.

Despite all the obstacles I encountered along the way, such as changing the dataset halfway through the project, making modifications to the model and rerunning everything, and the unexpected 54.39% error for high compression ratio, etc., I have successfully come to the end and concluded my excellent ROP experience through this reflection. Now I have a greater understanding of the process of research and deep learning algorithm. At the end of this reflection, I want to thank Professor Tyrrell for offering me this opportunity and guiding my research progress through the weekly meetings. I also want to thank Dr. Atsuhiro Hibi for providing me with endless guidance and support for the whole research project through meetings and frequent email exchange even when he was busy. Without their help, I would not be able to have such an excellence research experience.

Tong Su

Manav Shah’s Journey in ROP399

Hi! My name is Manav Shah, and I am finishing the third year of Computer Science Specialist and Statistics Minor at UofT. This past academic year, I had the opportunity to do an ROP399 research project under the guidance of Professor Pascal Tyrrell, and I would like to share experience on this blog.

My ROP project dealt with comparing the effect of decrease in sample size on Vision Transformer’s against Convolutional Neural Networks on a Chest X-Ray classification task using the NIH Chest X-Ray dataset. Convolutional Neural Networks have been predominantly used in medical imaging tasks as they are easy to train and perform very well with any image modality. However, in recent years, Vision Transformers (ViTs) have been shown to outperform on Convolutional Neural Networks. However, they have only been shown to do so only when trained/pretrained on extremely large amounts of data. Given that large amounts of labelled data are hard to come by in the field of Medical Imaging, it is important to set up some baselines for performance and gauge whether future work and research is warranted in this arena. This exploratory aspect made my project very exciting.

I started the project not knowing anything about ViTs. I had some experience training and using CNNs or Resnets before. Thus, I started with reading up everything I could about Vision Transformers. However, since it is a relatively new class of models, it was hard to gain an initial intuitive understanding of what was happening in the research papers I read. I did not know where I should start. To not waste time, I started by cleaning my data and preparing a binary classification dataset from the NIH Chest X-Ray dataset, to detect infiltration within the lungs. I trained a small CNN classifier from scratch to see if the results made sense. I was getting an accuracy of around 60%, which I knew was not good enough. Then, I spoke to Prof. Tyrrell and Atsuhiro, who pointed to the fact that my dataset might have some noise relating to the same patients being in the positive and negative class of images. Thus, I cleaned my data some more and made sure there was little correlation between the negative and positive class of images.

I then proceeded to train a small CNN again, with fair results. However, when I tried training a ViT from scratch on my datasets, it would only learn to output “No Infiltration” for all images as that was the majority class. So, I did some more research and tried a lot of different techniques, but to no avail. However, in trying to debug the ViT model, I gained an in-depth understanding of some concepts like learning rate scheduling, training regimes, transfer learning, self-attention etc. I learned a lot from a lot of failures that I encountered in the project. I was close to giving up, had it not been for Prof. Tyrrell’s patience and encouraging words. I also spoke to my Neural Networks professor and some friends for advice and learned a lot. In the end, I decided to use transfer learning, which ended up giving me very fruitful results.

More than technical knowledge, I learned how to stick with tough projects and what to expect when navigating one. I found Prof. Tyrrell’s attitude towards failures in projects very inspiring, which gave me the confidence to persevere through. The experience, in my opinion, teaches you how tough research actually is, and more importantly, how you can still overcome challenges and only get better having gone through them.

 Manav Shah

Grace Yu’s STA299 Journey

Hi everyone! My name is Grace Yu and I’m finishing my second year at the University of Toronto, pursuing a computer science specialist and a molecular genetics major. From September 2021 to April 2022, I was fortunate to have the opportunity to do a STA299 project with Professor Tyrrell through the Research Opportunity Program. I am excited to share my experience with you all!

My project was landmarking with reduced sample size in MSK ultrasound images for knees. Similar to many other ROP students, this was my first research experience. Prior to this project, I have no idea about how machine learning works. However, I am always interested in the intersection between computer science and medical field, and that’s what drives me in this opportunity.

The start of the project was interesting but not easy. There were many times I did not know if I was doing the right thing, or if I was making the efforts towards the correct path. Luckily, Professor Tyrrell, and people in the lab were always very patient and helpful. I begin by reading some research papers on developing new semi-supervised learning models, but found them difficult to comprehend and time-consuming. Mauro kindly provided the suggestions on which parts to focus when doing the literature research, and advised me to pay more attention in selecting a model instead of focusing on the technical details about how the model is constructed. In addition, as I spent much time in choosing a model, I fell behind others. Professor Tyrrell reminded me of the timeline of my project and the next steps I should take on as soon as possible, which was to find a dataset. Fortunately, with the help of lab, we prepared a dataset together and my project went back to schedule. Looking back, I appreciated the period of exploring and experimenting, and the guidance provided by others. The starting point of a project can be difficult and sometimes we do not know what we are doing, but really that’s ok. For me, the time I spent in the beginning paid off by having extra suitable model and leading to a nice comparison. In addition, this experience also allows me to get on new projects or new fields more quickly.

I am very grateful to having the opportunity to work in the MiDATA lab this year. Not only did I had more understanding of statistical and computer science concepts, but also I learned the methods and process of conducting research. I would like to thank professor Tyrrell, Majid, Mauro, and Atsuhiro for their guidance and feedback on my way of doing this project. With this experience, I am more confidence and looking forward to applying what I have learned to my future research journey.

Grace Yu

Linxi Chen’s STA399 Journey

Hi everyone! My name is Linxi Chen and I’m finishing my third year at the University of Toronto, pursuing a statistics specialist and a mathematics major. I did an STA399 research opportunity program with Professor Pascal Tyrrell from May 2021 – August 2021 and I am very grateful that I can have this opportunity. This project provided me an opportunity in understanding machine learning and scientific research. I would love to share my experience with you all.

Initially, I had no experience with machine learning and convolutional neural network and integrating machine learning with medical imaging was a brand-new area for me. Therefore, at the beginning of this project, I searched loads of information on machine learning to gain a general picture of this area. The first assignment in this project was to make a slide deck on machine learning in medical imaging. Extracting and simplifying the gathered information helped me understand this area more deeply. 

My research project is to find out an objective metric for heterogeneity and explore how dataset heterogeneity will affect heterogeneity as measured by the CNN training image features with sample size. At first, how to specifically define the term “heterogeneity” was a big challenge for me, since there are various kinds of definitions on Google and there was so little information that was directly related to my project. By comparing the information on websites and talking with Professor Tyrrell in the weekly meeting, I managed to define the term “between-group heterogeneity” as the extent to which the measurements of each group vary within a dataset, considering the mean of each subgroup and the grand mean of the population. Next, designing the experiment setup was also challenging, because I have to ensure the experiment steps are applicable and explicable. The datasets were separated into different groups according to the label of each image. We introduced new groups into the dataset while keeping the total sample size the same in each case. There was a total of four cases and the between-group heterogeneity was measured using Cochran’s Q which is a test statistic based on chi-squared distribution. The experiment setup was modified several times, because problems came up from time to time. For example, I planned to use multi-classification CNN model at first, but it showed that I have to use different number of output neurons for the model in different cases, which made the results not comparable. Professor Tyrrell and Mauro suggested I use a binary-classification model with pseudo label, which successfully solved this problem. Luckily, I found some code on the website and with the help of Mauro, I managed to come up with the code that was applicable to use. Next came the hardest part that I encountered. Although idealistically my expectation could be explained very well, still the output results I got were not what I expected. After modifying the model and the sample size several times, I finally managed to get the expected results.

Overall, I have learned a lot from this ROP program. With the guidance of Professor Tyrrell and the help of students in the lab, I have gained an in-depth understanding of machine learning, neural network and the process of scientific research. Also, I have become more familiar with writing a formal scientific research paper. The most valuable thing that I got from this experience is the ability of problem-solving and never be frustrated when things get wrong. I would like to thank Professor Tyrrell for giving me this opportunity to learn about scientific research and helping me overcome all the challenges that I encountered during this process. I’m very grateful that I have gained so many valuable skills in this project. Also, I would like to thank Mauro and all the members in the lab for giving me so much help with my project.

Linxi Chen

Jacqueline Seal’s Journey in ROP299: Simulating clinical data!

Hi! My name is Jacqueline, and I’m going into my second year at U of T, pursuing a major in Computer Science and a specialist in Bioinformatics. This past summer, I had the opportunity to do a STA299 project with Professor Tyrrell through the Research Opportunity Program, and I’m excited to share my experiences here!

My ROP project for the summer dealt with the simulation of clinical variables relevant to detecting intra-articular hemarthrosis – basically bleeding into the joint – among patients with hemophilia, a disease where patients lack sufficient clotting proteins and are prone to regular, excessive bleeding. Since hemophilia is quite rare, clinical data is often unavailable and so simulation can help us understand what the real data might look like under different sets of plausible assumptions. The ultimate goal was to demonstrate that adding clinical data to Mauro’s existing binary CNN classifier for articular blood detection could boost model performance, as compared to a model trained exclusively on ultrasound images.

Having just completed first year, I went into this ROP with a very limited statistics background and was initially overwhelmed by all the stats jargon being used in lab meetings and in conversations with other lab members. Concepts like “odds ratios,” “ROC,” and “sensitivity analysis” were completely new to me, and I spent many hours just familiarizing myself with these fundamentals. 

After a bit of a slow start, I began my project by identifying physical presentation and clinical history variables to use in my simulation. I was fortunate enough to speak with two distinguished hematologists from Novo Nordisk, Drs. Brand-Staufer and Zak, about the features most relevant to diagnosing a joint bleed. Based on this conversation, I selected two of these variables as a starting point and simulated them according to assumed distributions. Next, I simulated the probability of an articular bleed based on a logistic regression model and used that probability to simulate the “true” presence of a bleed based on a Bernoulli distribution.

Then, I took a bit of a fun detour: figuring out how to best match simulated data to real-world bleed probabilities output by Mauro’s model. With some guidance from Professor Tyrrell, I developed a matching algorithm that allowed us to control the strength of the positive correlation between clinical simulated probabilities and classifier probabilities. Perhaps the most difficult part of my project was ensuring that the simulated dataset captured the desired relationships between my explanatory variables and between each explanatory variable and the response variables. Thanks to the advice of Guan and Sylvia, however, I was able to verify these relationships and report on them in a statistically sound manner.

Despite all the obstacles I encountered along the way, despite changing the details of my methodology several times, despite making slow progress and occasionally feeling like I was going in circles, I’m very grateful to have had this opportunity. Not only did I gain a greater understanding of important statistical concepts and greater familiarity with machine learning techniques, but I also got first-hand experience navigating the research process, from beginning to end. Ultimately, my experience in the MiDATA lab was simultaneously challenging and rewarding, and I would like to thank Dr. Tyrrell for all his guidance this summer – whether it was setting up impromptu meetings to discuss unexpected issues in my data, providing feedback on my results, or simply sharing humorous anecdotes in our weekly lab meetings. Regardless of where this next year takes me, I’m confident that I’ll carry the lessons I learned this summer with me.

Jacqueline Seal

Jihong Huang’s ROP399 Journey

Hi, my name is Jihong Huang and I have finished my third year in computer science and statistics at the University of Toronto. During this summer, I had the great chance to work on my ROP399 project under the guide of Dr. Pascal Tyrell. In such a pandemic, everything was a bit different from usual, including this program. Still, I would like to share my experience and lessons from this summer with you!

After three years in the university and so many different courses in statistics and computer science, I thought that I was totally prepared to take a try in some research projects with knowledge learnt in lectures. However, it turned out that my thoughts were completely wrong! Everything was different from the lectures, where professors will teach step by step with detailed notes. I needed to create my own proposal and design the experiments, independently like a scholar instead of a student. Despite Dr. Tyrrell’s help, I struggled to figure out my schedule for the project. Such an experience was quite unique and special to me compared with time in lecture assignments.

After all the setups, I began to handle the coding part of my project. I picked YOLOv3 as my application of bounding box regression. YOLOv3 is one of the most popular bounding box regression algorithms and it already has excellent performances in many fields. At the same time, it has its complex structures and mechanisms that are longer and more complicated than any code that I have ever learnt. It looks like only the combination of classification and localization, where each single algorithm is easy to understand but the combination is much more advanced than my lectures notes! It took me weeks to roughly figure out its mechanism. Then, I devoted myself to debugging the code. That was difficult, as I was not familiar with most of the packages used. Some issues were caused by different versions of packages, while some were made by subtle wrong code. The adjustments of hyperparameters were also annoying as I usually could not find the optimal solutions for them. Thanks to the great help from Mauro, I finally made my code work on the server successfully.

At the end of the whole trip in my project, I gained a lot of advanced knowledge about bounding box regression and many relating packages, which I would probably never touch before my graduation if I did not take this project. However, my most precious lessons are not about any specific coding ability. The most important lesson is what scientific research is and how it should be done. I learnt that it is very important to make a clear and specific proposal as the plan in the beginning as it would provide the guidelines for any further experiments on coding. Otherwise, it would be easy to go off track and lose the initial goal when thousands of lines of code overwhelm. Also, there could always be failures in scientific research. I spent more than half of my time making and fixing mistakes during the project, which frustrated me a lot in the process. My final conclusion was suggesting that the algorithm selected was not performing well. But they were all common in scientific research. As we learn from failures, the failures are meaningful, and we could make further progress based on them. Thanks to the help from Dr. Tyrrell and all other lab members, it was them that helped me out of frustration during the project and offered me valuable advice.

After this project of three months, I learnt a lot from my first try in the world of scientific research, including coding skills and scientific spirits. This experience provided me with important guidance on my future direction of study and I think all the time and efforts are worthwhile.

– Jihong Huang

Rui Zhu’s ROP399 Journey

I am Rui Zhu. I’ve just completed my third year in the computer science program. I’ve been working in Dr. Tyrrell’s lab on my ROP399 project in the past summer, which is a new and wonderful experience for me.

When I am writing down this reflection, and at some other decision-making moment in the future, it reminds me of the interview with Dr. Tyrrell, where he asked me why I chose his lab and why he chose me. I had tons of reasons for choosing his lab. However, honestly, it was hard for me to put up a whole sentence to answer why he would choose me. “I haven’t done research before, and everything needs a start,” I remember I said unconfidently, “so I need this chance to see if I am really interested in it and see how it goes”. Fortunately, I received Dr. Tyrrell’s offer a few days after the interview, and my very first research experience started.

My ROP project is on imperfect gold standard, which is the consensus of the readers. More specifically, the project is about training models on dataset labelled by readers who make mistakes. At first, I started by reading a lot of papers on robust learning. However, when I had my first meeting with Dr. Tyrrell and Atsuhiro, who kindly helped me with my project, I could not answer what is the definition of imperfect gold standard and why we need consensus of the readers. Atsuhiro helped me out. He explained the problem in real-world applications, where multiple readers annotate a huge dataset without looking at each other’s labels, because it is time-consuming and costly. I learnt the lesson that doing research starts by thinking why I am doing this rather than thinking how to do it. I kept getting questions like why my project is meaningful.

After sorting my mind, I began writing my premise, purpose, hypothesis, and objectives. I thought it was difficult to write up a whole page for these things, but after finding out that I should not assume people know why I am doing the project, I explained everything to readers in my introduction. It was easier than I thought to put up a whole page. After finishing my premise, purpose, hypothesis, and objectives, I combined them together to be a full introduction. Everything flowed like water.

When I was writing the actual code for my project, not many difficulties were met, as I was getting help from Atsuhiro and Mauro. I wanted to thank them for their help. Mauro taught me how to use Pytorch Lightning, which structured Pytorch code in a way easy to understand. Atsuhiro helped me confirm my experimental methodology and gave me guidance on what robust learning techniques to use for my project. Moreover, I started very early to familiarize myself with the code.

Overall, the journey on my summer ROP research was wonderful. I learnt how to start research from scratch and some knowledge in robust learning, although I am only scratching the surface of it. It was a pleasure for me to work in Dr. Tyrrell’s lab this summer. I look forward to what I can do in the future in the world of research.

– Rui Zhu

Qianyu Fan’s ROP299 Experience

My name is Qianyu Fan, and I finished my first year at the University of Toronto, pursuing a statistics specialist. This summer I was given the incredible opportunity to work in Dr. Tyrrell’s lab for the ROP299 course. These four months, I have gone through pain and suffering, underwent a metamorphosis, and finally reaped the fruits.

I still remembered that I promised Professor Tyrrell during the interview that I would put twice as much effort as others to complete scientific research. Even if I had no experience with machine learning and neural networks, even if I hadn’t heard of them, the professor was welcome to accept me! During the weekly meetings, terminologies were hard for me to understand, though I tried to research them afterward. And so, I began my research in a daze.

Early on, I floundered to find a focus. The topic “Compare Image Similarity” is huge, where I could do the research on many sides. For instance, we could use different distance metrics to explore the similarities between synthetic and real images. Also, whether replacing real with synthetic images will improve the model accuracy in the training process is a meaningful topic. Due to many interesting ideas for the project, I was lost, and the proposal had been constantly revised. As other ROP students were starting to write their projects, I was still stuck in the proposal and was anxious about the progress. The professor understood my situation and helped me redefine my direction, because he cared about what the students learned in the course. So, my theme was: Comparison of Two Augmentation Methods in Improving Detection Accuracy of Hemarthrosis. We used data synthesis and traditional augmentation techniques to explore and compare the recognition accuracy with increasing proportions of augmented data.

As the deadline was approaching, I had the idea of giving up due to no results. Once in the private meeting with the professor, I broke down and cried. What a shame! He gave me much support and understood my frustration. Mauro was very helpful in offering the datasets as well as allowing me to use his codes and solving my questions. Thanks to their help, my thinking became clear, and I was able to complete the project on time.

A tortuous but unforgettable journey is over. I have learned a lot of things in this ROP course, from machine learning to scientific research. This will be an asset in my life. I appreciate that the professor gave me this opportunity and that I was able to complete my project.

Qianyu Fan