Hi! My name is Manav Shah, and I am finishing the third year of Computer Science Specialist and Statistics Minor at UofT. This past academic year, I had the opportunity to do an ROP399 research project under the guidance of Professor Pascal Tyrrell, and I would like to share experience on this blog.
My ROP project dealt with comparing the effect of decrease in sample size on Vision Transformer’s against Convolutional Neural Networks on a Chest X-Ray classification task using the NIH Chest X-Ray dataset. Convolutional Neural Networks have been predominantly used in medical imaging tasks as they are easy to train and perform very well with any image modality. However, in recent years, Vision Transformers (ViTs) have been shown to outperform on Convolutional Neural Networks. However, they have only been shown to do so only when trained/pretrained on extremely large amounts of data. Given that large amounts of labelled data are hard to come by in the field of Medical Imaging, it is important to set up some baselines for performance and gauge whether future work and research is warranted in this arena. This exploratory aspect made my project very exciting.
I started the project not knowing anything about ViTs. I had some experience training and using CNNs or Resnets before. Thus, I started with reading up everything I could about Vision Transformers. However, since it is a relatively new class of models, it was hard to gain an initial intuitive understanding of what was happening in the research papers I read. I did not know where I should start. To not waste time, I started by cleaning my data and preparing a binary classification dataset from the NIH Chest X-Ray dataset, to detect infiltration within the lungs. I trained a small CNN classifier from scratch to see if the results made sense. I was getting an accuracy of around 60%, which I knew was not good enough. Then, I spoke to Prof. Tyrrell and Atsuhiro, who pointed to the fact that my dataset might have some noise relating to the same patients being in the positive and negative class of images. Thus, I cleaned my data some more and made sure there was little correlation between the negative and positive class of images.
I then proceeded to train a small CNN again, with fair results. However, when I tried training a ViT from scratch on my datasets, it would only learn to output “No Infiltration” for all images as that was the majority class. So, I did some more research and tried a lot of different techniques, but to no avail. However, in trying to debug the ViT model, I gained an in-depth understanding of some concepts like learning rate scheduling, training regimes, transfer learning, self-attention etc. I learned a lot from a lot of failures that I encountered in the project. I was close to giving up, had it not been for Prof. Tyrrell’s patience and encouraging words. I also spoke to my Neural Networks professor and some friends for advice and learned a lot. In the end, I decided to use transfer learning, which ended up giving me very fruitful results.
More than technical knowledge, I learned how to stick with tough projects and what to expect when navigating one. I found Prof. Tyrrell’s attitude towards failures in projects very inspiring, which gave me the confidence to persevere through. The experience, in my opinion, teaches you how tough research actually is, and more importantly, how you can still overcome challenges and only get better having gone through them.
Manav Shah