Today’s MiWORD of the day is… Lasso!

Wait… Lasso? Isn’t a lasso that lariat or loop-like rope that cowboys use? Or perhaps you may be thinking about that tool in Photoshop that’s used for selecting free-form segments!

Well… technically neither is wrong! However, in statistics and machine learning, Lasso stands for something completely different: least absolute shrinkage and selection operator. This term was coined by Dr. Robert Tibshirani in 1996 (who was a UofT professor at that time!).

Okay… that’s cool and all, but what the heck does that actually mean? And what does it do?

Lasso is a type of regression analysis method, meaning it tries to estimate the relationship between predictor variables and outcomes. It’s typically used to perform feature selection or regularization.

Regularization is a way of reducing overfitting of a model, ie. it removes some of the “noise” and randomness of the data. On the other hand, feature selection is a form of dimension reduction. Out of all the predictor variables in a dataset, it will select the few that contribute the most to the outcome variable to include in a predictive model.

Lasso works by applying a fixed upper bound to the sum of absolute values of the coefficient of the predictors in a model. To ensure that this sum is within the upper bound, the algorithm will shrink some of the coefficients, particularly it shrinks the coefficients of predictors that are less important to the outcome. The predictors whose coefficients are shrunk to zero are not included at all in the final predictive model.

Lasso has applications in a variety of different fields! It’s used in finance, economics, physics, mathematics, and if you haven’t guessed already… medical imaging! As the state-of-the-art feature selection technique, Lasso is used a lot in turning large radiomic datasets into easily interpretable predictive models that help researchers study, treat, and diagnose diseases.

Now onto the fun part, using Lasso in a sentence by the end of the day! (see rules here)

Serious: This predictive model I got using Lasso has amazing accuracy for detecting the presence of a tumour!

Less serious: I went to my professor’s office hours for some help on how to use Lasso, but out of nowhere he pulled out a rope!

See you in the blogosphere!

Jessica Xu

MiWord of the Day Is… dimensionality reduction!

Guess what?

You are looking at a real person, not a painting! This is one of the great works by a talented artist Alexa Meade, who paints on 3D objects but creates a 2D painting illusion. Similarly in the world of statistics and machine learning, dimensionality reduction means what it sounds like: reduce the problem to a lower dimension. But only this time, not an illusion.

Imagine a 1x1x1 data point living inside a 2x2x2 feature space. If I ask you to calculate the data density, you will get ½ for 1D, ¼ for 2D and 1/8 for 3D. This simple example illustrates that the data points become sparser in higher dimensional feature space. To address this problem, we need some dimensional reduction tools to eliminate the boring dimensions (dimensions that do not give much information on the characteristics of the data).

There are mainly two approaches when it comes to dimension reduction. One is to select a subset of features (feature selection), the other is to construct some new features to describe the data in fewer dimensions (feature extraction).

Let us consider an example to illustrate the difference. Suppose you are asked to come up features to predict the university acceptance rate of your local high school.

You may discard the “grade in middle school” for its many missing values; discard “date of birth” and “student name” as they are not playing much role in applying university; discard “weight > 50kg” as everyone has the same value; discard “grade in GPA” as it can be calculated. If you have been through a similar process, congratulations! You just performed a dimension reduction by feature selection.

What you have done is removing the features with many missing values, the least correlated features, the features with low variance and one of the highly correlated. The idea behind feature selection is that the data might contain some redundant or irrelevant features and can be removed without losing too much loss information.

Now, instead of selecting a subset of features, you might try to construct some new features from the old ones. For example, you might create a new feature named “school grade” based on the full history of the academic features. If you have been through a thought process like this, you just performed a dimensional reduction by feature extraction

If you would like to do a linear combination, principal component analysis (PCA) is the tool for you. In PCA, variables are linearly combined into a new set of variables, known as the principal components. One way to do so is to give a weighted linear combination of “grade in score”, “grade in middle school” and “recommend letter” …

Now let us use “dimensionality reduction” in a sentence.

Serious: There are too many features in this dataset, and the testing accuracy seems too low. Let us apply dimensional reduction techniques to reduce overfit of our model…

Less serious:

Mom: “How was your trip to Tokyo?”

Me: “Great! Let me just send you a dimensionality reduction version of Tokyo.”

Mom: “A what Tokyo?”

Me: “Well, I mean … photos of Tokyo.”

I’ll see you in the blogosphere…

Jacky Wang