In this post, I will explain the concept of collinearity and multicollinearity and why it is important to understand them and take appropriate action when we are preparing data.
Correlation measures the strength and direction between two columns in a dataset. Correlation is often used to find the relationship between a feature and the target:
For example, if one of the features has a high correlation with the target, it tells that this particular feature heavily influences the target and should be included when we are training the model.
Collinearity, on the other hand, is a situation where two features are linearly associated (high correlation), and they are used as predictors for the target.
Multicollinearity is a special case of collinearity where a feature exhibits a linear relationship with two or more features.
Recall the formula for multiple linear regression:
One important assumption of linear regression is that there should exist a linear relationship between each of the predictors (x₁, x₂, etc) and the outcome y. However, if there is a correlation between the predictors (e.g. x₁ and x₂ are highly correlated), we can no longer determine the effect of one while holding the other constant since the two predictors change together. The end result is that the coefficients (w₁ and w₂) are now less exact and hence less interpretable.
When training a machine learning model, it is important that during the data preprocessing stage we sieve out the features in the dataset that exhibit multicollinearity. We can do so using a method known as VIF — Variance Inflation Factor.
VIF allows us to determine the strength of the correlation between the various independent variables. It is calculated by taking a variable and regressing it against every other variables.
VIF calculates how much the variance of a coefficient is inflated because of its linear dependencies with other predictors. Hence its name.
Here is how VIF works:
x₁ ~ x₂ + x₃ + x₄
In fact, we are performing a multiple regression above. Multiple regression generally explains the relationship between multiple independent or predictor variables and one dependent or criterion variable.
x₂ ~ x₁ + x₃ + x₄ # regress x₂ against the rest of the features
x₃ ~ x₁ + x₂ + x₄ # regress x₃ against the rest of the features
x₄ ~ x₁ + x₂ + x₃ # regress x₄ against the rest of the features
While correlation matrix and scatter plots can be used to find multicollinearity, they only show the bivariate relationship between the independent variables. VIF, on the other hand, shows the correlation of a variable with a group of other variables.
Now that we know how VIF is calculated, we can implement it using Python, with a little help from sklearn:
import pandas as pd
from sklearn.linear_model import LinearRegressiondef
calculate_vif(df, features):
vif, tolerance = {}, {} # all the features that we want to examine
for feature in features:
# extract all the other features we will regress against
X = [f for f in features if f != feature]
X, y = df[X], df[feature] # extract r-squared from the fit
r2 = LinearRegression().fit(X, y).score(X, y)
# calculate tolerance
tolerance[feature] = 1 - r2 # calculate VIF
vif[feature] = 1/(tolerance[feature]) # return VIF DataFrame
return pd.DataFrame({'VIF': vif, 'Tolerance': tolerance})
To see VIF in action, let’s use a sample dataset named bloodpressure.csv, with the following content:
Pt,BP,Age,Weight,BSA,Dur,Pulse,Stress,
1,105,47,85.4,1.75,5.1,63,33,
2,115,49,94.2,2.1,3.8,70,14,
3,116,49,95.3,1.98,8.2,72,10,
4,117,50,94.7,2.01,5.8,73,99,
5,112,51,89.4,1.89,7,72,95,
6,121,48,99.5,2.25,9.3,71,10,
7,121,49,99.8,2.25,2.5,69,42,
8,110,47,90.9,1.9,6.2,66,8,
9,110,49,89.2,1.83,7.1,69,62,
10,114,48,92.7,2.07,5.6,64,35,
11,114,47,94.4,2.07,5.3,74,90,
12,115,49,94.1,1.98,5.6,71,21,
13,114,50,91.6,2.05,10.2,68,47,
14,106,45,87.1,1.92,5.6,67,80,
15,125,52,101.3,2.19,10,76,98,
16,114,46,94.5,1.98,7.4,69,95,
17,106,46,87,1.87,3.6,62,18,
18,113,46,94.5,1.9,4.3,70,12,
19,110,48,90.5,1.88,9,71,99,
20,122,56,95.7,2.09,7,75,99,
The dataset consists of the following fields:
First, load the dataset into a Pandas DataFrame and drop the redundant columns:
df = pd.read_csv('bloodpressure.csv')
df = df.drop(['Pt','Unnamed: 8'],axis = 1)
df
Before we do any cleanup, it would be useful to visualize the relationships between the various columns using a pair plot (using the Seaborn module):
import seaborn as sns
sns.pairplot(df)
I have identified some columns where there seems to exist a strong correlation:
Next, calculate the correlation between the columns using the corr() function:
df.corr()
Assuming that we are trying to build a model that predicts BP, we could see that the top features that correlate to BP are Age, Weight, BSA, and Pulse:
Now that we have identified the columns that we want to use for the training model, we need to see which of the columns have multicollinearity. So so let’s use our calculate_vif() function that we have written earlier:
calculate_vif(df=df, features=['Age','Weight','BSA','Pulse'])
The valid value for VIF ranges from 1 to infinity. A rule of thumb for interpreting VIF values is:
From the result calculating the VIF in the previous section, we can see the Weight and BSA have VIF values greater than 5. This means that Weight and BSA are highly correlated. This is not surprising as heavier people have a larger body surface area.
So the next thing to do would be to try removing one of the highly correlated features and see if the result for VIF improves. Let’s try removing Weight since it has a higher VIF:
calculate_vif(df=df, features=['Age','BSA','Pulse'])
Let’s now remove BSA and see the VIF of the other features:
calculate_vif(df=df, features=['Age','Weight','Pulse'])
As we observed, removing Weight results in a lower VIF for all other features, compared to removing BSA. So should we remove Weight then? Well, ideally, yes. But for practical reasons, it would make more sense to remove BSA and keep Weight. This is because later on when the model is trained and we use it for prediction, it is easier to get a patient’s weight than his/her body surface area.
Let’s look at one more example. This time we will use the Breast Cancer dataset that comes with sklearn:
from sklearn import datasets
bc = datasets.load_breast_cancer()
df = pd.DataFrame(bc.data, columns=bc.feature_names)
df
This dataset has 30 columns, so let’s only focus on the first 8 columns:
sns.pairplot(df.iloc[:,:8])
We can immediately observe that some features are highly correlated. Can you spot them?
Let’s calculate the VIF for the first 8 columns:
calculate_vif(df=df, features=df.columns[:8])
We can see that the following features have large VIF values:
Let’s try to remove these features one by one and observe their new VIF values. First, remove the mean perimeter:
calculate_vif(df=df, features=['mean radius',
'mean texture',
'mean area',
'mean smoothness',
'mean compactness',
'mean concavity',
'mean concave points'])
Immediately there is a reduction of VIFs across the board. Let’s now remove the mean area:
calculate_vif(df=df, features=['mean radius',
'mean texture',
# 'mean area',
'mean smoothness',
'mean compactness',
'mean concavity',
'mean concave points'])
Let’s now remove the mean concave points, which have the highest VIF:
calculate_vif(df=df, features=['mean radius',
'mean texture',
# 'mean area',
'mean smoothness',
'mean compactness',
'mean concavity',
# 'mean concave points'
])
Finally, let’s remove mean concavity:
calculate_vif(df=df, features=['mean radius',
'mean texture',
# 'mean area',
'mean smoothness',
'mean compactness',
# 'mean concavity',
# 'mean concave points'
])
And now all the VIF values are under 5.
In this article, we learned that multicollinearity happens when a feature exhibits a linear relationship with two or more features. To detect multicollinearity, one method is to calculate the Variance Inflation Factor (VIF). Any feature that has a VIF of more than 5 should be removed from the training dataset. It is important to note that VIF only works on continuous variables and not categorical variables.