9 mins read
## Correlation vs. Collinearity vs. Multicollinearity

## Problem with collinearity and multicollinearity

## Fixing Multicollinearity

## Implementing VIF using Python

## Let’s Try It Out

## Visualizing the relationships between columns

## Calculating Correlation

## Calculating VIF

## Interpreting VIF Values

## One More Example

## Summary

In this post, I will explain the concept of collinearity and multicollinearity and why it is important to understand them and take appropriate action when we are preparing data.

*Correlation* measures the strength and direction between two columns in a dataset. Correlation is often used to find the relationship between a feature and the target:

For example, if one of the features has a high correlation with the target, it tells that this particular feature heavily influences the target and should be included when we are training the model.

*Collinearity*, on the other hand, is a situation where two features are linearly associated (high *correlation*), and they are used as *predictors* for the target.

*Multicollinearity* is a special case of collinearity where a feature exhibits a linear relationship with two or more features.

Recall the formula for multiple linear regression:

One important assumption of linear regression is that there should exist a linear relationship between each of the predictors (**x**₁, **x**₂, etc) and the outcome **y**. However, if there is a correlation between the predictors (e.g. **x**₁ and **x**₂ are highly correlated), we can no longer determine the effect of one while holding the other constant since the two predictors change together. The end result is that the coefficients (**w**₁ and **w**₂) are now less exact and hence less interpretable.

When training a machine learning model, it is important that during the data preprocessing stage we sieve out the features in the dataset that exhibit multicollinearity. We can do so using a method known as **VIF** — **Variance Inflation Factor**.

**VIF** allows us to determine the strength of the correlation between the various independent variables. *It is calculated by taking a variable and regressing it against every other variables*.

VIF calculates how much the **variance** of a coefficient is **inflated** because of its linear dependencies with other predictors. Hence its name.

Here is how VIF works:

- Assuming we have a list of features —
**x**₁,**x**₂,**x**₃, and**x**₄. - We first take the first feature,
**x**₁, and regress it against the other features:

x₁ ~ x₂ + x₃ + x₄

In fact, we are performing a multiple regression above. Multiple regression generally explains the relationship between multiple independent or predictor variables and one dependent or criterion variable.

- In the multiple regression above, we extract the
**R²**value (between 0 and 1). If**R²**is*large*, this means that**x₁**can be predicted from the three features, and is thus highly correlated with the three features —**x**₂,**x**₃, and**x**₄. If**R²**is*small*, this means that**x₁**cannot be predicted from the features, and is thus*not*correlated with the three features —**x**₂,**x**₃, and**x**₄. - Based on the
**R²**value that is calculated for**x₁**, we can now calculate its**VIF**using the following formula:

- A large
**R²**value (close to 1) will cause the denominator to be small (1 minus a value close to 1 will give a number close to 0). This will result in a large VIF. A large VIF indicates that this feature exhibits multicollinearity with the other features. - Conversely, a small
**R²**value (close to 0) will cause the denominator to be large (1 minus a value close to 0 will give a number close to 1). This will result in a small VIF. A small VIF indicates that this feature exhibits low multicollinearity with the other features. - (1- R²) is also known as the tolerance.

- We repeat the process above for the other features and calculate the VIF for each feature:

x₂ ~ x₁ + x₃ + x₄ # regress x₂ against the rest of the features

x₃ ~ x₁ + x₂ + x₄ # regress x₃ against the rest of the features

x₄ ~ x₁ + x₂ + x₃ # regress x₄ against the rest of the features

While correlation matrix and scatter plots can be used to find multicollinearity, they only show the bivariate relationship between the independent variables. VIF, on the other hand, shows the correlation of a variable with a group of other variables.

Now that we know how VIF is calculated, we can implement it using Python, with a little help from **sklearn**:

```
import pandas as pd
from sklearn.linear_model import LinearRegressiondef
calculate_vif(df, features):
vif, tolerance = {}, {} # all the features that we want to examine
for feature in features:
# extract all the other features we will regress against
X = [f for f in features if f != feature]
X, y = df[X], df[feature] # extract r-squared from the fit
r2 = LinearRegression().fit(X, y).score(X, y)
# calculate tolerance
tolerance[feature] = 1 - r2 # calculate VIF
vif[feature] = 1/(tolerance[feature]) # return VIF DataFrame
return pd.DataFrame({'VIF': vif, 'Tolerance': tolerance})
```

To see VIF in action, let’s use a sample dataset named **bloodpressure.csv**, with the following content:

```
Pt,BP,Age,Weight,BSA,Dur,Pulse,Stress,
1,105,47,85.4,1.75,5.1,63,33,
2,115,49,94.2,2.1,3.8,70,14,
3,116,49,95.3,1.98,8.2,72,10,
4,117,50,94.7,2.01,5.8,73,99,
5,112,51,89.4,1.89,7,72,95,
6,121,48,99.5,2.25,9.3,71,10,
7,121,49,99.8,2.25,2.5,69,42,
8,110,47,90.9,1.9,6.2,66,8,
9,110,49,89.2,1.83,7.1,69,62,
10,114,48,92.7,2.07,5.6,64,35,
11,114,47,94.4,2.07,5.3,74,90,
12,115,49,94.1,1.98,5.6,71,21,
13,114,50,91.6,2.05,10.2,68,47,
14,106,45,87.1,1.92,5.6,67,80,
15,125,52,101.3,2.19,10,76,98,
16,114,46,94.5,1.98,7.4,69,95,
17,106,46,87,1.87,3.6,62,18,
18,113,46,94.5,1.9,4.3,70,12,
19,110,48,90.5,1.88,9,71,99,
20,122,56,95.7,2.09,7,75,99,
```

The dataset consists of the following fields:

- Blood pressure (
**BP**), in mm Hg **Age**, in years**Weight**, in kg- Body surface area (
**BSA**), in m² - Duration of hypertension (
**Dur**), in years - Basal Pulse (
**Pulse**), in beats per minute - Stress index (
**Stress**)

First, load the dataset into a Pandas DataFrame and drop the redundant columns:

```
df = pd.read_csv('bloodpressure.csv')
df = df.drop(['Pt','Unnamed: 8'],axis = 1)
df
```

Before we do any cleanup, it would be useful to visualize the relationships between the various columns using a pair plot (using the **Seaborn** module):

```
import seaborn as sns
sns.pairplot(df)
```

I have identified some columns where there seems to exist a strong correlation:

Next, calculate the correlation between the columns using the **corr()** function:

```
df.corr()
```

Assuming that we are trying to build a model that predicts **BP**, we could see that the top features that correlate to **BP** are **Age**, **Weight**, **BSA**, and **Pulse**:

Now that we have identified the columns that we want to use for the training model, we need to see which of the columns have multicollinearity. So so let’s use our **calculate_vif()** function that we have written earlier:

```
calculate_vif(df=df, features=['Age','Weight','BSA','Pulse'])
```

The valid value for VIF ranges from 1 to infinity. A rule of thumb for interpreting VIF values is:

- 1 — features are not correlated
- 1<VIF<5 — features are moderately correlated
- VIF>5 — features are highly correlated
- VIF>10 — high correlation between features and is cause for concern

From the result calculating the VIF in the previous section, we can see the **Weight** and **BSA** have VIF values greater than 5. This means that **Weight** and **BSA** are highly correlated. This is not surprising as heavier people have a larger body surface area.

So the next thing to do would be to try removing one of the highly correlated features and see if the result for VIF improves. Let’s try removing **Weight** since it has a higher VIF:

```
calculate_vif(df=df, features=['Age','BSA','Pulse'])
```

Let’s now remove **BSA** and see the VIF of the other features:

```
calculate_vif(df=df, features=['Age','Weight','Pulse'])
```

As we observed, removing **Weight** results in a lower VIF for all other features, compared to removing **BSA**. So should we remove **Weight** then? Well, ideally, yes. But for practical reasons, it would make more sense to remove **BSA** and keep **Weight**. This is because later on when the model is trained and we use it for prediction, it is easier to get a patient’s weight than his/her body surface area.

Let’s look at one more example. This time we will use the Breast Cancer dataset that comes with **sklearn**:

```
from sklearn import datasets
bc = datasets.load_breast_cancer()
df = pd.DataFrame(bc.data, columns=bc.feature_names)
df
```

This dataset has 30 columns, so let’s only focus on the first 8 columns:

```
sns.pairplot(df.iloc[:,:8])
```

We can immediately observe that some features are highly correlated. Can you spot them?

Let’s calculate the VIF for the first 8 columns:

```
calculate_vif(df=df, features=df.columns[:8])
```

We can see that the following features have large VIF values:

Let’s try to remove these features one by one and observe their new VIF values. First, remove the **mean perimeter**:

```
calculate_vif(df=df, features=['mean radius',
'mean texture',
'mean area',
'mean smoothness',
'mean compactness',
'mean concavity',
'mean concave points'])
```

Immediately there is a reduction of VIFs across the board. Let’s now remove t**he mean area**:

```
calculate_vif(df=df, features=['mean radius',
'mean texture',
# 'mean area',
'mean smoothness',
'mean compactness',
'mean concavity',
'mean concave points'])
```

Let’s now remove the **mean concave points**, which have the highest VIF:

```
calculate_vif(df=df, features=['mean radius',
'mean texture',
# 'mean area',
'mean smoothness',
'mean compactness',
'mean concavity',
# 'mean concave points'
])
```

Finally, let’s remove **mean concavity**:

```
calculate_vif(df=df, features=['mean radius',
'mean texture',
# 'mean area',
'mean smoothness',
'mean compactness',
# 'mean concavity',
# 'mean concave points'
])
```

And now all the VIF values are under 5.

In this article, we learned that multicollinearity happens when a feature exhibits a linear relationship with two or more features. To detect multicollinearity, one method is to calculate the** Variance Inflation Factor** (**VIF**). Any feature that has a VIF of more than 5 should be removed from the training dataset. It is important to note that VIF only works on continuous variables and not categorical variables.