8 mins read
## Feature selection with Mutual Information

### Use Mutual Information from Scikit-Learn with Python

### Use the feature selector from Scikit-Learn

### The complete code

In probability theory and information theory, the **mutual information** (**MI**) of two random variables is a measure of the mutual dependence between the two variables. More specifically, it quantifies the “amount of information” (in units such as Shannons, more commonly called bits) obtained about one random variable, through the other random variable. The concept of mutual information is intricately linked to that of entropy of a random variable, a fundamental notion in information theory, that defines the “amount of information” held in a random variable.

Not limited to real-valued random variables like the correlation coefficient, MI is more general and determines how similar the joint distribution *p*(*X*,*Y*) is to the products of factored marginal distribution *p*(*X*)*p*(*Y*). MI is the expected value of the pointwise mutual information (PMI).

Formally, the mutual information of two discrete random variables *X* and *Y* can be defined as:

where *p*(*x*,*y*) is the joint probability distribution function of *X* and *Y*, and *p*(*x*) and *p*(*y*) are the marginal probability distribution functions of *X* and *Y* respectively.

In the case of continuous random variables, the summation is replaced by a definite double integral:

where *p*(*x*,*y*) is now the joint probability *density* function of *X* and *Y*, and *p*(*x*) and *p*(*y*) are the marginal probability density functions of *X* and *Y*, respectively.

If log base 2 is used, the units of mutual information are bits.

Intuitively, mutual information measures the information that *X* and *Y* share: it measures how much knowing one of these variables reduces uncertainty about the other. For example, if *X* and *Y* are independent, then knowing *X* does not give any information about *Y* and vice versa, so their mutual information is zero. At the other extreme, if *X* is a deterministic function of *Y* and *Y* is a deterministic function of *X* then all information conveyed by *X* is shared with *Y*: knowing *X* determines the value of *Y* and vice versa. As a result, in this case, the mutual information is the same as the uncertainty contained in *Y* (or *X*) alone, namely the entropy of *Y* (or *X*). Moreover, in this mutual information is the same as the entropy of *X* and as the entropy of *Y*. (A very special case of this is when *X* and *Y* are the same random variable.)

Mutual information is a measure of the inherent dependence expressed in the joint distribution of *X* and *Y* relative to the joint distribution of *X* and *Y* under the assumption of independence. Mutual information, therefore, measures dependence in the following sense: *I*(*X*; *Y*) = 0 if and only if *X* and *Y* are independent random variables. This is easy to see in one direction: if *X* and *Y* are independent, then *p*(*x*,*y*) = *p*(*x*)*p*(*y*), and therefore:

Moreover, mutual information is nonnegative (i.e. *I*(*X*;*Y*) ≥ 0; see below) and symmetric (i.e. *I*(*X*;*Y*) = *I*(*Y*;*X*)).

Mutual information can be equivalently expressed as:

where Η(*X*) and Η(*Y*) are the marginal entropies, Η(*X*|*Y*) and Η(*Y*|*X*) are the conditional entropies, and Η(*X*,*Y*) is the joint entropy of *X* and *Y*. Note the analogy to the union, difference, and intersection of two sets, as illustrated in the Venn diagram.

After an exhaustive search on the Web, I found some implementations to calculate Mutual Information(MI) between two random variables.

Here I listed some useful Python implementations:

**http://scikit-learn.org/stable/modules/generated/sklearn.metrics.mutual_info_score.html****https://github.com/rmaestre/Mutual-Information****https://github.com/gregversteeg/NPEET****https://gist.github.com/GaelVaroquaux/ead9898bd3c973c40429****https://gist.github.com/elsonidoq/4230222**

Mutual Information can answer the question: Is there a way to build a measurable connection between a feature and target?

Two benefits to using Mutual Information as a feature selector:

- The MI is model neutral, which means the solution can be applied to various kinds of ML models.
- MI solution is fast.

As I stated, Mutual Information measures the **entropy drops** under the condition of the target value:

MI(feature;target) = Entropy(feature) - Entropy(feature|target)

The MI score will fall in the range of 0 to 1. The higher value, the closer connection between this feature and the target, which suggests that we should put this feature in the training dataset. If the MI score is 0 or very low like 0.01. the low score suggests a weak connection between this feature and the target.

You can write a MI function from scratch on your own, for fun, or use the ready-to-use functions from Scikit-Learn.

I am going to use the Breast Cancer dataset from Scikit-Learn to build a sample ML model with Mutual Information applied. Use the Decision Tree Classifier to train 3 datasets from the cancer data and compare the result to see how the MI score will impact the ML model effectiveness.

- Train dataset 1, use all features.
- Train dataset 2, use only features whose MI scores are larger than 0.2
- Train dataset 3, use only features whose MI scores are less than 0.2

**Step 1. load breast cancer data**

```
from sklearn.datasets import load_breast_cancer as LBC
cancer = LBC()
X = cancer['data']
y = cancer['target']
```

**Step 2. compute the MI score**

```
from sklearn.feature_selection import mutual_info_classif as MIC
mi_score = MIC(X,y)
print(mi_score)
```

You shall see the mi_score array like this:

```
[0.37032947 0.09670886 0.40294198 0.36009957 0.08427789 0.21318114
0.37337734 0.43985571 0.06456878 0.00276314 0.24866738 0.00189163
0.27600984 0.33955538 0.01503326 0.07603828 0.11825812 0.12879402
0.0096701 0.03802394 0.45151801 0.12293047 0.47595645 0.46426102
0.09558928 0.22647456 0.31469449 0.43696443 0.0971793 0.06735096]
```

30 numbers represent the MI score for 30 features.

**Step 3. Build 3 train data and test datasets**

Dataset 1 with no MI score curation, names marked with `_1`

. Note that all three datasets will use the same `y_train`

and `y_test`

. So, no need to separate target data.

```
from sklearn.model_selection import train_test_split as tts
X_train_1,X_test_1,y_train,y_test = tts(
X,y
,random_state=0
,stratify=y
)
```

Dataset 2 with features that have MI scores larger than 0.2

```
import numpy as np
mi_score_selected_index = np.where(mi_scores >0.2)[0]
X_2 = X[:,mi_score_selected_index]
X_train_2,X_test_2,y_train,y_test = tts(
X_2,y
,random_state=0
,stratify=y
)
```

You will see the shape of X_2 is `(569,15)`

instead of `(569,30)`

. That is because 15 features have Mi scores larger than `0.2`

.

Dataset 3 with features that have MI scores less than 0.2

```
mi_score_selected_index = np.where(mi_scores < 0.2)[0]
X_3 = X[:,mi_score_selected_index]
X_train_3,X_test_3,y_train,y_test = tts(
X_3,y
,random_state=0
,stratify=y
)
```

Coincidentally, the 0.2 MI score separates 30 features into 15 and 15.

**Step 4. Compare the 3 datasets with the Decision Tree classifier**

```
from sklearn.tree import DecisionTreeClassifier as DTC
model_1 = DTC().fit(X_train_1,y_train)
model_2 = DTC().fit(X_train_2,y_train)
model_3 = DTC().fit(X_train_3,y_train)
score_1 = model_1.score(X_test_1,y_test)
score_2 = model_2.score(X_test_2,y_test)
score_3 = model_3.score(X_test_3,y_test)
print(f"score_1:{score_1}\n score_2:{score_2}\n score_3:{score_3}")
```

The result:

```
score_1:0.9300699300699301
score_2:0.9370629370629371
score_3:0.8251748251748252
```

See, dataset 2 with 15 features has MI > 0.2 reach to accuracy of 0.93 as good as dataset 1, which includes all features. While score_3 is merely 0.82 which is the result of 15 features that have MI score < 0.2.

From this sample, it is clear that the MI score can be used as a signal for feature selection.

In real ML projects, you may want to use the top n features, or top n percentile features instead of using a specified number 0.2 like the sample above. Scikit-Learn also provides many selectors as convenient tools. So that you don’t have to manually calculate MI scores and take the needed features. Here is a sample to select the top 50% of features, other selectors share similar usage.

```
from sklearn.feature_selection import SelectPercentile as SP
selector = SP(score_func=MIC, percentile=50) # select features with top 50% MI scores
selector.fit(X,y)
X_4 = selector.transform(X)
X_train_4,X_test_4,y_train,y_test = tts(
X_4,y
,random_state=0
,stratify=y
)
model_4 = DTC().fit(X_train_4,y_train)
score_4 = model_4.score(X_test_4,y_test)
print(f"score_4:{score_4}")
```

```
# load cancer data
from sklearn.feature_selection import SelectPercentile as SP
from sklearn.tree import DecisionTreeClassifier as DTC
import numpy as np
from sklearn.model_selection import train_test_split as tts
from sklearn.feature_selection import mutual_info_classif as MIC
from sklearn.datasets import load_breast_cancer as LBC
cancer = LBC()
X = cancer['data']
y = cancer['target']
# compute MI scores
mi_scores = MIC(X, y)
print(mi_scores)
# prepare dataset 1
X_train_1, X_test_1, y_train, y_test = tts(
X, y, random_state=0, stratify=y
)
# prepare dataset 2, MI > 0.2
mi_score_selected_index = np.where(mi_scores > 0.2)[0]
X_2 = X[:, mi_score_selected_index]
X_train_2, X_test_2, y_train, y_test = tts(
X_2, y, random_state=0, stratify=y
)
# prepare dataset 3, MI <0.2
mi_score_selected_index = np.where(mi_scores < 0.2)[0]
X_3 = X[:, mi_score_selected_index]
X_train_3, X_test_3, y_train, y_test = tts(
X_3, y, random_state=0, stratify=y
)
# compare results with Decision Tree Classifier
model_1 = DTC().fit(X_train_1, y_train)
model_2 = DTC().fit(X_train_2, y_train)
model_3 = DTC().fit(X_train_3, y_train)
score_1 = model_1.score(X_test_1, y_test)
score_2 = model_2.score(X_test_2, y_test)
score_3 = model_3.score(X_test_3, y_test)
print(f"score_1:{score_1}\n"
f"score_2:{score_2}\n"
f"score_3:{score_3}")
# use Scikit-learn feature selector
# select features with top 50% MI scores
selector = SP(score_func=MIC, percentile=50)
selector.fit(X, y)
X_4 = selector.transform(X)
X_train_4, X_test_4, y_train, y_test = tts(
X_4, y, random_state=0, stratify=y
)
model_4 = DTC().fit(X_train_4, y_train)
score_4 = model_4.score(X_test_4, y_test)
print(f"score_4:{score_4}")
```

Resources: