In many (business) cases it is equally important to not only have an accurate, but also an interpretable model. Oftentimes, apart from wanting to know what our model’s house price prediction is, we also wonder why it is this high/low and which features are most important in determining the forecast. Another example might be predicting customer churn — it is very nice to have a model that is successfully predicting which customers are prone to churn, but identifying which variables are important can help us in early detection and maybe even improving the product/service!
Knowing feature importance indicated by machine learning models can benefit you in multiple ways, for example:
That is why in this article I would like to explore different approaches to interpreting feature importance by the example of a Random Forest model. Most of them are also applicable to different models, starting from linear regression and ending with black boxes such as XGBoost.
One thing to note is that the more accurate our model is, the more we can trust feature importance measures and other interpretations. We assume that the model we build is reasonably accurate (as each data scientist will strive to have such a model) and in this article, we focus on the importance measures.
You can find the code used for this article on GitHub.
For this example, I will use the Boston house prices dataset (so a regression problem). But the approaches described in this article work just as well with classification problems, the only difference is the metric used for evaluation.
import pandas as pd
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
boston = load_boston()
y = boston.target
X = pd.DataFrame(boston.data, columns = boston.feature_names)
np.random.seed(seed = 42)
X['random'] = np.random.random(size = len(X))
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size = 0.2, random_state = 42)
The only non-standard thing in preparing the data is the addition of a random column to the dataset. Logically, it has no predictive power over the dependent variable (Median value of owner-occupied homes in $1000’s), so it should not be an important feature in the model. Let’s see how it will turn out.
Below I inspect the relationship between the random feature and the target variable. As it can be observed, there is no pattern on the scatterplot and the correlation is almost 0.
One thing to note here is that there is not much sense in interpreting the correlation for CHAS
, as it is a binary variable and different methods should be used for it.
I trained a plain Random Forest model to have a benchmark. I set a random_state
to ensure results comparability. I also use bootstrap and set oob_score = True
so I could later use the out-of-bag error.
Briefly, on the subject of out-of-bag error, each tree in the Random Forest is trained on a different dataset, sampled with replacement from the original data. This results in around ~2/3 of distinct observations in each training set. The out-of-bag error is calculated on all the observations, but for calculating each row’s error the model only considers trees that have not seen this row during training. This is similar to evaluating the model on a validation set. You can read more here.
from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor(n_estimators = 100,
n_jobs = -1,
oob_score = True,
bootstrap = True,
random_state = 42)
rf.fit(X_train, y_train)
print('R^2 Training Score: {:.2f} \nOOB Score: {:.2f} \nR^2 Validation Score: {:.2f}'.format(rf.score(X_train, y_train),
rf.oob_score_,
rf.score(X_valid, y_valid)))
R^2 Training Score: 0.93
OOB Score: 0.58
R^2 Validation Score: 0.76
Well, there is some overfitting in the model, as it performs much worse on OOB sample and worse on the validation set. But let’s say it is good enough and move forward to feature importances (measured on the training set performance). Some of the approaches can also be used for validation/OOB sets, to gain further interpretability on the unseen data.
By overall feature importances I mean the ones derived at the model level, i.e., saying that in a given model these features are most important in explaining the target variable.
Let’s start with decision trees to build some intuition. In decision trees, every node is a condition of how to split values in a single feature, so that similar values of the dependent variable end up in the same set after the split. The condition is based on impurity, which in case of classification problems is Gini impurity/information gain (entropy), while for regression trees its variance. So when training a tree we can compute how much each feature contributes to decreasing the weighted impurity. feature_importances_
in Scikit-Learn is based on that logic, but in the case of Random Forest, we are talking about averaging the decrease in impurity over trees.
Pros:
Cons:
It seems that the top 3 most important features are:
What seems surprising though is that a column of random values turned out to be more important than:
Intuitively this feature should have zero importance on the target variable. Let’s see how it is evaluated by different approaches.
This approach directly measures feature importance by observing how random re-shuffling (thus preserving the distribution of the variable) of each predictor influences model performance.
The approach can be described in the following steps:
Pros:
Cons:
feature_importances
As for the second problem with this method, I have already plotted the correlation matrix above. However, I will use a function from one of the libraries I use to visualize Spearman’s correlations. The difference between standard Pearson’s correlation is that this one first transforms variables into ranks and only then runs Pearson’s correlation on the ranks.
Spearman’s correlation:
I found two libraries with this functionality, not that it is difficult to code it. Let’s go over both of them as they have some unique features.
rfpimp
One thing to note about this library is that we have to provide a metric as a function of the form metric(model, X, y)
. This way we can use more advanced approaches such as using the OOB score of Random Forest. This library already contains functions for that (oob_regression_r2_score)
. But to keep the approach uniform, I will calculate the metrics on the training set (losing information about generalization).
from sklearn.metrics import r2_score
from rfpimp import permutation_importances
def r2(rf, X_train, y_train):
return r2_score(y_train, rf.predict(X_train))
perm_imp_rfpimp = permutation_importances(rf, X_train, y_train, r2)
The plot confirms what we have seen above, that 4 variables are less important than a random variable! Surprising… The top 4 stayed the same though. One more nice feature about rfpimp
is that it contains functionalities for dealing with the issue of collinear features (that was the idea behind showing the Spearman’s correlation matrix). For brevity, I will not show this case here, but you can read more in this great article by the authors of the library.
eli5
There are a few differences from the basic approach of rfpimp
and the one employed in eli5
. Some of them are:
cv
and refit
connected to using cross-validation. In this example, I set them to None
, as I do not use it but it might come in handy in some cases.metric
parameter, which as in rfpimp
accepts a function in the form of metric(model, X, y)
. If this parameter is not specified, the function will use the default score
method of the estimator.n_iter
– number of random shuffle iterations, the end score is the average
import eli5
from eli5.sklearn import PermutationImportance
perm = PermutationImportance(rf, cv = None, refit = False, n_iter = 50).fit(X_train, y_train)
perm_imp_eli5 = imp_df(X_train.columns, perm.feature_importances_)
The results are very similar to the previous ones, even as these came from multiple reshuffles per column. One extra nice thing about eli5
is that it is really easy to use the results of the permutation approach to carry out feature selection by using Scikit-learn’s SelectFromModel
or RFE
.
This approach is quite an intuitive one, as we investigate the importance of a feature by comparing a model with all features versus a model with this feature dropped for training.
I created a function (based on rfpimp
‘s implementation) for this approach below, which shows the underlying logic.
Pros:
Cons:
from sklearn.base import clone
def drop_col_feat_imp(model, X_train, y_train, random_state = 42):
# clone the model to have the exact same specification as the one initially trained
model_clone = clone(model)
# set random_state for comparability
model_clone.random_state = random_state
# training and scoring the benchmark model
model_clone.fit(X_train, y_train)
benchmark_score = model_clone.score(X_train, y_train)
# list for storing feature importances
importances = []
# iterating over all columns and storing feature importance (difference between benchmark and new model)
for col in X_train.columns:
model_clone = clone(model)
model_clone.random_state = random_state
model_clone.fit(X_train.drop(col, axis = 1), y_train)
drop_col_score = model_clone.score(X_train.drop(col, axis = 1), y_train)
importances.append(benchmark_score - drop_col_score)
importances_df = imp_df(X_train.columns, importances)
return importances_df
Here it gets interesting. First of all, negative importance, in this case, means that removing a given feature from the model actually improves the performance. So this is nice to see in the case of our random variable.
Alternatively, instead of the default score
method of the fitted model, we can use the out-of-bag error for evaluating the feature importance. To do so, we need to replace the score
method in the Gist above with model.oob_score_
(remember to do it for both the benchmark and the model within the loop).
By observation level feature importances I mean ones that had the most impact on explaining a particular observation fed to the model. For example, in the case of credit scoring, we would be able to say that these features had the most impact on determining the client’s credit score.
The main idea of treeinterpreter
is that it uses the underlying trees in Random Forest to explain how each feature contributes to the end value. We can observe how the value of the prediction (defined as the sum of each feature contribution + the average given by the initial node that is based on the entire training set) changes along the prediction path within the decision tree (after every split), together with the information which features caused the split (so also the change in prediction).
The formula for the prediction function (f(x)) can be written down as:
where c_full is the average of the entire dataset (initial node), K is the total number of features.
This may sound complicated, but take a look at an example from the author of the library:
As Random Forest’s prediction is the average of the trees, the formula for average prediction is the following:
where J is the number of trees in the forest.
I start by identifying rows with the lowest and highest absolute prediction error and will try to see what caused the difference.
Index with smallest error: 31
Index with largest error: 85
Using treeintrerpreter
I obtain 3 objects: predictions, bias (average value of the dataset) and contributions.
from treeinterpreter import treeinterpreter as ti, utils
selected_rows = [31, 85]
selected_df = X_train.iloc[selected_rows,:].values
prediction, bias, contributions = ti.predict(rf, selected_df)
for i in range(len(selected_rows)):
print("Row", selected_rows[i])
print("Prediction:", prediction[i][0], 'Actual Value:', y_train[selected_rows[i]])
print("Bias (trainset mean)", bias[i])
print("Feature contributions:")
for c, feature in sorted(zip(contributions[i],
X_train.columns),
key=lambda x: -abs(x[0])):
print(feature, round(c, 2))
print("-"*20)
For the observation with the smallest error, the main contributor was LSTAT
and RM
(which in previous cases turned out to be most important variables). In the highest error case, the highest contribution came from DIS
variable, overcoming the same two variables that played the most important role in the first case.
Row 31
Prediction: 21.996 Actual Value: 22.0
Bias (trainset mean) 22.544297029702978
Feature contributions:
LSTAT 3.02
RM -3.01
PTRATIO 0.36
AGE -0.29
DIS -0.21
random 0.18
RAD -0.17
NOX -0.16
TAX -0.11
CRIM -0.07
B -0.05
INDUS -0.02
ZN -0.01
CHAS -0.01
--------------------
Row 85
Prediction: 36.816 Actual Value: 50.0
Bias (trainset mean) 22.544297029702978
Feature contributions:
DIS 7.7
LSTAT 3.33
RM -1.88
CRIM 1.87
TAX 1.32
NOX 1.02
B 0.54
CHAS 0.36
PTRATIO -0.25
RAD 0.17
AGE 0.13
INDUS -0.03
random -0.01
ZN 0.0
---------------------
To dive even deeper, we might also be interested in the joined contribution of many variables (as explained in the case of XOR here). I will go right to the example, more information can be found under the link.
prediction1, bias1, contributions1 = ti.predict(rf, np.array([selected_df[0]]), joint_contribution=True)
prediction2, bias2, contributions2 = ti.predict(rf, np.array([selected_df[1]]), joint_contribution=True)
aggregated_contributions1 = utils.aggregated_contribution(contributions1)
aggregated_contributions2 = utils.aggregated_contribution(contributions2)
res = []
for k in set(aggregated_contributions1.keys()).union(
set(aggregated_contributions2.keys())):
res.append(([X_train.columns[index] for index in k] ,
aggregated_contributions1.get(k, 0) - aggregated_contributions2.get(k, 0)))
for lst, v in (sorted(res, key=lambda x:-abs(x[1])))[:10]:
print (lst, v)
Most of the difference between the best and worst predicted cases comes from the number of rooms (RM
) feature, in conjunction with weighted distances to five Boston employment centers (DIS
).
LIME (Local Interpretable Model-agnostic Explanations) is a technique explaining the predictions of any classifier/regressor in an interpretable and faithful manner. To do so, an explanation is obtained by locally approximating the selected model with an interpretable one (such as linear models with regularisation or decision trees). The interpretable models are trained on small perturbations (adding noise) of the original observation (row in case of tabular data), thus they only provide a good local approximation.
Some drawbacks to be aware of:
Below you can see the output of LIME interpretation.
There are 3 parts of the output:
1. Predicted value
2. Feature importance — in case of regression it shows whether it has a negative or positive impact on the prediction, sorted by absolute impact descending.
3. Actual values of these features for the explained rows.
Note that LIME has discretized the features in the explanation. This is because of setting discretize_continuous=True
in the constructor above. The reason for discretization is that it gives continuous features more intuitive explanations.
import lime
import lime.lime_tabular
explainer = lime.lime_tabular.LimeTabularExplainer(X_train.values,
mode = 'regression',
feature_names = X_train.columns,
categorical_features = [3],
categorical_names = ['CHAS'],
discretize_continuous = True)
np.random.seed(42)
exp = explainer.explain_instance(X_train.values[31], rf.predict, num_features = 5)
exp.show_in_notebook(show_all=False) #only the features used in the explanation are displayed
exp = explainer.explain_instance(X_train.values[85], rf.predict, num_features = 5)
exp.show_in_notebook(show_all=False)
LIME interpretation agrees that for these two observations the most important features are RM
and LSTAT
, which was also indicated by previous approaches.
Which observation-level approach should we trust, as it can happen that the results are different? This is a difficult question without a clear answer, as the two approaches are conceptually different and thus hard to compare directly. I would refer you to this answer, in which a similar question was tackled and nicely explained.
In this article, I showed a few approaches to deriving feature importances from machine learning models (not limited to Random Forest). I believe that understanding results is often as much important as having good results, thus every data scientist should do his/her best to understand which variables are the most important for the model and why. Not only can this help to get a better business understanding, but it also can lead to further improvements to the model.
Source: