In this article, we’ll look at a major problem with using Random Forest for Regression which is extrapolation.
Random Forest Regression is quite a robust algorithm, however, the question is should you use it for regression?
Why not use linear regression instead? The function in a Linear Regression can easily be written as y=mx + c while a function in a complex Random Forest Regression seems like a black box that can’t easily be represented as a function.
Generally, Random Forests produce better results, work well on large datasets, and can work with missing data by creating estimates for them. However, they pose a major challenge which is that they can’t extrapolate outside unseen data. We’ll dive deeper into these challenges in a minute
Decision Trees are great for obtaining non-linear relationships between input features and the target variable. The inner working of a Decision Tree can be thought of as a bunch of if-else conditions. It starts at the very top with one node. This node then splits into a left and right node — decision nodes. These nodes then split into their respective right and left nodes. At the end of the leaf node, the average of the observation that occurs within that area is computed. The most bottom nodes are referred to as leaves or terminal nodes. The value in the leaves is usually the mean of the observations occurring within that specific region. For instance, in the rightmost leaf node below, 552.889 is the average of the 5 samples.
How far this splitting goes is what is known as the depth of the tree. This is one of the hyperparameters that can be tuned. The maximum depth of the tree is specified so as to prevent the tree from becoming too deep — a scenario that leads to overfitting.
A random forest is an ensemble of decision trees. This is to say that many trees, constructed in a certain “random” way form a Random Forest.
The averaging makes a Random Forest better than a single Decision Tree hence improves its accuracy and reduces overfitting.
A prediction from the Random Forest Regressor is an average of the predictions produced by the trees in the forest.
In order to dive in further, let’s look at an example of a Linear Regression and a Random Forest Regression. For this, we’ll apply Linear Regression and a Random Forest Regression to the same dataset and compare the result.
Let’s take this example dataset where you should predict the price of diamonds based on other features like carat, depth, table, x, y, and z. If we look at the distribution of prices below:
We can see that the price ranges from 326 to 18823.
Let’s train the Linear Regression model and run predictions on the validation set. The distribution of predicted prices is the following:
Predicted prices are clearly outside the range of values of “price” seen in the training dataset. A Linear Regression model, just like the name suggests, created a linear model on the data. A simple way to think about it is in the form of y = mx+C. Therefore, since it fits a linear model, it is able to obtain values outside the training set during prediction. It is able to extrapolate based on the data.
Let’s now look at the results obtained from a Random Forest Regressor using the same dataset.
These values are clearly within the range of 326 and 18823 — just like in our training set. There are no values outside that range. Random Forest cannot extrapolate.
As you have seen above, when using a Random Forest Regressor, the predicted values are never outside the training set values for the target variable.
If you look at prediction values they will look like this:
Let’s explore that phenomenon here. The data used above have the following columns carat, depth, table, x, y, and z for predicting the price.
The diagram below shows one decision tree from the Random Forest Regressor.
Let’s zoom in to a smaller section of this tree. For example, there are 4 samples with depth <= 62.75, x <= 5.545, carat <= 0.905, and z <= 3.915. The price predicted for these is 2775.75. This figure represents the mean of all these four samples. Therefore, any value in the test set that falls in this leaf will be predicted as 2775.75.
This is to say that when the Random Forest Regressor is tasked with the problem of predicting values not previously seen, it will always predict an average of the values seen previously. Obviously, the average of a sample can not fall outside the highest and lowest values in the sample.
The Random Forest Regressor is unable to discover trends that would enable it in extrapolating values that fall outside the training set. When faced with such a scenario, the regressor assumes that the prediction will fall close to the maximum value in the training set. Figure 1 above illustrates that.
Ok, so how can you deal with this extrapolation problem?
There are a couple of options:
One of such extensions is Regression-Enhanced Random Forests (RERFs). The authors of this paper propose a technique borrowed from the strengths of penalized parametric regression to give better results in extrapolation problems.
Specifically, there are two steps to the process:
Since Random Forest is a fully nonparametric predictive algorithm, it may not efficiently incorporate known relationships between the response and the predictors. The response values are the observed values Y1, . . . , Yn from the training data. RERFs are able to incorporate known relationships between the responses and the predictors which is another benefit of using Regression-Enhanced Random Forests for regression problems.
At this point, I am sure you might be wondering whether or not you should use a Random Forest for regression problems.
Let’s look at that.
Hopefully, this article gave you some background into the inner workings of Random Forest Regression.