In this post, we will see how the Random Forest algorithm works internally. To truly appreciate it, it might be helpful to understand a bit about Decision-Tree Classifiers. But it’s not entirely required. We are not covering the pre-processing or feature creation steps involved in modeling — but only see what happens within the algorithm when we call the .fit()
and .transform()
methods for sklearn’s RandomForestClassifier
package does.
Basically, Random Forest ( RF) is a tree-based algorithm. It is an ensemble of multiple random trees of different kinds. The final value of the model is the average of all the predictions/estimates created by each individual tree. We will be using the scikit-learn package, specifically the following modules:
sklearn.ensemble.RandomForestClassifier
( for the Random Forest Classifier algorithm found in the sklearn library )
sklearn.ensemble.RandomForestRegressor
(for the Random Forest regressor algorithm)
sklearn.ensemble.RandomForestClassifier
For illustration, we will be using training data similar to the one below.
age ,glucose_level, weight, gender, smoking, ..., f98, f99
are all the independent variables or the features. Diabetic
is the y-variable/dependent variable that we have to predict. The problem is to predict the patients who are likely to be diabetic
With this basic information, let’s get started and understand what happens with we pass this training set to the algorithm.
Once we provide the training data to the RandomForestClassifier
model, it (the algorithm) selects a bunch of rows randomly. This process is called Bootstrapping (random replacement). For our example, let’s assume that it selects m records.
The number of rows to be selected can be provided by the user in the hyper-parameter- (max_samples)
:
import sklearn.ensemble.RandomForestClassifier
my_rf = RandomForestClassifier(max_samples=100)
This only applies if we turn on bootstrapping in the hyper-parameter ( bootstrap = True). bootstrap
is True by default. One row might get selected more than once
Now, RF randomly selects a subset of features/columns. Here for the sake of simplicity and for the example, we are choosing 3 random features. We can control this number in the hyper-parameter — max_features
:
import sklearn.ensemble.RandomForestClassifier
my_rf = RandomForestClassifier(max_features=3)
Once the 3 random features are selected ( in our example), the algorithm runs a splitting of the m record (from step 1) and does a quick calculation of the before and after values of a metric. This metric could be either Gini-impurity or entropy. It is based on the criteria — Gini or entropy we set in the hyper-parameter.
import sklearn.ensemble.RandomForestClassifier
my_rf = RandomForestClassifier(max_features=8 , criteria = 'gini')
criterion = 'gini' ( or 'entropy' )
By default, it is set to criteria = 'gini’
. Whichever of the random feature split gives the least combined Gini impurity/ entropy value, that feature is selected as the root node. The records are split at this node based on the best splitting point.
The algorithm performs the same process as in Step 2 and Step 4 and selects another set of 3 random features. ( 3 is the number we have specified — you can choose what you like — or leave it to the algorithm to choose the best number )
Based on the criteria ( gini
/entropy
), it selects which feature will go into the next node/child node, and further splitting of the records happens here.
This process continues (Steps 2, 4) of selecting the random feature and splitting of the nodes happens till either of the following conditions happen:
min_samples_leaf.
max_depth
)We now have the first “mini-decision tree ”.
The algorithm goes back to the data and does steps 1–5 to create the 2nd “mini-tree”
Once the default value of 100 trees is reached (we now have 100 mini decision trees), the model is said to have completed its fit()
process.
We can specify the number of trees we want to generate using the n_estimators
hyper-parameter.
import sklearn.ensemble.RandomForestClassifier
rf = RandomForestClassifier(n_estimators=300)
Now we have a forest of randomly created mini-trees ( hence the name Random Forest )
Now let’s predict the values in an unseen data set ( the test data set ). For inferencing (more commonly referred to as predicting/scoring ) the test data, the algorithm passes the record through each mini-tree.
The values from the record traverse through the mini tree based on the variables that each node represents, and reach a leaf node ultimately. Based on the predetermined value of the leaf node (during training) where this record ends up, that mini-tree is assigned one prediction output. In the same manner, the same record goes through all the 100 mini-decision trees and each of the 100 trees has a prediction output for that record.
The final prediction value for this record is calculated by taking simple voting of these 100 mini trees.
Now we have the prediction for a single record.
The algorithm iterates through all the records of the test set following the same process and does a calculation of the overall accuracy!
Now let’s get a deeper understanding of what each of the parameters does in the Random Forest algorithm.
n_estimators
( default = 100
)Since the RandomForest algorithm is an ensemble modeling technique, it ‘increases the generalization’ by creating a number of different kinds of trees with different depths and sizes. n_estimators is the number of trees we want the algorithm to create. Increasing the number of trees in the forest decreases the variance of the overall model and doesn’t contribute to overfitting. From the standpoint of generalization performance, using more trees is therefore better.
2. criterion
(default = gini
)
The measure to determine where/on what features a tree has to be split can be determined by two methods — by calculating the Gini-impurity or by entropy. For example, suppose there are two parameters gender
and nationality
based on which the splitting of the tree has to be done. The algorithm does the split using both the features and chooses the one which results in a lower entropy or a lower Gini-impurity after the split as the feature to be split on, discarding the other
3. max_depth
(default = None
)
is the measure of how much further the tree has to be expanded down to each node till we get to the leaf node. Generally in a tree-based algorithm, the more the depth, the more the chance that it overfits the data. Since Random Forest ensembles several different trees together, it is generally accepted to have deep trees.
4. min_samples_split
(default = 2
)
We can specify the minimum number of elements/records that has to be present in each node to determine if the algorithm can stop splitting further. If we mention the min_sample_split
to be 60, after 4 splits if the node still has more than 60 elements or records, it is a potential candidate to be split further. i.e. the splitting continues as long as there are more than 60 records.
5. max_features
(default = auto
)
At every split, the algorithm chooses some features ( randomly ) to be based on which the tree starts to split. max_features
determines how many features need to be selected for determining the split. Considering more features increases the chance of finding a better split. But, it also increases the correlation between trees, increasing the variance of the overall model. Recommended default values are the square root of the total number of features for classification problems, and 1/3 of the total number for regression problems. As with tree size, it may be possible to increase performance by tuning.
There are multiple options available in Scikit-Learn to assign maximum features. Here are a few of them :
How does “max_features” impact performance and speed?
Increasing max_features generally improves the performance of the model as at each node now we have a higher number of options to be considered. However, this is not necessarily true as this decreases the diversity of individual trees which is the USP of random forest. But, for sure, you decrease the speed of the algorithm by increasing the max_features. Hence, you need to strike the right balance and choose the optimal max_features.
6. bootstrap
(default = True
)
Once we provide the training data to the RandomForestClassifier
model the algorithm selects a bunch of rows randomly with replacement to build the trees. This process is called Bootstrapping (Random replacement). If the bootstrap option is set to False
, no random selection happens and the whole dataset is used to create the trees.
There are a few attributes that have a direct impact on model training speed. Following are the key parameters that we can tune for model speed:
This parameter tells the engine how many processors is it allowed to use. A value of “-1” means there is no restriction whereas a value of “1” means it can only use one processor. Here is a simple experiment we can do with Python to check this metric:
%timeit
model = RandomForestRegressor(n_estimator = 100, oob_score = TRUE,n_jobs = 1,random_state =1)
model.fit(X,y)
# Output: 1 loop best of 3 : 1.7 sec per loop
%timeit
model = RandomForestRegressor(n_estimator = 100,oob_score = TRUE,n_jobs = -1,random_state =1)
model.fit(X,y)
# Output: 1 loop best of 3 : 1.1 sec per loop
This parameter makes a solution easy to replicate. A definite value of random_state will always produce same results if given with the same parameters and training data. I have personally found an ensemble with multiple models of different random states and all optimum parameters sometimes perform better than the individual random states.
This is a random forest cross-validation method. It is very similar to the leave one out validation technique, however, this is so much faster. This method simply tags every observation used in different tress. And then it finds out a maximum vote score for every observation based on only trees which did not use this particular observation to train themselves.
Here is a single example of using all these parameters in a single function :
model = RandomForestRegressor(n_estimator = 100, oob_score = TRUE, n_jobs = -1,random_state =50, max_features = "auto", min_samples_leaf = 50)
model.fit(X,y)
In this section, we discuss everything you need to know about random forest models and overfitting. We start with a discussion of what overfitting is and how to determine when a model is overfitted. After that, we discuss random forests and the likelihood that random forest models will overfit.
Overfitting is a common phenomenon we should look out for any time we are training a machine learning model. Overfitting happens when a model pays too much attention to the specific details of the dataset that it was trained on. Specifically, the model picks up on patterns that are specific to the observations in the training data but do not generalize to other observations. The model is able to make great predictions on the data it was trained on but is not able to make good predictions on data it did not see during training.
Overfitting is a problem because machine learning models are generally trained with the intention of making predictions on unseen data. Models that have overfit to their training data set are not able to make good predictions on new data that they did not see during training, so they are not able to make predictions on unseen data.
If you plan to use a machine learning model to make predictions on unseen data, you should always check to make sure that your model is not overfitting to the training data. How do we check whether the model is overfitting to the training data?
In order to check whether a model is overfitting to the training data, we should make sure to split the dataset into a training dataset that is used to train the model and a test dataset that is not touched at all during model training. This way we will have a dataset available that the model did not see at all during training that we can use to assess whether our model is overfitting.
We should generally allocate around 70% of data to the training dataset and 30% of data to the test dataset. Only after we trained the model on the training dataset and optimized hyper parameters, we plan to optimize on the test dataset. At that point, we can use the model to make predictions on both the test data and the training data and then compare the performance metrics on the test and training data.
If a model is overfitting to the training data, we will notice that the performance metrics on the training data are much better than the performance metrics on the test data.
In general, random forests are much less likely to overfit than other models because they are made up of many weak classifiers that are trained completely independently on completely different subsets of the training data. Random forests are a great option to spring for if we want to train a quick model that is not likely to overfit. That being said, it is possible that a random forest model might overfit in some cases so we should still make sure to look out for overfitting when we train random forest models.
Here are some easy ways to prevent overfitting in random forests.
RandomForests can be regularized by tweaking the following parameters:
max_depth
: This is a parameter that controls the maximum depth of the trees. The bigger it is, there more parameters will have, remember that overfitting happens when there’s an excess of parameters being fitted.min_samples_leaf
: Instead of decreasing max_depth
we can increase the minimum number of samples required to be at a leaf node, this will limit the growth of the trees too and prevent having leaves with very few samples (Overfitting!)max_features
: As previously mentioned, overfitting happens when there’s abundance of parameters being fitted, the number of parameters hold a direct relationship with the number of features in the model, therefore limiting the amount of features in each tree will prove valuable to help control overfitting.Resources:
https://towardsdatascience.com/random-forest-hyperparameters-and-how-to-fine-tune-them-17aee785ee0d
https://medium.com/all-things-ai/in-depth-parameter-tuning-for-random-forest-d67bb7e920d
https://www.section.io/engineering-education/hyperparmeter-tuning/