Bayesian view of linear regression – Maximum Likelihood Estimation (MLE) and Maximum APriori (MAP)
2022-02-20
Understanding Attention Mechanism with example
2022-02-23
Show all

Hyperparameter optimization techniques in machine learning with Python code

10 mins read

In every Machine Learning project, it is possible and recommended to search the hyperparameter space to get the best performance metric. Finding the best hyperparameter combination is a step you wouldn’t want to miss as it might give your well-conceived model the final boost it needs. Many of us, default to using the well-established GridSearchCV implemented in Scikit-learn. However, the truth is that alternative optimization methods might be more suitable depending on the situation. In this article, we go through five options with in-depth explanations for each and a guide for how to use them in practice. You can find all the Python scripts gathered in one place in this Github repository.

Preparing Data

Before starting our quest for our best model, we want to find a dataset and a model first. For the dataset, we will use a package called datasets that allows us to easily download more than 500 datasets:

import pandas as pd
from datasets import load_dataset

dataset = load_dataset("amazon_us_reviews", 'Video_Games_v1_00', split='train')
df = pd.DataFrame(dataset)[:100]

We chose to use Amazon Us Reviews. The goal is to predict its target feature (the number of stars attributed) using customer reviews.

Below, we’re defining the model whose hyperparameters we will try to optimize:

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import TfidfVectorizer

model = Pipeline([('vect', TfidfVectorizer()), 
                  ('clf', RandomForestClassifier())])

X_train, X_test, y_train, y_test = train_test_split(df['review_body'], df['star_rating'], test_size=0.3)

How to find my models’ hyperparameters

Before we get to the optimization part, we first need to know what are our model’s hyperparameters, right? To do so, there are two simple ways:

  1. Scikit-learn documentation for your specific model You can simply look for your Scikit-learn model’s documentation. You’ll get to see a full list of hyperparameters with their names and possible values.
  2. Using one line of code You can also use the get_params method to find the names and current values for all the parameters of a given estimator:
model.get_params()

Now that we know how to find our hyperparameters, we can move on to our different optimization options

Grid Search

What is it?

Grid search is an optimization method based on trying out every possible combination of a finite number of hyperparameter values. In other words, in order to decide which combination of values gives the optimal results, we go through all the possibilities and measure the performance for each resulting model using a certain performance metric. In practice, grid search is usually combined with cross-validation on the training set.

When it comes to grid search, Scikit-learn gives us two options to choose from:

Exhaustive Grid Search ( GridSearchCV )

This first version is the classic one that goes through all the possible combinations of hyperparameter values exhaustively. The resulting models are evaluated one by one and the best-performing combination gets picked.

from sklearn.model_selection import GridSearchCV

# Pipeline estimators parameters
param_grid = {"vect__max_features": [1000, 1500],
              "clf__n_estimators": [200, 300, 400],
              "clf__criterion": ["gini", "entropy"]}

# Grid search on the pipeline
grid_search = GridSearchCV(model, param_grid=param_grid, cv=3)
grid_search.fit(X_train, y_train)

To visualize the results of your grid search and to get the best hyperparameters, refer to the paragraph at the end of the article.

Randomized Grid Search ( RandomizedSearchCV )

The second variant of grid search is a more selective one. Instead of going through every possible combination of hyperparameters, a choice is made over the possible estimators. In fact, not all parameter values are tried out. A chosen fixed number of parameter combinations is sampled from a certain statistical distribution given as an argument.

This method offers the flexibility of choosing the computational cost we can afford. This is done by fixing the number of sampled candidates or sampling iterations through the argument n_iter .

from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform

# Pipeline estimators parameters
distributions = {"vect__max_features": [1000, 1500],
                 "vect__max_df": uniform(),
                 "clf__n_estimators": [200, 300, 400]}

# Grid search on the pipeline
grid_search = RandomizedSearchCV(model, param_distributions=distributions, cv=3)
grid_search.fit(X_train, y_train)

There are certain points to mention here:

  • Lists are sampled uniformly
  • As mentioned in the documentation:

If all parameters are presented as a listsampling without replacement is performed. If at least one parameter is given as a distributionsampling with replacement is used.

  • It is recommended to use continuous distributions for continuous parameters to take full advantage of the randomization. A good example of this is the uniform distribution used above to sample the maximum document frequency chosen for the TF-IDF vectorizer.
  • Use n_iter to control the trade-off between results and computational efficiency. Increasing n_iter will always lead to better results, especially if there are continuous distributions involved.
  • Any function can be passed as a distribution as long as it implements an rvs method (random variate sample) for value sampling.

Successive Halving

The second method of hyperparameter tuning offered by scikit-learn is successive halving. This method consists of iteratively choosing the best-performing candidates on increasingly larger amounts of resources.

In fact, in the first iteration, the largest number of parameter combinations is tested over a small number of resources. As the number of iterations increases, only the best-performing candidates are kept. They are compared based on their performance over bigger amounts of resources.

In practice what can resources be? Most of the time, resources are the number of samples in the training set. It is however possible to choose another custom numeric parameter like the number of trees in the random forest algorithm by passing it as an argument.

Similar to the first type of grid search, there are two variants: HalvingGridSearchCV and HalvingRandomSearchCV.

Successive Halving Grid Search ( HalvingGridSearchCV )

Successive Halving estimators are still in the experimental phase on scikit-learn. Therefore, in order to use them, you need to have the latest version of scikit-learn ‘0.24.0’ and require the experimental feature:

from sklearn.experimental import enable_halving_search_cv

Once this is done, the code is exactly similar to GridSearchCV’s:

from sklearn.model_selection import HalvingGridSearchCV

# Pipeline estimators parameters
param_grid = {"vect__max_features": [1000, 1500],
              "clf__n_estimators": [200, 300, 400],
              "clf__criterion": ["gini", "entropy"]}

# Successive halving grid search on the pipeline
halving_gs = HalvingGridSearchCV(model, param_grid=param_grid, cv=3)
halving_gs.fit(X_train, y_train)

To exploit successive halving to the fullest and customize the computational cost to your needs, there is a number of relevant arguments to play with:

  • resource: You can play on this argument to customize the type of resources to increase with each iteration. For example, in our code above, we can define it to be the number of trees in a random forest:
param_grid = {"vect__max_features": [1000, 1500],
              "clf__criterion": ["gini", "entropy"]
              }

halving_gs = HalvingGridSearchCV(model, param_grid=param_grid, cv=3, resource='clf__n_estimators', max_resources=400)
halving_gs.fit(X_train, y_train)

Or even the number of features in the TFIDF vectorization:

param_grid = {"clf__n_estimators": [200, 300, 400] ,
              "clf__criterion": ["gini", "entropy"]
                }

halving_gs = HalvingGridSearchCV(model, param_grid=param_grid, cv=3, resource='vect__max_features', max_resources=1500)
halving_gs.fit(X_train, y_train)

Make sure however to remove the type of resource from the param_grid dictionary.

  • factor: This parameter is the halving parameter. By choosing a certain value for this argument, we get to choose the proportion of candidates that are selected and the number of resources being used for each iteration:
n_resources_{i+1} = n_resources_i * factor

n_candidates_{i+1} = n_candidates_i / factor
  • aggressive_elimination: Since the amount of resources used is multiplied by a factor at each iteration, there can be at most i_max iterations so that
(n_resources_{i_max} = n_resources_{0} * factor^{i_max} ) ≤ max_n_resources

If the amount of resources isn’t high enough, the remaining number of candidates at the last iteration isn’t small enough. It’s in this case that the aggressive_elimination argument makes sense. In fact, if it is set to True, the first iteration is performed multiple times until the number of candidates is small enough.

Randomized Successive Halving ( HalvingRandomSearchCV )

Just like the randomized grid search, randomized successive halving is similar to regular successive halving with one exception. In this variant, a fixed chosen number of candidates is sampled at random from the parameter space. This number is given as an argument named n_candidates. Let’s go back to our code. If we wish to apply randomized successive halving, the corresponding code would be:

from sklearn.experimental import enable_halving_search_cv
from sklearn.model_selection import HalvingRandomSearchCV
from scipy.stats import uniform

# Pipeline estimators parameters
distributions = {"vect__max_features": [1000, 1500],
                 "vect__max_df": uniform(0.1),
                 "clf__n_estimators": [200, 300, 400]
                }

# Grid search on the pipeline
halving_rnd = HalvingRandomSearchCV(model, param_distributions=distributions, cv=3)
halving_rnd.fit(X_train, y_train)

Bayesian Grid Search

The third and final method we’re going to talk about in this article is bayesian optimization over hyperparameters. To use it with Python, we are using a library called scikit-optimize. The method is called BayesSearchCV and as mentioned in the documentation, it “utilizes Bayesian Optimization where a predictive model referred to as “surrogate” is used to model the search space and utilized to arrive at good parameter values combination as soon as possible”.

What is the difference between randomized grid search and Bayesian grid search?

Compared to the randomized grid search, this method offers the advantage of taking into consideration the structure of the search space to optimize the search time. This is done by keeping in memory past evaluations and using that knowledge to sample new candidates that are most likely to give better results.

Now that we have a clear overall idea about this method, let’s move on to the concrete part, the coding part. We are going to use a library called scikit-optimize.

I should mention though:

1- You might have to downgrade your scikit-learn version to ‘0.23.2’ if you’re using the latest version for scikit-optimize to work properly (I would recommend you do that in a new environment):

pip install scikit-learn==0.23.2

2- Also, to avoid any further errors, make sure to install the newest development version via this command:

pip install git+https://github.com/scikit-optimize/scikit-optimize.git

Now the actual code for our model would be:

from skopt import BayesSearchCV
# parameter ranges are specified by one of below
from skopt.space import Real, Categorical, Integer


# Pipeline parameters search spaces
search_spaces = {"vect__max_features": Integer(1000, 1500),
                 "vect__max_df": Real(0, 1, prior='uniform'),
                 "clf__n_estimators": Integer(200,  400)}

# Bayesian grid search on the pipeline
opt = BayesSearchCV(model, search_spaces, cv=3)
opt.fit(X_train, y_train)

Visualizing hyperparameter optimization results

To get the full report of all candidates’ performance, we just need to use the attribute cv_results_ for all the methods listed above. The resulting dictionary can be converted to a data frame for more readability:

import pandas as pd
results = pd.DataFrame(grid_search.cv_results_)

To get other resulting items, you just need these lines of code:

The winning candidate:

best_model = grid_search.best_estimator_

The best combination of hyperparameters:

params = grid_search.best_params_

The best score after trying out the best candidate on the testing set:

score = grid_search.best_score

Grid Search report

If you wish to have a real-time report while the search is still on, scikit-learn developers were kind enough to post a ready-to-use piece of code that does exactly that. The result looks like this:

Screenshot by Author

Final Thoughts

When it comes to hyperparameter optimization, you have a wide choice of ready-to-use tools with Python. You can choose what works for you and experiment with them according to your needs. The trade-off between the best model performance and the most optimized search time is usually the factor that most influences the choice. In any case, it is important not to forget this step to give your model its best chance to perform well.

Source:

https://towardsdatascience.com/5-hyperparameter-optimization-methods-you-should-use-521e47d7feb0#e190

Amir Masoud Sefidian
Amir Masoud Sefidian
Machine Learning Engineer

Comments are closed.