Skip to content

Model Split & Tuning

Each model usually have 1 or more hyperparameters to adjust. A change in the hyperparameters during training will often result in a change in model performance.

Therefore we need to "tune" the models by finding the best values of hyperparameters that would derive an overall best model performance.

In grid search, we indicate a list of values for each hyperparameter, and use grid search to permutate to give you the best parameters for a given evaluation metric. Below is an example for a classification problem.

from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier()

grid_values = {'n_estimators':[150,175,200,225]}
grid = GridSearchCV(model, param_grid=grid_values, scoring="f1", cv=5)
grid.fit(predictor, target)

print(grid.best_params_)
print(grid.best_score_)

Auto Tuners

Grid search or random search, while semi-automated, is still very tedious. The list of values you give for searching might not be out of range or far off from the optimal, hence it might result in several iterations of changing the values to find a good one.

Bayesian optimization with gaussian processes is a popular technique to overcome this. It involves just defining the start and end value of each hyperparameter, and using bayesian probability, get the best parameters for a model.

Bayesian Optimization

This package is my favourite, due to its high level api and therefore, ease of use. We just need to define two things:

  • a dictionary which parameters and a range of values to tune within
  • a function with arguments that accepts the parameters, and output with the evaluation scoring. The model fitting and prediction, and scoring will thus reside here.

Besides its ease of use:

  • The api also allows an exploration option n_iter​, which balances the exploitation space defined by the bayesian optimiser.
  • It auto generates a table output for each iteration with the scoring and parameter values selected.
from bayes_opt import BayesianOptimization
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

# Function output with scores
def black_box(n_estimators, max_depth):
    params = {"n_jobs": 5,
              "n_estimators": int(round(n_estimators)),
              "max_depth": max_depth}
    model = RandomForestRegressor(**params)
    model.fit(X_train, y_train)
    y_predict = model.predict(X_test)
    score = np.sqrt(mean_squared_error(y_test, y_predict))
    return -score

# Parameter search space
pbounds = {'n_estimators': (1, 5),
           'max_depth': (10,50)}

optimizer = BayesianOptimization(black_box, pbounds, random_state=2100)
optimizer.maximize(init_points=10, n_iter=5)
best_param = optimizer.max['params']

However, it has a few downsides:

  • The most glaring is that it does not allow an option for parameters which should only be in integers (notice that n_estimators above has to be wrapped in an int function, else there will be an error). The best parameter output will still output as a float, and we have to change it back to an integer before retraining our final model.
  • The optimiser assumes that the best score is the highest value. However, for regression models, it is usually measured by loss or error metrics, like RMSE; which means the lower value the better. We will need to reverse this manually by changing the score to a negative value (as with above code)

Bayesian Tuning and Bandits

Bayesian Tuning and Bandits is a more recent library developed by MIT. Its main advantage to me over the above package is:

  • Allows specification for int or float options for parameter values.
from btb.tuning import GP
from btb import HyperParameter, ParamTypes
from sklearn.metrics import mean_squared_error

def auto_tuning(tunables, epoch, X_train, X_test, y_train, y_test):
    """Auto-tuner using BTB library"""
    tuner = GP(tunables)
    parameters = tuner.propose()

    score_list = []
    param_list = []

    for i in range(epoch):
        model = RandomForestRegressor(**parameters, n_jobs=10, verbose=3)
        model.fit(X_train, y_train)
        y_predict = model.predict(X_test)
        score = np.sqrt(mean_squared_error(y_test, y_predict))

        # store scores & parameters
        score_list.append(score)
        param_list.append(parameters)

        print('epoch: {}, rmse: {}, param: {}'.format(i+1,score,parameters))
        score = -score

        # get new parameters
        tuner.add(parameters, score)
        parameters = tuner.propose()

    best_s = tuner._best_score
    best_score_index = score_list.index(best_s)
    best_param = param_list[best_score_index]
    print('\nbest rmse: {}'.format(best_s))
    print('best parameters: {}'.format(best_param))
    return best_param

tunables = [('n_estimators', HyperParameter(ParamTypes.INT, [500, 2000])),
            ('max_depth', HyperParameter(ParamTypes.INT, [3, 20]))]
best_param = auto_tuning(tunables, 5, X_train, X_test, y_train, y_test)

The disadvantages are:

  • Certain options requires me to code manually, which I thought the authors can easily package them into a function. examples include:
    • Number of iterations; I have to write a for-loop for this
    • Printing of results for each iteration
    • Obtaining of parameters for the best_scores; I have to store all the scores and parameters in a list and then search back the index to get the best parameters.

Scikit-Optimize

scikit-optimize appears to be another popular package that implements bayesian optimization.