Skip to content

K-fold Cross-Validation

Traditional train-test split (or 1 fold split) runs the risk that the split might unwittingly be bias towards certain features or labels.

By iterating the model training into k-times, with each iteration using a different training & validation split, we can avoid such biasness, though it is k-times computationally expensive.

cross_val_score is a compact function to obtain the all scoring values using kfold in one line.

from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier

X = df[df.columns[1:-1]]
y = df['Cover_Type']

# using 5-fold cross validation mean scores
model = RandomForestClassifier()
cv_scores = cross_val_score(model, X, y, 
                scoring='accuracy', cv=5, n_jobs=-1)
print(np.mean(cv_scores))

For greater control, like to define our own evaluation metrics etc., we can use KFold to obtain the train & test indexes for each fold iteration. Sklearn's grid/random searches also allow cross validation together with model tuning.

from sklearn.model_selection import KFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score

def kfold_custom(fold=4, X, y, model, eval_metric):
    kf = KFold(n_splits=fold)
    score_total = []
    for train_index, test_index in kf.split(X):
        X_train, y_train = train[train_index][X_features], \
                           train[train_index][y_feature]
        X_test, y_test = test[test_index][X_features], \
                              test[test_index][y_feature]
        model.fit(X_train, y_train)
        y_predict = model.predict()
        score = eval_metric(y_test, y_predict)
        score_total.append(score)
    score = np.mean(score_total)
    return score

model = RandomForestClassifier()
kfold_custom(X, y, model, f1score)

There are many other variants of cross validations available in sklearn, as shown below.