Machine Learning Process
Lifecycle
The machine learning life cycle is a lengthy one, involving not just the building models, but also testing, deployment and monitoring. However, those will be covered in another of my website.

Modelling Process
The key focus of this site is to zoom in to both starting of the ML process, which involves the problem scoping, data aquisition and model training.
The first two cannot be emphasize enough as they will make or break your desired model outcome, as I have come to realised after some failed attempts. These are also the most time consuming, and rightly so. Imagine obtaining a poor model result just because you do not have domain understanding of which features are important for the predictors.

Features & Labels
In a supervised model, a dataset usually consists of two components, the features and the labels. These two have many names in both the ML and statistical worlds, so its worth it to list them out here.
Names | Synomyns |
---|---|
Features | X, independent variable, predictor, explanatory, input |
Labels | y, dependent variable, target, response, output |
Data Splits
Both the train & validation datasets are used for model training and evaluation. However, as we adjust the model hyperparameters to get the best evaluation metrics for both train & val sets, we can be overfitting to just these two data splits.
To overcome this problem, it is usual to have another split for a test set, with is also known as the unseen dataset, as the model trained have not seen it before. This will further confirm that your model is well generalized.

from sklearn.model_selection import train_test_split
# split into train+val & test sets
X_trainval, X_test, y_trainval, y_test = \
train_test_split(iris.data, iris.target, random_state=0)
# split train+val set into train and val set
X_train, X_valid, y_train, y_valid = \
train_test_split(X_trainval, y_trainval, random_state=1)
A more advanced form of data split is k-fold cross validation, which is further elaborated later. We can see the entire process diagrammatically as shown below.

A good workflow of the model training process with the inclusion of data splits is shown below.
