Persistance

Data, models and scalers are examples of objects that can benefit greatly from storing in their native python format.

For the former, it allows multiples faster loading compared to other sources since it is saved in a python format. For others, there are no other ways of saving as they are natively python objects. There are two ways of doing this.

Pickle

We can save data loaded into a dataframe like below.

import pandas as pd
df.to_pickle('df.pkl')
df = pd.read_pickle('df.pkl')

For models or scalars we need to import pickle directly. Note that the extension name can be anything.

import pickle

pickle.dump(model, open('model_rf.pkl', 'wb'))
pickle.load(open('model_rf.pkl', 'rb'))

Joblib

When using scikit-learn, it may be better to use joblib’s replacement of pickle (dump & load), which is more efficient on objects that carry large numpy arrays internally as is often the case for fitted scikit-learn estimators, but can only pickle to the disk and not to a string. More here.

However, note that for small files (>4GB), pickle might be faster. Dicussion in stackoverflow.

import joblib

joblib.dump(clf, 'model.joblib')
joblib.load('model.joblib')

Numpy

Numpy also have its own in-built seralisation function.

file_name = "dog_bark.npy"
# save as npy
np.save(file_name, np_array)
# load numpy array
array = np.load(file_name)