One-Class Classification

These requires the training of a normal state(s), allows outliers to be detected when they lie outside trained state. A common use for this is to detect anomalies.

One Class SVM

One-class SVM is an unsupervised algorithm that learns a decision function for outlier detection: classifying new data as similar or different to the training set.

Besides the kernel, two other parameters are impt:

nu: proportion of outliers you expect to observe
gamma: determines the smoothing of the contour lines.

from sklearn.svm import OneClassSVM

train, test = train_test_split(data, test_size=.2)
train_normal = train[train['y']==0]
train_outliers = train[train['y']==1]
outlier_prop = len(train_outliers) / len(train_normal)

model = OneClassSVM(kernel='rbf', nu=outlier_prop, gamma=0.000001)
svm.fit(train_normal[['col1','col2','col3']])

Isolation Forest

One efficient way of performing outlier detection in high-dimensional datasets is to use random forests. Isolation Forest ‘isolates’ observations by randomly selecting a feature and then randomly selecting a split value between the maximum and minimum values of the selected feature.

from sklearn.ensemble import IsolationForest

clf = IsolationForest(behaviour='new', max_samples=100,
                    random_state=rng, contamination='auto')

clf.fit(X_train)
y_pred_test = clf.predict(X_test)

# -1 are outliers
y_pred_test
# array([ 1,  1,  1,  1,  1,  1,  1,  1,  1, -1,  1,  1,  1,  1,  1,  1])

# calculate the no. of anomalies
pd.DataFrame(save)[0].value_counts()
# -1    23330
# 1      687
# Name: 0, dtype: int64

We can also get the average anomaly scores. The lower the value, the more abnormal they are. Negative scores represent outliers, positive scores represent inliers.

clf.decision_function(X_test)
array([ 0.14528263,  0.14528263, -0.08450298,  0.14528263,  0.14528263,
      0.14528263,  0.14528263,  0.14528263,  0.14528263, -0.14279962,
      0.14528263,  0.14528263, -0.05483886, -0.10086102,  0.14528263,
      0.14528263])