One-Class Classification
These requires the training of a normal state(s), allows outliers to be detected when they lie outside trained state. A common use for this is to detect anomalies.
One Class SVM
One-class SVM is an unsupervised algorithm that learns a decision function for outlier detection: classifying new data as similar or different to the training set.
Besides the kernel, two other parameters are impt:
nu
: proportion of outliers you expect to observegamma
: determines the smoothing of the contour lines.
from sklearn.svm import OneClassSVM
train, test = train_test_split(data, test_size=.2)
train_normal = train[train['y']==0]
train_outliers = train[train['y']==1]
outlier_prop = len(train_outliers) / len(train_normal)
model = OneClassSVM(kernel='rbf', nu=outlier_prop, gamma=0.000001)
svm.fit(train_normal[['col1','col2','col3']])
Isolation Forest
One efficient way of performing outlier detection in high-dimensional datasets is to use random forests. Isolation Forest ‘isolates’ observations by randomly selecting a feature and then randomly selecting a split value between the maximum and minimum values of the selected feature.
from sklearn.ensemble import IsolationForest
clf = IsolationForest(behaviour='new', max_samples=100,
random_state=rng, contamination='auto')
clf.fit(X_train)
y_pred_test = clf.predict(X_test)
# -1 are outliers
y_pred_test
# array([ 1, 1, 1, 1, 1, 1, 1, 1, 1, -1, 1, 1, 1, 1, 1, 1])
# calculate the no. of anomalies
pd.DataFrame(save)[0].value_counts()
# -1 23330
# 1 687
# Name: 0, dtype: int64
We can also get the average anomaly scores. The lower the value, the more abnormal they are. Negative scores represent outliers, positive scores represent inliers.
clf.decision_function(X_test)
array([ 0.14528263, 0.14528263, -0.08450298, 0.14528263, 0.14528263,
0.14528263, 0.14528263, 0.14528263, 0.14528263, -0.14279962,
0.14528263, 0.14528263, -0.05483886, -0.10086102, 0.14528263,
0.14528263])
