Clustering

Clustering involves finding related groups in the data and assigning every point in the dataset to one of the groups.

To validate that the model used is good, a verification needs to be done by a person labelling the dataset, and seeing the percentage matched.

# concat actual & predicted clusters together
y = pd.DataFrame(y.values, columns=['actual'])
cluster = pd.DataFrame(kmeans.labels_, columns=['cluster'])
df = pd.concat([y,cluster], axis=1)

# view absolute numbers
res = df.groupby('actual')['cluster'].value_counts()
print(res)

# view percentages
res2 = df.groupby('actual')['cluster'].value_counts(normalize=True)*100
print(res2)

K-Means

Key Hyperparameter	Desc
n_clusters	no. of clusters

Specify number of clusters (3)
3 random data points are randomly selected as cluster centers
Each data point is assigned to the cluster center it is cloest to
Cluster centers are updated to the mean of the assigned points
Steps 3-4 are repeated, till cluster centers remain unchanged

Introduction to Machine Learning with Python (book)

It is important to scale the features before applying K-means, unless the fields are not meant to be scaled, like distances. Categorical data is not appropriate as clustering calculated using euclidean distance (means). For long distances over an lat/long coordinates, they need to be projected to a flat surface.

One aspect of k means is that different random starting points for the cluster centers often result in very different clustering solutions. So typically, the k-means algorithm is run in scikit-learn with ten different random initializations and the solution occurring the most number of times is chosen.

Limitations

Very sensitive to outliers. They have to be removed before fitting the model
For simple globular shapes, not irregular complex clusters
Might need to reduce dimensions if very high no. of features or the distance separation might not be obvious

Two variants, K-medians & K-Medoids are less sensitive to outliers and work with categorical features.

from sklearn.cluster import KMeans
from sklearn.preprocessing import MinMaxScaler

X_scaled = MinMaxScaler().fit(X).transform(X)
kmeans = KMeans(n_clusters=4, random_state=0)
kmeans.fit(X)

To determine the best K, we will usually plot an elbow chart. Looking at the significant bend in the chart shows that the average distance value might be leveling off such that adding more clusters doesn't decrease the average distance as much. We will thus choose that K value.

There is also the silhouette chart used to determine the number of k.

from sklearn.cluster import KMeans
from scipy.spatial.distance import cdist
import numpy as np
import matplotlib.pylab as plt
import seaborn as sns
%matplotlib inline

meandist=[]
clusters=range(1,10)
for k in clusters:
    # prepare the model
    model=KMeans(n_clusters=k)
    # fit the model
    model.fit(train_feature)
    # test the model
    clusassign=model.predict(train_feature)
    # gives average distance values for each cluster solution
        # cdist calculates distance of each two points from centriod
        # get the min distance (where point is placed in clsuter)
        # get average distance by summing & dividing by total number of samples
    meandist.append(sum(np.min(cdist(train_feature, 
                        model.cluster_centers_, 'euclidean'), axis=1))
                    / train_feature.shape[0])

plt.plot(clusters, meandist)
plt.xlabel('Number of clusters')
plt.ylabel('Average distance')
plt.title('Selecting k with the Elbow Method')

Sometimes we need to find the cluster centres so that we can get an absolute distance measure of centroids to new data. Each feature will have a defined centre for each cluster.

# get cluster centres
centroids = model.cluster_centers_
# for each row, define cluster centre
centroid_labels = [centroids[i] for i in model.labels_]

If we have labels or y, and want to determine which y belongs to which cluster for an evaluation score, we can use a groupby to find the most number of labels that fall in a cluster and manually label them as such.

df = concat.groupby(['label','cluster'])['cluster'].count()

If we want to know what is the distance of each datapoint’s assign cluster distance to their centroid, we can do a fit_transform to get all distance from all cluster centroids and process from there.

from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=n_clusters, random_state=0)

# get distance from each centroid for each datapoint
dist_each_centroid = kmeans.fit_transform(df)
# get all assigned centroids
y = kmeans.labels_
# get distance of assigned centroid
dist = [distance[label] for label, distance in zip(y, dist_each_centroid)]
# concat label & distance together
label_dist = pd.DataFrame(zip(y,dist), columns=['label','distance'])

Gaussian Mixture Model

GMM is, in essence a density estimation model but can function like clustering. It has a probabilistic model under the hood so it returns a matrix of probabilities belonging to each cluster for each data point. More from here.

Key Hyperparameters	Desc
n_components	no. of clusters
covariance_type	diag (default, ellipse constrained to the axes), spherical (like k-means), or full (ellipse without a specific orientation)

from sklearn.mixture import GaussianMixture

# gmm accepts input as array, so have to convert dataframe to numpy
input_gmm = normal.values

gmm = GaussianMixture(n_components=4, covariance_type='full', random_state=42)
gmm.fit(input_gmm)
result = gmm.predict(test_set)

Python Data Science Handbook by Jake VanderPlas

BIC or AIC are used to determine the optimal number of clusters using the elbow diagram, the former usually recommends a simpler model. Note that number of clusters or components measures how well GMM works as a density estimator, not as a clustering algorithm.

from sklearn.mixture import GaussianMixture
import matplotlib.pyplot as plt
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

input_gmm = normal.values

bic_list = []
aic_list = []
ranges = range(1,30)

for i in ranges:
    gmm = GaussianMixture(n_components=i).fit(input_gmm)
    # BIC
    bic = gmm.bic(input_gmm)
    bic_list.append(bic)
    # AIC
    aic = gmm.aic(input_gmm)
    aic_list.append(aic)

plt.figure(figsize=(10, 5))
plt.plot(ranges, bic_list, label='BIC');
plt.plot(ranges, aic_list, label='AIC');
plt.legend(loc='best');

Agglommerative Clustering

Agglomerative Clustering is a type of hierarchical clustering technique used to build clusters from bottom up. Divisive Clustering is the opposite method of building clusters from top down, which is not available in sklearn.

The most useful part of such a clustering is the ability to draw a dendrogram to view the breakdown of clusters with a increasing distance.

Key Hyperparameter	Desc
affinity	methods of linking clusters (ward, average, complete)

University of Michigan: Coursera Data Science in Python

In essence, we can also use the 3-step method above to compute agglomerative clustering. First we conduct the agglomerative clustering, then display the dendrogram. After determining which distance it should be cut, we flatten the cluster by indicating the distance to slice the dendrogram.

from scipy.cluster.hierarchy import linkage, dendrogram, fcluster

# 1. clustering
Z = linkage(X, method='ward', metric='euclidean')

# 2. draw dendrogram
plt.figure(figsize=(10,5));
dendrogram(Z, orientation='left', leaf_font_size=8)
plt.show()

# 3. flatten cluster
distance_threshold = 10
y = fcluster(Z, distance_threshold, criterion='distance')

sklearn's agglomerative cluster is said to be very slow. An alternative, fastcluster is faster as it is a C++ library with python interface.

import fastcluster
from scipy.cluster.hierarchy import dendrogram, fcluster

# 1. clustering
Z = fastcluster.linkage_vector(X, method='ward', metric='euclidean')
Z_df = pd.DataFrame(data=Z, columns=['clusterOne','clusterTwo',
                                     'distance','newClusterSize'])

# 2. draw dendrogram
plt.figure(figsize=(10, 5))
dendrogram(Z, orientation='left', leaf_font_size=8)
plt.show();

# 3. flatten cluster
distance_threshold = 2000
clusters = fcluster(Z, distance_threshold, criterion='distance')

The dendrogram can be further enhanced by

adding title and axis labels
adding grids
trimming the bottom branches based on max. no. of clusters to display labelling each cluster split distance
a horizontal line to investigate where would be an appropriate cutoff point

from scipy.cluster.hierarchy import linkage, dendrogram

plt.style.use('seaborn-whitegrid')
plt.figure(figsize=(8, 5))
plt.title('Agglomerative Clustering Dendrogram')
plt.xlabel('Clusters')
plt.ylabel('Distance')

# cluster
Z = linkage(df, method='ward', metric='euclidean')

# plot dendrogram
ddata = dendrogram(Z, orientation='top',
                    truncate_mode='lastp', p=5,
                    labels=True, get_leaves=True,
                    show_leaf_counts=True,
                    show_contracted=True)

# plot cluster points & distance labels
limit = 4
for i, d, c in zip(ddata['icoord'], ddata['dcoord'], ddata['color_list']):
    x = sum(i[1:3])/2
    y = d[1]
    if y > limit:
        plt.plot(x, y, 'o', c=c, markeredgewidth=0)
        plt.annotate(int(y), (x, y), xytext=(0, -5),
                    textcoords='offset points',
                    va='top', ha='center', fontsize=9)

# plot distance
line = 1500
plt.axhline(y=line, c='black', linestyle='--');

The labels in brackets is the number of datapoints that are clustered under each branch.

DBScan

Density-Based Spatial Clustering of Applications with Noise (DBSCAN).

Key Hyperparameters	Desc
eps	epsilon. max distance btw 2 samples to be considered a cluster
min_samples	min. no. of samples to be considered a cluster

Pick an arbitrary point to start
Find all points with distance eps or less from that point
If points are more than min_samples within distance of esp, point is 4.labelled as a core sample, and assigned a new cluster label
Then all neighbours within eps of the point are visited
If they are core samples their neighbours are visited in turn and so on
The cluster thus grows till there are no more core samples within distance eps of the cluster
Then, another point that has not been visited is picked, and step 1-6 is repeated
3 kinds of points are generated in the end, core points, boundary points, and noise
Boundary points are core clusters but not within distance of esp

from sklearn.cluster import DBSCAN
from sklearn.datasets import make_blobs

X, y = make_blobs(random_state = 9, n_samples = 20)
dbscan = DBSCAN(eps = 2, min_samples = 2)

cls = dbscan.fit_predict(X)
print(cls)
# [1 0  1  0  2  0  0  0  2  2 -1  1  2  0  0 -1  0  0  1 -1]
# -1 indicates noise or outliers