Skip to content

Collaborative Filtering

Match users to people with similar tastes, and recommend what they like.

User-Based CF

Because each user may use different ratings to quantify the same level of appreciation for an item, this can create bias.

We can use normalization methods to over this bias. Pearson's Correlation Coefficient can be used in place of Cosine Similarity as a distance metric to overcome this bias by subtracting each users mean rating from their individual ratings. This is also referred to as mean centering.

Z-score normalization can also be used if the rating scale has a wide discrete value or is continuous.

Item-Based CF

First introduced by Amazon, it has to resolve scalability issues when there are a lot more users than items. This involves transposing the user-rating matrix, and then computing the distances between items, not users.

The distance matrix used is usually adjusted Cosine Similarity.

Pros Cons
Faster as far fewer items than users -
No cold start problem -
Items change less frequently than users -

Matrix Factorization

Matrix Factorization forms a user's latent preference matrix and an item properties matrix. This usually results in a much reduced model. The number of latent features to build the user preference matrix is self-selected, similar to a PCA.

Some of the popular algorithms to implement this include Alternating Least Squares (ALS), Non-Negative Matrix Factorisation, and Singular Value Decomposition++ (SVD++).

Two good python libraries for implementing matrix factorization are Implicit, and Surprise (does not support implicit ratings or content-based info).

Pros Cons
Generally more accurate than User-CF & Item-CF No explaninability
Smaller size than User-CF & Item-CF Factorization must be redone to apply to new users / items