Content-Based Recommenders
Match users directly to products & content by recommending based on what you have bought or viewed in the past.
Pros & Cons
Pros | Cons |
---|---|
No cold start - Can recommend new items even if there are no ratings provided by users | No serendipity - Hard to recommend items not similar to those already seen by the user |
Can explain why recommendations are made by referencing the deciding item features | Getting an adequate set of product descriptor features can be hard |
- | Can be hard to get the user preferences against the product features. Users don’t generally have the patience to specify all of their preferences |
Product Similarity
Product similarity recommendation is a content-based method which aims to recommends products by finding the most similar products to a query product based on the content. this content may be product title, description, images, category/subcategory, specification, etc.
We can use item2vec, a shallow, single layer neural network to do this. This is based on Word2Vec, where single words are replaced with the item content. Each sentence of words are replaced by buckets of items. More on its architecture in these two articles 1, 2.

from gensim.models import Word2Vec
import pandas as pd
model = Word2Vec(df, min_count=1, vector_size = 50, workers=3, \
window =3, sg = 1)
model.save("word2vec.model")
model = Word2Vec.load("word2vec.model")
# get numpy vector of a word
vector = model.wv['computer']
# get other similar words, in cosine similarity scores ranked
sims = model.wv.most_similar('computer', topn=10)
The arguments are as stated.
arg | desc |
---|---|
vector_size | no. of dimensions of the embeddings. default 100 |
window | max distance between a target word and words around the target word. default 5. |
min_count | no. count of words to consider when training the model; words with occurrence less than this count will be ignored. default 5 |
workers | no. of partitions during training. default 3 |
sg | training algorithm, either CBOW(0) or skip gram(1). The default training algorithm is CBOW. |
Evaluation
We can use mAP (Mean Average Precision), MRR (Mean Reciprocal Rank), or NDCG (Normalized Discounted Cumulative Gain)
Convert Item Mapping in Model
While we used product names for the embedding mapping in the model, productid is usually the unique id for querying. We can do that by changing the mapping within the model, and avoid the need to have an additional mapping table and increase the latency.
import os
import pandas as pd
from gensim import models
from gensim.models import Word2Vec
from gensim.models import KeyedVectors
def convert_word2vec_mapping(model, mapping, modelpath="newmapping.model", modeltemp="modeltemp.txt"):
"""Replace word2vec model product name mapping with SKU
Args:
model (gensim model object): word2vec model
mapping (df): dataframe with column names 'Product Name' & 'SKU'
modelpath (str): path to final word2vec model
modeltemp (str): path to temp model in text format
Out:
(gensim keyedVector model object)
Notes:
loading of a kv model has different syntax
it does not have neural network weights etc.
models.KeyedVectors.load("test.model")
- <https://github.com/RaRe-Technologies/gensim/issues/1936>
- <https://stackoverflow.com/questions/58393090/how-to-save-as-a-gensim-word2vec-file>
- <https://stackoverflow.com/questions/40936197/rename-gensim-word2vec-words-with-mapping>
"""
# get list of product names & vectors
vectors = model.wv.vectors
vocab = list(model.wv.vocab.keys())
# create a new model file if exist
if os.path.isfile(modeltemp):
os.remove(modeltemp)
# save new mapping model as text
with open(modeltemp, "a") as file:
file.write("{} {}\n".format(vectors.shape[0], vectors.shape[1]))
for vo, ve in zip(vocab, vectors):
# convert np array of embeddings into string
ve = [str(i) for i in list(ve)]
ve = " ".join(ve)
# query for SKU from product name
sku = mapping[mapping["Product Name"] == vo]["SKU"].tolist()
# validation for sku
if len(sku) == 1:
sku = sku[0]
file.write("{} {}\n".format(sku, ve))
elif len(sku) == 0:
raise ValueError("There are no SKUs for {}".format(vo))
else:
raise ValueError("There is more than 1 SKUs for {}".format(vo))
# load binary text model file
newmodel = models.KeyedVectors.load_word2vec_format(modeltemp, binary=False)
newmodel.save(modelpath)
# delete temp model file
os.remove(modeltemp)
loadedmodel = \
convert_word2vec_mapping(loaded, mapping, modelpath="word2vec-map/newmapping.model")
The downside is that this model does not include model weights, so an original copy of the model needs to be kept if retraining is required. The synatx is also changed since it is just a KeyVector
now. It is however, much smaller and faster.
from gensim.models import KeyedVectors
from gensim import models
newmapping = models.KeyedVectors.load("newmapping_cy.model")
newmapping.most_similar("76577", topn=5)
[('12345', 0.9807120561599731),
('12341', 0.9798750281333923),
('32142', 0.9789506196975708),
('54356', 0.97857666015625),
('53463', 0.9785518050193787)]