Content-Based Recommenders

Match users directly to products & content by recommending based on what you have bought or viewed in the past.

Pros & Cons

Pros	Cons
No cold start - Can recommend new items even if there are no ratings provided by users	No serendipity - Hard to recommend items not similar to those already seen by the user
Can explain why recommendations are made by referencing the deciding item features	Getting an adequate set of product descriptor features can be hard
-	Can be hard to get the user preferences against the product features. Users don’t generally have the patience to specify all of their preferences

Product Similarity

Product similarity recommendation is a content-based method which aims to recommends products by finding the most similar products to a query product based on the content. this content may be product title, description, images, category/subcategory, specification, etc.

We can use item2vec, a shallow, single layer neural network to do this. This is based on Word2Vec, where single words are replaced with the item content. Each sentence of words are replaced by buckets of items. More on its architecture in these two articles 1, 2.

from gensim.models import Word2Vec
import pandas as pd

model = Word2Vec(df, min_count=1, vector_size = 50, workers=3, \
                    window =3, sg = 1)
model.save("word2vec.model")
model = Word2Vec.load("word2vec.model")

# get numpy vector of a word
vector = model.wv['computer']  
# get other similar words, in cosine similarity scores ranked
sims = model.wv.most_similar('computer', topn=10)

The arguments are as stated.

arg	desc
vector_size	no. of dimensions of the embeddings. default 100
window	max distance between a target word and words around the target word. default 5.
min_count	no. count of words to consider when training the model; words with occurrence less than this count will be ignored. default 5
workers	no. of partitions during training. default 3
sg	training algorithm, either CBOW(0) or skip gram(1). The default training algorithm is CBOW.

Evaluation

We can use mAP (Mean Average Precision), MRR (Mean Reciprocal Rank), or NDCG (Normalized Discounted Cumulative Gain)

Convert Item Mapping in Model

While we used product names for the embedding mapping in the model, productid is usually the unique id for querying. We can do that by changing the mapping within the model, and avoid the need to have an additional mapping table and increase the latency.

import os
import pandas as pd
from gensim import models
from gensim.models import Word2Vec
from gensim.models import KeyedVectors

def convert_word2vec_mapping(model, mapping, modelpath="newmapping.model", modeltemp="modeltemp.txt"):
    """Replace word2vec model product name mapping with SKU

    Args:
        model (gensim model object): word2vec model
        mapping (df): dataframe with column names 'Product Name' & 'SKU'
        modelpath (str): path to final word2vec model
        modeltemp (str): path to temp model in text format

    Out:
        (gensim keyedVector model object)

    Notes:
        loading of a kv model has different syntax
            it does not have neural network weights etc.
        models.KeyedVectors.load("test.model")
        - <https://github.com/RaRe-Technologies/gensim/issues/1936>
        - <https://stackoverflow.com/questions/58393090/how-to-save-as-a-gensim-word2vec-file>
        - <https://stackoverflow.com/questions/40936197/rename-gensim-word2vec-words-with-mapping>
    """

    # get list of product names & vectors
    vectors = model.wv.vectors
    vocab = list(model.wv.vocab.keys())

    # create a new model file if exist
    if os.path.isfile(modeltemp):
        os.remove(modeltemp)

    # save new mapping model as text
    with open(modeltemp, "a") as file:
        file.write("{} {}\n".format(vectors.shape[0], vectors.shape[1]))
        for vo, ve in zip(vocab, vectors):
            # convert np array of embeddings into string
            ve = [str(i) for i in list(ve)]
            ve = " ".join(ve)
            # query for SKU from product name
            sku = mapping[mapping["Product Name"] == vo]["SKU"].tolist()

            # validation for sku
            if len(sku) == 1:
                sku = sku[0]
                file.write("{} {}\n".format(sku, ve))
            elif len(sku) == 0:
                raise ValueError("There are no SKUs for {}".format(vo))
            else:
                raise ValueError("There is more than 1 SKUs for {}".format(vo))

    # load binary text model file
    newmodel = models.KeyedVectors.load_word2vec_format(modeltemp, binary=False)
    newmodel.save(modelpath)

    # delete temp model file
    os.remove(modeltemp)

loadedmodel = \
convert_word2vec_mapping(loaded, mapping, modelpath="word2vec-map/newmapping.model")

The downside is that this model does not include model weights, so an original copy of the model needs to be kept if retraining is required. The synatx is also changed since it is just a KeyVector now. It is however, much smaller and faster.

from gensim.models import KeyedVectors
from gensim import models

newmapping = models.KeyedVectors.load("newmapping_cy.model")
newmapping.most_similar("76577", topn=5)

[('12345', 0.9807120561599731),
 ('12341', 0.9798750281333923),
 ('32142', 0.9789506196975708),
 ('54356', 0.97857666015625),
 ('53463', 0.9785518050193787)]