Association Rule Learning

An informal definition of association rule learning is "Customer who bought this will also buy...". This is used for optimization of a combination of things. It is also commonly known as Market Basket Analysis.

Knowing this association allows one to make recommendations to a customer. Two popular algorithms, Apriori & FPGrowth are available in the library mlxtend.

Apriori

Apriori association consists of three important parameters, Support, Confidence and Lift. This kaggle article explains them perfectly.

Given the transactional data below:

Transaction	Items
T1	apple, egg, milk
T2	carrot, milk
T3	apple, egg, carrot
T4	apple, egg
T5	apple, carrot

Support

A parameter to measure the popularity of certain item. It is a proportion of transactions in which a specific item appears. Support threshold is a key parameters in product association algos, and ranges from 0 to 1.

support{apple,egg} = 3/5 or 60%

Confidence

A parameter to measure how likely item B will be purchased given item A is purchased. This is expressed as confidence{A->B} = support{A,B}/support{A}, and ranges from 0 to 1.

confidence{apple->egg} 
= support{apple,egg} / support{apple}
= (3/5) / (4/5)
= 0.75 or 75%

However, if we look at the scores in the opposite direction.

confidence{egg->apple} 
= support{apple,egg} / support{egg}
= (3/5) / (3/5)
= 1 or 100%

One of the drawbacks of Confidence is that it might misrepresent the importance of an association.

Lift

A parameter to measure how likely item B will be purchased when item A is purchased, while controlling for how popular item B is. This accounts for the popularity of both items rather than just one in confidence calculation.

Unlike the confidence metric whose value may vary depending on direction (eg: confidence{A->B} may be different from confidence{B->A}), lift has no direction. This means that the lift{A,B} is always equal to the lift{B,A}. Hence, The formula for this is lift{A,B} = lift{B,A} = support{A,B} / (support{A} * support{B}), and it can range from 0 to infinity

lift{apple,egg} 
= lift{egg,apple} 
= support{apple,egg} / (support{apple} * support{egg})
= (3/5) / (4/5 * 3/5) 
= 1.25

The values of lift can be summarised as follows.

lift	explanation	example
1	no relationship between A and B	A and B occur together only by chance
>1	positive relationship between A and B	A and B occur together more often than random
<1	negative relationship between A and B	A and B occur together less often than random

from apyori import apriori
import pandas as pd
import numpy as np


df = pd.read_csv('../input/Market_Basket_Optimisation.csv', header = None)

#Transforming the list into a list of lists
# so that each transaction can be indexed easier
transactions = []
for i in range(0, df.shape[0]):
    transactions.append([str(dataset.values[i, j]) for j in range(0, 20)])

rules = apriori(transactions, 
                min_support = 0.003, min_confidence = 0.2, min_lift = 3, 
                min_length = 2)

results = list(rules)
results = pd.DataFrame(results)
results.head(5)

To analyse the results we can plot the following graphs.

FP Growth

Frequent Pattern (FP) Growth is preferred to Apriori for the reason that Apriori takes more execution time for repeated scanning of the transaction dataset to mine the frequent items. FP-Growth builds a compact-tree structure and uses the tree for frequent itemset mining and generating rules, using a divide and conquer approach.

Eclat

Eclat is a simplified version of Apriori model, as only Support value is used, which shows how frequent a set of items occur.