What is Association Analysis?
Association analysis uses a set of transactions to discover rules that indicate the likely occurrence of an item based on the occurrences of other items in the transaction
Association Rule
An implication expression of the form X ® Y, where X and Y are item sets
Example: {Milk, Diaper} -> {Beer}
Here X is {Milk, Diaper] -> Y which is {Beer}
TID | Items |
1 | Chips, Milk |
2 | Chips, Diaper, Beer, Cornflakes |
3 | Milk, Diaper, Beer, Pepsi |
4 | Chips, Milk, Diaper, Beer |
5 | Chips, Milk, Diaper, pepsi |
Association Rule Evaluation Metrics
Support (s) = Fraction of transactions that contain both X and Y i.e. how often Milk, Diaper and Beer occur together in the transactions. Milk, Diaper and Beer occur in 2 out of total 5 transactions, hence support =2/5=0.4
Confidence (c) = Measures how often each item in Y appears in transactions that contain X
C= Support (X + Y)/Support (X)
That is- How often beer occurs in the transactions which contain milk and diaper. Now milk and diaper are together in 3 transactions (TID=3, 4 and 5), and out of the 3, beer is present in 2 of them, hence confidence = 2/3 (No. of transactions with Milk, Diaper and Beer/No. of transactions with Milk and Beer) =0.67
Lift: The Lift of the rule is X=>Y is the confidence of the rule divided by the expected confidence, assuming that the item sets are independent.
Interpretation of Lift:
A lift value greater than 1 indicates that X and Y appear more often together than expected; this means that the occurrence of X has a positive effect on the occurrence of Y or that X is positively correlated with Y.
A lift smaller than 1 indicates that X and Y appear less often together than expected, this means that the occurrence of X has a negative effect on the occurrence of Y or that X is negatively correlated with Y
A lift value near 1 indicates that X and Y appear almost as often together as expected; this means that the occurrence of X has almost no effect on the occurrence of Y or that X and Y have Zero Correlation. Thus, lift is a value between 0 and infinity
For all the values of lift which are > 1, actual lift= Lift value-1 and
% Increase in those cases = (Lift value-1)*100
Coming back to our Example-> Lift (X->Y) = confidence(X->Y) / support(Y)
=Support (X+Y)/Support (X)*Support (Y)
= 0.67 / (3/5)=0.67/0.60 = 1.1167
Now, Let us do a bit of Math here-> ((0.67-0.60)/0.60)*100=70/6=11.67 i.e. probability of finding beer in the transactions which have Milk and Diaper is greater than the normal probability of finding Beer in the above 5 transactions by 11.67%.
How? Let’s solve further
Probability= Favorable Number of Cases/Total Sample Space
Probability of finding beer in the above 5 transactions=3/5=0.60
Probability of finding beer in the transactions which have milk and diaper
Favorable Cases= Beer + Milk + Diaper
Sample Space=Milk + Diaper
=number of transactions which have Beer with Milk and Diaper/number of transactions which have
Milk and Diaper=2/3=0.67. Now 0.67 is 11.67% more than 0.60 i.e. there is a lift or increase of 11.67% of finding beer in the transactions which have Milk and Diaper
To Summarize:
Support: The support of the rule, that is, the relative frequency of transactions that contain X and Y.
Support(X->Y) = support(X+Y)
Confidence: The confidence of the rule. Confidence(X->Y) = support(X+Y)/ support(X)
Lift: The following equation must hold true. Lift (X->Y) = confidence(X->Y) / support(Y)
=Support (X+Y)/Support (X)*Support(Y)
Support of the Rule X=>Y is Symmetric i.e. Support (X->Y) = Support (Y->X)
Lift of the Rule X->Y is Symmetric i.e. Support (X->Y) = Support(Y->X)
Drawback of Confidence:
Confidence can sometimes by misleading as is shown in the below example
Credit Card | ||||
Saving’s Account | No | Yes | Total | |
No | 50 | 350 | 400 | |
Yes | 100 | 500 | 600 |
Rule: S=>C (People with Savings Account are likely to have a credit card)
The interpretation of implication (=>) in association rules can sometimes be misleading
As in Above: Support (S=>C) =500/1000=50%
Confidence (S=>C) = 500/600=83%
Expected Confidence (S=>C (=350+500)/1000) = 85%
Lift (S=>C) = 0.83/0.85 < 1
Based on the Support and Confidence, it might be considered a strong rule. However, people without a savings account are even more likely to have a credit card (=350/400=87.5%).
Savings Account and Credit Card are in fact found to have a negative correlation. Thus, high confidence and support does not imply cause and effect, the two products at times might not even be correlated.
One has to exercise caution in making any recommendations in such cases and look closely at the lift values.
Possible Recommendations for X=>Y Rule (Where X and Y are 2 separate Products and have high support, high confidence and high positive lift > 1)
- Put X and Y Closer in the Store
- Package X with Y
- Package X and Y with a poorly selling item
- Give Discount on only one of X and Y
- Increase the Price of X and lower the price of Y (or vice versa)
- Advertise only one of X and Y i.e. do not advertise X and Y together
- Example: If X was a toy and Y a form of sweet, then offering sweets in the form of toy X could also be a good option.
Example: Interpretation of Rules for a sample product transaction set:
The thresholds used were 1.5 % support and 20% confidence.
Product1 |
==> |
Product2 |
Support (%) |
Confidence (%) |
Lift |
P |
==> |
Q |
2.18 |
26.33 |
1.49 |
R |
==> |
Q |
1.50 |
23.82 |
1.35 |
S |
==> |
Q |
2.42 |
23.45 |
1.33 |
T |
==> |
U |
1.79 |
21.06 |
1.23 |
Interpretation of the first Rule:
Products P and Q together appear in 2.18 % of the transactions as indicated by Support.
If there are 100 transactions that contain Product P, then 26 of those also have Q as indicated by the Confidence.
There is 49% more chance of occurrence of Q, given that P is also there as is indicated by the Lift.
Or The Probability of finding Q in all those transactions which have Product P is 49% more than the Probability of finding Product Q in all the transactions
Mathematics behind the Rule (Ex B->C):
Lift= Support of (B + C)/ Support (B)*Support (C) = approx 50%
The Way Lift has been calculated is as below:
Say for Example if total transactions are 100
C is present in 25=> Probability of finding C in transactions=25/100=1/4=0.25
B is present in 50, but C is present with B in 25 of them. So Probability of finding C in all the transactions with B is = B + C together/ B alone = 25/50=0.50
It implies that Probability of Finding C in all the transactions with B is double the probability of finding C alone in all the transactions
Example: Interpretation of Rules for a sample Product by region transaction set
Summary of association rules: Min: support = 2.0%, confidence = 20.0%
Max. Size of an Item Set = 10
Support: Fraction of transactions that contain both X and Y. The threshold has been kept at 2% i.e. atleast 2% of the transactions contain both X and Y.
Confidence (c): Measures how often each item in Y appears in transactions that contain X. The threshold has been kept at 20%.
Item Set 1 ( X ) |
==> |
Item Set 2 ( Y ) |
Support (%) |
Confidence (%) |
Lift |
A1 |
==> |
P1 |
3.61 |
88.91 |
19.41 |
A2 |
==> |
P2 |
1.99 |
65.89 |
15.11 |
Consider the top rule:
Let X= A1 (Region)
Let Y= P1 (Product)
Why the values for Support are same? -> It is just a simple mathematical formula
Support = Transactions that contain both X and Y/Total Transactions
Since for both the rules X and Y are same, just that their orientation is different, obviously for both the rules it comes 3.61% i.e. 3.61% of the transactions contain both X & Y
Interpretation of the Confidence Value:
X=>Y, confidence = (X Union Y)/X i.e. Support (X+Y)/Support (X)
88.91% of the times, Product P1 occurs in all those transactions which contain A1 as the region.
Say for example there are 100 transactions which contain region- A1, among them 89 transactions contain the Product P1
Interpretation of the Lift value:
Lift (X->Y) = confidence(X->Y) / support(Y) =Support (X+Y)/Support (X)*support (Y)
For this Rule => probability of finding Product1 increases 18.4 times in all those transactions where region is A1
Or
Probability of P1 in all those transactions which have region A1 is 18.4 times the Probability of Product P1 in all the transactions.
Mathematics behind the Rule (Ex T->S):
Say for Example if total transactions are 100
S is present in 20=> Probability of finding S in transactions=20/100=0.20
T is present in 50, but S is present with T in 20 of them. So Probability of finding S in all the transactions with T is = T + S together/ T alone = 20/50=0.40
It implies that Probability of finding Product S in all the transactions with region T is double the probability of finding S alone in all the transactions
The MB analysis is often viewed for co-marketing but this does not answer the question. That is, if X–>Y together, is the result or sales > if Z and Y are separate? This is called marketing synergy