- Association Rule Basics

Data Mining and Text Mining (UIC 583 @ Politecnico di Milano) - Lecture outline

2

What is association rule mining?

Frequent itemsets, support, and confidence

Mining association rules

The “Apriori” algorithm

Rule generation

Prof. Pier Luca Lanzi - Example of market-basket transactions

3

Bread

Bread

Steak

Jam

Peanuts

Jam

Jam

Soda

Milk

Soda

Soda

Peanuts

Fruit

Chips

Chips

Milk

Jam

Milk

Bread

Fruit

Fruit

Jam

Fruit

Fruit

Fruit

Soda

Soda

Soda

Peanuts

Chips

Chips

Peanuts

Cheese

Milk

Milk

Milk

Yogurt

Bread

Prof. Pier Luca Lanzi - What is association mining?

4

Finding frequent patterns, associations, correlations, or

causal structures among sets of items or objects in

transaction databases, relational databases, and other

information repositories

Applications

Basket data analysis

Cross-marketing

Catalog design

…

Prof. Pier Luca Lanzi - What is association rule mining?

5

TID Items

Examples

1 Bread, Peanuts, Milk, Fruit, Jam

{bread}

{milk}

2 Bread, Jam, Soda, Chips, Milk, Fruit

{soda}

{chips}

3 Steak, Jam, Soda, Chips, Bread

{bread}

{jam}

4 Jam, Soda, Peanuts, Milk, Fruit

5 Jam, Soda, Chips, Milk, Bread

6 Fruit, Soda, Chips, Milk

7 Fruit, Soda, Peanuts, Milk

8 Fruit, Peanuts, Cheese, Yogurt

Given a set of transactions, find rules that will predict the

occurrence of an item based on the occurrences of other

items in the transaction

Prof. Pier Luca Lanzi - Definition: Frequent Itemset

6

Itemset

TID Items

A collection of one or more items,

e.g., {milk, bread, jam}

1 Bread, Peanuts, Milk, Fruit, Jam

k-itemset, an itemset that

2 Bread, Jam, Soda, Chips, Milk, Fruit

contains k items

Support count ( )

3 Steak, Jam, Soda, Chips, Bread

Frequency of occurrence

4 Jam, Soda, Peanuts, Milk, Fruit

of an itemset

({Milk, Bread}) = 3

5 Jam, Soda, Chips, Milk, Bread

({Soda, Chips}) = 4

6 Fruit, Soda, Chips, Milk

Support

Fraction of transactions

7 Fruit, Soda, Peanuts, Milk

that contain an itemset

8 Fruit, Peanuts, Cheese, Yogurt

s({Milk, Bread}) = 3/8

s({Soda, Chips}) = 4/8

Frequent Itemset

An itemset whose support is greater

than or equal to a minsup threshold

Prof. Pier Luca Lanzi - What is an association rule?

7

Implication of the form X

Y, where X and Y are itemsets

Example, {bread}

{milk}

Rule Evaluation Metrics, Suppor & Confidence

Support (s)

Fraction of transactions

that contain both X and Y

Confidence (c)

Measures how often items in Y

appear in transactions that

contain X

Prof. Pier Luca Lanzi - Support and Confidence

8

Customer

buys both

Customer

buys Milk

Customer

buys Bread

Prof. Pier Luca Lanzi - What is the goal?

9

Given a set of transactions T, the goal of association rule

mining is to find all rules having

support ≥ minsup threshold

confidence ≥ minconf threshold

Brute-force approach:

List all possible association rules

Compute the support and confidence for each rule

Prune rules that fail the minsup and minconf thresholds

Brute-force approach is computationally prohibitive!

Prof. Pier Luca Lanzi - Mining Association Rules

10

{Bread, Jam}

{Milk} s=0.4 c=0.75

{Milk, Jam}

{Bread} s=0.4 c=0.75

{Bread}

{Milk, Jam} s=0.4 c=0.75

{Jam}

{Bread, Milk} s=0.4 c=0.6

{Milk}

{Bread, Jam} s=0.4 c=0.5

All the above rules are binary partitions of the same itemset:

{Milk, Bread, Jam}

Rules originating from the same itemset have identical

support but can have different confidence

We can decouple the support and confidence requirements!

Prof. Pier Luca Lanzi - Mining Association Rules:

11

Two Step Approach

Frequent Itemset Generation

Generate all itemsets whose support minsup

Rule Generation

Generate high confidence rules from frequent itemset

Each rule is a binary partitioning of a frequent itemset

Frequent itemset generation is computationally expensive

Prof. Pier Luca Lanzi - Frequent Itemset Generation

13

null

A

B

C

D

E

AB

AC

AD

AE

BC

BD

BE

CD

CE

DE

ABC

ABD

ABE

ACD

ACE

ADE

BCD

BCE

BDE

CDE

ABCD

ABCE

ABDE

ACDE

BCDE

Given d items, there

are 2d possible

ABCDE

candidate itemsets

Prof. Pier Luca Lanzi - Frequent Itemset Generation

14

Brute-force approach:

Each itemset in the lattice is a candidate frequent itemset

Count the support of each candidate by scanning the

database

TID Items

1 Bread, Peanuts, Milk, Fruit, Jam

candidates

2 Bread, Jam, Soda, Chips, Milk, Fruit

3 Steak, Jam, Soda, Chips, Bread

N 4 Jam, Soda, Peanuts, Milk, Fruit

M

5 Jam, Soda, Chips, Milk, Bread

6 Fruit, Soda, Chips, Milk

7 Fruit, Soda, Peanuts, Milk

8 Fruit, Peanuts, Cheese, Yogurt

w

Match each transaction against every candidate

Complexity ~ O(NMw) => Expensive since M = 2d

Prof. Pier Luca Lanzi - Computational Complexity

15

Given d unique items:

Total number of itemsets = 2d

Total number of possible association rules:

For d=6, there are 602 rules

Prof. Pier Luca Lanzi

d

d

k

d 1

d k

R

k 1

j 1

k

j

3d

2d 1 1 - 16

Frequent Itemset Generation Strategies

Reduce the number of candidates (M)

Complete search: M=2d

Use pruning techniques to reduce M

Reduce the number of transactions (N)

Reduce size of N as the size of itemset increases

Reduce the number of comparisons (NM)

Use efficient data structures to store the candidates or

transactions

No need to match every candidate against every

transaction

Prof. Pier Luca Lanzi - Reducing the Number of Candidates

18

Apriori principle

If an itemset is frequent, then all of

its subsets must also be frequent

Apriori principle holds due to the following property of the

support measure:

Support of an itemset never exceeds the support of its

subsets

This is known as the anti-monotone property of support

Prof. Pier Luca Lanzi - Illustrating Apriori Principle

19

null

A

B

C

D

E

AB

A

AC

A

AD

A

AE

A

BC

B

BD

B

BE

B

CD

CE

DE

Found to be

Infrequent

AB

A C

B

AB

A D

B

AB

A E

B

AC

A D

AC

A E

AD

A E

BC

B D

BC

B E

BD

B E

CDE

AB

A C

B D

AB

A C

B E

AB

A D

B E

AC

A DE

BC

B DE

Pruned

AB

A C

B DE

supersets

Prof. Pier Luca Lanzi - How does the Apriori principle work?

20

Items (1-itemsets)

Item

Count

Bread

4

Peanuts

4

2-itemsets

Milk

6

2-Itemset

Count

Fruit

6

Bread, Jam

4

Jam

5

Peanuts, Fruit

4

Soda

6

Milk, Fruit

5

Chips

4

Milk, Jam

4

Steak

1

Milk, Soda

5

Cheese

1

Fruit, Soda

4

Yogurt

1

Jam, Soda

4

3-itemsets

Soda, Chips

4

Minimum Support = 4

3-Itemset

Count

Milk, Fruit, Soda

4

Prof. Pier Luca Lanzi - Apriori Algorithm

21

Let k=1

Generate frequent itemsets of length 1

Repeat until no new frequent itemsets are identified

Generate length (k+1) candidate itemsets from length k

frequent itemsets

Prune candidate itemsets containing subsets of length k

that are infrequent

Count the support of each candidate by scanning the DB

Eliminate candidates that are infrequent, leaving only

those that are frequent

Prof. Pier Luca Lanzi - The Apriori Algorithm

22

Ck: Candidate itemset of size k

Lk : frequent itemset of size k

L1 = {frequent items};

for (k = 1; Lk != ; k++) do begin

Ck+1 = candidates generated from Lk;

for each transaction t in database do

increment the count of all candidates in Ck+1

that are contained in t

Lk+1 = candidates in Ck+1 with min_support

end

return k Lk;

Prof. Pier Luca Lanzi - The Apriori Algorithm

23

Join Step

Ck is generated by joining Lk-1 with itself

Prune Step

Any (k-1)-itemset that is not frequent cannot be a subset

of a frequent k-itemset

Prof. Pier Luca Lanzi - Efficient Implementation of Apriori in SQL 24

Hard to get good performance out of pure SQL (SQL-92)

based approaches alone

Make use of object-relational extensions like UDFs, BLOBs,

Table functions etc.

Get orders of magnitude improvement

S. Sarawagi, S. Thomas, and R. Agrawal. Integrating

association rule mining with relational database systems:

Alternatives and implications. In SIGMOD’98

Prof. Pier Luca Lanzi - How to Improve

25

Apriori’s Efficiency

Hash-based itemset counting: A k-itemset whose corresponding hashing

bucket count is below the threshold cannot be frequent

Transaction reduction: A transaction that does not contain any frequent k-

itemset is useless in subsequent scans

Partitioning: Any itemset that is potentially frequent in DB must be frequent

in at least one of the partitions of DB

Sampling: mining on a subset of given data, lower support threshold + a

method to determine the completeness

Dynamic itemset counting: add new candidate itemsets only when all of

their subsets are estimated to be frequent

Prof. Pier Luca Lanzi - Rule Generation

26

Given a frequent itemset L, find all non-empty subsets f L

such that f

L – f satisfies the minimum confidence

requirement

If {A,B,C,D} is a frequent itemset, candidate rules:

ABC D, ABD C, ACD B, BCD A, A BCD, B ACD,

C ABD, D ABC, AB CD, AC

BD, AD

BC, BC AD,

BD AC, CD AB

If |L| = k, then there are 2k – 2 candidate association rules

(ignoring L

and

L)

Prof. Pier Luca Lanzi - How to efficiently generate rules from

28

frequent itemsets?

Confidence does not have an anti-monotone property

c(ABC D) can be larger or smaller than c(AB D)

But confidence of rules generated from the same itemset has

an anti-monotone property

e.g., L = {A,B,C,D}:

c(ABC

D) c(AB

CD) c(A

BCD)

Confidence is anti-monotone with respect to the number

of items on the right hand side of the rule

Prof. Pier Luca Lanzi - Rule Generation for Apriori Algorithm

29

AB

A C

B D=>

= {

>

{ }

Low

Confidence

Rule

BC

B D=>

= A

>

AC

A D=>

= B

>

AB

A D

B =>

= C

>

AB

A C

B =>

= D

>

CD=>

= A

> B

A

BD

B =>

= A

> C

A

BC

B =>

= A

> D

A

AD

A =>

= B

> C

B

AC

A =>

= B

> D

B

AB

A =

B >

= C

> D

D=>

= A

> B

A C

B

C=>

= A

> B

A D

B

B=

B >

= A

> C

A D

A=

A >

= B

> C

B D

Pruned

Rules

Prof. Pier Luca Lanzi - Rule Generation for Apriori Algorithm

30

Candidate rule is generated by merging two rules that share

the same prefix

in the rule consequent

join(CD=>AB,BD=>AC)

CD=>AB

BD=>AC

would produce the candidate

rule D => ABC

Prune rule D=>ABC if its

subset AD=>BC does not have

high confidence

D=>ABC

Prof. Pier Luca Lanzi - Effect of Support Distribution

31

Many real data sets have skewed support distribution

Support

distribution of

a retail data set

Prof. Pier Luca Lanzi - How to set the appropriate minsup?

32

If minsup is set too high, we could miss itemsets involving

interesting rare items (e.g., expensive products)

If minsup is set too low, it is computationally expensive

and the number of itemsets is very large

A single minimum support threshold may not be effective

Prof. Pier Luca Lanzi