- A frequent itemset is an itemset whose support is greater than some user-specified minimum support (denoted Lk, where k is the size of the itemset)
- A candidate itemset is a potentially frequent itemset (denoted Ck, where k is the size of the itemset)
Apriori Algorithm
A Java applet which combines DIC, Apriori and Probability Based Objected Interestingness Measures can be found here.
Apriori Algorithm: (by Agrawal et al at IBM Almaden Research Centre) can be used to generate all frequent itemset
Pass 1
- Generate the candidate itemsets in C1
- Save the frequent itemsets in L1
Pass k
- Generate the candidate itemsets in Ck from the frequent
itemsets in Lk-1
- Join Lk-1 p with Lk-1q, as follows:
insert into Ck
select p.item1, p.item2, . . . , p.itemk-1, q.itemk-1
from Lk-1 p, Lk-1q
where p.item1 = q.item1, . . . p.itemk-2 = q.itemk-2, p.itemk-1 < q.itemk-1
- Generate all (k-1)-subsets from the candidate itemsets in Ck
- Prune all candidate itemsets from Ck where some (k-1)-subset of the candidate itemset is not in the frequent itemset Lk-1
- Scan the transaction database to determine the support for each candidate itemset in Ck
- Save the frequent itemsets in Lk
Implementation: A working Apriori Itemset Generation program can be found on the Itemset Implementation page.
Example 1: Assume the user-specified minimum support is 50%
- Given: The transaction database shown below
TID |
A |
B |
C |
D |
E |
F |
T1 |
1 |
0 |
1 |
1 |
0 |
0 |
T2 |
0 |
1 |
0 |
1 |
0 |
0 |
T3 |
1 |
1 |
1 |
0 |
1 |
0 |
T4 |
0 |
1 |
0 |
1 |
0 |
1 |
- The candidate itemsets in C2 are shown below
Itemset X |
supp(X) |
{A,B} |
25% |
{A,C} |
50% |
{A,D} |
25% |
{B,C} |
25% |
{B,D} |
50% |
{C,D} |
25% |
- The frequent itemsets in L2 are shown below
Itemset X |
supp(X) |
{A,C} |
50% |
{B,D} |
50% |
Example 2: Assume the user-specified minimum support is 40%, then generate all frequent itemsets.
Given: The transaction database shown below
TID |
A |
B |
C |
D |
E |
T1 |
1 |
1 |
1 |
0 |
0 |
T2 |
1 |
1 |
1 |
1 |
1 |
T3 |
1 |
0 |
1 |
1 |
0 |
T4 |
1 |
0 |
1 |
1 |
1 |
T5 |
1 |
1 |
1 |
1 |
0 |
Pass 1
C1
Itemset X |
supp(X) |
A |
? |
B |
? |
C |
? |
D |
? |
E |
? |
|
L1
Itemset X |
supp(X) |
A |
100% |
B |
60% |
C |
100% |
D |
80% |
E |
40% |
|
Pass 2
C2
Itemset X |
supp(X) |
A,B |
? |
A,C |
? |
A,D |
? |
A,E |
? |
B,C |
? |
B,D |
? |
B,E |
? |
C,D |
? |
C,E |
? |
D,E |
? |
- Nothing is pruned since all subsets of these itemsets are frequent
L2
Itemset X |
supp(X) |
A,B |
60% |
A,C |
100% |
A,D |
80% |
A,E |
40% |
B,C |
60% |
B,D |
40% |
B,E |
20% |
C,D |
80% |
C,E |
40% |
D,E |
40% |
|
L2 after saving only the frequent itemsets
Itemset X |
supp(X) |
A,B |
60% |
A,C |
100% |
A,D |
80% |
A,E |
40% |
B,C |
60% |
B,D |
40% |
C,D |
80% |
C,E |
40% |
D,E |
40% |
|
Pass 3
- To create C3 only look at items that have the same first item (in pass k, the first k - 2 items must match)
C3
|
Itemset X |
supp(X) |
join AB with AC |
A,B,C |
? |
join AB with AD |
A,B,D |
? |
join AB with AE |
A,B,E |
? |
join AC with AD |
A,C,D |
? |
join AC with AE |
A,C,E |
? |
join AD with AE |
A,D,E |
? |
join BC with BD |
B,C,D |
? |
join CD with CE |
C,D,E |
? |
|
C3 after pruning
Itemset X |
supp(X) |
A,B,C |
? |
A,B,D |
? |
A,C,D |
? |
A,C,E |
? |
A,D,E |
? |
B,C,D |
? |
C,D,E |
? |
|
- Pruning eliminates ABE since BE is not frequent
- Scan transactions in the database
L3
Itemset X |
supp(X) |
A,B,C |
60% |
A,B,D |
40% |
A,C,D |
80% |
A,C,E |
40% |
A,D,E |
40% |
B,C,D |
40% |
C,D,E |
40% |
Pass 4
- First k - 2 = 2 items must match in pass k = 4
C4
|
Itemset X |
supp(X) |
combine ABC with ABD |
A,B,C,D |
? |
combine ACD with ACE |
A,C,D,E |
? |
- Pruning:
- For ABCD we check whether ABC, ABD, ACD, BCD are frequent. They are in all cases, so we do not prune ABCD.
- For ACDE we check whether ACD, ACE, ADE, CDE are frequent. Yes, in all cases, so we do not prune ACDE
L4
Itemset X |
supp(X) |
A,B,C,D |
40% |
A,C,D,E |
40% |
Pass 5: For pass 5 we can't form any candidates because there aren't two frequent 4-itemsets beginning with the same 3 items.