- A frequent itemset is an itemset whose support is greater than some user-specified minimum support (denoted Lk, where k is the size of the itemset)
- A candidate itemset is a potentially frequent itemset (denoted Ck, where k is the size of the itemset)
Apriori Algorithm
A Java applet which combines DIC, Apriori and Probability Based Objected Interestingness Measures can be found here.
Apriori Algorithm: (by Agrawal et al at IBM Almaden Research Centre) can be used to generate all frequent itemset
Pass 1
- Generate the candidate itemsets in C1
- Save the frequent itemsets in L1
Pass k
- Generate the candidate itemsets in Ck from the frequent
itemsets in Lk-1
- Join Lk-1 p with Lk-1q, as follows:
insert into Ck
select p.item1, p.item2, . . . , p.itemk-1, q.itemk-1
from Lk-1 p, Lk-1q
where p.item1 = q.item1, . . . p.itemk-2 = q.itemk-2, p.itemk-1 < q.itemk-1
- Generate all (k-1)-subsets from the candidate itemsets in Ck
- Prune all candidate itemsets from Ck where some (k-1)-subset of the candidate itemset is not in the frequent itemset Lk-1
- Scan the transaction database to determine the support for each candidate itemset in Ck
- Save the frequent itemsets in Lk
Implementation: A working Apriori Itemset Generation program can be found on the Itemset Implementation page.
Example 1: Assume the user-specified minimum support is 50%
- Given: The transaction database shown below
| TID |
A |
B |
C |
D |
E |
F |
| T1 |
1 |
0 |
1 |
1 |
0 |
0 |
| T2 |
0 |
1 |
0 |
1 |
0 |
0 |
| T3 |
1 |
1 |
1 |
0 |
1 |
0 |
| T4 |
0 |
1 |
0 |
1 |
0 |
1 |
- The candidate itemsets in C2 are shown below
| Itemset X |
supp(X) |
| {A,B} |
25% |
| {A,C} |
50% |
| {A,D} |
25% |
| {B,C} |
25% |
| {B,D} |
50% |
| {C,D} |
25% |
- The frequent itemsets in L2 are shown below
| Itemset X |
supp(X) |
| {A,C} |
50% |
| {B,D} |
50% |
Example 2: Assume the user-specified minimum support is 40%, then generate all frequent itemsets.
Given: The transaction database shown below
| TID |
A |
B |
C |
D |
E |
| T1 |
1 |
1 |
1 |
0 |
0 |
| T2 |
1 |
1 |
1 |
1 |
1 |
| T3 |
1 |
0 |
1 |
1 |
0 |
| T4 |
1 |
0 |
1 |
1 |
1 |
| T5 |
1 |
1 |
1 |
1 |
0 |
Pass 1
C1
| Itemset X |
supp(X) |
| A |
? |
| B |
? |
| C |
? |
| D |
? |
| E |
? |
|
L1
| Itemset X |
supp(X) |
| A |
100% |
| B |
60% |
| C |
100% |
| D |
80% |
| E |
40% |
|
Pass 2
C2
| Itemset X |
supp(X) |
| A,B |
? |
| A,C |
? |
| A,D |
? |
| A,E |
? |
| B,C |
? |
| B,D |
? |
| B,E |
? |
| C,D |
? |
| C,E |
? |
| D,E |
? |
- Nothing is pruned since all subsets of these itemsets are frequent
L2
| Itemset X |
supp(X) |
| A,B |
60% |
| A,C |
100% |
| A,D |
80% |
| A,E |
40% |
| B,C |
60% |
| B,D |
40% |
| B,E |
20% |
| C,D |
80% |
| C,E |
40% |
| D,E |
40% |
|
L2 after saving only the frequent itemsets
| Itemset X |
supp(X) |
| A,B |
60% |
| A,C |
100% |
| A,D |
80% |
| A,E |
40% |
| B,C |
60% |
| B,D |
40% |
| C,D |
80% |
| C,E |
40% |
| D,E |
40% |
|
Pass 3
- To create C3 only look at items that have the same first item (in pass k, the first k - 2 items must match)
C3
|
Itemset X |
supp(X) |
| join AB with AC |
A,B,C |
? |
| join AB with AD |
A,B,D |
? |
| join AB with AE |
A,B,E |
? |
| join AC with AD |
A,C,D |
? |
| join AC with AE |
A,C,E |
? |
| join AD with AE |
A,D,E |
? |
| join BC with BD |
B,C,D |
? |
| join CD with CE |
C,D,E |
? |
|
C3 after pruning
| Itemset X |
supp(X) |
| A,B,C |
? |
| A,B,D |
? |
| A,C,D |
? |
| A,C,E |
? |
| A,D,E |
? |
| B,C,D |
? |
| C,D,E |
? |
|
- Pruning eliminates ABE since BE is not frequent
- Scan transactions in the database
L3
| Itemset X |
supp(X) |
| A,B,C |
60% |
| A,B,D |
40% |
| A,C,D |
80% |
| A,C,E |
40% |
| A,D,E |
40% |
| B,C,D |
40% |
| C,D,E |
40% |
Pass 4
- First k - 2 = 2 items must match in pass k = 4
C4
|
Itemset X |
supp(X) |
| combine ABC with ABD |
A,B,C,D |
? |
| combine ACD with ACE |
A,C,D,E |
? |
- Pruning:
- For ABCD we check whether ABC, ABD, ACD, BCD are frequent. They are in all cases, so we do not prune ABCD.
- For ACDE we check whether ACD, ACE, ADE, CDE are frequent. Yes, in all cases, so we do not prune ACDE
L4
| Itemset X |
supp(X) |
| A,B,C,D |
40% |
| A,C,D,E |
40% |
Pass 5: For pass 5 we can't form any candidates because there aren't two frequent 4-itemsets beginning with the same 3 items.