We now summarize itemset methodology formally as follows (Agrawal et al., 1996). Let I = {I1, I2,..., Im} be a set of literals, called items. Let D = {T1, T2, ..., Tn} be a set of n transactions, where for each transaction T ÎD, T Í I. A set of items X Í I is called an itemset. A transaction T contains an itemset X if X Í T. Each itemset X is associated with a set of transactions TX = {T Î D | T ÊX} which is the set of transactions which contain the itemset X. The support supp(X) of itemset X equals |TX|/|D|.
We illustrate support using the small sample transaction database shown in Table 1. The TID column gives the transaction identifier values. Beneath each item name are values indicating quantity of item sold. We recognize, of course, that support is defined over the binary domain {0,1}, with a value of 1 indicating the presence of an item in a transaction, and a value of 0 indicating the absence of an item. We use the explicit quantities to illustrate limitations of the support measure. In our calculation of support, any non-zero quantity in the table is treated as a 1.
TID | Item A | Item B | Item C | Item D |
T1 | 1 | 0 | 1 | 14 |
T2 | 0 | 0 | 6 | 0 |
T3 | 1 | 0 | 2 | 4 |
T4 | 0 | 0 | 4 | 0 |
T5 | 0 | 0 | 3 | 1 |
T6 | 0 | 0 | 1 | 13 |
T7 | 0 | 0 | 8 | 0 |
T8 | 4 | 0 | 0 | 7 |
T9 | 0 | 1 | 1 | 10 |
T10 | 0 | 0 | 0 | 18 |
Table 1: Example Transaction Database
Table 2 shows the support for each possible itemset. For example item A appears in 3 transactions and there are 10 transactions in total, so supp(A)= 3 / 10 = 0.3, and the combination CD appears in 5 out of 10 transactions, so supp(CD) = 5 / 10 = 0.5.
Itemset | s |
A | 0.30 |
B | 0.10 |
C | 0.80 |
D | 0.70 |
AB | 0.00 |
AC | 0.20 |
AD | 0.30 |
BC | 0.10 |
BD | 0.10 |
CD | 0.50 |
ABC | 0.00 |
ABD | 0.00 |
ACD | 0.20 |
BCD | 0.10 |
ABCD | 0.00 |
Table 2: Itemset Support
Using the support measure, the frequent itemsets are defined to be those whose support is greater than or equal to minsupport, a user specified threshold. If minsupport = 0.2, then the frequent itemsets are those shown in bold face in Table 2.
Creating the Association Rules
The following association rules can be created from the frequent itemsets in Table 2:
- C ® A
- A ® D
- D ® A
- C ® D
- D ® C
- A,C ® D
- A,D ® C
- C,D ® A
The association rules are X ® Y for itemset X and item Y.
Limitations of the Support Measure
As previously indicated, support has been used as a fundamental measure for determining the importance of an itemset. The frequency of occurrence of an itemset certainly provides useful information. In addition, support is measured relative to a stable foundation, the total number of transactions being examined. However, transaction data often contain richer information than whether or not the itemset exists in the transaction, such as quantity sold, unit cost, unit profit, or other numerical attributes. By analyzing this information, we can get a more insightful picture of the relative importance of itemsets.
Consider the information provided by quantity sold. Many products are purchased in multiples, such as frozen concentrated juice or carbonated beverages. Since support does not consider this, frequency information derived from support may be misleading. In the sample database, the 1-itemset C has a higher support than the 1-itemset D because it occurs in one more transaction than D. However, the total quantity of item D sold is 67, while that of item C is 26, so in fact D is sold more frequently. Similarly, itemsets BC and BD have equal support, since each occurs in a single transaction, but the quantity sold of items B and D in itemset BD is higher than the quantity sold of items B and C in itemset BC. Again, support provides a misleading picture of frequency in terms of the quantity of items sold.
The original support measure also does not allow for accurate financial calculations or comparisons. For target marketing, measures should take into account both the frequency of an item contributing to a predictive rule and the value of the items in the prediction (Masland and Piatetsky-Shapiro, 1995). The support measure allows for neither of these, so measures based on specific numbers of items, such as percentage of gross sales, costs or net profit, cannot be calculated, and business payoff cannot be maximized. Again examine itemsets BC and BD. Assume that each item is sold for $1.00. Using support, there is no reason to consider one itemset more important than the other, even though BD generates $11.00 of revenue and BC generates only $2.00. Support fails as a measure of relative importance whenever the number of items plays a significant role in determining the value of the relevant sales. Various approaches have been suggested for extending support to quantitative measures (Srikant and Agrawal, 1996; Buchter and Wirth; 1998). A simpler and more flexible approach is to use itemset share as a measure.