Introduction to Itemsets |
The existence of large amounts of scan code data collected by many businesses represents a potential wealth of information given adequate methods of transforming the data into meaningful information. One class of such data is stored in transaction databases from which all items obtained in a single transaction can be retrieved as a unit. The transactions can then be examined to determine what items typically appear together, e.g., which items customers typically buy together in a database of supermarket transactions. This in turn gives insight into questions such as how to market these products more effectively, how to group them in store layout or product packages, or which items to offer on sale to boost the sale of other items.
Recent research has focused on determining which groups of items, called itemsets, are frequently appear together in transactions. From any itemset an association rule may be derived which, given the occurrence of a subset of the items in the itemset, predicts the probability of the occurrence of the remaining items (e.g., Agrawal et al., 1993; Agrawal et al., 1996; Houtsma and Swami; 1995; Masand and Piatetsky-Shapiro, 1996). Several algorithms have also been proposed for finding generalized itemsets from items that are classified by one or more taxonomic hierarchies (Han and Fu, 1995; Srikant and Agrawal, 1995).
A retail organization may offer thousands of products and services. The number of possible combinations of these products and services is potentially huge. In the general case, the examination of all possible combinations is impractical and methods are required to focus effort on those itemsets that are considered important to an organization. The most commonly used measure of the importance of an itemset is its support, the percentage of all transactions that contain the itemset (e.g., Agrawal et al., 1993; Mannila et al., 1994; Han and Fu, 1995; Hipp et al., 1998; Hidber, 1999). An association rule of the form A ® B, where A and B are itemsets, is associated with a confidence measure which is the ratio of the support of the itemset A È B to the support of the itemset A. The confidence quantifies the probability that when A is bought, B will also be bought.
Itemsets that meet a minimum support threshold are referred to as frequent itemsets. The rationale behind the use of support is that a retail organization is only interested in those itemsets that occur frequently. However, the support of an itemset tells only the number of transactions in which the itemset was purchased. The exact number of items purchased is not analyzed and the precise impact of the purchase of an itemset cannot be measured in terms of stock, cost or profit. This shortcoming of the support measure prompted the development of a measure called itemset share, the fraction of some numerical value, such as total quantity of items sold or total profit, that is contributed by the items when they occur in an itemset (Carter et al., 1997). Overall, share-based measures provide more information whenever items are purchased in multiples.
General Definitions
conf(X ® Y) = P(Y | X) = P(X Ç Y) / P(X)
Example:
Transaction ID | A | B | C | D | E | F |
T_{1} | 1 | 0 | 1 | 1 | 0 | 0 |
T_{2} | 0 | 1 | 0 | 1 | 0 | 0 |
T_{3} | 1 | 1 | 1 | 0 | 1 | 0 |
T_{4} | 0 | 1 | 0 | 1 | 0 | 1 |