**Reference:**

- 13. Findlater, L., and Hamilton, H.J. ``Iceberg Cube Algorithms: An Empirical Evaluation on Synthetic and Real Data,'' Intelligent Data Analysis, 7(2), 2003. Accepted April, 2002.

# Introduction

An Iceberg-Cube contains only those cells of the data cube that meet an aggregate condition. It is called an Iceberg-Cube because it contains only some of the cells of the full cube, like the tip of an iceberg. The aggregate condition could be, for example, minimum support or a lower bound on average, min or max. The purpose of the Iceberg-Cube is to identify and compute only those values that will most likely be required for decision support queries. The aggregate condition specifies which cube values are more meaningful and should therefore be stored. This is one solution to the problem of computing versus storing data cubes.

For a three dimensional data cube, with attributes A, B and C, the Iceberg-Cube problem may be represented as:

FROM

CUBE BY

HAVING

TableName

A,B,C

COUNT(*)>=

*minsup*

where *minsup* is the minimum support. Minimum support
is the minimum number of tuples in which a combination of attribute values must
appear to be considered ** frequent**. It is expressed as a percentage
of the total number of tuples in the input table.

When an Iceberg-Cube is constructed, it may also include those totals from the original cube that satisfy the minimum support requirement. The inclusion of totals makes Iceberg-Cubes more useful for extracting previously unknown relationships from a database.

For example, suppose minimum support is 25% and we want to create an Iceberg-Cube using Table 2 as input. For a combination of attribute values to appear in the Iceberg-Cube, it must be present in at least 25% of tuples in the input table, or 2 tuples. The resulting Iceberg-Cube is shown in Table 3. Those cells in the full cube whose counts are less than 2, such as ({P1, Vancouver, Vance},1), are not present in the Iceberg-Cube.

Part | StoreLocation | Customer |

P1 | Vancouver | Vance |

P1 | Calgary | Bob |

P1 | Toronto | Richard |

P2 | Toronto | Allison |

P2 | Toronto | Allison |

P2 | Toronto | Tom |

P2 | Ottawa | Allison |

P3 | Montreal | Anne |

Combination | Count |

{P1, ANY, ANY} | 3 |

{P2, ANY, ANY} | 4 |

{ANY, Toronto, ANY} | 4 |

{ANY, ANY, Allison} | 3 |

{P2, Toronto, ANY} | 3 |

{P2, ANY, Allison} | 3 |

{ANY, Toronto, Allison} | 2 |

{P2, Toronto, Allison} | 2 |

__Methods for Computing Iceberg-Cubes__

**1. APRIORI**

The APRIORI algorithm uses ** candidate combinations** to avoid counting every possible combination of
attribute values. For a combination of attribute values to satisfy the minimum
support requirement, all subsets of that combination must also satisfy minimum
support. The candidate combinations are found by combining only the frequent
attribute value combinations that are already known. All other possible
combinations are automatically eliminated because not all of their subsets
would satisfy the minimum support requirement.

To do this, the algorithm first counts all single values on one pass of the data, then counts all candidate
combinations of the frequent single values to identify frequent pairs. On the
third pass over the data, it counts candidate combinations based on the
frequent pairs to determine frequent 3-sets, and so on. This method guarantees
that all frequent combinations of *k* values will be found in *k*
passes over the database.

**Example:**Suppose the input table is as shown in Table 4. On the first pass over the data, the APRIORI algorithm determines that the single values shown in Table 5 are frequent. These single values are implicit representations of ordered sets in the ordered set representation of an Iceberg-Cube. For example, {a1} represents the combination {a1,ANY,ANY,ANY}. The candidate combinations that can be derived from these values are shown in Table 6. These combinations are counted on the second pass and the algorithm determines which of them are frequent. The result is shown in Table 7. From the frequent pairs, only one 3-set candidate combination can be derived, as shown in Table 8, and the algorithm counts this combination on the third pass over the data. The result of the third pass over the data is shown in Table 9. Since {a2, c2, d2} is not frequent no more candidate combinations can be derived, and the algorithm finishes execution.

A | B | C | D |

a1 | b1 | c3 | d1 |

a1 | b5 | c1 | d2 |

a1 | b2 | c5 | d2 |

a2 | b2 | c2 | d2 |

a2 | b2 | c2 | d4 |

a2 | b2 | c4 | d2 |

a2 | b3 | c2 | d3 |

a3 | b4 | c6 | d2 |

Table 4: Sample Input Table

Combination | Count |

{a1} | 3 |

{a2} | 4 |

{b2} | 4 |

{c2} | 3 |

{d2} | 5 |

Combination |

{a1,b2} |

{a1,c2} |

{a1,d2} |

{a2, b2} |

{a2,c2} |

{a2,d2} |

{b2,c2} |

{b2,d2} |

{c2,d2} |

Combination | Count |

{a1,d2} | 2 |

{a2,b2} | 3 |

{a2,c2} | 3 |

{b2,c2} | 2 |

{a2,d2} | 2 |

{b2,d2} | 3 |

Combination |

{a2,b2,c2} |

{a2,b2,d2} |

Combination | Count |

{a2,b2,c2} | 2 |

{a2,b2,d2} | 2 |

**2. Top-Down Computation**

Top-down computation (tdC) computes an Iceberg-Cube by traversing down a
multi-dimensional lattice formed from the attributes in an input table. The
lattice represents all combinations of input attributes and the relationships
between those combinations. Figure 10 shows a 4-Dimensional lattice of this
type. The processing path of tdC is shown in Figure 11. The algorithm begins by
computing the frequent attribute value combinations for the attribute set at
the top of the tree, in this case ABCD. On the same pass over the data, tdC
counts value combinations for ABCD, ABC, AB and A, adding the frequent ones to
the Iceberg-Cube. This is facilitated by first ordering the database by the
current attribute combination, ABCD. tdC then continues to the next leaf node in
the tree, ABD, and counts those attribute value combinations. For *n*
attributes, there are 2^{n}^{-1} possible combinations
of those attributes, which are represented as the leaf nodes in the tree. If no
pruning occurs, then tdC examines every leaf node, making 2^{n}^{-1}
passes over the data. Pruning can occur when no attribute value combinations
are found to be frequent for a certain combination of attributes. For example,
if after the combination ABCD is processed, and there are no attribute value
combinations of AB that are frequent, then tdC can prune the ABD node from
the processing tree because all combinations that would be counted would start
with AB, except for the single attribute A, which has already been counted.

Figure 10: 4-Dimensional Lattice

Figure 11: Processing Tree of tdC for Four Attributes

**Example:**Suppose the input to tdC is as shown in Table 4 and minimum support is 25% or 2 tuples. tdC first orders the database by ABCD, resulting in Table 10. After counting all combinations for ABCD, ABC, AB and A, the Iceberg-Cube is as shown in Table 11. tdC then orders the database by ABD, as shown in Table 12, and counts the combinations ABD, AB and A. The result is shown in Table 13. tdC continues with leaf nodes ACD, AD, BCD, BD, CD and D. No pruning occurs for this input table. The final result is shown in Table 14.

A | B | C | D |

a1 | b1 | c3 | d1 |

a1 | b2 | c5 | d2 |

a1 | b5 | c1 | d2 |

a2 | b2 | c2 | d2 |

a2 | b2 | c2 | d4 |

a2 | b2 | c4 | d2 |

a2 | b3 | c2 | d3 |

a3 | b4 | c6 | d2 |

Combination | Count |

{a1} | 3 |

{a2,b2,c2} | 2 |

{a2,b2} | 3 |

{a2} | 4 |

A | B | D |

a1 | b1 | d1 |

a1 | b2 | d2 |

a1 | b5 | d2 |

a2 | b2 | d2 |

a2 | b2 | d2 |

a2 | b2 | d4 |

a2 | b3 | d3 |

a3 | b4 | d2 |

Combination | Count |

{a1} | 3 |

{a2,b2,d2} | 2 |

{a2,b2} | 3 |

{a2} | 4 |

Combination | Count |

{a1} | 3 |

{a2} | 4 |

{a2,b2} | 3 |

{a2,b2,c2} | 2 |

{a2,b2,d2} | 2 |

{a2,c2} | 3 |

{a1,d2} | 2 |

{a2,d2} | 2 |

{b2,c2} | 2 |

{b2} | 4 |

{b2,d2} | 3 |

{c2} | 3 |

{d2} | 5 |

The ** cardinality** of an attribute is the number of distinct values that attribute has. The tdC
algorithm prunes most efficiently when the input attributes are ordered from
highest to lowest cardinality because it is more likely to prune the first
attribute in that case.

**3. Bottom-Up Computation**

The bottom-up computation algorithm(BUC) repeatedly sorts the database as necessary to allow convenient partitioning and counting of the combinations without large main memory requirements. BUC begins by counting the frequency of the first attribute in the input table. The algorithm then partitions the database based on the frequent values of the first attribute, so that only those tuples that contain a frequent value for the first attribute are further examined. BUC then counts combinations of values for the first two attributes and again partitions the databse so only those tuples that contain frequent combinations of the first two attributes are further examined, and so on.

For a database with four attributes, A, B, C and D, the processing tree of the BUC algorithm is shown in Figure 12. As with the tdC processing tree, this is based on the 4-dimensional lattice from Figure 10. The algorithm examines attribute A, then partitions the database by the frequent values of A, sorting each partition by the next attribute, B, for ease of counting AB combinations. Within each partition, BUC counts combinations of attributes A and B. Again, once the frequent combinations are found, the database is partitioned, this time on frequent combinations of AB, and is sorted by attribute C. When no frequent combinations are found, the algorithm traverses back down the tree and ascends to the next node. For example, if there are no frequent combinations of AC, then BUC will examine combinations of AD next. In this way, BUC prunes passes over the data. As with the tdC algorithm, BUC prunes most efficiently when the attributes are ordered from highest to lowest cardinality.

Figure 12: Processing Tree of BUC for Four Attributes

**Example:**Suppose the input table is as shown in Table 4 and minimum support is 25% or 2 tuples. BUC first counts A, finding the values a1 and a2 to be frequent. The ordered sets {a1, ANY, ANY, ANY} and {a2, ANY, ANY, ANY}, implicitly represented as simply {a1} and {a2}, and their frequencies are inserted into the Iceberg-Cube. BUC then partitions the database on the values a1 and a2 and sorts those partitions by attribute B. At AB, no value of B is frequent with a1. Then the a1 partition is sorted by C, but no value of C is frequent with a1. Only the combination {a1,d2} is inserted into the Iceberg-Cube after fully examining the a1 partition. BUC then examines the a2 partition, which is shown in Table 15. The combination {a2,b2} is found to be frequent and is inserted into the Iceberg- Cube. BUC then further partitions the frequent combination {a2,b2}. The new partition, as shown in Table 16, is sorted by attribute C. After counting all combinations of a2, b2 and any value for C, BUC inserts {a2,b2,c2} into the Iceberg-Cube, then creates partitions based on that combination of values. The new partitions, shown in Table 17, are sorted by attribute D. No combinations that include D are found to be frequent, so the algorithm descends the processing tree back to AB and then continues looking for combinations of ABD that include {a2,b2}. After sorting by attribute D, the partition shown in Table 18 results. Since the combination {a2,b2,d2} is frequent, it is inserted in the Iceberg-cube. No other values of D are frequent with {a2,b2}. The algorithm descends the processing tree to AB again. It continues looking for other frequent combinations of AB after {a2,b2}, but there are none, so eventually it returns to A. From A, processing continues at AC, looking for values of C that occur frequently with {a2}. The combination {a2,c2} occurs frequently, so processing ascends the tree to ACD, to look for values of D that occur frequently with {a2,c2}, but none are found. The algorithm returns to AC and then to A. From A processing continues at AD, where {a2,d2} is found to be frequent. Next the data is resorted based on the single attribute B, then combinations of attribute B and C, and so on. The final Iceberg-Cube is the same as for tdC, shown in Table 14.

A | B | C | D |

a2 |
b2 |
c2 |
d2 |

a2 |
b2 |
c2 |
d4 |

a2 |
b2 |
c4 |
d2 |

a2 | b3 | c2 | d3 |

A | B | C | D |

a2 |
b2 |
c2 |
d2 |

a2 |
b2 |
c2 |
d4 |

a2 | b2 | c4 | d2 |

A | B | C | D |

a2 |
b2 |
c2 |
d2 |

a2 | b2 | c2 | d4 |

A | B | D |

a2 |
b2 |
d2 |

a2 |
b2 |
d2 |

a2 | b2 | d4 |

Table 18: Partition of {a2,b2} on D

**4. Other Methods**

For some databases the number of candidate combinations required to fit in main memory by the APRIORI algorithm
becomes too large. Apart from tdC and BUC, other new methods have attempted to
solve this problem. Three of these are FP-Growth, Top-*k* BUC and H-Mine.

FP-growth attempts to avoid candidate generation and minimize main memory use by compressing the database into a specialized data structure called an FP-tree. Unfortunately, the recursively generated conditional databases created by FP-growth may still exceed main memory size for large databases.

Top-*k* BUC is a specialized form of BUC, used for finding combinations with average values exceeding a
specified minimum average. The average is calculated based on the sum of the
values for a related measure attribute divided by the count of values. Top-*k*
BUC also uses an H-tree (hyper-tree) for this problem.

Another recent algorithm is the H-Mine algorithm, based on partitioning the database. This algorithm requires less main memory than APRIORI and FP-Growth but more than BUC.