**Please Reference the Following Paper:**

- Hilderman, R.J., and Hamilton, H.J. ``Heuristic Measures of Interestingness.'' In Third European Symposium on Principles of Data Mining and Knowledge Discovery (PKDD'99), Prague, Czech Republic, Springer Verlag, September, 1999, 232-241.

# Introduction

The sixteen measurses that make up the HMI set come from the areas of economics, ecology, and information theory. Collectively, we refer to these sixteen measures as the HMI set (i.e., heuristic measures of interestingness).

The measures can be used for more than just ranking the interestingness of generalized relations using domain generaliztion graph. For example, alternative methods could be used to guide the generation of summaries, such as Galois lattices, conceptual graphs, or formal concept analysis. Also, summarties could more generally include views generated from databases or summary tables generated from data cubes.

# Sixteen Characteristics of the HMI Set

**Each measure shares three important properties**

- Depends only on the frequency distribution or probability distribution of the values in the derived Count attribute of the summary to which it is being applied.
- Allows a value to be generated with at most one pass through the summary.
- Is independent of any specific units of measure.

**Variables used in describing of the HMI set**

- Let
be the total number of tuples in a summary*m* - Let
be the value contained in the*n*_{i}*Count*attribute for tuple*t*_{i} - Let
be the total count*N* - Let
be the observed probability (i.e. relative frequency) distribution of the tuples Based on the values*p**n*_{i} - Let
=*p*_{i}/*n*_{i}be the observed probability for tuple*N**t*_{i} - Let
be a uniform probability distribution of the tuples*q* - Let
=*u*/*N*be the count for tuple*m**t*,_{i}*i*= 1,2,...,*m*according to the uniform distribution*q* - Let
=*q-bar***1**/be the probability for tuple*m**t*,_{i}*i*= 1,2,...,*m*according to the uniform distribution*q* - Let
be the probability distribution obtained by combining the values*r**n*and_{i}*u* - Let
=*r*_{i}**(**/*n*+_{i}*u*)**2**, be the probability for tuples*N**t*, for all_{i}*i*= 1,2,...,*m*according to distribution*r*

*I _{Variance}*

Based on the sample variance from classical statistics,

*I*measures the weighted average of the squared deviations of the probabilities

_{Variance}*p*from the mean probability

_{i}*q-bar*, where the weight assigned to each squared deviation is 1/(

*m*- 1).

*I _{Simpson}*

A variance-like measure based on the Simpson index,

*I*measures the extent to which the counts are distributed over the tuples in a summary, rather than being concentrated in any single one of them.

_{Simpson}
*I _{Shannon}*

Based on a relative entropy measure from information theory (known as the

*Shannon index*),

*I*measures the average information content in the tuples of a summary.

_{Shannon}
*I _{Total}*

Based on the Shannon index from information theory,

*I*measures the total information content in a summary.

_{Total} *I _{Max}*

Based on the Shannon index from information theory,

*I*measures the maximum possible information content in a summary.

_{Max} *I _{McIntosh}*

Based on a heterongeneity index from ecology,

*I*views the counts in a summary as the coordinates of a point in a multidimensional space and measures the modified Euclidean distance from this point to the origin.

_{McIntosh} *I _{Lorenz}*

Based on the Lorenz curve from statistics, economics, and social science,

*I*measures the average value of the Lorenz curve derived from the probabilities

_{Lorenz}*p*associated with the tuples in a summary. The Lorenz curve is a series of straight lines in a square of unit length, starting from the origin and going successively to points (p

_{i}_{1},q

_{1}),(p

_{1}+ p

_{2}, q

_{1}+ q

_{2}), . . .. When the

*p*'s are all equal, the Lorenz curve coincides with the diagonal that cuts the unit square into equal halves. When the

_{i}*p*'s are not all equal, the Lorenz curve is below the diagonal.

_{i} *I _{Gini}*

Based on the Gini coefficient which is defined in terms of the Lorenz curve,

*I*measures the ratio of the area between the diagonal (i.e., the line of equality) and Lorenz curve, and the total area below the diagonal.

_{Gini} *I _{Berger}*

Based on a dominance index from ecology,

*I*measures the proportional dominance of the tuple in a summary with the highest probability

_{Berger}*p*

_{i} *I _{Schutz}*

Based on an inequality measure from the economics and social science,

*I*measures the relative mean deviation of the observed distribution of the counts in a summary from a uniform distribution of the counts.

_{Schutz} *I _{Bray}*

Based on a community similarity index from the ecology,

*I*measures the percentage of similarity between the observed distribution of the counts in a summary and a uniform distribution of the counts.

_{Bray} *I _{Whittaker}*

Based on a community similarity index from ecology,

*I*measures the fraction of similarity between the observed distribution of the counts in a summary and a uniform distribution of the counts.

_{Whittaker} *I _{Kullback}*

Based on a distance measure from information theory,

*I*measures the distance between the observed distribution of the counts in a summary and a uniform distribution of the counts.

_{Kullback} *I _{MacArthur}*

Based on the Shannon index from information theory,

*I*combines two summaries, and then measures the difference between the amount of information contained in the combined distribution and the amount of contained in the average of the two original distributions.

_{MacArthur} *I _{Theil}*

Based on a distance measure from the information theory,

*I*measures the distance between the observed distribution of the counts in a summary and a uniform distribution of the counts.

_{Theil} *I _{Atkinson}*

Based on a measure of inequality from economics,

*I*measures the percentage to which the population in a summary would have to be increased to achieve the same level of interestingness as if the counts in the summary were uniformly distributed.

_{Atkinson}