Notes 01-3: Primary Tasks of Data Mining

 

Data mining techniques are generally divided into two main categories.

 

Predictive: The objective is to predict the value of a particular attribute based upon the values of other attributes.

 

·        The target (or dependent) variable is the attribute whose value is to be predicted.

 

·        The explanatory (or independent) variables are the attributes used to make the prediction.

 

Descriptive: The objective is to derive patterns (e.g., correlations, trends, clusters, trajectories, and anomalies) that describe the fundamental relationships in the data.

 

Core Data Mining Techniques

 

Predictive modeling: Used to build a model for the target variable as a function of the explanatory variables. There are two types: classification and regression.

 

Classification: Used for discrete target variables.

 

Example - A simple decision tree for mammal classification

 

DIAGRAM = Introduction.F.2.a1 – TO BE DONE

 

Regression: Used for continuous target variables.

 

Example - Predicting salary based upon years of service

 

Salary

Years of Service

30

  3

57

  8

64

  9

72

13

36

  3

43

  6

59

11

90

21

20

  1

83

16

 

Plot the points on a graph and find a line that best represents the relationship between salary and years of service.

 

Association analysis: Used to discover patterns that describe strongly associated features in the data.

 

Example - Profiling sales

 

Assume some store sells the following products: milk, cheese, bread, eggs, diapers, and beer.

 

Store keeps track of when items are sold (i.e., AM or PM).

 

Customers can buy any combination of products.

 

Associations describe products that are purchased together.

 

milk → bread (AM and PM)

eggs → cheese (AM)

diapers → beer (PM)

 

Cluster analysis: Used to find groups of closely related objects so that objects in the same cluster are more similar to each other than to objects in other clusters.

 

Example – Clustering based upon customer profiles

 

Income

Age

Children

Marital Status

Education

25K

35

y

single

high school

15K

25

n

married

B.Sc.

20K

40

y

divorced

M.Sc.

30K

20

y

married

Ph.D.

20K

25

n

married

high school

70K

60

n

single

B.Sc.

90K

30

y

divorced

B.Sc.

 

Depending upon the attribute chosen, the clusters will be different.

 

Summarization involves methods for finding a compact description for a subset of data.

 

Dependency Modeling consists of finding a model which describes significant dependencies between variables. Dependency models exist at two levels:

 

1.     The structural level of the model specifies (often graphically) which variables are locally dependent on each other, and

 

2.     The quantitative level of the model specifies the strengths of the dependencies using some numerical scale.

 

Change and Deviation Detection focuses on discovering the most significant changes in the data from previously measured or normative values.