## Introduction

** Cluster analysis** is the process of grouping objects into subsets that have meaning in the
context of a particular problem. The objects are thereby organized into an efficient representation that characterizes the population being sampled. Unlike classification,
clustering does not rely on predefined classes. Clustering is referred to as an

**because no information is provided about the "right answer" for any of the objects. It can uncover previously undetected relationships in a complex data set. Many applications for cluster analysis exist. For example, in a business application, cluster analysis can be used to discover and characterize customer groups for marketing purposes.**

*unsupervised learning method*Two types of clustering algorithms are ** nonhierarchical** and

**. In nonhierarchical clustering, such as the**

*hierarchical***k**algorithm, the relationship between clusters is undetermined. Hierarchical clustering repeatedly links pairs of clusters until every data object is included in the hierarchy. With both of these approaches, an important issue is how to determine the similarity between two objects, so that clusters can be formed from objects with a high similarity to each other. Commonly,

*-means***, such as the**

*distance functions***and**

*Manhattan***distance functions, are used to determine similarity. A distance function yields a higher value for pairs of objects that are less similar to one another. Sometimes a**

*Euclidian***is used instead, which yields higher values for pairs that are more similar.**

*similarity function*## Distance Functions

Given two *p*-dimensional data objects *i* = (*x*_{i1},*x*_{i2}, ...,*x*_{ip}) and *j* = (*x*_{j1},*x*_{j2},
...,*x _{jp}*), the following common distance functions can be defined:

**Euclidian Distance Function:**

**Manhattan Distance Function:**

When using the Euclidian distance function to compare distances, it is not necessary to calculate the square root because distances are always positive numbers and as such, for two distances, *d*_{1}
and *d*_{2}, Ö*d*_{1} > Ö*d*_{2}
Û *d*_{1} > *d*_{2}. If some of an object's attributes are measured along different
scales, so when using the Euclidian distance function, attributes with larger scales of measurement may overwhelm attributes measured on a smaller scale. To prevent this problem, the attribute values are often
normalized to lie between 0 and 1.

Other distance functions may be more appropriate for some data.

*k*-means Algorithm

The *k*-means algorithm is one of a group of algorithms called ** partitioning methods**. The problem of partitional clustering can be formally stated as follows:
Given

*n*objects in a

*d*-dimensional metric space, determine a partition of the objects into

*k*groups, or clusters, such that the objects in a cluster are more similar to each other than to objects in different clusters. Recall that a partition divides a set into disjoint parts that together include all members of the set. The value of

*k*may or may not be specified and a clustering criterion, typically the

**, must be adopted.**

*squared-error criterion*The solution to this problem is straightforward. Select a clustering criterion, then for each data object select the cluster that optimizes the criterion. The *k*-means algorithm initializes *k* clusters
by arbitrarily selecting one object to represent each cluster. Each of the remaining objects are assigned to a cluster and
the clustering criterion is used to calculate the cluster mean. These means are used as the new cluster points
and each object is reassigned to the cluster that it is most similar to. This continues until there is no longer
a change when the clusters are recalculated. The algorithm is shown in Figure 1.

*k*-means Algorithm:- Select
*k*clusters arbitrarily. - Initialize cluster centers with those
*k*clusters. - Do loop

a) Partition by assigning or reassigning all data objects to their closest cluster center.

b) Compute new cluster centers as mean value of the objects in each cluster.

Until no change in cluster center calculation

Figure 1: *k*-means Algorithm

Figure 2: Sample Data

**Example:** Suppose we are given the data in Figure 2 as input and we choose *k*=2 and the Manhattan
distance function *dis* = |*x*_{2}-*x*_{1}|+|*y*_{2}-*y*_{1}|.
The detailed computation is as follows:

*k*partitions. An initial partition can be formed by first specifying a set of

*k*seed points. The seed points could be the first

*k*objects or

*k*objects chosen randomly from the input objects. We choose the first two objects as seed points and initialize the clusters as

*C*

_{1}={(0,0)} and

*C*

_{2}={(0,1)}.

Step 2. Since there is only one point in each cluster, that point is the cluster center.

Step 3a. Calculate the distance between each object and each cluster center, assigning the object to the closest cluster.

*dis*(1,3) = |1-0| + |1-0| = 2 and

*dis*(2,3) = |1-0| + |1-1| = 1,

*C*

_{2}.

The fifth object is equidistant from both clusters, so we arbitrarily assign it to

*C*

_{1}.

After calculating the distance for all points, the clusters contain the following objects:

*C*

_{1}= {(0,0),(1,0),(0.5,0.5)} and

*C*

_{2}= {(0,1),(1,1),(5,5),(5,6),(6,6),(6,5),(5.5,5.5)}.

*C*

_{1}= (0.5,0.16)

(0+0+0.5)/3 = 0.16

*C*

_{2}= (4.1,4.2)

(1+1+5+5+6+6+5.5)/7 = 4.2

*C*

_{1}= (0.5,0.16) and

*C*

_{2}= (4.1,4.2) differ from old centers

*C*

_{1}={(0,0)} and

*C*

_{2}={(0,1)}, so the loop is repeated.

*C*

_{1}= {(0,0),(0,1),(1,1),(1,0),(0.5,0.5)}

*C*

_{2}= {(5,5),(5,6),(6,6),(6,5),(5.5,5.5)}

*C*

_{1}= (0.5,0.5)

New center for

*C*

_{2}= (5.5,5.5)

*C*

_{1}= (0.5,0.5) and

*C*

_{2}= (5.5,5.5) differ from old centers

*C*

_{1}= (0.5,0.16) and

*C*

_{2}= (4.1,4.2), so the loop is repeated.

Algorithm is done: Centers are the same as in Step 3b', so algorithm is finished. Result is same as in 3b'.

# Agglomerative Hierarchical Algorithm

Hierarchical algorithms can be either agglomerative or divisive, that is top-down or bottom-up.
All ** agglomerative hierarchical clustering algorithms** begin with each object as a separate group. These
groups are successively combined based on similarity until there is only one group remaining or a specified
termination condition is satisfied. For

*n*objects,

*n*-1 mergings are done.

**are rigid in that once a merge has been done, it cannot be undone. Although there are smaller computational costs with this, it can also cause problems if an erroneous merge is done. As such, merge points need to be chosen carefully. Here we describe a simple agglomerative clustering algorithm. More complex algorithms have been developed, such as BIRCH and CURE, in an attempt to improve the clustering quality of hierarchical algorithms.**

*Hierarchical algorithms*

Figure 3: Sample Dendogram

In the context of hierarchical clustering, the hierarchy graph is called a ** dendogram**.
Figure 3 shows a sample dendogram that could be produced from a hierarchical clustering
algorithm. Unlike with the

*k*-means algorithm, the number of clusters (

*k*) is not specified in hierarchical clustering. After the hierarchy is built, the user can specify the number of clusters required, from 1 to

*n*. The top level of the hierarchy represents one cluster, or

*k*=1. To examine more clusters, we simply need to traverse down the hierarchy.

**Agglomerative Hierarchical Algorithm:**

**Given:**

*X*of objects {

*x*

_{1},...,

*x*}

_{n}A distance function

*dis*(

*c*

_{1},

*c*

_{2})

**for***i*= 1 to*n**c*= {_{i}*x*}_{i}

**end for***C*= {*c*_{1},...,*c*}_{b}*l*=*n*+1**while***C*.size > 1**do**a) (*c*_{min1},*c*_{min2}) = minimum*dis*(*c*,_{i}*c*) for all_{j}*c*,_{i}*c*in_{j}*C*

b) remove*c*_{min1}and*c*_{min2}from*C*

c) add {*c*_{min1},*c*_{min2}} to*C*

d)*l*=*l*+ 1

**end while**

Figure 4: Agglomerative Hierarchical Algorithm

Figure 4 shows a simple hierarchical algorithm. The distance function in this algorithm can
determine similarity of clusters through many methods, including single link and
group-average. ** Single link** calculates the distance between two clusters as the shortest distance
between any two objects contained in those clusters.

**first finds the average values for all objects in the group (i.e., cluster) and the calculates the distance between clusters as the distance between the average values.**

*Group-average*Each object in *X* is initially used to create a cluster containing a single object. These clusters are successively
merged into new clusters, which are added to the set of clusters, *C*. When a pair of clusters is merged,
the original clusters are removed from *C*. Thus, the number of clusters in *C* decreases until there
is only one cluster remaining, containing all the objects from *X*. The hierarchy of clusters is implicity
represented in the nested sets of *C*.

**Example:**Suppose the input to the simple agglomerative algorithm described above is the set

*X*, shown in Figure 5 represented in matrix and graph form. We will use the Manhattan distance function and the single link method for calculating distance between clusters. The set

*X*contains

*n*=10 elements,

*x*

_{1}to

*x*

_{10}, where

*x*

_{1}=(0,0).

Figure 5: Sample Data

*x*of

_{i}*X*is placed in a cluster

*c*, where

_{i}*c*is a member of the set of clusters

_{i}*C*.

*C*= {{

*x*

_{1}},{

*x*

_{2}},{

*x*

_{3}}, {

*x*

_{4}},{

*x*

_{5}},{

*x*

_{6}},{

*x*

_{7}}, {

*x*

_{8}},{

*x*

_{9}},{

*x*

_{10}}}

*l*= 11.

Step 3. (First iteration of while loop)

*C*.size = 10

- The minimum single link distance between two clusters is 1. This occurs in two places, between
*c*_{2}and*c*_{10}and between*c*_{3}and*c*_{10}. Depending on how our minimum function works we can choose either pair of clusters. Arbitrarily we choose the first.(*c*_{min1},*c*_{min2}) = (*c*_{2},*c*_{10}) - Since
*l*= 10,*c*_{11}=*c*_{2}U*c*_{10}= {{*x*_{2}},{*x*_{10}}} - Remove
*c*_{2}and*c*_{10}from*C*. - Add
*c*_{11}to*C*.*C*= {{*x*_{1}},{*x*_{3}}, {*x*_{4}},{*x*_{5}},{*x*_{6}},{*x*_{7}}, {*x*_{8}},{*x*_{9}},{{*x*_{2}}, {*x*_{10}}}} - Set
*l*=*l*+ 1 = 12

*C*.size = 9

- The minimum single link distance between two clusters is 1. This occurs
between
*c*_{3}and*c*_{11}because the distance between*x*_{3}and*x*_{10}is 1, where*x*_{10}is in*c*_{11}.(*c*_{min1},*c*_{min2}) = (*c*_{3},*c*_{11}) *c*_{12}=*c*_{3}U*c*_{11}= {{{*x*_{2}},{*x*_{10}}},{*x*_{3}}}- Remove
*c*_{3}and*c*_{11}from*C*. - Add
*c*_{12}to*C*.*C*= {{*x*_{1}}, {*x*_{4}},{*x*_{5}},{*x*_{6}},{*x*_{7}}, {*x*_{8}},{*x*_{9}},{{{*x*_{2}}, {*x*_{10}}}, {*x*_{3}}}} - Set
*l*= 13

*C*.size = 8

- (
*c*_{min1},*c*_{min2}) = (*c*_{1},*c*_{12}) *C*= {{*x*_{4}},{*x*_{5}},{*x*_{6}},{*x*_{7}}, {*x*_{8}},{*x*_{9}},{{{{*x*_{2}}, {*x*_{10}}}, {*x*_{3}}},{*x*_{1}}}}

*C*.size = 7

- (
*c*_{min1},*c*_{min2}) = (*c*_{4},*c*_{8}) *C*= {{*x*_{5}},{*x*_{6}},{*x*_{7}}, {*x*_{9}},{{{{*x*_{2}}, {*x*_{10}}}, {*x*_{3}}},{*x*_{1}}},{{*x*_{4}},{*x*_{8}}}}

*C*.size = 6

- (
*c*_{min1},*c*_{min2}) = (*c*_{5},*c*_{7}) *C*= {{*x*_{6}}, {*x*_{9}},{{{{*x*_{2}}, {*x*_{10}}}, {*x*_{3}}},{*x*_{1}}},{{*x*_{4}},{*x*_{8}}}, {{*x*_{5}},{*x*_{7}}}}

*C*.size = 5

- (
*c*_{min1},*c*_{min2}) = (*c*_{9},*c*_{13}) *C*= {{*x*_{6}}, {{*x*_{4}},{*x*_{8}}}, {{*x*_{5}}, {*x*_{7}}},{{{{{*x*_{2}}, {*x*_{10}}}, {*x*_{3}}},{*x*_{1}}}, {*x*_{9}}}}

*C*.size = 4

- (
*c*_{min1},*c*_{min2}) = (*c*_{6},*c*_{15}) *C*= {{{*x*_{4}},{*x*_{8}}},{{{{{*x*_{2}}, {*x*_{10}}}, {*x*_{3}}},{*x*_{1}}},{*x*_{9}}},{{*x*_{6}}, {{*x*_{5}},{*x*_{7}}}}}

*C*.size = 3

- (
*c*_{min1},*c*_{min2}) = (*c*_{14},*c*_{16}) *C*= {{{*x*_{6}},{{*x*_{5}},{*x*_{7}}}}, {{{*x*_{4}},{*x*_{8}}},{{{{{*x*_{2}}, {*x*_{10}}}, {*x*_{3}}},{*x*_{1}}},{*x*_{9}}}}}

*C*.size = 2

- (
*c*_{min1},*c*_{min2}) = (*c*_{17},*c*_{18}) *C*={{{{*x*_{4}},{*x*_{8}}},{{{{{*x*_{2}}, {*x*_{10}}}, {*x*_{3}}},{*x*_{1}}},{*x*_{9}}}},{{*x*_{6}}, {{*x*_{5}},{*x*_{7}}}}}

*C*.size = 1. Algorithm done.

The cluster created from this algorithm can be seen in Figure 6.
The corresponding dendogram formed from the hierarchy in *C* is shown in Figure 7. The points which appeared most
closely together on the graph of input data in Figure 5 are grouped together more closely in the hierarchy.

Figure 6: Graph with Clusters

Figure 7: Sample Dendogram