Notes 06-5: Density Methods

 

A density method groups neighbouring objects into clusters based upon density conditions (i.e., such as the number of objects within a given radius of each object in the cluster exceeding some threshold).

 

The DBSCAN (Density-Based Spatial Clustering of Applications with Noise) method grows regions with sufficiently high density into clusters and discovers clusters of arbitrary shape in databases with noise (i.e., outliers).

 

Example – shape of clusters that DBSCAN can find

 

DIAGRAM = Clustering.G.1.a

 

The basic idea is that for each instance in a cluster, the neighbourhood of a given radius has to contain at least a minimum number of instances (i.e., the density in the neighbourhood has to exceed some threshold).

 

A discussion of DBSCAN relies on a number of concepts:

 

The ε-neighbourhood of an instance (i.e., a point) consists of the instances (i.e., points) within a radius ε of the instance.

 

A core instance is an instance whose ε-neighbourhood contains at least some minimum number of instances, minPts.

 

Example – core instances

 

DIAGRAM = Clustering.G.1.c1

 

Given a set of instances, D, an instance i is directly density-reachable from instances j if i is within the ε-neighbourhood of j, and j is a core instance.

 

Example – directly density-reachable instances

 

DIAGRAM = Clustering.G.1.c2

 

An instance i is density-reachable from instance j with respect to ε and minPts in a set of instances, D, if there is a chain of instances i1, i2, …, in, i1 = j and in = i, such that ik+1 is directly density-reachable from ik with respect to ε and minPts, for 1 ≤ kn, ik D.

 

Example – density-reachable instances

 

DIAGRAM = Clustering.G.1.c3

 

An instance i is density-connected to instance j with respect to ε and minPts in a set of instances, D, if there is an instance p D such that both i and j are density-reachable from p with respect to ε and minPts.

 

Example – density-connected instances

 

DIAGRAM = Clustering.G.1.c4

 

A cluster K with respect to ε and minPts is a non-empty subset of a set of instances, D, that satisfies the following conditions:

 

·        For all instances i and j, if i K and j is density reachable from i with respect to ε and minPts, then j K.

 

·        For all instances i, j K, i is density-connected to j with respect to ε and minPts.

 

If K1, …, Kk are the clusters of a set of instances, D, with respect to εi and minPtsi, i = 1, …, k, then noise is the set of instances in D not belonging to any Ki.

 

The general approach used by DBSCAN

 

DBSCAN searches for clusters by checking the ε-neighbourhood of each instance in the database.

 

If the ε-neighbourhood of an instance i contains more than minPts, a new cluster with i as the core instance is created.

 

Directly density-reachable instances from these core instances are iteratively collected (may involve merging of a few density-reachable clusters).

 

The process terminates when no new instances are added to any cluster.

 

The DBSCAN method

 

Algorithm: DBSCAN

Input: D = a set of n instances of the form (p1, p2, …, pm)

       ε = the radius of the neighbourhood for each instance

       minPts = the minimum number of instances in an ε-neighbourhood required for an instance to be a core instance

Output: K = a set of clusters

Method:

1.    clusterID = 1

2.    for i = 1 to |D|

3.        currentInstance = GetNextInstance (D, i)

4.        if currentInstance.clusterID == UNCLASSIFIED

5.            if ExpandCluster (D, currentInstance, clusterID, ε, minPts)

6.                clusterID ++

7.    for i = 1 to |D|

8.        currentInstance = GetNextInstance (D, i)

9.        if currentInstance.clusterID != NOISE

10.          KcurrentInstance.clusterID = KcurrentInstance.clusterID currentInstance

11.  for i = 1 to clusterID - 1

12.      K = K Ki

 

Algorithm: ExpandCluster

Input: D = a set of n instances of the form (p1, p2, …, pm)

       currentInstance = an instance in D

       clusterID = identifier for the cluster currently being expanded

       ε = the radius of the neighbourhood for each instance

       minPts = the minimum number of instances in an ε-neighbourhood required for an instance to be a core instance

Output: TRUE if the cluster was expanded, otherwise, FALSE

Method:

1.    Dseeds = InstancesIn_ε_Neighbourhood (currentInstance, D, ε)

2.    if minPts > |Dseeds|

3.        ChangeClusterID (D, currentInstance, NOISE)

4.        return FALSE

5.    else

6.        for i = 1 to |Dseeds|

7.            currentSeed = GetNextInstance (Dseeds, i)

8.            ChangeClusterID (D, currentSeed, clusterID)

9.        Delete (currentInstance, Dseeds)

10.      while | Dseeds | > 0

11.          currentInstance = GetFirstInstance (Dseeds)

12.          ε_Neighbourhood = InstancesIn_ε_Neighbourhood (currentInstance, D, ε)

13.          if |ε_Neighbourhood| >= minPts

14.              for i = 1 to |ε_Neighbourhood|

15.                  candidateInstance = GetNextInstance (ε_Neighbourhood, i)

16.                  if candidateInstance.clusterID ε {UNCLASSIFIED, NOISE}

17.                      if candidateInstance.clusterID == UNCLASSIFIED

18.                          Append (Dseeds, candidateInstance)

19.                      ChangeClusterID (D, candidateInstance, clusterID)

20.          Delete (Dseeds, currentInstance)

21.      return TRUE

 

Example – DBSCAN

 

Instance

x

y

clusterID

1

2

3

Unclassified

2

2

5

Unclassified

3

2

6

Unclassified

4

3

2

Unclassified

5

3

4

Unclassified

6

3

6

Unclassified

7

3

7

Unclassified

8

4

1

Unclassified

9

4

4

Unclassified

10

4

5

Unclassified

11

4

7

Unclassified

12

5

2

Unclassified

13

5

6

Unclassified

14

5

9

Unclassified

15

6

2

Unclassified

16

6

5

Unclassified

17

6

7

Unclassified

18

6

9

Unclassified

19

7

6

Unclassified

20

7

7

Unclassified

21

8

6

Unclassified

 

Assume ε = 1 and minPts = 3.

 

Step 1: Initialize clusterID.

 

Step 2: Initialize i = 1. Since i <= |D|, go to Step 3.

 

Step 3: currentInstance = (2,3,U)

 

Step 4: Since currentInstance.clusterID = UNCLASSIFIED, go to Step 5.

 

Step 5: ExpandCluster (D,(2,3,U),1,1,3)

 

Step 5.1: Determine the instances within the ε-neighbourhood of currentInstance = (2,3,U). Thus, Dseeds = {(2,3,U)}.

 

Step 5.2: Since minPts > |Dseeds|, there are not sufficient instances within the ε-neighbourhood, so go to Step 5.3.

 

Step 5.3: At this point, currentInstances is considered to be noise. Thus, currentInstance = (2,3,N).

 

Step 5.4: The cluster could not be expanded around currentInstance, so return FALSE and go back to Step 2.

 

Step 2: Increment i = 2. Since i <= |D|, go to Step 3.

 

Step 3: currentInstance = (2,5,U).

 

Step 4: Since currentInstance.clusterID = UNCLASSIFIED, go to Step 5.

 

Step 5: ExpandCluster (D,(2,5,U),1,1,3).

 

Step 5.1: Determine the instances within the ε-neighbourhood of currentInstance = (2,5,U). Thus, Dseeds = {(2,5,U), (2,6,U)}.

 

Step 5.2: Since minPts > |Dseeds|, there are not sufficient instances within the ε-neighbourhood, so go to Step 5.3.

 

Step 5.3: At this point, currentInstance is considered to be noise. Thus, currentInstance = (2,5,N).

 

Step 5.4: The cluster could not be expanded around currentInstance, so return FALSE and go back to Step 2.

 

Step 2: Increment i = 3. Since i <= |D|, go to Step 3.

 

Step 3: currentInstance = (2,6,U).

 

Step 4: Since currentInstance.clusterID = UNCLASSIFIED, go to Step 5.

 

Step 5: ExpandCluster (D,(2,6,U),1,1,3).

 

Step 5.1: Determine the instances within the ε-neighbourhood of currentInstance = (2,6,U). Thus, Dseeds = {(2,6,U), (2,5,N), (3,6,U)}.

 

Step 5.2: Since minPts = |Dseeds|, there are sufficient instances within the ε-neighbourhood, so go to Step 5.6.

 

Step 5.6: Initialize i = 1. Since i <= |Dseeds|, go to step 5.7.

 

Step 5.7: currentSeed = (2,6,U).

 

Step 5.8: Since clusterID = 1, currentSeed = (2,6,1).

 

Step 5.6: Increment i = 2. Since i <= |Dseeds|, go to step 5.7.

 

Step 5.7: currentSeed = (2,5,N).

 

Step 5.8: Since clusterID = 1, currentSeed = (2,5,1).

 

Step 5.6: Increment i = 3. Since i <= |Dseeds|, go to step 5.7.

 

Step 5.7: currentSeed = (3,6,U).

 

Step 5.8: Since clusterID = 1, currentSeed = (3,6,1).

 

Step 5.9: currentInstance = (2,6,1) is removed from Dseeds. Thus, Dseeds = {(2,5,1), (3,6,1)}.

 

Step 5.10: Since |Dseeds| > 0, go to step 5.11.

 

Step 5.11: currentInstance = (2,5,1).

 

Step 5.12: ε_neighbourhood = {(2,5,1)}.

 

Step 5.13: Since | ε_neighbourhood| < minPts, go to step 5.20.

 

Step 5.20: currentInstance = (2,5,1) is removed from Dseeds. Thus, Dseeds = {(3,6,1)}. Go back to step 5.10.

 

Step 5.10: Since |Dseeds| > 0, go to step 5.11.

 

Step 5.11: currentInstance = (3,6,1).

 

Step 5.12: ε_neighbourhood = {(3,6,1), (2,6,1), (3,7,U)}.

 

Step 5.13: Since |ε_neighbourhood| >= minPts, go to step 5.14.

 

Step 5.14: Initialize i = 1. Since i <= |ε_neighbourhood|, go to step 5.15.

 

Step 5.15: candidateInstance = (3,6,1).

 

Step 5.16: Since candidateInstance.ClusterID = 1, go back to step 5.14.

 

Step 5.14: Increment i = 2. Since i <= |ε_neighbourhood|, go to step 5.15.

 

Step 5.15: candidateInstance = (2,6,1).

 

Step 5.16: Since candidateInstance.ClusterID = 1, go back to step 5.14.

 

Step 5.14: Increment i = 3. Since i <= |ε_neighbourhood|, go to step 5.15.

 

Step 5.15: candidateInstance = (3,7,U).

 

Step 5.16: Since candidateInstance.ClusterID = UNCLASSIFIED, go to step 5.17.

 

Step 5.17: Since candidateInstance.ClusterID = UNCLASSIFIED, go to step 5.18.

 

Step 5.18: candidateInstance has the potential to expand the cluster, so Dseeds = {(3,6,1), (3,7,U)}.

 

Step 5.19: Since clusterID = 1, candidateInstance = (3,7,1).

 

Step 5.20: currentInstance = (3,6,1) is removed from Dseeds. Thus, Dseeds = {(3,7,1)}. Go back to step 5.10.

 

Step 5.10: Since |Dseeds| > 0, go to step 5.11.

 

Step 5.11: currentInstance = (3,7,1).

 

Step 5.12: ε_neighbourhood = {(3,7,1), (3,6,1), (4,7,U)}.

 

Step 5.13: Since |ε_neighbourhood| >= minPts, go to step 5.14 and repeat step 5.14 to 5.16 for (3,7,1) and (3,6,1). As a result of this, there is no change to (3,7,1), (3,6,1) and Dseeds.

 

Step 5.14: Initialize i = 3. Since i <= |ε_neighbourhood|, go to step 5.15.

 

Step 5.15: candidateInstance = (4,7,U).

 

Step 5.16: Since candidateInstance.ClusterID = UNCLASSIFIED, go to step 5.17.

 

Step 5.17: Since candidateInstance.ClusterID = UNCLASSIFIED, go to step 5.18.

 

Step 5.18: candidateInstance has the potential to expand the cluster, so Dseeds = {(3,7,1), (4,7,U)}.

 

Step 5.19: Since clusterID = 1, candidateInstance = (4,7,1).

 

Step 5.20: currentInstance = (3,7,1) is removed from Dseeds. Thus, Dseeds = {(4,7,1)}. Go back to step 5.10.

 

Step 5.10: Since |Dseeds| > 0, go to step 5.11.

 

Step 5.11: currentInstance = (4,7,1).

 

Step 5.12: ε_neighbourhood = {(4,7,1), (3,7,1)}.

 

Step 5.13: Since |ε_neighbourhood| < minPts, go to step 5.20.

 

Step 5.20: currentInstance = (4,7,1) is removed from Dseeds. Thus, Dseeds = Ø. Go back to step 5.10.

 

Step 5.10: Since |Dseeds| = 0, go to step 5.21.

 

Step 5.21: Return TRUE.

 

Step 6: Increment clusterID = 2. Go back to step 2.

 

Step 2: Increment i = 4. Since i <= |D|, go to Step 3.

 

Step 3: currentInstance = (3,2,U).

 

Step 4: Since currentInstance.clusterID = UNCLASSIFIED, go to Step 5.

 

Step 5: ExpandCluster (D,(3,2,U),2,1,3).

 

Step 5.1: Determine the instances within the ε-neighbourhood of currentInstance = (3,2,U). Thus, Dseeds = {(3,2,U)}.

 

Step 5.2: Since minPts > |Dseeds|, there are not sufficient instance within the ε-neighbourhood, so go to Step 5.3.

 

Step 5.3: At this point, currentInstance is considered to be noise. Thus, currentInstance = (3,2,N).

 

Step 5.4: The cluster could not be expanded around currentInstance, so return FALSE and go back to Step 2.

 

Step 2: Increment i = 5. Since i <= |D|, go to Step 3.

 

Step 3: currentInstance = (3,4,U).

 

Step 4: Since currentInstance.clusterID = UNCLASSIFIED, go to Step 5.

 

Step 5: ExpandCluster (D,(3,4,U),2,1,3).

 

Step 5.1: Determine the instances within the ε-neighbourhood of currentInstance = (3,4,U). Thus, Dseeds = {(3,4,U), (4,4,U)}.

 

Step 5.2: Since minPts > |Dseeds|, there are not sufficient instances within the ε-neighbourhood, so go to Step 5.3.

 

Step 5.3: At this point, currentInstance is considered to be noise. Thus, currentInstance = (3,4,N).

 

Step 5.4: The cluster could not be expanded around currentInstance, so return FALSE and go back to Step 2.

 

Step 2: Increment i = 6 and i = 7 and repeat steps 3 and 4 for (3,6,1) and (3,7,1) which are have already been placed in cluster 1.

 

Step 2: Increment i = 8 and do steps 2 to 5.4 for (4,1,N) which has already been classified as noise.

 

Step 2: Increment i = 9. Since i <= |D|, go to Step 3.

 

Step 3: currentInstance = (4,4,U).

 

Step 4: Since currentInstance.clusterID = UNCLASSIFIED, go to Step 5.

 

Step 5: ExpandCluster (D,(4,4,U),2,1,3).

 

Step 5.1: Determine the instances within the ε-neighbourhood of currentInstance = (4,4,U). Thus, Dseeds = {(4,4,U), (3,4,N), (4,5,U)}.

 

Step 5.2: Since minPts = |Dseeds|, there are sufficient instances within the ε-neighbourhood, so go to Step 5.6.

 

Step 5.6: Initialize i = 1. Since i <= |Dseeds|, go to step 5.7.

 

Step 5.7: currentSeed = (4,4,U).

 

Step 5.8: Since clusterID = 2, currentSeed = (4,4,2).

 

Step 5.6: Increment i = 2. Since i <= |Dseeds|, go to step 5.7.

 

Step 5.7: currentSeed = (3,4,N).

 

Step 5.8: Since clusterID = 2, currentSeed = (3,4,2).

 

Step 5.6: Increment i = 3. Since i <= |Dseeds|, go to step 5.7.

 

Step 5.7: currentSeed = (4,5,U).

 

Step 5.8: Since clusterID = 1, currentSeed = (4,5,2).

.

.

.

 

Step 7: Initialize i = 1. Since i <= |D|, go to Step 8.

 

Step 8: currentInstance = (2,3,N).

 

Step 9: Since currentInstance.clusterID = NOISE, go back to step 7.

 

Step 7: Increment i = 2. Since i <= |D|, go to Step 8.

 

Step 8: currentInstance = (2,5,1).

 

Step 9: Since currentInstance.clusterID = 1, go to step 10.

 

Step 10: K1 = {(2,5,1)}. Go back to step 7.

 

Step 7: Increment i = 3. Since i <= |D|, go to Step 8.

 

Step 8: currentInstance = (2,6,1).

 

Step 9: Since currentInstance.clusterID = 1, go to step 10.

 

Step 10: K1 = {(2,5,1), (2,6,1)}. Go back to step 7.

.

.

.

 

Step 7: Increment i = 5. Since i <= |D|, go to Step 8.

 

Step 8: currentInstance = (3,4,2).

 

Step 9: Since currentInstance.clusterID = 2, go to step 10.

 

Step 10: K2 = {(3,4,2)}. Go back to step 7.

.

.

.

 

Step 10: K1 = {(2,5,1), (2,6,1), (3,6,1)}. Go back to step 7.

.

.

.

 

Step 10: K2 = {(3,4,2), (4,4,2)}. Go back to step 7.

.

.

.

 

Step 11: Initialize i = 1. Since i <= clusterID – 1, go to step 12.

 

Step 12: K = {{2,5,1), (2,6,1), (3,6,1), (3,7,1), (4,7,1)}}. Go back to step 11.

 

Step 11: Initialize i = 2. Since i <= clusterID – 1, go to step 12.

 

Step 12: K = {{2,5,1), (2,6,1), (3,6,1), (3,7,1), (4,7,1)}, {(3,4,2), (4,4,2), (4,5,2)}}. Go back to step 11.

.

.

.

 

The average run time complexity of DBSCAN is O(n log n).

 

The experimental results reported in the original paper are all incorrect because the authors were unaware of a serious bug in their program (they weren’t clustering all the points in the dataset).

 

The OPTICS (Ordering Points To Identify the Clustering Structure) method extends the DBSCAN method to consider a set of distance parameter values (i.e., a set of εs) in order to generate a set of clusters whose densities may be different.

 

Example – Clusters with different density parameters

 

DIAGRAM = Clustering.G.2.a

 

Example – Nested clusters

 

DIAGRAM = Clustering.G.2.b