Clustering Prof. Navneet Goyal BITS, Pilani 4/14/2017 Clustering Prof. Navneet Goyal BITS, Pilani Dr. Navneet Goyal, BITS,Pilani
Other Approaches to Clustering 4/14/2017 Other Approaches to Clustering Density-based methods Based on connectivity and density functions Filter out noise, find clusters of arbitrary shape Grid-based methods Quantize the object space into a grid structure Dr. Navneet Goyal, BITS,Pilani
Density-Based Clustering Methods 4/14/2017 Density-Based Clustering Methods Major features: Discover clusters of arbitrary shape Handle noise One scan Need density parameters as termination condition Several interesting studies: DBSCAN: Ester, et al. (KDD’96) OPTICS: Ankerst, et al (SIGMOD’99). DENCLUE: Hinneburg & D. Keim (KDD’98) CLIQUE: Agrawal, et al. (SIGMOD’98) Dr. Navneet Goyal, BITS,Pilani
Density-Based Method: DBSCAN 4/14/2017 Density-Based Method: DBSCAN Density-Based Spatial Clustering of Applications with Noise Clusters are dense regions of objects separated by regions of low density ( noise) Outliers will not effect creation of cluster Input MinPts – minimum number of points in any cluster Eps – for each point in cluster there must be another point in it less than this distance away Dr. Navneet Goyal, BITS,Pilani
DBSCAN Density Concepts 4/14/2017 DBSCAN Density Concepts Eps-neighborhood: Points within Eps distance of a point. Core point: Eps-neighborhood dense enough (MinPts) Directly density-reachable: A point p is directly density-reachable from a point q if the distance is small (Eps) and q is a core point. Density-reachable: A point is density-reachable form another point if there is a path from one to the other consisting of only core points. Dr. Navneet Goyal, BITS,Pilani
Density-Based Method: DBSCAN 4/14/2017 Density-Based Method: DBSCAN Eps-neighborhood: Points within Eps distance of a point. NEps(p): {q belongs to D | dist(p,q) <= Eps} Core point: Eps-neighborhood dense enough (MinPts) Directly density-reachable: A point p is directly density-reachable from a point q if the distance is small (Eps) and q is a core point. Directly density-reachable: A point p is directly density-reachable from a point q wrt. Eps, MinPts if 1) p belongs to NEps(q) 2) core point condition: |NEps (q)| >= MinPts p q MinPts = 5 Eps = 1 cm Dr. Navneet Goyal, BITS,Pilani
Density-Based Method: DBSCAN 4/14/2017 Density-Based Method: DBSCAN Density-reachable: A point is density-reachable form another point if there is a path from one to the other consisting of only core points A point p is density-reachable from a point q wrt. Eps, MinPts if there is a chain of points p1, …, pn, p1 = q, pn = p such that pi+1 is directly density-reachable from pi for all i (1,n-1) p q p1 Dr. Navneet Goyal, BITS,Pilani
Density-Based Method: DBSCAN 4/14/2017 Density-Based Method: DBSCAN Density-connected A point p is density-connected to a point q wrt. Eps, MinPts if there is a point o such that both, p and q are density-reachable from o wrt. Eps and MinPts. p q o Dr. Navneet Goyal, BITS,Pilani
4/14/2017 DBSCAN Relies on a density-based notion of cluster: A cluster is defined as a maximal set of density-connected points Discovers clusters of arbitrary shape in spatial databases with noise Core Border Outlier Eps = 1cm MinPts = 5 Dr. Navneet Goyal, BITS,Pilani
DBSCAN: Core, Border, and Noise Points 4/14/2017 DBSCAN: Core, Border, and Noise Points Dr. Navneet Goyal, BITS,Pilani
4/14/2017 DBSCAN: The Algorithm Label all points as core, border, or noise points Eliminate noise points Put an edge between all core points that are within ε of each other\ Make each group of connected core points into a separate cluster Assign each border point to one of the its associated core point Dr. Navneet Goyal, BITS,Pilani
DBSCAN: Core, Border and Noise Points 4/14/2017 DBSCAN: Core, Border and Noise Points Original Points Point types: core, border and noise Eps = 10, MinPts = 4 Source of figure: Introduction to Data Mining by Tan et. al. Dr. Navneet Goyal, BITS,Pilani
When DBSCAN Works Well Original Points Clusters Resistant to Noise 4/14/2017 When DBSCAN Works Well Clusters Original Points Resistant to Noise Can handle clusters of different shapes and sizes Source of figure: Introduction to Data Mining by Tan et. al. Dr. Navneet Goyal, BITS,Pilani
When DBSCAN Does NOT Work Well 4/14/2017 When DBSCAN Does NOT Work Well (MinPts=4, Eps=9.75). Original Points Varying densities High-dimensional data (MinPts=4, Eps=9.92) Source of figure: Introduction to Data Mining by Tan et. al. Dr. Navneet Goyal, BITS,Pilani
DBSCAN: Determining EPS and MinPts Idea is that for points in a cluster, their kth nearest neighbors are at roughly the same distance Noise points have the kth nearest neighbor at farther distance So, plot sorted distance of every point to its kth nearest neighbor Eps=10 Minpts=4 Source of figure: Introduction to Data Mining by Tan et. al.
OPTICS: Self Study Ordering Points To Identify Clustering Structure 4/14/2017 OPTICS: Self Study Ordering Points To Identify Clustering Structure DBSCAN is sensitive to the choice of input parameters Parameter setting is done empirically High dimensional data – more pronounced High dimensional data clustering structures are not generally characterized by global density parameters like eps & minpts OPTICS as a solution! Dr. Navneet Goyal, BITS,Pilani
OPTICS Computes an augmented cluster ordering 4/14/2017 OPTICS Computes an augmented cluster ordering Ordering represents the density based clustering structure of the data Contains information that is equivalent to density based clustering obtained from a wide range of parameter settings Cluster ordering can be used to extract basic clustering information Dr. Navneet Goyal, BITS,Pilani
4/14/2017 OPTICS In DBSCAN, for constant minpts, clusters with high density (lower eps) are completely contained in density connected sets obtained with lower density Extend DBSCAN to process a set of distance parameter eps at the same time. For this the objects need to be processed in a specific order This order selects an object that is density reachable wrt lowest eps so that clusters of higher density will be finished first. Dr. Navneet Goyal, BITS,Pilani
OPTICS 2 values need to be stored for each object: 4/14/2017 OPTICS 2 values need to be stored for each object: Core distance Reachability distance Core distance – smallest eps that makes it a core object. If p is not core, it is iundefined. Reachability distance of q wrt p is the greater value of the core distance of p and the euclidean distance between p & q. If p is not a core object, distance reachability bet p & q is undefined Dr. Navneet Goyal, BITS,Pilani
OPTICS: Some Extension from DBSCAN Index-based: k = number of dimensions N = 20 p = 75% M = N(1-p) = 5 Complexity: O(kN2) Core Distance Reachability Distance D p1 o p2 o Max (core-distance (o), d (o, p)) r(p1, o) = 2.8cm. r(p2,o) = 4cm MinPts = 5 e = 3 cm
Density-based Clustering Contd… 4/14/2017 Density-based Clustering Contd… Efficiency issues with DBSCAN Finding clusters in subspaces Modeling density accurately We now look at: Grid-based clustering Partitions data space into grid cells and forms clusters from cells that are dense enough Efficient approach for low-dimensional data Subspace clustering Finds clusters in subsets of all dimensions 2n-1 subspaces to be searched!!! Dr. Navneet Goyal, BITS,Pilani
Grid-based Clustering 4/14/2017 Grid-based Clustering GRIDCLUS STING CLIQUE WaveCluster Dr. Navneet Goyal, BITS,Pilani
Grid-based Clustering 4/14/2017 Grid-based Clustering Significant reduction in time complexity, especially for large data sets Number of cells << number of data points Instead of clustering data points, neighborhood surrounding the data points are clustered Dr. Navneet Goyal, BITS,Pilani
Grid-based Clustering 4/14/2017 Grid-based Clustering Steps involved: Creating the grid structure Calculating cell density for each cell Sorting of the cells according to their densities Identifying cluster centers Traversal of neighborhood cells Dr. Navneet Goyal, BITS,Pilani
Grid-based Clustering 4/14/2017 Grid-based Clustering Algorithm: Define a set of grid cells Assign objects to appropriate grid cells and compute the density of each cell Eliminate cells having density below a specified threshold Form clusters from contiguous groups of dense cells Dr. Navneet Goyal, BITS,Pilani
Grid-based Clustering 4/14/2017 Grid-based Clustering Defining Grid Cells Key step Equal width intervals along all dimensions Each cell has same volume Density of cell is defined as no. of points in cell Alternatively, equi-depth approach can be used Equal number of points in each interval Called as equal frequency discretization MAFIA : subspace clustering algorithm initially uses equal width intervals and then combines intervals of similar density Definition of grid has strong impact on clustering results Dr. Navneet Goyal, BITS,Pilani
Grid-based Clustering 4/14/2017 Grid-based Clustering Density of Grid Cells No. of points in the cell divided by the volume of the cell No. of road signs per km No. of tigers in a sq. km No. of molecules of a gas in cu. cm Source of figure: Introduction to Data Mining by Tan et. al. Dr. Navneet Goyal, BITS,Pilani
Grid-based Clustering 4/14/2017 Grid-based Clustering Forming Clusters from dense grid cells Relatively straight forward In the example on previous slide: 2 clusters Define adjacency 4 or 8 adjacent cells in 2-D? Efficient technique to find adjacent cells (only occupied cells are stored) Partially empty cells on the fringe of clusters which are not dense and will be discarded 4 parts of the larger cluster will be lost if the threshold is 9 Source of figure: Introduction to Data Mining by Tan et. al. Dr. Navneet Goyal, BITS,Pilani
Grid-based Clustering 4/14/2017 Grid-based Clustering Strengths & Limitations Single pass is enough to determine the cell and count of every cell Grid cells created only for non-empty cells Complexity of O(m) O(mlogm) grids are rectangular Curse of dimensionality Grid cells containing just one element Source of figure: Introduction to Data Mining by Tan et. al. Dr. Navneet Goyal, BITS,Pilani
4/14/2017 Subspace Clustering Clustering algorithms considered so far take into account all attributes Consider only a subspace of data Source of figure: Introduction to Data Mining by Tan et. al. Dr. Navneet Goyal, BITS,Pilani
4/14/2017 Subspace Clustering Source of figure: Introduction to Data Mining by Tan et. al. Dr. Navneet Goyal, BITS,Pilani
Some Research Directions 4/14/2017 Some Research Directions Ensemble Clustering Parallelizing Clustering Algorithms to leverage a Cluster Dr. Navneet Goyal, BITS,Pilani
Ensemble Clustering Similar to Ensemble Classification Consensus Clustering Obtain different clustering solutions and then reconcile them
Parallelizing Clustering Algorithms 4/14/2017 Parallelizing Clustering Algorithms Parallelize to leverage a cluster Two levels of parallelism Node Level Core Level Not Necessarily Orthogonal Hybrid – Non Trivial Programming Environment: MPI Open MP Dr. Navneet Goyal, BITS,Pilani