Clustering Prof. Navneet Goyal BITS, Pilani

Slides:



Advertisements
Similar presentations
DBSCAN & Its Implementation on Atlas Xin Zhou, Richard Luo Prof. Carlo Zaniolo Spring 2002.
Advertisements

Clustering (2). Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram –A tree like.
Cluster Analysis: Basic Concepts and Algorithms
Hierarchical Clustering, DBSCAN The EM Algorithm
Data Mining Cluster Analysis: Advanced Concepts and Algorithms
Lecture outline Density-based clustering (DB-Scan) – Reference: Martin Ester, Hans-Peter Kriegel, Jorg Sander, Xiaowei Xu: A Density-Based Algorithm for.
Efficient Density-Based Clustering of Complex Objects Stefan Brecheisen, Hans-Peter Kriegel, Martin Pfeifle University of Munich Institute for Computer.
Presented by: GROUP 7 Gayathri Gandhamuneni & Yumeng Wang.
DBSCAN – Density-Based Spatial Clustering of Applications with Noise M.Ester, H.P.Kriegel, J.Sander and Xu. A density-based algorithm for discovering clusters.
Density-based Approaches
Spatial and Temporal Data Mining
Segmentation in color space using clustering Student: Yijian Yang Advisor: Longin Jan Latecki.
Cluster Analysis Part III. Learning Objectives Density-Based Methods Grid-Based Methods Model-Based Clustering Methods Outlier Analysis Summary.
2001/12/18CHAMELEON1 CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling Paper presentation in data mining class Presenter : 許明壽 ; 蘇建仲.
Qiang Yang Adapted from Tan et al. and Han et al.
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Clustering CS 685: Special Topics in Data Mining Spring 2008 Jinze Liu.
Part II - Clustering© Prentice Hall1 Clustering Large DB Most clustering algorithms assume a large data structure which is memory resident. Most clustering.
Clustering Methods Professor: Dr. Mansouri
More on Clustering Hierarchical Clustering to be discussed in Clustering Part2 DBSCAN will be used in programming project.
Chapter 3: Cluster Analysis
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.
1 Clustering Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: J.W. Han, I. Witten, E. Frank.
Cluster Analysis.
An Introduction to Clustering
Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
© University of Minnesota Data Mining CSCI 8980 (Fall 2002) 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center.
Instructor: Qiang Yang
SCAN: A Structural Clustering Algorithm for Networks
Cluster Analysis.
Cluster Analysis: Basic Concepts and Algorithms
Clustering IV. Outline Impossibility theorem for clustering Density-based clustering and subspace clustering Bi-clustering or co-clustering.
Clustering Part2 BIRCH Density-based Clustering --- DBSCAN and DENCLUE
Advanced Database Technologies
An Efficient Approach to Clustering in Large Multimedia Databases with Noise Alexander Hinneburg and Daniel A. Keim.
1 CSE 980: Data Mining Lecture 17: Density-based and Other Clustering Algorithms.
Density-Based Clustering Algorithms
Han/Eick: Clustering II 1 Clustering Part2 continued 1. BIRCH skipped 2. Density-based Clustering --- DBSCAN and DENCLUE 3. GRID-based Approaches --- STING.
Topic9: Density-based Clustering
Han/Eick: Clustering II 1 Clustering Part2 continued 1. BIRCH skipped 2. Density-based Clustering --- DBSCAN and DENCLUE 3. GRID-based Approaches --- STING.
DBSCAN Data Mining algorithm Dr Veljko Milutinović Milan Micić
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Clustering COMP Research Seminar BCB 713 Module Spring 2011 Wei Wang.
Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Presented by Ho Wai Shing
Density-Based Clustering Methods. Clustering based on density (local cluster criterion), such as density-connected points Major features: –Discover clusters.
5/29/2008AI UEC in Japan Chapter 12 Clustering: Large Databases Written by Farial Shahnaz Presented by Zhao Xinyou Data Mining Technology.
Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN 1 Remaining Lectures in Advanced Clustering and Outlier Detection 2.Advanced Classification.
Data Mining Cluster Analysis: Advanced Concepts and Algorithms
1 Core Techniques: Cluster Analysis Cluster: a number of things of the same kind being close together in a group (Longman dictionary of contemporary English.
CLUSTERING DENSITY-BASED METHODS Elsayed Hemayed Data Mining Course.
Clustering By : Babu Ram Dawadi. 2 Clustering cluster is a collection of data objects, in which the objects similar to one another within the same cluster.
Parameter Reduction for Density-based Clustering on Large Data Sets Elizabeth Wang.
1 Similarity and Dissimilarity Between Objects Distances are normally used to measure the similarity or dissimilarity between two data objects Some popular.
DATA MINING: CLUSTER ANALYSIS (3) Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.
Clustering (2) Center-based algorithms Fuzzy k-means Density-based algorithms ( DBSCAN as an example ) Evaluation of clustering results Figures and equations.
Cohesive Subgraph Computation over Large Graphs
Data Mining: Basic Cluster Analysis
DATA MINING Spatial Clustering
More on Clustering in COSC 4335
CSE 4705 Artificial Intelligence
Hierarchical Clustering: Time and Space requirements
Data Mining Soongsil University
CSE 5243 Intro. to Data Mining
©Jiawei Han and Micheline Kamber Department of Computer Science
CS 685: Special Topics in Data Mining Jinze Liu
Data Mining Cluster Analysis: Advanced Concepts and Algorithms
CSE572, CBS598: Data Mining by H. Liu
CSE572, CBS572: Data Mining by H. Liu
CSE572, CBS572: Data Mining by H. Liu
CSE572: Data Mining by H. Liu
CS 685: Special Topics in Data Mining Jinze Liu
Presentation transcript:

Clustering Prof. Navneet Goyal BITS, Pilani 4/14/2017 Clustering Prof. Navneet Goyal BITS, Pilani Dr. Navneet Goyal, BITS,Pilani

Other Approaches to Clustering 4/14/2017 Other Approaches to Clustering Density-based methods Based on connectivity and density functions Filter out noise, find clusters of arbitrary shape Grid-based methods Quantize the object space into a grid structure Dr. Navneet Goyal, BITS,Pilani

Density-Based Clustering Methods 4/14/2017 Density-Based Clustering Methods Major features: Discover clusters of arbitrary shape Handle noise One scan Need density parameters as termination condition Several interesting studies: DBSCAN: Ester, et al. (KDD’96) OPTICS: Ankerst, et al (SIGMOD’99). DENCLUE: Hinneburg & D. Keim (KDD’98) CLIQUE: Agrawal, et al. (SIGMOD’98) Dr. Navneet Goyal, BITS,Pilani

Density-Based Method: DBSCAN 4/14/2017 Density-Based Method: DBSCAN Density-Based Spatial Clustering of Applications with Noise Clusters are dense regions of objects separated by regions of low density ( noise) Outliers will not effect creation of cluster Input MinPts – minimum number of points in any cluster Eps – for each point in cluster there must be another point in it less than this distance away Dr. Navneet Goyal, BITS,Pilani

DBSCAN Density Concepts 4/14/2017 DBSCAN Density Concepts Eps-neighborhood: Points within Eps distance of a point. Core point: Eps-neighborhood dense enough (MinPts) Directly density-reachable: A point p is directly density-reachable from a point q if the distance is small (Eps) and q is a core point. Density-reachable: A point is density-reachable form another point if there is a path from one to the other consisting of only core points. Dr. Navneet Goyal, BITS,Pilani

Density-Based Method: DBSCAN 4/14/2017 Density-Based Method: DBSCAN Eps-neighborhood: Points within Eps distance of a point. NEps(p): {q belongs to D | dist(p,q) <= Eps} Core point: Eps-neighborhood dense enough (MinPts) Directly density-reachable: A point p is directly density-reachable from a point q if the distance is small (Eps) and q is a core point. Directly density-reachable: A point p is directly density-reachable from a point q wrt. Eps, MinPts if 1) p belongs to NEps(q) 2) core point condition: |NEps (q)| >= MinPts p q MinPts = 5 Eps = 1 cm Dr. Navneet Goyal, BITS,Pilani

Density-Based Method: DBSCAN 4/14/2017 Density-Based Method: DBSCAN Density-reachable: A point is density-reachable form another point if there is a path from one to the other consisting of only core points A point p is density-reachable from a point q wrt. Eps, MinPts if there is a chain of points p1, …, pn, p1 = q, pn = p such that pi+1 is directly density-reachable from pi for all i (1,n-1) p q p1 Dr. Navneet Goyal, BITS,Pilani

Density-Based Method: DBSCAN 4/14/2017 Density-Based Method: DBSCAN Density-connected A point p is density-connected to a point q wrt. Eps, MinPts if there is a point o such that both, p and q are density-reachable from o wrt. Eps and MinPts. p q o Dr. Navneet Goyal, BITS,Pilani

4/14/2017 DBSCAN Relies on a density-based notion of cluster: A cluster is defined as a maximal set of density-connected points Discovers clusters of arbitrary shape in spatial databases with noise Core Border Outlier Eps = 1cm MinPts = 5 Dr. Navneet Goyal, BITS,Pilani

DBSCAN: Core, Border, and Noise Points 4/14/2017 DBSCAN: Core, Border, and Noise Points Dr. Navneet Goyal, BITS,Pilani

4/14/2017 DBSCAN: The Algorithm Label all points as core, border, or noise points Eliminate noise points Put an edge between all core points that are within ε of each other\ Make each group of connected core points into a separate cluster Assign each border point to one of the its associated core point Dr. Navneet Goyal, BITS,Pilani

DBSCAN: Core, Border and Noise Points 4/14/2017 DBSCAN: Core, Border and Noise Points Original Points Point types: core, border and noise Eps = 10, MinPts = 4 Source of figure: Introduction to Data Mining by Tan et. al. Dr. Navneet Goyal, BITS,Pilani

When DBSCAN Works Well Original Points Clusters Resistant to Noise 4/14/2017 When DBSCAN Works Well Clusters Original Points Resistant to Noise Can handle clusters of different shapes and sizes Source of figure: Introduction to Data Mining by Tan et. al. Dr. Navneet Goyal, BITS,Pilani

When DBSCAN Does NOT Work Well 4/14/2017 When DBSCAN Does NOT Work Well (MinPts=4, Eps=9.75). Original Points Varying densities High-dimensional data (MinPts=4, Eps=9.92) Source of figure: Introduction to Data Mining by Tan et. al. Dr. Navneet Goyal, BITS,Pilani

DBSCAN: Determining EPS and MinPts Idea is that for points in a cluster, their kth nearest neighbors are at roughly the same distance Noise points have the kth nearest neighbor at farther distance So, plot sorted distance of every point to its kth nearest neighbor Eps=10 Minpts=4 Source of figure: Introduction to Data Mining by Tan et. al.

OPTICS: Self Study Ordering Points To Identify Clustering Structure 4/14/2017 OPTICS: Self Study Ordering Points To Identify Clustering Structure DBSCAN is sensitive to the choice of input parameters Parameter setting is done empirically High dimensional data – more pronounced High dimensional data clustering structures are not generally characterized by global density parameters like eps & minpts OPTICS as a solution! Dr. Navneet Goyal, BITS,Pilani

OPTICS Computes an augmented cluster ordering 4/14/2017 OPTICS Computes an augmented cluster ordering Ordering represents the density based clustering structure of the data Contains information that is equivalent to density based clustering obtained from a wide range of parameter settings Cluster ordering can be used to extract basic clustering information Dr. Navneet Goyal, BITS,Pilani

4/14/2017 OPTICS In DBSCAN, for constant minpts, clusters with high density (lower eps) are completely contained in density connected sets obtained with lower density Extend DBSCAN to process a set of distance parameter eps at the same time. For this the objects need to be processed in a specific order This order selects an object that is density reachable wrt lowest eps so that clusters of higher density will be finished first. Dr. Navneet Goyal, BITS,Pilani

OPTICS 2 values need to be stored for each object: 4/14/2017 OPTICS 2 values need to be stored for each object: Core distance Reachability distance Core distance – smallest eps that makes it a core object. If p is not core, it is iundefined. Reachability distance of q wrt p is the greater value of the core distance of p and the euclidean distance between p & q. If p is not a core object, distance reachability bet p & q is undefined Dr. Navneet Goyal, BITS,Pilani

OPTICS: Some Extension from DBSCAN Index-based: k = number of dimensions N = 20 p = 75% M = N(1-p) = 5 Complexity: O(kN2) Core Distance Reachability Distance D p1 o p2 o Max (core-distance (o), d (o, p)) r(p1, o) = 2.8cm. r(p2,o) = 4cm MinPts = 5 e = 3 cm

Density-based Clustering Contd… 4/14/2017 Density-based Clustering Contd… Efficiency issues with DBSCAN Finding clusters in subspaces Modeling density accurately We now look at: Grid-based clustering Partitions data space into grid cells and forms clusters from cells that are dense enough Efficient approach for low-dimensional data Subspace clustering Finds clusters in subsets of all dimensions 2n-1 subspaces to be searched!!! Dr. Navneet Goyal, BITS,Pilani

Grid-based Clustering 4/14/2017 Grid-based Clustering GRIDCLUS STING CLIQUE WaveCluster Dr. Navneet Goyal, BITS,Pilani

Grid-based Clustering 4/14/2017 Grid-based Clustering Significant reduction in time complexity, especially for large data sets Number of cells << number of data points Instead of clustering data points, neighborhood surrounding the data points are clustered Dr. Navneet Goyal, BITS,Pilani

Grid-based Clustering 4/14/2017 Grid-based Clustering Steps involved: Creating the grid structure Calculating cell density for each cell Sorting of the cells according to their densities Identifying cluster centers Traversal of neighborhood cells Dr. Navneet Goyal, BITS,Pilani

Grid-based Clustering 4/14/2017 Grid-based Clustering Algorithm: Define a set of grid cells Assign objects to appropriate grid cells and compute the density of each cell Eliminate cells having density below a specified threshold Form clusters from contiguous groups of dense cells Dr. Navneet Goyal, BITS,Pilani

Grid-based Clustering 4/14/2017 Grid-based Clustering Defining Grid Cells Key step Equal width intervals along all dimensions Each cell has same volume Density of cell is defined as no. of points in cell Alternatively, equi-depth approach can be used Equal number of points in each interval Called as equal frequency discretization MAFIA : subspace clustering algorithm initially uses equal width intervals and then combines intervals of similar density Definition of grid has strong impact on clustering results Dr. Navneet Goyal, BITS,Pilani

Grid-based Clustering 4/14/2017 Grid-based Clustering Density of Grid Cells No. of points in the cell divided by the volume of the cell No. of road signs per km No. of tigers in a sq. km No. of molecules of a gas in cu. cm Source of figure: Introduction to Data Mining by Tan et. al. Dr. Navneet Goyal, BITS,Pilani

Grid-based Clustering 4/14/2017 Grid-based Clustering Forming Clusters from dense grid cells Relatively straight forward In the example on previous slide: 2 clusters Define adjacency 4 or 8 adjacent cells in 2-D? Efficient technique to find adjacent cells (only occupied cells are stored) Partially empty cells on the fringe of clusters which are not dense and will be discarded 4 parts of the larger cluster will be lost if the threshold is 9 Source of figure: Introduction to Data Mining by Tan et. al. Dr. Navneet Goyal, BITS,Pilani

Grid-based Clustering 4/14/2017 Grid-based Clustering Strengths & Limitations Single pass is enough to determine the cell and count of every cell Grid cells created only for non-empty cells Complexity of O(m) O(mlogm) grids are rectangular Curse of dimensionality Grid cells containing just one element Source of figure: Introduction to Data Mining by Tan et. al. Dr. Navneet Goyal, BITS,Pilani

4/14/2017 Subspace Clustering Clustering algorithms considered so far take into account all attributes Consider only a subspace of data Source of figure: Introduction to Data Mining by Tan et. al. Dr. Navneet Goyal, BITS,Pilani

4/14/2017 Subspace Clustering Source of figure: Introduction to Data Mining by Tan et. al. Dr. Navneet Goyal, BITS,Pilani

Some Research Directions 4/14/2017 Some Research Directions Ensemble Clustering Parallelizing Clustering Algorithms to leverage a Cluster Dr. Navneet Goyal, BITS,Pilani

Ensemble Clustering Similar to Ensemble Classification Consensus Clustering Obtain different clustering solutions and then reconcile them

Parallelizing Clustering Algorithms 4/14/2017 Parallelizing Clustering Algorithms Parallelize to leverage a cluster Two levels of parallelism Node Level Core Level Not Necessarily Orthogonal Hybrid – Non Trivial Programming Environment: MPI Open MP Dr. Navneet Goyal, BITS,Pilani