2001/12/18CHAMELEON1 CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling Paper presentation in data mining class Presenter : 許明壽 ; 蘇建仲.

Slides:



Advertisements
Similar presentations
CLUSTERING.
Advertisements

Multilevel Hypergraph Partitioning Daniel Salce Matthew Zobel.
Clustering (2). Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram –A tree like.
Cluster Analysis: Basic Concepts and Algorithms
1 CSE 980: Data Mining Lecture 16: Hierarchical Clustering.
Hierarchical Clustering, DBSCAN The EM Algorithm
PARTITIONAL CLUSTERING
Data Mining Cluster Analysis: Advanced Concepts and Algorithms
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Midterm topics Chapter 2 Data Data preprocessing Measures of similarity/dissimilarity Chapter.
Data Mining Cluster Analysis: Basic Concepts and Algorithms
DBSCAN – Density-Based Spatial Clustering of Applications with Noise M.Ester, H.P.Kriegel, J.Sander and Xu. A density-based algorithm for discovering clusters.
Clustering Prof. Navneet Goyal BITS, Pilani
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Clustering CS 685: Special Topics in Data Mining Spring 2008 Jinze Liu.
Part II - Clustering© Prentice Hall1 Clustering Large DB Most clustering algorithms assume a large data structure which is memory resident. Most clustering.
Clustering Methods Professor: Dr. Mansouri
More on Clustering Hierarchical Clustering to be discussed in Clustering Part2 DBSCAN will be used in programming project.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.
Cluster Analysis.
4. Clustering Methods Concepts Partitional (k-Means, k-Medoids)
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
© University of Minnesota Data Mining CSCI 8980 (Fall 2002) 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center.
Cluster Analysis.
Cluster Analysis: Basic Concepts and Algorithms
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Birch: An efficient data clustering method for very large databases
Chapter 3: Cluster Analysis  3.1 Basic Concepts of Clustering  3.2 Partitioning Methods  3.3 Hierarchical Methods The Principle Agglomerative.
DATA MINING LECTURE 8 Clustering The k-means algorithm
CHAMELEON : A Hierarchical Clustering Algorithm Using Dynamic Modeling
CURE: Clustering Using REpresentatives algorithm Student: Uglješa Milić University of Belgrade School of Electrical Engineering.
Cluster Analysis Part II. Learning Objectives Hierarchical Methods Density-Based Methods Grid-Based Methods Model-Based Clustering Methods Outlier Analysis.
9/03Data Mining – Clustering G Dong (WSU) 1 4. Clustering Methods Concepts Partitional (k-Means, k-Medoids) Hierarchical (Agglomerative & Divisive, COBWEB)
1 CSE 980: Data Mining Lecture 17: Density-based and Other Clustering Algorithms.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Clustering COMP Research Seminar BCB 713 Module Spring 2011 Wei Wang.
CURE: An Efficient Clustering Algorithm for Large Databases Sudipto Guha, Rajeev Rastogi, Kyuseok Shim Stanford University Bell Laboratories Bell Laboratories.
CSE5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai Li (Slides.
Chameleon: A hierarchical Clustering Algorithm Using Dynamic Modeling By George Karypis, Eui-Hong Han,Vipin Kumar and not by Prashant Thiruvengadachari.
Clustering.
Christoph F. Eick Questions and Topics Review November 11, Discussion of Midterm Exam 2.Assume an association rule if smoke then cancer has a confidence.
DBSCAN Data Mining algorithm Dr Veljko Milutinović Milan Micić
CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling Author:George et al. Advisor:Dr. Hsu Graduate:ZenJohn Huang IDSL seminar 2001/10/23.
Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Presented by Ho Wai Shing
Data Mining Cluster Analysis: Advanced Concepts and Algorithms
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
1 Core Techniques: Cluster Analysis Cluster: a number of things of the same kind being close together in a group (Longman dictionary of contemporary English.
CLUSTERING DENSITY-BASED METHODS Elsayed Hemayed Data Mining Course.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Sanghamitra.
Parameter Reduction for Density-based Clustering on Large Data Sets Elizabeth Wang.
Clustering Microarray Data based on Density and Shared Nearest Neighbor Measure CATA’06, March 23-25, 2006 Seattle, WA, USA Ranapratap Syamala, Taufik.
CURE: An Efficient Clustering Algorithm for Large Databases Authors: Sudipto Guha, Rajeev Rastogi, Kyuseok Shim Presentation by: Vuk Malbasa For CIS664.
Christoph F. Eick Questions Review October 12, How does post decision tree post-pruning work? What is the purpose of applying post-pruning in decision.
Data Mining: Basic Cluster Analysis
DATA MINING Spatial Clustering
More on Clustering in COSC 4335
Data Mining K-means Algorithm
Clustering in Ratemaking: Applications in Territories Clustering
CS 685: Special Topics in Data Mining Jinze Liu
Data Mining Cluster Analysis: Advanced Concepts and Algorithms
The University of Adelaide, School of Computer Science
Critical Issues with Respect to Clustering
CS 685: Special Topics in Data Mining Jinze Liu
CS 485G: Special Topics in Data Mining
DATA MINING Introductory and Advanced Topics Part II - Clustering
CSE572, CBS572: Data Mining by H. Liu
CSE572, CBS572: Data Mining by H. Liu
CSE572: Data Mining by H. Liu
CS 685: Special Topics in Data Mining Jinze Liu
Presentation transcript:

2001/12/18CHAMELEON1 CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling Paper presentation in data mining class Presenter : 許明壽 ; 蘇建仲 Data : 2001/12/18

2001/12/18CHAMELEON2 About this paper … Department of Computer Science and Engineering, University of Minnesota George Karypis Eui-Honh (Sam) Han Vipin Kumar IEEE Computer Journal - Aug. 1999

2001/12/18CHAMELEON3 Outline Problems definition Main algorithm Keys features of CHAMELEON Experiment and related worked Conclusion and discussion

2001/12/18CHAMELEON4 Problems definition Clustering Intracluster similarity is maximized Intercluster similarity is minimized Problems of existing clustering algorithms Static model constrain Breakdown when clusters that are of diverse shapes , densities , and sizes Susceptible to noise, outliers, and artifacts

2001/12/18CHAMELEON5 Static model constrain Data space constrain K means, PAM … etc Suitable only for data in metric spaces Cluster shape constrain K means, PAM, CLARANS Assume cluster as ellipsoidal or globular and are similar sizes Cluster density constrain DBScan Points within genuine cluster are density-reachable and point across different clusters are not Similarity determine constrain CURE, ROCK Use static model to determine the most similar cluster to merge

2001/12/18CHAMELEON6 Partition techniques problem (a) Clusters of widely different sizes (b) Clusters with convex shapes

2001/12/18CHAMELEON7 Hierarchical technique problem (1/2) The {(c), (d)} will be choose to merge when we only consider closeness

2001/12/18CHAMELEON8 Hierarchical technique problem (2/2) The {(a), (c)} will be choose to merge when we only consider inter- connectivity

2001/12/18CHAMELEON9 Main algorithm Two phase algorithm PHASE I Use graph partitioning algorithm to cluster the data items into a large number of relatively small sub- clusters. PHASE II Uses an agglomerative hierarchical clustering algorithm to find the genuine clusters by repeatedly combining together these sub-clusters.

2001/12/18CHAMELEON10 Framework

2001/12/18CHAMELEON11 Keys features of CHAMELEON Modeling the data Modeling the cluster similarity Partition algorithms Merge schemes

2001/12/18CHAMELEON12 Terms Arguments needed K K-nearest neighbor graph MINSIZE The minima size of initial cluster T RI Threshold of related inter-connectivity T RC Threshold of related intra-connectivity α Coefficient for weight of RI and RC

2001/12/18CHAMELEON13 Modeling the data K-nearest neighbor graph approach Advantages Data points that are far apart are completely disconnected in the G k G k capture the concept of neighborhood dynamically The edge weights of dense regions in G k tend to be large and the edge weights of sparse tend to be small

2001/12/18CHAMELEON14 Example of k-nearest neighbor graph

2001/12/18CHAMELEON15 Modeling the clustering similarity (1/2) Relative interconnectivity Relative closeness

2001/12/18CHAMELEON16 Modeling the clustering similarity (2/2) If related is considered, {(c), (d)} will be merged

2001/12/18CHAMELEON17 Partition algorithm (PHASE I) What Finding the initial sub-clusters Why RI and RC can’t be accurately calculated for clusters containing only a few data points How Utilize multilevel graph partitioning algorithm (hMETIS) Coarsening phase Partitioning phase Uncoarsening phase

2001/12/18CHAMELEON18 Partition algorithm (cont.) Initial all points belonging to the same cluster Repeat until (size of all clusters < MINSIZE) Select the largest cluster and use hMETIS to bisect Post scriptum Balance constrain Spilt Ci into C iA and C iB and each sub-clusters contains at least 25% of the node of C i

2001/12/18CHAMELEON19

2001/12/18CHAMELEON20 What Merging sub-clusters using a dynamic framework How Finding and merging the pair of sub-clusters that are the most similar Scheme 1 Scheme 2 Merge schemes (Phase II) and

2001/12/18CHAMELEON21 Experiment and related worked Introduction of CURE Introduction of DBSCAN Results of experiment Performance analysis

2001/12/18CHAMELEON22 Introduction of CURE (1/n) Clustering Using Representative points 1. Properties : Fit for non-spherical shapes. Shrinking can help to dampen the effects of outliers. Multiple representative points chosen for non-spherical Each iteration, representative points shrunk ratio related to merge procedure by some scattered points chosen Random sampling in data sets is fit for large databases

2001/12/18CHAMELEON23 Introduction of CURE (2/n) 2. Drawbacks : Partitioning method can not prove data points chosen are good. Clustering accuracy with respect to the parameters below : (1) Shrink factor s : CURE always find the right clusters by range of s values from 0.2 to 0.7. (2) Number of representative points c : CURE always found right clusters for value of c greater than 10. (3) Number of Partitions p : with as many as 50 partitions, CURE always discovered the desired clusters. (4) Random Sample size r :  (a) for sample size up to 2000, clusters found poor quality  (b) from 2500 sample points and above, about 2.5% of the data set size, CURE always correctly find the clusters.

2001/12/18CHAMELEON24 3. Clustering algorithm : Representative points

2001/12/18CHAMELEON25 Merge procedure

2001/12/18CHAMELEON26 Introduction of DBSCAN (1/n) Density Based Spatial Clustering of Application With Noise 1. Properties : Can discovery clusters of arbitrary shape. Each cluster with a typical density of points which is higher than outside of cluster. The density within the areas of noise is lower than the density in any of the clusters. Input the parameters MinPts only Easy to implement in C++ language using R*-tree Runtime is linear depending on the number of points. Time complexity is O(n * log n)

2001/12/18CHAMELEON27 Introduction of DBSCAN (2/n) 2. Drawbacks : Cannot apply to polygons. Cannot apply to high dimensional feature spaces. Cannot process the shape of k-dist graph with multi- features. Cannot fit for large database because no method applied to reduce spatial database. 3. Definitions Eps-neighborhood of a point p NEps(p)={q€D | dist(p,q)<=Eps} Each cluster with MinPts points

2001/12/18CHAMELEON28 Introduction of DBSCAN (3/n) 4. p is directly density-reachable from q (1) p€ NEps(q) and (2) | NEps(q) | >=MinPts (core point condition) We know directly density-reachable is symmetric when p and q both are core point, otherwise is asymmetric if one core point and one border point. 5. p is density-reachable from q if there is a chain of points between p and q Density-reachable is transitive, but not symmetric Density-reachable is symmetric for core points.

2001/12/18CHAMELEON29 Introduction of DBSCAN (4/n) 6. A point p is density-connected to a point q if there is a point s such that both p and q are density-reachable from s. Density-connected is symmetric and reflexive relation A cluster is defined to be a set of density-connected points which is maximal density-reachability. Noise is the set of points not belong to any of clusters. 7. How to find cluster C ? Maximality ∆ p, q : if p€ C and q is density-reachable from p, then q € C Connectivity ∆ p, q € C : p is density-connected to q 8. How to find noises ? ∆ p, if p is not belong to any clusters, then p is noise point

2001/12/18CHAMELEON30 Results of experiment

2001/12/18CHAMELEON31 Performance analysis (1/2) The time of construct the k-nearest neighbor Low-dimensional data sets based on k-d trees, overall complexity of O(n log n) High-dimensional data sets based on k-d trees not applicable, overall complexity of O(n 2 ) Finding initial sub-clusters Obtains m clusters by repeated partitioning successively smaller graphs, overall computational complexity is O(n log (n/m)) Is bounded by O(n log n) A faster partitioning algorithm to obtain the initial m clusters in time O(n+m log m) using multilevel m-way partitioning algorithm

2001/12/18CHAMELEON32 Performance analysis (2/2) Merging sub-clusters using a dynamic framework The time of compute the internal inter-connectivity and internal closeness for each initial cluster is which is O(nm) The time of the most similar pair of clusters to merge is O(m2 log m) by using a heap-based priority queue So overall complexity of CHAMELEON’s is O(n log n + nm + m2 log m)

2001/12/18CHAMELEON33 Conclusion and discussion Dynamic model with related interconnectivity and closeness This paper ignore the issue of scaling to large data Other graph representation methodology?? Other Partition algorithm??