CURE: An Efficient Clustering Algorithm for Large Databases Sudipto Guha, Rajeev Rastogi, Kyuseok Shim Stanford University Bell Laboratories Bell Laboratories.

Slides:



Advertisements
Similar presentations
CLUSTERING.
Advertisements

Clustering II.
Clustering.
Cluster Analysis: Basic Concepts and Algorithms
1 CSE 980: Data Mining Lecture 16: Hierarchical Clustering.
Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.
Hierarchical Clustering, DBSCAN The EM Algorithm
Clustering Francisco Moreno Extractos de Mining of Massive Datasets
Data Mining Cluster Analysis: Advanced Concepts and Algorithms
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Midterm topics Chapter 2 Data Data preprocessing Measures of similarity/dissimilarity Chapter.
Data Mining Cluster Analysis: Basic Concepts and Algorithms
Mining Distance-Based Outliers in Near Linear Time with Randomization and a Simple Pruning Rule Stephen D. Bay 1 and Mark Schwabacher 2 1 Institute for.
BIRCH: Is It Good for Databases? A review of BIRCH: An And Efficient Data Clustering Method for Very Large Databases by Tian Zhang, Raghu Ramakrishnan.
2001/12/18CHAMELEON1 CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling Paper presentation in data mining class Presenter : 許明壽 ; 蘇建仲.
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Clustering CS 685: Special Topics in Data Mining Spring 2008 Jinze Liu.
Brian Babcock Surajit Chaudhuri Gautam Das at the 2003 ACM SIGMOD International Conference By Shashank Kamble Gnanoba.
Assessment. Schedule graph may be of help for selecting the best solution Best solution corresponds to a plateau before a high jump Solutions with very.
Clustering II.
Cluster Analysis.
Clustering Algorithms BIRCH and CURE
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
1 Abstract This paper presents a novel modification to the classical Competitive Learning (CL) by adding a dynamic branching mechanism to neural networks.
© University of Minnesota Data Mining CSCI 8980 (Fall 2002) 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center.
Slide 1 EE3J2 Data Mining Lecture 16 Unsupervised Learning Ali Al-Shahib.
Cluster Analysis.
Cluster Analysis: Basic Concepts and Algorithms
Cluster Analysis.  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods.
What is Cluster Analysis?
What is Cluster Analysis?
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
FLANN Fast Library for Approximate Nearest Neighbors
Birch: An efficient data clustering method for very large databases
Chapter 3: Cluster Analysis  3.1 Basic Concepts of Clustering  3.2 Partitioning Methods  3.3 Hierarchical Methods The Principle Agglomerative.
CURE: Clustering Using REpresentatives algorithm Student: Uglješa Milić University of Belgrade School of Electrical Engineering.
Jay Anderson. 4.5 th Year Senior Major: Computer Science Minor: Pre-Law Interests: GT Rugby, Claymore, Hip Hop, Trance, Drum and Bass, Snowboarding etc.
Cluster Analysis Part II. Learning Objectives Hierarchical Methods Density-Based Methods Grid-Based Methods Model-Based Clustering Methods Outlier Analysis.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
1 CSE 980: Data Mining Lecture 17: Density-based and Other Clustering Algorithms.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Clustering COMP Research Seminar BCB 713 Module Spring 2011 Wei Wang.
Pattern Recognition April 19, 2007 Suggested Reading: Horn Chapter 14.
CLUSTER ANALYSIS Introduction to Clustering Major Clustering Methods.
ROCK: A Robust Clustering Algorithm for Categorical Attributes Authors: Sudipto Guha, Rajeev Rastogi, Kyuseok Shim Data Engineering, Proceedings.,
CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin Hung, Victor KOON Ping Yin, Bob CURE: Efficient Clustering Algorithm for Large Databases for Large Databases.
BIRCH: An Efficient Data Clustering Method for Very Large Databases Tian Zhang, Raghu Ramakrishnan, Miron Livny University of Wisconsin-Maciison Presented.
K-Means Algorithm Each cluster is represented by the mean value of the objects in the cluster Input: set of objects (n), no of clusters (k) Output:
Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
CURE: EFFICIENT CLUSTERING ALGORITHM FOR LARGE DATASETS VULAVALA VAMSHI PRIYA.
Presented by Ho Wai Shing
Definition Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to)
Database Management Systems, R. Ramakrishnan 1 Algorithms for clustering large datasets in arbitrary metric spaces.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.
Parameter Reduction for Density-based Clustering on Large Data Sets Elizabeth Wang.
Given a set of data points as input Randomly assign each point to one of the k clusters Repeat until convergence – Calculate model of each of the k clusters.
CURE: An Efficient Clustering Algorithm for Large Databases Authors: Sudipto Guha, Rajeev Rastogi, Kyuseok Shim Presentation by: Vuk Malbasa For CIS664.
BHARATH RENGARAJAN PURSUING MY MASTERS IN COMPUTER SCIENCE FALL 2008.
Unsupervised Learning: Clustering
Unsupervised Learning: Clustering
Hierarchical Clustering: Time and Space requirements
Data Mining K-means Algorithm
CS 685: Special Topics in Data Mining Jinze Liu
CS 685: Special Topics in Data Mining Jinze Liu
Clustering Wei Wang.
Hierarchical Clustering
Clustering Large Datasets in Arbitrary Metric Space
Unsupervised Learning: Clustering
Hierarchical Clustering
CS 685: Special Topics in Data Mining Jinze Liu
BIRCH: Balanced Iterative Reducing and Clustering using Hierarchies
Presentation transcript:

CURE: An Efficient Clustering Algorithm for Large Databases Sudipto Guha, Rajeev Rastogi, Kyuseok Shim Stanford University Bell Laboratories Bell Laboratories Presented by Zhirong Tao

Overview of the Paper Introduction Drawbacks of Traditional Clustering Algorithms Contributions of CURE CURE Algorithm Hierarchical Clustering Algorithm Random Sampling Partitioning for Speedup Labeling Data on Disk Handling Outliers Experimental Results

Drawbacks of Traditional Clustering Algorithms Centroid-based approach (using d mean ) considers only one point as representative of a cluster - the cluster centroid. All-points approach (based on d min ) makes the clustering algorithm extremely sensitive to outliers and to slight changes in the position of data points. Both of them can ’ t work well for non- spherical or arbitrary shaped clusters.

Contributions of CURE CURE can identify both spherical and non-spherical clusters. It chooses a number of well scattered points as representatives of the cluster instead of one point - centroid. CURE uses random sampling and partitioning to speed up clustering.

Overview of CURE

CURE Algorithm: Hierarchical Clustering Algorithm For each cluster, c well scattered points within the cluster are chosen, and then shrinking them toward the mean of the cluster by a fraction a. The distance between two clusters is then the distance between the closest pair of representative points from each cluster. The c representative points attempt to capture the physical shape and geometry of the cluster. Shrinking the scattered points toward the mean gets rid of surface abnormalities and mitigates the effects of outliers.

CURE Algorithm: Random Sampling In order to handle large data sets, random sampling is used to reduce the size of the input to CURE ’ s clustering algorithm. [Vit85] provides efficient algorithms for drawing a sample randomly in one pass and using constant space. Although random sampling does have tradeoff between accuracy and efficiency, experiments show that for most of the data sets, with moderate sized random samples, very good clusters can obtained.

CURE Algorithm: Partitioning for Speedup In the first pass: Partition the sample space into p partitions, each of size n/p. Partially cluster each partition until the final number of clusters in each partition reduces to n/pq, q > 1. The advantages are Reduce execution time Reduce the input size and ensure it fits in main- memory by storing only the representative points for each cluster as input to the clustering algorithm.

CURE Algorithm: Labeling Data on Disk Each data point is assigned to the cluster containing the representative point closest to it.

CURE Algorithm: Handling Outliers The number of points in a collection of outliers is typically much less than the number in a cluster. In the first phase, proceed with the clustering until # clusters decreases to 1/3, then classify clusters with very few points (e.g., 1 or 2) as outliers. The second phase occurs toward the end. Small groups (outliers) are easy to be identified and eliminated.

Experimental Results - Algorithms BIRCH CURE The partitioning constant q = 3 Two phase outlier handling Random sample size = 2.5% of the initial data set size MST (minimum spanning tree) When shrink factor = 0, CURE reduces to MST.

Experimental Results – Data Sets Experiment with data sets of two dimensions Data set 1 contains one big and two small circles. Data set 2 consists of 100 clusters with centers arranged in a grid pattern and data points in each cluster following a normal distribution with mean at the cluster center.

Experimental Results – Quality of Clustering BIRCH cannot distinguish between the big and small clusters. MST merges the two ellipsoids. CURE successfully discovers the clusters in Data set 1.

Experimental Results – Sensitivity to Parameters Shrink Factor a: 0.2 – 0.7 is a good range of values for a.

Experimental Results – Sensitivity to Parameters (Contd) Number of Representative Points c: For smaller values of c, the quality of clustering suffered. However, for values of c greater than 10, CURE always found right clusters.

Experimental Results – Comparison of Execution time to BIRCH Run both BIRCH and CURE on Data set 2:

Conclusion CURE can adjust well to clusters having non-spherical shapes and wide variances in size. CURE can handle large databases efficiently.