CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin Hung, Victor KOON Ping Yin, Bob CURE: Efficient Clustering Algorithm for Large Databases for Large Databases.

Slides:



Advertisements
Similar presentations
CLUSTERING.
Advertisements

Clustering Data Streams Chun Wei Dept Computer & Information Technology Advisor: Dr. Sprague.
Clustering II.
SEEM Tutorial 4 – Clustering. 2 What is Cluster Analysis?  Finding groups of objects such that the objects in a group will be similar (or.
Clustering.
Clustering (2). Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram –A tree like.
Hierarchical Clustering
Cluster Analysis: Basic Concepts and Algorithms
1 CSE 980: Data Mining Lecture 16: Hierarchical Clustering.
Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.
Hierarchical Clustering, DBSCAN The EM Algorithm
Data Mining Cluster Analysis: Advanced Concepts and Algorithms
Data Mining Cluster Analysis: Basic Concepts and Algorithms
BIRCH: Is It Good for Databases? A review of BIRCH: An And Efficient Data Clustering Method for Very Large Databases by Tian Zhang, Raghu Ramakrishnan.
Birch: Balanced Iterative Reducing and Clustering using Hierarchies By Tian Zhang, Raghu Ramakrishnan Presented by Vladimir Jelić 3218/10
More on Clustering Hierarchical Clustering to be discussed in Clustering Part2 DBSCAN will be used in programming project.
Tian Zhang Raghu Ramakrishnan Miron Livny Presented by: Peter Vile BIRCH: A New data clustering Algorithm and Its Applications.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.
Clustering II.
Clustering II.
Cluster Analysis.
Clustering Algorithms BIRCH and CURE
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Cluster Analysis.
Cluster Analysis: Basic Concepts and Algorithms
Cluster Analysis.  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods.
What is Cluster Analysis?
What is Cluster Analysis?
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Birch: An efficient data clustering method for very large databases
Surface Simplification Using Quadric Error Metrics Michael Garland Paul S. Heckbert.
CURE: Clustering Using REpresentatives algorithm Student: Uglješa Milić University of Belgrade School of Electrical Engineering.
Jay Anderson. 4.5 th Year Senior Major: Computer Science Minor: Pre-Law Interests: GT Rugby, Claymore, Hip Hop, Trance, Drum and Bass, Snowboarding etc.
Cluster Analysis Part II. Learning Objectives Hierarchical Methods Density-Based Methods Grid-Based Methods Model-Based Clustering Methods Outlier Analysis.
The BIRCH Algorithm Davitkov Miroslav, 2011/3116
1 CSE 980: Data Mining Lecture 17: Density-based and Other Clustering Algorithms.
Approximate XML Joins Huang-Chun Yu Li Xu. Introduction XML is widely used to integrate data from different sources. Perform join operation for XML documents:
CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin Hung, Victor KOON Ping Yin, Bob SPRINT: A Scalable Parallel Classifier for Data Mining.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Clustering COMP Research Seminar BCB 713 Module Spring 2011 Wei Wang.
CURE: An Efficient Clustering Algorithm for Large Databases Sudipto Guha, Rajeev Rastogi, Kyuseok Shim Stanford University Bell Laboratories Bell Laboratories.
CSE5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai Li (Slides.
Unsupervised Learning. Supervised learning vs. unsupervised learning.
Clustering I. 2 The Task Input: Collection of instances –No special class label attribute! Output: Clusters (Groups) of instances where members of a cluster.
ROCK: A Robust Clustering Algorithm for Categorical Attributes Authors: Sudipto Guha, Rajeev Rastogi, Kyuseok Shim Data Engineering, Proceedings.,
BIRCH: An Efficient Data Clustering Method for Very Large Databases Tian Zhang, Raghu Ramakrishnan, Miron Livny University of Wisconsin-Maciison Presented.
CURE: EFFICIENT CLUSTERING ALGORITHM FOR LARGE DATASETS VULAVALA VAMSHI PRIYA.
Presented by Ho Wai Shing
Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN 1 Remaining Lectures in Advanced Clustering and Outlier Detection 2.Advanced Classification.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.
CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin Hung, Victor KOON Ping Yin, Bob Fast Algorithms for Projected Clustering.
CURE: An Efficient Clustering Algorithm for Large Databases Authors: Sudipto Guha, Rajeev Rastogi, Kyuseok Shim Presentation by: Vuk Malbasa For CIS664.
Analysis of Massive Data Sets Prof. dr. sc. Siniša Srbljić Doc. dr. sc. Dejan Škvorc Doc. dr. sc. Ante Đerek Faculty of Electrical Engineering and Computing.
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
BHARATH RENGARAJAN PURSUING MY MASTERS IN COMPUTER SCIENCE FALL 2008.
Data Mining: Basic Cluster Analysis
More on Clustering in COSC 4335
Clustering CSC 600: Data Mining Class 21.
Data Mining K-means Algorithm
Data Mining -Cluster Analysis. What is a clustering ? Clustering is the process of grouping data into classes, or clusters, so that objects within a cluster.
BIRCH: An Efficient Data Clustering Method for Very Large Databases
DATA MINING Introductory and Advanced Topics Part II - Clustering
Clustering Large Datasets in Arbitrary Metric Space
Clustering The process of grouping samples so that the samples are similar within each group.
SEEM4630 Tutorial 3 – Clustering.
Hierarchical Clustering
BIRCH: Balanced Iterative Reducing and Clustering using Hierarchies
Presentation transcript:

CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin Hung, Victor KOON Ping Yin, Bob CURE: Efficient Clustering Algorithm for Large Databases for Large Databases

Content I.Different problem in traditional clustering method II.Basic idea of CURE clustering III.Improved CURE IV.Summary V.References

I.Different problem in traditional clustering method

Partitional Clustering –This category of clustering method try to reduce the data set into k clusters based on some criterion functions. –The most common criterion is square-error criterion. –This method favor to clusters with data points as compact and separated as possible Different problem in traditional clustering method

Partitional Clustering You may find error in case the square-error is reduced by splitting some large cluster to favor some other group. Different problem in traditional clustering method Figure: Splitting occur in large cluster by partitional method

Hierarchical Clustering –This category of clustering method try to merge sequences of disjoint clusters into the target k clusters base on the minimum distance between two clusters. –The distance between clusters can be measured as: Distance between mean: Distance between average point Distance between two nearest point within cluster Hierarchical Clustering Different problem in traditional clustering method

Hierarchical Clustering –This method favor hyper-spherical shape and uniform data. –Let’s take some prolonged data as example: –Result of d mean : Different problem in traditional clustering method

Hierarchical Clustering –Result of d min : Different problem in traditional clustering method

Problems summary 1.Traditional clustering mainly favors spherical shape. 2.Data in the cluster must be compact together. 3.Each cluster must separate far away enough. 4.Cluster size must be uniform. 5.Outliner will greatly disturb the cluster result. Different problem in traditional clustering method

II.Basic idea of CURE clustering

General CURE clustering procedure. 1.It is similar to hierarchical clustering approach. But it use sample point variant as the cluster representative rather than every point in the cluster. 2.First set a target sample number c. Than we try to select c well scattered sample points from the cluster. 3.The chosen scattered points are shrunk toward the centroid in a fraction of  where 0 <  <1 Basic idea of CURE clustering

General CURE clustering procedure. 4.These points are used as representative of clusters and will be used as the point in d min cluster merging approach. 5.After each merging, c sample points will be selected from original representative of previous clusters to represent new cluster. 6.Cluster merging will be stopped until target k cluster is found Basic idea of CURE clustering Nearest Merge Nearest Merge

Pseudo function of CURE Basic idea of CURE clustering

CURE efficient The worst-case time complexity is O(n 2 logn) The space complexity is O(n) due to the use of k-d treee and heap. Basic idea of CURE clustering

III.Improved CURE

In case of dealing with large database, we can’t store every data point to the memory. Handle of data merge in large database require very long time. We use random sampling to both reduce the time complexity and memory usage. Assume if we need to detect a cluster u present, we need to at least capture f fraction of data from this cluster f|u| The the required sampling data s to capture can be present as follow: You can refer to proof from the reference (i). Here we just want to show that we can determine a sample size s min such that the probability of get enough sample from every cluster u is 1 -  Random Sampling Improved CURE

Partitioning and two pass clustering In addition, we use two-pass approach to reduce the computation time. First, we divide the n data point into p partition and each contain n/p data point. We than pre-cluster each partition until the number of cluster n/pq reached in each partition for some q > 1 Then each cluster in the first pass result will be used as the second pass clustering input to form the final cluster. Each one partition’s time complexity is: Therefore, the first pass complexity will be: And the second pass complexity is: Overall, the time complexity will become: Improved CURE

Partitioning and two pass clustering The overall improvement will be: Also, to maintain the quality of clustering, we must make sure n/pq must be 2 to 3 times of k. Improved CUR

Outlier elimination We can introduce outliners elimination by two method. 1.Random sampling: With random sampling, most of outlier points are filtered out. 2.Outlier elimination: As outliner is not a compact group, it will grow in size very slowly during the cluster merge stage. We will then kick in the elimination procedure during the merging stage such that those cluster with 1 ~ 2 data points are removed from the cluster list. In order to prevent these outliners from merging into proper cluster, we must trigger the procedure in proper stage such that we can properly remove the outliners. In general, we will trigger this procedure when cluster sets reduce to 1/3 of total data sets. Improved CURE

Data labeling Due to the use of random sample. We need to label back every remaining data points to the proper cluster group. Each data point is assigned to the cluster group with a representative point nearest to the data point. Improved CURE

Final overview of CURE flow Improved CURE Data Draw Random Sample Partition SamplePartially cluster partition Elimination outliersCluster partial clusters Label data in disk

Sample result with different parameter Improved CURE Different shrinking factor 

Sample result with different parameter Improved CURE Different number of representatives c

Sample result with different parameter Improved CURE Relation of execution time, different partition number p, and different sample points s

IV.Summary CURE can effectively detect proper shape of the cluster with the help of scattered representative point and centroid shrinking. CURE can reduce computation time and memory loading with random sampling and 2 pass clustering CURE can effectively remove outlier. The quality and effectiveness of CURE can be tuned be varying different s,p,c,  to adapt different input data set.

V.References i.GRS97 Sudipto Guha, R. Rastogi, and K. Shim. CURE: A clustering algorithm for large databases. Technical report, Bell Laboratories, Murray Hill, ii.ZRL96 Tian Zhang, Raghu Ramakrishnan, Miron Livny, BIRCH: an efficient data clustering method for very large databases, ACM SIGMOD Record, v.25 n.2, p , June 1996ZRL96 Tian Zhang, Raghu Ramakrishnan, Miron Livny, BIRCH: an efficient data clustering method for very large databases, ACM SIGMOD Record, v.25 n.2, p , June 1996 iii.Sudipto Guha, Rajeev Rastogi, Kyuseok Shim: CURE: An Efficient Clustering Algorithm for Large Databases, ACM SIGMOD, 1998.Sudipto Guha, Rajeev Rastogi, Kyuseok Shim: CURE: An Efficient Clustering Algorithm for Large Databases, ACM SIGMOD, 1998.