CURE: Clustering Using REpresentatives algorithm Student: Uglješa Milić University of Belgrade School of Electrical Engineering.

Slides:



Advertisements
Similar presentations
Clustering.
Advertisements

Random Forest Predrag Radenković 3237/10
Hierarchical Clustering
Cluster Analysis: Basic Concepts and Algorithms
1 CSE 980: Data Mining Lecture 16: Hierarchical Clustering.
Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.
Hierarchical Clustering, DBSCAN The EM Algorithm
Data Mining Cluster Analysis: Advanced Concepts and Algorithms
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Midterm topics Chapter 2 Data Data preprocessing Measures of similarity/dissimilarity Chapter.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.
Data Mining Cluster Analysis: Basic Concepts and Algorithms
2001/12/18CHAMELEON1 CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling Paper presentation in data mining class Presenter : 許明壽 ; 蘇建仲.
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Clustering CS 685: Special Topics in Data Mining Spring 2008 Jinze Liu.
More on Clustering Hierarchical Clustering to be discussed in Clustering Part2 DBSCAN will be used in programming project.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.
6-1 ©2006 Raj Jain Clustering Techniques  Goal: Partition into groups so the members of a group are as similar as possible and different.
Clustering II.
Cluster Analysis.
Clustering Algorithms BIRCH and CURE
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Today Unsupervised Learning Clustering K-means. EE3J2 Data Mining Lecture 18 K-means and Agglomerative Algorithms Ali Al-Shahib.
Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
© University of Minnesota Data Mining CSCI 8980 (Fall 2002) 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center.
Slide 1 EE3J2 Data Mining Lecture 16 Unsupervised Learning Ali Al-Shahib.
Cluster Analysis.
Cluster Analysis: Basic Concepts and Algorithms
Cluster Analysis.  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods.
Cluster Analysis CS240B Lecture notes based on those by © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Chapter 3: Cluster Analysis  3.1 Basic Concepts of Clustering  3.2 Partitioning Methods  3.3 Hierarchical Methods The Principle Agglomerative.
CHURN PREDICTION MODEL IN RETAIL BANKING USING FUZZY C- MEANS CLUSTERING Džulijana Popović Consumer Finance, Zagrebačka banka d.d. Consumer Finance, Zagrebačka.
Evaluating Performance for Data Mining Techniques
Jay Anderson. 4.5 th Year Senior Major: Computer Science Minor: Pre-Law Interests: GT Rugby, Claymore, Hip Hop, Trance, Drum and Bass, Snowboarding etc.
Cluster Analysis Part II. Learning Objectives Hierarchical Methods Density-Based Methods Grid-Based Methods Model-Based Clustering Methods Outlier Analysis.
1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods.
Partitional and Hierarchical Based clustering Lecture 22 Based on Slides of Dr. Ikle & chapter 8 of Tan, Steinbach, Kumar.
1 CSE 980: Data Mining Lecture 17: Density-based and Other Clustering Algorithms.
Technological Educational Institute Of Crete Department Of Applied Informatics and Multimedia Intelligent Systems Laboratory 1 CLUSTERS Prof. George Papadourakis,
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Clustering COMP Research Seminar BCB 713 Module Spring 2011 Wei Wang.
CURE: An Efficient Clustering Algorithm for Large Databases Sudipto Guha, Rajeev Rastogi, Kyuseok Shim Stanford University Bell Laboratories Bell Laboratories.
CSE5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai Li (Slides.
ROCK: A Robust Clustering Algorithm for Categorical Attributes Authors: Sudipto Guha, Rajeev Rastogi, Kyuseok Shim Data Engineering, Proceedings.,
CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin Hung, Victor KOON Ping Yin, Bob CURE: Efficient Clustering Algorithm for Large Databases for Large Databases.
K-Means Algorithm Each cluster is represented by the mean value of the objects in the cluster Input: set of objects (n), no of clusters (k) Output:
Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
CURE: EFFICIENT CLUSTERING ALGORITHM FOR LARGE DATASETS VULAVALA VAMSHI PRIYA.
Presented by Ho Wai Shing
Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN 1 Remaining Lectures in Advanced Clustering and Outlier Detection 2.Advanced Classification.
Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree like diagram that.
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Hierarchical Clustering
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.
CURE: An Efficient Clustering Algorithm for Large Databases Authors: Sudipto Guha, Rajeev Rastogi, Kyuseok Shim Presentation by: Vuk Malbasa For CIS664.
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
BHARATH RENGARAJAN PURSUING MY MASTERS IN COMPUTER SCIENCE FALL 2008.
Data Mining: Basic Cluster Analysis
DATA MINING Spatial Clustering
More on Clustering in COSC 4335
CSE 4705 Artificial Intelligence
Chapter 15 – Cluster Analysis
CSE 5243 Intro. to Data Mining
Data Mining K-means Algorithm
CS 485G: Special Topics in Data Mining
DATA MINING Introductory and Advanced Topics Part II - Clustering
Hierarchical Clustering
Clustering Large Datasets in Arbitrary Metric Space
Hierarchical Clustering
BIRCH: Balanced Iterative Reducing and Clustering using Hierarchies
Presentation transcript:

CURE: Clustering Using REpresentatives algorithm Student: Uglješa Milić University of Belgrade School of Electrical Engineering Department of Computer Engineering Belgrade, December 2011

Agenda Introduction - About clustering - Previous approaches - Things to improve CURE algorithm - Basic ideas - Step by step - Experimental results Conclusion Q&A 1/15

Introduction About clustering Classification of objects into different groups Those who uses partitioning or hierarchical techniques Partitioning - starts with one big cluster and downward step by step reaches the number of clusters we wanted Hierarchical - starts with single point cluster and upward step by step merge cluster until desired number is reached The second technique is used in this work 2/15

Introduction Previous approaches All-points approach Any point in the cluster is representative of the cluster where d min stands for minimum distance between two points of a pair of clusters d min (C a, C b ) = minimum( || p a,i – p b,j || ) 3/15

Introduction Previous approaches Centroid-based approach Considers one point as representative of a cluster - centroid where d mean stands for a distance between two centroids d mean (C a, C b ) = || m a – m b || 4/15

Introduction Things to improve Hierarchical models are typically fast and efficient As a result they are also popular Some disadvantages of traditional clustering algorithms: - favor clusters approximating spherical shapes - similar sizes - poor at handling outliers 5/15

CURE algorithm Basic ideas Introduce balance between centroid and all-points techniques Presents a hybrid of the two Pre-defined number of representative points Shrinking them by factor α 6/15

CURE algorithm Step by step For each cluster, c well scattered points within the cluster are chosen, and then shrinking them toward the mean of the cluster by a fraction α The distance between two clusters is then the distance between the closest pair of representative points from each cluster. The c representative points attempt to capture the physical shape and geometry of the cluster. Shrinking the scattered points toward the mean gets rid of surface abnormalities and decrease the effects of outliers. 7/15

CURE algorithm Step by step Shrinking the sets, increases the distance from each cluster to any outlier (also eliminating the ‘chaining’ effect) Choosing well ‘scattered points’ representative of the cluster’s shape allows more precision than a standard spheroid radius. 8/15

CURE algorithm Experimental results Experiment with data sets of two dimensions Consists of on big and two small circles and two ellipsoid shapes connected 9/15

CURE algorithm Experimental results Shrink Factor α: 0.2 – 0.7 is a good range of values for α 10/15

CURE algorithm Experimental results Number of representative points c: For smaller values of c, the quality of clustering suffered For values of c greater than 10, CURE always found right clusters 11/15

CURE algorithm Experimental results BIRCH cannot distinguish between the big and small clusters MST merges the two ellipsoids CURE successfully discovers the clusters 12/15

Conclusion Can detect cluster with non-spherical shape and wide variance in size using a set of representative points for each cluster Have a good execution time in presence of large database using random sampling and partitioning methods Works well when the database contains outliers 13/15

References Sudipto Guha, Rajeev Rastogi, Kyuseok Shim Cure: An Efficient Clustering Algorithm for Large Databases. InformationSystems, Volume 26, Number 1, March /15

Q&A 15/15