BHARATH RENGARAJAN PURSUING MY MASTERS IN COMPUTER SCIENCE FALL 2008.

Slides:



Advertisements
Similar presentations
CLUSTERING.
Advertisements

Clustering II.
SEEM Tutorial 4 – Clustering. 2 What is Cluster Analysis?  Finding groups of objects such that the objects in a group will be similar (or.
Clustering.
Clustering (2). Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram –A tree like.
Hierarchical Clustering
Cluster Analysis: Basic Concepts and Algorithms
1 CSE 980: Data Mining Lecture 16: Hierarchical Clustering.
Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.
Hierarchical Clustering, DBSCAN The EM Algorithm
Data Mining Cluster Analysis: Advanced Concepts and Algorithms
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.
Data Mining Cluster Analysis: Basic Concepts and Algorithms
More on Clustering Hierarchical Clustering to be discussed in Clustering Part2 DBSCAN will be used in programming project.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.
Clustering II.
Clustering II.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Cluster Analysis.
Cluster Analysis: Basic Concepts and Algorithms
What is Cluster Analysis?
Cluster Analysis CS240B Lecture notes based on those by © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004.
What is Cluster Analysis?
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Chapter 3: Cluster Analysis  3.1 Basic Concepts of Clustering  3.2 Partitioning Methods  3.3 Hierarchical Methods The Principle Agglomerative.
CURE: Clustering Using REpresentatives algorithm Student: Uglješa Milić University of Belgrade School of Electrical Engineering.
Jay Anderson. 4.5 th Year Senior Major: Computer Science Minor: Pre-Law Interests: GT Rugby, Claymore, Hip Hop, Trance, Drum and Bass, Snowboarding etc.
Cluster Analysis Part II. Learning Objectives Hierarchical Methods Density-Based Methods Grid-Based Methods Model-Based Clustering Methods Outlier Analysis.
The BIRCH Algorithm Davitkov Miroslav, 2011/3116
Partitional and Hierarchical Based clustering Lecture 22 Based on Slides of Dr. Ikle & chapter 8 of Tan, Steinbach, Kumar.
1 CSE 980: Data Mining Lecture 17: Density-based and Other Clustering Algorithms.
Clustering What is clustering? Also called “unsupervised learning”Also called “unsupervised learning”
CURE: An Efficient Clustering Algorithm for Large Databases Sudipto Guha, Rajeev Rastogi, Kyuseok Shim Stanford University Bell Laboratories Bell Laboratories.
CSE5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai Li (Slides.
Clustering I. 2 The Task Input: Collection of instances –No special class label attribute! Output: Clusters (Groups) of instances where members of a cluster.
ROCK: A Robust Clustering Algorithm for Categorical Attributes Authors: Sudipto Guha, Rajeev Rastogi, Kyuseok Shim Data Engineering, Proceedings.,
CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin Hung, Victor KOON Ping Yin, Bob CURE: Efficient Clustering Algorithm for Large Databases for Large Databases.
CURE: EFFICIENT CLUSTERING ALGORITHM FOR LARGE DATASETS VULAVALA VAMSHI PRIYA.
Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN 1 Remaining Lectures in Advanced Clustering and Outlier Detection 2.Advanced Classification.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
Definition Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to)
A new clustering tool of Data Mining RAPID MINER.
CLUSTERING AND SEGMENTATION MIS2502 Data Analytics Adapted from Tan, Steinbach, and Kumar (2004). Introduction to Data Mining.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.
CURE: An Efficient Clustering Algorithm for Large Databases Authors: Sudipto Guha, Rajeev Rastogi, Kyuseok Shim Presentation by: Vuk Malbasa For CIS664.
Analysis of Massive Data Sets Prof. dr. sc. Siniša Srbljić Doc. dr. sc. Dejan Škvorc Doc. dr. sc. Ante Đerek Faculty of Electrical Engineering and Computing.
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
DATA MINING: CLUSTER ANALYSIS Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.
Rodney Nielsen Many of these slides were adapted from: I. H. Witten, E. Frank and M. A. Hall Data Science Algorithms: The Basic Methods Clustering WFH:
Data Mining: Basic Cluster Analysis
Hierarchical Clustering
Semi-Supervised Clustering
More on Clustering in COSC 4335
CSE 4705 Artificial Intelligence
Clustering CSC 600: Data Mining Class 21.
CSE 5243 Intro. to Data Mining
Data Mining K-means Algorithm
Hierarchical Clustering
数据挖掘 Introduction to Data Mining
DATA MINING Introductory and Advanced Topics Part II - Clustering
CSE572, CBS572: Data Mining by H. Liu
Clustering Wei Wang.
Clustering The process of grouping samples so that the samples are similar within each group.
SEEM4630 Tutorial 3 – Clustering.
Hierarchical Clustering
Data Mining Cluster Analysis: Basic Concepts and Algorithms
BIRCH: Balanced Iterative Reducing and Clustering using Hierarchies
Presentation transcript:

BHARATH RENGARAJAN PURSUING MY MASTERS IN COMPUTER SCIENCE FALL 2008

I.Problems in the traditional clustering method II.CURE clustering III.Summary IV.Drawbacks

Attempts to find k-partitions that try to minimize a certain criterion function The square-error criterion is the most common criterion function used. Works well for compact, well separated clusters.

You may find error in case the square-error is reduced by splitting some large cluster to favor some other group.

◦ This category of clustering method try to merge sequences of disjoint clusters into the target k clusters base on the minimum distance between two clusters. ◦ The distance between clusters can be measured as:  Distance between mean:  Distance between two nearest point within cluster

Result of d mean :

Result of d min :

1. Traditional clustering mainly favors spherical shape. 2. Data in the cluster must be compact together. 3. Each cluster must separate far away enough. 4. Outliner will greatly disturb the cluster result.

1.It is similar to hierarchical clustering approach. But it use sample point variant as the cluster representative rather than every point in the cluster. 2.First set a target sample number c. Than we try to select c well scattered sample points from the cluster. 3.The chosen scattered points are shrunk toward the centroid in a fraction of  where 0 <  <1 4.These points are used as representative of clusters and will be used as the point in d min cluster merging approach.

4.After each merging, c sample points will be selected from original representative of previous clusters to represent new cluster. 5.Cluster merging will be stopped until target k cluster is found Nearest Merge Nearest

The worst-case time complexity is O(n 2 logn) The space complexity is O(n) due to the use of k-d tree and heap.

In case of dealing with large database, we can’t store every data point to the memory. Handle of data merge in large database require very long time. We use random sampling to both reduce the time complexity and memory usage. By using random sampling, there exists a trade off between accuracy and efficiency.

We can introduce outliners elimination by two method. 1.Random sampling: With random sampling, most of outlier points are filtered out. 2.Outlier elimination: As outliner is not a compact group, it will grow in size very slowly during the cluster merge stage. We will then kick in the elimination procedure during the merging stage such that those cluster with 1 ~ 2 data points are removed from the cluster list.

Due to the use of random sample. We need to label back every remaining data points to the proper cluster group. Each data point is assigned to the cluster group with a representative point nearest to the data point.

Data Draw Random Sample Partition SamplePartially cluster partition Elimination outliersCluster partial clusters Label data in disk

CURE can effectively detect proper shape of the cluster with the help of scattered representative point and centroid shrinking. CURE can reduce computation time with random sampling. CURE can effectively remove outlier. The quality and effectiveness of CURE can be tuned be varying different s,p,c,  to adapt different input data set.

Clusters shown are somewhat standard shapes. Too many parameters are involved.