Clustering I. 2 The Task Input: Collection of instances –No special class label attribute! Output: Clusters (Groups) of instances where members of a cluster.

Slides:



Advertisements
Similar presentations
SEEM Tutorial 4 – Clustering. 2 What is Cluster Analysis?  Finding groups of objects such that the objects in a group will be similar (or.
Advertisements

Clustering.
Cluster Analysis: Basic Concepts and Algorithms
Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.
PARTITIONAL CLUSTERING
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.
Unsupervised Learning Reading: Chapter 8 from Introduction to Data Mining by Tan, Steinbach, and Kumar, pp , , Chapter 8.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.
What is Cluster Analysis? Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or.
Clustering II.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
1 Text Clustering. 2 Clustering Partition unlabeled examples into disjoint subsets of clusters, such that: –Examples within a cluster are very similar.
Cluster Analysis: Basic Concepts and Algorithms
Clustering. 2 Outline  Introduction  K-means clustering  Hierarchical clustering: COBWEB.
1 Partitioning Algorithms: Basic Concepts  Partition n objects into k clusters Optimize the chosen partitioning criterion Example: minimize the Squared.
Cluster Analysis (1).
What is Cluster Analysis?
Cluster Analysis CS240B Lecture notes based on those by © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004.
What is Cluster Analysis?
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Birch: An efficient data clustering method for very large databases
Clustering Unsupervised learning Generating “classes”
Evaluating Performance for Data Mining Techniques
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
DATA MINING CLUSTERING K-Means.
Unsupervised Learning Reading: Chapter 8 from Introduction to Data Mining by Tan, Steinbach, and Kumar, pp , , (
More on Microarrays Chitta Baral Arizona State University.
1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods.
Partitional and Hierarchical Based clustering Lecture 22 Based on Slides of Dr. Ikle & chapter 8 of Tan, Steinbach, Kumar.
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
Unsupervised Learning. Supervised learning vs. unsupervised learning.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Clustering COMP Research Seminar BCB 713 Module Spring 2011 Wei Wang.
Clustering.
CLUSTERING AND SEGMENTATION MIS2502 Data Analytics Adapted from Tan, Steinbach, and Kumar (2004). Introduction to Data Mining.
Data Science and Big Data Analytics Chap 4: Advanced Analytical Theory and Methods: Clustering Charles Tappert Seidenberg School of CSIS, Pace University.
Clustering Clustering is a technique for finding similarity groups in data, called clusters. I.e., it groups data instances that are similar to (near)
CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.
Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN 1 Remaining Lectures in Advanced Clustering and Outlier Detection 2.Advanced Classification.
Machine Learning Queens College Lecture 7: Clustering.
Slide 1 EE3J2 Data Mining Lecture 18 K-means and Agglomerative Algorithms.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
CLUSTERING AND SEGMENTATION MIS2502 Data Analytics Adapted from Tan, Steinbach, and Kumar (2004). Introduction to Data Mining.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.
Cluster Analysis Dr. Bernard Chen Assistant Professor Department of Computer Science University of Central Arkansas.
CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:
Cluster Analysis Dr. Bernard Chen Ph.D. Assistant Professor Department of Computer Science University of Central Arkansas Fall 2010.
Clustering Wei Wang. Outline What is clustering Partitioning methods Hierarchical methods Density-based methods Grid-based methods Model-based clustering.
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
DATA MINING: CLUSTER ANALYSIS Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
Data Mining and Text Mining. The Standard Data Mining process.
CLUSTER ANALYSIS. Cluster Analysis  Cluster analysis is a major technique for classifying a ‘mountain’ of information into manageable meaningful piles.
Rodney Nielsen Many of these slides were adapted from: I. H. Witten, E. Frank and M. A. Hall Data Science Algorithms: The Basic Methods Clustering WFH:
Data Mining: Basic Cluster Analysis
Unsupervised Learning: Clustering
Unsupervised Learning: Clustering
Semi-Supervised Clustering
Clustering CSC 600: Data Mining Class 21.
Data Mining K-means Algorithm
CSE 5243 Intro. to Data Mining
Clustering Wei Wang.
Unsupervised Learning: Clustering
SEEM4630 Tutorial 3 – Clustering.
Data Mining Cluster Analysis: Basic Concepts and Algorithms
Introduction to Machine learning
Presentation transcript:

Clustering I

2 The Task Input: Collection of instances –No special class label attribute! Output: Clusters (Groups) of instances where members of a cluster are more ‘similar’ to each other than they are to members of another cluster –Similarity between instances need to be defined Because there are no predefined classes to predict, the task is to find useful groups of instances

3 Example – Iris Data

4 Example Clusters? Petal Length Petal Width

5 Different Types of Clusters Clustering is not an end in itself –Only a means to an end defined by the goals of the parent application –A garment manufacturer might be looking for grouping people into different sizes and shapes (segmentation) –A biologist might be looking for gene groups with similar functions (natural groupings) Application dependent notions of clusters possible This means, you need different clustering techniques to find different types of clusters Also you need to learn to find a clustering technique to match your objective –Sometimes an ideal match is not possible; therefore you adapt the existing –Sometimes, a wrongly selected technique might show unexpected clusters – which is what data mining is all about

6 Cluster Representations Several representations are possible a k j h d g i f e c b Partitioning 2D space showing instances e c b 123 a b c d f i g h d a j k Venn Diagram Table of Cluster Probabilities abcdef g Dendrogram

7 Several Techniques Partitioning Methods –K-means and k-medoids Hierarchical Methods –Agglomerative Density-based Methods –DBSCAN

8 K-means Clustering Input: the number of clusters K and the collection of n instances Output: a set of k clusters that minimizes the squared error criterion Method: –Arbitrarily choose k instances as the initial cluster centers –Repeat (Re)assign each instance to the cluster to which the instance is the most similar, based on the mean value of the instances in the cluster Update cluster means (compute mean value of the instances for each cluster) –Until no change in the assignment Squared Error Criterion –E = ∑ i=1 k ∑ pЄCi |p-m i | 2 –where mi are the cluster means and p are points in clusters

9 Interactive Demo tering/tutorial_html/AppletKM.htmlhttp://home.dei.polimi.it/matteucc/Clus tering/tutorial_html/AppletKM.html

10 Choosing the number of Clusters K-means method expects number of clusters k to be specified as part of the input How to choose appropriate k? Options –Start with k = 1 and work up to small number Select the result with lowest squared error Warning: Result with each instance in a different cluster on its own would be the best –Create two initial clusters and split each cluster independently When splitting a cluster create two new ‘seeds’ One seed at a point one standard deviation away from the cluster center and the second one standard deviation in the opposite direction Apply k-means to the points in the cluster with the two new seeds

11 Choosing Initial Centroids Randomly chosen centroids may produce poor quality clusters How to choose initial centroids? Options –Take a sample of data, apply hierarchical clustering and extract k cluster centroids Works well – small sample sizes are desired –Take well separated k points Take a sample of data and select the first centroid as the centroid of the sample Each of the subsequent centroid is chosen to be a point farthest from any of the already selected centroids

12 Additional Issues K-means might produce empty clusters Need to choose an alternative centroid Options –Choose the point farthest from the existing centroids –Choose the point from the cluster with the highest squared error Squared error may need to be reduced further –Because K-means may converge to local minimum –Possible to reduce squared error without increasing K Solution –Apply alternate phases of cluster splitting and merging –Splitting – split the cluster with the largest squared error –Merging – redistribute the elements of a cluster to its neighbours Outliers in the data cause problems to k-means Options –Remove outliers –Use K-Medoids instead

13 K-medoid Input: the number of clusters K and the collection of n instances Output: A set of k clusters that minimizes the sum of the dissimilarities of all the instances to their nearest medoids Method: –Arbitrarily choose k instances as the initial medoids –Repeat (Re)assign each remaining instance to the cluster with the nearest medoid Randomly select a non-medoid instance, o r Compute the total cost, S, of swapping O j with O r If S<0 then swap O j with O r to form the new set of k medoids –Until no change

14 K-means Vs K-medoids Both require K to be specified in the input K-medoids is less influenced by outliers in the data K-medoids is computationally more expensive Both methods assign each instance exactly to one cluster –Fuzzy partitioning techniques relax this Partitioning methods generally are good at finding spherical shaped clusters They are suitable for small and medium sized data sets –Extensions are required for working on large data sets