LOGO Clustering Lecturer: Dr. Bo Yuan

Slides:



Advertisements
Similar presentations
CLUSTERING.
Advertisements

Clustering II.
Clustering (2). Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram –A tree like.
Hierarchical Clustering
Cluster Analysis: Basic Concepts and Algorithms
1 CSE 980: Data Mining Lecture 16: Hierarchical Clustering.
Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.
Hierarchical Clustering, DBSCAN The EM Algorithm
PARTITIONAL CLUSTERING
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.
Data Mining Cluster Analysis: Basic Concepts and Algorithms
More on Clustering Hierarchical Clustering to be discussed in Clustering Part2 DBSCAN will be used in programming project.
Data Mining Techniques: Clustering
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.
Clustering II.
Clustering II.
4. Clustering Methods Concepts Partitional (k-Means, k-Medoids)
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
What is Cluster Analysis
Cluster Analysis.
Cluster Analysis: Basic Concepts and Algorithms
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
Adapted by Doug Downey from Machine Learning EECS 349, Bryan Pardo Machine Learning Clustering.
What is Cluster Analysis?
1 Kunstmatige Intelligentie / RuG KI2 - 7 Clustering Algorithms Johan Everts.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
LOGO Classification III Lecturer: Dr. Bo Yuan
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Hierarchical clustering & Graph theory
Math 5364 Notes Chapter 8: Cluster Analysis Jesse Crawford Department of Mathematics Tarleton State University.
9/03Data Mining – Clustering G Dong (WSU) 1 4. Clustering Methods Concepts Partitional (k-Means, k-Medoids) Hierarchical (Agglomerative & Divisive, COBWEB)
1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods.
Partitional and Hierarchical Based clustering Lecture 22 Based on Slides of Dr. Ikle & chapter 8 of Tan, Steinbach, Kumar.
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
CLUSTERING. Overview Definition of Clustering Existing clustering methods Clustering examples.
CSE5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai Li (Slides.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Clustering COMP Research Seminar BCB 713 Module Spring 2011 Wei Wang.
Clustering I. 2 The Task Input: Collection of instances –No special class label attribute! Output: Clusters (Groups) of instances where members of a cluster.
CLUSTER ANALYSIS Introduction to Clustering Major Clustering Methods.
Cluster Analysis Potyó László. Cluster: a collection of data objects Similar to one another within the same cluster Similar to one another within the.
Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN 1 Remaining Lectures in Advanced Clustering and Outlier Detection 2.Advanced Classification.
Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree like diagram that.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Definition Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to)
Clustering/Cluster Analysis. What is Cluster Analysis? l Finding groups of objects such that the objects in a group will be similar (or related) to one.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.
Clustering By : Babu Ram Dawadi. 2 Clustering cluster is a collection of data objects, in which the objects similar to one another within the same cluster.
Clustering Wei Wang. Outline What is clustering Partitioning methods Hierarchical methods Density-based methods Grid-based methods Model-based clustering.
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
CSE4334/5334 Data Mining Clustering. What is Cluster Analysis? Finding groups of objects such that the objects in a group will be similar (or related)
Data Mining and Text Mining. The Standard Data Mining process.
Topic 4: Cluster Analysis Analysis of Customer Behavior and Service Modeling.
Data Mining Classification and Clustering Techniques Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to Data Mining.
Clustering Anna Reithmeir Data Mining Proseminar 2017
What Is Cluster Analysis?
Data Mining: Basic Cluster Analysis
Semi-Supervised Clustering
More on Clustering in COSC 4335
Clustering CSC 600: Data Mining Class 21.
K-means and Hierarchical Clustering
CSE572, CBS598: Data Mining by H. Liu
DATA MINING Introductory and Advanced Topics Part II - Clustering
CSE572, CBS572: Data Mining by H. Liu
CSE572, CBS572: Data Mining by H. Liu
Clustering Wei Wang.
Text Categorization Berlin Chen 2003 Reference:
CSE572: Data Mining by H. Liu
Hierarchical Clustering
Presentation transcript:

LOGO Clustering Lecturer: Dr. Bo Yuan

Overview  Partitioning Methods  K-Means  Sequential Leader  Model Based Methods  Density Based Methods  Hierarchical Methods 2

What is cluster analysis?  Finding groups of objects  Objects similar to each other are in the same group.  Objects are different from those in other groups.  Unsupervised Learning  No labels  Data driven 3

Clusters 4 Inter-Cluster Intra-Cluster

Clusters 5

Applications of Clustering  Marketing  Finding groups of customers with similar behaviours.  Biology  Finding groups of animals or plants with similar features.  Bioinformatics  Clustering of microarray data, genes and sequences.  Earthquake Studies  Clustering observed earthquake epicenters to identify dangerous zones.  WWW  Clustering weblog data to discover groups of similar access patterns.  Social Networks  Discovering groups of individuals with close friendships internally. 6

Earthquakes 7

Image Segmentation 8

The Big Picture 9

Requirements  Scalability  Ability to deal with different types of attributes  Ability to discover clusters with arbitrary shape  Minimum requirements for domain knowledge  Ability to deal with noise and outliers  Insensitivity to order of input records  Incorporation of user-defined constraints  Interpretability and usability 10

Practical Considerations 11

Normalization or Not? 12

Evaluation 13 VS.

Evaluation 14

The Influence of Outliers 15 outlier K=2

K-Means 16

K-Means 17

K-Means 18

K-Means  Determine the value of K.  Choose K cluster centres randomly.  Each data point is assigned to its closest centroid.  Use the mean of each cluster to update each centroid.  Repeat until no more new assignment.  Return the K centroids.  Reference  J. MacQueen (1967): "Some Methods for Classification and Analysis of Multivariate Observations", Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, vol.1, pp

Comments on K-Means  Pros  Simple and works well for regular disjoint clusters.  Converges relatively fast.  Relatively efficient and scalable O(t · k · n) t: iteration; k: number of centroids; n: number of data points  Cons  Need to specify the value of K in advance. Difficult and domain knowledge may help.  May converge to local optima. In practice, try different initial centroids.  May be sensitive to noisy data and outliers. Mean of data points …  Not suitable for clusters of Non-convex shapes 20

The Influence of Initial Centroids 21

The Influence of Initial Centroids 22

The K-Medoids Method  The basic idea is to use real data points as centres.  Determine the value of K in advance.  Randomly select K points as medoids.  Assign each data point to the closest medoid.  Calculate the cost of the configuration J.  For each medoid m  For each non-medoid point o  Swap m and o and calculate the new cost of configuration J′  If the cost of the best new configuration J* is lower than J, make the corresponding swap and repeat the above steps.  Otherwise, terminate the procedure. 23

The K-Medoids Method 24 Cost =20Cost =26

Sequential Leader Clustering  A very efficient clustering algorithm.  No iteration  Time complexity: O(n · k)  No need to specify K in advance.  Choose a cluster threshold value.  For every new data point:  Compute the distance between the new data point and every cluster's centre.  If the distance is smaller than the chosen threshold, assign the new data point to the corresponding cluster and re-compute cluster centre.  Otherwise, create a new cluster with the new data point as its centre.  Clustering results may be influenced by the sequence of data points. 25

Silhouette  A method of interpretation and validation of clusters of data.  A succinct graphical representation of how well each data point lies within its cluster compared to other clusters.  a(i): average dissimilarity of i with all other points in the same cluster  b(i): the lowest average dissimilarity of i to other clusters 26

Silhouette 27

Gaussian Mixture 28

Clustering by Mixture Models 29

K-Means Revisited 30 model parameters latent parameters

Expectation Maximization 31

32

EM: Gaussian Mixture 33

Density Based Methods  Generate clusters of arbitrary shapes.  Robust against noise.  No K value required in advance.  Somewhat similar to human vision. 34

DBSCAN  Density-Based Spatial Clustering of Applications with Noise  Density: number of points within a specified radius  Core Point: points with high density  Border Point: points with low density but in the neighbourhood of a core point  Noise Point: neither a core point nor a border point 35 Core Point Noise Point Border Point

DBSCAN 36 p q directly density reachable p q density reachable o qp density connected

DBSCAN  A cluster is defined as the maximal set of density connected points.  Start from a randomly selected unseen point P.  If P is a core point, build a cluster by gradually adding all points that are density reachable to the current point set.  Noise points are discarded (unlabelled). 37

Hierarchical Clustering  Produce a set of nested tree-like clusters.  Can be visualized as a dendrogram.  Clustering is obtained by cutting at desired level.  No need to specify K in advance.  May correspond to meaningful taxonomies. 38

Dinosaur Family Tree 39

Agglomerative Methods  Bottom-up Method  Assign each data point to a cluster.  Calculate the proximity matrix.  Merge the pair of closest clusters.  Repeat until only a single cluster remains.  How to calculate the distance between clusters?  Single Link  Minimum distance between points  Complete Link  Maximum distance between points 40

Example 41 BAFIMINARMTO BA FI MI NA RM TO Single Link

Example 42 BAFIMI/TONARM BA FI MI/TO NA RM BAFIMI/TONA/RM BA FI MI/TO NA/RM

Example 43 BA/NA/RMFIMI/TO BA/NA/RM FI MI/TO BA/FI/NA/RMMI/TO BA/FI/NA/RM0295 MI/TO2950

Min vs. Max 44

Reading Materials  Text Books  Richard O. Duda et al., Pattern Classification, Chapter 10, John Wiley & Sons.  J. Han and M. Kamber, Data Mining: Concepts and Techniques, Chapter 8, Morgan Kaufmann.  Survey Papers  A. K. Jain, M. N. Murty and P. J. Flynn (1999) “Data Clustering: A Review”, ACM Computing Surveys, Vol. 31(3), pp  R. Xu and D. Wunsch (2005) “Survey of Clustering Algorithms”, IEEE Transactions on Neural Networks, Vol. 16(3), pp  A. K. Jain (2010) “Data Clustering: 50 Years Beyond K-Means”, Pattern Recognition Letters, Vol. 31, pp  Online Tutorials    45

Review  What is clustering?  What are the two categories of clustering methods?  How does the K-Means algorithm work?  What are the major issues of K-Means?  How to control the number of clusters in Sequential Leader Clustering?  How to use Gaussian mixture models for clustering?  What are the main advantages of density methods?  What is the core idea of DBSCAN?  What is the general procedure of hierarchical clustering?  Which clustering methods do not require K as the input? 46

Next Week’s Class Talk  Volunteers are required for next week’s class talk.  Topic: Affinity Propagation  Science 315, 972–976, 2007  Clustering by passing messages between points.   Topic: Clustering by Fast Search and Find of Density Peaks  Science 344, 1492–1496, 2014  Cluster centers: higher density than neighbors  Cluster centers: distant from others points with higher densities  Length: 20 minutes plus question time 47

Assignment  Topic: Clustering Techniques and Applications  Techniques  K-Means  Another clustering method for comparison  Task 1: 2D Artificial Datasets  To demonstrate the influence of data patterns  To demonstrate the influence of algorithm factors  Task 2: Image Segmentation  Gray vs. Colour  Deliverables:  Reports (experiment specification, algorithm parameters, in-depth analysis)  Code (any programming language, with detailed comments)  Due: Sunday, 28 December  Credit: 15% 48