AMCS/CS229: Machine Learning

Slides:



Advertisements
Similar presentations
Addition Facts
Advertisements

Document Clustering Carl Staelin. Lecture 7Information Retrieval and Digital LibrariesPage 2 Motivation It is hard to rapidly understand a big bucket.
Addition 1’s to 20.
Week 1.
1 ECE 776 Project Information-theoretic Approaches for Sensor Selection and Placement in Sensor Networks for Target Localization and Tracking Renita Machado.
Clustering AMCS/CS 340: Data Mining Xiangliang Zhang
HW 4 Answers.
Unsupervised Learning
Data Mining Cluster Analysis Basics
Hierarchical Clustering, DBSCAN The EM Algorithm
Clustering Basic Concepts and Algorithms
CS690L: Clustering References:
Microarray Data Analysis (Lecture for CS397-CXZ Algorithms in Bioinformatics) March 19, 2004 ChengXiang Zhai Department of Computer Science University.
Cluster Analysis. Midterm: Monday Oct 29, 4PM  Lecture Notes from Sept 5, 2007 until Oct 15, Chapters from Textbook and papers discussed in class.
unsupervised learning - clustering
Cluster Analysis.  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods.
Cluster Validation.
CSCE822 Data Mining and Warehousing
© University of Minnesota Data Mining CSCI 8980 (Fall 2002) 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center.
Math 5364 Notes Chapter 8: Cluster Analysis Jesse Crawford Department of Mathematics Tarleton State University.
Clustering Basic Concepts and Algorithms 2
Lecture 20: Cluster Validation
Data Mining Cluster Analysis: Basic Concepts and Algorithms Adapted from Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar.
Critical Issues with Respect to Clustering Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
Clustering.
Data Mining Cluster Analysis: Basic Concepts and Algorithms.
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Clustering/Cluster Analysis. What is Cluster Analysis? l Finding groups of objects such that the objects in a group will be similar (or related) to one.
Cluster Analysis Dr. Bernard Chen Assistant Professor Department of Computer Science University of Central Arkansas.
Cluster Analysis Dr. Bernard Chen Ph.D. Assistant Professor Department of Computer Science University of Central Arkansas Fall 2010.
DATA MINING: CLUSTER ANALYSIS Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.
DATA MINING: CLUSTER ANALYSIS (3) Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.
CSE4334/5334 Data Mining Clustering. What is Cluster Analysis? Finding groups of objects such that the objects in a group will be similar (or related)
Clustering. What is Clustering? Unsupervised learning Seeks to organize data into “reasonable” groups Often based on some similarity (or distance) measure.
Cluster Validity For supervised classification we have a variety of measures to evaluate how good our model is Accuracy, precision, recall For cluster.
Data Mining: Basic Cluster Analysis
Data Mining Cluster Analysis: Basic Concepts and Algorithms
Data Mining Cluster Analysis: Basic Concepts and Algorithms
Hierarchical Clustering: Time and Space requirements
Clustering 28/03/2016 A diák alatti jegyzetszöveget írta: Balogh Tamás Péter.
Clustering Techniques for Finding Patterns in Large Amounts of Biological Data Michael Steinbach Department of Computer Science
What Is the Problem of the K-Means Method?
CSE 5243 Intro. to Data Mining
Clustering in Ratemaking: Applications in Territories Clustering
Cluster Analysis: Basic Concepts and Algorithms
Topic 3: Cluster Analysis
Clustering Evaluation The EM Algorithm
Data Mining Cluster Analysis: Basic Concepts and Algorithms
CSE 4705 Artificial Intelligence
Data Mining Cluster Techniques: Basic
Data Mining Cluster Analysis: Basic Concepts and Algorithms
Data Mining Cluster Analysis: Basic Concepts and Algorithms
Critical Issues with Respect to Clustering
CSE572, CBS598: Data Mining by H. Liu
Clustering 23/03/2016 A diák alatti jegyzetszöveget írta: Balogh Tamás Péter.
Computational BioMedical Informatics
Data Mining Cluster Analysis: Basic Concepts and Algorithms
Clustering Analysis.
Data Mining Cluster Analysis: Basic Concepts and Algorithms
GPX: Interactive Exploration of Time-series Microarray Data
Data Mining Cluster Analysis: Basic Concepts and Algorithms
CSE572, CBS572: Data Mining by H. Liu
CSE572, CBS572: Data Mining by H. Liu
Cluster Validity For supervised classification we have a variety of measures to evaluate how good our model is Accuracy, precision, recall For cluster.
Data Mining Cluster Analysis: Basic Concepts and Algorithms
Data Mining Cluster Analysis: Basic Concepts and Algorithms
Data Mining Cluster Analysis: Basic Concepts and Algorithms
CSE572: Data Mining by H. Liu
Data Mining Cluster Analysis: Basic Concepts and Algorithms
Data Mining Cluster Analysis: Basic Concepts and Algorithms
Presentation transcript:

AMCS/CS229: Machine Learning Clustering 2 Xiangliang Zhang King Abdullah University of Science and Technology

Xiangliang Zhang, KAUST AMCS/CS229: Machine Learning Cluster Analysis Partitioning Methods + EM algorithm Hierarchical Methods Density-Based Methods Clustering quality evaluation How to decide the number of clusters ? Summary Xiangliang Zhang, KAUST AMCS/CS229: Machine Learning 2

The quality of Clustering For supervised classification we have a variety of measures to evaluate how good our model is Accuracy, precision, recall For cluster analysis, the analogous question is how to evaluate the “goodness” of the resulting clusters? But “clusters are in the eye of the beholder”! Then why do we want to evaluate them? To avoid finding patterns in noise To compare clustering algorithms To compare two sets of clusters To compare two clusters Xiangliang Zhang, KAUST AMCS/CS229: Machine Learning 3

Measures of Cluster Validity Numerical measures that are applied to judge various aspects of cluster validity, are classified into the following two types: External Index: Used to measure the extent to which cluster labels match externally supplied class labels. Purity, Normalized Mutual Information Internal Index: Used to measure the goodness of a clustering structure without respect to external information. Sum of Squared Error (SSE) Cophenetic correlation coefficient, silhouette 4 http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-clustering-1.html

Cluster Validity: External Index The class labels are externally supplied (q classes) Purity: Larger purity values indicate better clustering solutions. Purity of each cluster Cr of size nr Purity of the entire clustering 5

Cluster Validity: External Index Purity: 6

Cluster Validity: External Index The class labels are externally supplied (q classes) NMI (Normalized Mutual Information) : where I is mutual information and H is entropy 7

Cluster Validity: External Index NMI (Normalized Mutual Information) : Larger NMI values indicate better clustering solutions. 8

Internal Measures: SSE Internal Index: Used to measure the goodness of a clustering structure without respect to external information SSE is good for comparing two clustering results average SSE SSE curves w.r.t. various K Can also be used to estimate the number of clusters 9

Internal Measures: Cophenetic correlation coefficient a measure of how faithfully a dendrogram preserves the pairwise distances between the original data points. Compare two hierarchical clusterings of the data 0.5 D F 0.71 A B 1.00 E 1.41 C 2.50 Single link --- Min Compute the correlation coefficient between Dist and CP 10 Matlab functions: cophenet

Xiangliang Zhang, KAUST AMCS/CS229: Machine Learning Cluster Analysis Partitioning Methods + EM algorithm Hierarchical Methods Density-Based Methods Clustering quality evaluation How to decide the number of clusters ? Summary Xiangliang Zhang, KAUST AMCS/CS229: Machine Learning 11

Internal Measures: Cohesion and Separation Cluster cohesion measures how closely related are objects in a cluster = SSE or the sum of the weight of all links within a cluster. Cluster separation measures how distinct or well-separated a cluster is from other clusters = sum of the weights between nodes in the cluster and nodes outside the cluster. separation cohesion 12

Internal Measures: Silhouette Coefficient Silhouette Coefficient combines ideas of both cohesion and separation For an individual point, i Calculate a = average distance of i to the points in its cluster Calculate b = min (average distance of i to points in another cluster) The silhouette coefficient for a point is then given by Typically between 0 and 1. The closer to 1 the better. Can calculate the Average Silhouette width for a cluster or a clustering Matlab functions: silhouette 13

Determine number of clusters by Silhouette Coefficient compare different clusterings by the average silhouette values K=3 mean(silh) = 0.526 K=4 mean(silh) = 0.640 K=5 mean(silh) = 0.527

Determine the number of clusters Select the number K of clusters as the one maximizing averaged silhouette value of all points Optimizing an objective criterion Gap statistics of the decreasing of SSE w.r.t. K Model-based method: optimizing a global criterion (e.g. the maximum likelihood of data) Try to use clustering methods which need not to set K, e.g., DbScan, Prior knowledge….. Xiangliang Zhang, KAUST AMCS/CS229: Machine Learning 15

Xiangliang Zhang, KAUST AMCS/CS229: Machine Learning Cluster Analysis Partitioning Methods + EM algorithm Hierarchical Methods Density-Based Methods Clustering quality evaluation How to decide the number of clusters ? Summary Xiangliang Zhang, KAUST AMCS/CS229: Machine Learning 16

Clustering VS Classification Xiangliang Zhang, KAUST AMCS/CS229: Machine Learning 17

Problems and Challenges Considerable progress has been made in scalable clustering methods Partitioning: k-means, k-medoids, CLARANS Hierarchical: BIRCH, ROCK, CHAMELEON Density-based: DBSCAN, OPTICS, DenClue Grid-based: STING, WaveCluster, CLIQUE Model-based: EM, SOM Spectral clustering Affinity Propagation Frequent pattern-based: Bi-clustering, pCluster Current clustering techniques do not address all the requirements adequately, still an active area of research Xiangliang Zhang, KAUST AMCS/CS229: Machine Learning 18

Open issues in clustering Cluster Analysis Open issues in clustering Clustering quality evaluation How to decide the number of clusters ? 19

Xiangliang Zhang, KAUST AMCS/CS229: Machine Learning What you should know What is clustering? How does k-means work? What is the difference between k-means and k-mediods? What is EM algorithm? How does it work? What is the relationship between k-means and EM? How to define inter-cluster similarity in Hierarchical clustering? What kinds of options do you have ? How does DBSCAN work ? Xiangliang Zhang, KAUST AMCS/CS229: Machine Learning 20

Xiangliang Zhang, KAUST AMCS/CS229: Machine Learning What you should know What are the advantages and disadvantages of DbScan? How to evaluate the clustering results ? Usually how to decide the number of clusters ? What are the main differences between clustering and classification? Xiangliang Zhang, KAUST AMCS/CS229: Machine Learning 21