Today Cluster Evaluation Internal External

Slides:



Advertisements
Similar presentations
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Advertisements

Size-estimation framework with applications to transitive closure and reachability Presented by Maxim Kalaev Edith Cohen AT&T Bell Labs 1996.
Clustering Beyond K-means
Supervised Learning Recap
Cluster Evaluation Metrics that can be used to evaluate the quality of a set of document clusters.
Lecture 22: Evaluation April 24, 2010.
K-means clustering Hongning Wang
Introduction to Bioinformatics
What is Statistical Modeling
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
Lecture 21: Spectral Clustering
MACHINE LEARNING 9. Nonparametric Methods. Introduction Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
University at BuffaloThe State University of New York Cluster Validation Cluster validation q Assess the quality and reliability of clustering results.
Cluster Analysis.  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods.
Clustering Evaluation April 29, Today Cluster Evaluation – Internal We don’t know anything about the desired labels – External We have some information.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 16: Flat Clustering 1.
Cluster Analysis CS240B Lecture notes based on those by © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004.
Revision (Part II) Ke Chen COMP24111 Machine Learning Revision slides are going to summarise all you have learnt from Part II, which should be helpful.
Introduction to Machine Learning Approach Lecture 5.
Gaussian Mixture Models and Expectation Maximization.
Relaxed Transfer of Different Classes via Spectral Partition Xiaoxiao Shi 1 Wei Fan 2 Qiang Yang 3 Jiangtao Ren 4 1 University of Illinois at Chicago 2.
Clustering Unsupervised learning Generating “classes”
Today Evaluation Measures Accuracy Significance Testing
Bayesian Sets Zoubin Ghahramani and Kathertine A. Heller NIPS 2005 Presented by Qi An Mar. 17 th, 2006.
Lecture 20: Cluster Validation
Machine Learning Queens College Lecture 2: Decision Trees.
Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Pseudo-supervised Clustering for Text Documents Marco Maggini, Leonardo Rigutini, Marco Turchi Dipartimento di Ingegneria dell’Informazione Università.
Unsupervised Learning: Clustering Some material adapted from slides by Andrew Moore, CMU. Visit for
Chapter 7 The Logic Of Sampling. Observation and Sampling Polls and other forms of social research rest on observations. The task of researchers is.
UNSUPERVISED LEARNING David Kauchak CS 451 – Fall 2013.
Lecture 17 Gaussian Mixture Models and Expectation Maximization
CS 478 – Tools for Machine Learning and Data Mining Clustering Quality Evaluation.
DATA MINING WITH CLUSTERING AND CLASSIFICATION Spring 2007, SJSU Benjamin Lam.
Mehdi Ghayoumi MSB rm 132 Ofc hr: Thur, a Machine Learning.
Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.
Machine Learning Queens College Lecture 7: Clustering.
Text Clustering Hongning Wang
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
1 Machine Learning Lecture 9: Clustering Moshe Koppel Slides adapted from Raymond J. Mooney.
Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.
1 Cluster Analysis – 2 Approaches K-Means (traditional) Latent Class Analysis (new) by Jay Magidson, Statistical Innovations based in part on a presentation.
Introduction to Data Mining Clustering & Classification Reference: Tan et al: Introduction to data mining. Some slides are adopted from Tan et al.
Flexible Speaker Adaptation using Maximum Likelihood Linear Regression Authors: C. J. Leggetter P. C. Woodland Presenter: 陳亮宇 Proc. ARPA Spoken Language.
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
Machine Learning Lecture 4: Unsupervised Learning (clustering) 1.
Introduction to Machine Learning Nir Ailon Lecture 12: EM, Clustering and More.
Data Mining Classification and Clustering Techniques Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to Data Mining.
GLBIO ML workshop May 17, 2016 Ivan Kryukov and Jeff Wintersinger
Sampath Jayarathna Cal Poly Pomona
k-Nearest neighbors and decision tree
Lecture 18 Expectation Maximization
Centroid index Cluster level quality measure
E The net electric flux through a closed cylindrical surface is zero.
Map of the Great Divide Basin, Wyoming, created using a neural network and used to find likely fossil beds See:
Dipartimento di Ingegneria «Enzo Ferrari»,
Big Data Analysis and Mining
Clustering Evaluation The EM Algorithm
Critical Issues with Respect to Clustering
Data Mining 資料探勘 分群分析 (Cluster Analysis) Min-Yuh Day 戴敏育
CSCI 5832 Natural Language Processing
Cluster Validity For supervised classification we have a variety of measures to evaluate how good our model is Accuracy, precision, recall For cluster.
LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.
Text Categorization Berlin Chen 2003 Reference:
Unsupervised Learning: Clustering
Introduction to Machine learning
Presentation transcript:

Lecture 8: Evaluation 2 Clustering evaluation Machine Learning Queens College

Today Cluster Evaluation Internal External We don’t know anything about the desired labels External We have some information about the labels

Internal Evaluation Clusters have been identified. How successful a partitioning of the data set was constructed through clustering? Internal measures have the quality that they can be directly optimized.

Intracluster variability How far is a point from its cluster centroid Intuition: every point assigned to a cluster should be closer to the center of that cluster than any other cluster K-means optimizes this measure.

Model Likelihood Intuition: the model that fits the data best represents the best clustering Requires a probabilistic model. Can be included in AIC and BIC measures to limit the number of parameters. GMM style:

Point similarity vs. Cluster similarity Intuition: two points that are similar should be in the same cluster Spectral Clustering optimizes this function.

Internal Cluster Measures Which cluster measure is best? Centroid distance Model likelihood Point distance It depends on the data and the task.

External Cluster Evaluation If you have a little bit of labeled data, unsupervised (clustering) techniques can be evaluated using this knowledge. Assume for a subset of the data points, you have class labels. How can we evaluate the success of the clustering?

External Cluster Evaluation Can’t we use Accuracy? or “Why is this hard?” The number of clusters may not equal the number of classes. It may be difficult to assign a class to a cluster.

Some principles. Homogeneity Completeness Each cluster should include members of as few classes as possible Completeness Each class should be represented in as few clusters as possible.

Some approaches Purity F-measure Cluster definitions of Precision and Recall Combined using harmonic mean as in traditional f-measure points of class i in cluster r

The problem of matching F-measure: 0.6 F-measure: 0.6

The problem of matching F-measure: 0.5 F-measure: 0.5

V-Measure Conditional Entropy based measure to explicitly calculate homogeneity and completeness.

Contingency Matrix Want to know how much the introduction of clusters is improving the information about the class distribution. 3 1 4 2

Entropy Entropy calculates the amount of “information” in a distribution. Wide distributions have a lot of information Narrow distributions have very little Based on Shannon’s limit of the number of bits required to transmit a distribution Calculation of entropy:

Pair Based Measures Statistics over every pair of items. SS – same cluster, same class SD – same cluster, different class DS – different cluster, same class DD – different cluster different class These can be arranged in a contingency matrix similar to when accuracy is constructed.

Pair Based Measures Rand: Jaccard Folkes-Mallow

B-cubed Similar to pair based counting systems, B-cubed calculates an element by element precision and recall.

External Evaluation Measures There are many choices. Some should almost certainly never be used Purity F-measure Others can be used based on task or other preferences V-Measure VI B-Cubed

Next Time Support Vector Machines