Data Analytics CMIS Short Course part II Day 1 Part 1: Clustering Sam Buttrey December 2015.

Slides:



Advertisements
Similar presentations
CLUSTERING.
Advertisements

Clustering.
CS 478 – Tools for Machine Learning and Data Mining Clustering: Distance-based Approaches.
SEEM Tutorial 4 – Clustering. 2 What is Cluster Analysis?  Finding groups of objects such that the objects in a group will be similar (or.
Clustering.
Cluster Analysis: Basic Concepts and Algorithms
PARTITIONAL CLUSTERING
1 Machine Learning: Lecture 10 Unsupervised Learning (Based on Chapter 9 of Nilsson, N., Introduction to Machine Learning, 1996)
Chapter 7 – Classification and Regression Trees
Chapter 7 – Classification and Regression Trees
Metrics, Algorithms & Follow-ups Profile Similarity Measures Cluster combination procedures Hierarchical vs. Non-hierarchical Clustering Statistical follow-up.
Introduction to Bioinformatics
Clustering and Dimensionality Reduction Brendan and Yifang April
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.
Clustering II.
Clustering II.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Cluster Analysis.  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods.
Clustering Petter Mostad. Clustering vs. class prediction Class prediction: Class prediction: A learning set of objects with known classes A learning.
Basic Data Mining Techniques
Unsupervised Learning and Data Mining
Cluster Analysis (1).
What is Cluster Analysis?
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
What is Cluster Analysis?
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Chapter 3: Cluster Analysis  3.1 Basic Concepts of Clustering  3.2 Partitioning Methods  3.3 Hierarchical Methods The Principle Agglomerative.
Evaluating Performance for Data Mining Techniques
Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.
Cluster Analysis Part II. Learning Objectives Hierarchical Methods Density-Based Methods Grid-Based Methods Model-Based Clustering Methods Outlier Analysis.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
CSC321: Neural Networks Lecture 12: Clustering Geoffrey Hinton.
Chapter 9 – Classification and Regression Trees
Lecture 20: Cluster Validation
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
Data Clustering 2 – K Means contd & Hierarchical Methods Data Clustering – An IntroductionSlide 1.
Clustering What is clustering? Also called “unsupervised learning”Also called “unsupervised learning”
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
Clustering I. 2 The Task Input: Collection of instances –No special class label attribute! Output: Clusters (Groups) of instances where members of a cluster.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Data Science and Big Data Analytics Chap 4: Advanced Analytical Theory and Methods: Clustering Charles Tappert Seidenberg School of CSIS, Pace University.
CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel:
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall 6.8: Clustering Rodney Nielsen Many / most of these.
Clustering Instructor: Max Welling ICS 178 Machine Learning & Data Mining.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
Flat clustering approaches
Chapter 13 (Prototype Methods and Nearest-Neighbors )
Data Analytics CMIS Short Course part II Day 1 Part 3: Ensembles Sam Buttrey December 2015.
Data Analytics CMIS Short Course part II Day 1 Part 1: Introduction Sam Buttrey December 2015.
Clustering (1) Chapter 7. Outline Introduction Clustering Strategies The Curse of Dimensionality Hierarchical k-means.
Data Analytics CMIS Short Course part II Day 1 Part 1: Clustering Sam Buttrey December 2015.
1 Microarray Clustering. 2 Outline Microarrays Hierarchical Clustering K-Means Clustering Corrupted Cliques Problem CAST Clustering Algorithm.
CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:
Clustering Wei Wang. Outline What is clustering Partitioning methods Hierarchical methods Density-based methods Grid-based methods Model-based clustering.
DATA MINING: CLUSTER ANALYSIS Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.
Clustering Machine Learning Unsupervised Learning K-means Optimization objective Random initialization Determining Number of Clusters Hierarchical Clustering.
Data Science Practical Machine Learning Tools and Techniques 6.8: Clustering Rodney Nielsen Many / most of these slides were adapted from: I. H. Witten,
Data Mining: Basic Cluster Analysis
PREDICT 422: Practical Machine Learning
Clustering CSC 600: Data Mining Class 21.
What Is the Problem of the K-Means Method?
Data Mining K-means Algorithm
Clustering and Multidimensional Scaling
Clustering Wei Wang.
Text Categorization Berlin Chen 2003 Reference:
Hierarchical Clustering
SEEM4630 Tutorial 3 – Clustering.
Presentation transcript:

Data Analytics CMIS Short Course part II Day 1 Part 1: Clustering Sam Buttrey December 2015

Clustering Techniques for finding structure in a set of measurements Group X’s without knowing their y’s Usually we don’t know number of clusters Method 1: Visual Difficult because of (usually) complicated correlation structure in the data Particularly hard in high dimensions

Clustering as Classification Clustering is a classification problem in which the Y values have to be estimated Y i | X i is multinomial as before Most techniques give an assignment, but we can also get a probability vector Clustering remains under-developed –Model quality? Variable selection? Scaling? Transformations, interactions etc.? Model fit? Prediction?

Clustering by PCs Method 2: Principal Components If the PCs capture spread in a smart way, then “nearby” observations should have similar values on the PCs Plot 2 or 3 and look (e.g. state.x77 ) We still need a rule for assigning observations to clusters, including for future observations

Inter-point Distances Most clustering techniques rely on a measure of distance between two points, between a point and a cluster, and between two clusters Concerns: How do we… 1.Evaluate the contribution of a variable to the clustering (selection, weighting)? 2.Account for correlation among variables? 3.How do we incorporate categorical variables?

Distances

Distance Measure R’s daisy() {cluster} computes inter- point distances (replaces dist() ) Scale, choice of metric can matter If all variables numeric, choose “euclidean” or “manhattan” We can scale columns differently, but correlation among columns ignored Otherwise daisy uses Gower distance

Gower Distance If some columns are not numeric, the “dissimilarity” between numeric X ik and X jk scaled to |X ik – X jk | / range ( X k ) –(What happens when one entry in X k has an outlier – like Age = 999?) For binary variables the usual dissimilarity is 0 if X ij = X ik, 1 if not –What if 1’s are very rare (e.g. Native Alaskan heritage, attended Sorbonne)? –Asymmetric binary

Gower Distance 9

Thoughts on Gower Natural adjustment for missing values –Euclidean dist: inflate by  [ncol(X)/#non-NA] All these choices can matter! daisy() computes all the pairwise distances up front There are n(n –1)/2 of these, which causes trouble in really big data Things are different in high dimensions – our intuition is not very good here Dimensionality reduction is always good!

Digression: High-Dimensional Data High-dimensional data is just different Here are the pairwise distances among 1,000 points in p dimensions where each component is indep. U(–.5, +.5), scaled to (0, 1) In high dimensions, everything is “equally far away” Hopefully our data lies in a lower-dimensional subspace 11 p = 2 p = 10 p = 50 p = 3000

Distance Between Clusters In addition to measuring distance between two observations, … …We also need to measure distance between a point and a cluster, and between two clusters Example: Euclidean between the two cluster averages Example: Manhattan between the two points farthest apart These choices may make a difference, and we don’t have much guidance

A. Partition Methods Given number of clusters (!), try to find observations that are means or medians Goal: each observation should be closer to its cluster’s center than to the center of another cluster; this partitions space –As we have seen, measuring “closer” requires some choices to be made Classic approach: k-means algorithm –R implementation predates daisy(), requires all numeric columns

K-means Algorithm 1.Select k candidate cluster centers at random 2.Assign each observation to the nearest cluster center (w/Euclidean distance) 3.Recompute the cluster means 4.Repeat from 2. until convergence Guaranteed to converge, but not optimally; depends on step 1; k assumed known (try with many k’s)

K-means (cont’d) Only kn (not n(n – 1)/2) computations per iteration, helps with big data Well-suited to separated spherical clusters, not to narrow ellipses, snakes, linked chains, concentric spheres… Susceptible to influence from extreme outliers, which perhaps belong in their own clusters of size 1 Example: state.x77 data

Pam and Clara pam (Kaufman & Rousseeuw, 1990) is k-means-like, but on medoids –A cluster medoid is the observation for which the sum of distances to other cluster members is the smallest in the cluster –Can use daisy() output, handle factors –Resistant to outliers –Expensive (O(n 2 ) for time and memory) clara is pam’s big sister –Operates on small subsets of the data

Cluster Validation K-means vs. pam How to evaluate how well we’re doing? –Cluster validation is an open problem –Goals: ensure we’re not just picking up sets of random fluctuations If our clustering is “better” on our data than what we see with the same technique on random noise, do we feel better? –Determine which of two clusterings is “better” –Determine how many “real” clusters there are

Cluster Validity External Validity: Compare cluster labels to “truth,” maybe in a classification context –True class labels often not known We cluster without knowing classes –Classes can span clusters: f vs. f vs. , so in any case… –…True number of clusters rarely known, even if we knew how many classes there were

Cluster Validity Internal Validity: Measure something about “inherent” goodness –Perhaps R 2 -style, 1 – SSW/SSB, using “sum of squares within” and “sum of squares between” –Whatever metric the clustering algorithm optimizes will look good in our results –“Always” better than using our technique on noise –Not obvious how to use training/test set

The Silhouette Plot For each point, compute avg. distance to all points in its cluster (a), and avg. distance to points not in its cluster (b) Silhouette coeff. is then 1 – a/b Usually in [0,1]; larger better –Can be computed over clusters or overall Drawn by plot.pam(), plot.clara() (Different from bannerplot!)

Examples K-means vs. pam (cont’d) How to evaluate how well we’re doing? –For the moment let’s measure agreement –One choice: Cramér’s V –V =  [  2 / n (k-1)], k = min(#row, #col) –V  [0, 1]; more rows, cols  higher V Rules of thumb:.15 weak,.35 strong,.50+ “essentially measuring same thing” Let’s do this thing!

Hierarchical Clustering Techniques to preserve hierarchy (so to get from the “best” six clusters to the best five, we join two existing clusters) –Advantages: hierarchy is good; nice pictures make it easier to choose number of clusters –Disadvantages: small data sets only Typically “agglomerative” or “divisive” agnes() : each object starts as one cluster; keep “merging” the two closest clusters till there’s one huge cluster

Agnes (cont’d) Each step reduces the # of clusters by 1 At each stage we need to know every entity’s distance to every other We merge the two closest objects… …Then compute distances of new object to all other entities As before, we need to be able to measure the distances between cluster, or between a point and a cluster

Hierarchical Clustering (cont’d) Divisive, implemented in diana() : –Start with all objects in one group –At each step, find the largest cluster –Remove its “weirdest” observation –See if others from that parent want to join the splinter group –Repeat until each obs. is its own cluster Clustering techniques often don’t agree!

Dendrogram The tree picture (“dendrogram”) shows the merging distance vertically and the observations horizontally Any horizontal line specifies a number of clusters (implemented in cutree() ) Both Agnes + Diana require all n-choose- 2 distances up front; ill-suited to large samples

Clustering Considerations Other methods (e.g. mixture models) exist Scaling/weighting, transformation are not automatic (although methods are being proposed to do this) Hierarchical methods don’t scale well –Must avoid computing all pairwise distances Validation and finding k are hard –Clustering is inherently more complicated than, say, linear regression

Shameless Plug Remember random forests? “Proximity” measured by number of times two observations fell in the same leaf –But every tree has the same response variable The treeClust() dissimilarity of Buttrey and Whitaker (2015) measures the dissimilarity in a set of trees where each response variable contributes 0 or 1 trees –Some trees are pruned to the root, dropped Seems to perform well in a lot of cases 27

More Examples Hierarchical clustering –state.77 data, again Splice Example –Gower distance; RF proximity; treeClust Visualizing high-dimensional data: –(Numeric) Multidimensional scaling, t-SNE Let’s do this thing! 28