Quality of Clusterings Two metrics: –SSE –Dissimilarity Ratio.

Slides:



Advertisements
Similar presentations
Clustering Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data. The example below demonstrates.
Advertisements

SEEM Tutorial 4 – Clustering. 2 What is Cluster Analysis?  Finding groups of objects such that the objects in a group will be similar (or.
Two topics in R: Simulation and goodness-of-fit HWU - GS.
Agglomerative Hierarchical Clustering 1. Compute a distance matrix 2. Merge the two closest clusters 3. Update the distance matrix 4. Repeat Step 2 until.
Structure-Based Distance Metric for High-Dimensional Space Exploration with Multi-Dimensional Scaling Jenny Hyunjung Lee , Kevin T. McDonnell, Alla Zelenyuk.
1 Classes #5, 6, 7, 8 Civil Engineering Materials – CIVE 2110 Stress Concentration Fall 2010 Dr. Gupta Dr. Pickett.
Building a Histogram in Excel IE 1225 R. Lindeke.
Metrics, Algorithms & Follow-ups Profile Similarity Measures Cluster combination procedures Hierarchical vs. Non-hierarchical Clustering Statistical follow-up.
6-1 ©2006 Raj Jain Clustering Techniques  Goal: Partition into groups so the members of a group are as similar as possible and different.
MARE 250 Dr. Jason Turner The Normal Distribution.
Manufacturing Variation Plotting a Normal Distribution.
Chapters 7-8 Key Points and JMP Instructions. Example 1: Comparing Two Sample Means Means are different at the  =.05 level.
Cluster Analysis (1).
What is Cluster Analysis?
Data Clustering (a very short introduction) Intuition: grouping of data into clusters so that elements from the same cluster are more similar to each other.
Distribution Summaries Measures of central tendency Mean Median Mode Measures of spread Standard Deviation Interquartile Range (IQR)
7/2/2015 IENG 486 Statistical Quality & Process Control 1 IENG Lecture 05 Interpreting Variation Using Distributions.
K-means Clustering. What is clustering? Why would we want to cluster? How would you determine clusters? How can you do this efficiently?
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
VARIABILITY. PREVIEW PREVIEW Figure 4.1 the statistical mode for defining abnormal behavior. The distribution of behavior scores for the entire population.
1-7: Midpoint and Distance in the Coordinate Plane
Normal Distribution Links Standard Deviation The Normal Distribution Finding a Probability Standard Normal Distribution Inverse Normal Distribution.
Milling Process Sensor Setup Data Acquisition Data pre- processing Features Extraction Microscopic tool wear measurement Prognostic modeling system and.
1 Measures of Variability Chapter 5 of Howell (except 5.3 and 5.4) People are all slightly different (that’s what makes it fun) Not everyone scores the.
Range, Variance, and Standard Deviation in SPSS. Get the Frequency first! Step 1. Frequency Distribution  After reviewing the data  Start with the “Analyze”
Variability The goal for variability is to obtain a measure of how spread out the scores are in a distribution. A measure of variability usually accompanies.
Dr. Engr. Sami ur Rahman Data Analysis Lecture 3: Data Distribution Normal Distribution.
Chapter 5 The Normal Curve. Histogram of Unemployment rates, States database.
1 Gene Ontology Javier Cabrera. 2 Outline Goal: How to identify biological processes or biochemical pathways that are changed by treatment.Goal: How to.
9.1 Notes Geometric Mean. 9.1 Notes Arithmetic mean is another term that means the same thing as average. The second do now question could have been,
Chapter 14 – Cluster Analysis © Galit Shmueli and Peter Bruce 2010 Data Mining for Business Intelligence Shmueli, Patel & Bruce.
11/23/2015Slide 1 Using a combination of tables and plots from SPSS plus spreadsheets from Excel, we will show the linkage between correlation and linear.
Normal Distribution Links The Normal Distribution Finding a Probability Standard Normal Distribution Inverse Normal Distribution.
Magic Camera Master’s Project Defense By Adam Meadows Project Committee: Dr. Eamonn Keogh Dr. Doug Tolbert.
Clustering.
DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University.
Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9.
Midpoint and Distance Formulas
Clustering Patrice Koehl Department of Biological Sciences National University of Singapore
SECTION 14.3 Measures of Disbursement. Objective  Determine the spread of a data set.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.
COURSE: JUST 3900 INTRODUCTORY STATISTICS FOR CRIMINAL JUSTICE Test Review: Ch. 4-6 Peer Tutor Slides Instructor: Mr. Ethan W. Cooper, Lead Tutor © 2013.
Partition Line Segments (2 Dimensions)
1 Multipath Routing in WSN with multiple Sink nodes YUEQUAN CHEN, Edward Chan and Song Han Department of Computer Science City University of HongKong.
Trigonometric Functions in the Coordinate Plane 12 April 2011.
Multivariate statistical methods Cluster analysis.
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
Homework 5 Corrections: Only need to compute Sum-Squared-Error and Average Entropy of clusters, not “cohesion” and “separation”
Machine Learning Lecture 4: Unsupervised Learning (clustering) 1.
1 Berger Jean-Baptiste
ANOVA: Analysis of Variation
Multivariate statistical methods
Midpoint and Distance Formulas
1-7: Midpoint and Distance in the Coordinate Plane
ANOVA: Analysis of Variation
Midpoint and Distance Formulas
Clustering Algorithms
Normal Distribution Links Standard Deviation The Normal Distribution
Step M2 – Variable Process Capability
Department of Computer Science University of York
1.7 Midpoint and Distance in the Coordinate Plane
Department of Computer Science
Given 2 ordered pairs, it’s the AVG of the x’s and AVG of the y’s.
Quantitative Data Who? Cans of cola. What? Weight (g) of contents.
SEEM4630 Tutorial 3 – Clustering.
Normal Distribution.
Combining Random Variables
Continuous distribution curve.
Source: Pattern Recognition Letters, VOL. 27, Issue 13, October 2006
Presentation transcript:

Quality of Clusterings Two metrics: –SSE –Dissimilarity Ratio

Computing SSE Save clusters. Two new columns are created: Cluster and Distance. Create new column as formula. Name it as dist-sqr and define it as Distance 2 Analyze – Distribution for dist-sqr. Get the mean and multiply by N to obtain SSE

Computing Dissimilarity Ratio Dissimilarity ratio = (inter-cluster distance / intra- cluster distance) Inter-cluster distance is the smallest distance between centroids Normalize centroid coordinates: –Coordinates are given in cluster output –Find mean and std dev for each dimension from histogram (distribution) output –Normalize each centroid coordinate: (x - mean) /st dev –Compute distances between each pair of centroids: Inter-cluster distance is given by the smallest of the normalized centroid distances

Dissimilarity Ratio – cont. Intra-cluster distance is given by the average max dist of the clusters. The max dist of each cluster is found at the clusters output in JMP. Computer dissimilarity ratio (DR) for each clustering The higher the DR the better the clustering.