Math 5364 Notes Chapter 8: Cluster Analysis Jesse Crawford Department of Mathematics Tarleton State University.

Slides:



Advertisements
Similar presentations
K-Means Clustering Algorithm Mining Lab
Advertisements

CS 478 – Tools for Machine Learning and Data Mining Clustering: Distance-based Approaches.
Cluster Analysis: Basic Concepts and Algorithms
Hierarchical Clustering, DBSCAN The EM Algorithm
Unsupervised Learning
More on Clustering Hierarchical Clustering to be discussed in Clustering Part2 DBSCAN will be used in programming project.
LOGO Clustering Lecturer: Dr. Bo Yuan
K Means Clustering , Nearest Cluster and Gaussian Mixture
Clustering Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data. The example below demonstrates.
Cluster Analysis.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.
Clustering II.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Cluster Analysis.  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods.
Cluster Analysis.
Cluster Analysis: Basic Concepts and Algorithms
Unsupervised Learning and Data Mining
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Evaluating Performance for Data Mining Techniques
Unsupervised Learning Reading: Chapter 8 from Introduction to Data Mining by Tan, Steinbach, and Kumar, pp , , (
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
Lecture 20: Cluster Validation
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
Data Clustering 2 – K Means contd & Hierarchical Methods Data Clustering – An IntroductionSlide 1.
Unsupervised learning introduction
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
Clustering Algorithms Presented by Michael Smaili CS 157B Spring
Data Science and Big Data Analytics Chap 4: Advanced Analytical Theory and Methods: Clustering Charles Tappert Seidenberg School of CSIS, Pace University.
Lecture 6 Spring 2010 Dr. Jianjun Hu CSCE883 Machine Learning.
Cluster Analysis Potyó László. Cluster: a collection of data objects Similar to one another within the same cluster Similar to one another within the.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.8: Clustering Rodney Nielsen Many of these.
Mehdi Ghayoumi MSB rm 132 Ofc hr: Thur, a Machine Learning.
Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN 1 Remaining Lectures in Advanced Clustering and Outlier Detection 2.Advanced Classification.
Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.
Clustering Instructor: Max Welling ICS 178 Machine Learning & Data Mining.
Machine Learning Queens College Lecture 7: Clustering.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
Chapter 13 (Prototype Methods and Nearest-Neighbors )
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Clustering Patrice Koehl Department of Biological Sciences National University of Singapore
Clustering/Cluster Analysis. What is Cluster Analysis? l Finding groups of objects such that the objects in a group will be similar (or related) to one.
Other Clustering Techniques
CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:
1 Cluster Analysis – 2 Approaches K-Means (traditional) Latent Class Analysis (new) by Jay Magidson, Statistical Innovations based in part on a presentation.
Given a set of data points as input Randomly assign each point to one of the k clusters Repeat until convergence – Calculate model of each of the k clusters.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Mixture Densities Maximum Likelihood Estimates.
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
DATA MINING: CLUSTER ANALYSIS (3) Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.
Gaussian Mixture Model classification of Multi-Color Fluorescence In Situ Hybridization (M-FISH) Images Amin Fazel 2006 Department of Computer Science.
Clustering. What is Clustering? Unsupervised learning Seeks to organize data into “reasonable” groups Often based on some similarity (or distance) measure.
Clustering Anna Reithmeir Data Mining Proseminar 2017
Data Mining: Basic Cluster Analysis
DATA MINING Spatial Clustering
Semi-Supervised Clustering
Hierarchical Clustering: Time and Space requirements
Clustering CSC 600: Data Mining Class 21.
Clustering 28/03/2016 A diák alatti jegyzetszöveget írta: Balogh Tamás Péter.
What Is the Problem of the K-Means Method?
数据挖掘 Introduction to Data Mining
Clustering Evaluation The EM Algorithm
CSE572, CBS598: Data Mining by H. Liu
Clustering 23/03/2016 A diák alatti jegyzetszöveget írta: Balogh Tamás Péter.
DATA MINING Introductory and Advanced Topics Part II - Clustering
CSE572, CBS572: Data Mining by H. Liu
Cluster Validity For supervised classification we have a variety of measures to evaluate how good our model is Accuracy, precision, recall For cluster.
LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.
Text Categorization Berlin Chen 2003 Reference:
CSE572: Data Mining by H. Liu
Introduction to Machine learning
Presentation transcript:

Math 5364 Notes Chapter 8: Cluster Analysis Jesse Crawford Department of Mathematics Tarleton State University

Today's Topics Overview of Cluster Analysis K-means clustering

What is Cluster Analysis? Dividing objects into clusters Distances within clusters are small Distances between clusters are large

What is Cluster Analysis? Dividing objects into clusters Distances within clusters are small Distances between clusters are large Training data has no class labels! Cluster analysis is also called unsupervised classification

Cluster Centers Cluster centers: prototypes, centroids, medoids

Purposes of Cluster Analysis Understanding Biology: Divide organisms into different classes (kingdom, phylum, class, etc.) Business: Divide customers into clusters for marketing purposes Weather: Identify patterns in atmosphere and ocean

Purposes of Cluster Analysis Utility Replace data points with cluster centers for summarization/compression

K-Means Clustering K-Means Algorithm Select K initial centroids Repeat the following: Form K clusters (assign each point to closest centroid) Recompute the centroid of each cluster Stop when centroids converge

K-Means Clustering K-Means Algorithm Select K initial centroids Repeat the following: Form K clusters (assign each point to closest centroid) Recompute the centroid of each cluster Stop when centroids converge Requires distance metric (Example: Euclidean distance) Depends on metric (Example: centroid = mean for Euclidean distance)

Sums of Squares for K-Means

A Problem with K-Means Different initial centroids can result in different clusterings Some choices of intial centroids may lead to local minima only. Possible solution: Repeat with randomly chosen initial centroids. Let m = number of repetitions

Today's Topics Cluster Evaluation Unsupervised Evaluation Measures SSW Silhouette Coefficient Supervised Evaluation Measures Entropy Purity Significance Tests

Unsupervised Evaluation Measures Does not use class labels SSW = Within Sum of Squares Silhouette Coefficient

Interpreting SSW

Silhouette Coefficient 1.For the ith data object, calculate its distance to all other objects in its cluster. Call this value a i 2.For the ith data object and any cluster not containing that object, calculate the object's average distance to all the objects in the given cluster. 3.The minimum value from Step 2 is called b i 4.For the ith object, the silhouette coefficient is

Silhouette Coefficient

Distance Matrix for a Data Set

Statistical Significance of the Silhouette Coefficient

Supervised Evaluation Measures

Today's Topics Chi-squared Test for Cluster Evaluation DBSCAN

Chi-square Test for Independence EngineeringScience and TechBusinessOtherTotals In State Out of State Totals How can we test indepence of these two variables?

Chi-square Test for Independence Column 1Column 2Column 3…Column cTotals Row 1O 11 O 12 O 13 …O 1c R1R1 Row 2O 21 O 22 O 23 …O 2c R2R2 ………………… Row rO r1 O r2 O r3 …O rc RrRr TotalsC1C1 C2C2 C3C3 …CcCc N

Chi-square Test for Independence Column 1Column 2Column 3…Column cTotals Row 1O 11 O 12 O 13 …O 1c R1R1 Row 2O 21 O 22 O 23 …O 2c R2R2 ………………… Row rO r1 O r2 O r3 …O rc RrRr TotalsC1C1 C2C2 C3C3 …CcCc N

Chi-square Test for Independence Column 1Column 2Column 3…Column cTotals Row 1O 11 O 12 O 13 …O 1c R1R1 Row 2O 21 O 22 O 23 …O 2c R2R2 ………………… Row rO r1 O r2 O r3 …O rc RrRr TotalsC1C1 C2C2 C3C3 …CcCc N

Chi-square Test for Independence Column 1Column 2Column 3…Column cTotals Row 1O 11 O 12 O 13 …O 1c R1R1 Row 2O 21 O 22 O 23 …O 2c R2R2 ………………… Row rO r1 O r2 O r3 …O rc RrRr TotalsC1C1 C2C2 C3C3 …CcCc N

Chi-square Test for Independence ObservedEngineeringScience and TechBusinessOtherTotals In State Out of State Totals ExpectedEngineeringScience and TechBusinessOtherTotals In State Out of State Totals

Chi-square Test for Independence ObservedEngineeringScience and TechBusinessOtherTotals In State Out of State Totals ExpectedEngineeringScience and TechBusinessOtherTotals In State Out of State Totals

DBSCAN Clustering Algorithm Density Based Spatial Clustering of Applications with Noise

DBSCAN: Parameters and Types of Points Requires two parameters: Eps (Must be chosen) MinPts (Default value = 5) Three types of points: Core points: Those with at least MinPts neighbors within its Eps neighborhood Border points: Not a core point, but within the Eps neighborhood of a core point Noise points: Not a core point or a border point

DBSCAN: Parameters and Types of Points Requires two parameters: Eps = 0.2 MinPts = 5

DBSCAN: Parameters and Types of Points Requires two parameters: Eps = 0.2 MinPts = 5

DBSCAN: Parameters and Types of Points Requires two parameters: Eps = 0.2 MinPts = 5

DBSCAN: Parameters and Types of Points Requires two parameters: Eps = 0.2 MinPts = 5

DBSCAN Algorithm Identify all core points, border points, and noise points. Two core points within Eps of each other are assigned to the same cluster. Border points are assigned to one of the clusters of its associated core points. Noise points are not assigned to clusters. They are simply classified as noise.

DBSCAN Algorithm Identify all core points, border points, and noise points. Two core points within Eps of each other are assigned to the same cluster. Border points are assigned to one of the clusters of its associated core points. Noise points are not assigned to clusters. They are simply classified as noise.

Today's Topics Agglomerative Hierarchical Clustering

Hierarchical Clustering Taxonomy of Living Organisms Dendrogram

Agglomerative Hierarchical Clustering

Distances Between Clusters

Agglomerative Hierarchical Clustering Heights = 1.0, 1.4, 3.0, 3.6, 5.6, 8.1, 13.0, 20.3

Today's Topics Gaussian Mixture EM Clustering

Setting for Gaussian Mixture EM Clustering p.m.f. for Y Prior distribution for Y Joint conditional distribution of X j 's given Y

Setting for Gaussian Mixture EM Clustering Prior distribution for Y Posterior distribution for Y

Want to maximize this Problem: Don't know Y's

Further Reading Dempster, A.P., Laird, N.M., and Rubin, D.B. (1977). "Maximum Likelihood from Incomplete Data via the EM Algorithm". Journal of the Royal Statistical Society, Series B. 39 (1): 1—38. Ledolter, J. (2013). Data Mining and Business Analytics with R.