Presentation is loading. Please wait.

Presentation is loading. Please wait.

Math 5364 Notes Chapter 8: Cluster Analysis Jesse Crawford Department of Mathematics Tarleton State University.

Similar presentations


Presentation on theme: "Math 5364 Notes Chapter 8: Cluster Analysis Jesse Crawford Department of Mathematics Tarleton State University."— Presentation transcript:

1 Math 5364 Notes Chapter 8: Cluster Analysis Jesse Crawford Department of Mathematics Tarleton State University

2 Today's Topics Overview of Cluster Analysis K-means clustering

3 What is Cluster Analysis? Dividing objects into clusters Distances within clusters are small Distances between clusters are large

4 What is Cluster Analysis? Dividing objects into clusters Distances within clusters are small Distances between clusters are large Training data has no class labels! Cluster analysis is also called unsupervised classification

5 Cluster Centers Cluster centers: prototypes, centroids, medoids

6 Purposes of Cluster Analysis Understanding Biology: Divide organisms into different classes (kingdom, phylum, class, etc.) Business: Divide customers into clusters for marketing purposes Weather: Identify patterns in atmosphere and ocean

7 Purposes of Cluster Analysis Utility Replace data points with cluster centers for summarization/compression

8 K-Means Clustering K-Means Algorithm Select K initial centroids Repeat the following: Form K clusters (assign each point to closest centroid) Recompute the centroid of each cluster Stop when centroids converge

9 K-Means Clustering K-Means Algorithm Select K initial centroids Repeat the following: Form K clusters (assign each point to closest centroid) Recompute the centroid of each cluster Stop when centroids converge Requires distance metric (Example: Euclidean distance) Depends on metric (Example: centroid = mean for Euclidean distance)

10

11 Sums of Squares for K-Means

12

13

14 A Problem with K-Means Different initial centroids can result in different clusterings Some choices of intial centroids may lead to local minima only. Possible solution: Repeat with randomly chosen initial centroids. Let m = number of repetitions

15 Today's Topics Cluster Evaluation Unsupervised Evaluation Measures SSW Silhouette Coefficient Supervised Evaluation Measures Entropy Purity Significance Tests

16 Unsupervised Evaluation Measures Does not use class labels SSW = Within Sum of Squares Silhouette Coefficient

17 Interpreting SSW

18 Silhouette Coefficient 1.For the ith data object, calculate its distance to all other objects in its cluster. Call this value a i 2.For the ith data object and any cluster not containing that object, calculate the object's average distance to all the objects in the given cluster. 3.The minimum value from Step 2 is called b i 4.For the ith object, the silhouette coefficient is

19 Silhouette Coefficient

20 Distance Matrix for a Data Set

21 Statistical Significance of the Silhouette Coefficient

22 Supervised Evaluation Measures

23

24

25 Today's Topics Chi-squared Test for Cluster Evaluation DBSCAN

26 Chi-square Test for Independence EngineeringScience and TechBusinessOtherTotals In State161413 56 Out of State14610838 Totals3020232194 How can we test indepence of these two variables?

27 Chi-square Test for Independence Column 1Column 2Column 3…Column cTotals Row 1O 11 O 12 O 13 …O 1c R1R1 Row 2O 21 O 22 O 23 …O 2c R2R2 ………………… Row rO r1 O r2 O r3 …O rc RrRr TotalsC1C1 C2C2 C3C3 …CcCc N

28 Chi-square Test for Independence Column 1Column 2Column 3…Column cTotals Row 1O 11 O 12 O 13 …O 1c R1R1 Row 2O 21 O 22 O 23 …O 2c R2R2 ………………… Row rO r1 O r2 O r3 …O rc RrRr TotalsC1C1 C2C2 C3C3 …CcCc N

29 Chi-square Test for Independence Column 1Column 2Column 3…Column cTotals Row 1O 11 O 12 O 13 …O 1c R1R1 Row 2O 21 O 22 O 23 …O 2c R2R2 ………………… Row rO r1 O r2 O r3 …O rc RrRr TotalsC1C1 C2C2 C3C3 …CcCc N

30 Chi-square Test for Independence Column 1Column 2Column 3…Column cTotals Row 1O 11 O 12 O 13 …O 1c R1R1 Row 2O 21 O 22 O 23 …O 2c R2R2 ………………… Row rO r1 O r2 O r3 …O rc RrRr TotalsC1C1 C2C2 C3C3 …CcCc N

31 Chi-square Test for Independence ObservedEngineeringScience and TechBusinessOtherTotals In State161413 56 Out of State14610838 Totals3020232194 ExpectedEngineeringScience and TechBusinessOtherTotals In State17.8711.9113.7012.5156 Out of State12.138.099.308.4938 Totals3020232194

32 Chi-square Test for Independence ObservedEngineeringScience and TechBusinessOtherTotals In State161413 56 Out of State14610838 Totals3020232194 ExpectedEngineeringScience and TechBusinessOtherTotals In State17.8711.9113.7012.5156 Out of State12.138.099.308.4938 Totals3020232194

33 DBSCAN Clustering Algorithm Density Based Spatial Clustering of Applications with Noise

34 DBSCAN: Parameters and Types of Points Requires two parameters: Eps (Must be chosen) MinPts (Default value = 5) Three types of points: Core points: Those with at least MinPts neighbors within its Eps neighborhood Border points: Not a core point, but within the Eps neighborhood of a core point Noise points: Not a core point or a border point

35 DBSCAN: Parameters and Types of Points Requires two parameters: Eps = 0.2 MinPts = 5

36 DBSCAN: Parameters and Types of Points Requires two parameters: Eps = 0.2 MinPts = 5

37 DBSCAN: Parameters and Types of Points Requires two parameters: Eps = 0.2 MinPts = 5

38 DBSCAN: Parameters and Types of Points Requires two parameters: Eps = 0.2 MinPts = 5

39 DBSCAN Algorithm Identify all core points, border points, and noise points. Two core points within Eps of each other are assigned to the same cluster. Border points are assigned to one of the clusters of its associated core points. Noise points are not assigned to clusters. They are simply classified as noise.

40 DBSCAN Algorithm Identify all core points, border points, and noise points. Two core points within Eps of each other are assigned to the same cluster. Border points are assigned to one of the clusters of its associated core points. Noise points are not assigned to clusters. They are simply classified as noise.

41 Today's Topics Agglomerative Hierarchical Clustering

42 Hierarchical Clustering Taxonomy of Living Organisms Dendrogram

43 Agglomerative Hierarchical Clustering

44

45

46

47

48

49

50

51

52

53 Distances Between Clusters

54 Agglomerative Hierarchical Clustering Heights = 1.0, 1.4, 3.0, 3.6, 5.6, 8.1, 13.0, 20.3

55 Today's Topics Gaussian Mixture EM Clustering

56 Setting for Gaussian Mixture EM Clustering p.m.f. for Y Prior distribution for Y Joint conditional distribution of X j 's given Y

57 Setting for Gaussian Mixture EM Clustering Prior distribution for Y Posterior distribution for Y

58

59 Want to maximize this Problem: Don't know Y's

60

61

62 Further Reading Dempster, A.P., Laird, N.M., and Rubin, D.B. (1977). "Maximum Likelihood from Incomplete Data via the EM Algorithm". Journal of the Royal Statistical Society, Series B. 39 (1): 1—38. Ledolter, J. (2013). Data Mining and Business Analytics with R.


Download ppt "Math 5364 Notes Chapter 8: Cluster Analysis Jesse Crawford Department of Mathematics Tarleton State University."

Similar presentations


Ads by Google