Object Orie’d Data Analysis, Last Time Finished Q-Q Plots –Assess variability with Q-Q Envelope Plot SigClust –When is a cluster “really there”? –Statistic: 2-means Cluster Index –Gaussian null distribution –Fit to data (for HDLSS data, using invariance) –P-values by simulation –Breast Cancer Data
More on K-Means Clustering Classical Algorithm (from MacQueen,1967) Start with initial means Cluster: each data pt. to closest mean Recompute Class mean Stop when no change Demo from:
More on K-Means Clustering Raw Data 2 Starting Centers
More on K-Means Clustering Assign Each Data Point To Nearest Center Recompute Mean Re-assign
More on K-Means Clustering Recompute Mean Re-Assign Data Points To Nearest Center
More on K-Means Clustering Recompute Mean Re-Assign Data Points To Nearest Center
More on K-Means Clustering Recompute Mean Final Assignment
More on K-Means Clustering New Example Raw Data Deliberately Strange Starting Centers
More on K-Means Clustering Assign Clusters To Given Means Note poor clustering
More on K-Means Clustering Recompute Mean Re-assign Shows Improvement
More on K-Means Clustering Recompute Mean Re-assign Shows Improvement Now very good
More on K-Means Clustering Different Example Best 2-means Cluster? Local Minima?
More on K-Means Clustering Assign Recompute Mean Re-assign Note poor clustering
More on K-Means Clustering Recompute Mean Final Assignment Stuck in Local Min
More on K-Means Clustering Same Data But slightly different starting points Impact???
More on K-Means Clustering Assign Recompute Mean Re-assign Note poor clustering
More on K-Means Clustering Recompute Mean Final Assignment Now get Global Min
More on K-Means Clustering ???Next time: Redo above, using my own Matlab calculations That way can show each step And get right answers.
More on K-Means Clustering Now explore starting values: Approach randomly choose 2 data points Give stable solutions? Explore for different point configurations And try 100 random choices Do 2-d examples for easy visualization
More on K-Means Clustering 2 Clusters: Raw Data (Normal mixture)
More on K-Means Clustering 2 Clusters: Cluster Index, based on 100 Random Starts
More on K-Means Clustering 2 Clusters: Chosen Clustering
More on K-Means Clustering 2 Clusters Results All starts end up with good answer Answer is very good (CI = 0.03) No obvious local minima
More on K-Means Clustering Stretched Gaussian: Raw Data
More on K-Means Clustering Stretched Gaussian : C. I., based on 100 Random Starts
More on K-Means Clustering Stretched Gaussian : Chosen Clustering
More on K-Means Clustering Stretched Gaussian Results All starts end up with same answer Answer is less good (CI = 0.35) No obvious local minima
More on K-Means Clustering Standard Gaussian: Raw Data
More on K-Means Clustering Standard Gaussian : C. I., based on 100 Random Starts
More on K-Means Clustering Standard Gaussian: Chosen Clustering
More on K-Means Clustering Standard Gaussian Results All starts end up with same answer Answer even less good (CI = 0.62) No obvious local minima So still stable, despite poor CI
More on K-Means Clustering 4 Balanced Clusters: Raw Data (Normal mixture)
More on K-Means Clustering 4 Balanced Clusters: CI, based on 100 Random Starts
More on K-Means Clustering 4 Balanced Clusters 100 Random Starts Many different solutions appear I.e. there are many local minima Sorting on CI (bottom) shows how many 2 seem smaller than others What are other local minima? Understand with deeper visualization
More on K-Means Clustering 4 Balanced Clusters: Class Assignment Image Plot
More on K-Means Clustering 4 Balanced Clusters: Vertically Regroup (better view?)
More on K-Means Clustering 4 Balanced Clusters: Choose cases to “flip” – color cases
More on K-Means Clustering 4 Balanced Clusters: Choose cases to “flip” – color cases
More on K-Means Clustering 4 Balanced Clusters: “flip”, shows local min clusters
More on K-Means Clustering 4 Balanced Clusters: sort columns, for better visualization
More on K-Means Clustering 4 Balanced Clusters: CI, based on 100 Random Starts
More on K-Means Clustering 4 Balanced Clusters: Color according to local minima
More on K-Means Clustering 4 Balanced Clusters: Chosen Clustering, smallest CI
More on K-Means Clustering 4 Balanced Clusters: Chosen Clustering, 2 nd small CI
More on K-Means Clustering 4 Balanced Clusters: Chosen Clustering, larger 3 rd CI
More on K-Means Clustering 4 Balanced Clusters: Chosen Clustering, larger 4 th CI
More on K-Means Clustering 4 Balanced Clusters: Chosen Clustering, larger 5 th CI
More on K-Means Clustering 4 Balanced Clusters: Chosen Clustering, larger 6 th CI
More on K-Means Clustering 4 Balanced Clusters Results Many Local Minima Two good ones appear often (2-2 splits) 4 worse ones (1-3 splits less common) 1 with single strange point Overall very unstable Raises concern over starting values
More on K-Means Clustering 4 Unbalanced Clusters: Raw Data (try for stability)
More on K-Means Clustering 4 Unbalanced Clusters: CI, based on 100 Random Starts
More on K-Means Clustering 4 Unbalanced Clusters: Recolor by CI
More on K-Means Clustering 4 Unbalanced Clusters: Chosen Clustering, smallest CI
More on K-Means Clustering 4 Unbalanced Clusters: Chosen Clustering, 2 nd small CI
More on K-Means Clustering 4 Unbalanced Clusters: Chosen Clustering, larger 3 rd CI
More on K-Means Clustering 4 Unbalanced Clusters Results Fewer Local Minima (more stable) Two good ones appear often (2-2 splits) Single 1-3 split less common Previous instability caused by balance? Maybe stability OK after all?
More on K-Means Clustering Data on Circle: Raw Data (maximal instability?)
More on K-Means Clustering Data on Circle: CI, based on 100 Random Starts
More on K-Means Clustering Data on Circle: Recolor by CI
More on K-Means Clustering Data on Circle: Chosen Clustering, smallest CI
More on K-Means Clustering Data on Circle : Chosen Clustering, 2 nd small CI
More on K-Means Clustering Data on Circle : Chosen Clustering, 3 rd small CI
More on K-Means Clustering Data on Circle Results Seems many local minima Several are the same? Could be programming error? But clear this is an unstable example
K-Means Clustering Caution This is all a personal view Others would present different aspects E.g. replace Euclidean dist. by others E.g. other types of clustering E.g. heat-map dendogram views …
SigClust Breast Cancer Data K-means Clustering & Starting Values Try 100 random Starts For full data set: Study Final CIs Shows just two solutions Study changes in data, with image view Shows little difference between these Overall: Typical for clusters can split When Split is Clear, easily find it
SigClust Random Restarts, Full Data
SigClust Breast Cancer Data For full Chuck Class (e.g. Luminal B): Study Final CIs Shows several solutions Study changes in data, with image view Shows multiple, divergent minima Overall: Typical for “terminal” clusters When no clear split, many local optima appear Could base test on number of local optima???
SigClust Random Restarts, Luminal B
SigClust Breast Cancer Data ??? Next time: show many more of these To better build this case….