Object Orie’d Data Analysis, Last Time Finished Q-Q Plots –Assess variability with Q-Q Envelope Plot SigClust –When is a cluster “really there”? –Statistic:

Slides:



Advertisements
Similar presentations
Object Orie’d Data Analysis, Last Time •Clustering –Quantify with Cluster Index –Simple 1-d examples –Local mininizers –Impact of outliers •SigClust –When.
Advertisements

Clustering k-mean clustering Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
PARTITIONAL CLUSTERING
Polynomial Regression and Transformations STA 671 Summer 2008.
LOCAL SEARCH AND CONTINUOUS SEARCH. Local search algorithms  In many optimization problems, the path to the goal is irrelevant ; the goal state itself.
October 1999 Statistical Methods for Computer Science Marie desJardins CMSC 601 April 9, 2012 Material adapted.
Unsupervised learning
Interesting Statistical Problem For HDLSS data: When clusters seem to appear E.g. found by clustering method How do we know they are really there? Question.
Self Organization: Competitive Learning
Assessment. Schedule graph may be of help for selecting the best solution Best solution corresponds to a plateau before a high jump Solutions with very.
Cluster Analysis.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Unsupervised Learning: Clustering Rong Jin Outline  Unsupervised learning  K means for clustering  Expectation Maximization algorithm for clustering.
Cluster Analysis.  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods.
Comparing Distributions III: Chi squared test, ANOVA By Peter Woolf University of Michigan Michigan Chemical Process Dynamics and Controls.
Clustering Color/Intensity
Lecture 4 Unsupervised Learning Clustering & Dimensionality Reduction
Cluster Analysis (1).
Cliff Rhyne and Jerry Fu June 5, 2007 Parallel Image Segmenter CSE 262 Spring 2007 Project Final Presentation.
What is Cluster Analysis?
K-means Clustering. What is clustering? Why would we want to cluster? How would you determine clusters? How can you do this efficiently?
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Clustering & Dimensionality Reduction 273A Intro Machine Learning.
Traveling Salesman Problem Continued. Heuristic 1 Ideas? –Go from depot to nearest delivery –Then to delivery closest to that –And so on until we are.
Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.
Math 5364 Notes Chapter 8: Cluster Analysis Jesse Crawford Department of Mathematics Tarleton State University.
Unsupervised Learning Reading: Chapter 8 from Introduction to Data Mining by Tan, Steinbach, and Kumar, pp , , (
Sample size vs. Error A tutorial By Bill Thomas, Colby-Sawyer College.
CSC321: Neural Networks Lecture 12: Clustering Geoffrey Hinton.
A P STATISTICS LESSON 2 – 2 STANDARD NORMAL CALCULATIONS.
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
Support Vector Machines Graphical View, using Toy Example:
1 Statistical Techniques Chapter Linear Regression Analysis Simple Linear Regression.
Clustering What is clustering? Also called “unsupervised learning”Also called “unsupervised learning”
First topic: clustering and pattern recognition Marc Sobel.
UNSUPERVISED LEARNING David Kauchak CS 451 – Fall 2013.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Clustering COMP Research Seminar BCB 713 Module Spring 2011 Wei Wang.
SWISS Score Nice Graphical Introduction:. SWISS Score Toy Examples (2-d): Which are “More Clustered?”
K-Means Algorithm Each cluster is represented by the mean value of the objects in the cluster Input: set of objects (n), no of clusters (k) Output:
COMBO-17 Galaxy Dataset Colin Holden COSC 4335 April 17, 2012.
STATS 10x Revision CONTENT COVERED: CHAPTERS
Introductory Statistics for Laboratorians dealing with High Throughput Data sets Centers for Disease Control.
1 Probability and Statistics Confidence Intervals.
Object Orie’d Data Analysis, Last Time
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 6 –Multiple hypothesis testing Marshall University Genomics.
CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:
1 Learning Bias & Clustering Louis Oliphant CS based on slides by Burr H. Settles.
Clustering Wei Wang. Outline What is clustering Partitioning methods Hierarchical methods Density-based methods Grid-based methods Model-based clustering.
Given a set of data points as input Randomly assign each point to one of the k clusters Repeat until convergence – Calculate model of each of the k clusters.
Ch. Eick: Randomized Hill Climbing Techniques Randomized Hill Climbing Neighborhood Hill Climbing: Sample p points randomly in the neighborhood of the.
GWAS Data Analysis. L1 PCA Challenge: L1 Projections Hard to Interpret (i.e. Little Data Insight) Solution: 1)Compute PC Directions Using L1 2)Compute.
Object Orie’d Data Analysis, Last Time Reviewed Clustering –2 means Cluster Index –SigClust When are clusters really there? Q-Q Plots –For assessing Goodness.
Rodney Nielsen Many of these slides were adapted from: I. H. Witten, E. Frank and M. A. Hall Data Science Algorithms: The Basic Methods Clustering WFH:
Object Orie’d Data Analysis, Last Time DiProPerm Test –Direction – Projection – Permutation –HDLSS hypothesis testing –NCI 60 Data –Particulate Matter.
Ch. 4: Feature representation
Unsupervised Learning
Statistical Smoothing
Return to Big Picture Main statistical goals of OODA:
CHAPTER 12 More About Regression
Data Mining K-means Algorithm
Ch. 4: Feature representation
Clustering.
Randomized Hill Climbing
Randomized Hill Climbing
Data Mining – Chapter 4 Cluster Analysis Part 2
Simple Kmeans Examples
Clustering Wei Wang.
Unsupervised Learning
Presentation transcript:

Object Orie’d Data Analysis, Last Time Finished Q-Q Plots –Assess variability with Q-Q Envelope Plot SigClust –When is a cluster “really there”? –Statistic: 2-means Cluster Index –Gaussian null distribution –Fit to data (for HDLSS data, using invariance) –P-values by simulation –Breast Cancer Data

More on K-Means Clustering Classical Algorithm (from MacQueen,1967) Start with initial means Cluster: each data pt. to closest mean Recompute Class mean Stop when no change Demo from:

More on K-Means Clustering Raw Data 2 Starting Centers

More on K-Means Clustering Assign Each Data Point To Nearest Center Recompute Mean Re-assign

More on K-Means Clustering Recompute Mean Re-Assign Data Points To Nearest Center

More on K-Means Clustering Recompute Mean Re-Assign Data Points To Nearest Center

More on K-Means Clustering Recompute Mean Final Assignment

More on K-Means Clustering New Example Raw Data Deliberately Strange Starting Centers

More on K-Means Clustering Assign Clusters To Given Means Note poor clustering

More on K-Means Clustering Recompute Mean Re-assign Shows Improvement

More on K-Means Clustering Recompute Mean Re-assign Shows Improvement Now very good

More on K-Means Clustering Different Example Best 2-means Cluster? Local Minima?

More on K-Means Clustering Assign Recompute Mean Re-assign Note poor clustering

More on K-Means Clustering Recompute Mean Final Assignment Stuck in Local Min

More on K-Means Clustering Same Data But slightly different starting points Impact???

More on K-Means Clustering Assign Recompute Mean Re-assign Note poor clustering

More on K-Means Clustering Recompute Mean Final Assignment Now get Global Min

More on K-Means Clustering ???Next time: Redo above, using my own Matlab calculations That way can show each step And get right answers.

More on K-Means Clustering Now explore starting values: Approach randomly choose 2 data points Give stable solutions? Explore for different point configurations And try 100 random choices Do 2-d examples for easy visualization

More on K-Means Clustering 2 Clusters: Raw Data (Normal mixture)

More on K-Means Clustering 2 Clusters: Cluster Index, based on 100 Random Starts

More on K-Means Clustering 2 Clusters: Chosen Clustering

More on K-Means Clustering 2 Clusters Results All starts end up with good answer Answer is very good (CI = 0.03) No obvious local minima

More on K-Means Clustering Stretched Gaussian: Raw Data

More on K-Means Clustering Stretched Gaussian : C. I., based on 100 Random Starts

More on K-Means Clustering Stretched Gaussian : Chosen Clustering

More on K-Means Clustering Stretched Gaussian Results All starts end up with same answer Answer is less good (CI = 0.35) No obvious local minima

More on K-Means Clustering Standard Gaussian: Raw Data

More on K-Means Clustering Standard Gaussian : C. I., based on 100 Random Starts

More on K-Means Clustering Standard Gaussian: Chosen Clustering

More on K-Means Clustering Standard Gaussian Results All starts end up with same answer Answer even less good (CI = 0.62) No obvious local minima So still stable, despite poor CI

More on K-Means Clustering 4 Balanced Clusters: Raw Data (Normal mixture)

More on K-Means Clustering 4 Balanced Clusters: CI, based on 100 Random Starts

More on K-Means Clustering 4 Balanced Clusters 100 Random Starts Many different solutions appear I.e. there are many local minima Sorting on CI (bottom) shows how many 2 seem smaller than others What are other local minima? Understand with deeper visualization

More on K-Means Clustering 4 Balanced Clusters: Class Assignment Image Plot

More on K-Means Clustering 4 Balanced Clusters: Vertically Regroup (better view?)

More on K-Means Clustering 4 Balanced Clusters: Choose cases to “flip” – color cases

More on K-Means Clustering 4 Balanced Clusters: Choose cases to “flip” – color cases

More on K-Means Clustering 4 Balanced Clusters: “flip”, shows local min clusters

More on K-Means Clustering 4 Balanced Clusters: sort columns, for better visualization

More on K-Means Clustering 4 Balanced Clusters: CI, based on 100 Random Starts

More on K-Means Clustering 4 Balanced Clusters: Color according to local minima

More on K-Means Clustering 4 Balanced Clusters: Chosen Clustering, smallest CI

More on K-Means Clustering 4 Balanced Clusters: Chosen Clustering, 2 nd small CI

More on K-Means Clustering 4 Balanced Clusters: Chosen Clustering, larger 3 rd CI

More on K-Means Clustering 4 Balanced Clusters: Chosen Clustering, larger 4 th CI

More on K-Means Clustering 4 Balanced Clusters: Chosen Clustering, larger 5 th CI

More on K-Means Clustering 4 Balanced Clusters: Chosen Clustering, larger 6 th CI

More on K-Means Clustering 4 Balanced Clusters Results Many Local Minima Two good ones appear often (2-2 splits) 4 worse ones (1-3 splits less common) 1 with single strange point Overall very unstable Raises concern over starting values

More on K-Means Clustering 4 Unbalanced Clusters: Raw Data (try for stability)

More on K-Means Clustering 4 Unbalanced Clusters: CI, based on 100 Random Starts

More on K-Means Clustering 4 Unbalanced Clusters: Recolor by CI

More on K-Means Clustering 4 Unbalanced Clusters: Chosen Clustering, smallest CI

More on K-Means Clustering 4 Unbalanced Clusters: Chosen Clustering, 2 nd small CI

More on K-Means Clustering 4 Unbalanced Clusters: Chosen Clustering, larger 3 rd CI

More on K-Means Clustering 4 Unbalanced Clusters Results Fewer Local Minima (more stable) Two good ones appear often (2-2 splits) Single 1-3 split less common Previous instability caused by balance? Maybe stability OK after all?

More on K-Means Clustering Data on Circle: Raw Data (maximal instability?)

More on K-Means Clustering Data on Circle: CI, based on 100 Random Starts

More on K-Means Clustering Data on Circle: Recolor by CI

More on K-Means Clustering Data on Circle: Chosen Clustering, smallest CI

More on K-Means Clustering Data on Circle : Chosen Clustering, 2 nd small CI

More on K-Means Clustering Data on Circle : Chosen Clustering, 3 rd small CI

More on K-Means Clustering Data on Circle Results Seems many local minima Several are the same? Could be programming error? But clear this is an unstable example

K-Means Clustering Caution This is all a personal view Others would present different aspects E.g. replace Euclidean dist. by others E.g. other types of clustering E.g. heat-map dendogram views …

SigClust Breast Cancer Data K-means Clustering & Starting Values Try 100 random Starts For full data set: Study Final CIs Shows just two solutions Study changes in data, with image view Shows little difference between these Overall: Typical for clusters can split When Split is Clear, easily find it

SigClust Random Restarts, Full Data

SigClust Breast Cancer Data For full Chuck Class (e.g. Luminal B): Study Final CIs Shows several solutions Study changes in data, with image view Shows multiple, divergent minima Overall: Typical for “terminal” clusters When no clear split, many local optima appear Could base test on number of local optima???

SigClust Random Restarts, Luminal B

SigClust Breast Cancer Data ??? Next time: show many more of these To better build this case….