Object Orie’d Data Analysis, Last Time DiProPerm Test –Direction – Projection – Permutation –HDLSS hypothesis testing –NCI 60 Data –Particulate Matter.

Slides:



Advertisements
Similar presentations
Object Orie’d Data Analysis, Last Time •Clustering –Quantify with Cluster Index –Simple 1-d examples –Local mininizers –Impact of outliers •SigClust –When.
Advertisements

FTP Biostatistics II Model parameter estimations: Confronting models with measurements.
Interesting Statistical Problem For HDLSS data: When clusters seem to appear E.g. found by clustering method How do we know they are really there? Question.
Object Orie’d Data Analysis, Last Time Finished NCI 60 Data Started detailed look at PCA Reviewed linear algebra Today: More linear algebra Multivariate.
BCOR 1020 Business Statistics Lecture 22 – April 10, 2008.
Chapter 3 Analysis of Variance
The Simple Regression Model
Correlation & Regression
Hypothesis Testing and T-Tests. Hypothesis Tests Related to Differences Copyright © 2009 Pearson Education, Inc. Chapter Tests of Differences One.
Objectives of Multiple Regression
Object Orie’d Data Analysis, Last Time
Correlation.
Significance Tests: THE BASICS Could it happen by chance alone?
Estimating the Number of Data Clusters via the Gap Statistic Paper by: Robert Tibshirani, Guenther Walther and Trevor Hastie J.R. Statist. Soc. B (2001),
Support Vector Machines Graphical View, using Toy Example:
Chapter 11 Inference for Tables: Chi-Square Procedures 11.1 Target Goal:I can compute expected counts, conditional distributions, and contributions to.
Object Orie’d Data Analysis, Last Time Finished Q-Q Plots –Assess variability with Q-Q Envelope Plot SigClust –When is a cluster “really there”? –Statistic:
BPS - 3rd Ed. Chapter 161 Inference about a Population Mean.
1 UNC, Stat & OR DWD in Face Recognition, (cont.) Interesting summary: Jump between means (in DWD direction) Clear separation of Maleness vs. Femaleness.
SWISS Score Nice Graphical Introduction:. SWISS Score Toy Examples (2-d): Which are “More Clustered?”
Hypothesis Testing. “Not Guilty” In criminal proceedings in U.S. courts the defendant is presumed innocent until proven guilty and the prosecutor must.
BPS - 5th Ed. Chapter 251 Nonparametric Tests. BPS - 5th Ed. Chapter 252 Inference Methods So Far u Variables have had Normal distributions. u In practice,
Analyzing Expression Data: Clustering and Stats Chapter 16.
Flat clustering approaches
CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:
Biostatistics Regression and Correlation Methods Class #10 April 4, 2000.
Participant Presentations Please Sign Up: Name (Onyen is fine, or …) Are You ENRolled? Tentative Title (???? Is OK) When: Next Week, Early, Oct.,
GWAS Data Analysis. L1 PCA Challenge: L1 Projections Hard to Interpret (i.e. Little Data Insight) Solution: 1)Compute PC Directions Using L1 2)Compute.
Object Orie’d Data Analysis, Last Time Reviewed Clustering –2 means Cluster Index –SigClust When are clusters really there? Q-Q Plots –For assessing Goodness.
SigClust Statistical Significance of Clusters in HDLSS Data When is a cluster “really there”? Liu et al (2007), Huang et al (2014)
15 Inferential Statistics.
Clustering Idea: Given data
Chapter 8: Estimating with Confidence
Statistical Smoothing
Return to Big Picture Main statistical goals of OODA:
Chapter 8: Estimating with Confidence
Chapter 10 Two-Sample Tests and One-Way ANOVA.
Object Orie’d Data Analysis, Last Time
Correlation I have two variables, practically „equal“ (traditionally marked as X and Y) – I ask, if they are independent and if they are „correlated“,
Stat 31, Section 1, Last Time Sampling Distributions
PCB 3043L - General Ecology Data Analysis.
Sampling Distributions
Chapter 8: Estimating with Confidence
Elementary Statistics
Chapter 10 Two-Sample Tests and One-Way ANOVA.
Elementary Statistics
Chapter 9 Hypothesis Testing.
CHAPTER 26: Inference for Regression
Chapter 11 Analysis of Variance
Lesson Comparing Two Means.
Quantitative Methods in HPELS HPELS 6210
Warmup To check the accuracy of a scale, a weight is weighed repeatedly. The scale readings are normally distributed with a standard deviation of
Chapter 8: Estimating with Confidence
CHAPTER 18: Inference about a Population Mean
Estimating with Confidence
Chapter 8: Estimating with Confidence
Basic Practice of Statistics - 3rd Edition Two-Sample Problems
Essential Statistics Two-Sample Problems - Two-sample t procedures -
Chapter 8: Estimating with Confidence
Intro to Confidence Intervals Introduction to Inference
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
CHAPTER 18: Inference about a Population Mean
CHAPTER 18: Inference about a Population Mean
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
2/5/ Estimating a Population Mean.
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Presentation transcript:

Object Orie’d Data Analysis, Last Time DiProPerm Test –Direction – Projection – Permutation –HDLSS hypothesis testing –NCI 60 Data –Particulate Matter Data –Perou 500 Breast Cancer Data –OK for subpop’ns found by clustering??? Started Investigation of Clustering –Simple 1-d examples

Clustering Important References: McQueen (1967) Hartigan (1975) Gersho and Gray (1992) Kaufman and Rousseeuw (2005),

K-means Clustering Notes on Cluster Index: CI = 0 when all data at cluster means CI small when gives tight clustering (within SS contains little variation) CI big when gives poor clustering (within SS contains most of variation) CI = 1 when all cluster means are same

K-means Clustering Clustering Goal: Given data Choose classes To miminize

2-means Clustering Study CI, using simple 1-d examples Varying Standard Deviation

2-means Clustering

Study CI, using simple 1-d examples Varying Standard Deviation Varying Mean

2-means Clustering

Study CI, using simple 1-d examples Varying Standard Deviation Varying Mean Varying Proportion

2-means Clustering

Study CI, using simple 1-d examples Over changing Classes (moving b’dry)

2-means Clustering

Study CI, using simple 1-d examples Over changing Classes (moving b’dry) Multi-modal data  interesting effects –Multiple local minima (large number) –Maybe disconnected –Optimization (over ) can be tricky… (even in 1 dimension, with K = 2)

2-means Clustering

Study CI, using simple 1-d examples Over changing Classes (moving b’dry) Multi-modal data  interesting effects –Can have 4 (or more) local mins (even in 1 dimension, with K = 2)

2-means Clustering

Study CI, using simple 1-d examples Over changing Classes (moving b’dry) Multi-modal data  interesting effects –Local mins can be hard to find –i.e. iterative procedures can “get stuck” (even in 1 dimension, with K = 2)

2-means Clustering Study CI, using simple 1-d examples Effect of a single outlier?

2-means Clustering

Study CI, using simple 1-d examples Effect of a single outlier? –Can create local minimum –Can also yield a global minimum –This gives a one point class –Can make CI arbitrarily small (really a “good clustering”???)

SigClust Statistical Significance of Clusters in HDLSS Data When is a cluster “really there”?

SigClust Co-authors: Andrew Nobel – UNC Statistics & OR C. M. Perou – UNC Genetics D. N. Hayes – UNC Oncology Yufeng Liu – UNC Statistics & OR

Common Microarray Analytic Approach: Clustering From: Perou, Brown, Botstein (2000) Molecular Medicine Today d = 1161 genes Zoomed to “relevant” Gene subsets

Interesting Statistical Problem For HDLSS data: When clusters seem to appear E.g. found by clustering method How do we know they are really there? Question asked by Neil Hayes Define appropriate statistical significance? Can we calculate it?

First Approaches: Hypo Testing e.g. Direction, Projection, Permutation Hypothesis test of: Significant difference between sub-populations Effective and Accurate I.e. Sensitive and Specific There exist several such tests But critical point is: What result implies about clusters

Clarifying Simple Example Why Population Difference Tests cannot indicate clustering Andrew Nobel Observation For Gaussian Data (Clearly 1 Cluster!) Assign Extreme Labels (e.g. by clustering) Subpopulations are signif’ly different

Simple Gaussian Example Clearly only 1 Cluster in this Example But Extreme Relabelling looks different Extreme T-stat strongly significant Indicates 2 clusters in data

Simple Gaussian Example Results: Random relabelling T-stat is not significant But extreme T-stat is strongly significant This comes from clustering operation Conclude sub-populations are different Now see that: Not the same as clusters really there Need a new approach to study clusters

Statistical Significance of Clusters Basis of SigClust Approach: What defines: A Cluster? A Gaussian distribution (Sarle & Kou 1993) So define SigClust test based on: 2-means cluster index (measure) as statistic Gaussian null distribution Currently compute by simulation Possible to do this analytically???

SigClust Statistic – 2-Means Cluster Index Measure of non-Gaussianity: 2-means Cluster Index Familiar Criterion from k-means Clustering Within Class Sum of Squared Distances to Class Means Prefer to divide (normalize) by Overall Sum of Squared Distances to Mean Puts on scale of proportions

SigClust Statistic – 2-Means Cluster Index Measure of non-Gaussianity: 2-means Cluster Index: Class Index Sets Class Means “Within Class Var’n” / “Total Var’n”

SigClust Gaussian null distribut’n Which Gaussian? Standard (sphered) normal? No, not realistic Rejection not strong evidence for clustering Could also get that from a-spherical Gaussian Need Gaussian more like data: Challenge: Parameter Estimation Recall HDLSS Context

SigClust Gaussian null distribut’n Estimated Mean (of Gaussian dist’n)? 1 st Key Idea: Can ignore this By appealing to shift invariance of CI When Data are (rigidly) shifted CI remains the same So enough to simulate with mean 0 Other uses of invariance ideas?

SigClust Gaussian null distribut’n Challenge: how to estimate cov. Matrix? Number of parameters: E.g. Perou 500 data: Dimension so But Sample Size Impossible in HDLSS settings???? Way around this problem?

SigClust Gaussian null distribut’n 2 nd Key Idea: Mod Out Rotations Replace full Cov. by diagonal matrix As done in PCA eigen-analysis But then “not like data”??? OK, since k-means clustering (i.e. CI) is rotation invariant (assuming e.g. Euclidean Distance)

SigClust Gaussian null distribut’n 2 nd Key Idea: Mod Out Rotations Only need to estimate diagonal matrix But still have HDLSS problems? E.g. Perou 500 data: Dimension Sample Size Still need to estimate param’s

SigClust Gaussian null distribut’n 3 rd Key Idea: Factor Analysis Model Model Covariance as: Biology + Noise Where is “fairly low dimensional” is estimated from background noise

SigClust Gaussian null distribut’n Estimation of Background Noise :  Reasonable model (for each gene): Expression = Signal + Noise  “noise” is roughly Gaussian  “noise” terms essentially independent (across genes)

SigClust Gaussian null distribut’n Estimation of Background Noise : Model OK, since data come from light intensities at colored spots

SigClust Gaussian null distribut’n Estimation of Background Noise : For all expression values (as numbers) Use robust estimate of scale Median Absolute Deviation (MAD) (from the median) Rescale to put on same scale as s. d.:

SigClust Estimation of Background Noise

SigClust Gaussian null distribut’n ??? Next time: Insert QQ plot stuff from about here

SigClust Estimation of Background Noise