Recall Flexibility From Kernel Embedding Idea HDLSS Asymptotics & Kernel Methods.

Slides:

Advertisements

Similar presentations

Object Orie’d Data Analysis, Last Time •Clustering –Quantify with Cluster Index –Simple 1-d examples –Local mininizers –Impact of outliers •SigClust –When.

Advertisements

Graph Embedding and Extensions: A General Framework for Dimensionality Reduction Keywords: Dimensionality reduction, manifold learning, subspace learning,

HDLSS Asy’s: Geometrical Represent’n Assume, let Study Subspace Generated by Data Hyperplane through 0, ofdimension Points are “nearly equidistant to 0”,

Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.

Interesting Statistical Problem For HDLSS data: When clusters seem to appear E.g. found by clustering method How do we know they are really there? Question.

DENSITY CURVES and NORMAL DISTRIBUTIONS. The histogram displays the Grade equivalent vocabulary scores for 7 th graders on the Iowa Test of Basic Skills.

SigClust Gaussian null distribution - Simulation Now simulate from null distribution using: where (indep.) Again rotation invariance makes this work (and.

Assessing cognitive models What is the aim of cognitive modelling? To try and reproduce, using equations or similar, the mechanism that people are using.

Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.

Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC Lecture 7: Coding and Representation 1 Computational Architectures in.

POSTER TEMPLATE BY: Cluster-Based Modeling: Exploring the Linear Regression Model Space Student: XiaYi(Sandy) Shen Advisor:

Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.

Object Orie’d Data Analysis, Last Time Finished Algebra Review Multivariate Probability Review PCA as an Optimization Problem (Eigen-decomp. gives rotation,

Object Orie’d Data Analysis, Last Time OODA in Image Analysis –Landmarks, Boundary Rep ’ ns, Medial Rep ’ ns Mildly Non-Euclidean Spaces –M-rep data on.

Object Orie’d Data Analysis, Last Time HDLSS Discrimination –MD much better Maximal Data Piling –HDLSS space is a strange place Kernel Embedding –Embed.

Object Orie’d Data Analysis, Last Time Gene Cell Cycle Data Microarrays and HDLSS visualization DWD bias adjustment NCI 60 Data Today: More NCI 60 Data.

Object Orie’d Data Analysis, Last Time

Gwangju Institute of Science and Technology Intelligent Design and Graphics Laboratory Multi-scale tensor voting for feature extraction from unstructured.

Object Orie’d Data Analysis, Last Time Distance Weighted Discrimination: Revisit microarray data Face Data Outcomes Data Simulation Comparison.

Review of Chapters 1- 5 We review some important themes from the first 5 chapters 1.Introduction Statistics- Set of methods for collecting/analyzing data.

1 Review Descriptive Statistics –Qualitative (Graphical) –Quantitative (Graphical) –Summation Notation –Qualitative (Numerical) Central Measures (mean,

Return to Big Picture Main statistical goals of OODA: Understanding population structure –Low dim ’ al Projections, PCA … Classification (i. e. Discrimination)

1 UNC, Stat & OR Nonnegative Matrix Factorization.

Robust PCA Robust PCA 3: Spherical PCA. Robust PCA.

Support Vector Machines Graphical View, using Toy Example:

1 Review Sections Descriptive Statistics –Qualitative (Graphical) –Quantitative (Graphical) –Summation Notation –Qualitative (Numerical) Central.

Object Orie’d Data Analysis, Last Time Discrimination for manifold data (Sen) –Simple Tangent plane SVM –Iterated TANgent plane SVM –Manifold SVM Interesting.

The Mathematical Model Of Repressilator where i = lacl, tetR, cl and j = cl, lacl, tetR. α 0 : the number of protein copies per cell produced from a given.

1 UNC, Stat & OR DWD in Face Recognition, (cont.) Interesting summary: Jump between means (in DWD direction) Clear separation of Maleness vs. Femaleness.

1 UNC, Stat & OR Hailuoto Workshop Object Oriented Data Analysis, II J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina.

Stat 31, Section 1, Last Time Time series plots Numerical Summaries of Data: –Center: Mean, Medial –Spread: Range, Variance, S.D., IQR 5 Number Summary.

SWISS Score Nice Graphical Introduction:. SWISS Score Toy Examples (2-d): Which are “More Clustered?”

Object Orie’d Data Analysis, Last Time SiZer Analysis –Zooming version, -- Dependent version –Mass flux data, -- Cell cycle data Image Analysis –1 st Generation.

Object Orie’d Data Analysis, Last Time Classical Discrimination (aka Classification) –FLD & GLR very attractive –MD never better, sometimes worse HDLSS.

Maximal Data Piling Visual similarity of & ? Can show (Ahn & Marron 2009), for d < n: I.e. directions are the same! How can this be? Note lengths are different.

Stat 31, Section 1, Last Time Distributions (how are data “spread out”?) Visual Display: Histograms Binwidth is critical Bivariate display: scatterplot.

Return to Big Picture Main statistical goals of OODA: Understanding population structure –Low dim ’ al Projections, PCA … Classification (i. e. Discrimination)

1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January.

Stat 31, Section 1, Last Time Course Organization & Website What is Statistics? Data types.

Math 285 Project Diffusion Maps Xiaoyan Chong Department of Mathematics and Statistics San Jose State University.

Statistics – O. R. 892 Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research University of North Carolina.

Stat 31, Section 1, Last Time Distribution of Sample Means –Expected Value  same –Variance  less, Law of Averages, I –Dist’n  Normal, Law of Averages,

Nonparametric Modeling of Textures Outline Parametric vs. nonparametric Image patches and similarity distance Efros-Leung’s texture synthesis by nonparametric.

1 UNC, Stat & OR U. C. Davis, F. R. G. Workshop Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North.

Object Orie’d Data Analysis, Last Time

1 UNC, Stat & OR Hailuoto Workshop Object Oriented Data Analysis, I J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina.

Participant Presentations Please Sign Up: Name (Onyen is fine, or …) Are You ENRolled? Tentative Title (???? Is OK) When: Next Week, Early, Oct.,

GWAS Data Analysis. L1 PCA Challenge: L1 Projections Hard to Interpret (i.e. Little Data Insight) Solution: 1)Compute PC Directions Using L1 2)Compute.

Object Orie’d Data Analysis, Last Time Reviewed Clustering –2 means Cluster Index –SigClust When are clusters really there? Q-Q Plots –For assessing Goodness.

1 UNC, Stat & OR Hailuoto Workshop Object Oriented Data Analysis, III J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina.

Return to Big Picture Main statistical goals of OODA: Understanding population structure –Low dim ’ al Projections, PCA … Classification (i. e. Discrimination)

PCA Data Represent ’ n (Cont.). PCA Simulation Idea: given Mean Vector Eigenvectors Eigenvalues Simulate data from Corresponding Normal Distribution.

Object Orie’d Data Analysis, Last Time Organizational Matters

Kernel Embedding Polynomial Embedding, Toy Example 3: Donut FLD Good Performance (Slice of Paraboloid)

Distance Weighted Discrim ’ n Based on Optimization Problem: For “Residuals”:

SigClust Statistical Significance of Clusters in HDLSS Data When is a cluster “really there”? Liu et al (2007), Huang et al (2014)

Object Orie’d Data Analysis, Last Time DiProPerm Test –Direction – Projection – Permutation –HDLSS hypothesis testing –NCI 60 Data –Particulate Matter.

Clustering Idea: Given data

Statistical Smoothing

Stats 202: Statistical Aspects of Data Mining Professor Rajan Patel

Return to Big Picture Main statistical goals of OODA:

Object Orie’d Data Analysis, Last Time

Object Orie’d Data Analysis, Last Time

Support Vector Machines

Radial DWD Main Idea: Linear Classifiers Good When Each Class Lives in a Distinct Region Hard When All Members Of One Class Are Outliers in a Random Direction.

Participant Presentations

Participant Presentations

Boxplots in R The function boxplot() in R plots boxplots

Presentation transcript:

Recall Flexibility From Kernel Embedding Idea HDLSS Asymptotics & Kernel Methods

Recall Flexibility From Kernel Embedding Idea HDLSS Asymptotics & Kernel Methods

Recall Flexibility From Kernel Embedding Idea HDLSS Asymptotics & Kernel Methods

Interesting Question: Behavior in Very High Dimension? Implications for DWD:  Recall Main Advantage is for High d  So Not Clear Embedding Helps  Thus Not Yet Implemented in DWD HDLSS Asymptotics & Kernel Methods

Batch and Source Adjustment Recall from Class Notes 1/26/16 For Stanford Breast Cancer Data (C. Perou) Analysis in Benito, et al (2004) Adjust for Source Effects –Different sources of mRNA Adjust for Batch Effects –Arrays fabricated at different times

Source Batch Adj: Biological Class Col. & Symbols

Source Batch Adj: Source Colors

Source Batch Adj: PC 1-3 & DWD direction

Source Batch Adj: DWD Source Adjustment

Source Batch Adj: Source Adj ’ d, PCA view

Source Batch Adj: S. & B Adj ’ d, Adj ’ d PCA

13 UNC, Stat & OR Why not adjust using SVM? Major Problem: Proj’d Distrib’al Shape Triangular Dist’ns (opposite skewed) Does not allow sensible rigid shift

14 UNC, Stat & OR Why not adjust using SVM? Nicely Fixed by DWD Projected Dist’ns near Gaussian Sensible to shift

15 UNC, Stat & OR Why not adjust by means? DWD is complicated: value added?  Because it is “cool”  Recall Improves SVM for HDLSS  Good Empirical Success  Routinely Used in Perou Lab  Many Comparisons Done  Similar Lessons from Wistar  Proven Statistical Power

16 UNC, Stat & OR Why not adjust by means? But Why Not PAM (~Mean Difference)?  Simpler is Better  Why not means, i.e. point cloud centerpoints? Elegant Answer: Xuxin Liu, et al (2009)

17 UNC, Stat & OR Why not adjust by means? But Why Not PAM (~Mean Difference)?  Simpler is Better  Why not means, i.e. point cloud centerpoints? Drawback to PAM:  Poor Handling of Unbalanced Biological Subtypes  DWD more Resistant to Unbalance

18 UNC, Stat & OR Why not adjust by means? Toy Example: Gaussian Clusters Two batches (denoted: + o) Two subtypes (red and blue) Goal: bring together – +  o and also +  o Challenge: unequal biological ratios within batches

19 UNC, Stat & OR Twiddle ratios of subtypes 2-d Toy Example Balanced Mixture

20 UNC, Stat & OR Twiddle ratios of subtypes 2-d Toy Example Unbalanced Mixture (Through “decimation”)

21 UNC, Stat & OR Twiddle ratios of subtypes 2-d Toy Example Unbalanced Mixture (Diminishing Discriminatory Power)

22 UNC, Stat & OR Twiddle ratios of subtypes 2-d Toy Example Unbalanced Mixture

23 UNC, Stat & OR Twiddle ratios of subtypes 2-d Toy Example Unbalanced Mixture

24 UNC, Stat & OR Twiddle ratios of subtypes 2-d Toy Example Unbalanced Mixture Note: Losing Distinction To Be Studied

25 UNC, Stat & OR Twiddle ratios of subtypes 2-d Toy Example Unbalanced Mixture

26 UNC, Stat & OR Why not adjust by means? DWD robust against non-proportional subtypes… Mathematical Statistical Question: Are there mathematics behind this?

HDLSS Data Combo Mathematics

Asymptotic Results (as ) Let denote ratio between subgroup sizes

HDLSS Data Combo Mathematics Asymptotic Results (as ):  For, PAM Inconsistent Angle(PAM,Truth)  For, PAM Strongly Inconsistent Angle(PAM,Truth)

HDLSS Data Combo Mathematics Asymptotic Results (as ):  For, DWD Inconsistent Angle(DWD,Truth)  For, DWD Strongly Inconsistent Angle(DWD,Truth)

HDLSS Data Combo Mathematics Value of and, for sample size ratio : , only when  Otherwise for, both are Inconsistent

HDLSS Data Combo Mathematics Comparison between PAM and DWD? I.e. between and ?

HDLSS Data Combo Mathematics Comparison between PAM and DWD?

HDLSS Data Combo Mathematics Comparison between PAM and DWD? I.e. between and ? Shows Strong Difference Explains Above Empirical Observation

SVM & DWD Tuning Parameter

SVM Tuning Parameter

SVM & DWD Tuning Parameter Possible Approaches: Visually Tuned (Can be Effective, But Takes Time, Requires Expertise)

SVM & DWD Tuning Parameter Possible Approaches: Visually Tuned Simple Defaults DWD: 100 / median pairwise distance (Surprisingly Useful, Simple Answer) SVM: 1000 (Works Well Sometimes, Not Others)

SVM & DWD Tuning Parameter Possible Approaches: Visually Tuned Simple Defaults (Works Well for DWD, Less Effective for SVM)

SVM & DWD Tuning Parameter

Possible Approaches: Visually Tuned Simple Defaults Cross Validation (Very Popular – Useful for SVM, But Comes at Computational Cost)

SVM & DWD Tuning Parameter Possible Approaches: Visually Tuned Simple Defaults Cross Validation Scale Space (Work with Full Range of Choices, Will Explore More Soon)

Participant Presentation Frank Teets Characterizing Protein Assembly Graphs