Object Orie’d Data Analysis, Last Time

Slides:

Advertisements

Similar presentations

Object Orie’d Data Analysis, Last Time •Clustering –Quantify with Cluster Index –Simple 1-d examples –Local mininizers –Impact of outliers •SigClust –When.

Advertisements

Lecture 9 Support Vector Machines

ECG Signal processing (2)

STOR 892 Object Oriented Data Analysis Radial Distance Weighted Discrimination Jie Xiong Advised by Prof. J.S. Marron Department of Statistics and Operations.

Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?

SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.

An Introduction of Support Vector Machine

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.

Statistics 100 Lecture Set 7. Chapters 13 and 14 in this lecture set Please read these, you are responsible for all material Will be doing chapters

SigClust Gaussian null distribution - Simulation Now simulate from null distribution using: where (indep.) Again rotation invariance makes this work (and.

x – independent variable (input)

Object Orie’d Data Analysis, Last Time Mildly Non-Euclidean Spaces Strongly Non-Euclidean Spaces –Tree spaces –No Tangent Plane Classification - Discrimination.

Support Vector Machines

Object Orie’d Data Analysis, Last Time Finished Algebra Review Multivariate Probability Review PCA as an Optimization Problem (Eigen-decomp. gives rotation,

Outline Separating Hyperplanes – Separable Case

Object Orie’d Data Analysis, Last Time OODA in Image Analysis –Landmarks, Boundary Rep ’ ns, Medial Rep ’ ns Mildly Non-Euclidean Spaces –M-rep data on.

Object Orie’d Data Analysis, Last Time HDLSS Discrimination –MD much better Maximal Data Piling –HDLSS space is a strange place Kernel Embedding –Embed.

Object Orie’d Data Analysis, Last Time

Object Orie’d Data Analysis, Last Time Distance Weighted Discrimination: Revisit microarray data Face Data Outcomes Data Simulation Comparison.

ArrayCluster: an analytic tool for clustering, data visualization and module ﬁnder on gene expression proﬁles 組員：李祥豪謝紹陽江建霖.

Sample classification using Microarray Data. AB We have two sample entities malignant vs. benign tumor patient responding to drug vs. patient resistant.

Return to Big Picture Main statistical goals of OODA: Understanding population structure –Low dim ’ al Projections, PCA … Classification (i. e. Discrimination)

Object Orie’d Data Analysis, Last Time Discrimination for manifold data (Sen) –Simple Tangent plane SVM –Iterated TANgent plane SVM –Manifold SVM Interesting.

Object Orie’d Data Analysis, Last Time Classification / Discrimination Classical Statistical Viewpoint –FLD “good” –GLR “better” –Conclude always do GLR.

1 UNC, Stat & OR DWD in Face Recognition, (cont.) Interesting summary: Jump between means (in DWD direction) Clear separation of Maleness vs. Femaleness.

1 UNC, Stat & OR Hailuoto Workshop Object Oriented Data Analysis, II J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina.

D. M. J. Tax and R. P. W. Duin. Presented by Mihajlo Grbovic Support Vector Data Description.

SWISS Score Nice Graphical Introduction:. SWISS Score Toy Examples (2-d): Which are “More Clustered?”

Object Orie’d Data Analysis, Last Time SiZer Analysis –Zooming version, -- Dependent version –Mass flux data, -- Cell cycle data Image Analysis –1 st Generation.

Object Orie’d Data Analysis, Last Time Classical Discrimination (aka Classification) –FLD & GLR very attractive –MD never better, sometimes worse HDLSS.

Maximal Data Piling Visual similarity of & ? Can show (Ahn & Marron 2009), for d < n: I.e. directions are the same! How can this be? Note lengths are different.

Stat 31, Section 1, Last Time Distributions (how are data “spread out”?) Visual Display: Histograms Binwidth is critical Bivariate display: scatterplot.

CZ5225: Modeling and Simulation in Biology Lecture 7, Microarray Class Classification by Machine learning Methods Prof. Chen Yu Zong Tel:

A Short and Simple Introduction to Linear Discriminants (with almost no math) Jennifer Listgarten, November 2002.

Return to Big Picture Main statistical goals of OODA: Understanding population structure –Low dim ’ al Projections, PCA … Classification (i. e. Discrimination)

Statistics – O. R. 892 Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research University of North Carolina.

1 UNC, Stat & OR U. C. Davis, F. R. G. Workshop Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North.

Computational Intelligence: Methods and Applications Lecture 24 SVM in the non-linear case Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.

Object Orie’d Data Analysis, Last Time PCA Redistribution of Energy - ANOVA PCA Data Representation PCA Simulation Alternate PCA Computation Primal – Dual.

1 UNC, Stat & OR Hailuoto Workshop Object Oriented Data Analysis, I J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina.

Participant Presentations Please Sign Up: Name (Onyen is fine, or …) Are You ENRolled? Tentative Title (???? Is OK) When: Next Week, Early, Oct.,

Classification on Manifolds Suman K. Sen joint work with Dr. J. S. Marron & Dr. Mark Foskey.

GWAS Data Analysis. L1 PCA Challenge: L1 Projections Hard to Interpret (i.e. Little Data Insight) Solution: 1)Compute PC Directions Using L1 2)Compute.

Object Orie’d Data Analysis, Last Time Reviewed Clustering –2 means Cluster Index –SigClust When are clusters really there? Q-Q Plots –For assessing Goodness.

Return to Big Picture Main statistical goals of OODA: Understanding population structure –Low dim ’ al Projections, PCA … Classification (i. e. Discrimination)

PCA Data Represent ’ n (Cont.). PCA Simulation Idea: given Mean Vector Eigenvectors Eigenvalues Simulate data from Corresponding Normal Distribution.

Object Orie’d Data Analysis, Last Time Organizational Matters

Kernel Embedding Polynomial Embedding, Toy Example 3: Donut FLD Good Performance (Slice of Paraboloid)

Recall Flexibility From Kernel Embedding Idea HDLSS Asymptotics & Kernel Methods.

A Brief Introduction to Support Vector Machine (SVM) Most slides were from Prof. A. W. Moore, School of Computer Science, Carnegie Mellon University.

Distance Weighted Discrim ’ n Based on Optimization Problem: For “Residuals”:

SigClust Statistical Significance of Clusters in HDLSS Data When is a cluster “really there”? Liu et al (2007), Huang et al (2014)

Object Orie’d Data Analysis, Last Time DiProPerm Test –Direction – Projection – Permutation –HDLSS hypothesis testing –NCI 60 Data –Particulate Matter.

1 C.A.L. Bailer-Jones. Machine Learning. Support vector machines Machine learning, pattern recognition and statistical data modelling Lecture 9. Support.

1 C.A.L. Bailer-Jones. Machine Learning. Data exploration and dimensionality reduction Machine learning, pattern recognition and statistical data modelling.

Statistical Smoothing

Return to Big Picture Main statistical goals of OODA:

PREDICT 422: Practical Machine Learning

Object Orie’d Data Analysis, Last Time

CSC321: Neural Networks Lecture 22 Learning features one layer at a time Geoffrey Hinton.

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Support Vector Machines (SVM)

Participant Presentations

Maximal Data Piling MDP in Increasing Dimensions:

Participant Presentations

Principal Nested Spheres Analysis

HDLSS Discrimination Mean Difference (Centroid) Method Same Data, Movie over dim’s.

Learning with information of features

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Presentation transcript:

Object Orie’d Data Analysis, Last Time Kernel Embedding Embed data in higher dimensional manifold Gives greater flexibility to linear methods Support Vector Machines Aimed at very non-Gaussian Data E.g. from Kernel Embedding Distance Weighted Discrimination HDLSS Improvement of SVM

Support Vector Machines Graphical View, using Toy Example: Find separating plane To maximize distances from data to plane In particular smallest distance Data points closest are called support vectors Gap between is called margin

Support Vector Machines Graphical View, using Toy Example:

Support Vector Machines Forgotten last time, Important Extension: Multi-Class SVMs Hsu & Lin (2002) Lee, Lin, & Wahba (2002) Defined for “implicit” version “Direction Based” variation???

Support Vector Machines Also forgotten last time, Toy examples illustrating Explicit vs. Implicit Kernel Embedding As well as effect of window width, σ on Gaussian kernel embedding

SVMs, Comput’n & Embedding For an “Embedding Map”, e.g. Explicit Embedding: Maximize: Get classification function: Straightforward application of embedding But loses inner product advantage

SVMs, Comput’n & Embedding Implicit Embedding: Maximize: Get classification function: Still defined only via inner products Retains optimization advantage Thus used very commonly Comparison to explicit embedding? Which is “better”???

Support Vector Machines Target Toy Data set:

Support Vector Machines Explicit Embedding, window σ = 0.1:

Support Vector Machines Explicit Embedding, window σ = 1:

Support Vector Machines Explicit Embedding, window σ = 10:

Support Vector Machines Explicit Embedding, window σ = 100:

Support Vector Machines Notes on Explicit Embedding: Too small  Poor generalizability Too big  miss important regions Classical lessons from kernel smoothing Surprisingly large “reasonable region” I.e. parameter less critical (sometimes?) Also explore projections (in kernel space)

Support Vector Machines Kernel space projection, window σ = 0.1:

Support Vector Machines Kernel space projection, window σ = 1:

Support Vector Machines Kernel space projection, window σ = 10:

Support Vector Machines Kernel space projection, window σ = 100:

Support Vector Machines Kernel space projection, window σ = 100:

Support Vector Machines Notes on Kernel space projection: Too small  Great separation But recall, poor generalizability Too big  no longer separable As above: Classical lessons from kernel smoothing Surprisingly large “reasonable region” I.e. parameter less critical (sometimes?) Also explore projections (in kernel space)

Support Vector Machines Implicit Embedding, window σ = 0.1:

Support Vector Machines Implicit Embedding, window σ = 0.5:

Support Vector Machines Implicit Embedding, window σ = 1:

Support Vector Machines Implicit Embedding, window σ = 10:

Support Vector Machines Notes on Implicit Embedding: Similar Large vs. Small lessons Range of “reasonable results” Seems to be smaller (note different range of windows) Much different “edge” behavior Interesting topic for future work…

Distance Weighted Discrim’n 2-d Visualization: Pushes Plane Away From Data All Points Have Some Influence

Distance Weighted Discrim’n References for more on DWD: Current paper: Marron, Todd and Ahn (2007) Links to more papers: Ahn (2007) JAVA Implementation of DWD: caBIG (2006) SDPT3 Software: Toh (2007) 26

Batch and Source Adjustment Recall from Class Notes 8/28/07 For Stanford Breast Cancer Data (C. Perou) Analysis in Benito, et al (2004) Bioinformatics, 20, 105-114. https://genome.unc.edu/pubsup/dwd/ Adjust for Source Effects Different sources of mRNA Adjust for Batch Effects Arrays fabricated at different times

Source Batch Adj: Biological Class Col. & Symbols

Source Batch Adj: Source Colors

Source Batch Adj: PC 1-3 & DWD direction

Source Batch Adj: DWD Source Adjustment

Source Batch Adj: Source Adj’d, PCA view

Source Batch Adj: S. & B Adj’d, Adj’d PCA

Why not adjust using SVM? Major Problem: Proj’d Distrib’al Shape Triangular Dist’ns (opposite skewed) Does not allow sensible rigid shift

Why not adjust using SVM? Nicely Fixed by DWD Projected Dist’ns near Gaussian Sensible to shift

(although still not perfect) Why not adjust by means? DWD is complicated: value added? Xuxin Liu example… Key is sizes of biological subtypes Differing ratio trips up mean But DWD more robust (although still not perfect)

Why not adjust by means? Next time: Work in before and after, slides like 138-141 from DWDnormPreso.ppt In Research/Bioinf/caBIG

Twiddle ratios of subtypes

DWD robust against non-proportional subtypes… Why not adjust by means? DWD robust against non-proportional subtypes… Mathematical Statistical Question: Are there mathematics behind this? (will answer next time…) 39

DWD in Face Recognition Face Images as Data (with M. Benito & D. Peña) Male – Female Difference? Discrimination Rule? Represented as long vector of pixel gray levels Registration is critical

DWD in Face Recognition, (cont.) Registered Data Shifts and scale Manually chosen To align eyes and mouth Still large variation See males vs. females???

DWD in Face Recognition , (cont.) DWD Direction Good separation Images “make sense” Garbage at ends? (extrapolation effects?)

DWD in Face Recognition , (cont.) Unregistered Version Much blurrier Since features don’t properly line up Nonlinear Variation But DWD still works Can see M-F differ’ce?

DWD in Face Recognition , (cont.) Interesting summary: Jump between means (in DWD direction) Clear separation of Maleness vs. Femaleness

DWD in Face Recognition , (cont.) Fun Comparison: Jump between means (in SVM direction) Also distinguishes Maleness vs. Femaleness But not as well as DWD

DWD in Face Recognition , (cont.) Analysis of difference: Project onto normals SVM has “small gap” (feels noise artifacts?) DWD “more informative” (feels real structure?)

DWD in Face Recognition, (cont.) Current Work: Focus on “drivers”: (regions of interest) Relation to Discr’n? Which is “best”? Lessons for human perception?

Outcomes Data Breast Cancer Study (C. M. Perou): Outcome of interest = death or survival Connection with gene expression? Approach: Treat death vs. survival during study as “classes” Find “direction that best separates the classes”

Outcomes Data Find “direction that best separates the classes” 49

Outcomes Data Find “direction that best separates the classes” 50

Outcomes Data Find “direction that best separates classes” DWD Projection SVM Projection Notes: SVM is “better separated”? (recall “data piling” problems….) DWD gives “more spread between sub-populations”??? (perhaps “more stable”?) 51

(reflects pointing in gene direction) Outcomes Data Which is “better”? Approach: Find “genes of interest” To maximize loadings of direction vectors (reflects pointing in gene direction) Show intensity plot (of gene expression) Using top 20 genes in each direction 52

Outcomes Data Which is “better”? Study with gene intensity plot Order cases by DWD score (proj’n) Order genes by DWD loading (vec. entry) Reduce to top & bottom 20 Color map shows gene expression Shows genes that drive classification Gene names also available 53

Outcomes Data Which is “better”? DWD direction 54

Outcomes Data Which is “better”? SVM direction 55

Outcomes Data Which is “better”? DWD finds genes showing better separation SVM genes are less informative 56

Outcomes Data How about Centroid (Mean Diff’nce) Method? 57

Outcomes Data How about Centroid (Mean Diff’nce) Method? 58

Outcomes Data Compare to DWD direction 59

Very simple things often “best” Outcomes Data How about Centroid (Mean Diff’nce) Method? Best yet, in terms of red – green plot? Projections unacceptably mixed? These are two different goals… Try for trade-off? Scale space approach??? Interesting philosophical point: Very simple things often “best” 60

Outcomes Data Weakness of above analysis: Some with “genes prone to disease” have not died yet Perhaps can see in DWD expression plot? Better analysis: More sophisticated survival methods Work in progress w/ Brent Johnson, Danyu Li, Helen Zhang 61

Distance Weighted Discrim’n 2=d Visualization: Pushes Plane Away From Data All Points Have Some Influence

Distance Weighted Discrim’n Maximal Data Piling BatchEffect1fig4Big.ps

HDLSS Discrim’n Simulations Main idea: Comparison of SVM (Support Vector Machine) DWD (Distance Weighted Discrimination) MD (Mean Difference, a.k.a. Centroid) Linear versions, across dimensions 64

HDLSS Discrim’n Simulations Overall Approach: Study different known phenomena Spherical Gaussians Outliers Polynomial Embedding Common Sample Sizes But wide range of dimensions

HDLSS Discrim’n Simulations Spherical Gaussians: DWD1figEBig.ps

HDLSS Discrim’n Simulations Spherical Gaussians: Same setup as before Means shifted in dim 1 only, All methods pretty good Harder problem for higher dimension SVM noticeably worse MD best (Likelihood method) DWD very close to MD Methods converge for higher dimension??

HDLSS Discrim’n Simulations Outlier Mixture: DWD1figFBig.ps

HDLSS Discrim’n Simulations Outlier Mixture: 80% dim. 1 , other dims 0 20% dim. 1 ±100, dim. 2 ±500, others 0 MD is a disaster, driven by outliers SVM & DWD are both very robust SVM is best DWD very close to SVM (insig’t difference) Methods converge for higher dimension?? Ignore RLR (a mistake)

HDLSS Discrim’n Simulations Wobble Mixture: DWD1figGBig.ps

HDLSS Discrim’n Simulations Wobble Mixture: 80% dim. 1 , other dims 0 20% dim. 1 ±0.1, rand dim ±100, others 0 MD still very bad, driven by outliers SVM & DWD are both very robust SVM loses (affected by margin push) DWD slightly better (by w’ted influence) Methods converge for higher dimension?? Ignore RLR (a mistake)

HDLSS Discrim’n Simulations Nested Spheres: DWD1figHBig.ps

HDLSS Discrim’n Simulations Nested Spheres: 1st d/2 dim’s, Gaussian with var 1 or C 2nd d/2 dim’s, the squares of the 1st dim’s (as for 2nd degree polynomial embedding) Each method best somewhere MD best in highest d (data non-Gaussian) Methods not comparable (realistic) Methods converge for higher dimension?? HDLSS space is a strange place Ignore RLR (a mistake)

HDLSS Discrim’n Simulations Conclusions: Everything (sensible) is best sometimes DWD often very near best MD weak beyond Gaussian Caution about simulations (and examples): Very easy to cherry pick best ones Good practice in Machine Learning “Ignore method proposed, but read paper for useful comparison of others”

HDLSS Discrim’n Simulations Caution: There are additional players E.g. Regularized Logistic Regression looks also very competitive Interesting Phenomenon: All methods come together in very high dimensions???

HDLSS Discrim’n Simulations Can we say more about: All methods come together in very high dimensions??? Mathematical Statistical Question: Mathematics behind this??? (will answer next time) 76