Object Orie’d Data Analysis, Last Time Classical Discrimination (aka Classification) –FLD & GLR very attractive –MD never better, sometimes worse HDLSS.

Slides:



Advertisements
Similar presentations
Introduction to Support Vector Machines (SVM)
Advertisements

Generative Models Thus far we have essentially considered techniques that perform classification indirectly by modeling the training data, optimizing.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 24: Non-linear Support Vector Machines Geoffrey Hinton.
Lecture 9 Support Vector Machines
ECG Signal processing (2)
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.
An Introduction of Support Vector Machine
Classification / Regression Support Vector Machines

Support Vector Machines Instructor Max Welling ICS273A UCIrvine.
CHAPTER 10: Linear Discrimination
An Introduction of Support Vector Machine
Support Vector Machines
SVMs Reprised. Administrivia I’m out of town Mar 1-3 May have guest lecturer May cancel class Will let you know more when I do...
Support Vector Machines (and Kernel Methods in general)
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
Margins, support vectors, and linear programming Thanks to Terran Lane and S. Dreiseitl.
Reduced Support Vector Machine
Sketched Derivation of error bound using VC-dimension (1) Bound our usual PAC expression by the probability that an algorithm has 0 error on the training.
Object Orie’d Data Analysis, Last Time Mildly Non-Euclidean Spaces Strongly Non-Euclidean Spaces –Tree spaces –No Tangent Plane Classification - Discrimination.
Support Vector Machines
SVMs Finalized. Where we are Last time Support vector machines in grungy detail The SVM objective function and QP Today Last details on SVMs Putting it.
SVMs Reprised Reading: Bishop, Sec 4.1.1, 6.0, 6.1, 7.0, 7.1.
Greg GrudicIntro AI1 Support Vector Machine (SVM) Classification Greg Grudic.
SVMs, cont’d Intro to Bayesian learning. Quadratic programming Problems of the form Minimize: Subject to: are called “quadratic programming” problems.
Mathematical Programming in Support Vector Machines
An Introduction to Support Vector Machines Martin Law.
Ch. Eick: Support Vector Machines: The Main Ideas Reading Material Support Vector Machines: 1.Textbook 2. First 3 columns of Smola/Schönkopf article on.
Object Orie’d Data Analysis, Last Time OODA in Image Analysis –Landmarks, Boundary Rep ’ ns, Medial Rep ’ ns Mildly Non-Euclidean Spaces –M-rep data on.
Object Orie’d Data Analysis, Last Time HDLSS Discrimination –MD much better Maximal Data Piling –HDLSS space is a strange place Kernel Embedding –Embed.
Object Orie’d Data Analysis, Last Time
Support Vector Machine & Image Classification Applications
Machine Learning Seminar: Support Vector Regression Presented by: Heng Ji 10/08/03.
1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.
计算机学院 计算感知 Support Vector Machines. 2 University of Texas at Austin Machine Learning Group 计算感知 计算机学院 Perceptron Revisited: Linear Separators Binary classification.
Return to Big Picture Main statistical goals of OODA: Understanding population structure –Low dim ’ al Projections, PCA … Classification (i. e. Discrimination)
10/18/ Support Vector MachinesM.W. Mak Support Vector Machines 1. Introduction to SVMs 2. Linear SVMs 3. Non-linear SVMs References: 1. S.Y. Kung,
SVM Support Vector Machines Presented by: Anas Assiri Supervisor Prof. Dr. Mohamed Batouche.
Classifiers Given a feature representation for images, how do we learn a model for distinguishing features from different classes? Zebra Non-zebra Decision.
An Introduction to Support Vector Machines (M. Law)
Object Orie’d Data Analysis, Last Time Classification / Discrimination Classical Statistical Viewpoint –FLD “good” –GLR “better” –Conclude always do GLR.
Machine Learning Weak 4 Lecture 2. Hand in Data It is online Only around 6000 images!!! Deadline is one week. Next Thursday lecture will be only one hour.
1 UNC, Stat & OR Hailuoto Workshop Object Oriented Data Analysis, II J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina.
CISC667, F05, Lec22, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Support Vector Machines I.
CS 478 – Tools for Machine Learning and Data Mining SVM.
Maximal Data Piling Visual similarity of & ? Can show (Ahn & Marron 2009), for d < n: I.e. directions are the same! How can this be? Note lengths are different.
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
Return to Big Picture Main statistical goals of OODA: Understanding population structure –Low dim ’ al Projections, PCA … Classification (i. e. Discrimination)
1 UNC, Stat & OR U. C. Davis, F. R. G. Workshop Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Greg GrudicIntro AI1 Support Vector Machine (SVM) Classification Greg Grudic.
Computational Intelligence: Methods and Applications Lecture 24 SVM in the non-linear case Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
Object Orie’d Data Analysis, Last Time PCA Redistribution of Energy - ANOVA PCA Data Representation PCA Simulation Alternate PCA Computation Primal – Dual.
CSSE463: Image Recognition Day 14 Lab due Weds. Lab due Weds. These solutions assume that you don't threshold the shapes.ppt image: Shape1: elongation.
Participant Presentations Please Sign Up: Name (Onyen is fine, or …) Are You ENRolled? Tentative Title (???? Is OK) When: Next Week, Early, Oct.,
Classification on Manifolds Suman K. Sen joint work with Dr. J. S. Marron & Dr. Mark Foskey.
Return to Big Picture Main statistical goals of OODA: Understanding population structure –Low dim ’ al Projections, PCA … Classification (i. e. Discrimination)
PCA Data Represent ’ n (Cont.). PCA Simulation Idea: given Mean Vector Eigenvectors Eigenvalues Simulate data from Corresponding Normal Distribution.
Kernel Embedding Polynomial Embedding, Toy Example 3: Donut FLD Good Performance (Slice of Paraboloid)
Recall Flexibility From Kernel Embedding Idea HDLSS Asymptotics & Kernel Methods.
A Brief Introduction to Support Vector Machine (SVM) Most slides were from Prof. A. W. Moore, School of Computer Science, Carnegie Mellon University.
Return to Big Picture Main statistical goals of OODA:
PREDICT 422: Practical Machine Learning
Object Orie’d Data Analysis, Last Time
Object Orie’d Data Analysis, Last Time
Support Vector Machines
Participant Presentations
Maximal Data Piling MDP in Increasing Dimensions:
SVMs for Document Ranking
Presentation transcript:

Object Orie’d Data Analysis, Last Time Classical Discrimination (aka Classification) –FLD & GLR very attractive –MD never better, sometimes worse HDLSS Discrimination –FLD & GLR fall apart –MD much better Maximal Data Piling –HDLSS space is a strange place

Kernel Embedding Aizerman, Braverman and Rozoner (1964) Motivating idea: Extend scope of linear discrimination, By adding nonlinear components to data (embedding in a higher dim ’ al space) Better use of name: nonlinear discrimination?

Kernel Embedding Stronger effects for higher order polynomial embedding: E.g. for cubic, linear separation can give 4 parts (or fewer)

Kernel Embedding General View: for original data matrix: add rows: i.e. embed in Then Higher slice Dimensional with a Space hyperplane

Kernel Embedding EmbeddedFisher Linear Discrimination: Choose Class 1, for any when: in embedded space. image of class boundaries in original space is nonlinear allows more complicated class regions Can also do Gaussian Lik. Rat. (or others) Compute image by classifying points from original space

Kernel Embedding Visualization for Toy Examples: Have Linear Disc. In Embedded Space Study Effect in Original Data Space Via Implied Nonlinear Regions Approach: Use Test Set in Original Space (dense equally spaced grid) Apply embedded discrimination Rule Color Using the Result

Kernel Embedding Polynomial Embedding, Toy Example 1: Parallel Clouds

Kernel Embedding Polynomial Embedding, Toy Example 1: Parallel Clouds PC 1:PC 1 –always bad –finds “ embedded greatest var. ” only) FLD:FLD –stays good GLR:GLR –OK discrimination at data –but overfitting problems

Kernel Embedding Polynomial Embedding, Toy Example 2: Split X

Kernel Embedding Polynomial Embedding, Toy Example 2: Split X FLD:FLD –Rapidly improves with higher degree GLR:GLR –Always good –but never ellipse around blues …

Kernel Embedding Polynomial Embedding, Toy Example 3: Donut

Kernel Embedding Polynomial Embedding, Toy Example 3: Donut FLD:FLD –Poor fit for low degree –then good –no overfit GLR:GLR –Best with No Embed, –Square shape for overfitting?

Kernel Embedding Drawbacks to polynomial embedding: too many extra terms create spurious structure i.e. have “ overfitting ” HDLSS problems typically get worse

Kernel Embedding Hot Topic Variation: “ Kernel Machines ” Idea: replace polynomials by other nonlinear functions e.g. 1: sigmoid functions from neural nets e.g. 2: radial basis functions Gaussian kernels Related to “ kernel density estimation ” kernel density estimation (recall: smoothed histogram)

Kernel Embedding Radial Basis Functions: Note: there are several ways to embed: Na ï ve Embedding (equally spaced grid) Explicit Embedding (evaluate at data) Implicit Emdedding (inner prod. based) (everybody currently does the latter)

Kernel Embedding Na ï ve Embedding, Radial basis functions: At some “ grid points ”, For a “ bandwidth ” (i.e. standard dev ’ n), Consider ( dim ’ al) functions: Replace data matrix with:

Kernel Embedding Na ï ve Embedding, Radial basis functions: For discrimination: Work in radial basis space, With new data vector, represented by:

Kernel Embedding Na ï ve Embedd ’ g, Toy E.g. 1: Parallel Clouds Good at data Poor outside

Kernel Embedding Na ï ve Embedd ’ g, Toy E.g. 2: Split X OK at data Strange outside

Kernel Embedding Na ï ve Embedd ’ g, Toy E.g. 3: Donut Mostly good Slight mistake for one kernel

Kernel Embedding Na ï ve Embedding, Radial basis functions: Toy Example, Main lessons: Generally good in regions with data, Unpredictable where data are sparse

Kernel Embedding Toy Example 4: Checkerboard Very Challenging! Linear Method? Polynomial Embedding?

Kernel Embedding Toy Example 4: Checkerboard Polynomial EmbeddingPolynomial Embedding: Very poor for linear Slightly better for higher degrees Overall very poor Polynomials don ’ t have needed flexibility

Kernel Embedding Toy Example 4: Checkerboard Radial Basis Embedding + FLD Is Excellent!

Kernel Embedding Drawbacks to na ï ve embedding: Equally spaced grid too big in high d Not computationally tractable (g d ) Approach: Evaluate only at data points Not on full grid But where data live

Kernel Embedding Other types of embedding: Explicit Implicit Will be studied soon, after introduction to Support Vector Machines …

Kernel Embedding generalizations of this idea to other types of analysis & some clever computational ideas. E.g. “ Kernel based, nonlinear Principal Components Analysis ” Ref: Sch ö lkopf, Smola and M ü ller (1998)

Support Vector Machines Motivation: Find a linear method that “ works well ” for embedded data Note: Embedded data are very non-Gaussian Suggests value of really new approach

Support Vector Machines Classical References: Vapnik (1982) Boser, Guyon & Vapnik (1992) Vapnik (1995) Excellent Web Resource:

Support Vector Machines Recommended tutorial: Burges (1998) Recommended Monographs: Cristianini & Shawe-Taylor (2000) Sch ö lkopf & Alex Smola (2002)

Support Vector Machines Graphical View, using Toy Example:Toy Example Find separating plane To maximize distances from data to plane In particular smallest distance Data points closest are called support vectors Gap between is called margin

SVMs, Optimization Viewpoint Formulate Optimization problem, based on: Data (feature) vectors Class Labels Normal Vector Location (determines intercept) Residuals (right side) Residuals (wrong side) Solve (convex problem) by quadratic programming

SVMs, Optimization Viewpoint Lagrange Multipliers primal formulation (separable case): Minimize: Where are Lagrange multipliers Dual Lagrangian version: Maximize: Get classification function:

SVMs, Computation Major Computational Point: Classifier only depends on data through inner products! Thus enough to only store inner products Creates big savings in optimization Especially for HDLSS data But also creates variations in kernel embedding (interpretation?!?) This is almost always done in practice

SVMs, Comput ’ n & Embedding For an “ Embedding Map ”, e.g. Explicit Embedding: Maximize: Get classification function: Straightforward application of embedding But loses inner product advantage

SVMs, Comput ’ n & Embedding Implicit Embedding: Maximize: Get classification function: Still defined only via inner products Retains optimization advantage Thus used very commonly Comparison to explicit embedding? Which is “ better ” ???

SVMs & Robustness Usually not severely affected by outliers, But a possible weakness: Can have very influential points Toy E.g., only 2 points drive SVMonly 2 points drive SVM Notes: Huge range of chosen hyperplanes But all are “ pretty good discriminators ” Only happens when whole range is OK??? Good or bad?

SVMs & Robustness Effect of violators (toy example):toy example Depends on distance to plane Weak for violators nearby Strong as they move away Can have major impact on plane Also depends on tuning parameter C

SVMs, Computation Caution: available algorithms are not created equal Toy Example: Gunn ’ s Matlab codeGunn ’ s Matlab code Todd ’ s Matlab codeTodd ’ s Matlab code Serious errors in Gunn ’ s version, does not find real optimum …

SVMs, Tuning Parameter Recall Regularization Parameter C: Controls penalty for violation I.e. lying on wrong side of plane Appears in slack variables Affects performance of SVM Toy ExampleToy Example: d = 50, Spherical Gaussian data

SVMs, Tuning Parameter Toy Example: d = 50, Spherical Gaussian data X=Axis: Opt. Dir ’ n Other: SVM Dir ’ n Small C: –Where is the margin? –Small angle to optimal (generalizable) Large C: –More data piling –Larger angle (less generalizable) –Bigger gap (but maybe not better???) Between: Very small range

SVMs, Tuning Parameter Toy Example: d = 50, Spherical Gaussian data Careful look at small C: Put MD on horizontal axis E.g.E.g. Shows SVM and MD same for C small –Mathematics behind this? Separates for large C –No data piling for MD

Distance Weighted Discrim ’ n Improvement of SVM for HDLSS Data Toy e.g. (similar to earlier movie)

Distance Weighted Discrim ’ n Toy e.g.: Maximal Data Piling Direction - Perfect Separation - Gross Overfitting - Large Angle - Poor Gen ’ ability

Distance Weighted Discrim ’ n Toy e.g.: Support Vector Machine Direction - Bigger Gap - Smaller Angle - Better Gen ’ ability - Feels support vectors too strongly??? - Ugly subpops? - Improvement?

Distance Weighted Discrim ’ n Toy e.g.: Distance Weighted Discrimination - Addresses these issues - Smaller Angle - Better Gen ’ ability - Nice subpops - Replaces min dist. by avg. dist.

Distance Weighted Discrim ’ n Based on Optimization Problem: More precisely: Work in appropriate penalty for violations Optimization Method: Second Order Cone Programming “ Still convex ” gen ’ n of quad ’ c program ’ g Allows fast greedy solution Can use available fast software (SDP3, Michael Todd, et al)

Distance Weighted Discrim ’ n 2=d Visualization: Pushes Plane Away From Data All Points Have Some Influence

49 UNC, Stat & OR DWD Batch and Source Adjustment Recall from Class Meeting, 9/6/05:9/6/05 For Perou ’ s Stanford Breast Cancer Data Analysis in Benito, et al (2004) Bioinformatics Use DWD as useful direction vector to: Adjust for Source Effects Different sources of mRNA Adjust for Batch Effects Arrays fabricated at different times

50 UNC, Stat & OR DWD Adj: Biological Class Colors & Symbols

51 UNC, Stat & OR DWD Adj: Source Colors

52 UNC, Stat & OR DWD Adj: Source Adj’d, PCA view

53 UNC, Stat & OR DWD Adj: Source Adj’d, Class Colored

54 UNC, Stat & OR DWD Adj: S. & B Adj’d, Adj’d PCA

55 UNC, Stat & OR Why not adjust using SVM? Major Problem: Proj’d Distrib’al Shape Triangular Dist’ns (opposite skewed) Does not allow sensible rigid shift

56 UNC, Stat & OR Why not adjust using SVM? Nicely Fixed by DWD Projected Dist’ns near Gaussian Sensible to shift

57 UNC, Stat & OR Why not adjust by means? DWD is complicated: value added? Xuxin Liu example… Key is sizes of biological subtypes Differing ratio trips up mean But DWD more robust (although still not perfect)

58 UNC, Stat & OR Twiddle ratios of subtypes Link to Movie

59 UNC, Stat & OR DWD in Face Recognition, I Face Images as Data (with M. Benito & D. Peña) Registered using landmarks Male – Female Difference? Discrimination Rule?

60 UNC, Stat & OR DWD in Face Recognition, II DWD Direction Good separation Images “make sense” Garbage at ends? (extrapolation effects?)

61 UNC, Stat & OR DWD in Face Recognition, III Interesting summary: Jump between means (in DWD direction) Clear separation of Maleness vs. Femaleness

62 UNC, Stat & OR DWD in Face Recognition, IV Fun Comparison: Jump between means (in SVM direction) Also distinguishes Maleness vs. Femaleness But not as well as DWD

63 UNC, Stat & OR DWD in Face Recognition, V Analysis of difference: Project onto normals SVM has “small gap” (feels noise artifacts?) DWD “more informative” (feels real structure?)

64 UNC, Stat & OR DWD in Face Recognition, VI Current Work: Focus on “drivers”: (regions of interest) Relation to Discr’n? Which is “best”? Lessons for human perception?

Fix links on face movies

Next Topics: DWD outcomes, from SAMSI below DWD simulations, from SAMSI below Windup from FDA doc –General Conclusion –Validation Also SVMoverviewSAMSI doc

Multi-Class SVMs Lee, Y., Lin, Y. and Wahba, G. (2002) "Multicategory Support Vector Machines, Theory, and Application to the Classification of Microarray Data and Satellite Radiance Data", U. Wisc. TR So far only have “ implicit ” version “ Direction based ” variation is unknown