Return to Big Picture Main statistical goals of OODA: Understanding population structure –Low dim ’ al Projections, PCA … Classification (i. e. Discrimination)

Slides:



Advertisements
Similar presentations
Introduction to Support Vector Machines (SVM)
Advertisements

CSC321: Introduction to Neural Networks and Machine Learning Lecture 24: Non-linear Support Vector Machines Geoffrey Hinton.
Support Vector Machines
Lecture 9 Support Vector Machines
ECG Signal processing (2)
STOR 892 Object Oriented Data Analysis Radial Distance Weighted Discrimination Jie Xiong Advised by Prof. J.S. Marron Department of Statistics and Operations.
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Support Vector Machine & Its Applications Mingyue Tan The University of British Columbia Nov 26, 2004 A portion (1/3) of the slides are taken from Prof.
SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.
An Introduction of Support Vector Machine
Classification / Regression Support Vector Machines
Support Vector Machines Instructor Max Welling ICS273A UCIrvine.
CHAPTER 10: Linear Discrimination
Input Space versus Feature Space in Kernel- Based Methods Scholkopf, Mika, Burges, Knirsch, Muller, Ratsch, Smola presented by: Joe Drish Department of.
An Introduction of Support Vector Machine
Support Vector Machines
Machine learning continued Image source:
SVMs Reprised. Administrivia I’m out of town Mar 1-3 May have guest lecturer May cancel class Will let you know more when I do...
Support Vector Machines and Kernel Methods
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
Margins, support vectors, and linear programming Thanks to Terran Lane and S. Dreiseitl.
Sketched Derivation of error bound using VC-dimension (1) Bound our usual PAC expression by the probability that an algorithm has 0 error on the training.
Object Orie’d Data Analysis, Last Time Mildly Non-Euclidean Spaces Strongly Non-Euclidean Spaces –Tree spaces –No Tangent Plane Classification - Discrimination.
Support Vector Machines
SVMs Finalized. Where we are Last time Support vector machines in grungy detail The SVM objective function and QP Today Last details on SVMs Putting it.
SVMs Reprised Reading: Bishop, Sec 4.1.1, 6.0, 6.1, 7.0, 7.1.
Greg GrudicIntro AI1 Support Vector Machine (SVM) Classification Greg Grudic.
SVMs, cont’d Intro to Bayesian learning. Quadratic programming Problems of the form Minimize: Subject to: are called “quadratic programming” problems.
An Introduction to Support Vector Machines Martin Law.
Support Vector Machines Piyush Kumar. Perceptrons revisited Class 1 : (+1) Class 2 : (-1) Is this unique?
This week: overview on pattern recognition (related to machine learning)
Object Orie’d Data Analysis, Last Time OODA in Image Analysis –Landmarks, Boundary Rep ’ ns, Medial Rep ’ ns Mildly Non-Euclidean Spaces –M-rep data on.
Object Orie’d Data Analysis, Last Time HDLSS Discrimination –MD much better Maximal Data Piling –HDLSS space is a strange place Kernel Embedding –Embed.
Object Orie’d Data Analysis, Last Time
Support Vector Machine & Image Classification Applications
CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.
1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.
计算机学院 计算感知 Support Vector Machines. 2 University of Texas at Austin Machine Learning Group 计算感知 计算机学院 Perceptron Revisited: Linear Separators Binary classification.
Kernel Methods A B M Shawkat Ali 1 2 Data Mining ¤ DM or KDD (Knowledge Discovery in Databases) Extracting previously unknown, valid, and actionable.
Robust PCA Robust PCA 3: Spherical PCA. Robust PCA.
CS Statistical Machine learning Lecture 18 Yuan (Alan) Qi Purdue CS Oct
Classifiers Given a feature representation for images, how do we learn a model for distinguishing features from different classes? Zebra Non-zebra Decision.
An Introduction to Support Vector Machines (M. Law)
Object Orie’d Data Analysis, Last Time Classification / Discrimination Classical Statistical Viewpoint –FLD “good” –GLR “better” –Conclude always do GLR.
CISC667, F05, Lec22, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Support Vector Machines I.
CS 478 – Tools for Machine Learning and Data Mining SVM.
RSVM: Reduced Support Vector Machines Y.-J. Lee & O. L. Mangasarian First SIAM International Conference on Data Mining Chicago, April 6, 2001 University.
Object Orie’d Data Analysis, Last Time Classical Discrimination (aka Classification) –FLD & GLR very attractive –MD never better, sometimes worse HDLSS.
An Introduction to Support Vector Machine (SVM)
Maximal Data Piling Visual similarity of & ? Can show (Ahn & Marron 2009), for d < n: I.e. directions are the same! How can this be? Note lengths are different.
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
Return to Big Picture Main statistical goals of OODA: Understanding population structure –Low dim ’ al Projections, PCA … Classification (i. e. Discrimination)
Greg GrudicIntro AI1 Support Vector Machine (SVM) Classification Greg Grudic.
Support Vector Machine: An Introduction. (C) by Yu Hen Hu 2 Linear Hyper-plane Classifier For x in the side of o : w T x + b  0; d = +1; For.
Return to Big Picture Main statistical goals of OODA: Understanding population structure –Low dim ’ al Projections, PCA … Classification (i. e. Discrimination)
PCA Data Represent ’ n (Cont.). PCA Simulation Idea: given Mean Vector Eigenvectors Eigenvalues Simulate data from Corresponding Normal Distribution.
Statistical Smoothing In 1 Dimension (Numbers as Data Objects)
Kernel Embedding Polynomial Embedding, Toy Example 3: Donut FLD Good Performance (Slice of Paraboloid)
A Brief Introduction to Support Vector Machine (SVM) Most slides were from Prof. A. W. Moore, School of Computer Science, Carnegie Mellon University.
Day 17: Duality and Nonlinear SVM Kristin P. Bennett Mathematical Sciences Department Rensselaer Polytechnic Institute.
Return to Big Picture Main statistical goals of OODA:
PREDICT 422: Practical Machine Learning
Object Orie’d Data Analysis, Last Time
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
Object Orie’d Data Analysis, Last Time
Support Vector Machines
Participant Presentations
Maximal Data Piling MDP in Increasing Dimensions:
SVMs for Document Ranking
Presentation transcript:

Return to Big Picture Main statistical goals of OODA: Understanding population structure –Low dim ’ al Projections, PCA … Classification (i. e. Discrimination) –Understanding 2+ populations Time Series of Data Objects –Chemical Spectra, Mortality Data “Vertical Integration” of Data Types

Maximal Data Piling Strange FLD effect at HDLSS boundary: Data Piling: For each class, all data project to single value

Maximal Data Piling Visual similarity of & ? Can show (Ahn & Marron 2009), for d < n: I.e. directions are the same! How can this be? Note lengths are different … Study from transformation viewpoint

Kernel Embedding Aizerman, Braverman and Rozoner (1964) Motivating idea: Extend scope of linear discrimination, By adding nonlinear components to data (embedding in a higher dim ’ al space) Better use of name: nonlinear discrimination?

Kernel Embedding General View: for original data matrix: add rows: i.e. embed in Higher Dimensional space

Kernel Embedding Polynomial Embedding, Toy Example 1: Parallel Clouds FLD Original Data

Kernel Embedding Polynomial Embedding, Toy Example 1: Parallel Clouds FLD

Kernel Embedding Polynomial Embedding, Toy Example 1: Parallel Clouds FLD

Kernel Embedding Polynomial Embedding, Toy Example 1: Parallel Clouds FLD

Kernel Embedding Polynomial Embedding, Toy Example 1: Parallel Clouds FLD All Stable & Very Good

Kernel Embedding Polynomial Embedding, Toy Example 1: Parallel Clouds GLR Original Data

Kernel Embedding Polynomial Embedding, Toy Example 1: Parallel Clouds GLR

Kernel Embedding Polynomial Embedding, Toy Example 1: Parallel Clouds GLR

Kernel Embedding Polynomial Embedding, Toy Example 1: Parallel Clouds GLR

Kernel Embedding Polynomial Embedding, Toy Example 1: Parallel Clouds GLR

Kernel Embedding Polynomial Embedding, Toy Example 1: Parallel Clouds GLR Unstable Subject to Overfitting

Kernel Embedding Polynomial Embedding, Toy Example 2: Split X

Kernel Embedding Polynomial Embedding, Toy Example 2: Split X FLD Original Data

Kernel Embedding Polynomial Embedding, Toy Example 2: Split X FLD Slightly Better

Kernel Embedding Polynomial Embedding, Toy Example 2: Split X FLD

Kernel Embedding Polynomial Embedding, Toy Example 2: Split X FLD Very Good Effective Hyperbola

Kernel Embedding Polynomial Embedding, Toy Example 2: Split X FLD

Kernel Embedding Polynomial Embedding, Toy Example 2: Split X FLD Robust Against Overfitting

Kernel Embedding Polynomial Embedding, Toy Example 2: Split X GLR Original Data Looks Very Good

Kernel Embedding Polynomial Embedding, Toy Example 2: Split X GLR

Kernel Embedding Polynomial Embedding, Toy Example 2: Split X GLR Reasonably Stable Never Ellipse Around Blues

Kernel Embedding Polynomial Embedding, Toy Example 3: Donut

Kernel Embedding Polynomial Embedding, Toy Example 3: Donut FLD Original Data Very Bad

Kernel Embedding Polynomial Embedding, Toy Example 3: Donut FLD Somewhat Better (Parabolic Fold)

Kernel Embedding Polynomial Embedding, Toy Example 3: Donut FLD Somewhat Better (Other Fold)

Kernel Embedding Polynomial Embedding, Toy Example 3: Donut FLD Good Performance (Slice of Paraboloid)

Kernel Embedding Polynomial Embedding, Toy Example 3: Donut FLD

Kernel Embedding Polynomial Embedding, Toy Example 3: Donut FLD Robust Against Overfitting

Kernel Embedding Polynomial Embedding, Toy Example 3: Donut GLR Original Data Best With No Embedding?

Kernel Embedding Polynomial Embedding, Toy Example 3: Donut GLR

Kernel Embedding Polynomial Embedding, Toy Example 3: Donut GLR

Kernel Embedding Polynomial Embedding, Toy Example 3: Donut GLR

Kernel Embedding Polynomial Embedding, Toy Example 3: Donut GLR Overfitting Gives Square Shape?

Kernel Embedding Drawbacks to polynomial embedding: too many extra terms create spurious structure i.e. have “ overfitting ” HDLSS problems typically get worse

Kernel Embedding Hot Topic Variation: “ Kernel Machines ” Idea: replace polynomials by other nonlinear functions

Kernel Embedding Hot Topic Variation: “ Kernel Machines ” Idea: replace polynomials by other nonlinear functions e.g. 1: sigmoid functions from neural nets

Kernel Embedding Hot Topic Variation: “ Kernel Machines ” Idea: replace polynomials by other nonlinear functions e.g. 1: sigmoid functions from neural nets e.g. 2: radial basis functions Gaussian kernels Related to “ kernel density estimation ” (recall: smoothed histogram)

Kernel Density Estimation Chondrite Data: Represent points by red bars Where are data “more dense”?

Kernel Density Estimation Chondrite Data: Put probability mass 1/n at each point Smooth piece of “density”

Kernel Density Estimation Chondrite Data: Sum pieces to estimate density Suggests 3 modes (rock sources)

Kernel Embedding Radial Basis Functions: Note: there are several ways to embed

Kernel Embedding Radial Basis Functions: Note: there are several ways to embed: Na ï ve Embedding (equally spaced grid) (Will take a look at this, for illustration purposes)

Kernel Embedding Radial Basis Functions: Note: there are several ways to embed: Na ï ve Embedding (equally spaced grid) Explicit Embedding (evaluate at data)

Kernel Embedding Radial Basis Functions: Note: there are several ways to embed: Na ï ve Embedding (equally spaced grid) Explicit Embedding (evaluate at data) Implicit Emdedding (inner prod. based) (everybody currently does the latter)

Kernel Embedding Na ï ve Embedding, Radial basis functions:

Kernel Embedding Na ï ve Embedding, Radial basis functions: At some “ grid points ”, For a “ bandwidth ” (i.e. standard dev ’ n),

Kernel Embedding Na ï ve Embedding, Radial basis functions: At some “ grid points ”, For a “ bandwidth ” (i.e. standard dev ’ n), Consider ( dim ’ al) functions:

Kernel Embedding Na ï ve Embedding, Radial basis functions: At some “ grid points ”, For a “ bandwidth ” (i.e. standard dev ’ n), Consider ( dim ’ al) functions: Replace data matrix with:

Kernel Embedding Na ï ve Embedding, Radial basis functions: For discrimination: Work in radial basis space, With new data vector, represented by:

Kernel Embedding Na ï ve Embedd ’ g, Toy E.g. 1: Parallel Clouds Good at data Poor outside

Kernel Embedding Na ï ve Embedd ’ g, Toy E.g. 2: Split X OK at data Strange outside

Kernel Embedding Na ï ve Embedd ’ g, Toy E.g. 3: Donut Mostly good

Kernel Embedding Na ï ve Embedd ’ g, Toy E.g. 3: Donut Mostly good Slight mistake for one kernel

Kernel Embedding Na ï ve Embedding, Radial basis functions: Toy Example, Main lessons: Generally good in regions with data, Unpredictable where data are sparse

Kernel Embedding Toy Example 4: Checkerboard Very Challenging! Linear Method? Polynomial Embedding?

Kernel Embedding Toy Example 4: Checkerboard Very Challenging! FLD Linear Is Hopeless

Kernel Embedding Toy Example 4: Checkerboard Very Challenging! FLD

Kernel Embedding Toy Example 4: Checkerboard Very Challenging! FLD

Kernel Embedding Toy Example 4: Checkerboard Very Challenging! FLD

Kernel Embedding Toy Example 4: Checkerboard Very Challenging! FLD

Kernel Embedding Toy Example 4: Checkerboard Very Challenging! FLD Embedding Gets Better But Still Not Great

Kernel Embedding Toy Example 4: Checkerboard Very Challenging! Polynomials Don’t Have Needed Flexiblity

Kernel Embedding Toy Example 4: Checkerboard Radial Basis Embedding + FLD Is Excellent!

Kernel Embedding Drawbacks to na ï ve embedding: Equally spaced grid too big in high d Not computationally tractable (g d ) Approach: Evaluate only at data points Not on full grid But where data live

Kernel Embedding Other types of embedding: Explicit Implicit Will be studied soon, after introduction to Support Vector Machines …

Kernel Embedding generalizations of this idea to other types of analysis & some clever computational ideas. E.g. “ Kernel based, nonlinear Principal Components Analysis ” Ref: Sch ö lkopf, Smola and M ü ller (1998)

Support Vector Machines Motivation: Find a linear method that “ works well ” for embedded data Note: Embedded data are very non-Gaussian Classical Statistics: “Use Prob. Dist’n” Looks Hopeless

Support Vector Machines Motivation: Find a linear method that “ works well ” for embedded data Note: Embedded data are very non-Gaussian Suggests value of really new approach

Support Vector Machines Classical References: Vapnik (1982) Boser, Guyon & Vapnik (1992) Vapnik (1995)

Support Vector Machines Recommended tutorial: Burges (1998)

Support Vector Machines Recommended tutorial: Burges (1998) Recommended Monographs: Cristianini & Shawe-Taylor (2000) Sch ö lkopf & Smola (2002)

Support Vector Machines Graphical View, using Toy Example:

Support Vector Machines Graphical View, using Toy Example:

Support Vector Machines Graphical View, using Toy Example: Find separating plane

Support Vector Machines Graphical View, using Toy Example: Find separating plane To maximize distances from data to plane

Support Vector Machines Graphical View, using Toy Example:

Support Vector Machines Graphical View, using Toy Example:

Support Vector Machines Graphical View, using Toy Example: Find separating plane To maximize distances from data to plane In particular smallest distance

Support Vector Machines Graphical View, using Toy Example:

Support Vector Machines Graphical View, using Toy Example: Find separating plane To maximize distances from data to plane In particular smallest distance Data points closest are called support vectors Gap between is called margin

Support Vector Machines Graphical View, using Toy Example: Find separating plane To maximize distances from data to plane In particular smallest distance Data points closest are called support vectors Gap between is called margin Caution: For some “margin” is different

SVMs, Optimization Viewpoint Formulate Optimization problem, based on: Data (feature) vectors Class Labels

SVMs, Optimization Viewpoint Formulate Optimization problem, based on: Data (feature) vectors Class Labels Normal Vector Location (determines intercept)

SVMs, Optimization Viewpoint Formulate Optimization problem, based on: Data (feature) vectors Class Labels Normal Vector Location (determines intercept) Residuals (right side) Residuals (wrong side)

SVMs, Optimization Viewpoint Formulate Optimization problem, based on: Data (feature) vectors Class Labels Normal Vector Location (determines intercept) Residuals (right side) Residuals (wrong side) Solve (convex problem) by quadratic programming

SVMs, Optimization Viewpoint Lagrange Multipliers primal formulation (separable case): Minimize: Where are Lagrange multipliers

SVMs, Optimization Viewpoint

Lagrange Multipliers primal formulation (separable case): Minimize: Where are Lagrange multipliers Dual Lagrangian version: Maximize:

SVMs, Optimization Viewpoint Lagrange Multipliers primal formulation (separable case): Minimize: Where are Lagrange multipliers Dual Lagrangian version: Maximize: Get classification function:

SVMs, Optimization Viewpoint Get classification function:

SVMs, Optimization Viewpoint

SVMs, Computation Major Computational Point: Classifier only depends on data through inner products!

SVMs, Computation Major Computational Point: Classifier only depends on data through inner products! Thus enough to only store inner products

SVMs, Computation Major Computational Point: Classifier only depends on data through inner products! Thus enough to only store inner products Creates big savings in optimization Especially for HDLSS data

SVMs, Computation Major Computational Point: Classifier only depends on data through inner products! Thus enough to only store inner products Creates big savings in optimization Especially for HDLSS data But also creates variations in kernel embedding (interpretation?!?)

SVMs, Computation Major Computational Point: Classifier only depends on data through inner products! Thus enough to only store inner products Creates big savings in optimization Especially for HDLSS data But also creates variations in kernel embedding (interpretation?!?) This is almost always done in practice

SVMs, Comput ’ n & Embedding For an “ Embedding Map ”, e.g. Explicit Embedding

SVMs, Comput ’ n & Embedding For an “ Embedding Map ”, e.g. Explicit Embedding: Maximize:

SVMs, Comput ’ n & Embedding For an “ Embedding Map ”, e.g. Explicit Embedding: Maximize: Get classification function:

SVMs, Comput ’ n & Embedding For an “ Embedding Map ”, e.g. Explicit Embedding: Maximize: Get classification function: Straightforward application of embedding But loses inner product advantage

SVMs, Comput ’ n & Embedding Implicit Embedding: Maximize:

SVMs, Comput ’ n & Embedding Implicit Embedding: Maximize: Note Difference from Explicit Embedding:

SVMs, Comput ’ n & Embedding Implicit Embedding: Maximize: Get classification function: Still defined only via inner products

SVMs, Comput ’ n & Embedding Implicit Embedding: Maximize: Get classification function: Still defined only via inner products Retains optimization advantage Thus used very commonly

SVMs, Comput ’ n & Embedding Implicit Embedding: Maximize: Get classification function: Still defined only via inner products Retains optimization advantage Thus used very commonly Comparison to explicit embedding? Which is “ better ” ???

Support Vector Machines Target Toy Data set:

Support Vector Machines Explicit Embedding, window σ = 0.1: Gaussian Kernel, i.e. Radial Basis Function

Support Vector Machines Explicit Embedding, window σ = 1: Pretty Big Change (Factor of 10)

Support Vector Machines Explicit Embedding, window σ = 10: Not Quite As Good ???

Support Vector Machines Explicit Embedding, window σ = 100: Note: Lost Center (Over- Smoothed)

Support Vector Machines Notes on Explicit Embedding: Too small  Poor generalizability Too big  miss important regions

Support Vector Machines Notes on Explicit Embedding: Too small  Poor generalizability Too big  miss important regions Classical lessons from kernel smoothing

Support Vector Machines Notes on Explicit Embedding: Too small  Poor generalizability Too big  miss important regions Classical lessons from kernel smoothing Surprisingly large “reasonable region” I.e. parameter less critical (sometimes?)

Support Vector Machines Interesting Alternative Viewpoint: Study Projections In Kernel Space (Never done in Machine Learning World)

Support Vector Machines Kernel space projection, window σ = 0.1: Note: Data Piling At Margin

Support Vector Machines Kernel space projection, window σ = 1: Excellent Separation (but less than σ = 0.1)

Support Vector Machines Kernel space projection, window σ = 10: Still Good (But Some Overlap)

Support Vector Machines Kernel space projection, window σ = 100: Some Reds On Wrong Side (Missed Center)

Support Vector Machines Notes on Kernel space projection: Too small  –Great separation –But recall, poor generalizability

Support Vector Machines Notes on Kernel space projection: Too small  –Great separation –But recall, poor generalizability Too big  no longer separable

Support Vector Machines Notes on Kernel space projection: Too small  –Great separation –But recall, poor generalizability Too big  no longer separable As above: –Classical lessons from kernel smoothing –Surprisingly large “reasonable region” –I.e. parameter less critical (sometimes?) Also explore projections (in kernel space)

Support Vector Machines Implicit Embedding, window σ = 0.1:

Support Vector Machines Implicit Embedding, window σ = 0.5:

Support Vector Machines Implicit Embedding, window σ = 1:

Support Vector Machines Implicit Embedding, window σ = 10:

Support Vector Machines Notes on Implicit Embedding: Similar Large vs. Small lessons Range of “reasonable results” Seems to be smaller (note different range of windows)

Support Vector Machines Notes on Implicit Embedding: Similar Large vs. Small lessons Range of “reasonable results” Seems to be smaller (note different range of windows) Much different “edge” behavior Interesting topic for future work…