Participant Presentations

Slides:



Advertisements
Similar presentations
Introduction to Support Vector Machines (SVM)
Advertisements

CSC321: Introduction to Neural Networks and Machine Learning Lecture 24: Non-linear Support Vector Machines Geoffrey Hinton.
Lecture 9 Support Vector Machines
ECG Signal processing (2)
STOR 892 Object Oriented Data Analysis Radial Distance Weighted Discrimination Jie Xiong Advised by Prof. J.S. Marron Department of Statistics and Operations.
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.
An Introduction of Support Vector Machine
CHAPTER 10: Linear Discrimination
An Introduction of Support Vector Machine
SVMs Reprised. Administrivia I’m out of town Mar 1-3 May have guest lecturer May cancel class Will let you know more when I do...
Principal Component Analysis CMPUT 466/551 Nilanjan Ray.
Support Vector Machines (and Kernel Methods in general)
Support Vector Machines and Kernel Methods
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
Margins, support vectors, and linear programming Thanks to Terran Lane and S. Dreiseitl.
Object Orie’d Data Analysis, Last Time Mildly Non-Euclidean Spaces Strongly Non-Euclidean Spaces –Tree spaces –No Tangent Plane Classification - Discrimination.
Support Vector Machines
SVMs Finalized. Where we are Last time Support vector machines in grungy detail The SVM objective function and QP Today Last details on SVMs Putting it.
SVMs Reprised Reading: Bishop, Sec 4.1.1, 6.0, 6.1, 7.0, 7.1.
SVMs, cont’d Intro to Bayesian learning. Quadratic programming Problems of the form Minimize: Subject to: are called “quadratic programming” problems.
An Introduction to Support Vector Machines Martin Law.
Support Vector Machines Piyush Kumar. Perceptrons revisited Class 1 : (+1) Class 2 : (-1) Is this unique?
Outline Separating Hyperplanes – Separable Case
Object Orie’d Data Analysis, Last Time OODA in Image Analysis –Landmarks, Boundary Rep ’ ns, Medial Rep ’ ns Mildly Non-Euclidean Spaces –M-rep data on.
Object Orie’d Data Analysis, Last Time HDLSS Discrimination –MD much better Maximal Data Piling –HDLSS space is a strange place Kernel Embedding –Embed.
Object Orie’d Data Analysis, Last Time
Support Vector Machine & Image Classification Applications
CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.
1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.
计算机学院 计算感知 Support Vector Machines. 2 University of Texas at Austin Machine Learning Group 计算感知 计算机学院 Perceptron Revisited: Linear Separators Binary classification.
Return to Big Picture Main statistical goals of OODA: Understanding population structure –Low dim ’ al Projections, PCA … Classification (i. e. Discrimination)
An Introduction to Support Vector Machines (M. Law)
Object Orie’d Data Analysis, Last Time Classification / Discrimination Classical Statistical Viewpoint –FLD “good” –GLR “better” –Conclude always do GLR.
1 UNC, Stat & OR Hailuoto Workshop Object Oriented Data Analysis, II J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina.
CISC667, F05, Lec22, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Support Vector Machines I.
CS 478 – Tools for Machine Learning and Data Mining SVM.
Object Orie’d Data Analysis, Last Time Classical Discrimination (aka Classification) –FLD & GLR very attractive –MD never better, sometimes worse HDLSS.
An Introduction to Support Vector Machine (SVM)
Maximal Data Piling Visual similarity of & ? Can show (Ahn & Marron 2009), for d < n: I.e. directions are the same! How can this be? Note lengths are different.
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1.
Return to Big Picture Main statistical goals of OODA: Understanding population structure –Low dim ’ al Projections, PCA … Classification (i. e. Discrimination)
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Computational Intelligence: Methods and Applications Lecture 24 SVM in the non-linear case Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
Classification on Manifolds Suman K. Sen joint work with Dr. J. S. Marron & Dr. Mark Foskey.
Return to Big Picture Main statistical goals of OODA: Understanding population structure –Low dim ’ al Projections, PCA … Classification (i. e. Discrimination)
PCA Data Represent ’ n (Cont.). PCA Simulation Idea: given Mean Vector Eigenvectors Eigenvalues Simulate data from Corresponding Normal Distribution.
Kernel Embedding Polynomial Embedding, Toy Example 3: Donut FLD Good Performance (Slice of Paraboloid)
Support Vector Machines
Return to Big Picture Main statistical goals of OODA:
Support vector machines
PREDICT 422: Practical Machine Learning
Support Vector Machine
Object Orie’d Data Analysis, Last Time
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
Computational Intelligence: Methods and Applications
Object Orie’d Data Analysis, Last Time
Support Vector Machines
Maximal Data Piling MDP in Increasing Dimensions:
HDLSS Discrimination Mean Difference (Centroid) Method Same Data, Movie over dim’s.
Principal Component Analysis
Support Vector Machines
Feature space tansformation methods
Support vector machines
COSC 4368 Machine Learning Organization
Linear Discrimination
SVMs for Document Ranking
Support Vector Machines 2
Presentation transcript:

Participant Presentations (10 Minute Talks)

Return to Big Picture Main statistical goals of OODA: Understanding population structure Low dim’al Projections, PCA … Classification (i. e. Discrimination) Understanding 2+ populations Time Series of Data Objects Chemical Spectra, Mortality Data “Vertical Integration” of Data Types

Classical Discrimination Summary of FLD vs. GLR: Tilted Point Clouds Data FLD good GLR good Donut Data FLD bad X Data GLR OK, not great Classical Conclusion: GLR generally better (will see a different answer for HDLSS data)

(requires root inverse covariance) HDLSS Discrimination Main HDLSS issues: Sample Size, 𝑛 < Dimension, 𝑑 Singular covariance matrix So can’t use matrix inverse I.e. can’t standardize (sphere) the data (requires root inverse covariance) Can’t do classical multivariate analysis

HDLSS Discrimination Application of Generalized Inverse to FLD: Direction (Normal) Vector: 𝑛 𝐹𝐿𝐷 = Σ 𝑤 − 𝑋 + − 𝑋 − Intercept: 𝜇 𝐹𝐿𝐷 = 1 2 𝑋 + + 1 2 𝑋 − Have replaced Σ 𝑤 −1 by Σ 𝑤 −

(populations overlap) HDLSS Discrimination FLD in Increasing Dimensions: Far beyond HDLSS boun’ry (d=70-1000): Quality degrades Projections look terrible (populations overlap) And Generalizability falls apart, as well Asymptotics worked out by Bickel & Levina (2004) Problem is estimation of 𝑑×𝑑 covariance matrix

HDLSS Discrimination Mean Difference (Centroid) Method Far more stable over dimensions Because is likelihood ratio solution (for known variance - Gaussians) Doesn’t feel HDLSS boundary Eventually becomes too good?!? Widening gap between clusters?!? Careful: angle to optimal grows So lose generalizability (since noise inc’s) HDLSS data present some odd effects…

Maximal Data Piling Strange FLD effect at HDLSS boundary: Data Piling: For each class, all data project to single value

Maximal Data Piling How to compute 𝑣 𝑀𝐷𝑃 ? Can show (Ahn & Marron 2009): 𝑣 𝑀𝐷𝑃 = Σ −1 𝑋 + − 𝑋 − Recall FLD formula: 𝑣 𝐹𝐿𝐷 = Σ 𝑤 −1 𝑋 + − 𝑋 − Only difference is global vs. within class Covariance Estimates!

Maximal Data Piling Visual similarity of 𝑣 𝑀𝐷𝑃 & 𝑣 𝐹𝐿𝐷 ? Can show (Ahn & Marron 2009), for 𝑑<𝑛: 𝑣 𝑀𝐷𝑃 𝑣 𝑀𝐷𝑃 = 𝑣 𝐹𝐿𝐷 𝑣 𝐹𝐿𝐷 I.e. directions are the same! How can this be? Note lengths are different… Study from transformation viewpoint

Maximal Data Piling Alternate approach: Optimization Viewpoint Find a direction vector, 𝑣 ( 𝑣 =1), to max 𝑣 𝑃 𝑋 + − 𝑃 𝑋 − 2 𝑉 + + 𝑉 − Where for ⨀ = +,- 𝑃 𝑋 ⨀ =𝐴𝑣𝑔 𝑣 𝑡 𝑋 ⨀ = Average Proj’n 𝑉 ⨀ =𝑣𝑎𝑟( 𝑣 𝑡 𝑋 ⨀ ) = Variance of Proj’ns Within Class Variation Between Class Variation

Maximal Data Piling Alternate approach: Optimization Viewpoint Find a direction vector, 𝑣 ( 𝑣 =1), to max 𝑣 𝑃 𝑋 + − 𝑃 𝑋 − 2 𝑉 + + 𝑉 − Case 1: 𝑑≤𝑛−2 Solutions are: 𝑣 𝐹𝐿𝐷 ∝ Σ 𝑤 − 𝑋 + − 𝑋 − 𝑣 𝑀𝐷𝑃 ∝ Σ − 𝑋 + − 𝑋 − Common Practice In Literature: Assume This, But Not Make It Very Clear

Maximal Data Piling max 𝑣 𝑃 𝑋 + − 𝑃 𝑋 − 2 𝑉 + + 𝑉 − Case 2: 𝑑≥𝑛−1 Solution is: 𝑣 𝑀𝐷𝑃 = Σ − 𝑋 + − 𝑋 − But not: 𝑣 𝐹𝐿𝐷 = Σ 𝑤 − 𝑋 + − 𝑋 − Point Not Made Anywhere Else???

Maximal Data Piling Recurring, over-arching, issue: HDLSS space is a weird place

Kernel Embedding Aizerman, Braverman and Rozoner (1964) Motivating idea: Extend scope of linear discrimination, By adding nonlinear components to data (embedding in a higher dim’al space) Better use of name: nonlinear discrimination?

Kernel Embedding But in the “quadratic embedded domain”, 𝑥, 𝑥 2 :𝑥∈ℝ ⊂ ℝ 2 linear separation can give 3 parts

better linear separation Kernel Embedding But in the quadratic embedded domain 𝑥, 𝑥 2 :𝑥∈ℝ ⊂ ℝ 2 Linear separation can give 3 parts original data space lies in 1d manifold very sparse region of ℝ 2 curvature of manifold gives: better linear separation can have any 2 break points (2 points ⟹ line)

Kernel Embedding General View: for original data matrix: 𝑥 11 ⋯ 𝑥 1𝑛 ⋮ ⋱ ⋮ 𝑥 𝑑1 ⋯ 𝑥 𝑑𝑛 add rows: i.e. embed in Higher Dimensional space 𝑥 11 ⋮ 𝑥 𝑑1 𝑥 1𝑛 ⋮ 𝑥 𝑑𝑛 𝑥 11 2 ⋮ 𝑥 𝑑1 2 ⋯ 𝑥 1𝑛 2 ⋮ 𝑥 𝑑𝑛1 2 𝑥 11 𝑥 21 ⋮ 𝑥 1𝑑 𝑥 2𝑑 ⋮

Kernel Embedding Embedded Fisher Linear Discrimination: Choose Class +1, for any 𝑥 0 ∈ ℝ 𝑑 when: 𝑥 0 𝑡 Σ 𝑤 −1 𝑋 (+) − 𝑋 (−) ≥ 1 2 𝑋 (+) − 𝑋 (−) Σ 𝑤 −1 𝑋 (+) − 𝑋 (−) in embedded space. Image of class boundaries in original space is nonlinear Allows more complicated class regions Can also do Gaussian Lik. Rat. (or others) Compute image by classifying points from original space

Kernel Embedding Visualization for Toy Examples: Have Linear Disc. In Embedded Space Study Effect in Original Data Space Via Implied Nonlinear Regions Challenge: Hard to Explicitly Compute

(dense equally spaced grid) Kernel Embedding Visualization for Toy Examples: Have Linear Disc. In Embedded Space Study Effect in Original Data Space Via Implied Nonlinear Regions Approach: Use Test Set in Original Space (dense equally spaced grid) Apply embedded discrimination Rule Color Using the Result

Kernel Embedding Recall Classifier Display Device: Parallel Clouds Use Yellow for Grid Points Assigned to Plus Class Use Cyan for Grid Points Assigned to Minus Class PEod1Raw.ps

Kernel Embedding Polynomial Embedding, Toy Example 1: Parallel Clouds FLD Original Data PEod1Raw.ps

Kernel Embedding Polynomial Embedding, Toy Example 1: Parallel Clouds FLD PEod1Raw.ps

Kernel Embedding Polynomial Embedding, Toy Example 1: Parallel Clouds FLD PEod1Raw.ps

Kernel Embedding Polynomial Embedding, Toy Example 1: Parallel Clouds FLD PEod1Raw.ps

Kernel Embedding Polynomial Embedding, Toy Example 1: Parallel Clouds FLD All Stable & Very Good Since No Better Separation in Embedded Space PEod1Raw.ps

Kernel Embedding Polynomial Embedding, Toy Example 1: Parallel Clouds GLR Original Data PEod1Raw.ps

Kernel Embedding Polynomial Embedding, Toy Example 1: Parallel Clouds GLR PEod1Raw.ps

Kernel Embedding Polynomial Embedding, Toy Example 1: Parallel Clouds GLR PEod1Raw.ps

Kernel Embedding Polynomial Embedding, Toy Example 1: Parallel Clouds GLR PEod1Raw.ps

Kernel Embedding Polynomial Embedding, Toy Example 1: Parallel Clouds GLR PEod1Raw.ps

Kernel Embedding Polynomial Embedding, Toy Example 1: Parallel Clouds GLR Unstable Subject to Overfitting Too Much Flexibility In Embedded Space PEod1Raw.ps

Kernel Embedding Polynomial Embedding, Toy Example 2: Split X Very Challenging For Linear Approaches PEod1Raw.ps

Kernel Embedding Polynomial Embedding, Toy Example 2: Split X FLD Original Data PEod1Raw.ps

Kernel Embedding Polynomial Embedding, Toy Example 2: Split X FLD Slightly Better PEod1Raw.ps

Kernel Embedding Polynomial Embedding, Toy Example 2: Split X FLD PEod1Raw.ps

Kernel Embedding Polynomial Embedding, Toy Example 2: Split X FLD Very Good Effective Hyperbola PEod1Raw.ps

Kernel Embedding Polynomial Embedding, Toy Example 2: Split X FLD PEod1Raw.ps

Kernel Embedding Polynomial Embedding, Toy Example 2: Split X FLD Robust Against Overfitting PEod1Raw.ps

Kernel Embedding Polynomial Embedding, Toy Example 2: Split X GLR Original Data Looks Very Good PEod1Raw.ps

Kernel Embedding Polynomial Embedding, Toy Example 2: Split X GLR PEod1Raw.ps

Kernel Embedding Polynomial Embedding, Toy Example 2: Split X GLR Reasonably Stable Never Ellipse Around Blues PEod1Raw.ps

Kernel Embedding Polynomial Embedding, Toy Example 3: Donut Very Challenging For Any Linear Approach PEod1Raw.ps

Kernel Embedding Polynomial Embedding, Toy Example 3: Donut FLD Original Data Very Bad PEod1Raw.ps

Kernel Embedding Polynomial Embedding, Toy Example 3: Donut FLD Somewhat Better (Parabolic Fold) PEod1Raw.ps

Kernel Embedding Polynomial Embedding, Toy Example 3: Donut FLD Somewhat Better (Other Fold) PEod1Raw.ps

Kernel Embedding Polynomial Embedding, Toy Example 3: Donut FLD Good Performance (Slice of Paraboloid) PEod1Raw.ps

Kernel Embedding Polynomial Embedding, Toy Example 3: Donut FLD Robust Against Overfitting PEod1Raw.ps

Kernel Embedding Polynomial Embedding, Toy Example 3: Donut GLR Original Data Best With No Embedding? PEod1Raw.ps

Kernel Embedding Polynomial Embedding, Toy Example 3: Donut GLR PEod1Raw.ps

Kernel Embedding Polynomial Embedding, Toy Example 3: Donut GLR PEod1Raw.ps

Kernel Embedding Polynomial Embedding, Toy Example 3: Donut GLR Overfitting Gives Square Shape? PEod1Raw.ps

Kernel Embedding Drawbacks to polynomial embedding: Too many extra terms create spurious structure i.e. have “overfitting” Too much flexibility HDLSS problems typically get worse

Kernel Embedding Important Variation: “Kernel Machines” Idea: replace polynomials by other nonlinear functions e.g. 1: sigmoid functions from neural nets e.g. 2: radial basis functions Gaussian kernels Related to “kernel density estimation” (study more later)

(everybody currently does the latter) Kernel Embedding Radial Basis Functions: Note: there are several ways to embed: Naïve Embedding (equally spaced grid) Explicit Embedding (evaluate at data) Implicit Embedding (inner prod. based) (everybody currently does the latter)

Kernel Embedding Naïve Embedding, Radial basis functions: At some “grid points” 𝑔 1 ⋯, 𝑔 𝑘 , For a “bandwidth” (i.e. standard dev’n) 𝜎, Consider (𝑑 dim’al) functions: 𝜑 𝜎 𝑥 − 𝑔 1 ,⋯, 𝜑 𝜎 𝑥 − 𝑔 𝑘 𝑁 𝑑 0, 𝜎 2 𝐼 Probability Density

Kernel Embedding Naïve Embedding, Radial basis functions: At some “grid points” 𝑔 1 ⋯, 𝑔 𝑘 , For a “bandwidth” (i.e. standard dev’n) 𝜎, Consider (𝑑 dim’al) functions: 𝜑 𝜎 𝑥 − 𝑔 1 ,⋯, 𝜑 𝜎 𝑥 − 𝑔 𝑘 Replace data matrix with: 𝜑 𝜎 𝑋 1 − 𝑔 1 ⋯ 𝜑 𝜎 𝑋 𝑛 − 𝑔 1 ⋮ ⋱ ⋮ 𝜑 𝜎 𝑋 1 − 𝑔 𝑘 ⋯ 𝜑 𝜎 𝑋 𝑛 − 𝑔 𝑘

Kernel Embedding Naïve Embedding, Radial basis functions: For discrimination: work in radial basis space, With new data vector 𝑋 0 , represented by: 𝜑 𝜎 𝑋 0 − 𝑔 1 ⋮ 𝜑 𝜎 𝑋 0 − 𝑔 𝑘

Kernel Embedding Naïve Embedd’g, Toy E.g. 1: Parallel Clouds Good at data Poor outside

Kernel Embedding Naïve Embedd’g, Toy E.g. 2: Split X OK at data Strange outside

Kernel Embedding Naïve Embedd’g, Toy E.g. 3: Donut Mostly good Slight mistake for one kernel

Kernel Embedding Naïve Embedding, Radial basis functions: Toy Example, Main lessons: Generally good in regions with data, Unpredictable where data are sparse

Kernel Embedding Toy Example 4: Checkerboard Very Challenging! Linear Method? Polynomial Embedding? PEod1Raw.ps

Kernel Embedding Toy Example 4: Checkerboard Very Challenging! FLD Linear Is Hopeless PEod1Raw.ps

Kernel Embedding Toy Example 4: Checkerboard Very Challenging! FLD PEod1Raw.ps

Kernel Embedding Toy Example 4: Checkerboard Very Challenging! FLD PEod1Raw.ps

Kernel Embedding Toy Example 4: Checkerboard Very Challenging! FLD PEod1Raw.ps

Kernel Embedding Toy Example 4: Checkerboard Very Challenging! FLD PEod1Raw.ps

Kernel Embedding Toy Example 4: Checkerboard Very Challenging! FLD Embedding Gets Better But Still Not Great PEod1Raw.ps

Kernel Embedding Toy Example 4: Checkerboard Very Challenging! Polynomials Don’t Have Needed Flexiblity PEod1Raw.ps

Kernel Embedding Toy Example 4: Checkerboard Radial Basis Embedding + FLD Is Excellent! PEod1Raw.ps

Kernel Embedding Drawbacks to naïve embedding: Equally spaced grid too big in high 𝑑 Not computationally tractable 𝑔 𝑑 Approach: Evaluate only at data points Not on full grid But where data live

Kernel Embedding Other types of embedding: Explicit Implicit Will be studied soon, after introduction to Support Vector Machines…

Kernel Embedding ∃ generalizations of this idea to other types of analysis & some clever computational ideas. E.g. “Kernel based, nonlinear Principal Components Analysis” Ref: Schölkopf et al (1998)

Kernel Embedding Important Variation: “Generalized Principal Components Analysis” Ref: Vidal et al (2016)

Support Vector Machines Motivation: Find a linear method that “works well” for embedded data Note: Embedded data are very non-Gaussian Classical Statistics: “Use Prob. Dist’n” Looks Hopeless

Support Vector Machines Motivation: Find a linear method that “works well” for embedded data Note: Embedded data are very non-Gaussian Suggests value of really new approach

Support Vector Machines Classical References: Vapnik (1982) Boser, Guyon & Vapnik (1992) Vapnik (1995)

Support Vector Machines Recommended tutorial: Burges (1998) Early Monographs: Cristianini & Shawe-Taylor (2000) Schölkopf & Smola (2002) More Recent Monographs: Hastie et al (2005) Bishop (2006)

Support Vector Machines Graphical View, using Toy Example:

Support Vector Machines Graphical View, using Toy Example: Find separating plane To maximize distances from data to plane

Support Vector Machines Graphical View, using Toy Example:

Support Vector Machines Graphical View, using Toy Example:

Support Vector Machines Graphical View, using Toy Example: Find separating plane To maximize distances from data to plane In particular smallest distance

Support Vector Machines Graphical View, using Toy Example:

Support Vector Machines Graphical View, using Toy Example: Find separating plane To maximize distances from data to plane In particular smallest distance Data points closest are called support vectors Gap between is called margin Caution: For some “margin” is different

SVMs, Optimization Viewpoint Formulate Optimization problem, based on: Data (feature) vectors 𝑥 1 ,⋯, 𝑥 𝑛 Class Labels 𝑦 𝑖 =±1 Normal Vector 𝑤 Location (determines intercept) 𝑏 Residuals (right side) 𝑟 𝑖 = 𝑦 𝑖 𝑥 𝑖 𝑡 𝑤+𝑏 Residuals (wrong side) 𝜉 𝑖 =− 𝑟 𝑖 Solve (convex problem) by quadratic programming

SVMs, Optimization Viewpoint Lagrange Multipliers primal formulation (separable case): Minimize: 𝐿 𝑃 𝑤,𝑏,𝛼 = 1 2 𝑤 2 − 𝑖=1 𝑛 𝛼 𝑖 𝑦 𝑖 𝑥 𝑖 ∙𝑤+𝑏 −1 Where 𝛼 1 ,⋯, 𝛼 𝑛 are Lagrange multipliers Notation: “dot product” = inner product Elsewhere: 𝑢∙𝑣= 𝑢,𝑣 = 𝑢 𝑡 𝑣

SVMs, Optimization Viewpoint Lagrange Multipliers primal formulation (separable case): Minimize: 𝐿 𝑃 𝑤,𝑏,𝛼 = 1 2 𝑤 2 − 𝑖=1 𝑛 𝛼 𝑖 𝑦 𝑖 𝑥 𝑖 ∙𝑤+𝑏 −1 Where 𝛼 1 ,⋯, 𝛼 𝑛 are Lagrange multipliers Dual Lagrangian version: Maximize: 𝐿 𝐷 = 𝑖 𝛼 𝑖 − 1 2 𝑖,𝑗 𝛼 𝑖 𝛼 𝑗 𝑦 𝑖 𝑦 𝑗 𝑥 𝑖 ∙ 𝑥 𝑗 Get classification function: 𝑓 𝑥 = 𝑖=1 𝑛 𝛼 𝑖 𝑦 𝑖 𝑥∙ 𝑥 𝑖 +𝑏

SVMs, Optimization Viewpoint Get classification function: 𝑓 𝑥 = 𝑖=1 𝑛 𝛼 𝑖 𝑦 𝑖 𝑥∙ 𝑥 𝑖 +𝑏 Choose Class: +1, when 𝑓 𝑥 >0 -1, when 𝑓 𝑥 <0 Note: linear function of 𝑥 i.e. have found separating hyperplane

SVMs, Computation Major Computational Point: Classifier only depends on data through inner products! 𝑓 𝑥 = 𝑖=1 𝑛 𝛼 𝑖 𝑦 𝑖 𝑥∙ 𝑥 𝑖 +𝑏

SVMs, Computation Major Computational Point: Classifier only depends on data through inner products! Thus enough to only store inner products Creates big savings in optimization Especially for HDLSS data But also creates variations in kernel embedding (interpretation?!?) This is almost always done in practice

SVMs, Comput’n & Embedding For an “Embedding Map”, Φ 𝑥 e.g. Explicit Embedding: Maximize: 𝐿 𝐷 = 𝑖 𝛼 𝑖 − 1 2 𝑖,𝑗 𝛼 𝑖 𝛼 𝑗 𝑦 𝑖 𝑦 𝑗 Φ 𝑥 𝑖 ∙Φ 𝑥 𝑗 Get classification function: 𝑓 𝑥 = 𝑖=1 𝑛 𝛼 𝑖 𝑦 𝑖 Φ 𝑥 ∙Φ 𝑥 𝑖 +𝑏 Straightforward application of embedding But loses inner product advantage Φ 𝑥 = 𝑥 𝑥 2

SVMs, Comput’n & Embedding Implicit Embedding: Maximize: 𝐿 𝐷 = 𝑖 𝛼 𝑖 − 1 2 𝑖,𝑗 𝛼 𝑖 𝛼 𝑗 𝑦 𝑖 𝑦 𝑗 Φ 𝑥 𝑖 ∙ 𝑥 𝑗

SVMs, Comput’n & Embedding Implicit Embedding: Maximize: 𝐿 𝐷 = 𝑖 𝛼 𝑖 − 1 2 𝑖,𝑗 𝛼 𝑖 𝛼 𝑗 𝑦 𝑖 𝑦 𝑗 Φ 𝑥 𝑖 ∙ 𝑥 𝑗 Note Difference from Explicit Embedding: 𝐿 𝐷 = 𝑖 𝛼 𝑖 − 1 2 𝑖,𝑗 𝛼 𝑖 𝛼 𝑗 𝑦 𝑖 𝑦 𝑗 Φ 𝑥 𝑖 ∙Φ 𝑥 𝑗

SVMs, Comput’n & Embedding Implicit Embedding: Maximize: 𝐿 𝐷 = 𝑖 𝛼 𝑖 − 1 2 𝑖,𝑗 𝛼 𝑖 𝛼 𝑗 𝑦 𝑖 𝑦 𝑗 Φ 𝑥 𝑖 ∙ 𝑥 𝑗 Get classification function: 𝑓 𝑥 = 𝑖=1 𝑛 𝛼 𝑖 𝑦 𝑖 Φ 𝑥∙ 𝑥 𝑖 +𝑏 Still defined only via inner products Retains optimization advantage Thus used very commonly Comparison to explicit embedding? Which is “better”???

Support Vector Machines Target Toy Data set:

Support Vector Machines Explicit Embedding, window σ = 0.1: Gaussian Kernel, i.e. Radial Basis Function

Support Vector Machines Explicit Embedding, window σ = 1: Pretty Big Change (Factor of 10)

Support Vector Machines Explicit Embedding, window σ = 10: Not Quite As Good ???

Support Vector Machines Explicit Embedding, window σ = 100: Note: Lost Center (Over- Smoothed)

Support Vector Machines Notes on Explicit Embedding: Too small  Poor generalizability Too big  miss important regions Classical lessons from kernel smoothing Surprisingly large “reasonable region” I.e. parameter less critical (sometimes?) Will study Later

Support Vector Machines Interesting Alternative Viewpoint: Study Projections In Kernel Space (Never done in Machine Learning World)

Support Vector Machines Kernel space projection, window σ = 0.1: Note: Data Piling At Margin Will become an issue soon

Support Vector Machines Kernel space projection, window σ = 1: Excellent Separation (but less than σ = 0.1)

Support Vector Machines Kernel space projection, window σ = 10: Still Good (But Some Overlap)

Support Vector Machines Kernel space projection, window σ = 100: Some Reds On Wrong Side (Missed Center)

Support Vector Machines Notes on Kernel space projection: Too small 𝜎  Great separation But recall, poor generalizability Too big 𝜎  no longer separable As above: Classical lessons from kernel smoothing Surprisingly large “reasonable region” I.e. parameter less critical (sometimes?)

Support Vector Machines Implicit Embedding, window σ = 0.1:

Support Vector Machines Implicit Embedding, window σ = 0.5:

Support Vector Machines Implicit Embedding, window σ = 1:

Support Vector Machines Implicit Embedding, window σ = 10:

Support Vector Machines Notes on Implicit Embedding: Similar Large vs. Small lessons Range of “reasonable results” Seems to be smaller (different range) Caution about “relative scales” For Φ 𝑥 ↔ Φ 𝑥 𝑡 𝑥 Rescale by 𝜎 ↔ 𝜎 2 , i.e. 10 ↔ 100 Much different “edge” behavior

SVMs & Robustness Usually not severely affected by outliers, But a possible weakness: Can have very influential points Toy E.g., only 2 points drive SVM

Can have very influential points SVMs & Robustness Can have very influential points

Can have very influential points SVMs & Robustness Usually not severely affected by outliers, But a possible weakness: Can have very influential points Toy E.g., only 2 points drive SVM Notes: Huge range of chosen hyperplanes But all are “pretty good discriminators” Only happens when whole range is OK??? Good or bad?

SVMs & Robustness Effect of violators:

d = 50, Spherical Gaussian data SVMs, Tuning Parameter Recall Regularization Parameter C: Controls penalty for violation I.e. lying on wrong side of plane Appears in slack variables Affects performance of SVM Toy Example: d = 50, Spherical Gaussian data

SVMs, Tuning Parameter Toy Example: d = 50, Sph’l Gaussian data

d = 50, Spherical Gaussian data SVMs, Tuning Parameter Toy Example: d = 50, Spherical Gaussian data X=Axis: Opt. Dir’n Other: SVM Dir’n Small C: Where is the margin? Small angle to optimal (generalizable) Large C: More data piling Larger angle (less generalizable) Bigger gap (but maybe not better???) Between: Very small range

SVMs, Tuning Parameter Toy Example: d = 50, Sph’l Gaussian data Put MD on horizontal axis

SVMs, Tuning Parameter Toy Example: d = 50, Spherical Gaussian data Careful look at small C: Put MD on horizontal axis Shows SVM and MD same for C small Separates better for large C But at cost of data piling (less generalizable) Mathematics behind this: Carmichael & Marron (2017)

d = 50, Spherical Gaussian data SVMs, Tuning Parameter Toy Example: d = 50, Spherical Gaussian data Strange Behavior in Both Views: Stable Over Range of Small 𝐶 Big Changes Over Narrow Range of 𝐶 Stable Over Range of Large 𝐶 Mathematics behind this: Carmichael & Marron (2017)

Support Vector Machines Important Extension: Multi-Class SVMs Hsu & Lin (2002) Lee, Lin, & Wahba (2002) Defined for “implicit” version “Direction Based” variation???

Distance Weighted Discrim’n Improvement of SVM for HDLSS Data Toy e.g. 𝑑=50 indep. 𝑁 0,1 𝜇 1 =±2 𝑛 + = 𝑛 − =20 (similar to earlier movie)

Distance Weighted Discrim’n Toy e.g.: Maximal Data Piling Direction - Perfect Separation - Gross Overfitting - Large Angle - Poor Gen’ability MDP

Distance Weighted Discrim’n Toy e.g.: Support Vector Machine Direction - Bigger Gap - Smaller Angle - Better Gen’ability - Feels support vectors too strongly??? - Ugly subpops? - Improvement?

Distance Weighted Discrim’n Toy e.g.: Distance Weighted Discrimination - Addresses these issues - Smaller Angle - Better Gen’ability - Nice subpops - Replaces min dist. by avg. dist.

Distance Weighted Discrim’n Based on Optimization Problem: For “Residuals”:

Distance Weighted Discrim’n Based on Optimization Problem: Uses “poles” to push plane away from data

Distance Weighted Discrim’n Based on Optimization Problem: More precisely: Work in appropriate penalty for violations Optimization Method: Second Order Cone Programming “Still convex” gen’n of quad’c program’g Allows fast greedy solution Can use available fast software (SDP3, Michael Todd, et al)

Distance Weighted Discrim’n References for more on DWD: Main paper: Marron, Todd and Ahn (2007) Links to more papers: Ahn (2007) R Implementation of DWD: CRAN (2011) SDPT3 Software: Toh et al (1999) Sparse DWD: Wang & Zou (2016) 133

Distance Weighted Discrim’n 2-d Visualization: Pushes Plane Away From Data All Points Have Some Influence (not just support vectors)

Support Vector Machines Graphical View, using Toy Example:

Support Vector Machines Graphical View, using Toy Example:

Distance Weighted Discrim’n Graphical View, using Toy Example:

DWD in Face Recognition Face Images as Data Benito et al (2017) Male – Female Difference? Discrimination Rule? Represented as long vector of pixel gray levels Registration is critical

DWD in Face Recognition, (cont.) Registered Data Shifts and scale Manually chosen To align eyes and mouth Still large variation See males vs. females???

DWD in Face Recognition , (cont.) DWD Direction Good separation Images “make sense” Garbage at ends? (extrapolation effects?)

DWD in Face Recognition , (cont.) Unregistered Version Much blurrier Since features don’t properly line up Nonlinear Variation But DWD still works Can see M-F differ’ce?