Semidefinite Programming Machines

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

3.6 Support Vector Machines
There is a pattern for factoring trinomials of this form, when c
STATISTICS INTERVAL ESTIMATION Professor Ke-Sheng Cheng Department of Bioenvironmental Systems Engineering National Taiwan University.
STATISTICS POINT ESTIMATION Professor Ke-Sheng Cheng Department of Bioenvironmental Systems Engineering National Taiwan University.
1 1  1 =.
1  1 =.
2 pt 3 pt 4 pt 5 pt 1 pt 2 pt 3 pt 4 pt 5 pt 1 pt 2 pt 3 pt 4 pt 5 pt 1 pt 2 pt 3 pt 4 pt 5 pt 1 pt 2 pt 3 pt 4 pt 5 pt 1 pt Time Money AdditionSubtraction.
Break Time Remaining 10:00.
Factoring Quadratics — ax² + bx + c Topic
Introduction to Support Vector Machines (SVM)
Machine Learning Math Essentials Part 2
The Fourier Transform I
Adding Up In Chunks.
Splines IV – B-spline Curves
6.4 Best Approximation; Least Squares
: 3 00.
5 minutes.
Bell Schedules Club Time is available from 8:05-8:20  1 st 8:20 – 9:15  2 nd 9:20 – 10:10  3 rd 10:15 – 11:05  4 th 11:10 – 12:50 A(11:10)
Fractions Simplify: 36/48 = 36/48 = ¾ 125/225 = 125/225 = 25/45 = 5/9
12 System of Linear Equations Case Study
Clock will move after 1 minute
Partial Products. Category 1 1 x 3-digit problems.
Chemical Reaction Engineering (CRE) is the field that studies the rates and mechanisms of chemical reactions and the design of the reactors in which they.
Select a time to count down from the clock above
9. Two Functions of Two Random Variables
1 Dr. Scott Schaefer Least Squares Curves, Rational Representations, Splines and Continuity.
Let’s Add! Click the cloud below for a secret question! Get Started!
Chapter 5 The Mathematics of Diversification
Lecture 14 Nonlinear Problems Grid Search and Monte Carlo Methods.
Lecture 9 Support Vector Machines
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Input Space versus Feature Space in Kernel- Based Methods Scholkopf, Mika, Burges, Knirsch, Muller, Ratsch, Smola presented by: Joe Drish Department of.
Support Vector Machines
SVM—Support Vector Machines
Pattern Recognition and Machine Learning: Kernel Methods.
Machine learning continued Image source:
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Robust Multi-Kernel Classification of Uncertain and Imbalanced Data
Support Vector Machine
Support Vector Machines and Kernel Methods
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Support Vector Machines Kernel Machines
Classification Problem 2-Category Linearly Separable Case A- A+ Malignant Benign.
Sketched Derivation of error bound using VC-dimension (1) Bound our usual PAC expression by the probability that an algorithm has 0 error on the training.
Announcements  Project teams should be decided today! Otherwise, you will work alone.  If you have any question or uncertainty about the project, talk.
Seminar in Advanced Machine Learning Rong Jin. Course Description  Introduction to the state-of-the-art techniques in machine learning  Focus of this.
Support Vector Machines
Lecture 10: Support Vector Machines
Invariant Large Margin Nearest Neighbour Classifier M. Pawan Kumar Philip Torr Andrew Zisserman.
Greg GrudicIntro AI1 Support Vector Machine (SVM) Classification Greg Grudic.
Learning in Feature Space (Could Simplify the Classification Task)  Learning in a high dimensional space could degrade generalization performance  This.
Outline Separating Hyperplanes – Separable Case
CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.
Lecture notes for Stat 231: Pattern Recognition and Machine Learning 1. Stat 231. A.L. Yuille. Fall 2004 Practical Issues with SVM. Handwritten Digits:
Classifiers Given a feature representation for images, how do we learn a model for distinguishing features from different classes? Zebra Non-zebra Decision.
1 Kernel based data fusion Discussion of a Paper by G. Lanckriet.
1 Chapter 6. Classification and Prediction Overview Classification algorithms and methods Decision tree induction Bayesian classification Lazy learning.
Kernels Usman Roshan CS 675 Machine Learning. Feature space representation Consider two classes shown below Data cannot be separated by a hyperplane.
Machine Learning Weak 4 Lecture 2. Hand in Data It is online Only around 6000 images!!! Deadline is one week. Next Thursday lecture will be only one hour.
GRASP Learning a Kernel Matrix for Nonlinear Dimensionality Reduction Kilian Q. Weinberger, Fei Sha and Lawrence K. Saul ICML’04 Department of Computer.
Sparse Kernel Methods 1 Sparse Kernel Methods for Classification and Regression October 17, 2007 Kyungchul Park SKKU.
Support Vector Machines. Notation Assume a binary classification problem. –Instances are represented by vector x   n. –Training examples: x = (x 1,
Text Classification using Support Vector Machine Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
Greg GrudicIntro AI1 Support Vector Machine (SVM) Classification Greg Grudic.
An Introduction to Support Vector Machines
Kernels Usman Roshan.
Support Vector Machines
Learning with information of features
Usman Roshan CS 675 Machine Learning
Presentation transcript:

Semidefinite Programming Machines Thore Graepel and Ralf Herbrich Microsoft Research Cambridge Microsoft Research Ltd.

Microsoft Research Ltd. Overview Invariant Pattern Recognition Semidefinite Programming (SDP) From Support Vector Machines (SVMs) to Semidefinite Programming Machines (SDPMs) Experimental Illustration Future Work Microsoft Research Ltd.

Typical Invariances for Images Translation Shear Rotation Microsoft Research Ltd.

Typical Invariances for Images Translation Shear Rotation This is for the poster! Microsoft Research Ltd.

Toy Features for Handwritten Digits 1 =0.48 2=0.58 3=0.37 Microsoft Research Ltd.

Warning: Highly Non-Linear Á2 Á1 Microsoft Research Ltd.

Warning: Highly Non-Linear 0.2 0.3 0.4 0.5 0.6 0.25 0.35 0.45 0.55 f 1 2 For the poster Microsoft Research Ltd.

Motivation: Classification Learning 0.65 0.6 0.55 Can we learn with infinitely many examples? 0.5 ) 0.45 x ( 2 f 0.4 0.35 0.3 0.25 0.2 0.1 0.2 0.3 0.4 0.5 f ( x ) 1 Microsoft Research Ltd.

Motivation: Classification Learning 0.65 0.6 0.55 0.5 ) 0.45 x ( 2 f 0.4 0.35 0.3 0.25 0.2 0.1 0.2 0.3 0.4 0.5 f ( x ) 1 Microsoft Research Ltd.

Motivation: Version Spaces Original patterns Transformed patterns Microsoft Research Ltd.

Semidefinite Programs (SDPs) Linear objective function Positive semidefinite (psd) constraints Infinitely many linear constraints Microsoft Research Ltd.

SVM as a Quadratic Program Given: A sample ((x1,y1),…,(xm,ym)). SVMs find the weight vector w that maximises the margin on the sample Microsoft Research Ltd.

SVM as a Semidefinite Program (I) A (block)-diagonal matrix is psd if and only if all its blocks are psd. g1,j gi,j gm,j 1 Aj:= B:= Microsoft Research Ltd.

SVM as a Semidefinite Program (I) A (block)-diagonal matrix is psd if and only if all its blocks are psd. g1,j gi,j gm,j 1 Aj:= B:= Microsoft Research Ltd.

SVM as a Semidefinite Program (II) Transform quadratic into linear objective Use Schur’s complement lemma Adds new (n+1)£(n+1) block to Aj and B Microsoft Research Ltd.

Taylor Approximation of Invariance Let T (x,µ) be an invariance transformation with parameter µ (e.g., angle of rotation). Taylor Expansion about 0=0 gives Polynomial approximation to trajectory. Microsoft Research Ltd.

Extension to Polynomials Consider polynomial trajectory x(µ): Infinite number of constraints from training example (x(0),…, x(r),y): Microsoft Research Ltd.

Non-Negative Polynomials (I) Theorem (Nesterov,2000): If r=2l then For every psd matrix P the polynomial p(µ)=µTP µ is non-negative everywhere. For every non-negative polynomial p there exists a psd matrix P such that p(µ)=µTPµ. Example: Microsoft Research Ltd.

Non-Negative Polynomials (II) (1) follows directly from psd definition (2) follows from sum-of-squares lemma. Note that (2) states the mere existence: Polynomial of degree r: r+1 parameters Coefficient matrix P:(r+2) (r+4)/8 parameters For r >2, we have to introduce another r(r-2)/8 auxiliary variables to find P. Microsoft Research Ltd.

Semidefinite Programming Machines Extension of SVMs as (non-trivial) SDP. G1,j g1,j 1 1 1 Aj:= Gi,j B:= 1 1 This was previously on the poster: Each trajectory (data point + transformation) is represented by an SDP constraint Gi: gi,j 1 Gm,j 1 1 gm,j 1 Microsoft Research Ltd.

Semidefinite Programming Machines Extension of SVMs as (non-trivial) SDP. g1,j G1,j 1 Aj:= Gi,j B:= 1 1 This was previously on the poster: Each trajectory (data point + transformation) is represented by an SDP constraint Gi: gi,j 1 Gm,j 1 1 gm,j 1 Microsoft Research Ltd.

Example: Second-Order SDPMs 2nd order Taylor expansion: Resulting polynomial in µ: Set of constraint matrices: Microsoft Research Ltd.

Example: Second-Order SDPMs 2nd order Taylor expansion: Resulting polynomial in µ: Set of constraint matrices: Microsoft Research Ltd.

Non-Negative on Segment Given a polynomial p of degree 2l, consider the polynomial -5 5 10 -10 q f( ) Note that q is a polynomial of degree 4l. If q is positive everywhere, then p is positive everywhere in [-¿,+¿]. Microsoft Research Ltd.

Non-Negative on Segment -5 5 10 -10 q f( ) Microsoft Research Ltd.

Truly Virtual Support Vectors Dual complementarity yields expansion: The truly virtual support vectors are linear combinations of derivatives: Microsoft Research Ltd.

Truly Virtual Support Vectors 0.22 0.2 “1” 0.18 0.16 0.14 0.12 “9” 0.1 0.08 0.06 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 Microsoft Research Ltd.

Visualisation: USPS “1” vs. “9” ¿ = 20º 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.06 0.08 0.12 0.14 0.16 0.18 0.22 Microsoft Research Ltd.

Results: Experimental Setup All 45 USPS classification tasks (1-v-1). 20 training images; 250 test images. Rotation is applied to all training images with ¿ = 10º. All results are averaged over 50 random training sets. Compared to SVM and virtual SVM. Microsoft Research Ltd.

Microsoft Research Ltd. Results: SDPM vs. SVM 0.05 0.1 0.15 0.2 0.02 0.04 0.06 0.08 0.12 0.14 0.16 0.18 SVM error SDPM error m = 20, tau = 10, artificially rotated before training, averaged over 50 random training sets, test set 250 large; all sets are balanced class-wise. The whole 45 o-v-o tasks of USPS. Microsoft Research Ltd.

Results: SDPM vs. Virtual SVM 0.02 0.04 0.06 0.08 0.1 0.12 0.14 VSVM error SDPM error m = 20, tau = 10, artificially rotated before training, averaged over 50 random training sets, test set 250 large; all sets are balanced class-wise. The whole 45 o-v-o tasks of USPS. Microsoft Research Ltd.

Results: Curse of Dimensionality Microsoft Research Ltd.

Results: Curse of Dimensionality 1 parameter 2 parameters Microsoft Research Ltd.

Extensions & Future Work Multiple parameters µ1, µ2,..., µD. (Efficient) adaptation to kernel space. Semidefinite Perceptrons (NIPS poster with A. Kharechko and J. Shawe-Taylor). Sparsification by efficiently finding the example x and transformation µ with maximal information (idea of Neil Lawrence). Expectation propagation for BPMs (idea of Tom Minka). Microsoft Research Ltd.

Conclusions & Future Work Learning from infinitely many examples. Truly virtual support vectors xi(µi*). Multiple parameters µ1, µ2,..., µD. (Efficient) adaptation to kernel space. Semidefinite Perceptrons (NIPS poster with A. Kharechko and J. Shawe-Taylor). Microsoft Research Ltd.