The value of kernel function represents the inner product of two training points in feature space Kernel functions merge two steps 1. map input data from.

Slides:



Advertisements
Similar presentations
Introduction to Support Vector Machines (SVM)
Advertisements

ECG Signal processing (2)
Support Vector Machines
SVM—Support Vector Machines
Y.-J. Lee, O. L. Mangasarian & W.H. Wolberg
Support Vector Machines (SVMs) Chapter 5 (Duda et al.)
Kernel Technique Based on Mercer’s Condition (1909)
Support Vector Classification (Linearly Separable Case, Primal) The hyperplanethat solves the minimization problem: realizes the maximal margin hyperplane.
Proximal Support Vector Machine Classifiers KDD 2001 San Francisco August 26-29, 2001 Glenn Fung & Olvi Mangasarian Data Mining Institute University of.
Reduced Support Vector Machine
Difficulties with Nonlinear SVM for Large Problems  The nonlinear kernel is fully dense  Computational complexity depends on  Separating surface depends.
Classification Problem 2-Category Linearly Separable Case A- A+ Malignant Benign.
Sketched Derivation of error bound using VC-dimension (1) Bound our usual PAC expression by the probability that an algorithm has 0 error on the training.
Binary Classification Problem Learn a Classifier from the Training Set
Support Vector Machines and Kernel Methods
Bioinformatics Challenge  Learning in very high dimensions with very few samples  Acute leukemia dataset: 7129 # of gene vs. 72 samples  Colon cancer.
Kernel Methods and SVM’s. Predictive Modeling Goal: learn a mapping: y = f(x;  ) Need: 1. A model structure 2. A score function 3. An optimization strategy.
Support Vector Machines
Lecture 10: Support Vector Machines
Greg GrudicIntro AI1 Support Vector Machine (SVM) Classification Greg Grudic.
Learning in Feature Space (Could Simplify the Classification Task)  Learning in a high dimensional space could degrade generalization performance  This.
Classification and Regression
Difficulties with Nonlinear SVM for Large Problems  The nonlinear kernel is fully dense  Computational complexity depends on  Separating surface depends.
Mathematical Programming in Support Vector Machines
An Introduction to Support Vector Machines Martin Law.
SVM by Sequential Minimal Optimization (SMO)
Feature Selection in Nonlinear Kernel Classification Olvi Mangasarian Edward Wild University of Wisconsin Madison.
Kernel Methods A B M Shawkat Ali 1 2 Data Mining ¤ DM or KDD (Knowledge Discovery in Databases) Extracting previously unknown, valid, and actionable.
Proximal Support Vector Machine Classifiers KDD 2001 San Francisco August 26-29, 2001 Glenn Fung & Olvi Mangasarian Data Mining Institute University of.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Classifiers Given a feature representation for images, how do we learn a model for distinguishing features from different classes? Zebra Non-zebra Decision.
An Introduction to Support Vector Machines (M. Law)
Nonlinear Data Discrimination via Generalized Support Vector Machines David R. Musicant and Olvi L. Mangasarian University of Wisconsin - Madison
Privacy-Preserving Support Vector Machines via Random Kernels Olvi Mangasarian UW Madison & UCSD La Jolla Edward Wild UW Madison November 14, 2015 TexPoint.
Kernels Usman Roshan CS 675 Machine Learning. Feature space representation Consider two classes shown below Data cannot be separated by a hyperplane.
Machine Learning Weak 4 Lecture 2. Hand in Data It is online Only around 6000 images!!! Deadline is one week. Next Thursday lecture will be only one hour.
CS 478 – Tools for Machine Learning and Data Mining SVM.
Kernel Methods: Support Vector Machines Maximum Margin Classifiers and Support Vector Machines.
RSVM: Reduced Support Vector Machines Y.-J. Lee & O. L. Mangasarian First SIAM International Conference on Data Mining Chicago, April 6, 2001 University.
CS558 Project Local SVM Classification based on triangulation (on the plane) Glenn Fung.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Data Mining via Support Vector Machines Olvi L. Mangasarian University of Wisconsin - Madison IFIP TC7 Conference on System Modeling and Optimization Trier.
1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Support Vector Machines. Notation Assume a binary classification problem. –Instances are represented by vector x   n. –Training examples: x = (x 1,
Text Classification using Support Vector Machine Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Greg GrudicIntro AI1 Support Vector Machine (SVM) Classification Greg Grudic.
Proximal Plane Classification KDD 2001 San Francisco August 26-29, 2001 Glenn Fung & Olvi Mangasarian Second Annual Review June 1, 2001 Data Mining Institute.
Privacy-Preserving Support Vector Machines via Random Kernels Olvi Mangasarian UW Madison & UCSD La Jolla Edward Wild UW Madison March 3, 2016 TexPoint.
Kernel Methods: Support Vector Machines Maximum Margin Classifiers and Support Vector Machines.
SVMs in a Nutshell.
Support Vector Machine: An Introduction. (C) by Yu Hen Hu 2 Linear Hyper-plane Classifier For x in the side of o : w T x + b  0; d = +1; For.
Incremental Reduced Support Vector Machines Yuh-Jye Lee, Hung-Yi Lo and Su-Yun Huang National Taiwan University of Science and Technology and Institute.
Massive Support Vector Regression (via Row and Column Chunking) David R. Musicant and O.L. Mangasarian NIPS 99 Workshop on Learning With Support Vectors.
Introduction to Machine Learning Prof. Nir Ailon Lecture 5: Support Vector Machines (SVM)
Minimal Kernel Classifiers Glenn Fung Olvi Mangasarian Alexander Smola Data Mining Institute University of Wisconsin - Madison Informs 2002 San Jose, California,
Classification via Mathematical Programming Based Support Vector Machines Glenn M. Fung Computer Sciences Dept. University of Wisconsin - Madison November.
Support Vector Machines Reading: Textbook, Chapter 5 Ben-Hur and Weston, A User’s Guide to Support Vector Machines (linked from class web page)
Knowledge-Based Nonlinear Support Vector Machine Classifiers Glenn Fung, Olvi Mangasarian & Jude Shavlik COLT 2003, Washington, DC. August 24-27, 2003.
Glenn Fung, Murat Dundar, Bharat Rao and Jinbo Bi
Geometrical intuition behind the dual problem
Kernels Usman Roshan.
Support Vector Machines Introduction to Data Mining, 2nd Edition by
Statistical Learning Dong Liu Dept. EEIS, USTC.
Support Vector Machines
Usman Roshan CS 675 Machine Learning
University of Wisconsin - Madison
University of Wisconsin - Madison
Minimal Kernel Classifiers
Presentation transcript:

The value of kernel function represents the inner product of two training points in feature space Kernel functions merge two steps 1. map input data from input space to feature space (might be infinite dim.) 2. do inner product in the feature space Kernel Technique Based on Mercer ’ s Condition (1909)

A Simple Example of Kernel Polynomial Kernel of Degree 2: Let and the nonlinear map defined by. Then.  There are many other nonlinear maps,, that satisfy the relation:

Power of the Kernel Technique Consider a nonlinear mapthat consists of distinct features of all the monomials of degree d. Then. For example:  Is it necessary? We only need to know !  This can be achieved

More Examples of Kernel is an integer:  Polynomial Kernel : ) (Linear Kernel :  Gaussian (Radial Basis) Kernel :  The -entry of represents the “similarity” of data pointsand

Nonlinear SVM Motivation  Linear SVM: (Linear separator: ) min s. t. (QP) By QP “duality”,. Maximizing the margin in the “dual space” gives: min  Dual SSVM with separator: min s. t.

Nonlinear Smooth SVM  Replace by a nonlinear kernel : min  Use Newton-Armijo algorithm to solve the problem  Each iteration solves m+1 linear equations in m+1 variables  Nonlinear classifier depends on entire dataset : Nonlinear Classifier:

Difficulties with Nonlinear SVM for Large Problems  The nonlinear kernel is fully dense  Computational complexity depends on # of example  Separating surface depends on almost entire dataset  Complexity of nonlinear SVM  Runs out of memory while storing the kernel matrix  Long CPU time to compute the dense kernel matrix  Need to generate and store entries  Need to store the entire dataset even after solving the problem

Solving the SVM with Massive Dataset  Limit the SVM to dataset of a few thousand points  Solution I: SMO (Sequential Minimal Optimization)  Standard optimization techniques require that the the data are held in memory  Solve the sub-optimization problem defined by the working set (size =2)  Increase the objective function iteratively  Solution II: RSVM (Reduced Support Vector Machine)

Reduced Support Vector Machine (ii) Solve the following problem by the Newton’s method min (iii) The nonlinear classifier is defined by the optimal solution in step (ii): Using gives lousy results! (i) Choose a random subset matrixof entire data matrix Nonlinear Classifier:

A Nonlinear Kernel Application Checkerboard Training Set: 1000 Points in Separate 486 Asterisks from 514 Dots

Conventional SVM Result on Checkerboard Using 50 Randomly Selected Points Out of 1000

RSVM Result on Checkerboard Using SAME 50 Random Points Out of 1000

RSVM on Moderate Sized Problems (Best Test Set Correctness %, CPU seconds) Cleveland Heart 297 x 13, BUPA Liver 345 x 6, Ionosphere 351 x 34, Pima Indians 768 x 8, Tic-Tac-Toe 958 x 9, Mushroom 8124 x 22, N/A

RSVM on Large UCI Adult Dataset Standard Deviation over 50 Runs = Average Correctness % & Standard Deviation, 50 Runs (6414, 26148) % (11221, 21341) % (16101, 16461) % (22697, 9865) % (32562, 16282) %

Reduced Set: Plays the Most Important Role in RSVM  It is natural to raise two questions:  Is there a way to choose the reduced set other than random selection so that RSVM will have a better performance?  Is there a mechanism to determine the size of reduced set automatically or dynamically?

Reduced Set Selection According to the Data Scatter in Input Space  Expected these points to be representative sample  Choose reduced set randomly but only keep the points in the reduced set that are more than a certain minimal distance apart

Data Scatter in Input Space is NOT Good Enough  An example is given as following: Training data analogous to XOR problem

Mapping to Feature Space  Map the input data via nonlinear mapping :  Equivalent to polynomial kernel with degree 2:

Data Points in the Feature Space

The Polynomial Kernel Matrix

Experiment Result

Express the Classifier as Linear Combination of Kernel Functions is a linear combination of a set of kernel functions  In SSVM, the nonlinear separating surface is:  In RSVM, the nonlinear separating surface is: is a linear combination of a set of kernel functions

Motivation of IRSVM The Strength of Weak Ties  The strength of weak ties  Mark S. Granovetter, The American Journal of Sociology, Vol. 78, No , May, 1973  If the kernel functions are very similar, the space spanned by these kernel functions will be very limited.

Incremental Reduced SVMs  Start with a very small reduced set, then add a new data point only when the kernel vector is dissimilar to the current function set  This point contributes the most extra information for generating the separating surface  Repeat until several successive points cannot be added

How to measure the dissimilarity? the kernel vector to the column space of is greater than a threshold  Add a point into the reduced set if the distance from

 This distance can be determined by solving a least squares problem Solving Least Squares Problems  The LSP has a unique solution if and

IRSVM Algorithm pseudo-code (sequential version) 1 Randomly choose two data from the training data as the initial reduced set 2 Compute the reduced kernel matrix 3 For each data point not in the reduced set 4 Computes its kernel vector 5 Computes the distance from the kernel vector 6 to the column space of the current reduced kernel matrix 7 If its distance exceed a certain threshold 8 Add this point into the reduced set and form the new reduced kernel matrix 9 Until several successive failures happened in line 7 10 Solve the QP problem of nonlinear SVMs with the obtained reduced kernel 11 A new data point is classified by the separating surface

Speed up IRSVM  The main cost depends on but not on  Take this advantage this, we examine a batch data points at the same  We have to solve the LSP many times and the complexity is

IRSVM Algorithm pseudo-code (Batch version) 1 Randomly choose two data from the training data as the initial reduced set 2 Compute the reduced kernel matrix 3 For a batch data point not in the reduced set 4 Computes their kernel vectors 5 Computes the corresponding distances from these kernel vector 6 to the column space of the current reduced kernel matrix 7 For those points’ distance exceed a certain threshold 8 Add those point into the reduced set and form the new reduced kernel matrix 9 Until no data points in a batch were added in line 7,8 10 Solve the QP problem of nonlinear SVMs with the obtained reduced kernel 11 A new data point is classified by the separating surface

IRSVM on Four Public Datasets

IRSVM on UCI Adult datasets

Time comparison on Adult datasets

IRSVM 10 Runs Average on 6414 Points Adult Training Set