Reduced Support Vector Machine

Slides:



Advertisements
Similar presentations
Introduction to Support Vector Machines (SVM)
Advertisements

Generative Models Thus far we have essentially considered techniques that perform classification indirectly by modeling the training data, optimizing.
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
SVM—Support Vector Machines
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
The Disputed Federalist Papers : SVM Feature Selection via Concave Minimization Glenn Fung and Olvi L. Mangasarian CSNA 2002 June 13-16, 2002 Madison,
Support Vector Machine
Support Vector Machines (SVMs) Chapter 5 (Duda et al.)
The value of kernel function represents the inner product of two training points in feature space Kernel functions merge two steps 1. map input data from.
Mutual Information Mathematical Biology Seminar
Prénom Nom Document Analysis: Linear Discrimination Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Using Analytic QP and Sparseness to Speed Training of Support Vector Machines John C. Platt Presented by: Travis Desell.
Measuring Model Complexity (Textbook, Sections ) CS 410/510 Thurs. April 27, 2007 Given two hypotheses (models) that correctly classify the training.
Difficulties with Nonlinear SVM for Large Problems  The nonlinear kernel is fully dense  Computational complexity depends on  Separating surface depends.
Support Vector Machines Kernel Machines
Binary Classification Problem Linearly Separable Case
Sketched Derivation of error bound using VC-dimension (1) Bound our usual PAC expression by the probability that an algorithm has 0 error on the training.
Binary Classification Problem Learn a Classifier from the Training Set
Bioinformatics Challenge  Learning in very high dimensions with very few samples  Acute leukemia dataset: 7129 # of gene vs. 72 samples  Colon cancer.
Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.
1 Computational Learning Theory and Kernel Methods Tianyi Jiang March 8, 2004.
2806 Neural Computation Support Vector Machines Lecture Ari Visa.
Support Vector Machines
What is Learning All about ?  Get knowledge of by study, experience, or being taught  Become aware by information or from observation  Commit to memory.
Lecture 10: Support Vector Machines
Classification and Regression
Difficulties with Nonlinear SVM for Large Problems  The nonlinear kernel is fully dense  Computational complexity depends on  Separating surface depends.
Mathematical Programming in Support Vector Machines
Efficient Model Selection for Support Vector Machines
Based on: The Nature of Statistical Learning Theory by V. Vapnick 2009 Presentation by John DiMona and some slides based on lectures given by Professor.
Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.
CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.
1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.
Kernel Methods A B M Shawkat Ali 1 2 Data Mining ¤ DM or KDD (Knowledge Discovery in Databases) Extracting previously unknown, valid, and actionable.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Classifiers Given a feature representation for images, how do we learn a model for distinguishing features from different classes? Zebra Non-zebra Decision.
Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:
START OF DAY 5 Reading: Chap. 8. Support Vector Machine.
CISC667, F05, Lec22, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Support Vector Machines I.
CS 478 – Tools for Machine Learning and Data Mining SVM.
Kernel Methods: Support Vector Machines Maximum Margin Classifiers and Support Vector Machines.
RSVM: Reduced Support Vector Machines Y.-J. Lee & O. L. Mangasarian First SIAM International Conference on Data Mining Chicago, April 6, 2001 University.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
CS558 Project Local SVM Classification based on triangulation (on the plane) Glenn Fung.
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
Review of fundamental 1 Data mining in 1D: curve fitting by LLS Approximation-generalization tradeoff First homework assignment.
Data Mining via Support Vector Machines Olvi L. Mangasarian University of Wisconsin - Madison IFIP TC7 Conference on System Modeling and Optimization Trier.
Support Vector Machines Tao Department of computer science University of Illinois.
Support Vector Machines. Notation Assume a binary classification problem. –Instances are represented by vector x   n. –Training examples: x = (x 1,
Final Exam Review CS479/679 Pattern Recognition Dr. George Bebis 1.
Goal of Learning Algorithms  The early learning algorithms were designed to find such an accurate fit to the data.  A classifier is said to be consistent.
Support Vector Machines (SVM): A Tool for Machine Learning Yixin Chen Ph.D Candidate, CSE 1/10/2002.
Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.
Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Proximal Plane Classification KDD 2001 San Francisco August 26-29, 2001 Glenn Fung & Olvi Mangasarian Second Annual Review June 1, 2001 Data Mining Institute.
Kernel Methods: Support Vector Machines Maximum Margin Classifiers and Support Vector Machines.
Generalization Error of pac Model  Let be a set of training examples chosen i.i.d. according to  Treat the generalization error as a r.v. depending on.
Incremental Reduced Support Vector Machines Yuh-Jye Lee, Hung-Yi Lo and Su-Yun Huang National Taiwan University of Science and Technology and Institute.
Extending linear models by transformation (section 3.4 in text) (lectures 3&4 on amlbook.com)
Introduction to Machine Learning Prof. Nir Ailon Lecture 5: Support Vector Machines (SVM)
Support Vector Machines Reading: Textbook, Chapter 5 Ben-Hur and Weston, A User’s Guide to Support Vector Machines (linked from class web page)
Support Vector Machines (SVMs) Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.
CS 9633 Machine Learning Support Vector Machines
Geometrical intuition behind the dual problem
Pattern Recognition CS479/679 Pattern Recognition Dr. George Bebis
Concave Minimization for Support Vector Machine Classifiers
Feature Selection Methods
University of Wisconsin - Madison
Presentation transcript:

Reduced Support Vector Machine Nonlinear Classifier: (i) Choose a random subset matrix of entire data matrix (ii) Solve the following problem by the Newton’s method min (iii) The nonlinear classifier is defined by the optimal solution in step (ii): Using gives lousy results!

Reduced Set: Plays the Most Important Role in RSVM It is natural to raise two questions: Is there a way to choose the reduced set other than random selection so that RSVM will have a better performance? Is there a mechanism to determine the size of reduced set automatically or dynamically?

Reduced Set Selection According to the Data Scatter in Input Space Choose reduced set randomly but only keep the points in the reduced set that are more than a certain minimal distance apart Expected these points to be representative sample

Data Scatter in Input Space is NOT Good Enough An example is given as following: 1 2 3 5 4 6 7 8 9 11 10 12 Training data analogous to XOR problem

Mapping to Feature Space Map the input data via nonlinear mapping: Equivalent to polynomial kernel with degree 2:

Data Points in the Feature Space 36 25 14 8 11 9 12 7 10 1 2 3 5 4 6 7 8 9 11 10 12

The Polynomial Kernel Matrix

Experiment Result 1 2 3 5 4 6 7 8 9 11 10 12

Express the Classifier as Linear Combination of Kernel Functions is a linear combination of a set of kernel functions In SSVM, the nonlinear separating surface is: In RSVM, the nonlinear separating surface is: is a linear combination of a set of kernel functions

Motivation of IRSVM The Strength of Weak Ties If the kernel functions are very similar, the space spanned by these kernel functions will be very limited. The strength of weak ties Mark S. Granovetter, The American Journal of Sociology, Vol. 78, No. 6 1360-1380, May, 1973

Incremental Reduced SVMs Start with a very small reduced set , then add a new data point only when the kernel vector is dissimilar to the current function set This point contributes the most extra information for generating the separating surface Repeat until several successive points cannot be added

How to measure the dissimilarity? the kernel vector to the column space of is greater than a threshold Add a point into the reduced set if the distance from

Solving Least Squares Problems This distance can be determined by solving a least squares problem The LSP has a unique solution if and

IRSVM Algorithm pseudo-code (sequential version) 1 Randomly choose two data from the training data as the initial reduced set 2 Compute the reduced kernel matrix 3 For each data point not in the reduced set 4 Computes its kernel vector 5 Computes the distance from the kernel vector 6 to the column space of the current reduced kernel matrix 7 If its distance exceed a certain threshold 8 Add this point into the reduced set and form the new reduced kernel matrix 9 Until several successive failures happened in line 7 10 Solve the QP problem of nonlinear SVMs with the obtained reduced kernel 11 A new data point is classified by the separating surface

Speed up IRSVM We have to solve the LSP many times and the complexity is The main cost depends on but not on Take this advantage this, we examine a batch data points at the same

IRSVM Algorithm pseudo-code (Batch version) 1 Randomly choose two data from the training data as the initial reduced set 2 Compute the reduced kernel matrix 3 For a batch data point not in the reduced set 4 Computes their kernel vectors 5 Computes the corresponding distances from these kernel vector 6 to the column space of the current reduced kernel matrix 7 For those points’ distance exceed a certain threshold 8 Add those point into the reduced set and form the new reduced kernel matrix 9 Until no data points in a batch were added in line 7,8 10 Solve the QP problem of nonlinear SVMs with the obtained reduced kernel 11 A new data point is classified by the separating surface

IRSVM on Four Public Datasets

IRSVM on UCI Adult datasets

Time comparison on Adult datasets

IRSVM 10 Runs Average on 6414 Points Adult Training Set

Empirical Risk Minimization (ERM) and are not needed) Replace the expected risk over by an average over the training example The empirical risk: Find the hypothesis with the smallest empirical risk Only focusing on empirical risk will cause overfitting

VC Confidence (The Bound between ) The following inequality will be held with probability C. J. C. Burges, A tutorial on support vector machines for pattern recognition, Data Mining and Knowledge Discovery 2 (2) (1998), p.121-167

Why We Maximize the Margin? (Based on Statistical Learning Theory) The Structural Risk Minimization (SRM): The expected risk will be less than or equal to empirical risk (training error)+ VC (error) bound

Bioinformatics Challenge Learning in very high dimensions with very few samples Colon cancer dataset: 2000 # of gene vs. 62 samples Acute leukemia dataset: 7129 # of gene vs. 72 samples Feature selection will be needed

Feature Selection Approaches Filter model: the attribute set is filtered to produce the most promising subset before learning commences Weight score approach Wrapper model: the learning algorithm is wrapped into the selection procedure 1-norm SVM IRSVM

Feature Selection –Filter Model Using Weight Score Approach

Filter Model – Weight Score Approach where and are the mean and standard deviation of feature for training examples of positive or negative class.

Filter Model – Weight Score Approach is defined as the ratio between the difference of the means of expression levels and the sum of standard deviation in two classes. Selecting genes with largest as our top features. The weight score is calculated with the information about a single feature. The highly linear correlated features might be selected by this approach.

Wrapper Model – IRSVM Find a Linear Classifier: Randomly choose a very small feature subset from the input features as the initial feature reduced set. Select a feature vector not in the current feature reduced set and computing the distance between this vector and the space spanned by current feature reduced set. If the distance is larger than a given gap, then we add this feature vector to the feature reduced set. Repeat step II and step III until there are no feature can be added to the current feature reduced set. Features in the resulting feature reduced set is our final result of feature selection.