Margin-Sparsity Trade-off for the Set Covering Machine ECML 2005 François Laviolette (Université Laval) Mario Marchand (Université Laval) Mohak Shah (Université.

Slides:



Advertisements
Similar presentations
A Support Vector Method for Optimizing Average Precision
Advertisements

Efficient classification for metric data Lee-Ad GottliebWeizmann Institute Aryeh KontorovichBen Gurion U. Robert KrauthgamerWeizmann Institute TexPoint.
Statistical Machine Learning- The Basic Approach and Current Research Challenges Shai Ben-David CS497 February, 2007.
ECG Signal processing (2)
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.
Linear Classifiers (perceptrons)
Support Vector Machines Instructor Max Welling ICS273A UCIrvine.
Pattern Recognition and Machine Learning
Support Vector Machines and Kernels Adapted from slides by Tim Oates Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore County.
Support Vector Machines
1 Lecture 5 Support Vector Machines Large-margin linear classifier Non-separable case The Kernel trick.
Jun Zhu Dept. of Comp. Sci. & Tech., Tsinghua University This work was done when I was a visiting researcher at CMU. Joint.
Bayesian Learning Rong Jin. Outline MAP learning vs. ML learning Minimum description length principle Bayes optimal classifier Bagging.
Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.
1 Classification: Definition Given a collection of records (training set ) Each record contains a set of attributes, one of the attributes is the class.
Measuring Model Complexity (Textbook, Sections ) CS 410/510 Thurs. April 27, 2007 Given two hypotheses (models) that correctly classify the training.
Optimizing F-Measure with Support Vector Machines David R. Musicant Vipin Kumar Aysel Ozgur FLAIRS 2003 Tuesday, May 13, 2003 Carleton College.
Reduced Support Vector Machine
Support Vector Machines Based on Burges (1998), Scholkopf (1998), Cristianini and Shawe-Taylor (2000), and Hastie et al. (2001) David Madigan.
Support Vector Machines
SVM Support Vectors Machines
Visual Recognition Tutorial
Bayesian Learning Rong Jin.
Experts and Boosting Algorithms. Experts: Motivation Given a set of experts –No prior information –No consistent behavior –Goal: Predict as the best expert.
An Introduction to Support Vector Machines Martin Law.
Support Vector Machines
Information theory, fitness and sampling semantics colin johnson / university of kent john woodward / university of stirling.
Efficient Model Selection for Support Vector Machines
Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.
CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.
Kernel Classifiers from a Machine Learning Perspective (sec ) Jin-San Yang Biointelligence Laboratory School of Computer Science and Engineering.
Information Coding in noisy channel error protection:-- improve tolerance of errors error detection: --- indicate occurrence of errors. Source.
Inferring Decision Trees Using the Minimum Description Length Principle J. R. Quinlan and R. L. Rivest Information and Computation 80, , 1989.
计算机学院 计算感知 Support Vector Machines. 2 University of Texas at Austin Machine Learning Group 计算感知 计算机学院 Perceptron Revisited: Linear Separators Binary classification.
Universit at Dortmund, LS VIII
Kernel Methods A B M Shawkat Ali 1 2 Data Mining ¤ DM or KDD (Knowledge Discovery in Databases) Extracting previously unknown, valid, and actionable.
ICML2004, Banff, Alberta, Canada Learning Larger Margin Machine Locally and Globally Kaizhu Huang Haiqin Yang, Irwin King, Michael.
Stochastic Subgradient Approach for Solving Linear Support Vector Machines Jan Rupnik Jozef Stefan Institute.
An Introduction to Support Vector Machines (M. Law)
Combining multiple learners Usman Roshan. Bagging Randomly sample training data Determine classifier C i on sampled data Goto step 1 and repeat m times.
Linear Learning Machines and SVM The Perceptron Algorithm revisited
Christopher M. Bishop, Pattern Recognition and Machine Learning.
Ohad Hageby IDC Support Vector Machines & Kernel Machines IP Seminar 2008 IDC Herzliya.
Biointelligence Laboratory, Seoul National University
Support Vector Machines and Gene Function Prediction Brown et al PNAS. CS 466 Saurabh Sinha.
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
Support Vector Machines Tao Department of computer science University of Illinois.
Lecture notes for Stat 231: Pattern Recognition and Machine Learning 1. Stat 231. A.L. Yuille. Fall Perceptron Rule and Convergence Proof Capacity.
Classification Course web page: vision.cis.udel.edu/~cv May 14, 2003  Lecture 34.
5.3 Algorithmic Stability Bounds Summarized by: Sang Kyun Lee.
6.S093 Visual Recognition through Machine Learning Competition Image by kirkh.deviantart.com Joseph Lim and Aditya Khosla Acknowledgment: Many slides from.
SVMs, Part 2 Summary of SVM algorithm Examples of “custom” kernels Standardizing data for SVMs Soft-margin SVMs.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 23: Linear Support Vector Machines Geoffrey Hinton.
Computational Intelligence: Methods and Applications Lecture 24 SVM in the non-linear case Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
Computational Intelligence: Methods and Applications Lecture 15 Model selection and tradeoffs. Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
SUPPORT VECTOR MACHINES Presented by: Naman Fatehpuria Sumana Venkatesh.
CSC2535: Lecture 4: Autoencoders, Free energy, and Minimum Description Length Geoffrey Hinton.
PAC-Bayes Risk Bounds for Sample-Compressed Gibbs Classifiers ICML 2005 François Laviolette and Mario Marchand Université Laval.
1 CS 391L: Machine Learning: Computational Learning Theory Raymond J. Mooney University of Texas at Austin.
Support vector machines
CS 9633 Machine Learning Support Vector Machines
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Computational Intelligence: Methods and Applications
Basic machine learning background with Python scikit-learn
Ensemble learning.
Support vector machines
Concave Minimization for Support Vector Machine Classifiers
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Locality In Distributed Graph Algorithms
Presentation transcript:

Margin-Sparsity Trade-off for the Set Covering Machine ECML 2005 François Laviolette (Université Laval) Mario Marchand (Université Laval) Mohak Shah (Université d’Ottawa)

PLAN  Margin-Sparsity trade-off for Sample Compressed Classifiers  The “classical” Set Covering Machine (Classical-SCM)  Definition  Tight Risk Bound and model selection  The learning algorithm  The modified Set Covering Machine (SCM2)  Definition  A non trivial Margin-Sparsity trade-off expressed by the risk bound  The learning algorithm  Empirical results  Conclusions

The Sample compression Framework  In the sample compression setting, each classifier is identified by 2 different sources of information:  The compression set: an (ordered) subset of the training set  A message string of additional information needed to identify a classifier  To be more precise: In the sample compression setting, there exists a “reconstruction” function R that gives a classifier h = R ( , S i ) when given a compression set S i and a message string .

The Sample compression Framework (2)  The examples are supposed i.i.d.  The risk (or generalization error) of a classifier h (noted R(h) ) is the probability that h misclassified a new example.  The empirical risk (noted R S (h) ) on a training set S is the frequency of errors of h on S.

Examples of sample-compressed classifiers Set Covering Machines (SCM) [Marchand and Shaw-Taylor JMLR 2002] Decision List Machines (DLM) [Marchand and Sokolova JMLR 2005] Support Vector Machines (SVM) …

Margin-Sparsity trade-off  There is a widespread belief that in the sample compression setting learning algorithms should somehow try to find a non-trivial margin-sparsity trade-off.  SVM are looking for Margin. But some efforts as been done in order to find a sparser SVM (Bennett (1999), and Bi et al. (2003)). This seems a difficult task.  SCM are looking for Sparsity. To force a classifier which is a conjunction of “geometric” Boolean features to have no training example within a distance  of its decision surface seems a much easier task. Moreover, we will see that in our setting, both sparsity and margin can be considered as different forms of data-compression

The “Classical” Set Covering Machine (Marchand and Shawe-Taylor 2002)  Construct the “smallest possible” conjunction of (Boolean- valued) features  Each feature h is a ball identified by two training examples (the center (x c, y c ) and the border point (x b, y b ) ) and defined for any input example x as: (Dually, one can consider to construct “smallest possible” disjunction of features, but we will only consider the conjunction case in this talk)

+ - An Example of a “Classical”-SCM

+ - But SCM is looking for sparsity !!

A risk bound

For “classical” SCMs  If we choose the following Prior:  Then Corollary 1 becomes:  Which almost expresses a symmetry between k and d. Because P M (Z i ) (  ), is small for the “classical” SCMs compared to and Idem for ln(d+1)

Model selection by the bound  Empirical results showed that looking for a SCM that minimise this risk bound is a slightly better model selection’s strategy than the cross-validation approach  The reasons are not totally clear This bound is tight There is a symmetry between d and k ???

A Learning algorithm for the “Classical” SCM  Ideally we would like a to find a SCM that minimizes the risk bound  Unfortunately, this is NP-Hard (at least)  We will therefore use a greedy heuristic based on the following observation

+ Adding one ball at the time, a classification error on an example “+” can not be fixed by adding other balls But, for an example “ - ” it is possible

A Learning algorithm for the “Classical” SCM (Marchand and Shawe-Taylor 2002)  Define a list p 1,p 2,…,p l, and for each such p (called the learning parameter) DO STEP 1  STEP 1:Suppose i balls (B p,0, B p,1, … B p,i-1 ) already have been construct by the algorithm UNTIL every “-” is assign correctly by the SCM (B p,1, B p,2, … B p,i-1 ) DO  Choose a new ball B p,i that maximizes q i - p ¢ r i where q i is the number of “-” correctly assign by B p,i but not correctly assign by the SCM (B p,0, B p,1, … B p,i-1 ) r i is the number of “+” not correctly assign by B p,i but correctly assign by the SCM (B p,0, B p,1, … B p,i-1 )

A Learning algorithm for the “Classical” SCM (continued)  Among the following SCMs, OUTPUT the one that have the best risk bound Note: the algorithm can be adapt to a cross-validation approach

SCM2, SCM with radii coded by message strings  In the “classical” SCM: centers and radii are defined by examples of training set  Another alternative: to code each radius value by a message string (but still use examples of the training set to define the centers)  Objective: to construct the “smallest possible” conjunction of balls each of which having the “smallest possible” number of bits in the message string that define its radius

What kind of radius can be described with l bits?  Let us choose a scale  Then with l = 0 bit, we can define: R + R/2

3R/4 What kind of radius can be described with l bits?  Let us choose a scale  Then with l = 1 bit, we can define: R + R/4

3R/8 5R/8 7R/8 What kind of radius can be described with l bits?  Let us choose a scale  Then with l = 2 bits, we can define: R + R/8

More precisely  Under some parameter R (that will be our scale), the radius of any ball of a SCM2 will be code by a pair (l, s) such that 0 < 2s-1 < 2 l+1 the code (l,s) means that the radius of the ball is  Thus the possible radius value for l=2 are  Note that l is the number of bits of the radius R/8, 3R/8, 5R/8 and 7R/8

Observe that if we have a large margin, among all the “interesting” balls, there will be one whose radius (l,s) of small number of bits

For SCM2  If we choose the following Priors:  Then Corollary 1 becomes: Which expresses a non trivial margin-sparsity trade-off !!! The learning algorithm is similar to the classical one, except it need two extra learning parameter: R and the maximum of bits allowed by message strings (noted l * )

Empirical results  SVMs and SCMs on UCI data sets:  We observe: For SCMs, model selection by the bound is almost always better than by cross-validation SCM2 is almost always better than SCM1 SCM2 tends to produce more balls than SCM1. Hence SCM2 sacrifices sparsity to obtain a larger margin

Conclusion We have proposed:  A new representation for the SCM that use two distinct sources of information: A compression set to represent the centers of the balls A message string to encode the radius value of each ball  A general data-compression risk bound that depend explicitly on these two information sources which exhibits a non trivial trade-off between sparsity (the inverse of the compression set size) and the margin (the inverse of the message length) Seems to be an effective guide for choosing the proper margin-sparsity trade- off of a classifier