Linear Programming Boosting by Column and Row Generation Kohei Hatano and Eiji Takimoto Kyushu University, Japan DS 2009.

Slides:



Advertisements
Similar presentations
Lecture 9 Support Vector Machines
Advertisements

Solving LP Models Improving Search Special Form of Improving Search
Ensemble Learning Reading: R. Schapire, A brief introduction to boosting.
SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.
A KTEC Center of Excellence 1 Pattern Analysis using Convex Optimization: Part 2 of Chapter 7 Discussion Presenter: Brian Quanz.
Structured SVM Chen-Tse Tsai and Siddharth Gupta.
SVM—Support Vector Machines
Support vector machine
Separating Hyperplanes
The Disputed Federalist Papers : SVM Feature Selection via Concave Minimization Glenn Fung and Olvi L. Mangasarian CSNA 2002 June 13-16, 2002 Madison,
Support Vector Machines
1 PEGASOS Primal Efficient sub-GrAdient SOlver for SVM Shai Shalev-Shwartz Yoram Singer Nati Srebro The Hebrew University Jerusalem, Israel YASSO = Yet.
1-norm Support Vector Machines Good for Feature Selection  Solve the quadratic program for some : min s. t.,, denotes where or membership. Equivalent.
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
Dual Problem of Linear Program subject to Primal LP Dual LP subject to ※ All duality theorems hold and work perfectly!
Using Analytic QP and Sparseness to Speed Training of Support Vector Machines John C. Platt Presented by: Travis Desell.
Support Vector Classification (Linearly Separable Case, Primal) The hyperplanethat solves the minimization problem: realizes the maximal margin hyperplane.
The Perceptron Algorithm (Dual Form) Given a linearly separable training setand Repeat: until no mistakes made within the for loop return:
Classification Problem 2-Category Linearly Separable Case A- A+ Malignant Benign.
Support Vector Machines
Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.
1 Computational Learning Theory and Kernel Methods Tianyi Jiang March 8, 2004.
Support Vector Regression David R. Musicant and O.L. Mangasarian International Symposium on Mathematical Programming Thursday, August 10, 2000
1 Multiple Kernel Learning Naouel Baili MRL Seminar, Fall 2009.
Optimization Theory Primal Optimization Problem subject to: Primal Optimal Value:
Trading Convexity for Scalability Marco A. Alvarez CS7680 Department of Computer Science Utah State University.
Classification and Regression
Online Learning Algorithms
Machine Learning Week 4 Lecture 1. Hand In Data Is coming online later today. I keep test set with approx test images That will be your real test.
SVM by Sequential Minimal Optimization (SMO)
Machine Learning Seminar: Support Vector Regression Presented by: Heng Ji 10/08/03.
计算机学院 计算感知 Support Vector Machines. 2 University of Texas at Austin Machine Learning Group 计算感知 计算机学院 Perceptron Revisited: Linear Separators Binary classification.
Universit at Dortmund, LS VIII
Benk Erika Kelemen Zsolt
Machine Learning in Ad-hoc IR. Machine Learning for ad hoc IR We’ve looked at methods for ranking documents in IR using factors like –Cosine similarity,
Stochastic Subgradient Approach for Solving Linear Support Vector Machines Jan Rupnik Jozef Stefan Institute.
Kernels Usman Roshan CS 675 Machine Learning. Feature space representation Consider two classes shown below Data cannot be separated by a hyperplane.
Sparse Kernel Methods 1 Sparse Kernel Methods for Classification and Regression October 17, 2007 Kyungchul Park SKKU.
Biointelligence Laboratory, Seoul National University
Survey of Kernel Methods by Jinsan Yang. (c) 2003 SNU Biointelligence Lab. Introduction Support Vector Machines Formulation of SVM Optimization Theorem.
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1.
1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Online Learning of Maximum Margin Classifiers Kohei HATANO Kyusyu University (Joint work with K. Ishibashi and M. Takeda) p-Norm with Bias COLT 2008.
Maximum Entropy Discrimination Tommi Jaakkola Marina Meila Tony Jebara MIT CMU MIT.
© Eric CMU, Machine Learning Support Vector Machines Eric Xing Lecture 4, August 12, 2010 Reading:
Lecture notes for Stat 231: Pattern Recognition and Machine Learning 1. Stat 231. A.L. Yuille. Fall 2004 Linear Separation and Margins. Non-Separable and.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Generalization Error of pac Model  Let be a set of training examples chosen i.i.d. according to  Treat the generalization error as a r.v. depending on.
Massive Support Vector Regression (via Row and Column Chunking) David R. Musicant and O.L. Mangasarian NIPS 99 Workshop on Learning With Support Vectors.
1 An introduction to support vector machine (SVM) Advisor : Dr.Hsu Graduate : Ching –Wen Hong.
Minimal Kernel Classifiers Glenn Fung Olvi Mangasarian Alexander Smola Data Mining Institute University of Wisconsin - Madison Informs 2002 San Jose, California,
Page 1 CS 546 Machine Learning in NLP Review 2: Loss minimization, SVM and Logistic Regression Dan Roth Department of Computer Science University of Illinois.
Day 17: Duality and Nonlinear SVM Kristin P. Bennett Mathematical Sciences Department Rensselaer Polytechnic Institute.
Support Vector Machines Reading: Textbook, Chapter 5 Ben-Hur and Weston, A User’s Guide to Support Vector Machines (linked from class web page)
Support Vector Machine Slides from Andrew Moore and Mingyue Tan.
Support vector machines
Multiplicative updates for L1-regularized regression
Geometrical intuition behind the dual problem
Support Vector Machines
Jan Rupnik Jozef Stefan Institute
Kernels Usman Roshan.
Chap 9. General LP problems: Duality and Infeasibility
Statistical Learning Dong Liu Dept. EEIS, USTC.
Support vector machines
Chapter 5. The Duality Theorem
Usman Roshan CS 675 Machine Learning
Support vector machines
Support vector machines
Presentation transcript:

Linear Programming Boosting by Column and Row Generation Kohei Hatano and Eiji Takimoto Kyushu University, Japan DS 2009

1.Introduction 2.Preliminaries 3.Our algorithm 4.Experiment 5.Summary

Example Given a web page, predict if the topic is “DS 2009”. hypothesis set = words DS 2009? yn +1 DS 2009 ? yn * ALT? yn * Porto? y n * Weighted majority vote Modern Machine Learning Approach Find a weighted majority of hypotheses (or hyperplane) which enlarges the margin

1-norm soft margin optimization (aka 1 norm SVM) Popular formulation as well as 2 norm soft margin opt. “find a hyperplane which separates positive and negative instances well” margin ρ ξiξi loss ξ i Note ・ Linear Program ・ good generarization guarantee [Schapire et al. 98] margin loss normalized with 1 norm

1-norm soft margin optimization(2) Advantage of 1 norm soft margin opt. Solution likely to be sparse ⇒ useful for feature selection sparse hyperplane 0.5 *(DS 2009?) *(ALT?) + 0.2* (Porto?) non-sparse hyperplane 0.2 *(DS 2009?) *(ALT?) * (Porto?)* 0.1* (wine?)+0.05*(tasting?) +0.05* (discovery?)+ 0.03*(science?)+0.02*()+… 2 norm soft margin opt. 1 norm soft marign opt.

Recent Results Our result new algorithm for 1 norm soft margin optimization 2-norm soft margin optimization ・ Quadratic Programming ・ SMO [Platt, 1999] ・ SVM light [Joachims, 1999] ・ SVM Perf [Joachims, 2006] ・ Pegasos [Shai-Schwartz et al., 2007] There are state-of-the-art solvers 1-norm soft margin optimization ・ Linear Programming ・ LPBoost [Dem iriz et al, 2003] ・ Entropy Regularized LPBoost [Warmuth et al., 2008] ・ others [Mangasarian 2006][Sra 2006] not efficient enough for large data room for improvements

1.Introduction 2.Preliminaries 3.Our algorithm 4.Experiment 5.Summary

Boosting Classification: frog “+1”, others “-1” Hypotheses ・・・ color and size color size +1 1 . d 1 : uniform distribution 2 . For t=1,…,T (i) Choose hypothesis h t maximizing the edge w.r.t.d t (ii) Update distribution d t to d t+1 3. Output weighting of chosen hypotheses +1

Boosting (2) color size +1 1 . d 1 : uniform distribution 2 . For t=1,…,T (i) Choose hypothesis h t maximizing the edge w.r.t.d t (ii) Update distribution d t to d t+1 3. Output weighting of chosen hypotheses +1 h1h1 h 1 (x i ) Edge of hypothesis h w.r.t. distribution d y i h(x i )>0 if correct frog -1,or +1 ∈ [-1,+1]

Boosting (2) color size +1 1 . d 1 : uniform distribution 2 . For t=1,…,T (i) Choose hypothesis h t maximizing the edge w.r.t.d t (ii) Update distribution d t to d t+1 3. Output weighting of chosen hypotheses +1 h1h1 h 1 (x i ) More weights on Misclassified instances

frog Boosting (3) color size +1 1 . d 1 : uniform distribution 2 . For t=1,…,T (i) Choose hypothesis h t maximizing the edge w.r.t.d t (ii) Update distribution d t to d t+1 3. Output weighting of chosen hypotheses +1 h2h2 Note: more weights on “diffucult” instances

Boosting (4) color size +1 1 . d 1 : uniform distribution 2 . For t=1,…,T (i) Choose hypothesis h t maximizing the edge w.r.t.d t (ii) Update distribution d t to d t+1 3. Output weighting of chosen hypotheses h h 2

Boosting & 1-norm soft margin optimizatoin “find the large-margin hyperplane which separates pos. and neg. instances as much as possible” find the distiribution which minimizes the maximum edge of hypotheses (find the most “difficult” distribution w.r.t. hypotheses) ≈ solvable using Boosting equivalent by duality of LP

LPBoost [Demiriz et al, 2003] Update: solve the dual problem w.r.t. the hypothesis set {h 1,…,h t } Output: the convex combination where α is the solution of the primal problem Theorem[Demiriz et al.] Given ε>0, LPBoost outputs ε-approximation of the optimal solution.

Properties of the optimal solution margin ρ* ξ* i loss ξ* i KKT conditions imply: Note: Optimal solution can be recovered using only instances with positive weigthts ν= 0.2 m (高々2割の事例がマージン ρ 未満) => d i <= 1/ν より ν 個は事例が必要

Properties of the optimal solution ( 2 ) By KKT conditions Sparseness of the optimal solution 1. Sparseness w.r.t. hypotheses weighting 2. Sparseness w.r.t. instances Note: Optimal solution can be recovered using only hypotheses with positive coefficients.

1.Introduction 2.Preliminaries 3.Our algorithm 4.Experiment 5.Summary

Our idea: Sparse LPBoost Take advantage of the sparseness w.r.t. hypotheses and instances 2.For t=1…. (i)Pick up instances with margin <ρ t (iii) solve the dual problem w.r.t. the past chosen instances by Boosting (ρ t+1 : the solution) 3 . Output the solution of the primal problem. Theorem Given ε>0, Sparse LPBoost outputs ε-approximation of the optimal solution.

Our idea (matrix form) Each row i corresponds to instance i Each column j corresponds to hypothesis j LPLPBoostSparse LPBoost “effective” constraints for optimal sol. whole matrixcolumns intersections of columns and rows intersections of column and rows Inequality Constraints of the dual problem

How to choose examples (hypotheses)? 1 st attempt: add an instance one by one less efficient than LP solver! Assumptions - # of hypotheses is constant ・ time complexity of a LP solver : m k (m: # of instances) Our method : Choose at most 2 t instances with margin < ρ (t:# of iterations) If the algorithm terminate after it chooose cm instances ( 0<c < 1 ) Note: same argument holds for hypotheses unknown

1.Introduction 2.Preliminaries 3.Our algorithm 4.Experiment 5.Summary

Experiments (new experiments not in the proceedings) Data set# of examples m # of hypotheses n Reuters ,17030,389 RCV120,24247,237 news2019,9961,335,193 parameters : ν=0.2m, ε=0.001 each algorithm implemented with C++ and LP solver CPLEX 11.0

Experimental Results (sec.) Sparse LPBoost improves the computation time by 3 to 100 times.

1.Introduction 2.Preliminaries 3.Our algorithm 4.Experiment 5.Summary

Summary & Open problem Our result Sparse LPBoost: provable decompotion algorithm which ε-approximates 1-norm soft margin optimization faster than 3 to 100 times than LP solver or LPBoost. Open problem Theoretical guarantee on # of iterations Better method for choosing instances (hypotheses)