Linear Programming Boosting by Column and Row Generation Kohei Hatano and Eiji Takimoto Kyushu University, Japan DS 2009
1.Introduction 2.Preliminaries 3.Our algorithm 4.Experiment 5.Summary
Example Given a web page, predict if the topic is “DS 2009”. hypothesis set = words DS 2009? yn +1 DS 2009 ? yn * ALT? yn * Porto? y n * Weighted majority vote Modern Machine Learning Approach Find a weighted majority of hypotheses (or hyperplane) which enlarges the margin
1-norm soft margin optimization (aka 1 norm SVM) Popular formulation as well as 2 norm soft margin opt. “find a hyperplane which separates positive and negative instances well” margin ρ ξiξi loss ξ i Note ・ Linear Program ・ good generarization guarantee [Schapire et al. 98] margin loss normalized with 1 norm
1-norm soft margin optimization(2) Advantage of 1 norm soft margin opt. Solution likely to be sparse ⇒ useful for feature selection sparse hyperplane 0.5 *(DS 2009?) *(ALT?) + 0.2* (Porto?) non-sparse hyperplane 0.2 *(DS 2009?) *(ALT?) * (Porto?)* 0.1* (wine?)+0.05*(tasting?) +0.05* (discovery?)+ 0.03*(science?)+0.02*()+… 2 norm soft margin opt. 1 norm soft marign opt.
Recent Results Our result new algorithm for 1 norm soft margin optimization 2-norm soft margin optimization ・ Quadratic Programming ・ SMO [Platt, 1999] ・ SVM light [Joachims, 1999] ・ SVM Perf [Joachims, 2006] ・ Pegasos [Shai-Schwartz et al., 2007] There are state-of-the-art solvers 1-norm soft margin optimization ・ Linear Programming ・ LPBoost [Dem iriz et al, 2003] ・ Entropy Regularized LPBoost [Warmuth et al., 2008] ・ others [Mangasarian 2006][Sra 2006] not efficient enough for large data room for improvements
1.Introduction 2.Preliminaries 3.Our algorithm 4.Experiment 5.Summary
Boosting Classification: frog “+1”, others “-1” Hypotheses ・・・ color and size color size +1 1 . d 1 : uniform distribution 2 . For t=1,…,T (i) Choose hypothesis h t maximizing the edge w.r.t.d t (ii) Update distribution d t to d t+1 3. Output weighting of chosen hypotheses +1
Boosting (2) color size +1 1 . d 1 : uniform distribution 2 . For t=1,…,T (i) Choose hypothesis h t maximizing the edge w.r.t.d t (ii) Update distribution d t to d t+1 3. Output weighting of chosen hypotheses +1 h1h1 h 1 (x i ) Edge of hypothesis h w.r.t. distribution d y i h(x i )>0 if correct frog -1,or +1 ∈ [-1,+1]
Boosting (2) color size +1 1 . d 1 : uniform distribution 2 . For t=1,…,T (i) Choose hypothesis h t maximizing the edge w.r.t.d t (ii) Update distribution d t to d t+1 3. Output weighting of chosen hypotheses +1 h1h1 h 1 (x i ) More weights on Misclassified instances
frog Boosting (3) color size +1 1 . d 1 : uniform distribution 2 . For t=1,…,T (i) Choose hypothesis h t maximizing the edge w.r.t.d t (ii) Update distribution d t to d t+1 3. Output weighting of chosen hypotheses +1 h2h2 Note: more weights on “diffucult” instances
Boosting (4) color size +1 1 . d 1 : uniform distribution 2 . For t=1,…,T (i) Choose hypothesis h t maximizing the edge w.r.t.d t (ii) Update distribution d t to d t+1 3. Output weighting of chosen hypotheses h h 2
Boosting & 1-norm soft margin optimizatoin “find the large-margin hyperplane which separates pos. and neg. instances as much as possible” find the distiribution which minimizes the maximum edge of hypotheses (find the most “difficult” distribution w.r.t. hypotheses) ≈ solvable using Boosting equivalent by duality of LP
LPBoost [Demiriz et al, 2003] Update: solve the dual problem w.r.t. the hypothesis set {h 1,…,h t } Output: the convex combination where α is the solution of the primal problem Theorem[Demiriz et al.] Given ε>0, LPBoost outputs ε-approximation of the optimal solution.
Properties of the optimal solution margin ρ* ξ* i loss ξ* i KKT conditions imply: Note: Optimal solution can be recovered using only instances with positive weigthts ν= 0.2 m (高々2割の事例がマージン ρ 未満) => d i <= 1/ν より ν 個は事例が必要
Properties of the optimal solution ( 2 ) By KKT conditions Sparseness of the optimal solution 1. Sparseness w.r.t. hypotheses weighting 2. Sparseness w.r.t. instances Note: Optimal solution can be recovered using only hypotheses with positive coefficients.
1.Introduction 2.Preliminaries 3.Our algorithm 4.Experiment 5.Summary
Our idea: Sparse LPBoost Take advantage of the sparseness w.r.t. hypotheses and instances 2.For t=1…. (i)Pick up instances with margin <ρ t (iii) solve the dual problem w.r.t. the past chosen instances by Boosting (ρ t+1 : the solution) 3 . Output the solution of the primal problem. Theorem Given ε>0, Sparse LPBoost outputs ε-approximation of the optimal solution.
Our idea (matrix form) Each row i corresponds to instance i Each column j corresponds to hypothesis j LPLPBoostSparse LPBoost “effective” constraints for optimal sol. whole matrixcolumns intersections of columns and rows intersections of column and rows Inequality Constraints of the dual problem
How to choose examples (hypotheses)? 1 st attempt: add an instance one by one less efficient than LP solver! Assumptions - # of hypotheses is constant ・ time complexity of a LP solver : m k (m: # of instances) Our method : Choose at most 2 t instances with margin < ρ (t:# of iterations) If the algorithm terminate after it chooose cm instances ( 0<c < 1 ) Note: same argument holds for hypotheses unknown
1.Introduction 2.Preliminaries 3.Our algorithm 4.Experiment 5.Summary
Experiments (new experiments not in the proceedings) Data set# of examples m # of hypotheses n Reuters ,17030,389 RCV120,24247,237 news2019,9961,335,193 parameters : ν=0.2m, ε=0.001 each algorithm implemented with C++ and LP solver CPLEX 11.0
Experimental Results (sec.) Sparse LPBoost improves the computation time by 3 to 100 times.
1.Introduction 2.Preliminaries 3.Our algorithm 4.Experiment 5.Summary
Summary & Open problem Our result Sparse LPBoost: provable decompotion algorithm which ε-approximates 1-norm soft margin optimization faster than 3 to 100 times than LP solver or LPBoost. Open problem Theoretical guarantee on # of iterations Better method for choosing instances (hypotheses)