ICML 20031 Linear Programming Boosting for Uneven Datasets Jurij Leskovec, Jožef Stefan Institute, Slovenia John Shawe-Taylor, Royal Holloway University.

Slides:



Advertisements
Similar presentations
An Introduction to Boosting Yoav Freund Banter Inc.
Advertisements

On-line learning and Boosting
PrasadL18SVM1 Support Vector Machines Adapted from Lectures by Raymond Mooney (UT Austin)
Lectures 17,18 – Boosting and Additive Trees Rice ECE697 Farinaz Koushanfar Fall 2006.
Boosting Rong Jin.
FilterBoost: Regression and Classification on Large Datasets Joseph K. Bradley 1 and Robert E. Schapire 2 1 Carnegie Mellon University 2 Princeton University.
CMPUT 466/551 Principal Source: CMU
A Survey on Text Categorization with Machine Learning Chikayama lab. Dai Saito.
Cos 429: Face Detection (Part 2) Viola-Jones and AdaBoost Guest Instructor: Andras Ferencz (Your Regular Instructor: Fei-Fei Li) Thanks to Fei-Fei Li,
Review of : Yoav Freund, and Robert E
Introduction to Boosting Slides Adapted from Che Wanxiang( 车 万翔 ) at HIT, and Robin Dhamankar of Many thanks!
Sparse vs. Ensemble Approaches to Supervised Learning
Kernel Technique Based on Mercer’s Condition (1909)
1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.
Boosting Rong Jin. Inefficiency with Bagging D Bagging … D1D1 D2D2 DkDk Boostrap Sampling h1h1 h2h2 hkhk Inefficiency with boostrap sampling: Every example.
1 How to be a Bayesian without believing Yoav Freund Joint work with Rob Schapire and Yishay Mansour.
Three kinds of learning
Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.
Bing LiuCS Department, UIC1 Learning from Positive and Unlabeled Examples Bing Liu Department of Computer Science University of Illinois at Chicago Joint.
Sparse vs. Ensemble Approaches to Supervised Learning
General Mining Issues a.j.m.m. (ton) weijters Overfitting Noise and Overfitting Quality of mined models (some figures are based on the ML-introduction.
Experts and Boosting Algorithms. Experts: Motivation Given a set of experts –No prior information –No consistent behavior –Goal: Predict as the best expert.
Boosting Main idea: train classifiers (e.g. decision trees) in a sequence. a new classifier should focus on those cases which were incorrectly classified.
Ensemble Learning (2), Tree and Forest
Online Learning Algorithms
Latent Boosting for Action Recognition Zhi Feng Huang et al. BMVC Jeany Son.
Machine Learning CS 165B Spring 2012
Learning with Positive and Unlabeled Examples using Weighted Logistic Regression Wee Sun Lee National University of Singapore Bing Liu University of Illinois,
Chapter 10 Boosting May 6, Outline Adaboost Ensemble point-view of Boosting Boosting Trees Supervised Learning Methods.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
SVM by Sequential Minimal Optimization (SMO)
A speech about Boosting Presenter: Roberto Valenti.
Evaluating Hypotheses Reading: Coursepack: Learning From Examples, Section 4 (pp )
Feature Selection in Nonlinear Kernel Classification Olvi Mangasarian & Edward Wild University of Wisconsin Madison Workshop on Optimization-Based Data.
Feature Selection in Nonlinear Kernel Classification Olvi Mangasarian Edward Wild University of Wisconsin Madison.
Evaluating What’s Been Learned. Cross-Validation Foundation is a simple idea – “ holdout ” – holds out a certain amount for testing and uses rest for.
Universit at Dortmund, LS VIII
Benk Erika Kelemen Zsolt
BOOSTING David Kauchak CS451 – Fall Admin Final project.
Processing of large document collections Part 3 (Evaluation of text classifiers, term selection) Helena Ahonen-Myka Spring 2006.
A Comparative Study of Kernel Methods for Classification Applications Yan Liu Oct 21, 2003.
Combining multiple learners Usman Roshan. Bagging Randomly sample training data Determine classifier C i on sampled data Goto step 1 and repeat m times.
Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.
Lecture notes for Stat 231: Pattern Recognition and Machine Learning 1. Stat 231. A.L. Yuille. Fall 2004 AdaBoost.. Binary Classification. Read 9.5 Duda,
Bing LiuCS Department, UIC1 Chapter 8: Semi-supervised learning.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
COP5992 – DATA MINING TERM PROJECT RANDOM SUBSPACE METHOD + CO-TRAINING by SELIM KALAYCI.
Classification and Prediction: Ensemble Methods Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Decision Trees IDHairHeightWeightLotionResult SarahBlondeAverageLightNoSunburn DanaBlondeTallAverageYesnone AlexBrownTallAverageYesNone AnnieBlondeShortAverageNoSunburn.
Text Categorization by Boosting Automatically Extracted Concepts Lijuan Cai and Tommas Hofmann Department of Computer Science, Brown University SIGIR 2003.
Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.
NTU & MSRA Ming-Feng Tsai
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
“Joint Optimization of Cascaded Classifiers for Computer Aided Detection” by M.Dundar and J.Bi Andrey Kolobov Brandon Lucia.
1 Systematic Data Selection to Mine Concept-Drifting Data Streams Wei Fan Proceedings of the 2004 ACM SIGKDD international conference on Knowledge discovery.
Boosting and Additive Trees (Part 1) Ch. 10 Presented by Tal Blum.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
AdaBoost Algorithm and its Application on Object Detection Fayin Li.
Page 1 CS 546 Machine Learning in NLP Review 2: Loss minimization, SVM and Logistic Regression Dan Roth Department of Computer Science University of Illinois.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Boosted Augmented Naive Bayes. Efficient discriminative learning of
Table 1. Advantages and Disadvantages of Traditional DM/ML Methods
ECE 5424: Introduction to Machine Learning
Asymmetric Gradient Boosting with Application to Spam Filtering
Cos 429: Face Detection (Part 2) Viola-Jones and AdaBoost Guest Instructor: Andras Ferencz (Your Regular Instructor: Fei-Fei Li) Thanks to Fei-Fei.
A New Boosting Algorithm Using Input-Dependent Regularizer
The
Large Scale Support Vector Machines
Overfitting and Underfitting
Ensemble learning Reminder - Bagging of Trees Random Forest
Presentation transcript:

ICML Linear Programming Boosting for Uneven Datasets Jurij Leskovec, Jožef Stefan Institute, Slovenia John Shawe-Taylor, Royal Holloway University of London, UK

ICML Motivation There are 800 million of Europeans and 2 million of them are Slovenians Want to build a classifier to distinguish Slovenians from the rest of Europeans A traditional unaware classifier (e.g. politician) would not even notice Slovenia as an entity We don’t want that!

ICML Problem setting Unbalanced Dataset 2 classes:  positive (small)  negative (large) Train a binary classifier to separate highly unbalanced classes

ICML Our solution framework We will use Boosting  Combine many simple and inaccurate categorization rules (weak learners) into a single highly accurate categorization rule  The simple rules are trained sequentially; each rule is trained on examples which are most difficult to classify by preceding rules

ICML Outline Boosting algorithms Weak learners Experimental setup Results Conclusions

ICML Related approaches: AdaBoost given training examples (x 1,y 1 ),… (x m,y m ) initialize D 0 (i) = 1/m y i   {+1, -1} for t = 1…T  pass distribution D t to weak learner  get weak hypothesis h t : X   R  choose α t (based on performance of h t )  update D t+1 (i) = D t (i) exp(-α t y i h t (x i )) / Z t final hypothesis: f(x) = ∑ t α t h t (x)

ICML AdaBoost - Intuition weak hypothesis h(x)  sign of h(x) is the predicted binary label  magnitude |h(x)| as a confidence α t controls the influence of each h t (x)

ICML More Boosting Algorithms Algorithms differ in the way of initializing weights D 0 (i) (misclassification costs) and updating them 4 boosting algorithms:  AdaBoost – Greedy approach  UBoost – Uneven loss function + greedy  LPBoost – Linear Programming (optimal solution)  LPUBoost – Our proposed solution (LP + uneven)

ICML given training examples (x 1,y 1 ),… (x m,y m ) initialize D 0 (i) = 1/m y i   {+1, -1} for t = 1…T  pass distribution D t to weak learner  get weak hypothesis h t : X   R  choose α t  update D t+1 (i) = D t (i) exp(-α t y i h t (x i )) / Z t final hypothesis: f(x) = ∑ t α t h t (x) Boosting Algorithm Differences Boosting Algorithms differ in these 2 lines

ICML UBoost - Uneven Loss Function set: D 0 (i) so that D 0 (positive) / D 0 (negative) = β update D t+1 (i):  increase weight of false negatives more than on false positives  decrease weight of true positives less than on true negatives Positive examples maintain higher weight (misclassification cost)

ICML LPBoost – Linear Programming set: D 0 (i) = 1/m update D t+1 : solve LP: argmin LPBeta, s.t. ∑ i (D(i) y i h k (x i )) ≤ LPBeta; k = 1…t where 1 / A < D(i) < 1 / B set α to Lagrangian multipliers if ∑ i D(i) y i h t (x i ) < LPBeta, optimal solution

ICML LPBoost – Intuition argmin LPBeta s.t. ∑ i (D(i) y i h k (x i )) ≤ LPBeta k = 1...t where 1 / A < D(i) < 1 / B D(1)D(2)D(3)…D(m) h1h h2h2 --++≤ LPBeta …… htht +-++ Training Example Weights Weak Learners

ICML LPBoost – Example D(1)D(2)D(3) h1h D(1)+ 0.7 D(2)- 0.2 D(3)≤ LPBeta h2h D(1)- 0.4 D(2)- 0.5 D(3)≤ LPBeta h3h D(1)- 0.1 D(2)- 0.3 D(3)≤ LPBeta Training Example Weights argmin LPBeta s.t. ∑ i (y i h k (x i ) D(i)) ≤ LPBeta k = where 1 / A < D(i) < 1 / B Confidence Incorrectly Classified Correctly Classified Weak Learners

ICML LPUBoost - Uneven Loss + LP set: D 0 (i) so that D 0 (positive) / D 0 (negative) = β update D t+1 :  solve LP, minimize LPBeta but set different misclassification cost bounds for D(i) (β times higher for positive examples) the rest as in LPBoost Note: β is input parameter. LPBeta is Linear Programming optimization variable

ICML Summary of Boosting Algorithms Uneven loss function Converges to global optimum AdaBoost  UBoost  LPBoost  LPUBoost 

ICML Weak Learners One-level decision tree (IF-THEN rule): if word w occurs in a document X return P else return N  P and N are real numbers chosen based on misclassification cost weights D t (i) interpret the sign of P and N as the predicted binary label magnitude |P| and |N| as the confidence

ICML Experimental setup Reuters newswire articles (Reuters-21578) ModApte split: 9603 train, 3299 test docs 16 categories representing all sizes Train binary classifier 5 fold cross validation Measures:Precision = TP / (TP + FP) Recall = TP / (TP + FN) F1 = 2Prec Rec / (Prec + Rec)

ICML Typical situations Balanced training dataset  all learning algorithms show similar performance Unbalanced training dataset  AdaBoost overfits  LPUBoost does not overfit – converges fast using only a few weak learners  UBoost and LPBoost are somewhere in between

ICML Balanced dataset Typical behavior

ICML Unbalanced Dataset AdaBoost overfits

ICML Unbalanced dataset LPUBoost Few iterations (10) Stop after no suitable feature is left

ICML Reuters categories F1 on test set even uneven Category (size)AdaULP LPU SVM EARN (2877) ACQ (1650) MONEY-FX (538) INTEREST (347) CORN (181) GNP (101) CARCASS (50) COTTON (39) MEAL-FEED (30) PET-CHEM (20) LEAD (15) SOY-MEAL (13) GROUNDNUT (5) PLATINUM (5) POTATO (3) NAPHTHA (2) AVERAGE

ICML LPUBoost vs. UBoost

ICML Most important features (stemmed words) EARN (2877) – 50: ct, net, profit, dividend, shr INTEREST (347) – 70: rate, bank, company, year, pct CARCASS (50) – 30: beef, pork, meat, dollar, chicago SOY-MEAL (13) – 3: meal, soymeal, soybean GROUNDNUT (5) – 2: peanut, cotton (F1=0.75) PLATINUM (5) – 1: platinum (F1=1.0) POTATO (3) – 1: potato (F1=0.86) Category size LPU model size (number of features / words)

ICML Computational efficiency AdaBoost and UBoost are the fastest – the simplest LPBoost and LPUBoost are a little slower  LP computation takes much of the time but since LPUBoost chooses fewer weak hypotheses the times get comparable to those of AdaBoost

ICML Conclusions LPUBoost is suitable for text categorization for highly unbalanced datasets All benefits (well-defined stopping criteria, unequal loss function) show up No overfitting: it is able to find simple (small) and complicated (large) hypotheses