FilterBoost: Regression and Classification on Large Datasets Joseph K. Bradley 1 and Robert E. Schapire 2 1 Carnegie Mellon University 2 Princeton University.

Slides:



Advertisements
Similar presentations
Ensemble Learning Reading: R. Schapire, A brief introduction to boosting.
Advertisements

On-line learning and Boosting
Ensemble Methods An ensemble method constructs a set of base classifiers from the training data Ensemble or Classifier Combination Predict class label.
Boosting Approach to ML
Games of Prediction or Things get simpler as Yoav Freund Banter Inc.
CMPUT 466/551 Principal Source: CMU
Longin Jan Latecki Temple University
Face detection Many slides adapted from P. Viola.
Review of : Yoav Freund, and Robert E
Introduction to Boosting Slides Adapted from Che Wanxiang( 车 万翔 ) at HIT, and Robin Dhamankar of Many thanks!
Sparse vs. Ensemble Approaches to Supervised Learning
Active Learning of Binary Classifiers
Unsupervised Models for Named Entity Classification Michael Collins and Yoram Singer Yimeng Zhang March 1 st, 2007.
2D1431 Machine Learning Boosting.
A Brief Introduction to Adaboost
Probably Approximately Correct Model (PAC)
Ensemble Learning: An Introduction
1 How to be a Bayesian without believing Yoav Freund Joint work with Rob Schapire and Yishay Mansour.
Dasgupta, Kalai & Monteleoni COLT 2005 Analysis of perceptron-based active learning Sanjoy Dasgupta, UCSD Adam Tauman Kalai, TTI-Chicago Claire Monteleoni,
Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.
Machine Learning: Ensemble Methods
Sparse vs. Ensemble Approaches to Supervised Learning
Boosting Main idea: train classifiers (e.g. decision trees) in a sequence. a new classifier should focus on those cases which were incorrectly classified.
For Better Accuracy Eick: Ensemble Learning
Online Learning Algorithms
Machine Learning CS 165B Spring 2012
Chapter 10 Boosting May 6, Outline Adaboost Ensemble point-view of Boosting Boosting Trees Supervised Learning Methods.
CSSE463: Image Recognition Day 27 This week This week Last night: k-means lab due. Last night: k-means lab due. Today: Classification by “boosting” Today:
CS 391L: Machine Learning: Ensembles
Benk Erika Kelemen Zsolt
Boosting of classifiers Ata Kaban. Motivation & beginnings Suppose we have a learning algorithm that is guaranteed with high probability to be slightly.
Potential-Based Agnostic Boosting Varun Kanade Harvard University (joint work with Adam Tauman Kalai (Microsoft NE))
Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.
Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.
Learning with AdaBoost
Lecture notes for Stat 231: Pattern Recognition and Machine Learning 1. Stat 231. A.L. Yuille. Fall 2004 AdaBoost.. Binary Classification. Read 9.5 Duda,
Boosting and Differential Privacy Cynthia Dwork, Microsoft Research TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A.
Maria-Florina Balcan Active Learning Maria Florina Balcan Lecture 26th.
E NSEMBLE L EARNING : A DA B OOST Jianping Fan Dept of Computer Science UNC-Charlotte.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Bias-Variance in Machine Learning. Bias-Variance: Outline Underfitting/overfitting: –Why are complex hypotheses bad? Simple example of bias/variance Error.
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Ensemble Methods: Bagging, Boosting Readings: Murphy 16.4; Hastie 16.
Classification Ensemble Methods 1
Decision Trees IDHairHeightWeightLotionResult SarahBlondeAverageLightNoSunburn DanaBlondeTallAverageYesnone AlexBrownTallAverageYesNone AnnieBlondeShortAverageNoSunburn.
Boosting ---one of combining models Xin Li Machine Learning Course.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
AdaBoost Algorithm and its Application on Object Detection Fayin Li.
Page 1 CS 546 Machine Learning in NLP Review 2: Loss minimization, SVM and Logistic Regression Dan Roth Department of Computer Science University of Illinois.
1 Machine Learning Lecture 8: Ensemble Methods Moshe Koppel Slides adapted from Raymond J. Mooney and others.
Adaboost (Adaptive boosting) Jo Yeong-Jun Schapire, Robert E., and Yoram Singer. "Improved boosting algorithms using confidence- rated predictions."
Machine Learning: Ensemble Methods
HW 2.
Reading: R. Schapire, A brief introduction to boosting
The Boosting Approach to Machine Learning
Ensemble Learning Introduction to Machine Learning and Data Mining, Carla Brodley.
The Boosting Approach to Machine Learning
ECE 5424: Introduction to Machine Learning
Data Mining Practical Machine Learning Tools and Techniques
A New Boosting Algorithm Using Input-Dependent Regularizer
CSCI B609: “Foundations of Data Science”
The
Introduction to Data Mining, 2nd Edition
Introduction to Boosting
Computational Learning Theory
Lecture 18: Bagging and Boosting
Computational Learning Theory
Ensemble learning.
Lecture 06: Bagging and Boosting
Model Combination.
Ensemble learning Reminder - Bagging of Trees Random Forest
Presentation transcript:

FilterBoost: Regression and Classification on Large Datasets Joseph K. Bradley 1 and Robert E. Schapire 2 1 Carnegie Mellon University 2 Princeton University

Batch Framework Typical Framework for Boosting Booster Final Hypothesis Dataset The booster must always have access to the entire dataset! Sampled i.i.d. from target distribution D

Motivation for a New Framework Batch boosters must always have access to the entire dataset! This limits their applications: e.g. Classify all websites on the WWW (i.e. very large dataset) –Batch boosters must either use a small subset of the data or use lots of time and space. This limits their efficiency: Each round requires computation on entire dataset. Ideas: 1) Use a data stream instead of a dataset, and 2) Train on new subsets of data each round.

Data Oracle Alternate Framework: Filtering Booster Stores tiny fraction of data! Final Hypothesis ~D Boost for 1000 rounds  only store ~1/1000 of data at a time. Note: Original boosting algorithm [Schapire ’90] was for filtering.

Main Results FilterBoost, a novel boosting-by-filtering algorithm Provable guarantees –Fewer assumptions than previous work –Better or equivalent bounds Applicable to both classification and conditional probability estimation Good empirical performance

Batch Boosting to Filtering Batch Algorithm Given: fixed dataset S For t = 1,…,T, Choose distribution D t over S Choose hypothesis h t Estimate error of h t with D t, S Give h t weight α t Output: final hypothesis sign[ H(x) = Σ t α t h t (x) ] Gives higher weights to misclassified examples D t forces the booster to learn to correctly classify “harder” examples on later rounds.

Batch Boosting to Filtering Batch Algorithm Given: fixed dataset S For t = 1,…,T, Choose distribution D t over S Choose hypothesis h t Estimate error of h t with D t, S Give h t weight α t Output: final hypothesis sign[ H(x) = Σ t α t h t (x) ] In Filtering, no dataset S! We need to think about D t in a different way.

Batch Boosting to Filtering Data Oracle Filter ~D reject accept ~D t Key idea: Simulate D t using Filter mechanism (rejection sampling)

FilterBoost: Main Algorithm Given: Oracle For t = 1,…,T, Filter t gives access to D t Draw example set from Filter  Choose hypothesis h t Draw new example set from Filter  Estimate error of h t Give h t weight α t Output: final hypothesis sign[ H(x) = Σ t α t h t (x) ] Weak Learner Weak Hypothesis h1h1 Predict sign [ H(x) = Estimate error of weak hypothesis α1α1 ] + … Filter 1 Data Oracle

The Filter Data Oracle Filter ~D reject accept ~D t Recall: We want to simulate D t using rejection sampling, where D t gives high weight to badly misclassified examples.

The Filter Data Oracle Filter ~D reject accept ~D t Label = +1 Booster predicts -1  High weight  High probability of being accepted Label = -1 Booster predicts -1  Low weight  Low probability of being accepted

The Filter What should D t be? D t must give high weight to misclassified examples. Idea: Filter accepts (x,y) with probability proportional to the error of the booster’s prediction H(x) w.r.t. y. AdaBoost [Freund & Schapire ’97] Exponential weights put too much weight on a few examples. Can’t be used for filtering!

The Filter What should D t be? D t must give high weight to misclassified examples. Idea: Filter accepts (x,y) with probability proportional to the error of the booster’s prediction H(x) w.r.t. y. MadaBoost [Domingo & Watanabe ’00] Truncated exponential weights work for filtering

The Filter FilterBoost is based on a variant of AdaBoost for logistic regression [Collins, Schapire & Singer, ’02] Minimizes logistic loss  leads to logistic weights: MadaBoost FilterBoost

FilterBoost: Analysis Given: Oracle For t = 1,…,T, Filter t gives access to D t Draw example set from Filter  Choose hypothesis h t Draw new example set from Filter  Estimate error of h t Give h t weight α t Output: final hypothesis sign[ H(x) = Σ t α t h t (x) ] Step 1 How long does the filter take to produce an example? Step 2: How many boosting rounds are needed? Step 3: How can we estimate weak hypotheses’ errors?

FilterBoost: Analysis Given: Oracle For t = 1,…,T, Filter t gives access to D t Draw example set from Filter  Choose hypothesis h t Draw new example set from Filter  Estimate error of h t Give h t weight α t Output: final hypothesis sign[ H(x) = Σ t α t h t (x) ] Step 1: If the filter takes too long, H(x) is accurate enough. How long does the filter take to produce an example?

FilterBoost: Analysis Given: Oracle For t = 1,…,T, Filter t gives access to D t Draw example set from Filter  Choose hypothesis h t Draw new example set from Filter  Estimate error of h t Give h t weight α t Output: final hypothesis sign[ H(x) = Σ t α t h t (x) ] Step 1: If the filter takes too long, H(x) is accurate enough. Step 2: How many boosting rounds are needed? If weak hypotheses have errors bounded away from ½, we make “significant progress” each round.

FilterBoost: Analysis Given: Oracle For t = 1,…,T, Filter t gives access to D t Draw example set from Filter  Choose hypothesis h t Draw new example set from Filter  Estimate error of h t Give h t weight α t Output: final hypothesis sign[ H(x) = Σ t α t h t (x) ] Step 1: If the filter takes too long, H(x) is accurate enough. Step 2: We make “significant progress” each round. Step 3: How can we estimate weak hypotheses’ errors? We use adaptive sampling [Watanabe ’00]

FilterBoost: Analysis Theorem: Assume weak hypotheses have edges (1/2 – error) at least γ > 0. Let ε = target error rate. FilterBoost produces a final hypothesis H(x) with error ≤ ε within T rounds where

Previous Work vs. FilterBoost Assume edges decreasing? Need a priori bound γ? Can use infinite weak hypothesis spaces?# rounds MadaBoost [Domingo & Watanabe ’00] YNY1/ε [Bshouty & Gavinsky ’02] NYYlog(1/ε) AdaFlat [Gavinsky ’03] NNY1/ε 2 GiniBoost [Hatano ’06] NNN1/ε FilterBoost NNY1/ε

FilterBoost’s Versatility FilterBoost is based on logistic regression. It may be directly applied to conditional probability estimation. FilterBoost may use confidence-rated predictions (real-valued weak hypotheses).

Experiments Tested FilterBoost against other batch and filtering boosters: MadaBoost, AdaBoost, Logistic AdaBoost Synthetic and real data Tested: classification and conditional probability estimation

Experiments: Noisy majority vote data, WL = decision stumps, 500,000 exs Time (sec) Test accuracy AdaBoost AdaBoost w/ resampling AdaBoost w/ confidence-rated predictions FilterBoost achieves optimal accuracy fastest Classification

Experiments: Noisy majority vote data, WL = decision stumps, 500,000 exs Conditional Probability Estimation Time (sec) RMSE FilterBoost AdaBoost (resampling) AdaBoost (confidence-rated)

Summary FilterBoost is applicable to learning with large datasets Fewer assumptions and better bounds than previous work Validated empirically in classification and conditional probability estimation, and appears much faster than batch boosting without sacrificing accuracy