A New Boosting Algorithm Using Input-Dependent Regularizer

Slides:



Advertisements
Similar presentations
An Introduction to Boosting Yoav Freund Banter Inc.
Advertisements

On-line learning and Boosting
Boosting Rong Jin.
ICML Linear Programming Boosting for Uneven Datasets Jurij Leskovec, Jožef Stefan Institute, Slovenia John Shawe-Taylor, Royal Holloway University.
FilterBoost: Regression and Classification on Large Datasets Joseph K. Bradley 1 and Robert E. Schapire 2 1 Carnegie Mellon University 2 Princeton University.
Games of Prediction or Things get simpler as Yoav Freund Banter Inc.
CMPUT 466/551 Principal Source: CMU
Longin Jan Latecki Temple University
Paper presentation for CSI5388 PENGCHENG XI Mar. 23, 2005
Review of : Yoav Freund, and Robert E
The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire.
Sparse vs. Ensemble Approaches to Supervised Learning
Logistic Regression Rong Jin. Logistic Regression Model  In Gaussian generative model:  Generalize the ratio to a linear model Parameters: w and c.
Boosting Rong Jin. Inefficiency with Bagging D Bagging … D1D1 D2D2 DkDk Boostrap Sampling h1h1 h2h2 hkhk Inefficiency with boostrap sampling: Every example.
Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.
A Brief Introduction to Adaboost
Ensemble Learning: An Introduction
Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.
Sparse vs. Ensemble Approaches to Supervised Learning
Boosting Main idea: train classifiers (e.g. decision trees) in a sequence. a new classifier should focus on those cases which were incorrectly classified.
Machine Learning CS 165B Spring 2012
AdaBoost Robert E. Schapire (Princeton University) Yoav Freund (University of California at San Diego) Presented by Zhi-Hua Zhou (Nanjing University)
Chapter 10 Boosting May 6, Outline Adaboost Ensemble point-view of Boosting Boosting Trees Supervised Learning Methods.
A speech about Boosting Presenter: Roberto Valenti.
Boosting Neural Networks Published by Holger Schwenk and Yoshua Benggio Neural Computation, 12(8): , Presented by Yong Li.
Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00.
Benk Erika Kelemen Zsolt
Boosting of classifiers Ata Kaban. Motivation & beginnings Suppose we have a learning algorithm that is guaranteed with high probability to be slightly.
Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.
Learning with AdaBoost
Lecture notes for Stat 231: Pattern Recognition and Machine Learning 1. Stat 231. A.L. Yuille. Fall 2004 AdaBoost.. Binary Classification. Read 9.5 Duda,
SemiBoost : Boosting for Semi-supervised Learning Pavan Kumar Mallapragada, Student Member, IEEE, Rong Jin, Member, IEEE, Anil K. Jain, Fellow, IEEE, and.
E NSEMBLE L EARNING : A DA B OOST Jianping Fan Dept of Computer Science UNC-Charlotte.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Ensemble Methods in Machine Learning
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Ensemble Methods: Bagging, Boosting Readings: Murphy 16.4; Hastie 16.
Decision Trees IDHairHeightWeightLotionResult SarahBlondeAverageLightNoSunburn DanaBlondeTallAverageYesnone AlexBrownTallAverageYesNone AnnieBlondeShortAverageNoSunburn.
1 Introduction to Predictive Learning Electrical and Computer Engineering LECTURE SET 8 Combining Methods and Ensemble Learning.
Boosting and Additive Trees (Part 1) Ch. 10 Presented by Tal Blum.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
SemiBoost : Boosting for Semi-supervised Learning Pavan Kumar Mallapragada, Student Member, IEEE, Rong Jin, Member, IEEE, Anil K. Jain, Fellow, IEEE, and.
AdaBoost Algorithm and its Application on Object Detection Fayin Li.
Adaboost (Adaptive boosting) Jo Yeong-Jun Schapire, Robert E., and Yoram Singer. "Improved boosting algorithms using confidence- rated predictions."
Ensemble Classifiers.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Machine Learning: Ensemble Methods
PREDICT 422: Practical Machine Learning
HW 2.
Reading: R. Schapire, A brief introduction to boosting
Bagging and Random Forests
Ch 14. Combining Models Pattern Recognition and Machine Learning, C. M
Trees, bagging, boosting, and stacking
Boosting and Additive Trees
Basic machine learning background with Python scikit-learn
ECE 5424: Introduction to Machine Learning
Asymmetric Gradient Boosting with Application to Spam Filtering
Combining Base Learners
A Comparative Study of Kernel Methods for Classification Applications
Data Mining Practical Machine Learning Tools and Techniques
Cos 429: Face Detection (Part 2) Viola-Jones and AdaBoost Guest Instructor: Andras Ferencz (Your Regular Instructor: Fei-Fei Li) Thanks to Fei-Fei.
Statistical Learning Dong Liu Dept. EEIS, USTC.
The
Introduction to Boosting
Machine Learning Ensemble Learning: Voting, Boosting(Adaboost)
Large Scale Support Vector Machines
Lecture 18: Bagging and Boosting
Ensemble learning.
Lecture 06: Bagging and Boosting
Model Combination.
Ensemble learning Reminder - Bagging of Trees Random Forest
Presentation transcript:

A New Boosting Algorithm Using Input-Dependent Regularizer Rong Jin1, Yan Liu2, Luo Si2, Jamie Carbonell2, Alex G. Hauptmann2 1. Michigan State University, 2. Carnegie Mellon University

Outline Introduction of AdaBoost algorithm Problems with AdaBoost New boosting algorithm: input-dependent regularizer Experiment Conclusion and future work

AdaBoost Algorithm (I) Boost a weak classifier into a strong classifier by linearly combine an ensemble of weak classifiers AdaBoost Given:A weak classifier h(x) with a large classification error E(x,y)~P(x,y)(h(x)y) Output: HT(x)= 1h1(x) + 2h2(x) +…+ThT(x) with a low classification error E(x,y)~P(x,y)(H(x)y) Theoretically and empirically effective to improve the performance Intuitively: goal and basic idea Strictly speaking:

AdaBoosting Algorithm (II) Sampling distribution Only focus on the examples that are misclassified or weakly classified by previous weak classifiers Combining Weak Classifiers Combination constants are computed in order to minimize the training error The detailed algorithm involves two steps: updating sampling distribution and combining base classifiers Initializes as evenly distributed, then update a way as only focus on the misclassified examples As linearly combination and the combination constants are computed in order to minimize the training error Choice of t:

Problems 1: Overfitting AdaBoost seldom overfits Not only minimizes the training error but also tends to maximize the classification margin (Ondar & Muller, 1998; Friedman et al., 1998) AdaBoost does overfit when the data are noisy (Dietterich, 2000; Ratsch & Muller, 2000; Grove & Schuurmans, 1998) Sampling distribution Dt(x) can have overly emphasis on noisy patterns Due to the “hard margin” criteria (Ratsch et al., 2000) Since Adaboost is a greedy algorithm, there have been many studies on the issue of overfitting Early studies show that .. Recent research found that .. Sampling distribution might have overly emphasis on noisy patterns so that the it is not general enough for most data; Hard margin, namely the maximal margin of those noisy data patterns. The margin may decrease significantly and thus force the generalization error bound to increase

Problems 1: Overfitting Introduce regularization Not only just minimize the training error Typical solutions Smooth the combination constant (Schapire & Singer, 1998) Epsilon boosting: equal to L1 regularization (Friedman & Tibshirani, 1998) Boosting with soft margin (Ratsch et. al, 2000) BrownBoost: a non monotonic cost function (Freund, 2001) In order to solve the overfitting problem, several strategies have been proposed. typical solutions include: change cost function go add regularizer (similar ides as ridge regression) or introduce soft margin (SVM)

Problem 2: Why Linear Combination? Each weak classifier ht(x) is trained on a different sampling distribution Dt(x) only good for particular types of input patterns {ht(x)} is a diverse ensemble Linear combination is not able to take full strength of the diverse ensemble {ht(x)} Solution: combination constants should be input dependent Similar idea as Hierarchical mixture model by Mikeal Jordan.

Input Dependent Regularizer Solve the two problems overfitting and constant combination Input dependent regularizer Main idea: different combination form The main idea is trying a different combination form

Role of Regularizer Router Prevent |HT(x)| from growing too fast Theorem: if all t are bounded max, |HT(x)| a ln(bT+c) For the of linear combination in AdaBoost, |HT(x)|~O(T) Router Input dependent combination constant The prediction of ht(x) is used only when Ht-1(x) is small Consistent with the training procedure ht(x) is trained on the examples that Ht-1(x) is uncertain To make a more clear. Regularizer: the loss function will be polynomial instead of exponential

WeightBoost Algorithm (1) Similar to AdaBoost: minimize the exponential cost function Training setup hi(x): x{1,-1}; a basis (weak) classifier HT(x): a linear combination of basic classifiers Goal: minimize training error

WeightBoost Algorithm (2) Emphasize misclassified data patterns As Simple As AdaBoost ! Avoid overemphasis on noisy data patterns Choice of t:

Empirical studies Datasets: eight different UCI datasets with only binary classes Methods to compare with AdaBoost algorithm WeightDecay Boost algorithm: close to L2 regularization Epsilon Boosting: related to L1 regularization

Experiment 1: Effectiveness Compare to AdaBoost The WeightBoost performs better than AdaBoost algorithm. In many cases, the WeightBoost performs substantially better than AdaBoost algorithm

Experiment 2: Beyond Regularization Compare to other regularized boosting WeightDecay Boost and Epsilon Boost The WeightBoost performs slightly better than other regularized boosting algorithms In several cases, the WeightBoost performs better than the other two regularized boosting algorithms

Experiment 3: Resistance to Noise Randomly select 10%, 20%, and 30% of training data and set the labels of training data to be random value Results for 10% Noise The WeightBoost is more resistant to training noise than AdaBoost algorithm In several cases, when AdaBoost overfits the training noises, WeightBoost is still able to perform well

Experiments with Text Categorization Reuter-21578 corpus with 10 most popular categories: WeightBoost improves 7 out of 10 categories

Conclusion and Future Work Introduce an input dependent regularizer into the combination form Prevent |H(x)| from increasing too fast  resistant to training noise ‘Route’ a testing data pattern to it’s appropriate classifier  improve the classification accuracy even further than standard regularization Future research issues How to determine the constant ? Other input dependent regularizer?