Why does averaging work?

Slides:



Advertisements
Similar presentations
An Introduction to Boosting Yoav Freund Banter Inc.
Advertisements

On-line learning and Boosting
Data Mining Classification: Alternative Techniques
Boosting Approach to ML
FilterBoost: Regression and Classification on Large Datasets Joseph K. Bradley 1 and Robert E. Schapire 2 1 Carnegie Mellon University 2 Princeton University.
Games of Prediction or Things get simpler as Yoav Freund Banter Inc.
CMPUT 466/551 Principal Source: CMU
Middle Term Exam 03/01 (Thursday), take home, turn in at noon time of 03/02 (Friday)
Longin Jan Latecki Temple University
A Survey on Text Categorization with Machine Learning Chikayama lab. Dai Saito.
Review of : Yoav Freund, and Robert E
Introduction to Boosting Slides Adapted from Che Wanxiang( 车 万翔 ) at HIT, and Robin Dhamankar of Many thanks!
Boosting Shai Raffaeli Seminar in mathematical biology
Logistic Regression Rong Jin. Logistic Regression Model  In Gaussian generative model:  Generalize the ratio to a linear model Parameters: w and c.
Why does averaging work? Yoav Freund AT&T Labs Plan of talk Generative vs. non-generative modeling Boosting Boosting and over-fitting Bagging and over-fitting.
1 How to be a Bayesian without believing Yoav Freund Joint work with Rob Schapire and Yishay Mansour.
Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.
Sparse vs. Ensemble Approaches to Supervised Learning
Boosting Main idea: train classifiers (e.g. decision trees) in a sequence. a new classifier should focus on those cases which were incorrectly classified.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Ensemble Learning (2), Tree and Forest
Machine Learning CS 165B Spring 2012
A speech about Boosting Presenter: Roberto Valenti.
Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00.
Benk Erika Kelemen Zsolt
BOOSTING David Kauchak CS451 – Fall Admin Final project.
Today Ensemble Methods. Recap of the course. Classifier Fusion
Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.
Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Ensemble Methods in Machine Learning
Classification Ensemble Methods 1
NTU & MSRA Ming-Feng Tsai
Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
AdaBoost Algorithm and its Application on Object Detection Fayin Li.
1 Machine Learning Lecture 8: Ensemble Methods Moshe Koppel Slides adapted from Raymond J. Mooney and others.
Cost-Sensitive Boosting algorithms: Do we really need them?
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Machine Learning: Ensemble Methods
HW 2.
Machine Learning – Classification David Fenyő
Bagging and Random Forests
Boosted Augmented Naive Bayes. Efficient discriminative learning of
Dan Roth Department of Computer and Information Science
The Boosting Approach to Machine Learning
Boosting and Additive Trees
COMP61011 : Machine Learning Ensemble Models
Machine Learning Basics
Support Vector Machines (SVM)
The Boosting Approach to Machine Learning
ECE 5424: Introduction to Machine Learning
Computational and Statistical Issues in Data-Mining
Asymmetric Gradient Boosting with Application to Spam Filtering
Bayesian Averaging of Classifiers and the Overfitting Problem
Adaboost Team G Youngmin Jun
Data Mining Practical Machine Learning Tools and Techniques
Cos 429: Face Detection (Part 2) Viola-Jones and AdaBoost Guest Instructor: Andras Ferencz (Your Regular Instructor: Fei-Fei Li) Thanks to Fei-Fei.
A New Boosting Algorithm Using Input-Dependent Regularizer
Introduction to Boosting
Machine Learning Ensemble Learning: Voting, Boosting(Adaboost)
Lecture 18: Bagging and Boosting
Ensemble Methods for Machine Learning: The Ensemble Strikes Back
Pattern Recognition and Machine Learning
Ensemble learning.
Support Vector Machine _ 2 (SVM)
Model Combination.
Ensemble learning Reminder - Bagging of Trees Random Forest
Logistic Regression Geoff Hulten.
Presentation transcript:

Why does averaging work? Yoav Freund AT&T Labs

Plan of talk Generative vs. non-generative modeling Boosting Boosting and over-fitting Bagging and over-fitting Applications

Toy Example Computer receives telephone call Measures Pitch of voice Decides gender of caller Male Human Voice Female

Generative modeling mean1 var1 mean2 var2 Probability Voice Pitch

Discriminative approach No. of mistakes Voice Pitch

Ill-behaved data Probability No. of mistakes mean1 mean2 Voice Pitch

Traditional Statistics vs. Machine Learning Predictions Actions Data Estimated world state Decision Theory Statistics

Another example Two dimensional data (example: pitch, volume) Generative model: logistic regression Discriminative model: separating line ( “perceptron” )

Model: w Find W to maximize

Model: w Find W to maximize

Comparison of methodologies Model Generative Discriminative Goal Probability estimates Classification rule Performance measure Likelihood Misclassification rate Mismatch problems Outliers Misclassifications

Boosting

Adaboost as gradient descent Discriminator class: a linear discriminator in the space of “weak hypotheses” Original goal: find hyper plane with smallest number of mistakes Known to be an NP-hard problem (no algorithm that runs in time polynomial in d, where d is the dimension of the space) Computational method: Use exponential loss as a surrogate, perform gradient descent. Unforeseen benefit: Out-of-sample error improves even after in-sample error reaches zero.

Margins view - + Prediction = Margin = w Correct Mistakes Margin Cumulative # examples Mistakes Correct Project

Adaboost et al. Adaboost = Logitboost Loss Brownboost 0-1 loss Margin Correct Margin Mistakes Loss Brownboost 0-1 loss

One coordinate at a time Adaboost performs gradient descent on exponential loss Adds one coordinate (“weak learner”) at each iteration. Weak learning in binary classification = slightly better than random guessing. Weak learning in regression – unclear. Uses example-weights to communicate the gradient direction to the weak learner Solves a computational problem

Boosting and over-fitting

Curious phenomenon Boosting decision trees Using <10,000 training examples we fit >2,000,000 parameters

Explanation using margins 0-1 loss Margin

Explanation using margins 0-1 loss No examples with small margins!! Margin

Experimental Evidence

Theorem For any convex combination and any threshold Schapire, Freund, Bartlett & Lee Annals of stat. 98 For any convex combination and any threshold Fraction of training example with small margin Probability of mistake Size of training sample No dependence on number of weak rules that are combined!!! VC dimension of weak rules

Suggested optimization problem Margin

Idea of Proof

Bagging and over-fitting

A metric space of classifiers Classifier space Example Space g f d d(f,g) = P( f(x) = g(x) ) Neighboring models make similar predictions

Confidence zones No. of mistakes Voice Pitch Certainly Certainly Bagged variants Majority vote Voice Pitch

Bagging Averaging over all good models increases stability (decreases dependence on particular sample) If clear majority the prediction is very reliable (unlike Florida) Bagging is a randomized algorithm for sampling the best model’s neighborhood Ignoring computation, averaging of best models can be done deterministically, and yield provable bounds.

A pseudo-Bayesian method Freund, Mansour, Schapire 2000 A pseudo-Bayesian method Define a prior distribution over rules p(h) Get m training examples, ; set For each h : err(h) = (no. of mistakes h makes)/m Weights: Log odds: Prediction(x) = l(x)>+ +1 else Insufficient data l(x)< - -1

Applications

Applications of Boosting Academic research Applied research Commercial deployment

Academic research % test error rates Database Other Boosting Cleveland Error reduction Cleveland 27.2 (DT) 16.5 39% Promoters 22.0 (DT) 11.8 46% Letter 13.8 (DT) 3.5 74% Reuters 4 5.8, 6.0, 9.8 2.95 ~60% Reuters 8 11.3, 12.1, 13.4 7.4 ~40% Cleveland: Heart disease, features: blood pressure, etc. Promoters: DNA, indicates that intron is coming Letter: standard b&w OCR benchmark, Frey & Slate 1991, Roman alphabet, 20 distorted fonts, 16 given attributes: # pixels, moments etc. Reuters-21450, 12,902 documents, standard benchmark, Reuters 4: first table A9, A10. Naïve bayes CMU, Rocchio,

Applied research “AT&T, How may I help you?” Classify voice requests Schapire, Singer, Gorin 98 Applied research “AT&T, How may I help you?” Classify voice requests Voice -> text -> category Fourteen categories Area code, AT&T service, billing credit, calling card, collect, competitor, dial assistance, directory, how to dial, person to person, rate, third party, time charge ,time AT&T service – sign up or drop

Examples Yes I’d like to place a collect call long distance please Operator I need to make a call but I need to bill it to my office Yes I’d like to place a call on my master card please I just called a number in Sioux city and I musta rang the wrong number because I got the wrong party and I would like to have that taken off my bill collect third party calling card billing credit

Weak rules generated by “boostexter” Third party Collect call Calling card Category Weak Rule Word occurs Word does not occur

Results 7844 training examples 1000 test examples hand transcribed 1000 test examples hand / machine transcribed Accuracy with 20% rejected Machine transcribed: 75% Hand transcribed: 90%

Commercial deployment Freund, Mason, Rogers, Pregibon, Cortes 2000 Commercial deployment Distinguish business/residence customers Using statistics from call-detail records Alternating decision trees Similar to boosting decision trees, more flexible Combines very simple rules Can over-fit, cross validation used to stop 3 years for fraud

Massive datasets 260M calls / day 230M telephone numbers Label unknown for ~30% Hancock: software for computing statistical signatures. 100K randomly selected training examples, ~10K is enough Training takes about 2 hours. Generated classifier has to be both accurate and efficient Call detail: originating, called, terminated (e.g, for 800, forwarding) #, start, end time, quality, termination code (e.g., for wireless), quality, no rate information Statistics: #calls, Distribution within day, week, # #’s called, type (800, toll, etc.), incoming vs. outgoing

Alternating tree for “buizocity”

Alternating Tree (Detail)

Precision/recall graphs Accuracy Score

Business impact Increased coverage from 44% to 56% Accuracy ~94% Saved AT&T 15M$ in the year 2000 in operations costs and missed opportunities.

Summary Boosting is a computational method for learning accurate classifiers Resistance to over-fit explained by margins Underlying explanation – large “neighborhoods” of good classifiers Boosting has been applied successfully to a variety of classification problems

Please come talk with me Questions welcome! Data, even better!! Potential collaborations, yes!!! Yoav@research.att.com www.predictiontools.com