Boosting and other Expert Fusion Strategies. References Chapter 9.5 Duda Hart & Stock Leo Breiman Boosting Bagging Arcing Presentation.

Slides:

Advertisements

Similar presentations

Random Forest Predrag Radenković 3237/10

Advertisements

Ensemble Learning Reading: R. Schapire, A brief introduction to boosting.

On-line learning and Boosting

Lectures 17,18 – Boosting and Additive Trees Rice ECE697 Farinaz Koushanfar Fall 2006.

Boosting Rong Jin.

BOOSTING & ADABOOST Lecturer: Yishay Mansour Itay Dangoor.

Boosting Approach to ML

Games of Prediction or Things get simpler as Yoav Freund Banter Inc.

CMPUT 466/551 Principal Source: CMU

Longin Jan Latecki Temple University

Introduction to Boosting Slides Adapted from Che Wanxiang( 车万翔 ) at HIT, and Robin Dhamankar of Many thanks!

Sparse vs. Ensemble Approaches to Supervised Learning

Boosting Rong Jin. Inefficiency with Bagging D Bagging … D1D1 D2D2 DkDk Boostrap Sampling h1h1 h2h2 hkhk Inefficiency with boostrap sampling: Every example.

Probably Approximately Correct Model (PAC)

Ensemble Learning: An Introduction

Evaluating Hypotheses

1 How to be a Bayesian without believing Yoav Freund Joint work with Rob Schapire and Yishay Mansour.

Adaboost and its application

Three kinds of learning

Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.

Decision Trees and more!. Learning OR with few attributes Target function: OR of k literals Goal: learn in time – polynomial in k and log n –  and 

Sparse vs. Ensemble Approaches to Supervised Learning

Experts and Boosting Algorithms. Experts: Motivation Given a set of experts –No prior information –No consistent behavior –Goal: Predict as the best expert.

Probably Approximately Correct Learning (PAC) Leslie G. Valiant. A Theory of the Learnable. Comm. ACM (1984)

Boosting Main idea: train classifiers (e.g. decision trees) in a sequence. a new classifier should focus on those cases which were incorrectly classified.

CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.

Ensemble Learning (2), Tree and Forest

For Better Accuracy Eick: Ensemble Learning

Collaborative Filtering Matrix Factorization Approach

Machine Learning CS 165B Spring 2012

By Wang Rui State Key Lab of CAD&CG

Chapter 10 Boosting May 6, Outline Adaboost Ensemble point-view of Boosting Boosting Trees Supervised Learning Methods.

Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.

CS 391L: Machine Learning: Ensembles

LOGO Ensemble Learning Lecturer: Dr. Bo Yuan

Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00.

Benk Erika Kelemen Zsolt

Ensemble Methods: Bagging and Boosting

Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.

Ensemble Learning (1) Boosting Adaboost Boosting is an additive model

Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.

Learning with AdaBoost

CS Inductive Bias1 Inductive Bias: How to generalize on novel data.

Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.

Ensemble Methods in Machine Learning

Classification Ensemble Methods 1

1 January 24, 2016Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 7 — Classification Ensemble Learning.

CS 8751 ML & KDDComputational Learning Theory1 Notions of interest: efficiency, accuracy, complexity Probably, Approximately Correct (PAC) Learning Agnostic.

Boosting and Additive Trees (Part 1) Ch. 10 Presented by Tal Blum.

Machine Learning Chapter 7. Computational Learning Theory Tom M. Mitchell.

Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.

AdaBoost Algorithm and its Application on Object Detection Fayin Li.

1 Machine Learning Lecture 8: Ensemble Methods Moshe Koppel Slides adapted from Raymond J. Mooney and others.

Ensemble Classifiers.

Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.

Machine Learning: Ensemble Methods

Chapter 7. Classification and Prediction

Reading: R. Schapire, A brief introduction to boosting

Boosting and Additive Trees (2)

The Boosting Approach to Machine Learning

Boosting and Additive Trees

The Boosting Approach to Machine Learning

ECE 5424: Introduction to Machine Learning

Data Mining Practical Machine Learning Tools and Techniques

Collaborative Filtering Matrix Factorization Approach

CSCI B609: “Foundations of Data Science”

Ensemble learning.

Model Combination.

Ensemble learning Reminder - Bagging of Trees Random Forest

Presentation transcript:

Boosting and other Expert Fusion Strategies

References Chapter 9.5 Duda Hart & Stock Leo Breiman Boosting Bagging Arcing Presentation adapted from: Rishi Sinha, Robin Dhamankar

Types of Multiple Experts Single expert on full observation space Single expert for sub regions of observation space (Trees) Multiple experts on full observation space Multiple experts on sub regions of observation space

Types of Multiple Experts Training Use full observation space for each expert Use different observation features for each expert Use different observations for each expert Combine the above

Online Experts Selection N strategies (experts) At time t: –Learner A chooses a distribution over N experts –Let p t (i) be the probability of i-th expert –  p t (i) = 1 and for a loss vector l t Loss at time t:  p t (i) l t (i) Assume bounded loss, l t (i) in [0,1]

Experts Algorithm: Greedy For each expert define its cumulative loss: Greedy: At time t choose the expert with minimum loss, namely, arg min L i t

Greedy Analysis Theorem: Let L G T be the loss of Greedy at time T, then Proof in notes. Weakness: Relies on a single expert for every observation

Better Multiple Experts Algorithms Would like to bound Better Bound: Hedge Algorithm Utilizes all experts for each observation

Multiple Experts Algorithm: Hedge Maintain weight vector at time t: w t Probabilities p t (k) = w t (k) /  w t (j) Initialization w 1 (i) = 1/N Updates:  w t+1 (k) = w t (k) U b (l t (k))  where b in [0,1] and  b r < U b (r) < 1-(1-b)r

Hedge Analysis Lemma: For any sequence of losses Proof (Mansour’s scribe) Corollary:

Hedge: Properties Bounding the weights Similarly for a subset of experts.

Hedge: Performance Let k be with minimal loss Therefore

Hedge: Optimizing b For b=1/2 we have Better selection of b:

Occam Razor Finding the shortest consistent hypothesis. Definition: (  )-Occam algorithm –  >0 and  <1 –Input: a sample S of size m –Output: hypothesis h –for every (x,b) in S: h(x)=b –size(h) < size  (c t ) m  Efficiency.

Occam Razor Theorem A: ( ,  )-Occam algorithm for C using H D distribution over inputs X c t in C the target function Sample size: with probability 1-  A(S)=h has error(h) < 

Occam Razor Theorem Use the bound for finite hypothesis class. Effective hypothesis class size 2 size(h) size(h) < n  m  Sample size:

Weak and Strong Learning

PAC Learning model (Strong Learning) There exists a distribution D over domain X Examples: –use c for target function (rather than c t ) Goal: –With high probability (1-  ) –find h in H such that –error(h,c ) <  –  arbitrarily small, thus STRONG LEARNING

Weak Learning Model Goal: error(h,c) < ½ -  (Slightly above chance) The parameter  is small –constant Intuitively: A much easier task Question: –Assume C is weak learnable, –C is PAC (strong) learnable

Majority Algorithm Hypothesis: h M (x)= MAJ[ h 1 (x),..., h T (x) ] size(h M ) < T size(h t ) Using Occam Razor

Majority: outline Sample m example Start with a distribution 1/m per example. Modify the distribution and get h t Hypothesis is the majority Terminate when perfect classification – of the sample

Majority: Algorithm Use the Hedge algorithm. The “experts” will be associate with points. Loss would be a correct classification. –l t (i)= 1 - | h t (x i ) – c(x i ) | Setting b= 1-  h M (x) = MAJORITY( h i (x)) Q: How do we set T?

Majority: Analysis Consider the set of errors S S={i | h M (x i )  c(x i ) } For every i in S: L i / T < ½ (Proof!) From Hedge properties:

MAJORITY: Correctness Error Probability: Number of Rounds: Terminate when error less than 1/m

Bagging Generate a random sample from training set by selecting elements with replacement. Repeat this sampling procedure, getting a sequence of k “independent” training sets A corresponding sequence of classifiers C1,C2,…,Ck is constructed for each of these training sets, by using the same classification algorithm To classify an unknown sample X, let each classifier predict. The Bagged Classifier C* then combines the predictions of the individual classifiers to generate the final outcome. (sometimes combination is simple voting) Taken from Lecture slides for Data Mining Concepts and Techniques by Jiawei Han and M Kamber

Boosting Also Ensemble Method. =>The final prediction is a combination of the prediction of several predictors. What is different? –Its iterative. –Boosting: Successive classifiers depends upon its predecessors. Previous methods : Individual classifiers were “independent” –Training Examples may have unequal weights. –Look at errors from previous classifier step to decide how to focus on next iteration over data –Set weights to focus more on ‘hard’ examples. (the ones on which we committed mistakes in the previous iterations)

Boosting W(x) is the distribution of weights over N training observations ∑ W(x i )=1 Initially assign uniform weights W 0 (x) = 1/N for all x, step k=0 At each iteration k : –Find best weak classifier C k (x) using weights W k (x) –With error rate ε k and based on a loss function: weight α k the classifier C k ‘s weight in the final hypothesis For each x i, update weights based on ε k to get W k+1 (x i ) C FINAL (x) =sign [ ∑ α i C i (x) ]

Boosting (Algorithm)

Boosting As Additive Model The final prediction in boosting f(x) can be expressed as an additive expansion of individual classifiers The process is iterative and can be expressed as follows. Typically we would try to minimize a loss function on the training examples

Boosting As Additive Model Simple case: Squared-error loss Forward stage-wise modeling amounts to just fitting the residuals from previous iteration. Squared-error loss not robust for classification

Boosting As Additive Model AdaBoost for Classification: L(y, f (x)) = exp(-y ∙ f (x)) - the exponential loss function

Boosting As Additive Model First assume that β is constant, and minimize w.r.t. G:

Boosting As Additive Model err m : It is the training error on the weighted samples The last equation tells us that in each iteration we must find a classifier that minimizes the training error on the weighted samples.

Boosting As Additive Model Now that we have found G, we minimize w.r.t. β:

Boosting (Recall) W(x) is the distribution of weights over the N training observations ∑ W(x i )=1 Initially assign uniform weights W 0 (x) = 1/N for all x, step k=0 At each iteration k : –Find best weak classifier C k (x) using weights W k (x) –With error rate ε k and based on a loss function: weight α k the classifier C k ‘s weight in the final hypothesis For each x i, update weights based on ε k to get W k+1 (x i ) C FINAL (x) =sign [ ∑ α i C i (x) ]

AdaBoost W(x) is the distribution of weights over the N training points ∑ W(x i )=1 Initially assign uniform weights W 0 (x) = 1/N for all x. At each iteration k : –Find best weak classifier C k (x) using weights W k (x) –Compute ε k the error rate as ε k = [ ∑ W(x i ) ∙ I(y i ≠ C k (x i )) ] / [ ∑ W(x i )] –weight α k the classifier C k ‘s weight in the final hypothesis Set α k = log ((1 – ε k )/ε k ) –For each x i, W k+1 (x i ) = W k (x i ) ∙ exp[α k ∙ I(y i ≠ C k (x i ))] C FINAL (x) =sign [ ∑ α i C i (x) ]

AdaBoost(Example) Original Training set : Equal Weights to all training samples Taken from “A Tutorial on Boosting” by Yoav Freund and Rob Schapire

AdaBoost (Example) ROUND 1

AdaBoost (Example) ROUND 2

AdaBoost (Example) ROUND 3

AdaBoost (Example)

AdaBoost (Characteristics) Why exponential loss function? –Computational Simple modular re-weighting Derivative easy so determining optimal parameters is relatively easy –Statistical In a two label case it determines one half the log odds of P(Y=1|x) => We can use the sign as the classification rule Accuracy depends upon number of iterations ( How sensitive.. we will see soon).

Boosting performance Decision stumps are very simple rules of thumb that test condition on a single attribute. Decision stumps formed the individual classifiers whose predictions were combined to generate the final prediction. The misclassification rate of the Boosting algorithm was plotted against the number of iterations performed.

Boosting performance Steep decrease in error

Boosting performance Pondering over how many iterations would be sufficient…. Observations –First few ( about 50) iterations increase the accuracy substantially.. Seen by the steep decrease in misclassification rate. – As iterations increase training error decreases ? and generalization error decreases ?

Can Boosting do well if? Limited training data? –Probably not.. Many missing values ? Noise in the data ? Individual classifiers not very accurate ? –It cud if the individual classifiers have considerable mutual disagreement.

Adaboost “Probably one of the three most influential ideas in machine learning in the last decade, along with Kernel methods and Variational approximations.” Original idea came from Valiant Motivation: We want to improve the performance of a weak learning algorithm

Adaboost Algorithm:

Boosting Trees Outline Basics of boosting trees. A numerical optimization problem Control the model complexity, generalization –Size of trees –Number of Iterations –Regularization Interpret the final model –Single variable –Correlation of variables

Boosting Trees : Basics Formally a tree is The parameters found by minimizing the empirical risk. Finding: –  j given R j : typically mean of y i in R j –R j : Is tough but solutions exist.

Basics Continued … Approximate criterion for optimizing  Boosted tree model is sum of such trees induced in a forward stage wise manner In case of binary classification and exponential loss functions this reduces to Ada Boost

Numerical Optimization Loss Function is So the problem boils down to finding Which in optimization procedures are solved as

Numerical Optimization Methods Steepest Descent Loss on Training Data converges to 0.

Generalization Gradient Boosting –We want the algorithm to generalize. –Gradient on the other hand is defined only on the training data points. –So fit the tree T to the negative gradient values by least squares. MART – Multiple additive regression trees

Algorithm

Tuning the Parameters The parameters that can be tuned are –The size of constituent trees J. –The number of boosting iterations M. –Shrinkage –Penalized Regression

Right-sized trees The optimal for one step might not be the optimal for the algorithm –Using very large tree (such as C4.5) as weak learner to fit the residue assumes each tree is the last one in the expansion. Usually degrade performance and increase computation Solution : restrict the value of J to be the same for all trees.

Right sized trees. For trees the higher order interactions effects present in large trees suffer inaccuracies. J is the factor that helps control the higher order interactions. Thus we would like to keep J low. In practice the value of 4  J  8 is seen to have worked the best.

Controlling M (Regularization) After each iteration the training risk L(f M ). As M  , L(f M )  0 But this would risk over fitting the training data. To avoid this monitor prediction risk on a validation sample. Other methods in the Next Chapter.

Shrinkage Scale the contribution of each tree by a factor 0 < < 1 to control the learning rate. and M control the prediction risk on training data. Are not independent of each other.

Experts: Motivation Given a set of experts –No prior information –No consistent behavior –Goal: Predict as the best expert Model –online model –Input: historical results.

Expert: Goal Match the loss of best expert. Loss: –L A –L i Can we hope to do better?

Example: Guessing letters Setting: –Alphabet  of k letters Loss: –1 incorrect guess –0 correct guess Experts: –Each expert guesses a certain letter always. Game: guess the most popular letter online.

Example 2: Rock-Paper- Scissors Two player game. Each player chooses: Rock, Paper, or Scissors. Loss Matrix: Goal: Play as best as we can given the opponent. RockPaperScissor s Rock1/210 Paper01/21 Scissor s 101/2

Example 3: Placing a point Action: choosing a point d. Loss (give the true location y): ||d-y||. Experts: One for each point. Important: Loss is Convex Goal: Find a “center”

Adaboost Line 1: Given input space X and training examples x 1,…x m and label space Y = {-1,1} Line 2: Initialize a distribution D to 1/m where m is the number of instances in the input space. Line 3: for( int t=0;t<T;t++) Line 4: Train weak learning algorithm using D t Line 5: Get a weak hypothesis h t which maps the input space to the label space. The error of this hypothesis is ε t Line 6: α t = (1/2)ln((1- ε t )/ ε t ) Line 7: D t (instance i )=(1/ Z t )(D t (instance i )x{e - αt }if the hypothesis correctly matched the instance to the label x{e αt } otherwise

Adaboost Final hypothesis: H(x) = sign(sum(α t h t (x))) Main ideas: – Adaboost forces the weak learner to focus on incorrectly classified instances –Training error decreases exponentially –Does boosting overfit? Baum showed Generalization error = O(sqrt(Td/m)) Schapire showed error = O(sqrt(d/mθ)) Does Generalization error depend on T or not? The jury is still out. –No overfit mechanism

AdaBoost: Dynamic Boosting Better bounds on the error No need to “know”  Each round a different b –as a function of the error

AdaBoost: Input Sample of size m: A distribution D over examples –We will use D(x i )=1/m Weak learning algorithm A constant T (number of iterations)

AdaBoost: Algorithm Initialization: w 1 (i) = D(x i ) For t = 1 to T DO –p t (i) = w t (i) /  w t (j) –Call Weak Learner with p t –Receive h t –Compute the error  t of h t on p t –Set b t =  t /(1-  t ) –w t+1 (i) = w t (i) (b t ) e, where e=1-|h t (x i )-c(x i )| Output

AdaBoost: Analysis Theorem: –Given  1,...,  T –the error  of h A is bounded by

AdaBoost: Proof Let l t (i) = 1-|h t (x i )-c(x i )| By definition: p t l t = 1 –  t Upper bounding the sum of weights –From the Hedge Analysis. Error occurs only if

AdaBoost Analysis (cont.) Bounding the weight of a point Bounding the sum of weights Final bound as function of b t Optimizing b t : –b t =  t / (1 –  t )

AdaBoost: Fixed bias Assume  t = 1/2 -  We bound:

Learning OR with few attributes Target function: OR of k literals Goal: learn in time: – polynomial in k and log n –  and  constant ELIM makes “slow” progress –disqualifies one literal per round –May remain with O(n) literals

Set Cover - Definition Input: S 1, …, S t and S i  U Output: S i1, …, S ik and  j S jk =U Question: Are there k sets that cover U? NP-complete

Set Cover Greedy algorithm j=0 ; U j =U; C=  While U j   –Let S i be arg max |S i  U j | –Add S i to C –Let U j+1 = U j – S i –j = j+1

Set Cover: Greedy Analysis At termination, C is a cover. Assume there is a cover C’ of size k. C’ is a cover for every U j Some S in C’ covers U j /k elements of U j Analysis of U j : |U j+1 |  |U j | - |U j |/k Solving the recursion. Number of sets j < k ln |U|

Building an Occam algorithm Given a sample S of size m –Run ELIM on S –Let LIT be the set of literals –There exists k literals in LIT that classify correctly all S Negative examples: –any subset of LIT classifies theme correctly

Building an Occam algorithm Positive examples: –Search for a small subset of LIT –Which classifies S + correctly –For a literal z build T z ={x | z satisfies x} –There are k sets that cover S + –Find k ln m sets that cover S + Output h = the OR of the k ln m literals Size (h) < k ln m log 2n Sample size m =O( k log n log (k log n))

Application : Data mining Challenges in real world data mining problems –Data has large number of observations and large number of variables on each observation. –Inputs are a mixture of various different kinds of variables –Missing values, outliers and variables with skewed distribution. –Results to be obtained fast and they should be interpretable. So off-shelf techniques are difficult to come up with. Boosting Decision Trees ( AdaBoost or MART) come close to an off-shelf technique for Data Mining.

Boosting Trees Presented by Rishi Sinha

Occam Razor

Occam algorithm and compression A B S (x i,b i ) x 1, …, x m

compression Option 1: –A sends B the values b 1, …, b m –m bits of information Option 2: –A sends B the hypothesis h –Occam: large enough m has size(h) < m Option 3 (MDL): –A sends B a hypothesis h and “corrections” –complexity: size(h) + size(errors)

Independent Component Analysis (ICA) This is the first ICA paper –My source for this explanation of ICA is “Independent Component Analysis: A Tutorial” by Aapo Hyvarinen and “Variational Methods for Bayesian Independent Component Analysis” ICA chapter by Rizwan A. Choudrey