Introduction to Boosting Slides Adapted from Che Wanxiang( 车万翔 ) at HIT, and Robin Dhamankar of Many thanks!

Ideas Boosting is considered to be one of the most significant developments in machine learning Finding many weak rules of thumb is easier than finding a single, highly prediction rule Key in combining the weak rules

Boosting(Algorithm) W(x) is the distribution of weights over the N training points ∑ W(x i )=1 Initially assign uniform weights W 0 (x) = 1/N for all x, step k=0 At each iteration k : Find best weak classifier C k (x) using weights W k (x) With error rate ε k and based on a loss function: weight α k the classifier C k ‘s weight in the final hypothesis For each x i, update weights based on ε k to get W k+1 (x i ) C FINAL (x) =sign [ ∑ α i C i (x) ]

Boosting (Algorithm)

Boosting As Additive Model The final prediction in boosting f(x) can be expressed as an additive expansion of individual classifiers The process is iterative and can be expressed as follows. Typically we would try to minimize a loss function on the training examples

Boosting As Additive Model Simple case: Squared-error loss Forward stage-wise modeling amounts to just fitting the residuals from previous iteration. Squared-error loss not robust for classification

Boosting As Additive Model AdaBoost for Classification: L(y, f (x)) = exp(-y ∙ f (x)) - the exponential loss function

Boosting As Additive Model First assume that β is constant, and minimize w.r.t. G:

Boosting As Additive Model err m : It is the training error on the weighted samples The last equation tells us that in each iteration we must find a classifier that minimizes the training error on the weighted samples.

Boosting As Additive Model Now that we have found G, we minimize w.r.t. β:

AdaBoost(Algorithm) W(x) is the distribution of weights over the N training points ∑ W(x i )=1 Initially assign uniform weights W 0 (x) = 1/N for all x. At each iteration k : Find best weak classifier C k (x) using weights W k (x) Compute ε k the error rate as ε k = [ ∑ W(x i ) ∙ I(y i ≠ C k (x i )) ] / [ ∑ W(x i )] weight α k the classifier C k ‘s weight in the final hypothesis Set α k = log ((1 – ε k )/ε k ) For each x i, W k+1 (x i ) = W k (x i ) ∙ exp[α k ∙ I(y i ≠ C k (x i ))] C FINAL (x) =sign [ ∑ α i C i (x) ]

AdaBoost(Example) Original Training set : Equal Weights to all training samples Taken from “A Tutorial on Boosting” by Yoav Freund and Rob Schapire

AdaBoost(Example) ROUND 1

AdaBoost(Example)

AdaBoost (Characteristics) Why exponential loss function? Computational Simple modular re-weighting Derivative easy so determing optimal parameters is relatively easy Statistical In a two label case it determines one half the log odds of P(Y=1|x) => We can use the sign as the classification rule Accuracy depends upon number of iterations ( How sensitive.. we will see soon).

Boosting performance Decision stumps are very simple rules of thumb that test condition on a single attribute. Decision stumps formed the individual classifiers whose predictions were combined to generate the final prediction. The misclassification rate of the Boosting algorithm was plotted against the number of iterations performed.

Boosting performance Steep decrease in error

Boosting performance Pondering over how many iterations would be sufficient…. Observations First few ( about 50) iterations increase the accuracy substantially.. Seen by the steep decrease in misclassification rate. As iterations increase training error decreases ? and generalization error decreases ?

Can Boosting do well if? Limited training data? Probably not.. Many missing values ? Noise in the data ? Individual classifiers not very accurate ? It cud if the individual classifiers have considerable mutual disagreement.

Application : Data mining Challenges in real world data mining problems Data has large number of observations and large number of variables on each observation. Inputs are a mixture of various different kinds of variables Missing values, outliers and variables with skewed distribution. Results to be obtained fast and they should be interpretable. So off-shelf techniques are difficult to come up with. Boosting Decision Trees ( AdaBoost or MART) come close to an off-shelf technique for Data Mining.

AT&T “May I help you?”

Introduction to Boosting Slides Adapted from Che Wanxiang( 车万翔 ) at HIT, and Robin Dhamankar of Many thanks!

Similar presentations

Presentation on theme: "Introduction to Boosting Slides Adapted from Che Wanxiang( 车万翔 ) at HIT, and Robin Dhamankar of Many thanks!"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Introduction to Boosting Slides Adapted from Che Wanxiang( 车 万翔 ) at HIT, and Robin Dhamankar of Many thanks!

Similar presentations

Presentation on theme: "Introduction to Boosting Slides Adapted from Che Wanxiang( 车 万翔 ) at HIT, and Robin Dhamankar of Many thanks!"— Presentation transcript:

Similar presentations

About project

Feedback

Introduction to Boosting Slides Adapted from Che Wanxiang( 车万翔 ) at HIT, and Robin Dhamankar of Many thanks!

Presentation on theme: "Introduction to Boosting Slides Adapted from Che Wanxiang( 车万翔 ) at HIT, and Robin Dhamankar of Many thanks!"— Presentation transcript: