Introduction to Boosting Aristotelis Tsirigos email: tsirigos@cs.nyu.edu SCLT seminar - NYU Computer Science
1. Learning Problem Formulation I Unknown target function: Given data sample: Objective: predict output y for any given input x
Learning Problem Formulation II Loss function: Generalization error: Objective: find h with minimum generalization error Main boosting idea: minimize the empirical error:
PAC Learning Input: Objective: Hypothesis space H Sample of size N Accuracy ε Confidence 1-δ Objective: Strong PAC learning: for any given ε,δ: Weak PAC Learning: holds only for some ε,δ Boosting converts a weak learner to a strong one!
2. Adaboost - Introduction Idea: Complex hypothesis are hard to design without overfitting Simple hypothesis cannot explain all data points Combine many simple hypothesis into a complex one Issues: How do we generate simple hypotheses? How do we combine them? Method: Apply some weighting scheme on the examples Find a simple hypothesis for each weighted version of the examples Compute a weight for each hypothesis and combine them linearly
Some early algorithms Boosting by filtering (Schapire 1990) Run weak learner on differently filtered example sets Combine weak hypotheses Requires knowledge on the performance of weak learner Boosting by majority (Freund 1995) Run weak learner on weighted example set Combine weak hypotheses linearly Bagging (Breiman 1996) Run weak learner on bootstrap replicates of the training set Average weak hypotheses Reduces variance
Adaboost - Outline Input: N examples SN = {(x1,y1),…, (xN,yN)} a weak base learner h = h(d,x) Initialize: equal example weights di = 1/N for all n = 1..N Iterate for t = 1..T: train base learner according to weighted example set (d(t),x) and obtain hypothesis ht = h(d(t),x) compute hypothesis error εt compute hypothesis weight αt update example weights for next iteration d(t+1) Output: final hypothesis as a linear combination of ht
Adaboost – Data flow diagram A(d,S) d(1) d(2) d(T) SN … α1h1(x) α2h2(x) αThT(x)
Adaboost – Details The loss function on the combined hypothesis at step t: In each iteration, L is greedily minimized with respect to αt: Finally, the example weights are updated:
Adaboost – Big picture The weak learner A induces a feature space: Ideally, we want to find the combined hypothesis with the minimum loss: However, Adaboost optimizes α locally.
Base learners Weak learners used in practice: Decision stumps (axis parallel splits) Decision trees (e.g. C4.5 by Quinlan 1996) Multi-layer neural networks Radial basis function networks Can base learners operate on weighted examples? In many cases they can be modified to accept weights along with the examples In general, we can sample the examples (with replacement) according to the distribution defined by the weights
3. Boosting & Learning Theory Main results Training error converges to zero Bound for the generalization error Bound for the margin-based generalization error
Training error - Definitions Function φθ on margin z: 1 -1 θ 1 z Empirical margin-based error:
Training error - Theorem The empirical margin error of the composite hypothesis fT obeys: Therefore, the empirical margin error converges to zero exponentially fast (for large θ).
Generalization bounds Theorem 2 Let F be a class of {-1,+1}-valued functions. By applying standard VC dimension bounds, we get that for every f in F with probability at least 1-δ: This is a distribution-free bound, i.e. it holds for any probability measure P.
Luckiness Take advantage of data “regularities” in order to get tighter bounds Do that without imposing any a-priori conditions on P Introduce a luckiness function that is based on the data An example of luckiness function is
The Rademacher complexity A notion of complexity related to VC dimension but more general in some sense The Rademacher complexity of a class F of [-1,+1]-valued functions is defined as: where: σn are independently set to -1 or +1 with equal probability xn are drawn independently from the underlying distribution P RN(F) is the expected correlation of a random sequence σn with the optimal f in F on a given sample xn, i=1..N.
The margin-based bound Theorem 3 Let F be a class of [-1,+1]-valued functions. Then, for every f in F with probability at least 1-δ: Note that:
Application to boosting In boosting the considered hypothesis space is: The Rademacher complexity of F does not depend on T: Whereas the VC dimension of F is dependent on T: The generalization bound does not depend on T!
4. Boosting and large margins Input space X Feature space F Linear separation in feature space F corresponds to a nonlinear separation in the original input space X Under what conditions does boosting compute a combined hypothesis with large margin?
Min-Max theorem The edge of a weak learner: The margin of the combined hypothesis: Theorem 4 The connection between edge and margin:
Adaboostr algorithm (Breiman 1997) Rule for choosing hypothesis weight: If γ* > r, it guarantees a margin ρ: where γ* is the minimum edge of hypotheses ht.
Achieving the optimal margin bound Arc-GV Choose rt based on the margin achieved so far by the combined hypothesis: Convergence rate to maximum margin is not known Marginal Adaboost Run Adaboostr and measure achieved margin ρ If ρ < r, then run Adaboostr with a decreased r Otherwise run Adaboostr with an increased r Converges fast to the maximum margin
Summary Boosting takes a weak learner and converts it to a strong one Works by asymptotically minimizing the empirical error Effectively maximizes the margin of the combined hypothesis Obeys “low” generalization error bound under the “luckiness” assumption