Presentation is loading. Please wait.

Presentation is loading. Please wait.

Boosting and Additive Trees (2)

Similar presentations


Presentation on theme: "Boosting and Additive Trees (2)"— Presentation transcript:

1 Boosting and Additive Trees (2)
Yi Zhang , Kevyn Collins-Thompson Advanced Statistical Seminar Oct 29, 2002

2 Recap: Boosting (1) Background: Ensemble Learning
Boosting Definitions, Example AdaBoost Boosting as an Additive Model Boosting Practical Issues Exponential Loss Other Loss Functions Boosting Trees Boosting as Entropy Projection Data Mining Methods

3 Outline for This Class Find the solution based on numerical optimization Control the model complexity and avoid over fitting Right sized trees for boosting Number of iterations Regularization Understand the final model (Interpretation) Single variable Correlation of variables

4 Numerical Optimization
Goal: Find f that minimize the loss function over training data Gradient Descent Search in the unconstrained function space to minimize the loss on training data Loss on training data converges to zero

5 Gradient Search on Constrained Function Space: Gradient Tree Boosting
Introduce a tree at the mth iteration whose predictions tm are as close as possible to the negative gradient Advantage compared with unconstrained gradient search: Robust, less likely for over fitting

6 Algorithm 3: MART

7 View Boosting as Linear Model
Basis expansion: use basis function Tm (m=1..M, each Tm is a weak learner) to transform inputs vector X into T space, then use linear models in this new space Special for Boosting: Choosing of basis function Tm depends on T1,… Tm-1

8 Improve Boosting as Linear Model
Recap: Linear Models in Chapter 3 Bias Variance trade off Subset selection (feature selection, discrete) Coefficient shrinkage (smoothing: ridge, lasso) Using derived input direction (PCA, PLA) Multiple outcome shrinkage and selection Exploit correlations in different outcomes This Chapter: Improve Boosting Size of the constituent trees J Number of boosting iterations M (subset selection) Regularization (Shrinkage)

9 Right Size Tree for Boosting (?)
The Best for one step is not the best in long run Using very large tree (such as C4.5) as weak learner to fit the residue assumes each tree is the last one in the expansion. Usually degrade performance and increase computation Simple approach: restrict all trees to be the same size J J limits the input features interaction level of tree-based approximation In practice low-order interaction effects tend to dominate, and empirically 4J 8 works well (?)

10

11 Number of Boosting Iterations (subset selection)
Boosting will over fit as M ->  Use validation set Other methods … (later)

12 Shrinkage Scale the contribution of each tree by a factor 0<<1 to control the learning rate Both  and M control prediction risk on the training data, and operate dependently  M If you penalize feature effect, then you includes more features

13

14 Penalized Regression Ridge regression or Lasso regression

15 Algorithm 4: Forward stagewise linear

16 If is monotone in , we have k|k| =  M, and the solution for algorithm 4 is identical to result of lasso regression as described in page 64. ( , M ) lasso regression S/t/

17 More about algorithm 4 Algorithm 4  Algorithm 3 + Shrinkage
L1 norm vs. L2 norm: more details later Chapter 12 after learning SVM

18 Interpretation: Understanding the final model
Single decision trees are easy to interpret Linear combination of trees is difficult to understand Which features are important? What’s the interaction between features?

19 Relative Importance of Individual Variables
For a single tree, define the importance of xl as For additive tree, define the importance of xl as For K-class classification, just treat as K 2-class classification task

20

21 Partial Dependence Plots
Visualize dependence of approximation f(x) on the joint values of important features Usually the size of the subsets is small (1-3) Define average or partial dependence Can be estimated empirically using the training data:

22 10.50 vs. 10.52 Same if predictor variables are independent
Why use instead of to Measure Partial Dependency? Example 1: f(X)=h1(xs)+ h2(xc) Example 2: f(X)=h1(xs)* h2(xc)

23

24

25 Conclusion Find the solution based on numerical optimization
Control the model complexity and avoid over fitting Right sized trees for boosting Number of iterations Regularization Understand the final model (Interpretation) Single variable Correlation of variables

26


Download ppt "Boosting and Additive Trees (2)"

Similar presentations


Ads by Google