Presentation is loading. Please wait.

Presentation is loading. Please wait.

Additive Models and Trees

Similar presentations


Presentation on theme: "Additive Models and Trees"— Presentation transcript:

1 Additive Models and Trees
Lecture Notes for CMPUT 466/551 Nilanjan Ray Principal Source: Department of Statistics, CMU

2 Topics to cover GAM: Generalized Additive Models
CART: Classification and Regression Trees MARS: Multiple Adaptive Regression Splines

3 Generalized Additive Models
What is GAM? The functions fj are smoothing functions in general, such as splines, kernel functions, linear functions, and so on… Each function could be different, e.g., f1 can be linear, f2 can be a natural spline, etc. Compare GAM with Linear Basis Expansions (Ch. 5 of [HTF]) Similarities? Dissimilarities? Any similarity (in principle) with Naïve Bayes model?

4 Smoothing Functions in GAM
Non-parametric functions (linear smoother) Smoothing splines (Basis expansion) Simple k-nearest neighbor (raw moving average) Locally weighted average by using kernel weighting Local linear regression, local polynomial regression Linear functions Functions of more than one variables (interaction term) Example:

5 Learning GAM: Backfitting
Backfitting algorithm Initialize: Cycle: j = 1,2,…, p,…,1,2,…, p,…, (m cycles) Until the functions change less than a prespecified threshold

6 Backfitting: Points to Ponder
Computational Advantage? Convergence? How to choose fitting functions?

7 Example: Generalized Logistic Regression
Model:

8 Additive Logistic Regression: Backfitting
Fitting logistic regression (P99) Fitting additive logistic regression (P262) 1. where 2. 2. Iterate: Iterate: a. a. b. b. c. Using weighted least squares to fit a linear model to zi with weights wi, give new estimates c. Using weighted backfitting algorithm to fit an additive model to zi with weights wi, give new estimates 3. Continue step 2 until converge 3.Continue step 2 until converge

9 SPAM Detection via Additive Logistic Regression
Input variables (predictors): 48 quantitative variables: percentage of words in the that match a given word. Examples include business, address, internet, etc. 6 quantitative variables: percentage of characters in the that match a given character, such as ‘ch;’, ch(, etc. The average length of uninterrupted sequences of capital letters The length of the longest uninterrupted sequence of capital letters The sum of length of uninterrupted length of capital letters Output variable: SPAM (1) or (0) fj’s are taken as cubic smoothing splines

10 SPAM Detection: Results
True Class Predicted Class (0) SPAM (1) 58.5% 2.5% 2.7% 36.2% Sensitivity: Probability of predicting spam given true state is spam = Specificity: Probability of predicting given true state is =

11 GAM: Summary Useful flexible extensions of linear models
Backfitting algorithm is simple and modular Interpretability of the predictors (input variables) are not obscured Not suitable for very large data mining applications (why?)

12 CART Overview Principle behind: Divide and conquer
Partition the feature space into a set of rectangles For simplicity, use recursive binary partition Fit a simple model (e.g. constant) for each rectangle Classification and Regression Trees (CART) Regress Trees Classification Trees Popular in medical applications

13 CART An example (in regression case):

14 Basic Issues in Tree-based Methods
How to grow a tree? How large should we grow the tree?

15 Regression Trees Partition the space into M regions: R1, R2, …, RM.
Note that this is still an additive model

16 Regression Trees– Grow the Tree
The best partition: to minimize the sum of squared error: Finding the global minimum is computationally infeasible Greedy algorithm: at each level choose variable j and value s as: The greedy algorithm makes the tree unstable The error made at the upper level will be propagated to the lower level

17 Regression Tree – how large should we grow the tree ?
Trade-off between bias and variance Very large tree: overfit (low bias, high variance) Small tree (low variance, high bias): might not capture the structure Strategies: 1: split only when we can decrease the error (usually short-sighted) 2: Cost-complexity pruning (preferred)

18 Regression Tree - Pruning
Cost-complexity pruning: Pruning: collapsing some internal nodes Cost complexity: Choose best alpha: weakest link pruning (p.270, [HTF]) Each time collapse an internal node which add smallest error Choose from this tree sequence the best one by cross-validation Penalty on the complexity/size of the tree Cost: sum of squared errors

19 Classification Trees Classify the observations in node m to the major class in the node: Pmk is the proportion of observation of class k in node m Define impurity for a node: Misclassification error: Entropy: Gini index :

20 Classification Trees Entropy and Gini are more sensitive
Node impurity measures versus class proportion for 2-class problem Entropy and Gini are more sensitive To grow the tree: use Entropy or Gini To prune the tree: use Misclassification rate (or any other method)

21 Tree-based Methods: Discussions
Categorical Predictors Problem: Consider splits of sub tree t into tL and tR based on categorical predictor x which has q possible values: 2(q-1)-1 ways ! Treat the categorical predictor as ordered by say proportion of class 1

22 Tree-based Methods: Discussions
Linear Combination Splits Split the node based on Improve the predictive power Hurt interpretability Instability of Trees Inherited from the hierarchical nature Bagging (section 8.7 of [HTF]) can reduce the variance

23 Bootstrap Trees Construct B number of trees from B bootstrap samples– bootstrap trees

24 Bootstrap Trees

25 Bagging The Bootstrap Trees
is computed from the bth bootstrap sample in this case a tree Bagging reduces the variance of the original tree by aggregation

26 Bagged Tree Performance
Majority vote Average

27 MARS In multi-dimensional spline the basis functions grow exponentially– curse of dimensionality A partial remedy is a greedy forward search algorithm Create a simple basis-construction dictionary Construct basis functions on-the-fly Choose the best-fit basis function at each step

28 Basis functions 1-dim linear spline (t represents the knot)
Basis collections C: |C| = 2 * N * p

29 The MARS procedure (1st stage)
Initialize basis set M with a constant function Form candidates (cross-product of M with set C) Add the best-fit basis pair (decrease residual error the most) into M Repeat from step 2 (until e.g. |M| >= threshold) M (new) M (old) C

30 The MARS procedure (2nd stage)
The final model M typically overfits the data =>Need to reduce the model size (# of terms) Backward deletion procedure Remove term which causes the smallest increase in residual error Compute Repeat step 1 Choose the model size with minimum GCV.

31 Generalized Cross Validation (GCV)
M(.) measures effective # of parameters: r: # of linearly independent basis functions K: # of knots selected c = 3 Why c = 3? In page 286, c=3 is obtained from math (?) and simulation results.

32 Discussion Piecewise linear reflected basis
Allow operation on local region Fitting N reflected basis pairs takes O(N) instead of O(N^2) Left-part is zero, right-part differs by a constant Reflected basis pair is linearly independent => fit (x-t)+ and (t-x)+ individually X[i-1] X[i] X[i+1] X[i+2]

33 Discussion (continue)
Hierarchical model (reduce search computation) High-order term exists => some lower-order “footprints” exist Restriction: Each input appear at most once in a product: e.g. (Xj - t1) * (Xj - t1) is not considered Set upper limit on order of interaction Upper limit of 1 => additive model MARS for classification Use multi-response Y (N*K indicator matrix) Masking problem may occur Better solution: “optimal scoring” (Chapter 12.5 of [HTF])

34 MARS & CART relationship
IF replace piecewise linear basis by step functions keep only the newly formed product terms in M (leaf nodes of a binary tree) THEN MARS forward procedure = CART tree growing procedure


Download ppt "Additive Models and Trees"

Similar presentations


Ads by Google