Download presentation
Published byOliver Haynes Modified over 9 years ago
1
Comp 540 Chapter 9: Additive Models, Trees, and Related Methods
Came out of my personal experience with 301 – fourier analysis and linear systems Ryan King
2
Overview 9.1 Generalized Additive Models 9.2 Tree-based Methods (CART)
9.4 MARS 9.6 Missing Data 9.7 Computational Considerations
3
Generalize Additive Models
Generally have the form: Example: logistic regression becomes additive Logistic regression:
4
Link Functions The conditional mean related to an additive function of the predictors via a link function Identity: (Gaussian) Logit: (binomial) Log: (Poisson)
5
9.1.1 Fitting Additive Models
Ex: Additive Cubic Splines Penalized Sum of Squares Criterion (PRSS)
6
9.1 Backfitting Algorithm
Initialize: Cycle: Until the functions change less than a threshold
7
9.1.3 Summary Additive models extend linear models
Flexible, but still interpretable Simple, modular, backfitting procedure Limitations for large data-mining applications
8
9.2 Tree-Based Methods Partition the feature space
Fit a simple model (constant) in each partition Simple, but powerful CART: Classification and Regression Trees, Breiman et al, 1984
9
9.2 Binary Recursive Partitions
f x1 x2 c d e a a b b c d e f
10
9.2 Regression Trees CART is a top down (divisive) greedy procedure
Partitioning is a local decision for each node A partition on variable j at value s creates regions: and
11
9.2 Regression Trees Each node chooses j,s to solve:
For any choice j,s the inner minimization is solved by: Easy to scan through all choices of j,s to find optimal split After the split, recur on and
12
9.2 Cost-Complexity Pruning
How large do we grow the tree? Which nodes should we keep? Grow tree out to fixed depth, prune back based on cost-complexity criterion.
13
9.2 Terminology A subtree: implies is a pruned version of
Tree has M leaf nodes, each indexed by m Leaf node m maps to region denotes the number of leaf nodes of is the number of data points in region
14
9.2 Cost-Complexity Pruning
We define the cost complexity criterion: For , find to minimize Choose by cross-validation
15
9.2 Classification Trees We define the same cost complexity criterion:
But choose different measure of node impurity
16
9.2 Impurity Measures Misclassification Error Gini index Cross-entropy
17
9.2 Categorical Predictors
How do we handle categorical variables? In general, possible partitions of q values into two groups 3. Trick for 0-1 case: sort the predictor classes by proportion falling in outcome class 1, then partition as normal
18
9.2 CART Example Examples…
19
9.3 PRIM-Bump Hunting Partition based, but not tree-based
Seeks boxes where the response average is high Top-down algorithm
20
Patient Rule Induction Method
Start with all data, and maximal box Shrink the box by compressing one face, to peel off factor alpha of observations. Choose peeling that produces highest response mean. Repeat step 2 until some minimal number of observations remain Expand the box along any face, as long as the resulting box mean increases. Steps 1-4 give a sequence of boxes, use cross-validation to choose a member of the sequence, call that box B1 Remove B1 from dataset, repeat process to find another box, as desired.
21
9.3 PRIM Summary Can handle categorical predictors, as CART does
Designed for regression, can 2 class classification can be coded as 0-1 Non-trivial to deal with k>2 classes More patient than CART
22
9.4 Multivariate Adaptive Regression Splines (MARS)
Generalization of stepwise linear regression Modification of CART to improve regression performance Able to capture additive structure Not tree-based
23
9.4 MARS Continued Additive model with adaptive set of basis vectors
Basis built up from simple piecewise linear functions Set “C” represents candidate set of linear splines, with “knees” at each data point Xi. Models built with elements from C or their products.
24
9.4 MARS Procedure Model has form:
Given a choice for the , the coefficients chosen by standard linear regression. Start with All functions in C are candidate functions. At each stage consider as a new basis function pair all products of a function in the model set M, with one of the reflected pairs in C. We add add to the model terms of the form:
25
9.4 Choosing Number of Terms
Large models can overfit. Backward deletion procedure: delete terms which cause the smallest increase in residual squared error, to give sequence of models. Pick Model using Generalized Cross Validation: is the effective number of parameters in the model. C=3, r is the number of basis vectors, and K knots Choose the model which minimizes
26
9.4 MARS Summary Basis functions operate locally
Forward modeling is hierarchical, multiway products are built up only from existing terms Each input appears only once in each product Useful option is to set limit on order of operations. Limit of two allows only pairwise products. Limit of one results in an additive model
27
9.5 Hierarchical Mixture of Experts (HME)
Variant of tree based methods Soft splits, not hard decisions At each node, an observation goes left or right with probability depending on its input values Smooth parameter optimization, instead of discrete split point search
28
9.5 HMEs continued Linear (or logistic) regression model fit at each leaf node (Expert) Splits can be multi-way, instead of binary Splits are probabilistic functions of linear combinations of inputs (gating network), rather than functions of single inputs Formally a mixture model
29
9.6 Missing Data Quite common to have data with missing values for one or more input features Missing values may or may not distort data For response vector y, Xobs is the observed entries, let Z=(X,y) and Zobs=(Xobs,y)
30
9.6 Missing Data Quite common to have data with missing values for one or more input features Missing values may or may not distort data For response vector y, Xobs is the observed entries, let Z=(X,y) and Zobs=(Xobs,y), R is an indicator matrix for missing values
31
9.6 Missing Data Missing at Random(MAR):
Missing Completely at Random(MCAR) MCAR is a stronger assumption
32
9.6 Dealing with Missing Data
Three approaches for handling MCAR data: Discard observations with missing features Rely on the learning algorithm to deal with missing values in its training phase Impute all the missing values before training
33
9.6 Dealing…MCAR If few values are missing, (1) may work
For (2), CART can work well with missing values via surrogate splits. Additive models can assume average values. (3) is necessary for most algorithms. Simplest tactic is to use the mean or median. If features are correlated, can build predictive models for missing features in terms of known features
34
9.6 Computational Considerations
For N observations, p predictors Additive Models: Trees: MARS: HME, at each step:
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.