Comp 540 Chapter 9: Additive Models, Trees, and Related Methods

Slides:



Advertisements
Similar presentations
Random Forest Predrag Radenković 3237/10
Advertisements

CHAPTER 9: Decision Trees
1 Data Mining Classification Techniques: Decision Trees (BUSINESS INTELLIGENCE) Slides prepared by Elizabeth Anglo, DISCS ADMU.
Classification Techniques: Decision Tree Learning
Model Assessment, Selection and Averaging
Chapter 7 – Classification and Regression Trees
Chapter 7 – Classification and Regression Trees
Additive Models, Trees, and Related Methods (Part I)
Linear Methods for Regression Dept. Computer Science & Engineering, Shanghai Jiao Tong University.
x – independent variable (input)
Missing at Random (MAR)  is unknown parameter of the distribution for the missing- data mechanism The probability some data are missing does not depend.
Additive Models and Trees
Tree-based methods, neutral networks
Lecture 5 (Classification with Decision Trees)
End of Chapter 8 Neil Weisenfeld March 28, 2005.
Kernel Methods and SVM’s. Predictive Modeling Goal: learn a mapping: y = f(x;  ) Need: 1. A model structure 2. A score function 3. An optimization strategy.
Lecture 11 Multivariate Regression A Case Study. Other topics: Multicollinearity  Assuming that all the regression assumptions hold how good are our.
Data mining and statistical learning - lecture 12 Neural networks (NN) and Multivariate Adaptive Regression Splines (MARS)  Different types of neural.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Chapter 9 Additive Models,Trees,and Related Models
Classification and Prediction: Regression Analysis
Ensemble Learning (2), Tree and Forest
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
Classification Part 4: Tree-Based Methods
Bump Hunting The objective PRIM algorithm Beam search References: Feelders, A.J. (2002). Rule induction by bump hunting. In J. Meij (Ed.), Dealing with.
Lecture Notes 4 Pruning Zhangxi Lin ISQS
沈致远. Test error(generalization error): the expected prediction error over an independent test sample Training error: the average loss over the training.
Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.
Biostatistics Case Studies 2005 Peter D. Christenson Biostatistician Session 5: Classification Trees: An Alternative to Logistic.
Chapter 11 – Neural Networks COMP 540 4/17/2007 Derek Singer.
CART:Classification and Regression Trees Presented by; Pavla Smetanova Lütfiye Arslan Stefan Lhachimi Based on the book “Classification and Regression.
Chapter 9 – Classification and Regression Trees
Zhangxi Lin ISQS Texas Tech University Note: Most slides are from Decision Tree Modeling by SAS Lecture Notes 5 Auxiliary Uses of Trees.
Decision Trees Jyh-Shing Roger Jang ( 張智星 ) CSIE Dept, National Taiwan University.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
Decision Tree Learning Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata August 25, 2014.
For Monday No new reading Homework: –Chapter 18, exercises 3 and 4.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Jeff Howbert Introduction to Machine Learning Winter Regression Linear Regression Regression Trees.
1 Decision Tree Learning Original slides by Raymond J. Mooney University of Texas at Austin.
Lectures 15,16 – Additive Models, Trees, and Related Methods Rice ECE697 Farinaz Koushanfar Fall 2006.
Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.
Lecture Notes for Chapter 4 Introduction to Data Mining
ECE 471/571 – Lecture 20 Decision Tree 11/19/15. 2 Nominal Data Descriptions that are discrete and without any natural notion of similarity or even ordering.
Additive Models , Trees , and Related Models Prof. Liqing Zhang Dept. Computer Science & Engineering, Shanghai Jiaotong University.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 15: Mixtures of Experts Geoffrey Hinton.
Classification and Regression Trees
RECITATION 4 MAY 23 DPMM Splines with multiple predictors Classification and regression trees.
Eco 6380 Predictive Analytics For Economists Spring 2016 Professor Tom Fomby Department of Economics SMU.
Basis Expansions and Generalized Additive Models Basis expansion Piecewise polynomials Splines Generalized Additive Model MARS.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Bump Hunting The objective PRIM algorithm Beam search References: Feelders, A.J. (2002). Rule induction by bump hunting. In J. Meij (Ed.), Dealing with.
Decision Tree Learning DA514 - Lecture Slides 2 Modified and expanded from: E. Alpaydin-ML (chapter 9) T. Mitchell-ML.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Chapter 7. Classification and Prediction
Predictive Learning from Data
Introduction to Machine Learning and Tree Based Methods
Data Science Algorithms: The Basic Methods
Ch 14. Combining Models Pattern Recognition and Machine Learning, C. M
Ch9: Decision Trees 9.1 Introduction A decision tree:
Additive Models,Trees,and Related Models
Boosting and Additive Trees
Additive Models,Trees,and Related Models
Clustering (3) Center-based algorithms Fuzzy k-means
Statistical Learning Dong Liu Dept. EEIS, USTC.
Generally Discriminant Analysis
Basis Expansions and Generalized Additive Models (2)
Neural networks (1) Traditional multi-layer perceptrons
INTRODUCTION TO Machine Learning 2nd Edition
STT : Intro. to Statistical Learning
Presentation transcript:

Comp 540 Chapter 9: Additive Models, Trees, and Related Methods Came out of my personal experience with 301 – fourier analysis and linear systems Ryan King

Overview 9.1 Generalized Additive Models 9.2 Tree-based Methods (CART) 9.4 MARS 9.6 Missing Data 9.7 Computational Considerations

Generalize Additive Models Generally have the form: Example: logistic regression becomes additive Logistic regression:

Link Functions The conditional mean related to an additive function of the predictors via a link function Identity: (Gaussian) Logit: (binomial) Log: (Poisson)

9.1.1 Fitting Additive Models Ex: Additive Cubic Splines Penalized Sum of Squares Criterion (PRSS)

9.1 Backfitting Algorithm Initialize: Cycle: Until the functions change less than a threshold

9.1.3 Summary Additive models extend linear models Flexible, but still interpretable Simple, modular, backfitting procedure Limitations for large data-mining applications

9.2 Tree-Based Methods Partition the feature space Fit a simple model (constant) in each partition Simple, but powerful CART: Classification and Regression Trees, Breiman et al, 1984

9.2 Binary Recursive Partitions f x1 x2 c d e a a b b c d e f

9.2 Regression Trees CART is a top down (divisive) greedy procedure Partitioning is a local decision for each node A partition on variable j at value s creates regions: and

9.2 Regression Trees Each node chooses j,s to solve: For any choice j,s the inner minimization is solved by: Easy to scan through all choices of j,s to find optimal split After the split, recur on and

9.2 Cost-Complexity Pruning How large do we grow the tree? Which nodes should we keep? Grow tree out to fixed depth, prune back based on cost-complexity criterion.

9.2 Terminology A subtree: implies is a pruned version of Tree has M leaf nodes, each indexed by m Leaf node m maps to region denotes the number of leaf nodes of is the number of data points in region

9.2 Cost-Complexity Pruning We define the cost complexity criterion: For , find to minimize Choose by cross-validation

9.2 Classification Trees We define the same cost complexity criterion: But choose different measure of node impurity

9.2 Impurity Measures Misclassification Error Gini index Cross-entropy

9.2 Categorical Predictors How do we handle categorical variables? In general, possible partitions of q values into two groups 3. Trick for 0-1 case: sort the predictor classes by proportion falling in outcome class 1, then partition as normal

9.2 CART Example Examples…

9.3 PRIM-Bump Hunting Partition based, but not tree-based Seeks boxes where the response average is high Top-down algorithm

Patient Rule Induction Method Start with all data, and maximal box Shrink the box by compressing one face, to peel off factor alpha of observations. Choose peeling that produces highest response mean. Repeat step 2 until some minimal number of observations remain Expand the box along any face, as long as the resulting box mean increases. Steps 1-4 give a sequence of boxes, use cross-validation to choose a member of the sequence, call that box B1 Remove B1 from dataset, repeat process to find another box, as desired.

9.3 PRIM Summary Can handle categorical predictors, as CART does Designed for regression, can 2 class classification can be coded as 0-1 Non-trivial to deal with k>2 classes More patient than CART

9.4 Multivariate Adaptive Regression Splines (MARS) Generalization of stepwise linear regression Modification of CART to improve regression performance Able to capture additive structure Not tree-based

9.4 MARS Continued Additive model with adaptive set of basis vectors Basis built up from simple piecewise linear functions Set “C” represents candidate set of linear splines, with “knees” at each data point Xi. Models built with elements from C or their products.

9.4 MARS Procedure Model has form: Given a choice for the , the coefficients chosen by standard linear regression. Start with All functions in C are candidate functions. At each stage consider as a new basis function pair all products of a function in the model set M, with one of the reflected pairs in C. We add add to the model terms of the form:

9.4 Choosing Number of Terms Large models can overfit. Backward deletion procedure: delete terms which cause the smallest increase in residual squared error, to give sequence of models. Pick Model using Generalized Cross Validation: is the effective number of parameters in the model. C=3, r is the number of basis vectors, and K knots Choose the model which minimizes

9.4 MARS Summary Basis functions operate locally Forward modeling is hierarchical, multiway products are built up only from existing terms Each input appears only once in each product Useful option is to set limit on order of operations. Limit of two allows only pairwise products. Limit of one results in an additive model

9.5 Hierarchical Mixture of Experts (HME) Variant of tree based methods Soft splits, not hard decisions At each node, an observation goes left or right with probability depending on its input values Smooth parameter optimization, instead of discrete split point search

9.5 HMEs continued Linear (or logistic) regression model fit at each leaf node (Expert) Splits can be multi-way, instead of binary Splits are probabilistic functions of linear combinations of inputs (gating network), rather than functions of single inputs Formally a mixture model

9.6 Missing Data Quite common to have data with missing values for one or more input features Missing values may or may not distort data For response vector y, Xobs is the observed entries, let Z=(X,y) and Zobs=(Xobs,y)

9.6 Missing Data Quite common to have data with missing values for one or more input features Missing values may or may not distort data For response vector y, Xobs is the observed entries, let Z=(X,y) and Zobs=(Xobs,y), R is an indicator matrix for missing values

9.6 Missing Data Missing at Random(MAR): Missing Completely at Random(MCAR) MCAR is a stronger assumption

9.6 Dealing with Missing Data Three approaches for handling MCAR data: Discard observations with missing features Rely on the learning algorithm to deal with missing values in its training phase Impute all the missing values before training

9.6 Dealing…MCAR If few values are missing, (1) may work For (2), CART can work well with missing values via surrogate splits. Additive models can assume average values. (3) is necessary for most algorithms. Simplest tactic is to use the mean or median. If features are correlated, can build predictive models for missing features in terms of known features

9.6 Computational Considerations For N observations, p predictors Additive Models: Trees: MARS: HME, at each step: