Chapter 9 Additive Models,Trees,and Related Models

Slides:



Advertisements
Similar presentations
Chapter 7 Classification and Regression Trees
Advertisements

Random Forest Predrag Radenković 3237/10
Brief introduction on Logistic Regression
Model Assessment, Selection and Averaging
Chapter 7 – Classification and Regression Trees
Model assessment and cross-validation - overview
Chapter 7 – Classification and Regression Trees
Data mining and statistical learning - lecture 6
Basis Expansion and Regularization Presenter: Hongliang Fei Brian Quanz Brian Quanz Date: July 03, 2008.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Ensemble Learning: An Introduction
Additive Models and Trees
Tree-based methods, neutral networks
Basis Expansions and Regularization Based on Chapter 5 of Hastie, Tibshirani and Friedman.
End of Chapter 8 Neil Weisenfeld March 28, 2005.
Chapter 6: Multilayer Neural Networks
Maximum Likelihood (ML), Expectation Maximization (EM)
Lecture 11 Multivariate Regression A Case Study. Other topics: Multicollinearity  Assuming that all the regression assumptions hold how good are our.
Data Mining: A Closer Look Chapter Data Mining Strategies (p35) Moh!
Data mining and statistical learning - lecture 12 Neural networks (NN) and Multivariate Adaptive Regression Splines (MARS)  Different types of neural.
Comp 540 Chapter 9: Additive Models, Trees, and Related Methods
R OBERTO B ATTITI, M AURO B RUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Feb 2014.
Ensemble Learning (2), Tree and Forest
Microsoft Enterprise Consortium Data Mining Concepts Introduction to Directed Data Mining: Decision Trees Prepared by David Douglas, University of ArkansasHosted.
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
Introduction to Directed Data Mining: Decision Trees
Classification Part 4: Tree-Based Methods
Collaborative Filtering Matrix Factorization Approach
Bump Hunting The objective PRIM algorithm Beam search References: Feelders, A.J. (2002). Rule induction by bump hunting. In J. Meij (Ed.), Dealing with.
沈致远. Test error(generalization error): the expected prediction error over an independent test sample Training error: the average loss over the training.
Chapter 9 – Classification and Regression Trees
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Overview of Supervised Learning Overview of Supervised Learning2 Outline Linear Regression and Nearest Neighbors method Statistical Decision.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
Chapter 4: Introduction to Predictive Modeling: Regressions
Lectures 15,16 – Additive Models, Trees, and Related Methods Rice ECE697 Farinaz Koushanfar Fall 2006.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.
1 Chapter 4: Introduction to Predictive Modeling: Regressions 4.1 Introduction 4.2 Selecting Regression Inputs 4.3 Optimizing Regression Complexity 4.4.
Additive Models , Trees , and Related Models Prof. Liqing Zhang Dept. Computer Science & Engineering, Shanghai Jiaotong University.
Classification and Regression Trees
Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
Eco 6380 Predictive Analytics For Economists Spring 2016 Professor Tom Fomby Department of Economics SMU.
Hierarchical Mixture of Experts Presented by Qi An Machine learning reading group Duke University 07/15/2005.
Basis Expansions and Generalized Additive Models Basis expansion Piecewise polynomials Splines Generalized Additive Model MARS.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Bump Hunting The objective PRIM algorithm Beam search References: Feelders, A.J. (2002). Rule induction by bump hunting. In J. Meij (Ed.), Dealing with.
By N.Gopinath AP/CSE.  A decision tree is a flowchart-like tree structure, where each internal node (nonleaf node) denotes a test on an attribute, each.
CPH Dr. Charnigo Chap. 9 Notes To begin with, have a look at Figure 9.5 on page 315. One can get an intuitive feel for how a tree works by examining.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Chapter 7. Classification and Prediction
Deep Feedforward Networks
Introduction to Machine Learning and Tree Based Methods
Trees, bagging, boosting, and stacking
Ch9: Decision Trees 9.1 Introduction A decision tree:
Additive Models,Trees,and Related Models
Boosting and Additive Trees
Additive Models,Trees,and Related Models
Data Mining Lecture 11.
Classification by Decision Tree Induction
Hidden Markov Models Part 2: Algorithms
Roberto Battiti, Mauro Brunato
Collaborative Filtering Matrix Factorization Approach
Statistical Learning Dong Liu Dept. EEIS, USTC.
Generally Discriminant Analysis
Basis Expansions and Generalized Additive Models (2)
STT : Intro. to Statistical Learning
Presentation transcript:

Chapter 9 Additive Models,Trees,and Related Models The Elements of Statistical Learning

Introduction In this chapter we begin our discussion of some specific methods for supervised learning 9.1: Generalized Additive Models 9.2: Tree-Based Methods 9.3:PRIM:Bump Hunting 9.4:MARS: Multivariate Adaptive Regression Splines 9.5:HME:Hieraechical Mixture of Experts

9.1 Generalized Additive Models In the regression setting, a generalized additive models has the form: Here fj’s are unspecified smooth and nonparametric functions. Instead of using LBE in chapter 5, we fit each function using a scatter plot smoother(e.g. a cubic smoothing spline)

GAM(cont.) For two-class classification, the additive logistic regression model is: Here

GAM(cont) In general, the conditional mean U(x) of a response y is related to an additive function of the predictors via a link function g: Examples of classical link functions are the following: Identity: g(u)=u Logit: g(u)=log[u/(1-u)] Probit: g(m) = F-1(m) Log: g(u) = log(u)

Fitting Additive Models The additive model has the form: Here we have Given observations , a criterion like penalized sum squares can be specified for this problem: Where are tuning parameters.

FAM(cont.) Conclusions: The solution to minimize PRSS is cubic splines, however without further restrictions the solution is not unique. If 0 holds, it is easy to see that: If in addition to this restriction, the matrix of input values has full column rank, then (9.7)is a strict convex criterion and has an unique solution. If the matrix is singular, then the linear part of fj cannot be uniquely determined. (Buja 1989)

FAM(cont.)

Additive Logistic Regression

9.2 Tree-Based Method Background: Tree-based models partition the feature space into a set of rectangles, and then fit a simple model(like a constant) in each one. A popular method for tree-based regression and classification is called CART(classification and regression tree)

CART Example: Let’s consider a regression problem with continuous response Y and inputs X1 and X2. The top left panel of figure is one partition. To simplify matters, we consider the partition shown by the top right panel of figure. The corresponding regression model predicts Y with a constant Cm in region Rm: For illustration, we choose C1=-5, C2=-7,C3=0,C4=2, C5=4 in the bottom right panel in figure 9.2.

CART

Regression Tree Suppose we have a partition into M regions: R1 R2 …..RM. We model the response Y with a constant Cm in each region: If we adopt as our criterion minimization of RSS, it is easy to see that:

Regression Tree(cont.) Finding the best binary partition in term of minimum RSS is computationally infeasible A greedy algorithm is used. Here

Regression Tree(cont.) For any choice j and s, the inner minimization is solved by: For each splitting variable Xj the determination of split point s can be done very quickly and hence by scanning through all of the input, determination of the best pair (j,s) is feasible. Having found the best split, we partition the data into two regions and repeat the splitting progress in each of the two regions.

Regression Tree(cont.) We index terminal nodes by m, with node m representing region Rm. Let |T| denotes the number of terminal notes in T. Letting: We define the cost complexity criterion:

Classification Tree : the proportion of class k on mode m. : the majority class on node m Instead of Qm(T) defined in(9.15) in regression, we have different measures Qm(T) of node impurity including the following: Misclassification Error: Gini Index: Cross-entropy (deviance):

Classification Tree(cont.) Example: For two classes, if p is the proportion of in the second class, these three measures are: 1-max(p,1-p) ,2p(1-p), -plogp-(1-p)log(1-p)

9.3 PRIM: Bump Hunting The patient rule induction method(PRIM) finds boxes in the feature space and seeks boxes in which the response average is high. Hence it looks for maxima in the target function, an exercise known as bump hunting.

PRIM(cont.)

PRIM(cont.)

PRIM(cont.)

9.4 MARS: Multivariate Adaptive Regression Splines MARS uses expansions in piecewise linear basis functions of the form: (x-t)+ and (t-x)+. We call the two functions a reflected pair.

MARS(cont.) The idea of MARS is to form reflected pairs for each input Xj with knots at each observed value Xij of that input. Therefore, the collection of basis function is: The model has the form: where each hm(x) is a function in C or a product of two or more such functions.

MARS(cont.) We start with only the constant function h0(x)=1 in the model set M and all functions in the set C are candidate functions. At each stage we add to the model set M the term of the form: that produces the largest decrease in training error. The process is continued until the model set M contains some preset maximum number of terms.

MARS(cont.)

MARS(cont.)

MARS(cont.)

MARS(cont.) This progress typically overfits the data and so a backward deletion procedure is applied. The term whose removal causes the smallest increase in RSS is deleted from the model at each step, producing an estimated best model of each size . Generalized cross validation is applied to estimate the optimal value of

9.5 Hierarchical Mixtures of Experts The HME method can be viewed as a variant of the tree-based methods. Difference: The main difference is that the tree splits are not hard decisions but rather soft probabilistic ones. In an HME a linear(or logistic regression) model is fitted in each terminal node, instead of a constant as in the CART.

HME(cont.) A simple two-level HME model is shown in Figure 9.13. It can be viewed as a tree with soft splits at each non-terminal node. The terminal node is called expert and the non-terminal node is called gating networks. The idea is that each expert provides a prediction about the response Y, and these are combined together by the gating networks. 

HME(cont.)

HME(cont.) Here is how an HME is defined: The top gating network has the output: At the second level, the gating networks have a similar form:

HME(cont.) At the experts, we have a model for the response variable of the form: This differs according to the problem.

HME(cont.)

9.6 Missing Data A definition: Roughly speaking, data are missing at random if the mechanism resulting in its omission is independent of its (unobserved) value. A more precise definition is given by Little and Rubin(2002): Suppose Y is response and X is the N*P matrix of input. Xobs denotes the observed entries in X. Z=(Y,X) and Zobs=(Y,Xobs) Finally R is an indicator matrix with ijth item 1 if Xij is missing and 0 otherwise.

Missing Data(cont.) The data is said to be MAR if the distribution of R depends on Z only through Zobs:   The data is said to be MCAR if the distribution of R does not depend on Z:

Missing Data(cont.)

CHAPTER 9 THE END THANK YOU