Additive Models, Trees, and Related Methods (Part I)

Additive Models, Trees, and Related Methods (Part I)
Joy, Jie, Lucian Oct 22nd, 2002

Outline Tree-Based Methods PRIM (20) minutes 2:20pm-2:40pm
CART (30 minutes ) 1:30pm – 2:00pm HME (10 minutes) 2:00pm – 2:20pm PRIM (20) minutes 2:20pm-2:40pm Discussions (10 minutes) 2:40pm-2:50pm

Tree-Based Methods Overview Principle behind: Divide and conquer
Variance will be increased Finesse the curse of dimensionality with the price of mis-specifying the model Partition the feature space into a set of rectangles For simplicity, use recursive binary partition Fit a simple model (e.g. constant) for each rectangle Classification and Regression Trees (CART) Regress Trees Classification Trees Hierarchical Mixture Experts (HME)

CART An example (in regression case):

How CART Sees An Elephant
It was six men of Indostan ; To learning much inclined, Who went to see the Elephant; (Though all of them were blind), That each by observation; Might satisfy his mind …. -- “The Blind Men and the Elephant” by John Godfrey Saxe ( )

Basic Issues in Tree-based Methods
How to grow a tree? How large should we grow the tree?

Regression Trees Partition the space into M regions: R1, R2, …, RM.

Regression Trees – Grow the Tree
The best partition: to minimize the sum of squared error: Finding the global minimum is computationally infeasible Greedy algorithm: at each level choose variable j and value s as: The greedy algorithm makes the tree unstable The error made at the upper level will be propagated to the lower level

Regression Tree – how large should we grow the tree ?
Trade-off between accuracy and generalization Very large tree: overfit Small tree: might not capture the structure Strategies: 1: split only when we can decrease the error (short-sighted, e.g. XOR) 2: Cost-complexity pruning (preferred)

Regression Tree - Pruning
Cost-complexity pruning: Pruning: collapsing some internal nodes Cost complexity: Choose best alpha: weakest link pruning Each time collapse an internal node which add smallest error Choose from this tree sequence the best one by cross-validation Penalty on the complexity/size of the tree Cost: sum of squared errors

Classification Trees Classify the observations in node m to the major class in the node: Pmk is the proportion of observation of class k in node m Define impurity for a node: Misclassification error: Cross-entropy:

Classification Trees Gini index (a famous index of measuring income inequality):

Classification Trees Cross-entropy and Gini are more sensitive
To grow the tree: use CE or Gini To prune the tree: use Misclassification rate (or any other method)

Discussions on Tree-based Methods
Categorical Predictors Problem: Consider splits of sub tree t into tL and tR based on categorical predictor x which has q possible values: 2(q-1)-1 ways ! Theorem (Fisher 1958) There is an optimal partition B1, B2 of B such that For and Order the predictor classes according to the mean of the outcome Y. Intuition: Treat the categorical predictor as if it were ordered

Discussions on Tree-based Methods
The Loss Matrix Consequences of misclassification depends on class Define loss function L Modify the Gini index as In a terminal node m , classify it to class k as:

Discussions on Trees Missing Predictor Values
If we have enough training data: discard observations with mission value Fill in (impute) the missing value. E.g. the mean of known values Create a category called “missing” Surrogate variables Choose primary predictor and split point The first surrogate predictor best mimics the split by the primary predictor, the second does second best, … When sending observations down the tree, use primary first. If the value of primary is missing, use the first surrogate. If the first surrogate is missing, use the second. …

Discussions on Trees Binary Splits? Question: (Yan Liu)
This question is on the limitation of multiway split for building tree: it is said on page273 that the problem with multi-way split is that it fragments the data too quickly, leaving insufficient data at the next level down. Can you give me an intuitive explanation of why the binary splits are more preferred? In my understanding, one of the problems in multiway split might be that it is hard to find the best attributes and split points, is that right? Answer: why binary splits are preferred? More standard framework to train “To be or not to be” is easier to decide

Discussions on Trees Linear Combination Splits Instability of Trees
Split the node based on Improve the predictive power Hurt interpretability Instability of Trees Inherited from the hierarchical nature Bagging (section 8.7) can reduce the variance

Discussions on Trees

Discussions on Trees Majority vote Average

Hierarchical Mixture Experts
The gating networks provide a nested, “soft” partitioning of the input space The expert network provide local regression surface within the partition Both mixture coefficients and mixture components are Generalized Linear Models (GLIM)

Expert node output: Lower level gate Lower level gate output

Likelihood of the training data Gradient descent learning algorithm to update Uij Applying EM to HME for training Latent variable: indicator zi– which branch to go See Jordan 1994 for details

Hierarchical Mixture Experts -- EM
Each histogram displays the distribution of posterior probabilities across the training set at each node in the tree

Comparison Architecture Relative Error1 # Epochs Linear .31 1 BackProp
.09 5,500 HME ( Alg. 1) .10 35 HME ( Alg. 2) .12 39 CART .17 N/A CART (linear) .13 MARS .16 All methods perform better than Linear BP has lowest relative error BP hard to converge 16 experts for HME, four level hierarchy 16 basis function for MARS (Jordan 94)

Model Selection for HME
Structural parameters need to be decided Number of levels Branching factor of the tree K No methods for finding a good tree topology as in CART

Questions and Discussions
CART: Rong Jin: 1. According to Eqn. (9.16), successfully splitting in a large subtree is more valuable than doing it for small subtree. Could you justify it ? Rong Jin: In the discussion of general regression tree or classification tree, it only considers the partition of feature space in a simple binary way. Is there any works that has been done along the line of nonlinear partition of feature space? Rong Jin: Does it make any sense to do the overlap split ? Ben: Could you make it clearer about the differences between using L_{k,k'}as loss vs. as weights? (p. 272)

Locality: Rong Jin: Both tree model and kernel function try to capture the locality. For tree model, the locality is created through the partition of the feature space while the kernel function is able to express the locality using the special distance function. Please comment on the these two methods on their ability of expressing localized function. Ben: Could you make comparisons between the tree methods introduced here with kNN and kernel methods?

Gini and other measures: Yan: The classification (or regression) trees discussed in thischapter used a lot of criterion to select the attribute and splitpoints, such as misclassification error, gini index and cross-entropy.When should we use these criteria? Is the entropy more preferred thanthe other two? (Also I want to make some clarification: is the Gini index refers to gain ratio,a nd cross-entropy refers to information gain?) Jian Zhang: From the book we know that Gini index has many nice properties, like:tight upper bound of error, training error rate with probability, etc.Should we prefer it in classification task for those reasons? Weng-keen: How is the Gini index equal to the training error? Yan Jun: For tree based methods, can you give me an intuitive explanationabout the Gini index measure? Ben: . What does minimizing node impurity mean? Is it just to decrease overall variance? Is there any implication to the bias? How does the usual bias-vairance tradeoff play a role here?

HME: Yan: In my understanding, the HME is more like neural network combined with Gaussian linear regression (or logistic regression) interms that the input of the neural network is the output of the Gaussian regression. Is my understanding right? Ben: For 2-class case depicted in Fig (HME), why do we need to do two times of mixtures? (two layers) Ben In equ why the 2 upper bound of summations are the same K? -- Yes

Reference Fisher, W.D On grouping for maximum homogeneity. J. Amer. Statist. Assoc., 53: Breiman, L Classification and Regression Trees. Wadsworth International Group Jordan, M. I., & Jacobs, R. A. (1994). Hierarchical mixtures of experts and the EM algorithm. Neural Computation, 6,

“The softmax function derives naturally from log-linear models and leads to convenient interpretations of the weights in terms of odds ratios. You could, however, use a variety of other nonnegative functions on the real line in place of the exp function. Or you could constrain the net inputs to the output units to be nonnegative, and just divide by the sum--that's called the Bradley-Terry-Luce model”

Additive Models, Trees, and Related Methods (Part I)

Similar presentations

Presentation on theme: "Additive Models, Trees, and Related Methods (Part I)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Additive Models, Trees, and Related Methods (Part I)

Similar presentations

Presentation on theme: "Additive Models, Trees, and Related Methods (Part I)"— Presentation transcript:

Similar presentations

About project

Feedback