Additive Models, Trees, and Related Methods (Part I)

Slides:



Advertisements
Similar presentations
Chapter 7 Classification and Regression Trees
Advertisements

Neural networks Introduction Fitting neural networks
Random Forest Predrag Radenković 3237/10
Brief introduction on Logistic Regression
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Chapter 7 – Classification and Regression Trees
CMPUT 466/551 Principal Source: CMU
Chapter 7 – Classification and Regression Trees
Chapter 4: Linear Models for Classification
Bayesian Learning Rong Jin. Outline MAP learning vs. ML learning Minimum description length principle Bayes optimal classifier Bagging.
x – independent variable (input)
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Sparse vs. Ensemble Approaches to Supervised Learning
Announcements  Homework 4 is due on this Thursday (02/27/2004)  Project proposal is due on 03/02.
Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.
Ensemble Learning: An Introduction
Additive Models and Trees
Tree-based methods, neutral networks
Lecture 5 (Classification with Decision Trees)
Classification 10/03/07.
End of Chapter 8 Neil Weisenfeld March 28, 2005.
Decision Trees (2). Numerical attributes Tests in nodes are of the form f i > constant.
Data mining and statistical learning - lecture 13 Separating hyperplane.
Sparse vs. Ensemble Approaches to Supervised Learning
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Comp 540 Chapter 9: Additive Models, Trees, and Related Methods
Chapter 9 Additive Models,Trees,and Related Models
Ensemble Learning (2), Tree and Forest
Decision Tree Models in Data Mining
CSC2535: 2013 Advanced Machine Learning Lecture 3a: The Origin of Variational Bayes Geoffrey Hinton.
Introduction to Directed Data Mining: Decision Trees
Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.
Chapter 9 – Classification and Regression Trees
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 16: NEURAL NETWORKS Objectives: Feedforward.
Lecture 7. Outline 1. Overview of Classification and Decision Tree 2. Algorithm to build Decision Tree 3. Formula to measure information 4. Weka, data.
Zhangxi Lin ISQS Texas Tech University Note: Most slides are from Decision Tree Modeling by SAS Lecture Notes 5 Auxiliary Uses of Trees.
Overview of Supervised Learning Overview of Supervised Learning2 Outline Linear Regression and Nearest Neighbors method Statistical Decision.
Decision Tree Learning Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata August 25, 2014.
Overview of the final test for CSC Overview PART A: 7 easy questions –You should answer 5 of them. If you answer more we will select 5 at random.
MACHINE LEARNING 10 Decision Trees. Motivation  Parametric Estimation  Assume model for class probability or regression  Estimate parameters from all.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
Chapter1: Introduction Chapter2: Overview of Supervised Learning
Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.
Lecture Notes for Chapter 4 Introduction to Data Mining
1 Illustration of the Classification Task: Learning Algorithm Model.
Classification and Regression Trees
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
Hierarchical Mixture of Experts Presented by Qi An Machine learning reading group Duke University 07/15/2005.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Bump Hunting The objective PRIM algorithm Beam search References: Feelders, A.J. (2002). Rule induction by bump hunting. In J. Meij (Ed.), Dealing with.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
CPH Dr. Charnigo Chap. 9 Notes To begin with, have a look at Figure 9.5 on page 315. One can get an intuitive feel for how a tree works by examining.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Deep Feedforward Networks
Instance Based Learning
Ch 14. Combining Models Pattern Recognition and Machine Learning, C. M
Ch9: Decision Trees 9.1 Introduction A decision tree:
10701 / Machine Learning.
Data Mining Lecture 11.
ECE 471/571 – Lecture 12 Decision Tree.
Statistical Learning Dong Liu Dept. EEIS, USTC.
Model generalization Brief summary of methods
Neural networks (1) Traditional multi-layer perceptrons
Introduction to Neural Networks
STT : Intro. to Statistical Learning
Presentation transcript:

Additive Models, Trees, and Related Methods (Part I) Joy, Jie, Lucian Oct 22nd, 2002

Outline Tree-Based Methods PRIM (20) minutes 2:20pm-2:40pm CART (30 minutes ) 1:30pm – 2:00pm HME (10 minutes) 2:00pm – 2:20pm PRIM (20) minutes 2:20pm-2:40pm Discussions (10 minutes) 2:40pm-2:50pm

Tree-Based Methods Overview Principle behind: Divide and conquer Variance will be increased Finesse the curse of dimensionality with the price of mis-specifying the model Partition the feature space into a set of rectangles For simplicity, use recursive binary partition Fit a simple model (e.g. constant) for each rectangle Classification and Regression Trees (CART) Regress Trees Classification Trees Hierarchical Mixture Experts (HME)

CART An example (in regression case):

How CART Sees An Elephant It was six men of Indostan ; To learning much inclined, Who went to see the Elephant; (Though all of them were blind), That each by observation; Might satisfy his mind …. -- “The Blind Men and the Elephant” by John Godfrey Saxe (1816-1887)

Basic Issues in Tree-based Methods How to grow a tree? How large should we grow the tree?

Regression Trees Partition the space into M regions: R1, R2, …, RM.

Regression Trees – Grow the Tree The best partition: to minimize the sum of squared error: Finding the global minimum is computationally infeasible Greedy algorithm: at each level choose variable j and value s as: The greedy algorithm makes the tree unstable The error made at the upper level will be propagated to the lower level

Regression Tree – how large should we grow the tree ? Trade-off between accuracy and generalization Very large tree: overfit Small tree: might not capture the structure Strategies: 1: split only when we can decrease the error (short-sighted, e.g. XOR) 2: Cost-complexity pruning (preferred)

Regression Tree - Pruning Cost-complexity pruning: Pruning: collapsing some internal nodes Cost complexity: Choose best alpha: weakest link pruning Each time collapse an internal node which add smallest error Choose from this tree sequence the best one by cross-validation Penalty on the complexity/size of the tree Cost: sum of squared errors

Classification Trees Classify the observations in node m to the major class in the node: Pmk is the proportion of observation of class k in node m Define impurity for a node: Misclassification error: Cross-entropy:

Classification Trees Gini index (a famous index of measuring income inequality):

Classification Trees Cross-entropy and Gini are more sensitive To grow the tree: use CE or Gini To prune the tree: use Misclassification rate (or any other method)

Discussions on Tree-based Methods Categorical Predictors Problem: Consider splits of sub tree t into tL and tR based on categorical predictor x which has q possible values: 2(q-1)-1 ways ! Theorem (Fisher 1958) There is an optimal partition B1, B2 of B such that For and Order the predictor classes according to the mean of the outcome Y. Intuition: Treat the categorical predictor as if it were ordered

Discussions on Tree-based Methods The Loss Matrix Consequences of misclassification depends on class Define loss function L Modify the Gini index as In a terminal node m , classify it to class k as:

Discussions on Trees Missing Predictor Values If we have enough training data: discard observations with mission value Fill in (impute) the missing value. E.g. the mean of known values Create a category called “missing” Surrogate variables Choose primary predictor and split point The first surrogate predictor best mimics the split by the primary predictor, the second does second best, … When sending observations down the tree, use primary first. If the value of primary is missing, use the first surrogate. If the first surrogate is missing, use the second. …

Discussions on Trees Binary Splits? Question: (Yan Liu) This question is on the limitation of multiway split for building tree: it is said on page273 that the problem with multi-way split is that it fragments the data too quickly, leaving insufficient data at the next level down. Can you give me an intuitive explanation of why the binary splits are more preferred? In my understanding, one of the problems in multiway split might be that it is hard to find the best attributes and split points, is that right? Answer: why binary splits are preferred? More standard framework to train “To be or not to be” is easier to decide

Discussions on Trees Linear Combination Splits Instability of Trees Split the node based on Improve the predictive power Hurt interpretability Instability of Trees Inherited from the hierarchical nature Bagging (section 8.7) can reduce the variance

Discussions on Trees

Discussions on Trees Majority vote Average

Hierarchical Mixture Experts The gating networks provide a nested, “soft” partitioning of the input space The expert network provide local regression surface within the partition Both mixture coefficients and mixture components are Generalized Linear Models (GLIM)

Hierarchical Mixture Experts Expert node output: Lower level gate Lower level gate output

Hierarchical Mixture Experts Likelihood of the training data Gradient descent learning algorithm to update Uij Applying EM to HME for training Latent variable: indicator zi– which branch to go See Jordan 1994 for details

Hierarchical Mixture Experts -- EM Each histogram displays the distribution of posterior probabilities across the training set at each node in the tree

Comparison Architecture Relative Error1 # Epochs Linear .31 1 BackProp .09 5,500 HME ( Alg. 1) .10 35 HME ( Alg. 2) .12 39 CART .17 N/A CART (linear) .13 MARS .16 All methods perform better than Linear BP has lowest relative error BP hard to converge 16 experts for HME, four level hierarchy 16 basis function for MARS (Jordan 94)

Model Selection for HME Structural parameters need to be decided Number of levels Branching factor of the tree K No methods for finding a good tree topology as in CART

Questions and Discussions CART: Rong Jin: 1. According to Eqn. (9.16), successfully splitting in a large subtree is more valuable than doing it for small subtree. Could you justify it ? Rong Jin: In the discussion of general regression tree or classification tree, it only considers the partition of feature space in a simple binary way. Is there any works that has been done along the line of nonlinear partition of feature space? Rong Jin: Does it make any sense to do the overlap split ? Ben: Could you make it clearer about the differences between using L_{k,k'}as loss vs. as weights? (p. 272)

Questions and Discussions Locality: Rong Jin: Both tree model and kernel function try to capture the locality. For tree model, the locality is created through the partition of the feature space while the kernel function is able to express the locality using the special distance function. Please comment on the these two methods on their ability of expressing localized function. Ben: Could you make comparisons between the tree methods introduced here with kNN and kernel methods?

Questions and Discussions Gini and other measures: Yan: The classification (or regression) trees discussed in thischapter used a lot of criterion to select the attribute and splitpoints, such as misclassification error, gini index and cross-entropy.When should we use these criteria? Is the entropy more preferred thanthe other two? (Also I want to make some clarification: is the Gini index refers to gain ratio,a nd cross-entropy refers to information gain?) Jian Zhang: From the book we know that Gini index has many nice properties, like:tight upper bound of error, training error rate with probability, etc.Should we prefer it in classification task for those reasons? Weng-keen: How is the Gini index equal to the training error? Yan Jun: For tree based methods, can you give me an intuitive explanationabout the Gini index measure? Ben: . What does minimizing node impurity mean? Is it just to decrease overall variance? Is there any implication to the bias? How does the usual bias-vairance tradeoff play a role here?

Questions and Discussions HME: Yan: In my understanding, the HME is more like neural network combined with Gaussian linear regression (or logistic regression) interms that the input of the neural network is the output of the Gaussian regression. Is my understanding right? Ben: For 2-class case depicted in Fig. 9.13 (HME), why do we need to do two times of mixtures? (two layers) Ben In equ. 9.30 why the 2 upper bound of summations are the same K? -- Yes

Reference Fisher, W.D. 1958. On grouping for maximum homogeneity. J. Amer. Statist. Assoc., 53: 789-798 Breiman, L. 1984. Classification and Regression Trees. Wadsworth International Group Jordan, M. I., & Jacobs, R. A. (1994). Hierarchical mixtures of experts and the EM algorithm. Neural Computation, 6, 181-214

Hierarchical Mixture Experts “The softmax function derives naturally from log-linear models and leads to convenient interpretations of the weights in terms of odds ratios. You could, however, use a variety of other nonnegative functions on the real line in place of the exp function. Or you could constrain the net inputs to the output units to be nonnegative, and just divide by the sum--that's called the Bradley-Terry-Luce model”

Hierarchical Mixture Experts