Additive Models, Trees, etc. Based in part on Chapter 9 of Hastie, Tibshirani, and Friedman David Madigan.

Slides:



Advertisements
Similar presentations
A Tale of Two GAMs Generalized additive models as a tool for data exploration Mariah Silkey, Actelion Pharmacueticals Ltd. 1.
Advertisements

Partial Least Squares Models Based on Chapter 3 of Hastie, Tibshirani and Friedman Slides by Javier Cabrera.
Brief introduction on Logistic Regression
Regression “A new perspective on freedom” TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA A A A A AAA A A.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Generalized Additive Models Keith D. Holler September 19, 2005 Keith D. Holler September 19, 2005.
Detecting Faces in Images: A Survey
INTRODUCTION TO Machine Learning 3rd Edition
Evidence Contrary to the Statistical View of Boosting David Mease & Abraham Wyner.
Vector Generalized Additive Models and applications to extreme value analysis Olivier Mestre (1,2) (1) Météo-France, Ecole Nationale de la Météorologie,
Datamining and statistical learning - lecture 9 Generalized linear models (GAMs)  Some examples of linear models  Proc GAM in SAS  Model selection in.
Missing at Random (MAR)  is unknown parameter of the distribution for the missing- data mechanism The probability some data are missing does not depend.
Linear Regression Models Based on Chapter 3 of Hastie, Tibshirani and Friedman Slides by David Madigan.
Flexible Metric NN Classification based on Friedman (1995) David Madigan.
Additive Models and Trees
Basis Expansions and Regularization Based on Chapter 5 of Hastie, Tibshirani and Friedman.
Nonparametric Smoothing Methods and Model Selections T.C. Lin Dept. of Statistics National Taipei University 5/4/2005.
Kernel Methods and SVM’s. Predictive Modeling Goal: learn a mapping: y = f(x;  ) Need: 1. A model structure 2. A score function 3. An optimization strategy.
Classification Based in part on Chapter 10 of Hand, Manilla, & Smyth and Chapter 7 of Han and Kamber David Madigan.
Kernel Methods Part 2 Bing Han June 26, Local Likelihood Logistic Regression.
Comp 540 Chapter 9: Additive Models, Trees, and Related Methods
Intelligible Models for Classification and Regression
Prelude of Machine Learning 202 Statistical Data Analysis in the Computer Age (1991) Bradely Efron and Robert Tibshirani.
Chapter 9 Additive Models,Trees,and Related Models
Spline and Kernel method Gaussian Processes
THE SCIENCE OF RISK SM 1 Interaction Detection in GLM – a Case Study Chun Li, PhD ISO Innovative Analytics March 2012.
Next Generation Techniques: Trees, Network and Rules
Algorithms: The Basic Methods Witten – Chapter 4 Charles Tappert Professor of Computer Science School of CSIS, Pace University.
Copyright © 2010, SAS Institute Inc. All rights reserved. Applied Analytics Using SAS ® Enterprise Miner™
Data Mining Volinsky - Columbia University 1 Chapter 4.2 Regression Topics Credits Hastie, Tibshirani, Friedman Chapter 3 Padhraic Smyth Lecture.
Model Comparison for Tree Resin Dose Effect On Termites Lianfen Qian Florida Atlantic University Co-author: Soyoung Ryu, University of Washington.
Data Mining - Volinsky Columbia University 1 Topic 10 - Ensemble Methods.
Additive Logistic Regression: a Statistical View of Boosting
Linear Model. Formal Definition General Linear Model.
Linear Models for Classification
Lectures 15,16 – Additive Models, Trees, and Related Methods Rice ECE697 Farinaz Koushanfar Fall 2006.
Linear Methods for Classification Based on Chapter 4 of Hastie, Tibshirani, and Friedman David Madigan.
1 Chapter 4: Introduction to Predictive Modeling: Regressions 4.1 Introduction 4.2 Selecting Regression Inputs 4.3 Optimizing Regression Complexity 4.4.
Additive Models , Trees , and Related Models Prof. Liqing Zhang Dept. Computer Science & Engineering, Shanghai Jiaotong University.
Gaussian Process and Prediction. (C) 2001 SNU CSE Artificial Intelligence Lab (SCAI)2 Outline Gaussian Process and Bayesian Regression  Bayesian regression.
Boosting and Additive Trees (Part 1) Ch. 10 Presented by Tal Blum.
RECITATION 4 MAY 23 DPMM Splines with multiple predictors Classification and regression trees.
Week 7: General linear models Overview Questions from last week What are general linear models? Discussion of the 3 articles.
Eco 6380 Predictive Analytics For Economists Spring 2016 Professor Tom Fomby Department of Economics SMU.
Distinguishing the Forest from the Trees 2006 CAS Ratemaking Seminar Richard Derrig, PhD, Opal Consulting Louise Francis, FCAS,
Basis Expansions and Generalized Additive Models Basis expansion Piecewise polynomials Splines Generalized Additive Model MARS.
RECITATION 2 APRIL 28 Spline and Kernel method Gaussian Processes Mixture Modeling for Density Estimation.
Data Mining: Concepts and Techniques1 Prediction Prediction vs. classification Classification predicts categorical class label Prediction predicts continuous-valued.
LOGISTIC REGRESSION. Purpose  Logistical regression is regularly used when there are only two categories of the dependent variable and there is a mixture.
Lecture 29: Modeling Data. Data Modeling Interpolate between data points, using either linear or cubic spline models Model a set of data points as a polynomial.
Chapter 7. Classification and Prediction
Chapter 4.2 Regression Topics
KAIR 2013 Nov 7, 2013 A Data Driven Analytic Strategy for Increasing Yield and Retention at Western Kentucky University Matt Bogard Office of Institutional.
More General Need different response curves for each predictor
Generalized Additive Models & Decision Trees
Boosting and Additive Trees (2)
CSE 4705 Artificial Intelligence
Machine learning, pattern recognition and statistical data modelling
Machine learning, pattern recognition and statistical data modelling
STA 216 Generalized Linear Models
Regression Modeling Approaches
Optimal scaling for a logistic regression model with ordinal covariates Sanne JW Willems, Marta Fiocco, and Jacqueline J Meulman Leiden University & Stanford.
More General Need different response curves for each predictor
Bias-variance Trade-off
Mass Spectrum Normalization
Linear Discrimination
Memory-Based Learning Instance-Based Learning K-Nearest Neighbor
Generalized Additive Model
Presentation transcript:

Additive Models, Trees, etc. Based in part on Chapter 9 of Hastie, Tibshirani, and Friedman David Madigan

Predictive Modeling Goal: learn a mapping: y = f(x;  ) Need: 1. A model structure 2. A score function 3. An optimization strategy Categorical y  {c 1,…,c m }: classification Real-valued y: regression Note: usually assume {c 1,…,c m } are mutually exclusive and exhaustive

Generalized Additive Models Highly flexible form of predictive modeling for regression and classification: g (“link function”) could be the identity or logit or log or whatever The f s are smooth functions often fit using natural cubic splines

Basic Backfitting Algorithm arbitrary smoother - could be natural cubic splines

Example using R’s gam function library(mgcv) set.seed(0) n<-400 x0 <- runif(n, 0, 1) x1 <- runif(n, 0, 1) x2 <- runif(n, 0, 1) x3 <- runif(n, 0, 1) pi <- asin(1) * 2 f <- 2 * sin(pi * x0) f <- f + exp(2 * x1) f <- f * x2^11 * (10 * (1 - x2))^6 +10 * (10 * x2)^3 * (1 - x2)^ e <- rnorm(n, 0, 2) y <- f + e b<-gam(y~s(x0)+s(x1)+s(x2)+s(x3)) summary(b) plot(b,pages=1)

Tree Models Easy to understand: recursively divide predictor space into regions where response variable has small variance Predicted value is majority class (classification) or average value (regression) Can handle mixed data, missing values, etc. Usually grow a large tree and prune it back rather than attempt to optimally stop the growing process

Training Dataset This follows an example from Quinlan’s ID3

Output: A Decision Tree for “buys_computer” age? overcast student?credit rating? noyes fair excellent <=30 >40 no yes

Confusion matrix

Algorithms for Decision Tree Induction Basic algorithm (a greedy algorithm) –Tree is constructed in a top-down recursive divide-and-conquer manner –At start, all the training examples are at the root –Attributes are categorical (if continuous-valued, they are discretized in advance) –Examples are partitioned recursively based on selected attributes –Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain) Conditions for stopping partitioning –All samples for a given node belong to the same class –There are no remaining attributes for further partitioning – majority voting is employed for classifying the leaf –There are no samples left

Information Gain (ID3/C4.5) Select the attribute with the highest information gain Assume there are two classes, P and N –Let the set of examples S contain p elements of class P and n elements of class N –The amount of information, needed to decide if an arbitrary example in S belongs to P or N is defined as e.g. I(0.5,0.5)=1; I(0.9,0.1)=0.47; I(0.99,0.01)=0.08;

Information Gain in Decision Tree Induction Assume that using attribute A a set S will be partitioned into sets {S 1, S 2, …, S v } –If S i contains p i examples of P and n i examples of N, the entropy, or the expected information needed to classify objects in all subtrees S i is The encoding information that would be gained by branching on A

Attribute Selection by Information Gain Computation  Class P: buys_computer = “yes”  Class N: buys_computer = “no”  I(p, n) = I(9, 5) =0.940  Compute the entropy for age: Hence Similarly

Gini Index (IBM IntelligentMiner) If a data set T contains examples from n classes, gini index, gini(T) is defined as where p j is the relative frequency of class j in T. If a data set T is split into two subsets T 1 and T 2 with sizes N 1 and N 2 respectively, the gini index of the split data contains examples from n classes, the gini index gini(T) is defined as The attribute provides the smallest gini split (T) is chosen to split the node

Avoid Overfitting in Classification The generated tree may overfit the training data –Too many branches, some may reflect anomalies due to noise or outliers –Result is in poor accuracy for unseen samples Two approaches to avoid overfitting –Prepruning: Halt tree construction early—do not split a node if this would result in the goodness measure falling below a threshold Difficult to choose an appropriate threshold –Postpruning: Remove branches from a “fully grown” tree— get a sequence of progressively pruned trees Use a set of data different from the training data to decide which is the “best pruned tree”

Approaches to Determine the Final Tree Size Separate training (2/3) and testing (1/3) sets Use cross validation, e.g., 10-fold cross validation Use minimum description length (MDL) principle: –halting growth of the tree when the encoding is minimized

Dietterich (1999) Analysis of 33 UCI datasets

Missing Predictor Values For categorical predictors, simply create a value “missing” For continuous predictors, evaluate split using the complete cases; once a split is chosen find a first “surrogate predictor” that gives the most similar split Then find the second best surrogate, etc. At prediction time, use the surrogates in order

Bagging and Random Forests Big trees tend to have high variance and low bias Small trees tend to have low variance and high bias Is there some way to drive the variance down without increasing bias? Bagging can do this to some extent

Naïve Bayes Classification Recall: p(c k |x)  p(x| c k )p(c k ) Now suppose: Then: Equivalently: C x1x1 x2x2 xpxp … “weights of evidence”

Evidence Balance Sheet

Naïve Bayes (cont.) Despite the crude conditional independence assumption, works well in practice (see Friedman, 1997 for a partial explanation) Can be further enhanced with boosting, bagging, model averaging, etc. Can relax the conditional independence assumptions in myriad ways (“Bayesian networks”)

Patient Rule Induction (PRIM) Looks for regions of predictor space where the response variable has a high average value Iterative procedure. Starts with a region including all points. At each step, PRIM removes a slice on one dimension If the slice size  is small, this produces a very patient rule induction algorithm

PRIM Algorithm 1.Start with all of the training data, and a maximal box containing all of the data 2.Consider shrinking the box by compressing along one face, so as to peel off the proportion  of observations having either the highest values of a predictor Xj or the lowest. Choose the peeling that produces the highest response mean in the remaining box 3.Repeat step 2 until some minimal number of observations remain in the box 4.Expand the box along any face so long as the resulting box mean increases 5.Use cross-validation to choose a box from the sequence of boxes constructed above. Call the box B1 6.Remove the data in B1 from the dataset and repeat steps 2-5.