CN700: HST 10.6-10.13 Neil Weisenfeld (notes were recycled and modified from Prof. Cohen and an unnamed student) April 12, 2005.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Random Forest Predrag Radenković 3237/10
Linear Regression.
Regularization David Kauchak CS 451 – Fall 2013.
Lectures 17,18 – Boosting and Additive Trees Rice ECE697 Farinaz Koushanfar Fall 2006.
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
R OBERTO B ATTITI, M AURO B RUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Feb 2014.
Boosting Ashok Veeraraghavan. Boosting Methods Combine many weak classifiers to produce a committee. Resembles Bagging and other committee based methods.
CMPUT 466/551 Principal Source: CMU
Classification and Prediction: Regression Via Gradient Descent Optimization Bamshad Mobasher DePaul University.
Visual Recognition Tutorial
Introduction to Boosting Slides Adapted from Che Wanxiang( 车 万翔 ) at HIT, and Robin Dhamankar of Many thanks!
Linear Methods for Regression Dept. Computer Science & Engineering, Shanghai Jiao Tong University.
x – independent variable (input)
Sparse vs. Ensemble Approaches to Supervised Learning
Boosting Rong Jin. Inefficiency with Bagging D Bagging … D1D1 D2D2 DkDk Boostrap Sampling h1h1 h2h2 hkhk Inefficiency with boostrap sampling: Every example.
Curve-Fitting Regression
Ensemble Learning: An Introduction
Additive Models and Trees
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
End of Chapter 8 Neil Weisenfeld March 28, 2005.
Data mining and statistical learning - lecture 13 Separating hyperplane.
Data mining and statistical learning - lecture 12 Neural networks (NN) and Multivariate Adaptive Regression Splines (MARS)  Different types of neural.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Comp 540 Chapter 9: Additive Models, Trees, and Related Methods
Classification and Prediction: Regression Analysis
Ensemble Learning (2), Tree and Forest
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
Calibration & Curve Fitting
Collaborative Filtering Matrix Factorization Approach
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Chapter 10 Boosting May 6, Outline Adaboost Ensemble point-view of Boosting Boosting Trees Supervised Learning Methods.
Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.
Classification Part 3: Artificial Neural Networks
Non-Linear Models. Non-Linear Growth models many models cannot be transformed into a linear model The Mechanistic Growth Model Equation: or (ignoring.
Chapter 11 – Neural Networks COMP 540 4/17/2007 Derek Singer.
Non-Linear Models. Non-Linear Growth models many models cannot be transformed into a linear model The Mechanistic Growth Model Equation: or (ignoring.
Outline 1-D regression Least-squares Regression Non-iterative Least-squares Regression Basis Functions Overfitting Validation 2.
Chapter 9 – Classification and Regression Trees
Zhangxi Lin ISQS Texas Tech University Note: Most slides are from Decision Tree Modeling by SAS Lecture Notes 5 Auxiliary Uses of Trees.
Jeff Howbert Introduction to Machine Learning Winter Regression Linear Regression.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
Curve-Fitting Regression
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Ensemble Learning (1) Boosting Adaboost Boosting is an additive model
Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:
Linear Models for Classification
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Non-Linear Models. Non-Linear Growth models many models cannot be transformed into a linear model The Mechanistic Growth Model Equation: or (ignoring.
Ensemble Methods in Machine Learning
Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com.
Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.
Boosting and Additive Trees (Part 1) Ch. 10 Presented by Tal Blum.
Basis Expansions and Generalized Additive Models Basis expansion Piecewise polynomials Splines Generalized Additive Model MARS.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Bump Hunting The objective PRIM algorithm Beam search References: Feelders, A.J. (2002). Rule induction by bump hunting. In J. Meij (Ed.), Dealing with.
By Subhasis Dasgupta Asst Professor Praxis Business School, Kolkata Classification Modeling Decision Tree (Part 2)
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Predictive Learning from Data
Boosting and Additive Trees (2)
CSE 4705 Artificial Intelligence
Boosting and Additive Trees
A Simple Artificial Neuron
Roberto Battiti, Mauro Brunato
Collaborative Filtering Matrix Factorization Approach
Overfitting and Underfitting
Basis Expansions and Generalized Additive Models (2)
Basis Expansions and Generalized Additive Models (1)
Presentation transcript:

CN700: HST Neil Weisenfeld (notes were recycled and modified from Prof. Cohen and an unnamed student) April 12, 2005

Robust Loss Functions for Classification Loss functions that lead to simple boosting solutions (squared error and exponential) are not always the most robust. In classification (with –1/1 response), the “margin” plays the role that residuals do in regression: y.f(x) Incorrect classification is negative (-1*1, 1*-1) Loss criteria should penalize negative margins more heavily (positive margins are correctly classified)

Loss functions for 2-class classification Exponential and binomial deviance -> monotone continuous approximations to misclassification loss Exponential more heavily penalizes strong negatives, while deviance is more balanced. Binomial deviance more robust in noisy situations where Bayes error rate is not close to zero. Squared error is poor if classification is the goal (it grows above zero)

Loss functions for k-classes Bayes classifier: If not just interested in assignment, then the class probabilities are of interest. Logistic model generalizes to K classes: Binomial deviance extends to K-class multinomial deviance loss function:

Robust loss functions for regression Squared error loss too- heavily penalizes large absolute residuals |y-f(x)| and is threfore not robust. Absolute error better choice Huber loss (below) deals well with outliers and nearly as efficient as least squares for Gaussian errors.

Inserted for completeness

Boosting Trees Trees partition the space of joint predictor variable values into disjoint regions with constant predictor values assigned to each region: Tree can be formally expressed as: Parameters found by minimizing empirical risk:

Boosting Trees Formidable combinatorial optimization problem Divide into two parts: 1.Find : typically trivial 2.Find : use a greedy, top-down recursive partitioning algorithm (e.g. Gini index for misclassification loss in growing of tree) Boosted tree model is the sum of such trees (from forward, stagewise algorithm):

Boosting Trees The Boosted Tree Model is “induced in a forward, stagewise manner” At each step in the procedure, one must solve: Given the regions at each step, the optimal constants are found:

Boosting Trees For squared error loss, this is no harder than for a single tree: at each stage you create the tree that best predicts the current residuals For two-class classification and exponential loss, we have AdaBoost Absolute error or Huber loss for regression and deviance for classification would make for robust trees, but there are no simple boosting algorithms.

Boosting Trees: Numerical Optimization A variety of numerical optimization techniques exist for finding the solution to the above problem. They all work iteratively, in the sense that the function is approximated by taking an initial guess at its value, and computing successive adding functions to it, each of which is computed on the basis of the function at the previous iteration.

Boosting Trees: Steepest Descent Move down the gradient of L(f) Very Greedy => Can get stuck in local minima Unconstrained => Can be applied to any system (as long as gradient can be calculated)

Not a Tree Algorithm so… (notes from the Professor) Calculate a gradient and then fit regression trees by least squares. Advantage – No need to do linear fit Gradient taken only w.r.t function values at points so its as if it were one diminsional

Boosting Trees: Gradient Boosting But gradient descent operates solely on the training data. One idea: create boosted trees to approximate steps down the gradient. Boosting is like gradient descent, but each added tree moves down the loss gradient created at f m-1, and hence approximates the true gradient Each tree is constrained by the previous one unlike the true gradient

MART (Multiple Additive Regression Trees): Generic Gradient Tree Boosting Algorithm 1. Initialize: 2. For m=1…M: –A) For i=1,2,…,N (pseudo- residuals) –B) Fit a regression tree to the targets r_im giving terminal regions R_jm, j=1,…,J_m –C) For j=1,2,…,J_m: –D) update: 3. Output:

Right-Sized Trees for Boosting At issue: for single-tree methods we create deep trees and then prune them. How to handle for these complex, multi-tree methods? Maybe set a constant depth in terms of the number of terminal nodes J. Number of terminal nodes relates to degree of coordinate variable interactions that are considered. Consideration of the ANOVA (Analysis of Variance) expansion of the “target” function: ANOVA expansion: Yields an approach to Boosting Trees in which the number of terminal nodes in each of the individual trees is set to J, where J-1 is the largest degree of interaction we wish to capture about the data.

Effect of Interaction Order Just to show how degree of interaction relates to test error in the simple example of Ideal J=2, so boosting models with J>2 incurs more variance. Note J is not the “number of terms”

Regularization Aside from J, the other meta-parameter of MART is M, the number of iterations. Continued iteration usually reduces training risk, but can lead to overfitting. One strategy is to estimate M*, the ideal number of iterations, by testing prediction risk, as a function of M, on a validation sample. Other regularizations strategies…

Shrinkage The idea of shrinkage is to weight the contribution of each tree by a factor between 0 and 1. Thus, the MART update rule can be replaced by: There is a clear tradeoff between the shrinkage factor and M, the number of iterations. Lower values of require more iterations and longer computation, but favor better test error. Best strategy seems to be to suck it up and set low (<0.1).

Shrinkage and Test Error Again, the example of 10.2 Effect especially pronounced when using deviant binomial deviance loss measure, but shrinkage always looks nicer. HS&T have led us down the primrose path.

Penalized Regression Taking the set of all possible J terminal node trees realizable on the data set as basis functions, the linear model is: Penalized regression includes a penalty J(  ) for large numbers of parameters:

Penalized Regression Penalties can be, for example, like ridge or lasso: However, direct implementation of the above procedure is computationally infeasible (due to the requirement that all possible J-terminal node trees have been found). Forward stagewise linear regression provides a close approximation to the lasso and is similar to boosting and algorithm 10.2

1.Initialize: 2.For m = 1 to M: 3.Output: Forward Stagewise Linear Regression Increasing M is like decreasing. Many coefficients will remain at zero. Others will tend to have absolute values less than their least squares defaults.

Lasso vs. Forward-Stagewise (but not on trees) Just as a demonstration, try this out with the original variables, instead of trees, and compare to the Lasso solutions.

Importance of Predictor Variables (IPV) Find out which variable reduces the error most Normalize other variables influence w.r.t. this variable

IPV: Hints To overcome greedy splits, average over many boosted trees To prevent masking—where important variables are highly correlated with other important ones—use shrinking

IPV Classification For K-class classification, fit a function for each class, and see which variable is important within each class. If a few variables are important across all classes: 1.) Laugh your way to the bank, and 2.) Give me some $ for teaching you this.

IPV: Classification Arrange each variable’s importance in a matrix with p rows and K classes Columns separate within a class Rows separate between classes

Partial Dependence Plots (PDP’s) Problem: After we’ve determined our important variables, how can we visualize their effects? Solution1: Give up and become another (clueless but rich) manager Solution2: Just pick a few and keep at it (who likes the Bahamas anyway?)

What they are in Limit PDP

PDP’s: Conditioning To visualize d (> 3) dimensions, condition on a few input variables –Like looking at slices of the d dimensional surface. –Set ranges if necessary Especially useful when –interactions are limited –and those variables have additive or multiplicative effects

PDP’s: Finding Interactions To find interactions, compare partial dependence plots with their relative importance If the importance is high yet the plot appears flat, multiply it with another important variable