Intelligible Models for Classification and Regression

Slides:

Advertisements

Similar presentations

Detecting Statistical Interactions with Additive Groves of Trees

Advertisements

Additive Groves of Regression Trees Daria Sorokina Rich Caruana Mirek Riedewald.

Chapter 5 Multiple Linear Regression

Neural networks Introduction Fitting neural networks

Random Forest Predrag Radenković 3237/10

Integrated Instance- and Class- based Generative Modeling for Text Classification Antti PuurulaUniversity of Waikato Sung-Hyon MyaengKAIST 5/12/2013 Australasian.

CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.

Combining Classification and Model Trees for Handling Ordinal Problems D. Anyfantis, M. Karagiannopoulos S. B. Kotsiantis, P. E. Pintelas Educational Software.

Pattern Recognition and Machine Learning

Indian Statistical Institute Kolkata

« هو اللطیف » By : Atefe Malek. khatabi Spring 90.

Chapter 7 – Classification and Regression Trees

CMPUT 466/551 Principal Source: CMU

Chapter 7 – Classification and Regression Trees

Basis Expansion and Regularization Presenter: Hongliang Fei Brian Quanz Brian Quanz Date: July 03, 2008.

Paper presentation for CSI5388 PENGCHENG XI Mar. 23, 2005

Artificial Intelligence Lecture 2 Dr. Bo Yuan, Professor Department of Computer Science and Engineering Shanghai Jiaotong University

Rich Caruana Alexandru Niculescu-Mizil Presented by Varun Sudhakar.

x – independent variable (input)

Sparse vs. Ensemble Approaches to Supervised Learning

Ensemble Learning: An Introduction

End of Chapter 8 Neil Weisenfeld March 28, 2005.

Sparse vs. Ensemble Approaches to Supervised Learning

Comp 540 Chapter 9: Additive Models, Trees, and Related Methods

Classification and Prediction: Regression Analysis

Ensemble Learning (2), Tree and Forest

Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.

Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.

Feature Selection in Nonlinear Kernel Classification Olvi Mangasarian & Edward Wild University of Wisconsin Madison Workshop on Optimization-Based Data.

Chapter 9 – Classification and Regression Trees

Combining Regression Trees and Radial Basis Function Networks paper by: M. Orr, J. Hallam, K. Takezawa, A. Murray, S. Ninomiya, M. Oide, T. Leonard presentation.

Jeff Howbert Introduction to Machine Learning Winter Regression Linear Regression.

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.

1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.

Stochastic Subgradient Approach for Solving Linear Support Vector Machines Jan Rupnik Jozef Stefan Institute.

Today Ensemble Methods. Recap of the course. Classifier Fusion

Ensemble Methods: Bagging and Boosting

CROSS-VALIDATION AND MODEL SELECTION Many Slides are from: Dr. Thomas Jensen -Expedia.com and Prof. Olga Veksler - CS Learning and Computer Vision.

ISCG8025 Machine Learning for Intelligent Data and Information Processing Week 3 Practical Notes Application Advice *Courtesy of Associate Professor Andrew.

Predicting Good Probabilities With Supervised Learning

CpSc 881: Machine Learning

Ensemble Methods in Machine Learning

Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com.

Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.

Classification Ensemble Methods 1

Regression. We have talked about regression problems before, as the problem of estimating the mapping f(x) between an independent variable x and a dependent.

Additive Models ， Trees ， and Related Models Prof. Liqing Zhang Dept. Computer Science & Engineering, Shanghai Jiaotong University.

Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.

Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.

Eco 6380 Predictive Analytics For Economists Spring 2016 Professor Tom Fomby Department of Economics SMU.

Neural Networks The Elements of Statistical Learning, Chapter 12 Presented by Nick Rizzolo.

Ensemble Learning, Boosting, and Bagging: Scaling up Decision Trees (with thanks to William Cohen of CMU, Michael Malohlava of 0xdata, and Manish Amde.

Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.

Overfitting, Bias/Variance tradeoff. 2 Content of the presentation Bias and variance definitions Parameters that influence bias and variance Bias and.

Model Selection and the Bias–Variance Tradeoff All models described have a smoothing or complexity parameter that has to be considered: multiplier of the.

Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.

Data Mining Practical Machine Learning Tools and Techniques

Week 2 Presentation: Project 3

An Empirical Comparison of Supervised Learning Algorithms

Learning Recommender Systems with Adaptive Regularization

Boosting and Additive Trees (2)

CSE 4705 Artificial Intelligence

Basic machine learning background with Python scikit-learn

Logistic Regression Classification Machine Learning.

Introduction to Predictive Modeling

Deep Learning for Non-Linear Control

Overfitting and Underfitting

Machine learning overview

Machine Learning: Lecture 5

Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017

Presentation transcript:

Intelligible Models for Classification and Regression Authors: Yin Lou, Rich Caruana, Johannes Gehrke(Cornell University) KDD, 2012 Presented by: Haotian Jiang 3.31.2015

Content Motivation 2. Generalized Additive Model ----Introduction ----Methodology Experiments Results 5. Discussion 6. Conclusion

Content Motivation 2. Generalized Additive Model ----Introduction ----Methodology Experiments Results 5. Discussion 6. Conclusion

Motivation of RWR Regression and Classification: a. Complex model: SVM Random Forests Deep Neural Nets Advantage: High accuracy Disadvantage: Uninterpretable b. Linear model: Advantage: Interpretable Disadvantage: Low accuracy Common Problem: In many applications what is learned is just as important as the accuracy of the predictions.

Content Motivation 2. Generalized Additive Model ----Introduction ----Methodology Experiments Results 5. Discussion 6. Conclusion

We fit models of the form: Introduction to GAMs We fit models of the form: Link Function is identity: Regression model Link Function is logistic function: Classification model

Introduction to GAMs The data in Example 1 was drawn from a model with no interactions between features, a model of the form in Equation 1 is able to fit the data perfectly.

Methodology of GAMs Shape Function Learning Method Let denote a training dataset of size N, where is a feature vector with n features and is the target. In this paper, author consider both regression problem where , and binary classification problems where . Given a model F, let denote the prediction of the model for data point . The goal is to minimize the expected value of some loss function . Shape Function Learning Method

Shape Functions Regression Splines. We consider regression splines of the following form: We assume that a model containing only one smooth function of one covariate. For example, Recommendation reading: Generalized Additive Models an introduction with R

Shape Function Trees and Ensembles of Trees We control tree complexity by either fixing the number of leaves or by disallowing leaves that have fewer than an α-fraction of the number of training examples. Single Tree: We use a single regression tree as a shape function. Bagged Trees: We use the well-known technique of bagging to reducing variance. Boosted Trees: We use gradient boosting, where each successive tree tries to predict the overall residual from all preceding trees. Boosted Bagged Trees: We use a bagged ensemble in each step of stochastic gradient boosting, resulting in a boosted ensemble of bagged trees.

Shape Function Rectangular area: Our purpose is

Learning Method: Least Squares Least Squares: We minimize with the smoothing parameter λ. Large values of λ lead to a straight line for while low values of λ allow the spline to fit closely to the data. We use thin plate regression splines from the R package “mgcv” that automatically selects the best values for the parameters of the splines.

Learning Method: Gradient Boosting We use standard gradient boosting with one difference: Since we want to learn shape functions for all features, in each iteration of boosting we cycle sequentially through all features.

Learning Method: Backfitting The first shape function f1 is then learned using the training set with the goal to predict y.

Content Motivation 2. Generalized Additive Model ----Introduction ----Methodology Experiments Results 5. Discussion 6. Conclusion

EXPERIMENTAL SETUP Datasets: The “Concrete,” “Wine,” and “Music” regression datasets are from the UCI repository; “Delta” is the task of controlling the ailerons of a F16 aircraft; “CompAct” is a regression dataset from the Delve repository that describes the state of multiuser computers. The synthetic dataset was described in Example 1.

EXPERIMENTAL SETUP In gradient boosting, we vary the number of leaves in the bagged or boosted trees: 2, 3, 4, 8, 12 to 16. Trained models will contain M such trees for each shape function after M iterations. In backfitting, we control the complexity of the tree by adaptively choosing a parameter α that stops splitting nodes smaller than an α fraction of the size of the training data; we vary α ∈ {0.00125, 0.025, 0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5}. Metrics: For regression problems, we report the root mean squared error (RMSE). For classification problems, we report the error rates. Training Set, 0.64N Cross Validation, Checking for convergence. Dataset Validation Set, 0.16N Test Set, 0.2N Purpose: Computing M && Bias-Variance Analysis

Content Motivation 2. Generalized Additive Model ----Introduction ----Methodology Experiments Results 5. Discussion 6. Conclusion The training curves for backfitting are not monotonic, and have distinct peaks on the classification problem. Each peak corresponds to a new backfitting iteration when pseudo residuals are updated.

Results

Results

Results

Discussion Question: why are tree-based methods more accurate for feature shaping than spline-based methods? Bias-Variance Analysis Define the average prediction on L samples for each points (xi, yi) in test set as The squared bias: The variance is calculated as:

Discussion

Disscusion Computation Cost: P-LS and P-IRLS are very fast on small datasets, but on the larger datasets they are slower than the BSTTRx. Due to the extra cost of bagging, BST-bagTRx, BF-bagTR and BF-bbTRx are much slower than P-LS/P-IRLS or BST-TRx. The slowest method we tested is backfitting, which is expensive because at each iteration the previous shape functions are discarded and a new fit for each feature must be learned. Limitations and Extensions: Dimensionality (only low-to-medium), Incorporating feature selection, Allowing a few pairwise interactions. Accurate Intelligible Models with Pairwise Interactions

Content Motivation 2. Generalized Additive Model ----Introduction ----Methodology Experiments Results 5. Discussion 6. Conclusion

Conclusion Our bias-variance analysis shows that spline-based methods tend to underfit and thus may miss important non-smooth structure in the shape models. As expected, the bias-variance analysis also shows that tree-based methods are prone to overfitting and require careful regularization. GAMs method based on gradient boosting of size-limited bagged trees that yields significantly more accuracy than previous algorithms on both regression and classification problems while retaining the intelligibility of GAM models.

Thank you! Q & A Haotian Jiang