Eco 6380 Predictive Analytics For Economists Spring 2016 Professor Tom Fomby Department of Economics SMU.

Slides:



Advertisements
Similar presentations
Systematic Data Selection to Mine Concept Drifting Data Streams Wei Fan IBM T.J.Watson.
Advertisements

Eco 5385 Predictive Analytics For Economists Spring 2014 Professor Tom Fomby Director, Richard B. Johnson Center for Economic Studies Department of Economics.
My name is Dustin Boswell and I will be presenting: Ensemble Methods in Machine Learning by Thomas G. Dietterich Oregon State University, Corvallis, Oregon.
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Part I Introduction to Data Mining by Tan,
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Classification: Definition Given a collection of records (training set ) –Each record contains a set of attributes, one of the attributes is the class.
Data Mining Classification: Alternative Techniques
Comparison of Data Mining Algorithms on Bioinformatics Dataset Melissa K. Carroll Advisor: Sung-Hyuk Cha March 4, 2003.
Chapter 8 – Logistic Regression
Indian Statistical Institute Kolkata
Model Assessment, Selection and Averaging
Chapter 7 – Classification and Regression Trees
Chapter 7 – Classification and Regression Trees
Lecture Notes for Chapter 4 Introduction to Data Mining
Evaluation.
Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.
Evaluation.
Tree-based methods, neutral networks
Evaluating Hypotheses
Three kinds of learning
Data mining and statistical learning - lecture 13 Separating hyperplane.
1 Automated Feature Abstraction of the fMRI Signal using Neural Network Clustering Techniques Stefan Niculescu and Tom Mitchell Siemens Medical Solutions,
ML ALGORITHMS. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of classifying new examples.
Chapter 12 Sample Surveys
General Mining Issues a.j.m.m. (ton) weijters Overfitting Noise and Overfitting Quality of mined models (some figures are based on the ML-introduction.
On Comparing Classifiers: Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Presented by Prakash Tilwani MACS 598 April 25 th.
Chapter 5 Data mining : A Closer Look.
For Better Accuracy Eick: Ensemble Learning
CSCI 347 / CS 4206: Data Mining Module 06: Evaluation Topic 01: Training, Testing, and Tuning Datasets.
Attention Deficit Hyperactivity Disorder (ADHD) Student Classification Using Genetic Algorithm and Artificial Neural Network S. Yenaeng 1, S. Saelee 2.
Overview DM for Business Intelligence.
2015 AprilUNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 1 Hastie, Tibshirani and Friedman.The Elements of Statistical Learning (2nd edition,
Predicting Income from Census Data using Multiple Classifiers Presented By: Arghya Kusum Das Arnab Ganguly Manohar Karki Saikat Basu Subhajit Sidhanta.
CLassification TESTING Testing classifier accuracy
Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 5 of Data Mining by I. H. Witten, E. Frank and M. A. Hall 報告人:黃子齊
Outline 1-D regression Least-squares Regression Non-iterative Least-squares Regression Basis Functions Overfitting Validation 2.
Chapter 9 – Classification and Regression Trees
LOGO Ensemble Learning Lecturer: Dr. Bo Yuan
Chapter 12 Notes Surveys, Sampling, & Bias Examine a Part of the Whole: We’d like to know about an entire population of individuals, but examining all.
Discriminant Analysis Discriminant analysis is a technique for analyzing data when the criterion or dependent variable is categorical and the predictor.
Today Ensemble Methods. Recap of the course. Classifier Fusion
Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.
CLASSIFICATION: Ensemble Methods
CROSS-VALIDATION AND MODEL SELECTION Many Slides are from: Dr. Thomas Jensen -Expedia.com and Prof. Olga Veksler - CS Learning and Computer Vision.
CS Inductive Bias1 Inductive Bias: How to generalize on novel data.
Reservoir Uncertainty Assessment Using Machine Learning Techniques Authors: Jincong He Department of Energy Resources Engineering AbstractIntroduction.
Data Mining and Decision Support
Validation methods.
Chapter 3 Surveys and Sampling © 2010 Pearson Education 1.
Eco 6380 Predictive Analytics For Economists Spring 2016 Professor Tom Fomby Department of Economics SMU.
Classification and Regression Trees
Eco 6380 Predictive Analytics For Economists Spring 2016 Professor Tom Fomby Department of Economics SMU.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Overfitting, Bias/Variance tradeoff. 2 Content of the presentation Bias and variance definitions Parameters that influence bias and variance Bias and.
Neural networks (2) Reminder Avoiding overfitting Deep neural network Brief summary of supervised learning methods.
Eco 6380 Predictive Analytics For Economists Spring 2016 Professor Tom Fomby Department of Economics SMU.
CPH Dr. Charnigo Chap. 11 Notes Figure 11.2 provides a diagram which shows, at a glance, what a neural network does. Inputs X 1, X 2,.., X P are.
Chapter 11 – Neural Nets © Galit Shmueli and Peter Bruce 2010 Data Mining for Business Intelligence Shmueli, Patel & Bruce.
AC 1.2 present the survey methodology and sampling frame used
Data Science Credibility: Evaluating What’s Been Learned
Machine Learning Supervised Learning Classification and Regression
Eco 6380 Predictive Analytics For Economists Spring 2016
Eco 6380 Predictive Analytics For Economists Spring 2016
Eco 6380 Predictive Analytics For Economists Spring 2014
Eco 6380 Predictive Analytics For Economists Spring 2016
Eco 6380 Predictive Analytics For Economists Spring 2016
Model generalization Brief summary of methods
MIS2502: Data Analytics Classification Using Decision Trees
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Presentation transcript:

Eco 6380 Predictive Analytics For Economists Spring 2016 Professor Tom Fomby Department of Economics SMU

Presentation 4 The Dangers of Overtraining a Model and What to do to Avoid It Chapter 2 in SPB

Choosing and Maintaining a Best Supervised Learning Model In the case of a Supervised Learning Problem (either prediction or classification) the standard practice is to partition the data into training, validation, and test subsets (more about the rationale of this partitioning below) Use these data partitions to determine a best individual model or Ensemble Model to use for the task at hand (more about how to determine the best model later). In some sense we are “letting the data speak” as to the best model(s) to use for prediction or classification purposes. Only with the advent of the large data sets that arise in data mining tasks has the data partitioning approach become viable.

Two Major Problems faced by both the Statistical and Algorithmic (AI) Modeling Approaches Over-training – the process of fitting an overly complex model to a set of data. The Complex model, when applied to an independent test data set, performs poorly in terms of predictive accuracy. It is often said that, “the model has not only fit the signal in the data but the noise as well.” Multiplicity of Good Models – several models provide equally good fits of the data and in independent data sets perform equally well in terms of predictive accuracy.

“Over-Training” Problem On test data On Training data Complexity of Models (Fomby Diagram) Goodness-of-fit (lower the better)

“Over-Training” Problem An overly complex model fit to training data will invariably not perform as well ( in terms of predictive accuracy ) on independent test data. The overly complex model tends to, unfortunately, fit not only the signal in the training data but also to the noise in the training data and thus gives a false impression of predictive accuracy.

“Multiplicity of Good Models” Problem Models (Fomby Diagram) Predictive Accuracy (lower the better) Model 1 Model 2 Model 3 Model J Model K Model L Model M …

“Multiplicity of Good Models” Problem When there is a multiplicity of good models one can often improve on predictive performance of any of the individual “good” models by building an ensemble of the best of the good models.

A Corollary : “Complexity of Ensemble Model” Problem On test data On Training data Number of Models Making Up Ensemble (Fomby Diagram) Predictive Accuracy (lower the better)

A Good Approach to Building a Good Ensemble Model: Build a “Trimmed” Ensemble Examine several model “groups” having distinct architectures (designs) like MLR, KNN, ANN, CART, SVM, etc. Determine a Best (“super”) Model for each Model Group and then, among these “super” models, choose the best 3 or best 4 of them, form an ensemble model (usually an equally weighted one), and then apply it to an independent data set. It will, with high probability, outperform the individual “super” models that make up the ensemble. In other words don’t just throw together a bunch of individual models without careful pre-selection (“trimming”) In essence, this is the approach of the Auto Numeric and Auto Classifier Nodes in SPSS Modeler. See “property_values_numericpredictor.str” and “pm_binaryclassifier.str” in the Demo Streams directory of SPSS Modeler.

Using Cross-Validation and Data Partitioning to Avoid the Over-training (Over-fitting) of Models As we shall see, data mining is vulnerable to the danger of “over-fitting” where a model is fit so closely to the available sample of data (by making it overly complex) that it describes not merely structural characteristics of the data, but random peculiarities as well. In engineering terms, the model is fitting the noise, not just the signal. Over-fitting can also occur in the sense that with so many models and their architectures to choose from, “one is bound to find a model that works sooner or later.”

Using Cross-Validation and Data Partitioning to Avoid the Over-training (Over-fitting) of Models continued To overcome the tendency to over-fit, data mining techniques require that a model be “tuned” on one sample and “validated” on a separate sample. Models that are badly over-fit to one sample due to over-complexity will generally perform poorly in a separate, independent data set. Therefore, validation of a data mining model on a separate data set from the data set used to formulate the model (cross validation) tends to reduce the incidence of over- fitting and results in an overall better performing model. Also if one of many models just “happens” to fit a data set well, its innate inadequacy is likely to be revealed when applied to an independent data set.

Using Cross-Validation and Data Partitioning to Avoid the Over-training (Over-fitting) of Models continued The definitions and purposes of the data partitions are as follows: Training Partition: That portion of the data used to fit the “parameters” of a statistical model or determine the “architecture” of a data-algorithmic (machine learning) model Validation Partition: That portion of the data used to assess whether a proposed model was over-fit to the training data and thus how good a proposed model really is. For example, is a single hidden-layer artificial neural network to be preferred over a double hidden-layer artificial neural network? Is a best subset multiple regression determined with a critical p-value of 0.10 to be preferred to one determined with a critical p-value of 0.01? The Validation data set helps us answer such questions. Also, as we shall subsequently see, the Validation data set helps us build Ensemble models that are constructed as “combinations” of the best individual performing models. Test Partition: That portion of the data that is used to assess the “unconditional” (generalized) performances of “best” models that may have been determined from the Validation data set. Also the Test data set can be used to determine the relative efficiencies of competing Ensemble models for the purpose of choosing an optimal Ensemble to use in practice.

Binning Continuous Variables and Stratified Sampling Sometimes when continuous input variables are very multi-collinear, it helps to create categorical variables out of the continuous variables by a process called “binning”. That is, the continuous variable is partitioned into class intervals and an indicator (0 and 1) variable is assigned for each interval. For example, an income variable divided into income_30, income_60, income_90, and income_above, these variables representing income groups of (0,30K), (30K, 60K), (60K,90K), and (90K,90K+). Binning crucial highly multi-collinear continuous variables can break down the multicollinearity that might otherwise prevent the construction of a strong prediction or classification model. The XLMINER program supports the binning operation. It is sometimes the case that models are best built by strata. That is, models built by separate strata, for example by regions north, south, east, and west, can often give rise to more accurate predictions and classifications when considering all of the data (strata) as a whole. Thus, predictions or classification built up by strata might yield more accurate predictions or classifications than of a model using all of the data at once. The XLMINER programs supports stratified sampling.

Classroom Exercise: Exercise 2