Eco 6380 Predictive Analytics For Economists Spring 2016 Professor Tom Fomby Department of Economics SMU
Presentation 4 The Dangers of Overtraining a Model and What to do to Avoid It Chapter 2 in SPB
Choosing and Maintaining a Best Supervised Learning Model In the case of a Supervised Learning Problem (either prediction or classification) the standard practice is to partition the data into training, validation, and test subsets (more about the rationale of this partitioning below) Use these data partitions to determine a best individual model or Ensemble Model to use for the task at hand (more about how to determine the best model later). In some sense we are “letting the data speak” as to the best model(s) to use for prediction or classification purposes. Only with the advent of the large data sets that arise in data mining tasks has the data partitioning approach become viable.
Two Major Problems faced by both the Statistical and Algorithmic (AI) Modeling Approaches Over-training – the process of fitting an overly complex model to a set of data. The Complex model, when applied to an independent test data set, performs poorly in terms of predictive accuracy. It is often said that, “the model has not only fit the signal in the data but the noise as well.” Multiplicity of Good Models – several models provide equally good fits of the data and in independent data sets perform equally well in terms of predictive accuracy.
“Over-Training” Problem On test data On Training data Complexity of Models (Fomby Diagram) Goodness-of-fit (lower the better)
“Over-Training” Problem An overly complex model fit to training data will invariably not perform as well ( in terms of predictive accuracy ) on independent test data. The overly complex model tends to, unfortunately, fit not only the signal in the training data but also to the noise in the training data and thus gives a false impression of predictive accuracy.
“Multiplicity of Good Models” Problem Models (Fomby Diagram) Predictive Accuracy (lower the better) Model 1 Model 2 Model 3 Model J Model K Model L Model M …
“Multiplicity of Good Models” Problem When there is a multiplicity of good models one can often improve on predictive performance of any of the individual “good” models by building an ensemble of the best of the good models.
A Corollary : “Complexity of Ensemble Model” Problem On test data On Training data Number of Models Making Up Ensemble (Fomby Diagram) Predictive Accuracy (lower the better)
A Good Approach to Building a Good Ensemble Model: Build a “Trimmed” Ensemble Examine several model “groups” having distinct architectures (designs) like MLR, KNN, ANN, CART, SVM, etc. Determine a Best (“super”) Model for each Model Group and then, among these “super” models, choose the best 3 or best 4 of them, form an ensemble model (usually an equally weighted one), and then apply it to an independent data set. It will, with high probability, outperform the individual “super” models that make up the ensemble. In other words don’t just throw together a bunch of individual models without careful pre-selection (“trimming”) In essence, this is the approach of the Auto Numeric and Auto Classifier Nodes in SPSS Modeler. See “property_values_numericpredictor.str” and “pm_binaryclassifier.str” in the Demo Streams directory of SPSS Modeler.
Using Cross-Validation and Data Partitioning to Avoid the Over-training (Over-fitting) of Models As we shall see, data mining is vulnerable to the danger of “over-fitting” where a model is fit so closely to the available sample of data (by making it overly complex) that it describes not merely structural characteristics of the data, but random peculiarities as well. In engineering terms, the model is fitting the noise, not just the signal. Over-fitting can also occur in the sense that with so many models and their architectures to choose from, “one is bound to find a model that works sooner or later.”
Using Cross-Validation and Data Partitioning to Avoid the Over-training (Over-fitting) of Models continued To overcome the tendency to over-fit, data mining techniques require that a model be “tuned” on one sample and “validated” on a separate sample. Models that are badly over-fit to one sample due to over-complexity will generally perform poorly in a separate, independent data set. Therefore, validation of a data mining model on a separate data set from the data set used to formulate the model (cross validation) tends to reduce the incidence of over- fitting and results in an overall better performing model. Also if one of many models just “happens” to fit a data set well, its innate inadequacy is likely to be revealed when applied to an independent data set.
Using Cross-Validation and Data Partitioning to Avoid the Over-training (Over-fitting) of Models continued The definitions and purposes of the data partitions are as follows: Training Partition: That portion of the data used to fit the “parameters” of a statistical model or determine the “architecture” of a data-algorithmic (machine learning) model Validation Partition: That portion of the data used to assess whether a proposed model was over-fit to the training data and thus how good a proposed model really is. For example, is a single hidden-layer artificial neural network to be preferred over a double hidden-layer artificial neural network? Is a best subset multiple regression determined with a critical p-value of 0.10 to be preferred to one determined with a critical p-value of 0.01? The Validation data set helps us answer such questions. Also, as we shall subsequently see, the Validation data set helps us build Ensemble models that are constructed as “combinations” of the best individual performing models. Test Partition: That portion of the data that is used to assess the “unconditional” (generalized) performances of “best” models that may have been determined from the Validation data set. Also the Test data set can be used to determine the relative efficiencies of competing Ensemble models for the purpose of choosing an optimal Ensemble to use in practice.
Binning Continuous Variables and Stratified Sampling Sometimes when continuous input variables are very multi-collinear, it helps to create categorical variables out of the continuous variables by a process called “binning”. That is, the continuous variable is partitioned into class intervals and an indicator (0 and 1) variable is assigned for each interval. For example, an income variable divided into income_30, income_60, income_90, and income_above, these variables representing income groups of (0,30K), (30K, 60K), (60K,90K), and (90K,90K+). Binning crucial highly multi-collinear continuous variables can break down the multicollinearity that might otherwise prevent the construction of a strong prediction or classification model. The XLMINER program supports the binning operation. It is sometimes the case that models are best built by strata. That is, models built by separate strata, for example by regions north, south, east, and west, can often give rise to more accurate predictions and classifications when considering all of the data (strata) as a whole. Thus, predictions or classification built up by strata might yield more accurate predictions or classifications than of a model using all of the data at once. The XLMINER programs supports stratified sampling.
Classroom Exercise: Exercise 2