Cross-validation Brenda Thomson/ Peter Fox Data Analytics

Cross-validation Brenda Thomson/ Peter Fox Data Analytics
ITWS-4600/ITWS-6600/MATP-4450/CSCI-4960 Group 2 Module 7, October 16, 2018

Contents

Numeric v. non-numeric

Cross-validation Cross-validation is a model validation technique for assessing how the results of a statistical analysis will generalize to an independent data set. It is mainly used in settings where the goal is prediction, and one wants to estimate how accurately a predictive model will perform in practice. I.e. predictive and prescriptive analytics…

Cross-validation In a prediction problem, a model is usually given a dataset of known data on which training is run (training dataset), and a dataset of unknown data (or first seen data) against which the model is tested (testing dataset). Sound familiar?

Cross-validation The goal of cross validation is to define a dataset to "test" the model in the training phase (i.e., the validation dataset), in order to limit problems like overfitting And, give an insight on how the model will generalize to an independent data set (i.e., an unknown dataset, for instance from a real problem), etc.

Common type of x-validation
K-fold 2-fold (do you know this one?) Rep-random-subsample Leave out-subsample Lab in a few weeks … to try these out

K-fold Original sample is randomly partitioned into k equal size subsamples. Of the k subsamples, a single subsample is retained as the validation data for testing the model, and the remaining k − 1 subsamples are used as training data. Repeat cross-validation process k times (folds), with each of the k subsamples used exactly once as the validation data. The k results from the folds can then be averaged (usually) to produce a single estimation.

Leave out subsample As the name suggests, leave-one-out cross-validation (LOOCV) involves using a single observation from the original sample as the validation data, and the remaining observations as the training data. i.e. K=n-fold cross-validation Leave out > 1 = bootstraping and jackknifing

boot(strapping) Generate replicates of a statistic applied to data (parametric and nonparametric). nonparametric bootstrap, possible methods: ordinary bootstrap, the balanced bootstrap, antithetic resampling, and permutation. For nonparametric multi-sample problems stratified resampling is used: this is specified by including a vector of strata in the call to boot. importance resampling weights may be specified.

Jackknifing Systematically recompute the statistic estimate, leaving out one or more observations at a time from the sample set From this new set of replicates of the statistic, an estimate for the bias and an estimate for the variance of the statistic can be calculated. Often use log(variance) [instead of variance] especially for non-normal distributions

Repeat-random-subsample
Random split of the dataset into training and validation data. For each such split, the model is fit to the training data, and predictive accuracy is assessed using the validation data. Results are then averaged over the splits. Note: for this method can the results will vary if the analysis is repeated with different random splits.

Advantage? The advantage of K-fold over repeated random sub-sampling is that all observations are used for both training and validation, and each observation is used for validation exactly once. 10-fold cross-validation is commonly used The advantage of rep-random over k-fold cross validation is that the proportion of the training/validation split is not dependent on the number of iterations (folds).

Disadvantage The disadvantage of rep-random is that some observations may never be selected in the validation subsample, whereas others may be selected more than once. i.e., validation subsets may overlap.

Assignment 6 Your term projects should fall within the scope of a data analytics problem of the type you have worked with in class/ labs, or know of yourself – the bigger the data the better. This means that the work must go beyond just making lots of figures. You should develop the project to indicate you are thinking of and exploring the relationships and distributions within your data. Start with a hypothesis, think of a way to model and use the hypothesis, find or collect the necessary data, and do both preliminary analysis, detailed modeling and summary (interpretation) level students must develop at least two types of models. Note: You do not have to come up with a positive result, i.e. disproving the hypothesis is just as good. Introduction (2%) % may change… Data Description (3%) Analysis (5%) Model Development (12%) Conclusions and Discussion (3%) Oral presentation (5%) (~5 mins)

Cross-validation Brenda Thomson/ Peter Fox Data Analytics

Similar presentations

Presentation on theme: "Cross-validation Brenda Thomson/ Peter Fox Data Analytics"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Cross-validation Brenda Thomson/ Peter Fox Data Analytics

Similar presentations

Presentation on theme: "Cross-validation Brenda Thomson/ Peter Fox Data Analytics"— Presentation transcript:

Similar presentations

About project

Feedback