Model generalization Brief summary of methods

Slides:



Advertisements
Similar presentations
Neural networks Introduction Fitting neural networks
Advertisements

Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.
Model generalization Test error Bias, variance and complexity
Model Assessment, Selection and Averaging
Model Assessment and Selection
Model assessment and cross-validation - overview
CMPUT 466/551 Principal Source: CMU
MACHINE LEARNING 9. Nonparametric Methods. Introduction Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 
Sparse vs. Ensemble Approaches to Supervised Learning
Ensemble Learning what is an ensemble? why use an ensemble?
Ensemble Learning: An Introduction
End of Chapter 8 Neil Weisenfeld March 28, 2005.
Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.
Visual Recognition Tutorial
Sparse vs. Ensemble Approaches to Supervised Learning
CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation X = {
Ensemble Learning (2), Tree and Forest
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Machine Learning CS 165B Spring 2012
沈致远. Test error(generalization error): the expected prediction error over an independent test sample Training error: the average loss over the training.
Chapter 10 Boosting May 6, Outline Adaboost Ensemble point-view of Boosting Boosting Trees Supervised Learning Methods.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Lecture 12 – Model Assessment and Selection Rice ECE697 Farinaz Koushanfar Fall 2006.
Ensemble Methods: Bagging and Boosting
Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.
Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.
Manu Chandran. Outline Background and motivation Over view of techniques Cross validation Bootstrap method Setting up the problem Comparing AIC,BIC,Crossvalidation,Bootstrap.
Ensemble Learning (1) Boosting Adaboost Boosting is an additive model
CROSS-VALIDATION AND MODEL SELECTION Many Slides are from: Dr. Thomas Jensen -Expedia.com and Prof. Olga Veksler - CS Learning and Computer Vision.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
Chapter1: Introduction Chapter2: Overview of Supervised Learning
NTU & MSRA Ming-Feng Tsai
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Neural networks (2) Reminder Avoiding overfitting Deep neural network Brief summary of supervised learning methods.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
1 C.A.L. Bailer-Jones. Machine Learning. Model selection and combination Machine learning, pattern recognition and statistical data modelling Lecture 10.
PREDICT 422: Practical Machine Learning Module 3: Resampling Methods in Machine Learning Lecturer: Nathan Bastian, Section: XXX.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Machine Learning: Ensemble Methods
Bagging and Random Forests
Evaluating Classifiers
Deep Feedforward Networks
Ch8: Nonparametric Methods
Alan Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani
The Elements of Statistical Learning
Reading: Pedro Domingos: A Few Useful Things to Know about Machine Learning source: /cacm12.pdf reading.
Maximum Likelihood Estimation
ECE 5424: Introduction to Machine Learning
Ensemble Learning Introduction to Machine Learning and Data Mining, Carla Brodley.
ECE 5424: Introduction to Machine Learning
Overview of Supervised Learning
Bias and Variance of the Estimator
A “Holy Grail” of Machine Learing
Machine Learning Today: Reading: Maria Florina Balcan
Data Mining Practical Machine Learning Tools and Techniques
Methods of Economic Investigation Lecture 12
CS 2750: Machine Learning Line Fitting + Bias-Variance Trade-off
Machine Learning Ensemble Learning: Voting, Boosting(Adaboost)
10701 / Machine Learning Today: - Cross validation,
What is Regression Analysis?
Support Vector Machine _ 2 (SVM)
Model Combination.
Ensemble learning Reminder - Bagging of Trees Random Forest
Parametric Methods Berlin Chen, 2005 References:
Multivariate Methods Berlin Chen, 2005 References:
Support Vector Machines 2
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
Presentation transcript:

Model generalization Brief summary of methods Bias, variance and complexity Brief introduction to Stacking

Need expert input to select representations Brief summary Capability to learn highly abstract representations Need expert input to select representations Shallow classifiers (linear machines) Parametric Nonlinear models Kernel machines Deep neural networks “Do not generalize well far from the training examples” Linear partition Nonlinear partition Flexible Nonlinear partition Extract higher order information

Model Training data Model Testing data Testing error rate Training error rate Good performance on testing data, which is independent from the training data, is most important for a model. It serves as the basis in model selection.

In classification, categorical G (class label): Test error To evaluate prediction accuracy, loss functions are needed. Continuous Y: In classification, categorical G (class label): Where , Except in rare cases (e.g. 1 nearest neighbor), the trained classifier always gives a probabilistic outcome.

Test error The log-likelihood can be used as a loss-function for general response densities, such as the Poisson, gamma, exponential, log-normal and others. If Prθ(X)(Y) is the density of Y , indexed by a parameter θ(X) that depends on the predictor X, then The 2 makes the log-likelihood loss for the Gaussian distribution match squared error loss.

Test error Test error: The expected loss over an INDEPENDENT test set. The expectation is taken with regard to everything that’s random - both the training set and the test set. In practice it is more feasible to estimate the testing error given a training set: Training error is the average Loss over just the training set:

Test error

Test error

Test error Test error for categorical outcome: Training error:

Goals in model building Model selection: Estimating the performance of different models; choose the best one (2) Model assessment: Estimate the prediction error of the chosen model on new data.

Goals in model building Ideally, we’d like to have enough data to be divided into three sets: Training set: to fit the models Validation set: to estimate prediction error of models, for the purpose of model selection Test set: to assess the generalization error of the final model A typical split:

Goals in model building What’s the difference between the validation set and the test set? The validation set is used repeatedly on all models. The model selection can chase the randomness in this set. Our selection of the model is based on this set. In a sense, there is over-fitting in terms of this set, and the error rate is under-estimated. The test set should be protected and used only once to obtain an unbiased error rate.

Goals in model building In reality, there’s not enough data. How do people deal with the issue? Eliminate validation set. Draw validation set from training set. Try to achieve generalization error and model selection. (AIC, BIC, cross-validation ……) Sometimes, even omit the test set and final estimation of prediction error; publish the result and leave testing to later studies.

Bias-variance trade-off In the continuous outcome case, assume The expected prediction error in regression is:

Bias-variance trade-off

Bias-variance trade-off Kernel smoother. Green curve-truth. Red curves – estimates based on random samples.

Bias-variance trade-off K-nearest neighbor classifier: The higher the k, the lower the model complexity (estimation becomes more global, space partitioned into larger patches) Increase k, the variance term decreases, and the bias term increases. (Here x’s are assumed to be fixed; randomness only in y) Bias

Bias-variance trade-off For linear model with p coefficients, The average of ||h(x0)||2 over sample values is p/N Model complexity is directly associated with p.

Bias-variance trade-off

Bias-variance trade-off An example. 50 observations, 20 predictors, uniformly distributed in the hypercube [0, 1]20 Y is 0 if X1 ≤ 1/2 and 1 if X1 > 1/2, and apply k-nearest neighbors. Red: prediction error Green: squared bias Blue: variance

Bias-variance trade-off An example. 50 observations, 20 predictors, uniformly distributed in the hypercube [0, 1]20 Red: prediction error Green: squared bias Blue: variance

Increase the model space Stacking Increase the model space Ensemble learning Combining weak learners Combining strong learners Bagging Stacking Boosting Random forest

Strong learner ensembles (“Stacking” and beyond): Current Bioinformatics, 5, (4):296-308, 2010.

Combines algorithms to produce an asymptotically optimal combination Stacking Uses cross validation to assess the individual performance of prediction algorithms Combines algorithms to produce an asymptotically optimal combination For each predictor, predict each observation in a V-fold cross-validation Find a weight vector: Combine the prediction from individual algorithms using the weights. Stat in Med. 34:106–117

Stacking Lancet Respir Med. 3(1):42-52