Download presentation
Presentation is loading. Please wait.
Published byἈρτεμίσιος Μαλαξός Modified over 5 years ago
1
Model generalization Brief summary of methods
Bias, variance and complexity Brief introduction to Stacking
2
Need expert input to select representations
Brief summary Capability to learn highly abstract representations Need expert input to select representations Shallow classifiers (linear machines) Parametric Nonlinear models Kernel machines Deep neural networks “Do not generalize well far from the training examples” Linear partition Nonlinear partition Flexible Nonlinear partition Extract higher order information
3
Model Training data Model Testing data Testing error rate Training error rate Good performance on testing data, which is independent from the training data, is most important for a model. It serves as the basis in model selection.
4
In classification, categorical G (class label):
Test error To evaluate prediction accuracy, loss functions are needed. Continuous Y: In classification, categorical G (class label): Where , Except in rare cases (e.g. 1 nearest neighbor), the trained classifier always gives a probabilistic outcome.
5
Test error The log-likelihood can be used as a loss-function for general response densities, such as the Poisson, gamma, exponential, log-normal and others. If Prθ(X)(Y) is the density of Y , indexed by a parameter θ(X) that depends on the predictor X, then The 2 makes the log-likelihood loss for the Gaussian distribution match squared error loss.
6
Test error Test error: The expected loss over an INDEPENDENT test set. The expectation is taken with regard to everything that’s random - both the training set and the test set. In practice it is more feasible to estimate the testing error given a training set: Training error is the average Loss over just the training set:
7
Test error
8
Test error
9
Test error Test error for categorical outcome: Training error:
10
Goals in model building
Model selection: Estimating the performance of different models; choose the best one (2) Model assessment: Estimate the prediction error of the chosen model on new data.
11
Goals in model building
Ideally, we’d like to have enough data to be divided into three sets: Training set: to fit the models Validation set: to estimate prediction error of models, for the purpose of model selection Test set: to assess the generalization error of the final model A typical split:
12
Goals in model building
What’s the difference between the validation set and the test set? The validation set is used repeatedly on all models. The model selection can chase the randomness in this set. Our selection of the model is based on this set. In a sense, there is over-fitting in terms of this set, and the error rate is under-estimated. The test set should be protected and used only once to obtain an unbiased error rate.
13
Goals in model building
In reality, there’s not enough data. How do people deal with the issue? Eliminate validation set. Draw validation set from training set. Try to achieve generalization error and model selection. (AIC, BIC, cross-validation ……) Sometimes, even omit the test set and final estimation of prediction error; publish the result and leave testing to later studies.
14
Bias-variance trade-off
In the continuous outcome case, assume The expected prediction error in regression is:
15
Bias-variance trade-off
16
Bias-variance trade-off
Kernel smoother. Green curve-truth. Red curves – estimates based on random samples.
17
Bias-variance trade-off
K-nearest neighbor classifier: The higher the k, the lower the model complexity (estimation becomes more global, space partitioned into larger patches) Increase k, the variance term decreases, and the bias term increases. (Here x’s are assumed to be fixed; randomness only in y) Bias
18
Bias-variance trade-off
For linear model with p coefficients, The average of ||h(x0)||2 over sample values is p/N Model complexity is directly associated with p.
19
Bias-variance trade-off
20
Bias-variance trade-off
An example. 50 observations, 20 predictors, uniformly distributed in the hypercube [0, 1]20 Y is 0 if X1 ≤ 1/2 and 1 if X1 > 1/2, and apply k-nearest neighbors. Red: prediction error Green: squared bias Blue: variance
21
Bias-variance trade-off
An example. 50 observations, 20 predictors, uniformly distributed in the hypercube [0, 1]20 Red: prediction error Green: squared bias Blue: variance
22
Increase the model space
Stacking Increase the model space Ensemble learning Combining weak learners Combining strong learners Bagging Stacking Boosting Random forest
23
Strong learner ensembles (“Stacking” and beyond):
Current Bioinformatics, 5, (4): , 2010.
24
Combines algorithms to produce an asymptotically optimal combination
Stacking Uses cross validation to assess the individual performance of prediction algorithms Combines algorithms to produce an asymptotically optimal combination For each predictor, predict each observation in a V-fold cross-validation Find a weight vector: Combine the prediction from individual algorithms using the weights. Stat in Med. 34:106–117
25
Stacking Lancet Respir Med. 3(1):42-52
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.