Download presentation
Presentation is loading. Please wait.
Published byJudith Stevens Modified over 9 years ago
1
Case Selection and Resampling Lucila Ohno-Machado HST951
2
Topics Case selection (influence detection) Regression diagnostics Sampling procedures –Bootstrap –Jackknife –Cross-validation
3
Unusual Data Outlier (discrepancy, unusual observation that may change parameters) Leverage (far from mean or centroid of other observations, unusual combinations of independent variable values X that may change parameters) Influence = discrepancy x leverage
4
Detecting Outliers: Residuals Measure of error Studentized residuals can be calculated by removing one observation at a time Obs: High-leverage observations may have small residuals
5
Assessing Leverage Hat values measure the distance of an observation to the means (or centroid) of all observations Dependent variables are not involved in determining leverage
6
Measuring Influence Impact on coefficient of deleting an observation –DFBETA –COX’s D –DFFITS Impact on standard error –COVRATIO
7
Case selection Not all cases are created equal Some influential cases are good Some are bad “Outliers” Some non-influential cases are redundant It would be nice to keep “minimal” set of good cases in training sets for fast on-line training
8
Classical Diagnostics Unicase selection is determined by removing one observation and inspecting results Unicase influence on –Estimated parameters (coefficients) –Fitted value (Y-hat) –Residuals (error)
9
When outcomes are binary Residuals may not reflect discriminatory performance, but rather calibration Remember that a model with good discriminatory performance may be recalibrated Same rationale for coefficients
10
Influence Definition of influence is not fixed If the main reason for building models is prediction Then evaluating model performance given different subsets of original sample might point to good, redundant, and bad cases
11
Qualifying a case Bad cases, when removed, should result in models with better predictions Redundant cases, when removed, should not affect predictions Good cases, if removed, would result in models with worse predictions
12
Defining prediction performance Use, for example, areas under ROC curves (or mean square error or cross entropy error) For each set of samples: –Evaluate performance on training and holdout sets –Determine which cases to remove Determine performance on test or validation sets
13
Sequential Multicase Selection Sequential procedure –remove most influential case –remove second-most influential case (conditioned on the first) –and so on… i (C(n,m)), for all i=1 to m, where C(.) represents the number of subsets of size m that can be built from n cases. Problem: cases are not considered en bloc
14
Alternatives Multicase selection that is not sequential, yet not exhaustive (e.g., genetic algorithm search) Analogous to variable selection
15
Genetic Algorithm Given a training set C, and a selection of cases v, we construct a logistic regression model l C (v). We evaluate the model using the AUC, and represent this evaluation as a(l C (v)). For a total number of cases n, and m cases in selection v, we use the following fitness function: f(v,C) = a(l C (v)) + (n - m)/n.
20
Resampling
21
Bootstrap Motivation Sometimes it is not possible to collect many samples from a population Sometimes it is not correct to assume a certain distribution for the population Goal: Assess sampling variation
22
Bootstrap Efron (Stanford biostats) late 80’s –“Pulling oneself up by one’s bootstraps” Nonparametric approach to statistical inference Uses computation instead of traditional distributional assumptions and asymptotic results Can be used to derive standard errors, confidence intervals, and test hypothesis
23
Example Adapted from Fox (1997) “Applied Regression Analysis” Goal: Estimate mean difference between Male and Female finding X Four pairs of observations are available:
24
Observ.MaleFemaleDiffer. 124186 21417-3 340355 444413
25
Mean Difference Sample mean is (6-3+5+3)/4 = 2.75 If Y were normally distributed, 95% CI But we do not know
26
Estimates Estimate of is Estimate of standard error is Assuming population is normally distributed, we can use t-distribution as
27
Confidence Interval = 2.75 ± 4.30 (2.015) = 2.75 ± 8.66 -5.91 < < 11.41 HUGE!!!
28
Sample mean and variance Use distribution Y* of sample to estimate distribution Y in population y*p*(y*) 6.25 -3.25 E*(Y*) = y* p(y*) = 2.75 5.25V*(Y*) = [y*-E*] 2 p(y*) 3.25 = 12.187
29
Sample with Replacement SampleY1*Y1*Y2*Y2*Y3*Y3*Y4*Y4* * 166666.00 2666-33.75 366655.75.. 100-35632.75 101-35 61.25 … 255-33353.5 25633333.00
30
Calculating the CI Mean of 256 bootstrap means is 2.75, but SE is (no hat since SE is not estimated, but known)
31
So what? We already knew that! But with bootstrap –Confidence intervals can be more accurate –Can be used for non-linear statistics without known standard error formulas
32
The population is to the sample as the sample is to the bootstrap samples In practice (as opposed to previous example), not all bootstrap samples are selected
33
Procedure 1. Specify data-collection scheme that results in observed sample Collect(population) -> sample 2. Use sample as if it were population (with replacement) Collect(sample) -> bootstrap sample1 bootstrap sample 2 etc…
34
Cont. 3. For each bootstrap sample, calculate the estimate you are looking for 4. Use the distribution of the bootstrap estimates to estimate the properties of the sample
35
Bootstrap Confidence Intervals Normal Theory Percentile Intervals Example –95% percentile is calculated by taking –Lower = 0.025 x bootstrap replicates –Upper = 0.975 x bootstrap replicates There are corrections for bootstrap intervals
36
Bootstrapping Linear Regression Observed estimate is usually the coefficient(s) - (at least) 2 ways of doing this Resample observations (usual) and re-regress (X will vary) Resample residuals (X are fixed, Y*=Y+E* is new dependent variable, re-regress X fixed) –Assumes errors are identically distributed –High-leverage outlier impact may be lost
37
Bootstrap for other methods Used in other classification methods (neural networks, classification trees, etc.) Usually useful when sample size is small and no distribution assumptions can be made Same principles apply
38
Other resampling methods Jackknife (take one out) is a special case of bootstrap –Resamples without one case and without replacement (samples have size n-1) Cross-validation –Divides data into training and test Generally used to estimate confidence intervals on predictions for “full” model (i.e., model that utilized all cases)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.