Case Selection and Resampling Lucila Ohno-Machado HST951.

Case Selection and Resampling Lucila Ohno-Machado HST951

Topics Case selection (influence detection) Regression diagnostics Sampling procedures –Bootstrap –Jackknife –Cross-validation

Unusual Data Outlier (discrepancy, unusual observation that may change parameters) Leverage (far from mean or centroid of other observations, unusual combinations of independent variable values X that may change parameters) Influence = discrepancy x leverage

Detecting Outliers: Residuals Measure of error Studentized residuals can be calculated by removing one observation at a time Obs: High-leverage observations may have small residuals

Assessing Leverage Hat values measure the distance of an observation to the means (or centroid) of all observations Dependent variables are not involved in determining leverage

Measuring Influence Impact on coefficient of deleting an observation –DFBETA –COX’s D –DFFITS Impact on standard error –COVRATIO

Case selection Not all cases are created equal Some influential cases are good Some are bad “Outliers” Some non-influential cases are redundant It would be nice to keep “minimal” set of good cases in training sets for fast on-line training

Classical Diagnostics Unicase selection is determined by removing one observation and inspecting results Unicase influence on –Estimated parameters (coefficients) –Fitted value (Y-hat) –Residuals (error)

When outcomes are binary Residuals may not reflect discriminatory performance, but rather calibration Remember that a model with good discriminatory performance may be recalibrated Same rationale for coefficients

Influence Definition of influence is not fixed If the main reason for building models is prediction Then evaluating model performance given different subsets of original sample might point to good, redundant, and bad cases

Qualifying a case Bad cases, when removed, should result in models with better predictions Redundant cases, when removed, should not affect predictions Good cases, if removed, would result in models with worse predictions

Defining prediction performance Use, for example, areas under ROC curves (or mean square error or cross entropy error) For each set of samples: –Evaluate performance on training and holdout sets –Determine which cases to remove Determine performance on test or validation sets

Sequential Multicase Selection Sequential procedure –remove most influential case –remove second-most influential case (conditioned on the first) –and so on…  i (C(n,m)), for all i=1 to m, where C(.) represents the number of subsets of size m that can be built from n cases. Problem: cases are not considered en bloc

Alternatives Multicase selection that is not sequential, yet not exhaustive (e.g., genetic algorithm search) Analogous to variable selection

Genetic Algorithm Given a training set C, and a selection of cases v, we construct a logistic regression model l C (v). We evaluate the model using the AUC, and represent this evaluation as a(l C (v)). For a total number of cases n, and m cases in selection v, we use the following fitness function: f(v,C) = a(l C (v)) +  (n - m)/n.

Resampling

Bootstrap Motivation Sometimes it is not possible to collect many samples from a population Sometimes it is not correct to assume a certain distribution for the population Goal: Assess sampling variation

Bootstrap Efron (Stanford biostats) late 80’s –“Pulling oneself up by one’s bootstraps” Nonparametric approach to statistical inference Uses computation instead of traditional distributional assumptions and asymptotic results Can be used to derive standard errors, confidence intervals, and test hypothesis

Example Adapted from Fox (1997) “Applied Regression Analysis” Goal: Estimate mean difference between Male and Female finding X Four pairs of observations are available:

Observ.MaleFemaleDiffer. 124186 21417-3 340355 444413

Mean Difference Sample mean is (6-3+5+3)/4 = 2.75 If Y were normally distributed, 95% CI But we do not know 

Estimates Estimate of  is Estimate of standard error is Assuming population is normally distributed, we can use t-distribution as

Confidence Interval  = 2.75 ± 4.30 (2.015) = 2.75 ± 8.66 -5.91 <  < 11.41 HUGE!!!

Sample mean and variance Use distribution Y* of sample to estimate distribution Y in population y*p*(y*) 6.25 -3.25 E*(Y*) =  y* p(y*) = 2.75 5.25V*(Y*) =  [y*-E*] 2 p(y*) 3.25 = 12.187

Sample with Replacement SampleY1*Y1*Y2*Y2*Y3*Y3*Y4*Y4* * 166666.00 2666-33.75 366655.75.. 100-35632.75 101-35 61.25 … 255-33353.5 25633333.00

Calculating the CI Mean of 256 bootstrap means is 2.75, but SE is (no hat since SE is not estimated, but known)

So what? We already knew that! But with bootstrap –Confidence intervals can be more accurate –Can be used for non-linear statistics without known standard error formulas

The population is to the sample as the sample is to the bootstrap samples In practice (as opposed to previous example), not all bootstrap samples are selected

Procedure 1. Specify data-collection scheme that results in observed sample Collect(population) -> sample 2. Use sample as if it were population (with replacement) Collect(sample) -> bootstrap sample1 bootstrap sample 2 etc…

Cont. 3. For each bootstrap sample, calculate the estimate you are looking for 4. Use the distribution of the bootstrap estimates to estimate the properties of the sample

Bootstrap Confidence Intervals Normal Theory Percentile Intervals Example –95% percentile is calculated by taking –Lower = 0.025 x bootstrap replicates –Upper = 0.975 x bootstrap replicates There are corrections for bootstrap intervals

Bootstrapping Linear Regression Observed estimate is usually the coefficient(s) - (at least) 2 ways of doing this Resample observations (usual) and re-regress (X will vary) Resample residuals (X are fixed, Y*=Y+E* is new dependent variable, re-regress X fixed) –Assumes errors are identically distributed –High-leverage outlier impact may be lost

Bootstrap for other methods Used in other classification methods (neural networks, classification trees, etc.) Usually useful when sample size is small and no distribution assumptions can be made Same principles apply

Other resampling methods Jackknife (take one out) is a special case of bootstrap –Resamples without one case and without replacement (samples have size n-1) Cross-validation –Divides data into training and test Generally used to estimate confidence intervals on predictions for “full” model (i.e., model that utilized all cases)

Case Selection and Resampling Lucila Ohno-Machado HST951.

Similar presentations

Presentation on theme: "Case Selection and Resampling Lucila Ohno-Machado HST951."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Case Selection and Resampling Lucila Ohno-Machado HST951.

Similar presentations

Presentation on theme: "Case Selection and Resampling Lucila Ohno-Machado HST951."— Presentation transcript:

Similar presentations

About project

Feedback