Case Selection and Resampling Lucila Ohno-Machado HST951.

Slides:



Advertisements
Similar presentations
Brief introduction on Logistic Regression
Advertisements

CmpE 104 SOFTWARE STATISTICAL TOOLS & METHODS MEASURING & ESTIMATING SOFTWARE SIZE AND RESOURCE & SCHEDULE ESTIMATING.
Probabilistic & Statistical Techniques Eng. Tamer Eshtawi First Semester Eng. Tamer Eshtawi First Semester
Prediction, Correlation, and Lack of Fit in Regression (§11. 4, 11
Linear regression models
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 13 Nonlinear and Multiple Regression.
Ch11 Curve Fitting Dr. Deshi Ye
Today: Quizz 11: review. Last quizz! Wednesday: Guest lecture – Multivariate Analysis Friday: last lecture: review – Bring questions DEC 8 – 9am FINAL.
Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.
Multivariate Data Analysis Chapter 4 – Multiple Regression.
2008 Chingchun 1 Bootstrap Chingchun Huang ( 黃敬群 ) Vision Lab, NCTU.
4-1 Statistical Inference The field of statistical inference consists of those methods used to make decisions or draw conclusions about a population.
Bootstrapping LING 572 Fei Xia 1/31/06.
Bagging LING 572 Fei Xia 1/24/06. Ensemble methods So far, we have covered several learning methods: FSA, HMM, DT, DL, TBL. Question: how to improve results?
Topic 3: Regression.
Experimental Evaluation
Linear Regression Analysis 5E Montgomery, Peck and Vining 1 Chapter 6 Diagnostics for Leverage and Influence.
Basics of regression analysis
Statistical Comparison of Two Learning Algorithms Presented by: Payam Refaeilzadeh.
Business Statistics - QBM117 Statistical inference for regression.
5-3 Inference on the Means of Two Populations, Variances Unknown
Simple Linear Regression and Correlation
Regression Model Building Setting: Possibly a large set of predictor variables (including interactions). Goal: Fit a parsimonious model that explains variation.
15: Linear Regression Expected change in Y per unit X.
Simple Linear Regression Analysis
Bootstrapping applied to t-tests
Bootstrap spatobotp ttaoospbr Hesterberger & Moore, chapter 16 1.
Forecasting Revenue: An Example of Regression Model Building Setting: Possibly a large set of predictor variables used to predict future quarterly revenues.
Correlation & Regression
1 MULTI VARIATE VARIABLE n-th OBJECT m-th VARIABLE.
Linear Regression Inference
CPE 619 Simple Linear Regression Models Aleksandar Milenković The LaCASA Laboratory Electrical and Computer Engineering Department The University of Alabama.
Copyright © 2013, 2010 and 2007 Pearson Education, Inc. Chapter Inference on the Least-Squares Regression Model and Multiple Regression 14.
Simple Linear Regression Models
1 Least squares procedure Inference for least squares lines Simple Linear Regression.
Bootstrap and Cross-Validation Bootstrap and Cross-Validation.
Model Building III – Remedial Measures KNNL – Chapter 11.
Statistics for Business and Economics Dr. TANG Yu Department of Mathematics Soochow University May 28, 2007.
Population All members of a set which have a given characteristic. Population Data Data associated with a certain population. Population Parameter A measure.
Estimation Bias, Standard Error and Sampling Distribution Estimation Bias, Standard Error and Sampling Distribution Topic 9.
PARAMETRIC STATISTICAL INFERENCE
1 G Lect 10a G Lecture 10a Revisited Example: Okazaki’s inferences from a survey Inferences on correlation Correlation: Power and effect.
Inference for Regression Simple Linear Regression IPS Chapter 10.1 © 2009 W.H. Freeman and Company.
Multiple Regression Petter Mostad Review: Simple linear regression We define a model where are independent (normally distributed) with equal.
Ensemble Methods: Bagging and Boosting
ELEC 303 – Random Signals Lecture 18 – Classical Statistical Inference, Dr. Farinaz Koushanfar ECE Dept., Rice University Nov 4, 2010.
Regression Analysis Part C Confidence Intervals and Hypothesis Testing
Limits to Statistical Theory Bootstrap analysis ESM April 2006.
Model Selection and Validation. Model-Building Process 1. Data collection and preparation 2. Reduction of explanatory or predictor variables (for exploratory.
Bootstraps and Jackknives Hal Whitehead BIOL4062/5062.
Robust Estimators.
Statistical Inference Statistical inference is concerned with the use of sample data to make inferences about unknown population parameters. For example,
Machine Learning 5. Parametric Methods.
Ex St 801 Statistical Methods Inference about a Single Population Mean (CI)
1/61: Topic 1.2 – Extensions of the Linear Regression Model Microeconometric Modeling William Greene Stern School of Business New York University New York.
The “Big Picture” (from Heath 1995). Simple Linear Regression.
Bias-Variance Analysis in Regression  True function is y = f(x) +  where  is normally distributed with zero mean and standard deviation .  Given a.
Estimating standard error using bootstrap
Inference about the slope parameter and correlation
Chapter 6 Diagnostics for Leverage and Influence
Statistical Quality Control, 7th Edition by Douglas C. Montgomery.
Statistics in MSmcDESPOT
Diagnostics and Transformation for SLR
CHAPTER 29: Multiple Regression*
Chapter 12 Inference on the Least-squares Regression Line; ANOVA
Bootstrapping Jackknifing
Checking the data and assumptions before the final analysis.
Regression Forecasting and Model Building
Diagnostics and Transformation for SLR
Bootstrapping and Bootstrapping Regression Models
Presentation transcript:

Case Selection and Resampling Lucila Ohno-Machado HST951

Topics Case selection (influence detection) Regression diagnostics Sampling procedures –Bootstrap –Jackknife –Cross-validation

Unusual Data Outlier (discrepancy, unusual observation that may change parameters) Leverage (far from mean or centroid of other observations, unusual combinations of independent variable values X that may change parameters) Influence = discrepancy x leverage

Detecting Outliers: Residuals Measure of error Studentized residuals can be calculated by removing one observation at a time Obs: High-leverage observations may have small residuals

Assessing Leverage Hat values measure the distance of an observation to the means (or centroid) of all observations Dependent variables are not involved in determining leverage

Measuring Influence Impact on coefficient of deleting an observation –DFBETA –COX’s D –DFFITS Impact on standard error –COVRATIO

Case selection Not all cases are created equal Some influential cases are good Some are bad “Outliers” Some non-influential cases are redundant It would be nice to keep “minimal” set of good cases in training sets for fast on-line training

Classical Diagnostics Unicase selection is determined by removing one observation and inspecting results Unicase influence on –Estimated parameters (coefficients) –Fitted value (Y-hat) –Residuals (error)

When outcomes are binary Residuals may not reflect discriminatory performance, but rather calibration Remember that a model with good discriminatory performance may be recalibrated Same rationale for coefficients

Influence Definition of influence is not fixed If the main reason for building models is prediction Then evaluating model performance given different subsets of original sample might point to good, redundant, and bad cases

Qualifying a case Bad cases, when removed, should result in models with better predictions Redundant cases, when removed, should not affect predictions Good cases, if removed, would result in models with worse predictions

Defining prediction performance Use, for example, areas under ROC curves (or mean square error or cross entropy error) For each set of samples: –Evaluate performance on training and holdout sets –Determine which cases to remove Determine performance on test or validation sets

Sequential Multicase Selection Sequential procedure –remove most influential case –remove second-most influential case (conditioned on the first) –and so on…  i (C(n,m)), for all i=1 to m, where C(.) represents the number of subsets of size m that can be built from n cases. Problem: cases are not considered en bloc

Alternatives Multicase selection that is not sequential, yet not exhaustive (e.g., genetic algorithm search) Analogous to variable selection

Genetic Algorithm Given a training set C, and a selection of cases v, we construct a logistic regression model l C (v). We evaluate the model using the AUC, and represent this evaluation as a(l C (v)). For a total number of cases n, and m cases in selection v, we use the following fitness function: f(v,C) = a(l C (v)) +  (n - m)/n.

Resampling

Bootstrap Motivation Sometimes it is not possible to collect many samples from a population Sometimes it is not correct to assume a certain distribution for the population Goal: Assess sampling variation

Bootstrap Efron (Stanford biostats) late 80’s –“Pulling oneself up by one’s bootstraps” Nonparametric approach to statistical inference Uses computation instead of traditional distributional assumptions and asymptotic results Can be used to derive standard errors, confidence intervals, and test hypothesis

Example Adapted from Fox (1997) “Applied Regression Analysis” Goal: Estimate mean difference between Male and Female finding X Four pairs of observations are available:

Observ.MaleFemaleDiffer

Mean Difference Sample mean is ( )/4 = 2.75 If Y were normally distributed, 95% CI But we do not know 

Estimates Estimate of  is Estimate of standard error is Assuming population is normally distributed, we can use t-distribution as

Confidence Interval  = 2.75 ± 4.30 (2.015) = 2.75 ± <  < HUGE!!!

Sample mean and variance Use distribution Y* of sample to estimate distribution Y in population y*p*(y*) E*(Y*) =  y* p(y*) = V*(Y*) =  [y*-E*] 2 p(y*) 3.25 =

Sample with Replacement SampleY1*Y1*Y2*Y2*Y3*Y3*Y4*Y4* * …

Calculating the CI Mean of 256 bootstrap means is 2.75, but SE is (no hat since SE is not estimated, but known)

So what? We already knew that! But with bootstrap –Confidence intervals can be more accurate –Can be used for non-linear statistics without known standard error formulas

The population is to the sample as the sample is to the bootstrap samples In practice (as opposed to previous example), not all bootstrap samples are selected

Procedure 1. Specify data-collection scheme that results in observed sample Collect(population) -> sample 2. Use sample as if it were population (with replacement) Collect(sample) -> bootstrap sample1 bootstrap sample 2 etc…

Cont. 3. For each bootstrap sample, calculate the estimate you are looking for 4. Use the distribution of the bootstrap estimates to estimate the properties of the sample

Bootstrap Confidence Intervals Normal Theory Percentile Intervals Example –95% percentile is calculated by taking –Lower = x bootstrap replicates –Upper = x bootstrap replicates There are corrections for bootstrap intervals

Bootstrapping Linear Regression Observed estimate is usually the coefficient(s) - (at least) 2 ways of doing this Resample observations (usual) and re-regress (X will vary) Resample residuals (X are fixed, Y*=Y+E* is new dependent variable, re-regress X fixed) –Assumes errors are identically distributed –High-leverage outlier impact may be lost

Bootstrap for other methods Used in other classification methods (neural networks, classification trees, etc.) Usually useful when sample size is small and no distribution assumptions can be made Same principles apply

Other resampling methods Jackknife (take one out) is a special case of bootstrap –Resamples without one case and without replacement (samples have size n-1) Cross-validation –Divides data into training and test Generally used to estimate confidence intervals on predictions for “full” model (i.e., model that utilized all cases)