Don't Be Loopy: Re-Sampling and Simulation the SAS® Way David L. Cassell Design Pathways Corvallis, OR.

Slides:



Advertisements
Similar presentations
Efficiency and Productivity Measurement: Bootstrapping DEA Scores
Advertisements

Variance Estimation in Complex Surveys Third International Conference on Establishment Surveys Montreal, Quebec June 18-21, 2007 Presented by: Kirk Wolter,
Hypothesis testing and confidence intervals by resampling by J. Kárász.
Materials for Lecture 11 Chapters 3 and 6 Chapter 16 Section 4.0 and 5.0 Lecture 11 Pseudo Random LHC.xls Lecture 11 Validation Tests.xls Next 4 slides.
April 21, 2010 STAT 950 Chris Wichman. Motivation Every ten years, the U.S. government conducts a population census, and every five years the U. S. National.
Today: Quizz 11: review. Last quizz! Wednesday: Guest lecture – Multivariate Analysis Friday: last lecture: review – Bring questions DEC 8 – 9am FINAL.
Model Assessment and Selection
Model Assessment, Selection and Averaging
Model assessment and cross-validation - overview
Computing Simulations in SAS Jordan Elm 7/26/2007 Reference: SAS for Monte Carlo Studies: A Guide for Quantitative Researchers by Xitao Fan, Akos Felsovalyi,
Copyright © 2006 Pearson Addison-Wesley. All rights reserved. Lecture 3: Monte Carlo Simulations (Chapter 2.8–2.10)
Multiple regression analysis
Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.
Resampling techniques
2008 Chingchun 1 Bootstrap Chingchun Huang ( 黃敬群 ) Vision Lab, NCTU.
1 Introduction to Biostatistics (PUBHLTH 540) Estimating Parameters Which estimator is best? Study possible samples, determine Expected values, bias, variance,
Bootstrapping LING 572 Fei Xia 1/31/06.
Bagging LING 572 Fei Xia 1/24/06. Ensemble methods So far, we have covered several learning methods: FSA, HMM, DT, DL, TBL. Question: how to improve results?
August 2004Copyright Tim Hesterberg1 Introduction to the Bootstrap (and Permutation Tests) Tim Hesterberg, Ph.D. Association of General Clinical Research.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Nonparametric and Resampling Statistics. Wilcoxon Rank-Sum Test To compare two independent samples Null is that the two populations are identical The.
STAT 572: Bootstrap Project Group Members: Cindy Bothwell Erik Barry Erhardt Nina Greenberg Casey Richardson Zachary Taylor.
Bootstrap spatobotp ttaoospbr Hesterberger & Moore, chapter 16 1.
Scot Exec Course Nov/Dec 04 Ambitious title? Confidence intervals, design effects and significance tests for surveys. How to calculate sample numbers when.
1 Terminating Statistical Analysis By Dr. Jason Merrick.
EVALUATION David Kauchak CS 451 – Fall Admin Assignment 3 - change constructor to take zero parameters - instead, in the train method, call getFeatureIndices()
2015 AprilUNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 1 Hastie, Tibshirani and Friedman.The Elements of Statistical Learning (2nd edition,
Proc Surveyselect or the easy way to select samples Gitte Churlish Churlish Consulting.
Review of Econ424 Fall –open book –understand the concepts –use them in real examples –Dec. 14, 8am-12pm, Plant Sciences 1129 –Vote Option 1(2)
Statistical Computing
1 CSI5388 Error Estimation: Re-Sampling Approaches.
Model Building III – Remedial Measures KNNL – Chapter 11.
Bootstrapping – the neglected approach to uncertainty European Real Estate Society Conference Eindhoven, Nederlands, June 2011 Paul Kershaw University.
Biostatistics IV An introduction to bootstrap. 2 Getting something from nothing? In Rudolph Erich Raspe's tale, Baron Munchausen had, in one of his many.
Montecarlo Simulation LAB NOV ECON Montecarlo Simulations Monte Carlo simulation is a method of analysis based on artificially recreating.
Bootstrapping (And other statistical trickery). Reminder Of What We Do In Statistics Null Hypothesis Statistical Test Logic – Assume that the “no effect”
9 Mar 2007 EMBnet Course – Introduction to Statistics for Biologists Nonparametric tests, Bootstrapping
Examples of Computing Uses for Statisticians Data management : data entry, data extraction, data cleaning, data storage, data manipulation, data distribution.
1 Nonparametric Methods II Henry Horng-Shing Lu Institute of Statistics National Chiao Tung University
Sampling With Replacement How the program works a54p10.sas.
Performance of Resampling Variance Estimation Techniques with Imputed Survey data.
Resampling techniques
Sampling And Resampling Risk Analysis for Water Resources Planning and Management Institute for Water Resources May 2007.
Limits to Statistical Theory Bootstrap analysis ESM April 2006.
Monte Carlo Process Risk Analysis for Water Resources Planning and Management Institute for Water Resources 2008.
Computational statistics, lecture3 Resampling and the bootstrap  Generating random processes  The bootstrap  Some examples of bootstrap techniques.
Bootstraps and Jackknives Hal Whitehead BIOL4062/5062.
Bootstrap Event Study Tests Peter Westfall ISQS Dept. Joint work with Scott Hein, Finance.
Nonparametric Methods II 1 Henry Horng-Shing Lu Institute of Statistics National Chiao Tung University
Case Selection and Resampling Lucila Ohno-Machado HST951.
Multiple Imputation using SAS Don Miller 812 Oswald Tower
1 Stat 6601 Presentation Presented by: Xiao Li (Winnie) Wenlai Wang Ke Xu Nov. 17, 2004 V & R 6.6.
From Wikipedia: “Parametric statistics is a branch of statistics that assumes (that) data come from a type of probability distribution and makes inferences.
BMTRY 789 Lecture 6: Proc Sort, Random Number Generators, and Do Loops Readings – Chapters 5 & 6 Lab Problem - Brain Teaser Homework Due – HW 2 Homework.
1 Impact of Sample Estimate Rounding on Accuracy ERCOT Load Profiling Department May 22, 2007.
Data Mining: Neural Network Applications by Louise Francis CAS Convention, Nov 13, 2001 Francis Analytics and Actuarial Data Mining, Inc.
Bootstrapping James G. Anderson, Ph.D. Purdue University.
Notes on Bootstrapping Jeff Witmer 10 February 2016.
Bootstrapping and Randomization Techniques Q560: Experimental Methods in Cognitive Science Lecture 15.
Ungraded quiz Unit 6.
Simulation: Sensitivity, Bootstrap, and Power
BOOTSTRAPPING: LEARNING FROM THE SAMPLE
Help! Statistics! Resampling; the Bootstrap
QQ Plot Quantile to Quantile Plot Quantile: QQ Plot:
Ch13 Empirical Methods.
The sampling distribution of a statistic
Bootstrapping Jackknifing
Cross-validation Brenda Thomson/ Peter Fox Data Analytics
Sampling Distribution of the Mean in IML
Bootstrapping and Bootstrapping Regression Models
Presentation transcript:

Don't Be Loopy: Re-Sampling and Simulation the SAS® Way David L. Cassell Design Pathways Corvallis, OR

David L. Cassell, Design Pathways Introduction BootstrappingJackknifingCross-validationSimulations Monte Carlo … and on and on…

David L. Cassell, Design Pathways First, the BAD WAY The typical bootstrap code – a huge macro loop SlowAwkward Very complex code Log-fillingOutput-clogging Did I mention ‘slow’?

David L. Cassell, Design Pathways The typical BAD bootstrap code – a huge macro loop %do i = 1 %to &REPS ; %do i = 1 %to &REPS ; %* steps to generate one data set; %* steps to generate one data set; %* the proc to do the analysis; %* the proc to do the analysis; %* some way of appending the new results; %* some way of appending the new results; %end; %end; %* a proc to compute the bootstrap estimates; %* a proc to compute the bootstrap estimates; %mend; %mend;

David L. Cassell, Design Pathways Interlude – What is a bootstrap? Types of Re-sampling: Random draws Designed subsets Exchange labels

David L. Cassell, Design Pathways Interlude – What is a bootstrap? Want to approximate sampling distribution Simple: SRS with replacement from original sample Non-parametric (mostly) Want: bias, std error, CI, or … Assumptions: exchangeability, …

David L. Cassell, Design Pathways Interlude – What is a bootstrap? We’ll start with the simple bootstrap Get a URS sample of size N Compute your statistic Repeat B=1000 or 10,000 or … times Look at the behavior of your B values

David L. Cassell, Design Pathways Interlude – What is a bootstrap? Warning: do not forget exchangeability! The simple / naïve bootstrap doesn’t work right on: Time series data Repeated measures data Survey sample data Data with analytic weights

David L. Cassell, Design Pathways Interlude – What is a bootstrap? A common approach is the bootstrap percentile interval: Take your B values from before Pull the 2.5 th and 97.5 th percentiles to get a 95% percentile interval as your CI

David L. Cassell, Design Pathways The typical BAD bootstrap code – a huge macro loop %macro bootie ( input=, reps= ); %do i = 1 %to &REPS ; %do i = 1 %to &REPS ; %* steps to generate one data set; %* steps to generate one data set; %* the proc to do the analysis; %* the proc to do the analysis; %* some way of appending the new results; %* some way of appending the new results; %end; %end; %* a proc to compute the bootstrap estimates; %* a proc to compute the bootstrap estimates; %mend; %mend;

David L. Cassell, Design Pathways A Better Bootstrap 1. Generate ALL of the bootstrap samples as one data set 2. Use the same proc as before, but use by- processing 3. Use the same computations to get the bootstrap estimates

David L. Cassell, Design Pathways A Better Bootstrap proc surveyselect data=YourData out=outboot out=outboot seed= seed= method=urs method=urs samprate=1 samprate=1 outhits outhits rep=1000; rep=1000; run; run;

David L. Cassell, Design Pathways A Better Bootstrap proc univariate data=outboot; var x; var x; by Replicate; by Replicate; output out=out1 q1=q1 median=med q3=q3; output out=out1 q1=q1 median=med q3=q3; run; run; data out2; set out1; set out1; trimean = (q1 + 2*med + q3) / 4; trimean = (q1 + 2*med + q3) / 4; run; run;

David L. Cassell, Design Pathways A Better Bootstrap proc univariate data=out2; var trimean; var trimean; output out=final output out=final pctlpts=2.5, 97.5 pctlpts=2.5, 97.5 pctlpre=ci; pctlpre=ci; run; run;

David L. Cassell, Design Pathways A Better Bootstrap – More sasfile YourData load; proc surveyselect data=YourData out=outboot seed= seed= method=urs samprate=1 outhits method=urs samprate=1 outhits rep=1000; rep=1000; run; run; sasfile YourData close;

David L. Cassell, Design Pathways A Better Bootstrap – More ods listing close; proc univariate data=outboot; var x; var x; by Replicate; by Replicate; output out=out1 q1=q1 median=med q3=q3; output out=out1 q1=q1 median=med q3=q3; run; run; ods listing;

David L. Cassell, Design Pathways A Better Bootstrap – ODS OUTPUT ods output Modes=modal; proc univariate data=outboot modes; var YourVariable; var YourVariable; by Replicate; by Replicate; run; run; ods output close;

David L. Cassell, Design Pathways Case Resampling Simple bootstrap, as before Apply to: PROC REG, PROC LOGISTIC, …. The approach can be criticized on several grounds

David L. Cassell, Design Pathways Case Resampling data test; x=1; y=45; output; x=1; y=45; output; do x = 2 to 29; do x = 2 to 29; y = 3*x + 6*rannor(1234); y = 3*x + 6*rannor(1234); output; output; end; end; x=30; y=45; output; x=30; y=45; output; run; run;

David L. Cassell, Design Pathways Case Resampling

David L. Cassell, Design Pathways Case Resampling ods listing close; proc surveyselect data=temp1 out=boot1 seed=38474 method=urs samprate=1 outhits rep=1000; method=urs samprate=1 outhits rep=1000; run; run; proc reg data=boot1 outest=est1(drop=_:); model y=x; model y=x; by replicate; by replicate; run; run; ods listing;

David L. Cassell, Design Pathways Case Resampling proc univariate data=est1; var x; var x; output out=final pctlpts=2.5, 97.5 output out=final pctlpts=2.5, 97.5 pctlpre=ci; pctlpre=ci; run; run;

David L. Cassell, Design Pathways Case Resampling proc robustreg data=temp1 method=MM; model y=x; model y=x; run; run;

David L. Cassell, Design Pathways Case Resampling PROC REG (1.74, 2.80) bootstrap (case resampling)(1.65, 2.90) PROC ROBUSTREG(2.39, 3.13)

David L. Cassell, Design Pathways Resampling residuals Fit the model Bootstrap sample for the residuals Add the randomly resampled e to Y-hat Fit the model for each of the B reps Compute bootstrap estimates

David L. Cassell, Design Pathways Resampling residuals 1 perform the regression, get Y-hat and e 2 split the data 3 copy the FIT data set repeatedly 4 URS sample of residuals for each replicate 5 merge residuals with records 6 fit the model on each replicate 7 compute bootstrap estimates

David L. Cassell, Design Pathways Resampling residuals proc reg data=test; model y=x; model y=x; output out=out1 p=yhat r=res; output out=out1 p=yhat r=res; run; run;

David L. Cassell, Design Pathways Resampling residuals data fit(keep=yhat x order) resid(keep=res); set out1; set out1; order+1; order+1; run; run;

David L. Cassell, Design Pathways Resampling residuals proc surveyselect data=fit out=outfit method=srs samprate=1 rep=1000; method=srs samprate=1 rep=1000; run; run;

David L. Cassell, Design Pathways Resampling residuals data outres2; do replicate = 1 to 1000; do replicate = 1 to 1000; do order = 1 to numrecs; do order = 1 to numrecs; p = ceil( numrecs * ranuni( ) ); p = ceil( numrecs * ranuni( ) ); set resid nobs=numrecs point=p; set resid nobs=numrecs point=p; output; output; end; end; stop; stop; run; run;

David L. Cassell, Design Pathways Resampling residuals data prepped; merge outfit outres2; merge outfit outres2; by replicate order; by replicate order; new_y = yhat + res; new_y = yhat + res; run; run;

David L. Cassell, Design Pathways Resampling residuals proc reg data=prepped outest=est1( drop=_: ); outest=est1( drop=_: ); model new_y = x; model new_y = x; by replicate; by replicate; run; run;

David L. Cassell, Design Pathways Resampling residuals proc univariate data=est1; var x; var x; output out=final pctlpts=2.5, 97.5 output out=final pctlpts=2.5, 97.5 pctlpre=ci; pctlpre=ci; run; run;

David L. Cassell, Design Pathways ?The? Bootstrap? Simple bootstrap Residual resampling Parametric bootstrap Smooth bootstrap Wild bootstrap Double bootstrap Various ‘adjusted’ bootstraps

David L. Cassell, Design Pathways The Jackknife Non-parametric N systematic samples of size N-1 Less general than the bootstrap Easier to apply to complex sampling schemes

David L. Cassell, Design Pathways The Jackknife data outb; do replicate = 1 to numrecs; do replicate = 1 to numrecs; do rec = 1 to numrecs; do rec = 1 to numrecs; set test nobs=numrecs point=rec; set test nobs=numrecs point=rec; if replicate ^= rec then output; if replicate ^= rec then output; end; end; stop; stop; run; run;

David L. Cassell, Design Pathways The Jackknife ods listing close; proc univariate data=outb; var y; var y; by replicate; by replicate; output out=outall kurtosis=curt; output out=outall kurtosis=curt; run; run; ods listing;

David L. Cassell, Design Pathways The Jackknife proc univariate data=outall; var curt; var curt; output out=final mean=jmean std=jstd; output out=final mean=jmean std=jstd; run; run;

David L. Cassell, Design Pathways Randomization Tests Resampling plan Re-label the data points randomly Compare against original Random subset of full permutation test

David L. Cassell, Design Pathways Cross-Validation Another type of resampling plan K replicate samples Each sample uses (K-1)/K to model and 1/K for testing

David L. Cassell, Design Pathways Cross-Validation LOOCV – Leave-One-Out Cross-Validation K-fold Cross-Validation Random K-fold Cross-Validation

David L. Cassell, Design Pathways Random K-Fold Cross-Validation %let K=10; %let rate= %sysevalf( (&K-1) / &K ); proc surveyselect data=temp1 out=xv seed= samprate=&RATE outall rep=&K ; samprate=&RATE outall rep=&K ; run; run; data xv; set xv; set xv; if selected then new_y=y; if selected then new_y=y; run; run;

David L. Cassell, Design Pathways Random K-Fold Cross-Validation proc reg data=xv; model new_y=x; model new_y=x; by replicate; by replicate; output out=out1(where=(new_y=.)) p=yhat; output out=out1(where=(new_y=.)) p=yhat; run; run;

David L. Cassell, Design Pathways Random K-Fold Cross-Validation data out2; set out1; set out1; d=y-yhat; d=y-yhat; absd=abs(d); absd=abs(d); run; run; proc summary data=out2; var d absd; var d absd; output out=out3 std(d)=rmse mean(absd)=mae; output out=out3 std(d)=rmse mean(absd)=mae; run; run;

David L. Cassell, Design Pathways Monte Carlo Simulations Sample from theoretical distributions Sample from population of data points

David L. Cassell, Design Pathways Simulations proc surveyselect data=largefile out=process_set seed= method=srs sampsize=1000; seed= method=srs sampsize=1000; run; run; data processor; array{5,5} a1-a25; array{5,5} a1-a25; set process_set; set process_set; run; run;

David L. Cassell, Design Pathways Simulations proc plan seed= ; factors replicate=100 ordered factors replicate=100 ordered SiteNo = 30 of 200 / noprint; SiteNo = 30 of 200 / noprint; output out=plan9; output out=plan9; run; run;

David L. Cassell, Design Pathways CONCLUSIONS Cassell’s “7 Habits of Highly Effective SAS-ers” KNOW YOUR PROBLEM KNOW YOUR PROBLEM USE THE RIGHT TOOL USE THE RIGHT TOOL FEWER STEPS GET YOU FARTHER FEWER STEPS GET YOU FARTHER STAY TALL AND THIN STAY TALL AND THIN TOO MUCH OF A GOOD THING IS BAD TOO MUCH OF A GOOD THING IS BAD SKIP THE EXPENSIVE STUFF SKIP THE EXPENSIVE STUFF SHARPEN THE SAW SHARPEN THE SAW

David L. Cassell, Design Pathways CONCLUSIONS SAS is great at resampling and simulations. You just have to code it in SAS instead of something else! Don’t run 5003 steps when 3 steps will do it. Don’t assume everything is a macro problem.

David L. Cassell, Design Pathways CONCLUSIONS Resampling methods and simulations do not solve all your problems. Use your brain before you use your keyboard.

David L. Cassell, Design Pathways Contact Information David L. Cassell Design Pathways 3115 NW Norwood Pl. Corvallis, OR