Computing Simulations in SAS Jordan Elm 7/26/2007 Reference: SAS for Monte Carlo Studies: A Guide for Quantitative Researchers by Xitao Fan, Akos Felsovalyi, Stephen Sivo, and Sean Keenan Copyright(c) 2002 by SAS Institute Inc., Cary, NC, USA
What is meant by “Running Simulations” Simulating Data- Use Random Number Generator. To generate data with certain distribution/shape. Monte Carlo Simulations- Use Random Number Generator, Do Loops, Macros To generate data and compare performance of different methods of analysis.
Monte Carlo Simulations The use of Random Sampling techniques and a computer to obtain approximate solutions to mathematical problems (probability) Can find solutions to mathematical problems (which may have many variables) that cannot easily be solved, for example, by integral calculus, or other numerical methods.integral calculus Learn how a statistic may vary from sample to sample (i.e. obtain the sampling dist for the statistic) by repeatedly drawing random samples from a specific population.
Suitable Questions How does the sample median behave versus the sample mean for a particular distribution. How much variability is there in a sample correlation coefficient for a given sample size. How does non-normality of the data affect the regression coefficients (PROC GLM). Theory is weak or assumptions are violated, so need MC simulations to answer the question.
What is MC simulation Used for? Determining Power/Sample size of a Statistical Method during the planning phase of a study Assess consequences of Violation of Assumptions (homogeneity of variance for t-test, normality of data) Comparing Performance (e.g. Power, Type I error rate) of different Statistical Methods
Example: Rolling the Die Twice What are the chances of obtaining 2 as the sum of rolling a die twice? 1.Roll die twice 10,000 times by hand so can estimate the chance of obtaining 2 as the sum. 2.Apply probability theory (1/6*1/6)= Empirical Approach –Monte Carlo Simulation The outcomes of rolling a die are SIMULATED. Requires a computer and software (SAS, Stata R).
Rolling Die Twice: Prob Sum=2
Basic Programming Steps 1. Generate Random Sample (Data Step) 2. Perform Analysis in Question and Output Statistic to a dataset (Proc) 3. Repeat ( ,000,000 times depending on desired precision) Macro, Do Loop 4. Analyze the Accumulated Statistic of Interest 5. Present Results
Step 1: Generating Data Use functions to generate data with a known distribution E.g. RANUNI, RANEXP, RANNOR, RAND Transform the data to the desired shape x=MU+sqrt(S2)*rannor(seed); ~Norm (MU, S2) x=ranexp(seed)/lambda; ~Exp(lambda) x=ranbin(seed, n, p); ~Binomial(n, p)
Generating Data Transform the data to the desired shape: x=MU+sqrt(S2)*rannor(seed); Seed: is an integer. If seed < 0, the time of day is used to initialize the seed stream, and the stream of random numbers is not replicable. If you use a positive seed, you can always replicate the stream of random numbers by using the same DATA step, but must make your macro program change seed for each replication of the do loop.
Generating Data Multivariate data: %MVN macro Download from Tip for faster program: Generate ALL the data first, then use the BY statement within PROC to analyze each “sample”. E.g. If sample size is 50 and # of reps is set to 1000, then generate data with obs.
Generating Data that Mirror Your Sample Characteristics How well does the t-test do when data is non- normal? Generate non-normal data: Obtain 1 st 4 moments from your sample data (mean, sd, skewness, kurtosis) Obtain inter-variable correlations (PROC CORR) if variables are correlated. Use sample moments and correlations as population parameters, generate data accordingly. Fleishman’s Power Transformation Method Y=a+bZ+cZ 2 +dZ 3, Y non-normal variable, Z~N(0,1), a,b,c,d given by Fleishman (1978) for diff values of kurtosis and skewness.
Example Generating Non-Normal data **** Program 4.3 Fleishman Method for Generating 3 Non-Normal Variables ****; DATA A; DO I = 1 TO 10000; X1 = RANNOR (0); X2 = RANNOR (0); X3 = RANNOR (0); *** Fleishman non-normality transformation; X1 = *X *X1** *X1**3; X2 = *X *X2** *X2**3; X3 = *X *X3** *X3**3; X1 = *X1; ***linear transformation; X2 = *X2; X3 = X3; OUTPUT; END; PROC MEANS N MEAN STD SKEWNESS KURTOSIS; VAR X1 X2 X3; PROC CORR NOSIMPLE; VAR X1 X2 X3; RUN; **************************************************************************;
Example of MC study Assessing the effect of Non-normal data on the Type I error rate of an ANOVA test
Proc IML Matrix Language within SAS Allows for faster programming, however, still slower than Stata, R, S-plus.
Program 6.3: Assessing the effect of Data Non- Normality on the Type I error rate in ANOVA
Bootstrapping & Jackknifing Bootstrapping (Efron 1979) –Drawing a sample from an existing dataset. Sample is same size (or smaller than the original dataset) (Re-sampling with replacement) Purpose- To estimate the dispersion (variance) of poorly understood statistics (nonparametric statistics) Jackknifing-Re-sampling with replacement from an existing dataset. Sample is same size as the original dataset minus 1 observation. used to detect outlier or to make sure that results are repeatable (cross validation).
Examples of Simulation Studies in Epidemiology
Simulation Study of Confounder-Selection Strategies G Maldonado, S Greenland American Journal of Epidemiology Vol. 138, No. 11: In the absence of prior knowledge about population relations, investigators frequently employ a strategy that uses the data to help them decide whether to adjust for a variable. The authors compared the performance of several such strategies for fitting multiplicative Poisson regression models to cohort data: 1) the "change-in-estimate" strategy, in which a variable is controlled if the adjusted and unadjusted estimates differ by some important amount; 2) the "significance-test-of-the-covariate" strategy, in which a variable is controlled if its coefficient is significantly different from zero at some predetermined significance level; 3) the "significance-test-of-the-difference" strategy, which tests the difference between the adjusted and unadjusted exposure coefficients; 4) the "equivalence-test-of-the-difference" strategy, which significance-tests the equivalence of the adjusted and unadjusted exposure coefficients; and 5) a hybrid strategy that takes a weighted average of adjusted and unadjusted estimates. Data were generated from 8,100 population structures at each of several sample sizes. The performance of the different strategies was evaluated by computing bias, mean squared error, and coverage rates of confidence intervals. At least one variation of each strategy that was examined performed acceptably. The change-in-estimate and equivalence-test-of-the-difference strategies performed best when the cut-point for deciding whether crude and adjusted estimates differed by an important amount was set to a low value (10%). The significance test strategies performed best when the alpha level was set to much higher than conventional levels (0.20).
Confidence Intervals for Biomarker-based Human Immunodeficiency Virus Incidence Estimates and Differences using Prevalent Data Cole et al. American J Epid 165 (1): 94. (2007) Common approaches to confidence interval (CI) estimation for this incidence measure have included 1) ignoring the random error in T or 2) employing a Bonferroni adjustment of the box method. The authors present alternative Monte Carlo-based CIs for this incidence measure, as well as CIs for the biomarker-based incidence difference; standard approaches to CIs are typically appropriate for the incidence ratio. Using American Red Cross blood donor data as an example, the authors found that ignoring the random error in T provides a 95% CI for incidence as much as 0.26 times the width of the Monte Carlo CI, while the Bonferroni-box method provides a 95% CI as much as 1.57 times the width of the Monte Carlo CI. Prevalent biologic specimens can be used to estimate human immunodeficiency virus (HIV) incidence using a two-stage immunologic testing algorithm that hinges on the average time, T, between testing HIV-positive on highly sensitive enzyme immunoassays and testing HIV-positive on less sensitive enzyme immunoassays. Common approaches to confidence interval (CI) estimation for this incidence measure have included 1) ignoring the random error in T or 2) employing a Bonferroni adjustment of the box method. The authors present alternative Monte Carlo-based CIs for this incidence measure, as well as CIs for the biomarker-based incidence difference; standard approaches to CIs are typically appropriate for the incidence ratio. Using American Red Cross blood donor data as an example, the authors found that ignoring the random error in T provides a 95% CI for incidence as much as 0.26 times the width of the Monte Carlo CI, while the Bonferroni-box method provides a 95% CI as much as 1.57 times the width of the Monte Carlo CI. Further research is needed to understand under what circumstances the proposed Monte Carlo methods fail to provide valid CIs. The Monte Carlo-based CI may be preferable to competing methods because of the ease of extension to the incidence difference or to exploration of departures from assumptions.
Your Turn to Try
Assess the Effect of Unequal Pop Variance in a 2-sample T-test Design a MC study with to determine: What happens to the type I error rate? What happens to the Power?
Problem Do 1000 replications Let the sample size for the 2 groups (X1 and X2) be 20/group. Alpha=0.05 Mean=50 (under null) Mean=40 (under alternative) SD=10 and 15 Compute a 2-sample t-test
Reference SAS for Monte Carlo Studies: A Guide for Quantitative Researchers by Xitao Fan, Akos Felsovalyi, Stephen Sivo, and Sean Keenan Copyright(c) 2002 by SAS Institute Inc., Cary, NC, USA ISBN