Computing Simulations in SAS Jordan Elm 7/26/2007 Reference: SAS for Monte Carlo Studies: A Guide for Quantitative Researchers by Xitao Fan, Akos Felsovalyi,

Slides:



Advertisements
Similar presentations
Hypothesis testing and confidence intervals by resampling by J. Kárász.
Advertisements

Forecasting Using the Simple Linear Regression Model and Correlation
ADVANCED STATISTICS FOR MEDICAL STUDIES Mwarumba Mwavita, Ph.D. School of Educational Studies Research Evaluation Measurement and Statistics (REMS) Oklahoma.
Chapter 7 Statistical Data Treatment and Evaluation
Inference for Regression
Multiple regression analysis
EPIDEMIOLOGY AND BIOSTATISTICS DEPT Esimating Population Value with Hypothesis Testing.
Resampling techniques
Simulation Modeling and Analysis Session 12 Comparing Alternative System Designs.
Topic 2: Statistical Concepts and Market Returns
Pengujian Parameter Koefisien Korelasi Pertemuan 04 Matakuliah: I0174 – Analisis Regresi Tahun: Ganjil 2007/2008.
Statistical Methods in Computer Science Hypothesis Testing I: Treatment experiment designs Ido Dagan.
Lecture 9 Today: –Log transformation: interpretation for population inference (3.5) –Rank sum test (4.2) –Wilcoxon signed-rank test (4.4.2) Thursday: –Welch’s.
© 2000 Prentice-Hall, Inc. Chap Forecasting Using the Simple Linear Regression Model and Correlation.
Statistical Methods in Computer Science Hypothesis Testing I: Treatment experiment designs Ido Dagan.
Independent Sample T-test Classical design used in psychology/medicine N subjects are randomly assigned to two groups (Control * Treatment). After treatment,
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 14: Non-parametric tests Marshall University Genomics.
Chapter 14 Inferential Data Analysis
Richard M. Jacobs, OSA, Ph.D.
Simple Linear Regression. Introduction In Chapters 17 to 19, we examine the relationship between interval variables via a mathematical equation. The motivation.
1 Simple Linear Regression 1. review of least squares procedure 2. inference for least squares lines.
Basic Analysis of Variance and the General Linear Model Psy 420 Andrew Ainsworth.
STAT 572: Bootstrap Project Group Members: Cindy Bothwell Erik Barry Erhardt Nina Greenberg Casey Richardson Zachary Taylor.
Bootstrapping applied to t-tests
1 Overview of Major Statistical Tools UAPP 702 Research Methods for Urban & Public Policy Based on notes by Steven W. Peuquet, Ph.D.
Xitao Fan, Ph.D. Chair Professor & Dean Faculty of Education University of Macau Designing Monte Carlo Simulation Studies.
AM Recitation 2/10/11.
Review of Statistical Inference Prepared by Vera Tabakova, East Carolina University ECON 4550 Econometrics Memorial University of Newfoundland.
Hypothesis Testing in Linear Regression Analysis
ANOVA Greg C Elvers.
Education 793 Class Notes T-tests 29 October 2003.
1 Least squares procedure Inference for least squares lines Simple Linear Regression.
Bootstrap and Cross-Validation Bootstrap and Cross-Validation.
Statistics & Biology Shelly’s Super Happy Fun Times February 7, 2012 Will Herrick.
Week 111 Power of the t-test - Example In a metropolitan area, the concentration of cadmium (Cd) in leaf lettuce was measured in 7 representative gardens.
Bootstrapping (And other statistical trickery). Reminder Of What We Do In Statistics Null Hypothesis Statistical Test Logic – Assume that the “no effect”
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 08/10/ :23 PM 1 Some basic statistical concepts, statistics.
© 2003 Prentice-Hall, Inc.Chap 13-1 Basic Business Statistics (9 th Edition) Chapter 13 Simple Linear Regression.
Andrew Thomson on Generalised Estimating Equations (and simulation studies)
9 Mar 2007 EMBnet Course – Introduction to Statistics for Biologists Nonparametric tests, Bootstrapping
Educational Research: Competencies for Analysis and Application, 9 th edition. Gay, Mills, & Airasian © 2009 Pearson Education, Inc. All rights reserved.
Various topics Petter Mostad Overview Epidemiology Study types / data types Econometrics Time series data More about sampling –Estimation.
MGS3100_04.ppt/Sep 29, 2015/Page 1 Georgia State University - Confidential MGS 3100 Business Analysis Regression Sep 29 and 30, 2015.
6-1 Introduction To Empirical Models Based on the scatter diagram, it is probably reasonable to assume that the mean of the random variable Y is.
PCB 3043L - General Ecology Data Analysis. OUTLINE Organizing an ecological study Basic sampling terminology Statistical analysis of data –Why use statistics?
6/4/2016Slide 1 The one sample t-test compares two values for the population mean of a single variable. The two-sample t-test of population means (aka.
1 11 Simple Linear Regression and Correlation 11-1 Empirical Models 11-2 Simple Linear Regression 11-3 Properties of the Least Squares Estimators 11-4.
ITEC6310 Research Methods in Information Technology Instructor: Prof. Z. Yang Course Website: c6310.htm Office:
Lecture 4: Statistics Review II Date: 9/5/02  Hypothesis tests: power  Estimation: likelihood, moment estimation, least square  Statistical properties.
Limits to Statistical Theory Bootstrap analysis ESM April 2006.
Academic Research Academic Research Dr Kishor Bhanushali M
Model Selection and Validation. Model-Building Process 1. Data collection and preparation 2. Reduction of explanatory or predictor variables (for exploratory.
Hypothesis Testing. Why do we need it? – simply, we are looking for something – a statistical measure - that will allow us to conclude there is truly.
Introducing Communication Research 2e © 2014 SAGE Publications Chapter Seven Generalizing From Research Results: Inferential Statistics.
Education 793 Class Notes Inference and Hypothesis Testing Using the Normal Distribution 8 October 2003.
Chapter 1 Introduction to Statistics. Section 1.1 Fundamental Statistical Concepts.
Review of Statistical Inference Prepared by Vera Tabakova, East Carolina University.
URBDP 591 A Lecture 16: Research Validity and Replication Objectives Guidelines for Writing Final Paper Statistical Conclusion Validity Montecarlo Simulation/Randomization.
Modern Approaches The Bootstrap with Inferential Example.
BIOL 582 Lecture Set 2 Inferential Statistics, Hypotheses, and Resampling.
Bootstrapping James G. Anderson, Ph.D. Purdue University.
NURS 306, Nursing Research Lisa Broughton, MSN, RN, CCRN RESEARCH STATISTICS.
Stats Methods at IC Lecture 3: Regression.
Estimating standard error using bootstrap
Inference for Least Squares Lines
Hypothesis Testing and Confidence Intervals (Part 1): Using the Standard Normal Lecture 8 Justin Kern October 10 and 12, 2017.
Stats Club Marnie Brennan
6-1 Introduction To Empirical Models
Ch13 Empirical Methods.
Presentation transcript:

Computing Simulations in SAS Jordan Elm 7/26/2007 Reference: SAS for Monte Carlo Studies: A Guide for Quantitative Researchers by Xitao Fan, Akos Felsovalyi, Stephen Sivo, and Sean Keenan Copyright(c) 2002 by SAS Institute Inc., Cary, NC, USA

What is meant by “Running Simulations”  Simulating Data- Use Random Number Generator. To generate data with certain distribution/shape.  Monte Carlo Simulations- Use Random Number Generator, Do Loops, Macros To generate data and compare performance of different methods of analysis.

Monte Carlo Simulations  The use of Random Sampling techniques and a computer to obtain approximate solutions to mathematical problems (probability)  Can find solutions to mathematical problems (which may have many variables) that cannot easily be solved, for example, by integral calculus, or other numerical methods.integral calculus  Learn how a statistic may vary from sample to sample (i.e. obtain the sampling dist for the statistic) by repeatedly drawing random samples from a specific population.

Suitable Questions  How does the sample median behave versus the sample mean for a particular distribution.  How much variability is there in a sample correlation coefficient for a given sample size.  How does non-normality of the data affect the regression coefficients (PROC GLM).  Theory is weak or assumptions are violated, so need MC simulations to answer the question.

What is MC simulation Used for?  Determining Power/Sample size of a Statistical Method during the planning phase of a study  Assess consequences of Violation of Assumptions (homogeneity of variance for t-test, normality of data)  Comparing Performance (e.g. Power, Type I error rate) of different Statistical Methods

Example: Rolling the Die Twice What are the chances of obtaining 2 as the sum of rolling a die twice? 1.Roll die twice 10,000 times by hand so can estimate the chance of obtaining 2 as the sum. 2.Apply probability theory (1/6*1/6)= Empirical Approach –Monte Carlo Simulation The outcomes of rolling a die are SIMULATED. Requires a computer and software (SAS, Stata R).

Rolling Die Twice: Prob Sum=2

Basic Programming Steps 1. Generate Random Sample (Data Step) 2. Perform Analysis in Question and Output Statistic to a dataset (Proc) 3. Repeat ( ,000,000 times depending on desired precision) Macro, Do Loop 4. Analyze the Accumulated Statistic of Interest 5. Present Results

Step 1: Generating Data  Use functions to generate data with a known distribution E.g. RANUNI, RANEXP, RANNOR, RAND  Transform the data to the desired shape x=MU+sqrt(S2)*rannor(seed); ~Norm (MU, S2) x=ranexp(seed)/lambda; ~Exp(lambda) x=ranbin(seed, n, p); ~Binomial(n, p)

Generating Data  Transform the data to the desired shape: x=MU+sqrt(S2)*rannor(seed);  Seed: is an integer. If seed < 0, the time of day is used to initialize the seed stream, and the stream of random numbers is not replicable. If you use a positive seed, you can always replicate the stream of random numbers by using the same DATA step, but must make your macro program change seed for each replication of the do loop.

Generating Data  Multivariate data: %MVN macro Download from  Tip for faster program: Generate ALL the data first, then use the BY statement within PROC to analyze each “sample”. E.g. If sample size is 50 and # of reps is set to 1000, then generate data with obs.

Generating Data that Mirror Your Sample Characteristics  How well does the t-test do when data is non- normal?  Generate non-normal data: Obtain 1 st 4 moments from your sample data (mean, sd, skewness, kurtosis) Obtain inter-variable correlations (PROC CORR) if variables are correlated. Use sample moments and correlations as population parameters, generate data accordingly.  Fleishman’s Power Transformation Method Y=a+bZ+cZ 2 +dZ 3, Y non-normal variable, Z~N(0,1), a,b,c,d given by Fleishman (1978) for diff values of kurtosis and skewness.

Example Generating Non-Normal data **** Program 4.3 Fleishman Method for Generating 3 Non-Normal Variables ****; DATA A; DO I = 1 TO 10000; X1 = RANNOR (0); X2 = RANNOR (0); X3 = RANNOR (0); *** Fleishman non-normality transformation; X1 = *X *X1** *X1**3; X2 = *X *X2** *X2**3; X3 = *X *X3** *X3**3; X1 = *X1; ***linear transformation; X2 = *X2; X3 = X3; OUTPUT; END; PROC MEANS N MEAN STD SKEWNESS KURTOSIS; VAR X1 X2 X3; PROC CORR NOSIMPLE; VAR X1 X2 X3; RUN; **************************************************************************;

Example of MC study  Assessing the effect of Non-normal data on the Type I error rate of an ANOVA test

Proc IML  Matrix Language within SAS  Allows for faster programming, however, still slower than Stata, R, S-plus.

Program 6.3: Assessing the effect of Data Non- Normality on the Type I error rate in ANOVA

Bootstrapping & Jackknifing  Bootstrapping (Efron 1979) –Drawing a sample from an existing dataset. Sample is same size (or smaller than the original dataset) (Re-sampling with replacement) Purpose- To estimate the dispersion (variance) of poorly understood statistics (nonparametric statistics)  Jackknifing-Re-sampling with replacement from an existing dataset. Sample is same size as the original dataset minus 1 observation. used to detect outlier or to make sure that results are repeatable (cross validation).

Examples of Simulation Studies in Epidemiology

Simulation Study of Confounder-Selection Strategies G Maldonado, S Greenland American Journal of Epidemiology Vol. 138, No. 11: In the absence of prior knowledge about population relations, investigators frequently employ a strategy that uses the data to help them decide whether to adjust for a variable. The authors compared the performance of several such strategies for fitting multiplicative Poisson regression models to cohort data: 1) the "change-in-estimate" strategy, in which a variable is controlled if the adjusted and unadjusted estimates differ by some important amount; 2) the "significance-test-of-the-covariate" strategy, in which a variable is controlled if its coefficient is significantly different from zero at some predetermined significance level; 3) the "significance-test-of-the-difference" strategy, which tests the difference between the adjusted and unadjusted exposure coefficients; 4) the "equivalence-test-of-the-difference" strategy, which significance-tests the equivalence of the adjusted and unadjusted exposure coefficients; and 5) a hybrid strategy that takes a weighted average of adjusted and unadjusted estimates. Data were generated from 8,100 population structures at each of several sample sizes. The performance of the different strategies was evaluated by computing bias, mean squared error, and coverage rates of confidence intervals. At least one variation of each strategy that was examined performed acceptably. The change-in-estimate and equivalence-test-of-the-difference strategies performed best when the cut-point for deciding whether crude and adjusted estimates differed by an important amount was set to a low value (10%). The significance test strategies performed best when the alpha level was set to much higher than conventional levels (0.20).

Confidence Intervals for Biomarker-based Human Immunodeficiency Virus Incidence Estimates and Differences using Prevalent Data Cole et al. American J Epid 165 (1): 94. (2007) Common approaches to confidence interval (CI) estimation for this incidence measure have included 1) ignoring the random error in T or 2) employing a Bonferroni adjustment of the box method. The authors present alternative Monte Carlo-based CIs for this incidence measure, as well as CIs for the biomarker-based incidence difference; standard approaches to CIs are typically appropriate for the incidence ratio. Using American Red Cross blood donor data as an example, the authors found that ignoring the random error in T provides a 95% CI for incidence as much as 0.26 times the width of the Monte Carlo CI, while the Bonferroni-box method provides a 95% CI as much as 1.57 times the width of the Monte Carlo CI. Prevalent biologic specimens can be used to estimate human immunodeficiency virus (HIV) incidence using a two-stage immunologic testing algorithm that hinges on the average time, T, between testing HIV-positive on highly sensitive enzyme immunoassays and testing HIV-positive on less sensitive enzyme immunoassays. Common approaches to confidence interval (CI) estimation for this incidence measure have included 1) ignoring the random error in T or 2) employing a Bonferroni adjustment of the box method. The authors present alternative Monte Carlo-based CIs for this incidence measure, as well as CIs for the biomarker-based incidence difference; standard approaches to CIs are typically appropriate for the incidence ratio. Using American Red Cross blood donor data as an example, the authors found that ignoring the random error in T provides a 95% CI for incidence as much as 0.26 times the width of the Monte Carlo CI, while the Bonferroni-box method provides a 95% CI as much as 1.57 times the width of the Monte Carlo CI. Further research is needed to understand under what circumstances the proposed Monte Carlo methods fail to provide valid CIs. The Monte Carlo-based CI may be preferable to competing methods because of the ease of extension to the incidence difference or to exploration of departures from assumptions.

Your Turn to Try

Assess the Effect of Unequal Pop Variance in a 2-sample T-test  Design a MC study with to determine: What happens to the type I error rate? What happens to the Power?

Problem  Do 1000 replications  Let the sample size for the 2 groups (X1 and X2) be 20/group.  Alpha=0.05  Mean=50 (under null) Mean=40 (under alternative)  SD=10 and 15  Compute a 2-sample t-test

Reference  SAS for Monte Carlo Studies: A Guide for Quantitative Researchers by Xitao Fan, Akos Felsovalyi, Stephen Sivo, and Sean Keenan Copyright(c) 2002 by SAS Institute Inc., Cary, NC, USA  ISBN