The Right Way to code simulation studies in Stata

Slides:



Advertisements
Similar presentations
Introduction to Monte Carlo Markov chain (MCMC) methods
Advertisements

Evidence synthesis of competing interventions when there is inconsistency in how effectiveness outcomes are measured across studies Nicola Cooper Centre.
Hypothesis testing and confidence intervals by resampling by J. Kárász.
1 Creating and Tweaking Data HRP223 – 2010 October 24, 2011 Copyright © Leland Stanford Junior University. All rights reserved. Warning: This.
Pricing an Option Monte Carlo Simulation. We will explore a technique, called Monte Carlo simulation, to numerically derive the price of an option or.
The Mimix Command Reference Based Multiple Imputation For Sensitivity Analysis of Longitudinal Trials with Protocol Deviation Suzie Cro EMERGE.
Short Resume of Statistical Terms Fall 2013 By Yaohang Li, Ph.D.
Probabilistic Mechanism Analysis. Outline Uncertainty in mechanisms Why consider uncertainty Basics of uncertainty Probabilistic mechanism analysis Examples.
Key Data Management Tasks in Stata
Standard Error and Confidence Intervals Martin Bland Professor of Health Statistics University of York
Pro gradu –thesis Tuija Hevonkorpi.  Basic of survival analysis  Weibull model  Frailty models  Accelerated failure time model  Case study.
1 Data Manipulation (with SQL) HRP223 – 2010 October 13, 2010 Copyright © Leland Stanford Junior University. All rights reserved. Warning: This.
Survival Analysis in Stata First, declare your survival-time variables to Stata using stset For example, suppose your duration variable is called timevar.
01/20151 EPI 5344: Survival Analysis in Epidemiology Cox regression: Introduction March 17, 2015 Dr. N. Birkett, School of Epidemiology, Public Health.
We’ll now look at the relationship between a survival variable Y and an explanatory variable X; e.g., Y could be remission time in a leukemia study and.
1 Introduction to Statistics − Day 4 Glen Cowan Lecture 1 Probability Random variables, probability densities, etc. Lecture 2 Brief catalogue of probability.
Treat everyone with sincerity,
MAT 4830 Mathematical Modeling 04 Monte Carlo Integrations
Gil McVean, Department of Statistics Thursday February 12 th 2009 Monte Carlo simulation.
: LSS1 Longitudinal Studies Seminars: Longitudinal Analyses Using STATA Stirling University, Data and Variable Management Paul Lambert.
1 Data Manipulation (with SQL) HRP223 – 2009 October 12, 2009 Copyright © Leland Stanford Junior University. All rights reserved. Warning: This.
Synthetic Approaches to Data Linkage Mark Elliot, University of Manchester Jerry Reiter Duke University Cathie Marsh Centre.
Missing data: Why you should care about it and what to do about it
Statistical Estimation
BIOST 513 Discussion Section - Week 10
STATISTICAL INFERENCE
Point and interval estimations of parameters of the normally up-diffused sign. Concept of statistical evaluation.
Carina Omoeva, FHI 360 Wael Moussa, FHI 360
Introduction Osborn.
Chapter 7: Sampling Distributions
S2 Chapter 6: Populations and Samples
Meta-analysis of joint longitudinal and event-time outcomes
Chapter 7 ENGR 201: Statistics for Engineers
ECONOMETRICS ii – spring 2018
Re-randomising patients within clinical trials
Multiple Imputation Using Stata
Quantitative Project Risk Analysis
Effective Feedback, Rubrics, and Grading
Mark Rothmann U.S. Food and Drug Administration September 14, 2018
Parametric Survival Models (ch. 7)
Ch13 Empirical Methods.
Chapter 9: Sampling Distributions
CHAPTER 7 Sampling Distributions
Jeffrey E. Korte, PhD BMTRY 747: Foundations of Epidemiology II
Lab 2 HRP223 – 2010 October 18, 2010 Copyright © Leland Stanford Junior University. All rights reserved. Warning: This presentation is protected.
What Is a Sampling Distribution?
Chapter 7: Sampling Distributions
EVENT PROJECTION Minzhao Liu, 2018
CHAPTER 7 Sampling Distributions
Chapter 7: Sampling Distributions
Amos Introduction In this tutorial, you will be briefly introduced to the student version of the SEM software known as Amos. You should download the current.
Test Drop Rules: If not:
Chapter 7: Sampling Distributions
Chapter 7: Sampling Distributions
Chapter 7: Sampling Distributions
Chapter 9: Sampling Distributions
Chapter 7: Sampling Distributions
Chapter 7: Sampling Distributions
Chapter 7: Sampling Distributions
Data Manipulation (with SQL)
CHAPTER 7 Sampling Distributions
Chapter 7: Sampling Distributions
Chapter 7: Sampling Distributions
The Practice of Statistics – For AP* STARNES, YATES, MOORE
Chapter 7: Sampling Distributions
Chapter 7: Sampling Distributions
Chapter 7: Sampling Distributions
Chapter 7: Sampling Distributions
Fractional-Random-Weight Bootstrap
Presentation transcript:

The Right Way to code simulation studies in Stata Tim Morris MRC CTU at UCL 25th UK Stata Conference Michael Crowther University of Leicester

https://github.com/tpmorris/TheRightWay 

What is a simulation study? Use of (pseudo) random numbers to produce data from some distribution to help us to study properties of a statistical method. An example: Generate data from a distribution with parameter θ Apply analysis method to data, producing an estimate 𝜃 Repeat (1) and (2) nsim times Compare θ with E[ 𝜃 ] – if we had not generated the data, we would not know θ and so could not do this.

Some background Consistent terminology with definitions ADEMP (Aims, Data-generating mechanisms, Estimands, Methods, Performance measures): D, E, M are important in coding simulation studies

Four datasets (possibly) Simulated: e.g. a simulated hypothetical study) Estimates: some summary of a repetition States: record of 𝑛 𝑠𝑖𝑚 +1 RNG states –at the beginning of each repetition and one after final repetition Performance: summarises estimates of performance (bias, empirical SE, coverage etc.), and (hopefully) their Monte Carlo SE, for each D, E, M

This talk This talk focuses on the code that produces a simulated dataset and returns the estimates and states datasets. I teach simulation studies a lot. Errors in coding occur primarily in generating data in the way you want, and in storing summaries of each rep (estimates data).

A simple simulation study: Aims Suppose we are interested in the analysis of a randomised trial with a survival outcome and unknown baseline hazard function. Aim to evaluate the impacts of: misspecifying the baseline hazard function on the estimate of the treatment effect fitting a more complex model than necessary avoiding the issue by using a semiparametric model

Data generating mechanisms Simulate nobs=100 and then nobs=500 from a Weibull distribution with 𝑋 𝑖 ~𝐵𝑒𝑟𝑛(.5) and ℎ 𝑡 = 𝜆𝛾 𝑡 𝛾−1 exp 𝑋 𝑖 𝜃 where 𝜆=0.1, 𝜃=−0.5 (admin censoring at 5 years) Study 𝛾 = 1 then 𝛾 = 1.5

Estimands and Methods The estimand is 𝜃, the hazard ratio for treatment vs. control Methods: Exponential model Weibull model Cox model (Don’t need to consider performance measures for this talk; see London Stata Conference 2020!)

Well-structured estimates (empty) Long–long format rep_id n_obs truegamma method theta_hat se 1 100 γ=1 Exponential -1.690183 .5477225 Weibull -1.712495 .54808 Cox -1.688541 .5481199 γ=1.5 -.5390697 .2495417 -.6375546 .2504361 -.6162164 .2510851 500 -.5785365 .1548867 -.5820988 .1549543 -.5867053 .1550035 -.4040936 .1188226 -.4308287 .1189563 -.4335943 .1190354 Inputs Results

Well-structured estimates (empty) Wide–long format rep_id n_obs gamma theta_exp se_exp theta_wei se_wei theta_cox se_cox 1 100 γ=1 -1.690183 .5477225 -1.712495 .54808 -1.688541 .5481199 1.5 -.5164924 .2589072 -.5594682 .2595417 -.5601631 .2598854 500 -.6253604 .1511858 -.6269046 .1512856 -.6343831 .1513485 -.478514 .1176905 -.5447887 .1179448 -.5460246 .1180312 2 -.377425 .3562627 -.3859514 .3563656 -.3728753 .3564457 -.4841157 .2456835 -.5684879 .2466851 -.5850977 .2472228 -.6477997 .1615617 -.6477113 .161647 -.6452857 .1616655 -.3358569 .1222584 -.3609435 .1223288 -.3619137 .1224012 Inputs Results

The simulate approach From the help file: ‘simulate eases the programming task of performing Monte Carlo-type simulations’ … ‘questionable’ to ‘no’.

The simulate approach If you haven’t used it, simulate works as follows: You write a program (rclass or eclass) that follows standard Stata syntax and returns quantities of interest as scalars. Your program will generate ≥1 simulated dataset and return estimates for ≥1 estimands obtained by ≥1 methods. You use simulate to repeatedly call the program.

The simulate approach I’ve wished-&-grumbled here and on Statalist that simulate: – Does not allow posting of the repetition number (an oversight?) – Precludes putting strings into the estimates dataset, meaning non-numerical inputs (D) and contents of c(rngstate) cannot be stored. – Produces ultra-wide data (if E, M and D vary, the resulting estimates must be stored across a single row!) Your code is clean; your estimates dataset is a mess.

The post approach Structure: tempname tim postfile `tim' int(rep) str5(dgm estimand) /// double(theta se) using estimates.dta, replace forval i = 1/`nsim' { <1st DGM> <apply method> post `tim' (`i') ("thing") ("theta") (_b[trt]) (_se[trt]) <2nd DGM> } postclose `tim'

The post approach + No shortcomings of simulate + Produces a well-formed estimates dataset – post commands become entangled in the code for generating and analysing data – post lines are more error prone. Suppose you are using different n. An efficient way to code this is to generate a dataset (with n observations) and then increase subsets of this data in analysis for the ‘smaller n’ data-generating mechanisms. The code can get inelegant and you mis-post. Your estimates dataset is clean; your code is a mess.

The right approach One can mash-up the two! Write a program, as you would with simulate Use postfile Call the program Post inputs and returned results using post Use a second postfile for storing rngstates Why? 1. Appease Michael: Tidy code that is less error-prone. 2. Appease Tim: Tidy estimates (and states) dataset that avoids error-prone reshaping & formatting acrobatics.

A query (grumble?) None of the options allow for a well-formatted dataset. I want to define a (unique) sort order, label variables & values, use chars… (for value labels, order matters; see below) I believe this stuff has to be done afterwards (?) To use 1 "Exponential" 2 "Weibull" and 3 "Cox" (I do), I have to open estimates.dta, label define and label values. Could this be done up-front so you could e.g. fill in DGM codes with “Cox”:method_label rather than number 2?