Designing longitudinal studies in epidemiology

Slides:



Advertisements
Similar presentations
Sample size estimation
Advertisements

Sociology 601 Class 17: October 28, 2009 Review (linear regression) –new terms and concepts –assumptions –reading regression computer outputs Correlation.
Confidence Intervals © Scott Evans, Ph.D..
Chapter 10 Simple Regression.
Chapter 12 Simple Regression
Clustered or Multilevel Data
Chapter Topics Types of Regression Models
Sample Size Determination In the Context of Hypothesis Testing
Sample Size Determination
McGraw-Hill/Irwin Copyright © 2013 by The McGraw-Hill Companies, Inc. All rights reserved. A PowerPoint Presentation Package to Accompany Applied Statistics.
Chapter 14 Inferential Data Analysis
Selecting a Valid Sample Size for Longitudinal and Multilevel Studies in Cancer Research: Software and Methods Deborah H. Glueck, Sarah M. Kreidler, Brandy.
Designing longitudinal studies in epidemiology Donna Spiegelman Professor of Epidemiologic Methods Departments of Epidemiology and Biostatistics
Assessing Survival: Cox Proportional Hazards Model Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research.
1 Chapter 10 Correlation and Regression We deal with two variables, x and y. Main goal: Investigate how x and y are related, or correlated; how much they.
Repeated measures: Approaches to Analysis Peter T. Donnan Professor of Epidemiology and Biostatistics.
Correlation and Linear Regression
Simple Linear Regression
CORRELATION & REGRESSION
Inference for a Single Population Proportion (p).
Sample size determination Nick Barrowman, PhD Senior Statistician Clinical Research Unit, CHEO Research Institute March 29, 2010.
L 1 Chapter 12 Correlational Designs EDUC 640 Dr. William M. Bauer.
Statistics for clinicians Biostatistics course by Kevin E. Kip, Ph.D., FAHA Professor and Executive Director, Research Center University of South Florida,
Various topics Petter Mostad Overview Epidemiology Study types / data types Econometrics Time series data More about sampling –Estimation.
1 Review of ANOVA & Inferences About The Pearson Correlation Coefficient Heibatollah Baghi, and Mastee Badii.
Chapter 4 Linear Regression 1. Introduction Managerial decisions are often based on the relationship between two or more variables. For example, after.
Lesson Multiple Regression Models. Objectives Obtain the correlation matrix Use technology to find a multiple regression equation Interpret the.
1 Inferences About The Pearson Correlation Coefficient.
Simulation Study for Longitudinal Data with Nonignorable Missing Data Rong Liu, PhD Candidate Dr. Ramakrishnan, Advisor Department of Biostatistics Virginia.
Sample Size Determination
1 Statistics 262: Intermediate Biostatistics Regression Models for longitudinal data: Mixed Models.
Class Six Turn In: Chapter 15: 30, 32, 38, 44, 48, 50 Chapter 17: 28, 38, 44 For Class Seven: Chapter 18: 32, 34, 36 Chapter 19: 26, 34, 44 Quiz 3 Read.
Methods of Presenting and Interpreting Information Class 9.
Longitudinal Data & Mixed Effects Models Danielle J. Harvey UC Davis.
Repeated measures: Approaches to Analysis
Inference for a Single Population Proportion (p)
Inference about the slope parameter and correlation
23. Inference for regression
The simple linear regression model and parameter estimation
Logic of Hypothesis Testing
Sample Size Determination

How many study subjects are required ? (Estimation of Sample size) By Dr.Shaik Shaffi Ahamed Associate Professor Dept. of Family & Community Medicine.
Statistical Core Didactic
Correlation and Simple Linear Regression
Applied Biostatistics: Lecture 2
Kakhramon Yusupov June 15th, :30pm – 3:00pm Session 3
Relative Values.
Chapter 13 Created by Bethany Stubbe and Stephan Kogitz.
Statistical Models for the Analysis of Single-Case Intervention Data
CJT 765: Structural Equation Modeling
Chapter 11 Simple Regression
Understanding Standards Event Higher Statistics Award
12 Inferential Analysis.
Essential Statistics (a.k.a: The statistical bare minimum I should take along from STAT 101)
Jeffrey E. Korte, PhD BMTRY 747: Foundations of Epidemiology II
CHAPTER 29: Multiple Regression*
Multiple Regression Models
Inference in Linear Models
One-Way Analysis of Variance
12 Inferential Analysis.
Simple Linear Regression
Fixed, Random and Mixed effects
Product moment correlation
Inferential Statistics
Interpreting Epidemiologic Results.
Longitudinal Data & Mixed Effects Models
COMPARING VARIABLES OF ORDINAL OR DICHOTOMOUS SCALES: SPEARMAN RANK- ORDER, POINT-BISERIAL, AND BISERIAL CORRELATIONS.
Correlation and Simple Linear Regression
Presentation transcript:

Designing longitudinal studies in epidemiology Donna Spiegelman Professor of Epidemiologic Methods Departments of Epidemiology and Biostatistics stdls@channing.harvard.edu Xavier Basagana Doctoral Student Department of Biostatistics, Harvard School of Public Health

Background We develop methods for the design of longitudinal studies for the most common scenarios in epidemiology There already exist some formulas for power and sample size calculations in this context. All prior work has been developed for clinical trials applications

Based on clinical trials: Background Based on clinical trials: Some are based on test statistics that are not valid or less efficient in an observational context, where (e.g. ANCOVA).

Based on clinical trials: Background Based on clinical trials: In clinical trials: The time measure of interest is time from randomization  everyone starts at the same time. We consider situations where, for example, age is the time variable of interest, and subjects do not start at the same age. Time-invariant exposures Exposure (treatment) prevalence is 50% by design

Xavier Basagaña’s Thesis Derive study design formulas based on tests that are valid and efficient for observational studies, for two reasonable alternative hypotheses. Comprehensively assess the effect of all parameters on power and sample size. Extend the formulas to a context where not all subjects enter the study at the same time. Extend formulas to the case of time-varying covariates, and compare it to the time-invariant covariates case.

Xavier Basagaña’s Thesis Derive the optimal combination of number of subjects (n) and number of repeated measures (r+1) when subject to a cost constraint. Create a computer program to perform design computations. Intuitive parameterization and easy to use.

Notation and Preliminary Results

Constant Mean Difference (CMD). We study two alternative hypotheses: Constant Mean Difference (CMD).

Linearly Divergent Differences (LDD)

Intuitive parameterization of the alternative hypothesis the mean response at baseline (or at the mean initial time) in the unexposed group, where the percent difference between exposed and unexposed groups at baseline (or at the mean initial time), where

Intuitive parameterization of the alternative hypothesis (2) : the percent change from baseline (or from the mean initial time) to end of follow-up (or to the mean final time) in the unexposed group, where When is not fixed, is defined at time s instead of at time : the percent difference between the change from baseline (or from the mean initial time) to end of follow-up (or mean final time) in the exposed group and the unexposed group, where When , will be defined as the percent change from baseline (or from the mean initial time) to the end of follow-up (or to the mean final time) in the exposed group, i.e.

Notation & Preliminary Results We consider studies where the interval between visits (s) is fixed but the duration of the study is free (e.g. participants may respond to questionnaires every two years) Increasing r involves increasing the duration of the study We also consider studies where the duration of the study, , is fixed, but the interval between visits is free (e.g. the study is 5 years long) Increasing r involves increasing the frequency of the measurements, s  = s r.

Notation & Preliminary Results Model The generalized least squares (GLS) estimator of B is Power formula

Notation & Preliminary Results Let lm be the (l,m)th element of -1 Assuming that the time distribution is independent of exposure group. Then, under CMD Under LDD

Correlation structures We consider three common correlation structures: Compound symmetry (CS).

Correlation structures Damped Exponential (DEX)  = 0: CS  = 0.3: CS  = 1: AR(1)

Correlation structures Random intercepts and slopes (RS). Reparameterizing: is the reliability coefficient at baseline is the slope reliability at the end of follow-up ( =0 is CS; =1 all variation in slopes is between subjects). With this correlation structure, the variance of the response changes with time, i.e. this correlation structure gives a heteroscedastic model.

Example Goal is to investigate the effect of indicators of socioeconomic status and post-menopausal hormone use on cognitive function (CMD) and cognitive decline (LDD) “Pilot study” by Lee S, Kawachi I, Berkman LF, Grodstein F (“Education, other socioeconomic indicators, and cognitive function. Am J Epidemiol 2003; 157: 712-720). Will denote as Grodstein. Design questions include power of the published study to detect effects of specified magnitude, the number and timing of additional tests in order to obtain a study with the desired power to detect effects of specified magnitude, and the optimal number of participants and measurements needed in a de novo study of these issues

Example At baseline and at one time subsequently, six cognitive tests were administered to 15,654 participants in the Nurses’ Health Study Outcome: Telephone Interview for Cognitive Status (TICS) 00=32.7 (4); Implies model = 1 point/10 years of age

Example Exposure: Graduate school degree vs. not (GRAD) Corr(GRAD, age)=-0.01 points Exposure: Post-menopausal hormone use (CURRHORM) Corr(CURRHORM, age)=-0.06 Time: age (years) is the best choice, not questionnaire cycle or calendar year of test The mean age was 74 and V(t0)4.

Example The estimated covariance parameters were SAS code to fit the LDD model with CS covariance proc mixed; class id; model tics=grad age gradage/s; random id; SAS code to fit the LDD model with RS covariance model tics=grad age gradage/s ddfm=bw; Random intercept age/type=un subject=id; CS RS  or 0.27 0.26 0.04 -0.14

Program optitxs.r makes it all possible

http://www.hsph.harvard.edu/faculty/spiegelman/software.html

http://www.hsph.harvard.edu/faculty/spiegelman/optitxs.html

Illustration of use of software optitxs.r We’ll calculate the power of the Grodstein’s published study to detect the observed 70% difference in rates of decline between those with more than high school vs. others Recall that 6.2% of NHS had more than high school; there was a –0.3% decline in cognitive function per year

> long.power() Press <Esc> to quit Constant mean difference (CMD) or Linearly divergent difference (LDD)? ldd The alternative is LDD. Enter the total sample size (N): 15000 Enter the number of post-baseline measures (r>0): 1 Enter the time between repeated measures (s): 2 Enter the exposure prevalence (pe) (0<=pe<=1): 0.062 Enter the variance of the time variable at baseline, V(t0) (enter 0 if all participants begin at the same time): 4 Enter the correlation between the time variable at baseline and exposure, rho[e,t0] (enter 0 if all participants begin at the same time): -0.01 Will you specify the alternative hypothesis on the absolute (beta coefficient) scale (1) or the relative (percent) scale (2)? 2 The alternative hypothesis will be specified on the relative (percent) change scale.

Enter mean response at baseline among unexposed (mu00): 32.7 Enter the percent change from baseline to end of follow-up among unexposed (p2) (e.g. enter 0.10 for a 10% change): -0.006 Enter the percent difference between the change from baseline to end of follow-up in the exposed group and the unexposed group (p3) (e.g. enter 0.10 for a 10% difference): 0.7 Which covariance matrix are you assuming: compound symmetry (1), damped exponential (2) or random slopes (3)? 2 You are assuming DEX covariance Enter the residual variance of the response given the assumed model covariates (sigma2): 12 Enter the correlation between two measures of the same subject separated by one unit (rho): 0.3 Enter the damping coefficient (theta): 0.10 Power = 0.4206059

Power of current study To detect the observed 70% difference in cognitive decline by GRAD CS: 44% RS: 35% DEX : 42% To detect a hypothesized ±10% difference in cognitive decline by current hormone use CS & DEX: 7% RS: 6%

How many additional measurements are needed when tests are administered every 2 years how many more years of follow-up are needed... To detect the observed 70% difference in cognitive decline by GRAD with 90% power? CS, DEX , RS: 3 post-baseline measurements =6 one more 5 year grant cycle To detect a hypothesized ± 20% difference in cognitive decline by current hormone use with 90% power? CS, DEX : 6 post-baseline measurements =12 More than two 5 year grant cycles N=15,000 for these calculations

How many more measurements should be taken in four (1 NIH grant cycle) and eight years of follow-up (two NIH grant cycles)... To detect the observed 70% difference in cognitive decline by GRAD with 90% power? To detect a hypothesized ± 20% difference in cognitive decline by current hormone use with 90% power? Duration of follow-up 4 years 8 years CS 8 1 DEX 10 RS Duration of follow-up 4 years 8 years CS >50 11 DEX 17 RS 13

Optimize (N,r) in a new study of cognitive decline Assume 4 years of follow-up (1 NIH grant cycle); cost of recruitment and baseline measurements are twice that of subsequent measurements GRAD: (N,r)=(26,795; 1) CS =(26,930;1) DEX =(28,945;1) RS CURRHORM: (N,r)=(97,662; 1) CS =(98,155; 1) DEX =(105,470;1) RS

Re: Constant Mean Difference (CMD) Conclusions Re: Constant Mean Difference (CMD)

Conclusions CMD: If all observations have the same cost, one would not take repeated measures. If subsequent measures are cheaper, one would take no repeated measures or just a small number if the correlation between measures is large. If deviations from CS exist, it is advisable to take more repeated measures. Power increases as and as Power increases as Var( ) goes to 0

Conclusions LDD: If the follow-up period is not fixed, choose the maximum length of follow-up possible (except when RS is assumed). If the follow-up period fixed, one would take more than one repeated measure only when the subsequent measures are more than five times cheaper. When there are departures from CS, values of  around 10 or 20 are needed to justify taking 3 or 4 measures. Power increases as , as , as slope reliability goes to 0, as Var( ) increases, and as the correlation between and exposure goes to 0

Conclusions LDD: The optimal (N,r) and the resulting power can strongly depend on the correlation structure. Combinations that are optimal for one correlation may be bad for another. All these decisions are based on power considerations alone. There might be other reasons to take repeated measures. Sensitivity analysis. Our program.

Future work Develop formulas for time-varying exposure. Include dropout For sample size calculations, simply inflate the sample size by a factor of 1/(1-f). However, dropout can alter the relationship between N and r.