Presentation is loading. Please wait.

Presentation is loading. Please wait.

Clustered or Multilevel Data

Similar presentations


Presentation on theme: "Clustered or Multilevel Data"— Presentation transcript:

1 Clustered or Multilevel Data
What are clustered or multilevel data? Why are multilevel data common in outcomes research? What methods of analysis are available? What are random versus fixed effects? How does the N at each level affect model choice? How does the study question affect model choice?

2 What are clustered data?
Gathering individual observations into larger groups does not create clustered data Individual observations from a simple, random sample are never clustered Clustering is a result of sampling/design Usually from stages/levels in obtaining the individual units of observation

3 Examples of Clustered Data
Litters of puppies Pieces of leaves (several per leaf) Intervention on institutions (eg, schools) TB cases and their contacts Survey stratified by county and census tract A sample of physicians and their patients Repeated measurements on individuals

4 Clustered or Multilevel Data
Level 2 (cluster): Physicians, schools, census tracts, leaves Level 2 unit #1 Level 2 unit #2 Level 2 unit #3 2,1 3,1 1,1 1,2 2,2 4,3 1,3 2,3 3,3 Level 1 (individual observation): Patients, students, residents, leaf samples

5 “Cluster analysis” is a different topic: finds clusters in data
x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x

6 Repeated Measures are also a Type of Clustered or Multilevel Data
Level 2 (cluster): Individual subjects Person #1 Person #2 Person #3 1,1 2,1 3,1 1,2 2,2 1,3 2,3 3,3 4,3 Time 1 Level 1 (individual observation): Observations at different times

7 Multilevel Data is Common in Outcomes Research
Secondary data sets are often multilevel: Patients clustered within physicians clustered within hospitals or clinics (hospital discharges) National health surveys (NHIS, NHANES) are stratified probability surveys Health interventions often randomize institutions or geographic areas Health policy changes are applied at geographic or institutional level

8 Characteristics of Clustered Data
Measurements within clusters are correlated (eg, measures on same person are more alike than measurements across persons) Variables can be measured at each level The variance of the outcome can be attributed to each level Standard statistical models and tests are incorrect

9 Effects of Clustered Data
The assumptions of independence and equal variance of standard statistics do not hold Standard errors for statistical testing will be incorrect Regression models cannot be fit using methods that assume independence of observations For example, ordinary least squares calculation of the regression line is incorrect

10 Example of Multilevel Data with a Linear Outcome Variable
PORT study of type II diabetes patients’ satisfaction with medical care Outcome = score from 14 questionnaire items Sample of 70 physicians (level 2 sample) Sample of 1492 patients (level 1 sample) Mean 21.3 patients per physician Range from 5 to 45 patients per physician Two levels of covariates considered Physician years in practice, specialty (level 2) Patient age, gender (level 1)

11 Clustered/Multi-level Data Variance Outcome = Patient Satisfaction Score
Level 2: Physicians (N=70) MD1: mean=81 MD2: mean=58 MD3: mean=74 55 61 68 74 75 79 81 85 77 Level 1: Patients (N=1492) Variance in the patient score divides into two parts: (1) the variance between physicans = 2B (2) the variance within the physicians = 2W So the total variance = 2B + 2W

12 Intraclass Correlation Coefficient
The intraclass correlation coefficient (ICC) is a measure of the correlation among the individual observations within the clusters It is calculated by the ratio of the between cluster variance to the total variance: 2B / (2B + 2W )

13 Intraclass Correlation Coefficient (ICC)
MD1: mean=81 MD2: mean=58 MD3: mean=74 58 58 74 74 74 74 81 81 81 Take extreme case where each MD’s patients have the same score = no variance within the physicians. So, ICC = 2B / 2B + 2W = 2B / 2B + 0 = 1 = perfect correlation within the clusters.

14 Methods of Analyzing Multilevel Data
Use a single measure per cluster (e.g., mean satisfactions score) as the outcome variable Fit a model with indicator variables for each cluster (minus one) Fit a regression model with generalized estimating equations (GEE model) Fit a fixed effects conditional regression model Fit a random effects regression model

15 Choice of Analysis Model: Two Main Considerations
What is the research question How many observations are there at level 2 and how many level 1 observations are there per level 2 observation

16 Choice of Analysis Model: The Research Question
What is the relationship of patient age to the MD satisfaction score? (level 1 predictor) What is the relationship between MD years in practice and the score? (level 2 predictor) How much variation is there in the mean satisfaction score between MDs adjusted for level 1 and level 2 predictors? (level 2 variance)

17 Method (1): Use mean satisfaction score for each physician as outcome
Single measure for each cluster simple, easy to understand loses information, power (N=70, not 1492) ignores different variance of single outcome if clusters are different sizes no individual level variables except as mean values (eg, mean patient age) Only answers question 2 (MD years in practice) although can use mean patient age

18 Method (2): Use dummy variable for each MD
Dummy variable represents each MD effect treats each MD effect as equally well estimated but some of the clusters small (N=5,7,8, etc.) If we had 70 MD’s and only 200 patients, 69 dummy variables would use up too many degrees of freedom If we had only 10 MD’s, it is a good choice Can only answer question 1 (relationship of patient age to satisfaction score)

19 Method (3):Regression with Generalized Estimating Equations (GEE)
Estimates regression coefficients and variance separately to account for clustering Gives population average effect of age on satisfaction (“marginal model”) Analyst indicates correlation structure within the clusters Answers questions 1 and 2 but not 3 Variation in patient satisfaction between MD’s is not modeled separately

20 Specifying Correlation within Clusters for GEE Model
Most common assumption is one correlation coefficient for all pairs of observations within the clusters; called compound symmetry or exchangeable correlation structure Other assumptions about the correlation are possible (eg, correlation weakens with time/distance) The GEE regression will give good estimate of predictor coefficients even if the correlation specified is incorrect if you use the robust se’s

21 Method (4): Use Conditional Regression Model with Fixed Effects
Looks within each MD to model the association between patient age and the score No coefficient for MD (“conditioned out”) Good choice if number of MD’s large relative to number of patients (70 MD’s, 200 patients) Matched pairs are analyzed with conditional regression Answers question 1, but not 2 and 3

22 Method (5): Use a Random Effects Regression Model
Predictor variables for both individual and cluster level variables Models variance associated with MD separately from variance within the clusters in patient satisfaction Improves estimate of MD effect by treating MD mean scores as random sample of scores Only model that answers all 3 questions

23 Fixed versus Random Effects
Effects are random when the levels are a sample of a larger population have variation because sampled; another sample would give different data Effects are fixed if they represent all possible levels/members of a population: eg, male/female; treatment groups; all the regions of the U.S.

24 Fixed versus Random Effects
Effects can often be considered fixed or random depending on the research question If you want to generalize from the sample of doctors to other doctors, you would consider the doctors as a random effect If the doctors in your sample are the only ones you care about, you could consider doctors as a fixed effect

25 Random Effects Illustrated from the PORT Diabetes Study
In the MD satisfaction score example, begin by ignoring predictors such as the patients’ age and the physicians’ number of years in practice The overall mean patient satisfaction score for all 1492 patients was 67.7 (SD=23.5) Separate means calculated for each physician’s patients ranged from 53.4 to 87.1

26 Random Effects: MD Score
Consider the satisfaction score as composed of two parts: the overall mean () plus or minus the difference from that overall mean of the mean score for each physician (j) Each MD’s difference, j, is a random effect because the 70 MD’s represent a sample of possible MD’s. If we sampled another 70, the j’s would be different

27 A Simple Random Effects Model
If we add a term for error associated with each individual patient, the model is: yij =  + j+ eij, where  = overall mean, j = difference for MD, and eij = individual error Model says there is random variation from the mean score at the level of MD’s (level 2) plus variation at the level of patients (level 1)

28 What does the random effects model do?
Actual MD means vary from 53.4 to 87.1 and patient N for each MD varies from 5 to 45. Thus, actual MD means not very stable. Random effects model assumes MD mean scores are from an underlying normal distribution It uses the information from all the MDs and the characteristics of a normal distribution to estimate the “true” j’s

29 Estimating the Random Effects
In our example from the PORT study, raw means range = 53.4 to 87.1 Ordinary least squares estimates range = 54.0 and 87.9 (term for each MD, ANCOVA) The random effects estimates of the mean patient scores by MD ranged from 60.4 to 78.6; their SD was 4.94. so random effects are closer to the overall mean

30 Adding MD and Patient Predictors to the Simple Model
We want to examine the effect of patient’s age (level 1 variable) and MD years in practice (level 2) on the satisfaction score Specify a regression model with 2 predictor variables and a random effect for the MD Score for each MD is modeled both by adjusting for patient’s age and MD years in practice and by modeling the distribution of MD mean scores

31 Final Random (or Mixed) Effects Regression Model
Positive association with patient age (=0.15, p=0.003, satisfaction score goes up with age) No association with MD years in practice (p=0.69) Significant variance (24.4) in satisfaction score by MD (random effect)

32 Summary Clustered data should not be analyzed with standard statistical methods and tests Reduction of outcome and predictors to one value per cluster is an option but loses information Choice of remaining methods (dummy variables, conditional regression, GEE, or random effects) depends on the research question and on the number of observations at each level

33 Summary Research questions affect choice of method
if only care about predictors, GEE models are a good alternative if question is about variation between clusters (level 2 variable), a model that produces random effects estimates is needed Number of clusters has to be large enough to estimate a random effect (N=30+) Small number of clusters can be handled with dummy variables

34 Data Set for Homework CA hospitals CABG registry
Patients (N=28,555) clustered within hospitals (N=80) Binary outcome: alive/dead after 30 days Patient level characteristics and hospital characteristics Use STATA to answer questions [syntax for the models supplied]


Download ppt "Clustered or Multilevel Data"

Similar presentations


Ads by Google