Multilevel Analysis Kate Pickett Senior Lecturer in Epidemiology.

Slides:



Advertisements
Similar presentations
Qualitative predictor variables
Advertisements

SC968: Panel Data Methods for Sociologists Random coefficients models.
Christopher Dougherty EC220 - Introduction to econometrics (chapter 4) Slideshow: interactive explanatory variables Original citation: Dougherty, C. (2012)
Lecture 4 (Chapter 4). Linear Models for Correlated Data We aim to develop a general linear model framework for longitudinal data, in which the inference.
Repeated Measures, Part 3 May, 2009 Charles E. McCulloch, Division of Biostatistics, Dept of Epidemiology and Biostatistics, UCSF.
Lecture 9 Today: Ch. 3: Multiple Regression Analysis Example with two independent variables Frisch-Waugh-Lovell theorem.
Multilevel Models 4 Sociology 8811, Class 26 Copyright © 2007 by Evan Schofer Do not copy or distribute without permission.
Lecture 6: Repeated Measures Analyses Elizabeth Garrett Child Psychiatry Research Methods Lecture Series.
Sociology 601 Class 19: November 3, 2008 Review of correlation and standardized coefficients Statistical inference for the slope (9.5) Violations of Model.
Sociology 601 Class 21: November 10, 2009 Review –formulas for b and se(b) –stata regression commands & output Violations of Model Assumptions, and their.
Sociology 601 Class 28: December 8, 2009 Homework 10 Review –polynomials –interaction effects Logistic regressions –log odds as outcome –compared to linear.
Multilevel Models 2 Sociology 8811, Class 24
Clustered or Multilevel Data
Introduction to Regression Analysis Straight lines, fitted values, residual values, sums of squares, relation to the analysis of variance.
1 Review of Correlation A correlation coefficient measures the strength of a linear relation between two measurement variables. The measure is based on.
So far, we have considered regression models with dummy variables of independent variables. In this lecture, we will study regression models whose dependent.
Sociology 601 Class 23: November 17, 2009 Homework #8 Review –spurious, intervening, & interactions effects –stata regression commands & output F-tests.
Interpreting Bi-variate OLS Regression
Back to House Prices… Our failure to reject the null hypothesis implies that the housing stock has no effect on prices – Note the phrase “cannot reject”
TESTING A HYPOTHESIS RELATING TO A REGRESSION COEFFICIENT This sequence describes the testing of a hypotheses relating to regression coefficients. It is.
SLOPE DUMMY VARIABLES 1 The scatter diagram shows the data for the 74 schools in Shanghai and the cost functions derived from a regression of COST on N.
Christopher Dougherty EC220 - Introduction to econometrics (chapter 3) Slideshow: precision of the multiple regression coefficients Original citation:
Christopher Dougherty EC220 - Introduction to econometrics (chapter 5) Slideshow: Chow test Original citation: Dougherty, C. (2012) EC220 - Introduction.
EDUC 200C Section 4 – Review Melissa Kemmerle October 19, 2012.
TOBIT ANALYSIS Sometimes the dependent variable in a regression model is subject to a lower limit or an upper limit, or both. Suppose that in the absence.
Christopher Dougherty EC220 - Introduction to econometrics (chapter 5) Slideshow: the effects of changing the reference category Original citation: Dougherty,
Christopher Dougherty EC220 - Introduction to econometrics (chapter 5) Slideshow: dummy classification with more than two categories Original citation:
DUMMY CLASSIFICATION WITH MORE THAN TWO CATEGORIES This sequence explains how to extend the dummy variable technique to handle a qualitative explanatory.
Christopher Dougherty EC220 - Introduction to econometrics (chapter 10) Slideshow: Tobit models Original citation: Dougherty, C. (2012) EC220 - Introduction.
1 INTERACTIVE EXPLANATORY VARIABLES The model shown above is linear in parameters and it may be fitted using straightforward OLS, provided that the regression.
1 TWO SETS OF DUMMY VARIABLES The explanatory variables in a regression model may include multiple sets of dummy variables. This sequence provides an example.
Confidence intervals were treated at length in the Review chapter and their application to regression analysis presents no problems. We will not repeat.
1 PROXY VARIABLES Suppose that a variable Y is hypothesized to depend on a set of explanatory variables X 2,..., X k as shown above, and suppose that for.
1 Estimation of constant-CV regression models Alan H. Feiveson NASA – Johnson Space Center Houston, TX SNASUG 2008 Chicago, IL.
How do Lawyers Set fees?. Learning Objectives 1.Model i.e. “Story” or question 2.Multiple regression review 3.Omitted variables (our first failure of.
MULTILEVEL ANALYSIS Kate Pickett Senior Lecturer in Epidemiology SUMBER: www-users.york.ac.uk/.../Multilevel%20Analysis.ppt‎University of York.
Topic 5 Statistical inference: point and interval estimate
Repeated Measures, Part 2 May, 2009 Charles E. McCulloch, Division of Biostatistics, Dept of Epidemiology and Biostatistics, UCSF.
EDUC 200C Section 3 October 12, Goals Review correlation prediction formula Calculate z y ’ = r xy z x for a new data set Use formula to predict.
What is the MPC?. Learning Objectives 1.Use linear regression to establish the relationship between two variables 2.Show that the line is the line of.
Introduction Multilevel Analysis
Basic Biostatistics Prof Paul Rheeder Division of Clinical Epidemiology.
CENTRE FOR INNOVATION, RESEARCH AND COMPETENCE IN THE LEARNING ECONOMY Session 3: Basic techniques for innovation data analysis. Part II: Introducing regression.
Simple regression model: Y =  1 +  2 X + u 1 We have seen that the regression coefficients b 1 and b 2 are random variables. They provide point estimates.
Lecture 3 Linear random intercept models. Example: Weight of Guinea Pigs Body weights of 48 pigs in 9 successive weeks of follow-up (Table 3.1 DLZ) The.
Biostat 200 Lecture Simple linear regression Population regression equationμ y|x = α +  x α and  are constants and are called the coefficients.
POSSIBLE DIRECT MEASURES FOR ALLEVIATING MULTICOLLINEARITY 1 What can you do about multicollinearity if you encounter it? We will discuss some possible.
Data Analysis in Practice- Based Research Stephen Zyzanski, PhD Department of Family Medicine Case Western Reserve University School of Medicine October.
Christopher Dougherty EC220 - Introduction to econometrics (chapter 4) Slideshow: exercise 4.5 Original citation: Dougherty, C. (2012) EC220 - Introduction.
(1)Combine the correlated variables. 1 In this sequence, we look at four possible indirect methods for alleviating a problem of multicollinearity. POSSIBLE.
COST 11 DUMMY VARIABLE CLASSIFICATION WITH TWO CATEGORIES 1 This sequence explains how you can include qualitative explanatory variables in your regression.
Sampling and Nested Data in Practice-Based Research Stephen Zyzanski, PhD Department of Family Medicine Case Western Reserve University School of Medicine.
Christopher Dougherty EC220 - Introduction to econometrics (chapter 6) Slideshow: exercise 6.13 Original citation: Dougherty, C. (2012) EC220 - Introduction.
Section 6.4 Inferences for Variances. Chi-square probability densities.
1 CHANGES IN THE UNITS OF MEASUREMENT Suppose that the units of measurement of Y or X are changed. How will this affect the regression results? Intuitively,
GRAPHING A RELATIONSHIP IN A MULTIPLE REGRESSION MODEL The output above shows the result of regressing EARNINGS, hourly earnings in dollars, on S, years.
1 BINARY CHOICE MODELS: LINEAR PROBABILITY MODEL Economists are often interested in the factors behind the decision-making of individuals or enterprises,
1 REPARAMETERIZATION OF A MODEL AND t TEST OF A LINEAR RESTRICTION Linear restrictions can also be tested using a t test. This involves the reparameterization.
F TESTS RELATING TO GROUPS OF EXPLANATORY VARIABLES 1 We now come to more general F tests of goodness of fit. This is a test of the joint explanatory power.
1 COMPARING LINEAR AND LOGARITHMIC SPECIFICATIONS When alternative specifications of a regression model have the same dependent variable, R 2 can be used.
VARIABLE MISSPECIFICATION II: INCLUSION OF AN IRRELEVANT VARIABLE In this sequence we will investigate the consequences of including an irrelevant variable.
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE In this sequence and the next we will investigate the consequences of misspecifying the regression.
Chapter 14 Introduction to Multiple Regression
From t-test to multilevel analyses Del-2
Lecture 18 Matched Case Control Studies
From t-test to multilevel analyses (Linear regression, GLM, …)
QM222 Class 11 Section A1 Multiple Regression
Introduction to Econometrics, 5th edition
Introduction to Econometrics, 5th edition
Presentation transcript:

Multilevel Analysis Kate Pickett Senior Lecturer in Epidemiology

Perspective Health researchers: Are interested in answering research questions (not maths) Want to be able to apply statistical techniques Want to be able to interpret results Want to be able to communicate with consumers and statisticians

Aims for this session Understand the rationale for multilevel analysis Understand common terminology Interpret output from multilevel models Be able to read and critically appraise studies using multilevel models

Context and composition Studying populations (groups) and individuals From Rose, G. Sick individuals and sick populations. Int J Epidemiol 1985;14:32-38

Levels of analysis Health researchers may collect and use data collected at the level of: Individuals, patients Families or other social groupings Clinics or hospitals Small areas, neighbourhoods Large populations

Population APopulation B How is Population A different from Population B?

Ecological studies Data are aggregated and represent a group, rather than an individual incidence rate of an illness prevalence of a particular health service We don’t know which particular individuals within the group were ill or received the service These group-based outcome measures are analyzed by correlating them with determinants measured for the same groups

Source: Pickett KE, Kelly S, Brunner E, Lobstein T, Wilkinson RG. Wider income gaps, wider waistbands? An ecological study of obesity and income inequality. J Epidemiol Community Health 2005;59:670–674.

The ecological fallacy Associations at the group level may not hold at an individual level Eg, we might see that rates of obesity are correlated internationally with per capita calorie intake But, we don’t know if it is the obese individuals who are eating all the calories Many group-level variables are correlated so we may get spurious correlations Eg, obesity rates may also be correlated with number of zoos per capita or some other completely unrelated factor

The atomistic fallacy But the ecological fallacy has a flip side Factors that affect outcomes in individuals may not operate in the same way at the population level Eg, teenage births are more common among the poor, but teenage birth rates are very high in some very wealthy countries.

Source: Pickett KE, Mookherjee S, Wilkinson RG. Adolescent Birth Rates,Total Homicides, and Income Inequality In Rich Countries, AJPH 2005;95: Example of teenage births

Ecological variables Sometimes ecological studies are done because it is quick and easy Sometimes ecological studies are the best design for the research question BECAUSE Some determinants are “ecological”: Population density Air quality/pollution GNP Income inequality % unemployed Ambient temperature

Context and composition But what if we are interested in both types of variables (individual and population) simultaneously? Eg: we might want to know about the effect of population-level unemployment on health, above and beyond the health impact of being unemployed for any given individual

Multilevel models

Introduction to multilevel models Hierarchical models Mixed effects models Random effects models

Background Developed in education research Observations of students in a single class are not independent of one another “Standard” statistical models assume that observations are independent Two-level hierarchy Students within classes Three-level hierarchy Students within classes within schools Four-level hierarchy Students within classes within schools within local authority areas

Health research context Patients within a medical practice Residents within neighbourhoods Subjects within trial clusters Hospitals within PCTs….

Examples for class Some examples are drawn from Twisk JWR “Applied Multilevel Analysis” Cambridge University Press, 2006 Example data are available at: Research question: what is the relationship between total cholesterol and age? Statistical software: Stata but note that MLwiN is free to UK academics: ad/index.shtml)

Simple linear regression Total cholesterol = β 0 + β 1 x age + ε

Simple linear regression, adding a categorical variable Total cholesterol = β 0 + β 1 x age + β 2 x gender + ε

Simple linear regression, adding another variable (doctor) Total cholesterol = β 0 + β 1 x age + β 2 x MD 1 + β 3 x MD 2 + β 4 x MD 3 + β 5 x MD 4 +…..+ β m x MD m-1 + ε

Multilevel analysis Instead of estimating all those separate intercepts, we estimate the variance of them In our example that means estimating 1 additional parameter, rather than 11 We are allowing the intercept to be random (random effects modelling) An efficient way of correcting for a variable with many categories Trade-off: Assumes that the different intercepts are normally distributed

Example data Cholesterol Dataset 441 patients Age years Cholesterol mmol/l 12 doctors

Non-multilevel regression. regress cholesterol age Source | SS df MS Number of obs = F( 1, 439) = Model | Prob > F = Residual | R-squared = Adj R-squared = Total | Root MSE = cholesterol | Coef. Std. Err. t P>|t| [95% Conf. Interval] age | _cons | Example using Stata

Multilevel Model in Stata. xtmixed cholesterol age ||doctor:, ml var Performing EM optimization: Performing gradient-based optimization: Iteration 0: log likelihood = Iteration 1: log likelihood = Computing standard errors: Mixed-effects ML regression Number of obs = 441 Group variable: doctor Number of groups = 12 Obs per group: min = 36 avg = 36.8 max = 39 Wald chi2(1) = Log likelihood = Prob > chi2 = cholesterol | Coef. Std. Err. z P>|z| [95% Conf. Interval] age | _cons | Random-effects Parameters | Estimate Std. Err. [95% Conf. Interval] doctor: Identity | var(_cons) | var(Residual) | LR test vs. linear regression: chibar2(01) = Prob >= chibar2 =

Do we need the multilevel model? Likelihood ratio test: Compare -2 log likelihood of model with random intercept to -2 log likelihood of ordinary linear model Difference has a Chi-square distribution with df = difference in number of parameters estimated Difference = , highly significant

Model parameters Effects of age in each model: Coefficient in ordinary model = Coefficient in multilevel model = % CI in ordinary model (0.0428, ) 95% CI in multilevel model (0.0435,0.0556) Age is significant in both models

Intraclass correlation coefficient This measures how dependent the observations are within clusters Eg, how correlated are the observations of patients belonging to the same doctor? Defined as: Variance between clusters/Total variance The smaller the variance within clusters, the greater the ICC

ICC (a) Distribution of an outcome variable Assume that the total variance = 10

ICC (b) ICC is low because: Variance within groups is high (9) Variance between groups is low (1) Numerator is small, relative to denominator ICC = 1/10=0.1

ICC (c) The groups are now more spread out, more different, and: ICC is bigger because: Variance within groups is lower (5) Variance between groups is higher (5) ICC=5/10 = 0.5

ICC (d) The groups are now completely different, and: ICC is maximised because: Variance within groups is minimal (1) Variance between groups is maximal (9) Numerator is large, relative to denominator ICC=9/10 = 0.9 MUCH MORE DEPENDENCE WITHIN CLUSTER – each observation provides less unique information

Impact on significance tests Table of alpha values under different conditions of sample size and ICC Intraclass Correlation Coefficient Sample size

ICC in our example ICC = between doctor variance/total variance ICC = /( ) = / = % of the total individual differences in cholesterol are at the doctor level

ICC When ICC is high Evidence of a contextual effect on the outcome Evidence of differences in composition between the clusters Explore by including explanatory variables at each level When ICC is low No need for a multilevel analysis

Back to unemployment example

Population A Population B Red = unemployed Data Structure

An ordinary regression model Health =b 0 + b 1 (unemployed) + b 2 (% unemployed) + e e represents the effect of all omitted variables and measurement error and is assumed to have a random effect (so it gets ignored)

Population A Population B Aside from unemployment, subjects in A are different from B in other ways: composition (shape, size), context (density) Data Structure

A multi-level regression model i = individual, j=context: y ij = bx ij + BX i + E j + e ij Health = b (unemployed ij ) + B(% unemployed i ) +E j + e ij

What does this mean for critical appraisal of the health literature? When data are hierarchical or multi- level by nature, they should be analysed appropriately The coefficients or odds ratios from the models can be interpreted as usual The ICC shows how much variance in the outcome occurs between the higher- level contexts If appropriate methods are not used, standard errors and significance tests may be wrong and coefficients biased

A summary Ecological studies Appropriate when the research question concerns only ecological effects Ecological fallacy may be a problem Individual-level studies Appropriate when the research question concerns only individual-level effects Atomistic fallacy may be a problem Multi-level studies Appropriate when the research question concerns both context and composition of populations