Hierarchical models for combining multiple data sources measured at individual and small area levels Chris Jackson With Nicky Best and Sylvia Richardson.

Slides:



Advertisements
Similar presentations
Multilevel modelling short course
Advertisements

The Census Area Statistics Myles Gould Understanding area-level inequality & change.
Sources and effects of bias in investigating links between adverse health outcomes and environmental hazards Frank Dunstan University of Wales College.
How would you explain the smoking paradox. Smokers fair better after an infarction in hospital than non-smokers. This apparently disagrees with the view.
Associations between Obesity and Depression by Race/Ethnicity and Education among Women: Results from the National Health and Nutrition Examination Survey,
BACKGROUND Benzene is a known carcinogen. Occupational exposure to benzene is an established risk factor for leukaemia. Less is known about the effects.
Adjustments for Age-sex and MLC NRAC 29 March 2007.
Nicky Best, Chris Jackson, Sylvia Richardson Department of Epidemiology and Public Health Imperial College, London Studying.
“Personality, Socioeconomic Status, and All-Cause Mortality in the United States” - Chapman BP et al. Journal Club 02/24/11.
Nicky Best and Chris Jackson With Sylvia Richardson Department of Epidemiology and Public Health Imperial College, London
School of Veterinary Medicine and Science Multilevel modelling Chris Hudson.
Introduction to Logistic Regression. Simple linear regression Table 1 Age and systolic blood pressure (SBP) among 33 adult women.
Clustered or Multilevel Data
GIS in Spatial Epidemiology: small area studies of exposure- outcome relationships Robert Haining Department of Geography University of Cambridge.
BIOST 536 Lecture 4 1 Lecture 4 – Logistic regression: estimation and confounding Linear model.
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 12: Multiple and Logistic Regression Marshall University.
Graphical models for combining multiple sources of information in observational studies Nicky Best Sylvia Richardson Chris Jackson Virgilio Gomez Sara.
Aspects of the National Health Interview Survey (NHIS) Chris Moriarity National Conference on Health Statistics August 16, 2010
Unit 6: Standardization and Methods to Control Confounding.
Multiple Choice Questions for discussion
Evidence-Based Medicine 4 More Knowledge and Skills for Critical Reading Karen E. Schetzina, MD, MPH.
Liesl Eathington Iowa Community Indicators Program Iowa State University October 2014.
Modeling errors in physical activity data Sarah Nusser Department of Statistics and Center for Survey Statistics and Methodology Iowa State University.
Measuring Output from Primary Medical Care, with Quality Adjustment Workshop on measuring Education and Health Volume Output OECD, Paris 6-7 June 2007.
Using GIS to investigate multiple deprivation David Briggs Small Area Health Statistics Unit Imperial College, London A few thoughts and several questions.
TWO-STAGE CASE-CONTROL STUDIES USING EXPOSURE ESTIMATES FROM A GEOGRAPHICAL INFORMATION SYSTEM Jonas Björk 1 & Ulf Strömberg 2 1 Competence Center for.
Evidence-Based Medicine 3 More Knowledge and Skills for Critical Reading Karen E. Schetzina, MD, MPH.
Lecture 8: Generalized Linear Models for Longitudinal Data.
Graphical models for combining multiple data sources
1 Rob Woodruff Battelle Memorial Institute, Health & Analytics Cynthia Ferre Centers for Disease Control and Prevention Conditional.
 Is there a comparison? ◦ Are the groups really comparable?  Are the differences being reported real? ◦ Are they worth reporting? ◦ How much confidence.
Chris Jackson With Nicky Best and Sylvia Richardson Department of Epidemiology and Public Health Imperial College, London
Nuoo-Ting (Jassy) Molitor 1 Chris Jackson 2 With Nicky Best, Sylvia Richardson 1 1 Department of Epidemiology and Public Health Imperial College, London.
Introduction Multilevel Analysis
1 Ratio estimation under SRS Assume Absence of nonsampling error SRS of size n from a pop of size N Ratio estimation is alternative to under SRS, uses.
Centre for Environmental Health Research Small area health analyses: pharmacy data and exposure to transport noise Oscar Breugelmans, Jan van de Kassteele,
Inference from ecological models: air pollution and stroke using data from Sheffield, England. Ravi Maheswaran, Guangquan Li, Jane Law, Robert Haining,
Scot Exec Course Nov/Dec 04 Survey design overview Gillian Raab Professor of Applied Statistics Napier University.
Exposure to cyclo-oxygenase-2 inhibitors and risk of cancer: nested case-control studies IAE world Congress Epidemiology 2011 Edinburgh Yana Vinogradova,
Department of SOCIAL MEDICINE Producing Small Area Estimates of the Need for Hip and Knee Replacement Surgery ANDY JUDGE Nicky Welton Mary Shaw Yoav Ben-Shlomo.
Stephen Fisher, Jane Holmes, Nicky Best, Sylvia Richardson Department of Sociology, University of Oxford Department of Epidemiology and Biostatistics Imperial.
Multilevel Data in Outcomes Research Types of multilevel data common in outcomes research Random versus fixed effects Statistical Model Choices “Shrinkage.
Various topics Petter Mostad Overview Epidemiology Study types / data types Econometrics Time series data More about sampling –Estimation.
Nuoo-Ting (Jassy) Molitor 1 Chris Jackson 2 With Nicky Best, Sylvia Richardson 1 1 Department of Epidemiology and Public Health Imperial College, London.
A short introduction to epidemiology Chapter 4: More complex study designs Neil Pearce Centre for Public Health Research Massey University Wellington,
An Introductory Lecture to Environmental Epidemiology Part 5. Ecological Studies. Mark S. Goldberg INRS-Institut Armand-Frappier, University of Quebec,
Leicester Warwick Medical School Health and Disease in Populations Case-Control Studies Paul Burton.
WP 19 Assessment of Statistical Disclosure Control Methods for the 2001 UK Census Natalie Shlomo University of Southampton Office for National Statistics.
Case Control Study Dr. Ashry Gad Mohamed MB, ChB, MPH, Dr.P.H. Prof. Of Epidemiology.
CORRELATION: Correlation analysis Correlation analysis is used to measure the strength of association (linear relationship) between two quantitative variables.
BACKGROUND Benzene is a known carcinogen. Occupational exposure to benzene is an established risk factor for leukaemia. Less is known about the effects.
HLM Models. General Analysis Strategy Baseline Model - No Predictors Model 1- Level 1 Predictors Model 2 – Level 2 Predictors of Group Mean Model 3 –
BC Jung A Brief Introduction to Epidemiology - XIII (Critiquing the Research: Statistical Considerations) Betty C. Jung, RN, MPH, CHES.
Chapter 8: Simple Linear Regression Yang Zhenlin.
Leicester Warwick Medical School Health and Disease in Populations Cohort Studies Paul Burton.
CROSS SECTIONAL STUDIES
Descriptive study design
POPLHLTH 304 Regression (modelling) in Epidemiology Simon Thornley (Slides adapted from Assoc. Prof. Roger Marshall)
Proposed Statistical Methodology for the Canadian Heart Health Surveys Follow-up Study
Kelvyn Jones, University of Bristol Wednesday 2nd July 2008, Session 29 WHAT IS: multilevel modelling?
1 Part09: Applications of Multi- level Models to Spatial Epidemiology Francesca Dominici & Scott L Zeger.
1 Module IV: Applications of Multi-level Models to Spatial Epidemiology Francesca Dominici & Scott L Zeger.
Small area estimation combining information from several sources Jae-Kwang Kim, Iowa State University Seo-Young Kim, Statistical Research Institute July.
1 Borgan and Henderson: Event History Methodology Lancaster, September 2006 Session 8.1: Cohort sampling for the Cox model.
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 13: Multiple, Logistic and Proportional Hazards Regression.
Exposure Prediction and Measurement Error in Air Pollution and Health Studies Lianne Sheppard Adam A. Szpiro, Sun-Young Kim University of Washington CMAS.
Combining individual and aggregate data to improve estimates of ethnic voting in Britain in 2001 and 2005 Stephen Fisher, Jane Holmes, Nicky Best, Sylvia.
S1316 analysis details Garnet Anderson Katie Arnold
Jeffrey E. Korte, PhD BMTRY 747: Foundations of Epidemiology II
Presentation transcript:

Hierarchical models for combining multiple data sources measured at individual and small area levels Chris Jackson With Nicky Best and Sylvia Richardson Department of Epidemiology and Public Health Imperial College, London BIAS project

Outline Infer some individual-level relationship, e.g. influence of individual socio-economic circumstances on risk of ill health Infer some individual-level relationship, e.g. influence of individual socio-economic circumstances on risk of ill health Use combination of datasets, individual and aggregate, to answer the question. Use combination of datasets, individual and aggregate, to answer the question. Multi-level models on multi-level data. Multi-level models on multi-level data.Examples: Hospital admission for cardiovascular disease and socio- demographic factors Hospital admission for cardiovascular disease and socio- demographic factors Low birth weight and air pollution Low birth weight and air pollution

AdvantagesDisadvantages Aggregate Individual Combining different forms of observational data Census National registers Environmental monitors Abundant, routinely collected Covers whole population Can study small- area variations Surveys Cohort studies Case-control Census SAR Ecological bias Distinguishing individual from area-level effects Not many variables Direct information on exposure-outcome relationship More variables available Low power Little geographical information  confidentiality COMBINED Conflicts between information from each Reduce confounding and bias Maximise power Separate individual and area-level effects

Example 1: Cardiovascular hospitalisation Question Socio-demographic predictors of hospitalisation for heart and circulatory disease for individuals Socio-demographic predictors of hospitalisation for heart and circulatory disease for individuals Is there any evidence of contextual effects (area-level as well as individual predictors) Is there any evidence of contextual effects (area-level as well as individual predictors)Design Data synthesis using Area-level administrative data: hospital episode statistics and census small-area statistics Area-level administrative data: hospital episode statistics and census small-area statistics Individual-level survey data: Health Survey for England. Individual-level survey data: Health Survey for England.Issue Reduce ecological bias and improve power, compared to using datasets singly. Reduce ecological bias and improve power, compared to using datasets singly.

Example 2: Low birth weight and pollution Question Influence of traffic-related air pollution (PM 10, NO 2, CO) on risk of intrauterine growth retardation (  low birth weight) Influence of traffic-related air pollution (PM 10, NO 2, CO) on risk of intrauterine growth retardation (  low birth weight)Design Data synthesis using two individual-level datasets National births register, (~600,000 births) National births register, (~600,000 births) Millennium Cohort Study. (~20,000 births) Millennium Cohort Study. (~20,000 births)Issue Geographical identifiers (  pollution exposure), and outcome, available for both datasets Geographical identifiers (  pollution exposure), and outcome, available for both datasets Important confounders (maternal age, smoking, ethnicity…) only available in the small dataset. Combine to increase power. Important confounders (maternal age, smoking, ethnicity…) only available in the small dataset. Combine to increase power.

Multilevel models for individual and area data Most commonly used to model individual-level outcomes y ij (individual j, area i) individual-level outcomes y ij (individual j, area i) in terms of individual-level predictors x ij individual-level predictors x ij group-level (e.g. area-level) predictors x i group-level (e.g. area-level) predictors x i Allow baseline risk (possibly also covariate effects) to vary by area: Allow baseline risk (possibly also covariate effects) to vary by area: y ij ~  i +  x ij +  x i However We want to model area-level outcomes y i as well as individual outcomes y ij

Modelling the area-level outcome Individual exposure Aggregate exposure Individual outcome y ij x ij xixixixi Aggregate outcome Individual exposure Aggregate exposure Individual outcome y ij x ij xixixixi yiyiyiyi

Ecological inference Determining individual-level exposure-outcome relationships using aggregate data. Determining individual-level exposure-outcome relationships using aggregate data. A simple ecological model: A simple ecological model: Y i ~ Binomial(p i, N i ), logit(p i ) =  +  X i Y i is the number of disease cases in area i N i is the population in area i X i is the proportion of individuals in area i with e.g. low social class. p i is the area-specific disease rate exp(  ) = odds ratio associated with exposure X i exp(  ) = odds ratio associated with exposure X i This is the group level association. Not necessarily equal to individual-level association → ecological bias This is the group level association. Not necessarily equal to individual-level association → ecological bias

Ecological bias Bias in ecological studies can be caused by: Confounding. As in all observational studies Confounding. As in all observational studies  confounders can be area-level (between-area) or individual-level (within-area).  Solution: try to account for confounders. non-linear exposure-response relationship, combined with within-area variability of exposure non-linear exposure-response relationship, combined with within-area variability of exposure  No bias if exposure is constant in area (contextual effect)  Bias increases as within-area variability increases  …unless models are refined to account for this hidden variability

Improving ecological inference Alleviate bias associated with within-area exposure variability. Alleviate bias associated with within-area exposure variability. Get some information on within-area distribution f i (x) of exposures, e.g. from individual-level exposure data. Get some information on within-area distribution f i (x) of exposures, e.g. from individual-level exposure data. Use this to form well-specified model for ecological data by integrating the underlying individual-level model. Use this to form well-specified model for ecological data by integrating the underlying individual-level model. Y i ~ Binomial(p i, N i ), p i =  p ik (x) f i (x) dx p i is average group-level risk p ik (x) is individual-level model (e.g. logistic regression) f i (x) is distribution of exposure x within area i f i (x) is distribution of exposure x within area i (or joint distribution of multiple exposures)

When ecological inference can work Using well-specified model Using well-specified model Information on within-area distribution of exposure Information on within-area distribution of exposure  Information, e.g. from a sample of individual exposures, to estimate the unbiased model that accounts for this distribution. High between-area contrasts in exposure High between-area contrasts in exposure  Information on the variation in outcome between areas with low exposure rates and high exposure rates  E.g. to determine ethnic differences in health, better to study areas in London (more diverse) than areas in a rural region. When there is insufficient information in ecological data: May be able to incorporate individual-level exposure- outcome data… May be able to incorporate individual-level exposure- outcome data…

Hierarchical related regression Individual-level model Logistic regression for individual-level outcome Logistic regression for individual-level outcome Includes individual or area-level predictors Includes individual or area-level predictors Use this to Use this to  model the individual-level data  construct correct model for aggregate data Model for aggregate data Based on averaging the individual model over the within-area joint distribution of covariates. Based on averaging the individual model over the within-area joint distribution of covariates. Alleviates ecological bias. Alleviates ecological bias. Combined model Individual and aggregate data assumed to be generated by the same baseline and relative risk parameters. Individual and aggregate data assumed to be generated by the same baseline and relative risk parameters. Estimate these parameters using both datasets simultaneously Estimate these parameters using both datasets simultaneously Infer individual-level relationships using both individual and aggregate data

Combining ecological and case-control data If outcome is rare, individual-level data from surveys or cohorts will usually contain little information. If outcome is rare, individual-level data from surveys or cohorts will usually contain little information. Supplement ecological data with case-control data instead. Supplement ecological data with case-control data instead. Haneuse and Wakefield (2005) describe a hybrid likelihood for combination of ecological and case-control data Haneuse and Wakefield (2005) describe a hybrid likelihood for combination of ecological and case-control data  Even including individual data from the cases only can reduce ecological bias to acceptable levels.

Issues with combining data Some variables missing in one dataset Some variables missing in one dataset  e.g. smoking, blood pressure available in survey but not administrative data Different but related information in each Different but related information in each  e.g. self-reported disease versus hospital admission records. Conflicts between datasets in information on what is nominally the same variable Conflicts between datasets in information on what is nominally the same variable  e.g. self-completed and interviewed responses to surveys Ideally the individual and aggregate data are from the same source (e.g. census small-area and SAR) Ideally the individual and aggregate data are from the same source (e.g. census small-area and SAR)

AGGREGATE Hospital Episode Statistics number of CVD admissions in area in 1998, by age group/sex Census small area statistics marginal proportions non-white, social class IV/V,… Census Samples of Anonymised Records (2%) full within-area cross-classification of individuals, age/sex/ethnicity/social class/car ownership - required for correct aggregate modelINDIVIDUAL Health Survey for England Self-reported admission to hospital for CVD (1998 only) Self-reported long-term CVD (1997, 1999, 1998, 2000, 2001)  Multiple imputation for missing hospital admission in not individual age and sex individual ethnicity individual social class individual car access Baseline and relative risk of CVD admission for individual Example: Cardiovascular disease (CVD)

Health Survey for England aggregated over districts Census covariates or Hospital Episode Statistics data Are aggregate and individual data consistent?

Area baseline risk ii Relative risk for individuals UNKNOWNS  Basic illustration of combining individual and aggregate data Aggregate census data DATA x ij y ij yiyi xixi exposure disease Areas i Areas i, individuals j disease exposure e.g. proportion low social class Individual social class CVD admission Individual survey data Area admissions count

 ik Individual survey data Aggregate census data Area/stratum baseline risk Relative risk for exposures DATA x ij y ij y ik x ir  exposures disease Areas i Areas i, individuals j social class r, employment status s, age/sex strata k. x is x ik x irsk x il Census Samples of Anonymised Records Areas i, individuals l Cross-classification of individuals Exposures More complex models for disease, more confounders, need another data source. CVD admission

 ik Survey data (1998) Aggregate census data Area/stratum baseline risk Relative risk for exposures DATA y ij * y ij y ik x ir  Areas i Areas i, individuals j social class r, employment status s, age/sex strata k. x is x ik x irsk x il Census Samples of Anonymised Records Areas i, individuals l Cross-classification of individuals CVD admissions Survey data ( ) x ij y ij Areas i, individuals j CVD admissions including imputed values Imputing missing outcomes in individual data Self reported CVD

Estimated coefficients (with 95% CI) for multiple regression model of the risk of hospitalisation Individual data only Aggregate data only Models combining individual and aggregated data

Individual and area-level predictors Area level covariates in underlying model for hospitalisation risk (Carstairs deprivation index) Area level covariates in underlying model for hospitalisation risk (Carstairs deprivation index)  No significant influence of Carstairs, after accounting for individual-level factors Random effects models Random effects models Random area-level baseline risk, quantifies remaining variability between areas. Random area-level baseline risk, quantifies remaining variability between areas.  After adjusting for covariates, variance partitioned into individual / area-level components  4% of residual variance between wards attributable to unobserved area-level factors (2% for districts) Little evidence of contextual effects Little evidence of contextual effects

Example: Low birth weight and pollution Geographically complete individual dataset from national register, with exposure, outcome but not confounders Geographically complete individual dataset from national register, with exposure, outcome but not confounders Geographically sparse survey dataset with all variables. Geographically sparse survey dataset with all variables. → missing data problem Impute missing covariates that are likely to be confounded with the pollution exposure. Impute missing covariates that are likely to be confounded with the pollution exposure. Information for this imputation Information for this imputation  from aggregate data (e.g. ethnicity, from census).  from sparse survey dataset

CONFOUNDERS Sex, age Socioeconomic???? National register data (LARGE) Survey data (Small) Low birth weight Pollution ee cc  regression model Confounders Sex, age Socioeconomic Smoking Ethnicity Maternal age etc.. POLLUTION Aggregate census data Ethnicity

Parallel regression models Desire unbiased inference on the effect of the primary exposure. Desire unbiased inference on the effect of the primary exposure. Available from small dataset with all confounders, but with low power. Available from small dataset with all confounders, but with low power. Information for imputation comes from small dataset or ecological data  is resulting uncertainty worth the precision gained? Information for imputation comes from small dataset or ecological data  is resulting uncertainty worth the precision gained? Work in progress, currently awaiting some data. Work in progress, currently awaiting some data.

Summary Combining datasets can increase power and reduce bias, making use of strengths of each Combining datasets can increase power and reduce bias, making use of strengths of each Problems may arise when data are incompatible or inconsistent. Problems may arise when data are incompatible or inconsistent. Bayesian hierarchical models useful in cases of conflicts. Bayesian hierarchical models useful in cases of conflicts.  All our methods can be implemented in WinBUGS More applied studies needed to demonstrate the utility of the approach. More applied studies needed to demonstrate the utility of the approach.

Publications Our papers available from C. Jackson, N. Best, S. Richardson. Hierarchical related regression for combining aggregate and survey data in studies of socio-economic disease risk factors. under revision, Journal of the Royal Statistical Society, Series A. C. Jackson, N. Best, S. Richardson. Hierarchical related regression for combining aggregate and survey data in studies of socio-economic disease risk factors. under revision, Journal of the Royal Statistical Society, Series A. C. Jackson, N. Best, S. Richardson. Improving ecological inference using individual-level data. Statistics in Medicine (2006) 25(12): C. Jackson, N. Best, S. Richardson. Improving ecological inference using individual-level data. Statistics in Medicine (2006) 25(12): C. Jackson, S. Richardson, N. Best. Studying place effects on health by synthesising area-level and individual data. Submitted. C. Jackson, S. Richardson, N. Best. Studying place effects on health by synthesising area-level and individual data. Submitted. S. Haneuse and J. Wakefield. The combination of ecological and case-control data. Submitted. S. Haneuse and J. Wakefield. The combination of ecological and case-control data. Submitted.