Nuoo-Ting (Jassy) Molitor 1 Chris Jackson 2 With Nicky Best, Sylvia Richardson 1 1 Department of Epidemiology and Public Health Imperial College, London.

Slides:



Advertisements
Similar presentations
Multivariate Meta-analysis: Notes on Correlations Robert Platt Department of Epidemiology & Biostatistics McGill University Jack Ishak United BioSource.
Advertisements

Sources and effects of bias in investigating links between adverse health outcomes and environmental hazards Frank Dunstan University of Wales College.
A Tutorial on Learning with Bayesian Networks
Controlling for Time Dependent Confounding Using Marginal Structural Models in the Case of a Continuous Treatment O Wang 1, T McMullan 2 1 Amgen, Thousand.
Simulation of “forwards-backwards” multiple imputation technique in a longitudinal, clinical dataset Catherine Welch 1, Irene Petersen 1, James Carpenter.
Comments on Hierarchical models, and the need for Bayes Peter Green, University of Bristol, UK IWSM, Chania, July 2002.
BACKGROUND Benzene is a known carcinogen. Occupational exposure to benzene is an established risk factor for leukaemia. Less is known about the effects.
Nicky Best, Chris Jackson, Sylvia Richardson Department of Epidemiology and Public Health Imperial College, London Studying.
Chance, bias and confounding
Nicky Best and Chris Jackson With Sylvia Richardson Department of Epidemiology and Public Health Imperial College, London
Shanna H. Swan, Kirsten Waller, Barbara Hopkins, Gerald DeLorenze, Gayle Windham, Laura Fenster, Catherine Schaefer, Raymond Neutra Environmental Health.
ChiSq Tests: 1 Chi-Square Tests of Association and Homogeneity.
BIOST 536 Lecture 9 1 Lecture 9 – Prediction and Association example Low birth weight dataset Consider a prediction model for low birth weight (< 2500.
GIS in Spatial Epidemiology: small area studies of exposure- outcome relationships Robert Haining Department of Geography University of Cambridge.
A Longitudinal Study of Maternal Smoking During Pregnancy and Child Height Author 1 Author 2 Author 3.
Learning In Bayesian Networks. Learning Problem Set of random variables X = {W, X, Y, Z, …} Training set D = { x 1, x 2, …, x N }  Each observation specifies.
Regression and Correlation
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 12: Multiple and Logistic Regression Marshall University.
Graphical models for combining multiple sources of information in observational studies Nicky Best Sylvia Richardson Chris Jackson Virgilio Gomez Sara.
Hierarchical models for combining multiple data sources measured at individual and small area levels Chris Jackson With Nicky Best and Sylvia Richardson.
Cohort Study.
Logistic Regression. Outline Review of simple and multiple regressionReview of simple and multiple regression Simple Logistic RegressionSimple Logistic.
Modeling Menstrual Cycle Length in Pre- and Peri-Menopausal Women Michael Elliott Xiaobi Huang Sioban Harlow University of Michigan School of Public Health.
Simple Linear Regression
Calculating Low Birth Weight from DHS Can Mothers Help Improve Estimation? Amos Channon, Mac McDonald, Sabu Padmadas University of Southampton.
TWO-STAGE CASE-CONTROL STUDIES USING EXPOSURE ESTIMATES FROM A GEOGRAPHICAL INFORMATION SYSTEM Jonas Björk 1 & Ulf Strömberg 2 1 Competence Center for.
Kevin Kovach, DrPH(c), MSc, CHES Johnson County Department of Health and Environment – Olathe, Kansas Does the County Poverty Rate Influence Birth Weight.
Evidence-Based Medicine 3 More Knowledge and Skills for Critical Reading Karen E. Schetzina, MD, MPH.
Racial Disparity in Correlates of Late Preterm Births: A Population-Based Study Shailja Jakhar, Christine Williams, Louis Flick, Jen Jen Chang, Qian Min,
Graphical models for combining multiple data sources
1 Rob Woodruff Battelle Memorial Institute, Health & Analytics Cynthia Ferre Centers for Disease Control and Prevention Conditional.
Chris Jackson With Nicky Best and Sylvia Richardson Department of Epidemiology and Public Health Imperial College, London
1 Multiple Imputation : Handling Interactions Michael Spratt.
Combining prevalence estimates from multiple sources Julian Flowers.
Inference from ecological models: air pollution and stroke using data from Sheffield, England. Ravi Maheswaran, Guangquan Li, Jane Law, Robert Haining,
Maternity and Ethnicity in Scotland Chalmers J, Bansal N, Fischbacher CM, Steiner M, Bhopal R, on behalf of the Scottish Health and Ethnicity Linkage Study.
Racial and Ethnic Disparities in the Knowledge of Shaken Baby Syndrome among Recent Mothers Findings from the Rhode Island PRAMS Hanna Kim, Samara.
Stephen Fisher, Jane Holmes, Nicky Best, Sylvia Richardson Department of Sociology, University of Oxford Department of Epidemiology and Biostatistics Imperial.
1 Introduction to Survey Data Analysis Linda K. Owens, PhD Assistant Director for Sampling & Analysis Survey Research Laboratory University of Illinois.
Multiple Imputation (MI) Technique Using a Sequence of Regression Models OJOC Cohort 15 Veronika N. Stiles, BSDH University of Michigan September’2012.
Areej Jouhar & Hafsa El-Zain Biostatistics BIOS 101 Foundation year.
Handling Attrition and Non- response in the 1970 British Cohort Study Tarek Mostafa Institute of Education – University of London.
Nuoo-Ting (Jassy) Molitor 1 Chris Jackson 2 With Nicky Best, Sylvia Richardson 1 1 Department of Epidemiology and Public Health Imperial College, London.
Multilevel Modeling Software Wayne Osgood Crime, Law & Justice Program Department of Sociology.
Discrete Choice Modeling William Greene Stern School of Business New York University.
BACKGROUND Benzene is a known carcinogen. Occupational exposure to benzene is an established risk factor for leukaemia. Less is known about the effects.
Bayesian Parametric and Semi- Parametric Hierarchical models: An application to Disinfection By-Products and Spontaneous Abortion: Rich MacLehose November.
Right Hand Side (Independent) Variables Ciaran S. Phibbs June 6, 2012.
Bayesian Multivariate Logistic Regression by Sean O’Brien and David Dunson (Biometrics, 2004 ) Presented by Lihan He ECE, Duke University May 16, 2008.
Right Hand Side (Independent) Variables Ciaran S. Phibbs.
Massachusetts Births 2005 Center for Health Information, Statistics, Research, and Evaluation Division of Research and Epidemiology Registry of Vital Records.
Simple linear regression Tron Anders Moger
Overview and Common Pitfalls in Statistics and How to Avoid Them
Simulation Study for Longitudinal Data with Nonignorable Missing Data Rong Liu, PhD Candidate Dr. Ramakrishnan, Advisor Department of Biostatistics Virginia.
A first order model with one binary and one quantitative predictor variable.
The Impact of Birth Spacing on Subsequent Feto-Infant Outcomes among Community Enrollees of a Federal Healthy Start Project Hamisu M. Salihu, MD, PhD Euna.
Designing Factorial Experiments with Binary Response Tel-Aviv University Faculty of Exact Sciences Department of Statistics and Operations Research Hovav.
POPLHLTH 304 Regression (modelling) in Epidemiology Simon Thornley (Slides adapted from Assoc. Prof. Roger Marshall)
1 Bandit Thinkhamrop, PhD.(Statistics) Dept. of Biostatistics & Demography Faculty of Public Health Khon Kaen University Overview and Common Pitfalls in.
Exact Logistic Regression
Lecture 3 (Chapter 4). Linear Models for Longitudinal Data Linear Regression Model (Review) Ordinary Least Squares (OLS) Maximum Likelihood Estimation.
Computational methods for inferring cellular networks II Stat 877 Apr 17 th, 2014 Sushmita Roy.
Synthetic Approaches to Data Linkage Mark Elliot, University of Manchester Jerry Reiter Duke University Cathie Marsh Centre.
Factors associated with maternal smoking during early pregnancy: relationship to low-birth-weight infants and maternal attitude toward their pregnancy.
Exposure Prediction and Measurement Error in Air Pollution and Health Studies Lianne Sheppard Adam A. Szpiro, Sun-Young Kim University of Washington CMAS.
Chapter 2. **The frequency distribution is a table which displays how many people fall into each category of a variable such as age, income level, or.
INFERENCE FOR BIG DATA Mike Daniels The University of Texas at Austin Department of Statistics & Data Sciences Department of Integrative Biology.
STA248 week 121 Bootstrap Test for Pairs of Means of a Non-Normal Population – small samples Suppose X 1, …, X n are iid from some distribution independent.
How to handle missing data values
CS639: Data Management for Data Science
Presentation transcript:

Nuoo-Ting (Jassy) Molitor 1 Chris Jackson 2 With Nicky Best, Sylvia Richardson 1 1 Department of Epidemiology and Public Health Imperial College, London 2 MRC Biostatistics Unit, Cambridge Bayesian graphical models for combining mismatched administrative and survey data: application to low birth weight and water disinfection by-products

BIAS Project “ Bayesian methods for integrated bias modelling and analysis of multiple data sources” Observational data in social sciences / epidemiology Account for common biases  …especially by using multiple data sources  Bayesian graphical models Outline of talk  (10 mins) Overview of graphical models for observational data biases (CJ)  (20 mins) Case study: Combining birth register, survey and census data to study effects of water disinfection by-products on risk of low birthweight (Jassy Molitor)

OUT- COME Observed individuals PRED- ICTOR ? Population of interest SELECTION BIAS (BY DESIGN) NON-RESPONSE (ACCIDENTAL) CONFOUNDING (BY DESIGN) MISSING DATA (ACCIDENTAL) ?? ?

Observed individuals PRED- ICTOR Population of interest SELECTION BIAS (BY DESIGN) NON-RESPONSE (ACCIDENTAL) CONFOUNDING (BY DESIGN) MISSING DATA (ACCIDENTAL) ?? ? ? ? PRED- ICTOR OUT- COME MEASUREMENT ERROR

Graphical model More general than a multilevel model As well as hierarchical structures (groups of groups of individuals) … …can express any relationship between known or unknown quantities Represented by a graph with nodes and links Y W Z X Genotypes of parents Genotypes of children

Advantages of graphical models Mathematical: Use network structure to build a joint probability distribution for known and unknown quantities.

Joint distributions and graphical models Use ideas from graph theory to: represent structure of a joint probability distribution… …by encoding conditional independencies Factorization thm: Jt distribution P(V) =  P(v | parents[v]) D EB C A F P(A,B,C,D,E,F) = P(A|C) P(B|D,E) P(C|D,E) P(D) P(E) P(F|D,E)

Advantages of graphical models Mathematical: Use network structure to build a joint probability distribution for known and unknown quantities. Modelling: Easy to represent real-world complexity as a fusion of simpler sub-models.

Conditional independence provides mathematical basis for expressing large system as fusion of smaller components D EB C A F Building complex models

Conditional independence provides mathematical basis for expressing large system as fusion of smaller components D EB C D E F C A Building complex models

Advantages of graphical models Mathematical: Use network structure to build a joint probability distribution for known and unknown quantities. Modelling: Easy to represent real-world complexity as a fusion of simpler sub-models. Inference: Bayesian, unknown quantities have probability distributions, updated as data arrive. Uncertainties propagated through model Computational: Allow efficient algorithms for estimating Bayesian posterior distributions

Simple example OUT- COME Effect Observed data unknowns Individuals EXPO- SURE

Simple example OUT- COME Effect Observed data unknowns CONFO- UNDER Individuals EXPO- SURE

Simple example EXPO- SURE OUT- COME Effect Observed data unknowns EXPO- SURE OUT- COME Individuals with complete data CONFO- UNDER CONFO- UNDER ??? Individuals with missing data

Simple example EXPO- SURE OUT- COME Effect on outcome Observed data unknowns EXPO- SURE OUT- COME Individuals with complete data CONFO- UNDER CONFO- UNDER ??? Individuals with missing data Effect on confounder

EXPO- SURE OUT- COME Observed data unknowns EXPO- SURE OUT- COME Individuals with complete data CONFO- UNDER CONFO- UNDER ??? Individuals with missing data Effect on confounder Effect on outcome

Building complex models Key idea understand complex system through global model built from small pieces  comprehensible  each with only a few variables  representing a different data source or bias

Case study Combining birth register, survey and census data to study effects of water disinfection by-products on risk of low birth weight

Low Birthweight (LBW) (birth weight < 2.5kg) Environmental Exposure Chlorine Byproducts (THMs) Outcome Low Birth-weight (LBW) LBW and pre-term (LBWP) LBW and Full-term (LBWF) LBW: baby ’ s birth weight is less than 2.5 kg LBWP: LBW babies were born less than 37 weeks LBWF: LBW babies were born at least 37 weeks Covariates: mothers’ race/ethnicity Babies’ sex mothers’ smoking status Mothers’ maternal age during the pregnancy Example of combining different data sources – Chlorination Study Chlorine Natural organic matter and / or Chemical compound bromide organic & inorganic byproducts organic & inorganic byproducts bromate bromate bromate bromate chlorite chlorite chlorite chlorite haloacetic acids (HAA5) haloacetic acids (HAA5) haloacetic acids (HAA5) haloacetic acids (HAA5) total trihalomethanes (THMs ) total trihalomethanes (THMs ) total trihalomethanes (THMs ) total trihalomethanes (THMs ) reacts Gestation age

Available data sources related to the Chlorination Study Why do we need them? Administrative data (NBR) Deal with Small % of LBW in pop Inconclusive link between LBW and THMs Imputing missing covariates Aggregate data Survey data (MCS) Adjust for important subject level covariate Allows to examine different types of LBW

Administrative data (large) -power, no selection bias Observed postcode Missing smoking and race/ethnicity Missing baby’s gestation age NBR (national birth registry) Observed postcode Census region-level of race/ethnicity composition Consumer survey: CACI - region-level of tobacco expenditure Aggregate Data (UK) Survey data (Subset of NBR) - low power, selection bias Observed postcode Observed smoking and race/ethnicity Observed baby’s gestation age MCS (millennium cohort study) Summary of data sources

Disease sub-model for MCS m: subject index for MCS r: region index y r m normal LBWP LBWF THM r m C r m Disease Model Parameters Unknown Known y : Birth weight indicator (1: normal, 2: LBWP, 3: LBWF) THM : THM (chlorine byproduct) exposure C : missing covariates such as race/ethnicity and smoking. Only observed in the MCS. Multinomial logistic regression for MCS y r m ~ Multinomial (p r m,1:3, 1) log(p r m,2 / p r m,1 )= b 10 + b 11 THM r m + b 12 C r m log(p r m,3 / p r m,1 )= b 20 + b 21 THM r m + b 22 C r m Building the sub-model

Disease sub-model for NBR n: subject index for NBR r: region index y r n normal THM r n Disease Model Parameters Unknown Known C r n LBWP LBWF Missing LBWP & LBWF were due to missing gestation age C : missing covariates such as race/ethnicity and smoking (Missing in the NBR, but Observed in the MCS) Building the sub-model Multinomial logistic regression for NBR y r n ~ Multinomial (p r n,1:3, 1) log(p r n,2 / p r n,1 )= b 10 + b 11 THM r n + b 12 C r n log(p r n,3 / p r n,1 )= b 20 + b 21 THM r n + b 22 C r n

G-age: Gestation age y r n THM r n Disease Model Parameters normalLBW THM r m Disease Model Parameters y r m normal LBW C r n C r m Birth Weight (BW ) LBWP LBWF LBWP LBWF missing G-age known unknown NBR MCS Missing outcome model - impute LBWP and LBWF for NBR

C r n C r m NBR MCS Aggregate A r Aggregate A r Unknown Known missing covar. model parameters Missing Covariate Model Impute C r n in terms of aggregate data and MCS data Building the sub-model Since our missing covariate such as race and smoke are binary variables, we use a multivariate-probit model to account for their correlation

1: nonwhite (Asian, Black, Others) 0: white 1: yes 0: no RaceSmoke Define underlying continuous variables (smoke*, race*) Smoke= I(smok* >0) & Race= I (Race* >0) Multivariate Probit Model (Chip & Greenberg,1998) Correlation S: Sampling Stratum Adjust for selection bias

NBR disease sub-model THM r n Disease Model Parameters THM r m Disease Model Parameters C r n C r m y r m normal LBWP LBWF y r n normal LBWF LBWP MCS disease sub-model Aggre. A r C r n C r m Missing covar. model parameters Missing covar. sub-model Missing Outcome Model Unified model known unknown Aggre. A r

1. Disease Model (y={1,2,3} ) 3. Missing Covariates Model (Multivariate Probit) 2. Missing Outcome Model i: subject index N m : group of subjects who had missing outcome (y miss ) r: region u: index for the category of outcome y obs : observed outcome X: observed covariates

Y (1, 2, 3) C (0/1) Aggre. (census) Missing Covariate Model Missing Outcome Model Investigating the performance of the unified model Good Performance of model depended on 1.How well the aggre. data can inform C (covariate) 2.How strong C and Y are linked MCS data shown there was 1. a strong association between aggre. data and race, smoke 2. a strong association between race, smoke and Y

Strong C-Aggre. association Strong Y-C link Step 1: Create data (N=1333) under the scenarios : Step 3: Compare the prediction based on an analysis using fully observed data (no imputation) with an analysis using partially observed data (imputation). Step 2: Randomly assign missing values 50% for Y=2 & Y=3 and 80% for C Repeat step 2 : generate 10 replicate samples Simulation Study

Pr( Y=2 | Y=2 or 3, covariates) conditional probability for LBWP given LBW, covariates Pr(Y=3 |Y=2 or 3, covariates) conditional probability for LBWF given LBW covariates Examining the missing outcome model: imputing Y In this dataset, missing outcome data are always LBW, either pre or full term (Y=2 or Y=3). Therefore, for missing outcome data, we wish to determine the conditional probabilities, If we are to accurately impute Y, these probabilities must be accurately estimated.

Examining the missing outcome model: imputing Y S=0, R=0 S=0, R=1 S=1, R=0 S=1, R=1 Y contains 50% missing values at categories 2 and 3 S and R is totally observed More challenging ! Y contains 50% missing values at categories 2 and 3 S and R contains 80% missing values

Examining the missing covariate model : imputing C (smoke and race) Y=1 Y=2 Y=3 One level Imputation C Aggre. C contains 80% missing Two levels imputation C Aggre. Y C C contains 80% missing Y contains 50% missing at categories 2 and 3 P00 Non-smoker White P01 Smoker Non-White P10 Smoker White P11 Non-smoker Non-white Smoke RACERACE

Real data analysis – United Utilities water company Data: Restrict on: Singleton birth Period: Sep 2000 – Aug 2001 Subjects: MCS 1333 NBR = Total 9278 Missing % in Race and Smoke: ~ 85% Missing % in Outcome: ~ 7% Complete Observed information Missing Race Missing Smoke Missing outcome at levels of 2 (LBWP) and 3 (LBWF)

Real data analysis – United Utilities water company Exposure variable : THMs It was dichotomized into 2 groups low-medium exposure group (<= 60 g/l) : % high exposure group (>60 g/l) : % Estimated in separate model for MCS and NBR (Whitaker et al, 2005) In addition to race and smoke, we also adjust for : baby’s sex mother maternal age Observed in both MCS and NBR

Standard (STATA) VS. Bayesian a. Multinomial logistic regression model for MCS data - no imputation b. Bayesian multiple bias model for combined NBR, MCS and aggregate data - impute missing outcome and covariates Models for real data analysis

Results for the real data analysis (Low birth-weight full-term VS Normal) OR ( 95% CI) DataModelOutcomeTHMsSmokeNon-white MCS (1333) Multinomial Logistic (STATA) LBWF 1.51 ( ) 2.4 ( ) 4.7 (2.2-10) MCS+NBR (9278) Bayesian Multiple Bias LBWF2.13 ( )* 2.6 ( )* 6.9 ( )* * 95% Bayesian Credible Interval All parameter estimates adjusted for baby’s sex, mother maternal age

Conclusion There is an evidence for association of THM exposure with low birth-weight full-term. Combining the datasets can  increase statistical power of the survey data  alleviate bias due to confounding in the administrative data Must allow for selection mechanism of survey when combining data

THANKS Mireille Toledano Mark Nieuwenhuijsen James Bennett Peter Hambly Daniela Fecht John Molitor

using one-level imputation Strong C-aggre. Weak C-aggre. Y=1 Y=2 Y=3

S=0, R=0 S=0, R=1 S=1, R=0 S=1, R=1 Strong Y-C Weak Y-C Y contains 50% missing values at categories 2 and 3 using one-level imputation

two-levels VS one-level imputation Y=1 Y=2 Y=3 Strong C-aggre Strong Y-C Weak C-aggre Strong Y-C Strong C-aggre Weak Y-C

Without cut function Cut function

Without cut functionCut function using two-level imputation