Presentation is loading. Please wait.

Presentation is loading. Please wait.

Graphical models for combining multiple data sources

Similar presentations


Presentation on theme: "Graphical models for combining multiple data sources"— Presentation transcript:

1 Graphical models for combining multiple data sources
Nicky Best Sylvia Richardson Chris Jackson Imperial College BIAS node with thanks to Peter Green

2 Outline Overview of graphical modelling
Case study 1: Water disinfection byproducts and adverse birth outcomes Modelling multiple sources of bias in observational studies Case study 2: Socioeconomic factors and limiting long term illness Combining individual and aggregate level data Simulation study Application to Census and Health Survey for England

3 Graphical modelling Mathematics Modelling Algorithms Inference

4 1. Mathematics Mathematics Modelling Algorithms Inference
Key idea: conditional independence X and Y are conditionally independent given Z if, knowing Z, discovering Y tells you nothing more about X P(X | Y, Z) = P(X | Z)

5 Example: Mendelian inheritance
Z X Y Z = genotype of parents X, Y = genotypes of 2 children If we know the genotype of the parents, then the children’s genotypes are conditionally independent

6 Joint distributions and graphical models
Use ideas from graph theory to: represent structure of a joint probability distribution….. …..by encoding conditional independencies A C D F B E Factorization thm: Jt distribution P(V) =  P(v | parents[v])

7 Where does the graph come from?
Genetics pedigree (family tree) Physical, biological, social systems supposed causal effects Contingency tables hypothesis tests on data Gaussian case non-zeros in inverse covariance matrix

8 A C D F B E Conditional independence provides mathematical basis for splitting up large system into smaller components

9 C D A C D F B E E Conditional independence provides mathematical basis for splitting up large system into smaller components

10 2. Modelling Mathematics Modelling Algorithms Inference Graphical models provide framework for building probabilistic models for empirical data

11 Building complex models
Key idea understand complex system through global model built from small pieces comprehensible each with only a few variables modular

12 Example: Case study 1 Epidemiological study of birth defects and mothers’ exposure to water disinfection byproducts Background Chlorine added to tap water supply for disinfection Reacts with natural organic matter in water to form unwanted byproducts (including trihalomethanes, THMs) Some evidence of adverse health effects (cancer, birth defects) associated with exposure to high levels of THM We are carrying out study in Great Britain using routine data, to investigate risk of birth defects associated with exposure to different THM levels

13 Data sources National postcoded births register
National and local congenital anomalies registers Routinely monitored THM concentrations in tap water samples for each water supply zone within 14 different water company regions Census data – area level socioeconomic factors Millenium cohort study (MCS) – individual level outcomes and confounder data on sample of mothers Literature relating to factors affecting personal exposure (uptake factors, water consumption, etc.)

14 Model for combining data sources
THMzt [tap] s2 THMztj [raw] THMzk [pers] THMzi [pers] b[T] pzk b[c] pzi czk czi yzk qz yzi

15 Model for combining data sources
Regression model for national data relating risk of birth defects (pzk) to mother’s THM exposure and other confounders (czk) f THMzt [tap] s2 THMztj [raw] THMzk [pers] THMzi [pers] b[T] pzk b[c] pzi czk czi yzk qz yzi

16 Model for combining data sources
Regression model for MCS data relating risk of birth defects (pzi) to mother’s THM exposure and other confounders (czi) THMzt [tap] s2 THMztj [raw] THMzk [pers] THMzi [pers] b[T] pzk b[c] pzi czk czi yzk qz yzi

17 Model for combining data sources
Missing data model to estimate confounders (czk) for mothers in national data, using information on within area distribution of confounders in MCS f THMzt [tap] s2 THMztj [raw] THMzk [pers] THMzi [pers] b[T] pzk b[c] pzi czk czi yzk qz yzi

18 Model for combining data sources
Model to estimate true tap water THM concentration from raw data THMzt [tap] s2 THMztj [raw] THMzk [pers] THMzi [pers] b[T] pzk b[c] pzi czk czi yzk qz yzi

19 Model for combining data sources
Model to predict personal exposure using estimated tap water THM level and literature on distribution of factors affecting individual uptake of THM f THMzt [tap] s2 THMztj [raw] THMzk [pers] THMzi [pers] b[T] pzk b[c] pzi czk czi yzk qz yzi

20 3. Inference Mathematics Modelling Algorithms Inference

21 Bayesian

22 … or non Bayesian

23 Bayesian Full Probability Modelling
Graphical approach to building complex models lends itself naturally to Bayesian inferential process Graph defines joint probability distribution on all the ‘nodes’ in the model Condition on parts of graph that are observed (data) Update probabilities of remaining nodes using Bayes theorem Automatically propagates all sources of uncertainty

24 4. Algorithms Mathematics Modelling Algorithms Inference
Many algorithms, including MCMC, are able to exploit graphical structure MCMC: subgroups of variables updated randomly Ensemble converges to equilibrium (e.g. posterior) dist.

25 Key idea exploited by WinBUGS software
MCMC Key idea exploited by WinBUGS software - need only look at neighbours ? Updating

26 Case study 2 Socioeconomic factors affecting health Background
Interested in individual versus contextual effects of socioeconomic determinants of health Often investigated using multi-level studies (individuals within areas) Ecological studies also widely used in epidemiology and social sciences due to availability of small-area data investigate relationships at level of group, rather than individual outcome and exposures are available as group-level summaries usual aim is to transfer inference to individual level

27 Building the model s2 Multilevel model for individual data ai x[c]ik
b[c] x[b]ik pik b[b] yik

28 Building the model s2 Multilevel model for individual data
yik ~ Bernoulli(pik), person k, area i ai x[c]ik b[c] x[b]ik pik b[b] yik

29 Building the model s2 Multilevel model for individual data
yik ~ Bernoulli(pik), person k, area i ai log pik = ai + b[c] x[c]ik + b[b] x[b]ik x[c]ik b[c] x[b]ik pik b[b] yik

30 Building the model s2 Multilevel model for individual data
yik ~ Bernoulli(pik), person k, area i ai log pik = ai + b[c] x[c]ik + b[b] x[b]ik x[c]ik b[c] ai ~ Normal(0, s2) x[b]ik pik b[b] yik

31 Building the model s2 Multilevel model for individual data
yik ~ Bernoulli(pik), person k, area i ai log pik = ai + b[c] x[c]ik + b[b] x[b]ik x[c]ik b[c] ai ~ Normal(0, s2) x[b]ik pik b[b] Prior distributions on s2, b[c], b[b] yik

32 Building the model Ecological model s2 ai V[c]i b[c] X[c]i b[b] qi
X[b]i Yi Ni

33 Building the model Ecological model Yi ~ Binomial(qi, Ni), area i s2
V[c]i b[c] X[c]i b[b] qi X[b]i Yi Ni

34 Building the model Ecological model Yi ~ Binomial(qi, Ni), area i s2
qi =  pik(x[b], x[c]) fi(x[b], x[c]) dx[b]dx[c] ai V[c]i b[c] X[c]i b[b] qi X[b]i Yi Ni

35 Building the model Ecological model Yi ~ Binomial(qi, Ni), area i s2
qi =  pik(x[b], x[c]) fi(x[b], x[c]) dx[c]dx[c] ai V[c]i Assuming x[b], x[c] independent, with X[b]i = proportion exposed to ‘b’ in area i and fi(x[c]) = Normal(X[c]i, V[c]i), then qi = q0i(1-X[b]i) + q1iX[b]i where q0i = marginal prob of disease for unexposed = exp(ai + b[c]X[c]I + b2[c]V[c]i/2) b[c] X[c]i b[b] qi X[b]i Yi Ni

36 Building the model Ecological model Yi ~ Binomial(qi, Ni), area i s2
qi =  pik(x[b], x[c]) fi(x[b], x[c]) dx[b]dx[c] ai V[c]i Assuming x[b], x[c] independent, with X[b]i = proportion exposed to ‘b’ in area i and fi(x[c]) = Normal(X[c]i, V[c]i), then qi = q0i(1-X[b]i) + q1iX[b]i where q1i = marginal prob of disease for exposed = exp(ai + b[b] + b[c]X[c]I + b2[c]V[c]i/2) b[c] X[c]i b[b] qi X[b]i Yi Ni

37 Building the model Ecological model Yi ~ Binomial(qi, Ni), area i s2
qi =  pik(x[b], x[c]) fi(x[b], x[c]) dx[b]dx[c] ai V[c]i ai ~ Normal(0, s2) b[c] X[c]i b[b] qi X[b]i Yi Ni

38 Building the model Ecological model Yi ~ Binomial(qi, Ni), area i s2
qi =  pik(x[b], x[c]) fi(x[b], x[c]) dx[b]dx[c] ai V[c]i ai ~ Normal(0, s2) b[c] X[c]i Prior distributions on s2, b[b], b[c] b[b] qi X[b]i Yi Ni

39 Combining individual and aggregate data
Individual level survey data often lack power to inform about contextual and/or individual-level effects Even when correct (integrated) model used, ecological data often contain little information about some or all effects of interest Can we improve inference by combining both types of model / data?

40 Combining individual and aggregate data
s2 s2 Multilevel model for individual data Ecological model ai ai V[c]i x[c]ik b[c] b[c] X[c]i x[b]ik pik b[b] b[b] qi X[b]i yik Yi Ni

41 Combining individual and aggregate data
s2 Hierarchical Related Regression (HRR) model ai V[c]i x[c]ik b[c] X[c]i x[b]ik pik b[b] qi X[b]i yik Yi Ni

42 Simulation Study

43 Simulation Study

44 Simulation Study

45 Comments Inference from aggregate data can be unbiased provided exposure contrasts between areas are high (and appropriate integrated model used) Combining aggregate data with small samples of individual data can reduce bias when exposure contrasts are low Combining individual and aggregate data can reduce MSE of estimated compared to individual data alone Individual data cannot help if individual-level model is misspecified

46 Application to LLTI Health outcome Exposures Data sources
Limiting Long Term Illness (LLTI) in men aged yrs living in London Exposures ethnicity (white/non-white), income, area deprivation Data sources Aggregate: 1991 Census aggregated to ward level Individual: Health Survey for England (with ward identifier) 1-9 observations per ward (median 1.6)

47 Ward level data Deprivation % non white Mean income Prevalence of LLTI

48 Results Model Non-white Log income Deprivation Between-area variance
Individual -0.36 (-0.98, 0.23) -0.55 (-0.80, -0.32) -0.022 (-0.032, 0.074) 0.18 (0.052, 0.64) Ecological 0.50 (0.27, 0.72) -0.72 (-0.93, -0.51) 0.063 (0.054, 0.073) 0.19 (0.17, 0.21) Combined 0.48 (0.23, 0.72) -0.70 (-0.91, -0.50) 0.064 (0.054, 0.074) (0.17, 0.22) Combined (correlation modelled) (0.24, 0.73) -0.71 (-0.91, -0.51)

49 Thank you for your attention!
Concluding Remarks Graphical models are powerful and flexible tool for building realistic statistical models for complex problems Applicable in many domains Allow exploiting of subject matter knowledge Allow formal combining of multiple data sources Built on rigorous mathematics Principled inferential methods Thank you for your attention!


Download ppt "Graphical models for combining multiple data sources"

Similar presentations


Ads by Google