Graphical models for combining multiple data sources

Slides:



Advertisements
Similar presentations
Bayes rule, priors and maximum a posteriori
Advertisements

Bayesian mixture models for analysing gene expression data Natalia Bochkina In collaboration with Alex Lewin, Sylvia Richardson, BAIR Consortium Imperial.
Sources and effects of bias in investigating links between adverse health outcomes and environmental hazards Frank Dunstan University of Wales College.
CSE 473/573 Computer Vision and Image Processing (CVIP) Ifeoma Nwogu Lecture 27 – Overview of probability concepts 1.
A Tutorial on Learning with Bayesian Networks
Probabilistic models Jouni Tuomisto THL. Outline Deterministic models with probabilistic parameters Hierarchical Bayesian models Bayesian belief nets.
Comments on Hierarchical models, and the need for Bayes Peter Green, University of Bristol, UK IWSM, Chania, July 2002.
Uncertainty and confidence intervals Statistical estimation methods, Finse Friday , 12.45–14.05 Andreas Lindén.
Bayesian Estimation in MARK
BACKGROUND Benzene is a known carcinogen. Occupational exposure to benzene is an established risk factor for leukaemia. Less is known about the effects.
Evaluating Diagnostic Accuracy of Prostate Cancer Using Bayesian Analysis Part of an Undergraduate Research course Chantal D. Larose.
Nicky Best, Chris Jackson, Sylvia Richardson Department of Epidemiology and Public Health Imperial College, London Studying.
Nicky Best and Chris Jackson With Sylvia Richardson Department of Epidemiology and Public Health Imperial College, London
From: Probabilistic Methods for Bioinformatics - With an Introduction to Bayesian Networks By: Rich Neapolitan.
Data Mining Cardiovascular Bayesian Networks Charles Twardy †, Ann Nicholson †, Kevin Korb †, John McNeil ‡ (Danny Liew ‡, Sophie Rogers ‡, Lucas Hope.
1 Graphical Models in Data Assimilation Problems Alexander Ihler UC Irvine Collaborators: Sergey Kirshner Andrew Robertson Padhraic Smyth.
Chapter 7 Sampling and Sampling Distributions
Machine Learning CMPT 726 Simon Fraser University
Probabilistic Robotics Introduction Probabilities Bayes rule Bayes filters.
Data Mining CS 341, Spring 2007 Lecture 4: Data Mining Techniques (I)
Part III: Inference Topic 6 Sampling and Sampling Distributions
G. Cowan Lectures on Statistical Data Analysis Lecture 10 page 1 Statistical Data Analysis: Lecture 10 1Probability, Bayes’ theorem 2Random variables and.
Lecture II-2: Probability Review
Learning In Bayesian Networks. Learning Problem Set of random variables X = {W, X, Y, Z, …} Training set D = { x 1, x 2, …, x N }  Each observation specifies.
Lecture 9: p-value functions and intro to Bayesian thinking Matthew Fox Advanced Epidemiology.
Study Design and Analysis in Epidemiology: Where does modeling fit? Meaningful Modeling of Epidemiologic Data, 2010 AIMS, Muizenberg, South Africa Steve.
Graphical models for combining multiple sources of information in observational studies Nicky Best Sylvia Richardson Chris Jackson Virgilio Gomez Sara.
Hierarchical models for combining multiple data sources measured at individual and small area levels Chris Jackson With Nicky Best and Sylvia Richardson.
Overview G. Jogesh Babu. Probability theory Probability is all about flip of a coin Conditional probability & Bayes theorem (Bayesian analysis) Expectation,
Using GIS to investigate multiple deprivation David Briggs Small Area Health Statistics Unit Imperial College, London A few thoughts and several questions.
STA291 Statistical Methods Lecture 16. Lecture 15 Review Assume that a school district has 10,000 6th graders. In this district, the average weight of.
Introduction: Why statistics? Petter Mostad
Review of Chapters 1- 5 We review some important themes from the first 5 chapters 1.Introduction Statistics- Set of methods for collecting/analyzing data.
6.1 - One Sample One Sample  Mean μ, Variance σ 2, Proportion π Two Samples Two Samples  Means, Variances, Proportions μ 1 vs. μ 2.
Chris Jackson With Nicky Best and Sylvia Richardson Department of Epidemiology and Public Health Imperial College, London
Nuoo-Ting (Jassy) Molitor 1 Chris Jackson 2 With Nicky Best, Sylvia Richardson 1 1 Department of Epidemiology and Public Health Imperial College, London.
Inference from ecological models: air pollution and stroke using data from Sheffield, England. Ravi Maheswaran, Guangquan Li, Jane Law, Robert Haining,
Department of SOCIAL MEDICINE Producing Small Area Estimates of the Need for Hip and Knee Replacement Surgery ANDY JUDGE Nicky Welton Mary Shaw Yoav Ben-Shlomo.
Stephen Fisher, Jane Holmes, Nicky Best, Sylvia Richardson Department of Sociology, University of Oxford Department of Epidemiology and Biostatistics Imperial.
Various topics Petter Mostad Overview Epidemiology Study types / data types Econometrics Time series data More about sampling –Estimation.
Bayesian Networks for Data Mining David Heckerman Microsoft Research (Data Mining and Knowledge Discovery 1, (1997))
Comments: The Big Picture for Small Areas Alan M. Zaslavsky Harvard Medical School.
Nuoo-Ting (Jassy) Molitor 1 Chris Jackson 2 With Nicky Best, Sylvia Richardson 1 1 Department of Epidemiology and Public Health Imperial College, London.
A short introduction to epidemiology Chapter 4: More complex study designs Neil Pearce Centre for Public Health Research Massey University Wellington,
An Introductory Lecture to Environmental Epidemiology Part 5. Ecological Studies. Mark S. Goldberg INRS-Institut Armand-Frappier, University of Quebec,
Module networks Sushmita Roy BMI/CS 576 Nov 18 th & 20th, 2014.
Ch 8. Graphical Models Pattern Recognition and Machine Learning, C. M. Bishop, Revised by M.-O. Heo Summarized by J.W. Nam Biointelligence Laboratory,
BACKGROUND Benzene is a known carcinogen. Occupational exposure to benzene is an established risk factor for leukaemia. Less is known about the effects.
Learning In Bayesian Networks. General Learning Problem Set of random variables X = {X 1, X 2, X 3, X 4, …} Training set D = { X (1), X (2), …, X (N)
Lecture 2: Statistical learning primer for biologists
The generalization of Bayes for continuous densities is that we have some density f(y|  ) where y and  are vectors of data and parameters with  being.
Probabilistic models Jouni Tuomisto THL. Outline Deterministic models with probabilistic parameters Hierarchical Bayesian models Bayesian belief nets.
1 Module One: Measurements and Uncertainties No measurement can perfectly determine the value of the quantity being measured. The uncertainty of a measurement.
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Stats Term Test 4 Solutions. c) d) An alternative solution is to use the probability mass function and.
Review of statistical modeling and probability theory Alan Moses ML4bio.
Probabilistic Robotics Introduction Probabilities Bayes rule Bayes filters.
- 1 - Outline Introduction to the Bayesian theory –Bayesian Probability –Bayes’ Rule –Bayesian Inference –Historical Note Coin trials example Bayes rule.
1 Getting started with WinBUGS Mei LU Graduate Research Assistant Dept. of Epidemiology, MD Anderson Cancer Center Some material was taken from James and.
G. Cowan Lectures on Statistical Data Analysis Lecture 10 page 1 Statistical Data Analysis: Lecture 10 1Probability, Bayes’ theorem 2Random variables and.
Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability Primer Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability.
Probabilistic Robotics Probability Theory Basics Error Propagation Slides from Autonomous Robots (Siegwart and Nourbaksh), Chapter 5 Probabilistic Robotics.
Outline Historical note about Bayes’ rule Bayesian updating for probability density functions –Salary offer estimate Coin trials example Reading material:
NURS 306, Nursing Research Lisa Broughton, MSN, RN, CCRN RESEARCH STATISTICS.
Prediction and Missing Data. Summarising Distributions ● Models are often large and complex ● Often only interested in some parameters – e.g. not so interested.
The binomial applied: absolute and relative risks, chi-square
Combining individual and aggregate data to improve estimates of ethnic voting in Britain in 2001 and 2005 Stephen Fisher, Jane Holmes, Nicky Best, Sylvia.
CSE-490DF Robotics Capstone
CS639: Data Management for Data Science
Mathematical Foundations of BME Reza Shadmehr
Presentation transcript:

Graphical models for combining multiple data sources Nicky Best Sylvia Richardson Chris Jackson Imperial College BIAS node with thanks to Peter Green

Outline Overview of graphical modelling Case study 1: Water disinfection byproducts and adverse birth outcomes Modelling multiple sources of bias in observational studies Case study 2: Socioeconomic factors and limiting long term illness Combining individual and aggregate level data Simulation study Application to Census and Health Survey for England

Graphical modelling Mathematics Modelling Algorithms Inference

1. Mathematics Mathematics Modelling Algorithms Inference Key idea: conditional independence X and Y are conditionally independent given Z if, knowing Z, discovering Y tells you nothing more about X P(X | Y, Z) = P(X | Z)

Example: Mendelian inheritance Z X Y Z = genotype of parents X, Y = genotypes of 2 children If we know the genotype of the parents, then the children’s genotypes are conditionally independent

Joint distributions and graphical models Use ideas from graph theory to: represent structure of a joint probability distribution….. …..by encoding conditional independencies A C D F B E Factorization thm: Jt distribution P(V) =  P(v | parents[v])

Where does the graph come from? Genetics pedigree (family tree) Physical, biological, social systems supposed causal effects Contingency tables hypothesis tests on data Gaussian case non-zeros in inverse covariance matrix

A C D F B E Conditional independence provides mathematical basis for splitting up large system into smaller components

C D A C D F B E E Conditional independence provides mathematical basis for splitting up large system into smaller components

2. Modelling Mathematics Modelling Algorithms Inference Graphical models provide framework for building probabilistic models for empirical data

Building complex models Key idea understand complex system through global model built from small pieces comprehensible each with only a few variables modular

Example: Case study 1 Epidemiological study of birth defects and mothers’ exposure to water disinfection byproducts Background Chlorine added to tap water supply for disinfection Reacts with natural organic matter in water to form unwanted byproducts (including trihalomethanes, THMs) Some evidence of adverse health effects (cancer, birth defects) associated with exposure to high levels of THM We are carrying out study in Great Britain using routine data, to investigate risk of birth defects associated with exposure to different THM levels

Data sources National postcoded births register National and local congenital anomalies registers Routinely monitored THM concentrations in tap water samples for each water supply zone within 14 different water company regions Census data – area level socioeconomic factors Millenium cohort study (MCS) – individual level outcomes and confounder data on sample of mothers Literature relating to factors affecting personal exposure (uptake factors, water consumption, etc.)

Model for combining data sources THMzt [tap] s2 THMztj [raw] THMzk [pers] THMzi [pers] b[T] pzk b[c] pzi czk czi yzk qz yzi

Model for combining data sources Regression model for national data relating risk of birth defects (pzk) to mother’s THM exposure and other confounders (czk) f THMzt [tap] s2 THMztj [raw] THMzk [pers] THMzi [pers] b[T] pzk b[c] pzi czk czi yzk qz yzi

Model for combining data sources Regression model for MCS data relating risk of birth defects (pzi) to mother’s THM exposure and other confounders (czi) THMzt [tap] s2 THMztj [raw] THMzk [pers] THMzi [pers] b[T] pzk b[c] pzi czk czi yzk qz yzi

Model for combining data sources Missing data model to estimate confounders (czk) for mothers in national data, using information on within area distribution of confounders in MCS f THMzt [tap] s2 THMztj [raw] THMzk [pers] THMzi [pers] b[T] pzk b[c] pzi czk czi yzk qz yzi

Model for combining data sources Model to estimate true tap water THM concentration from raw data THMzt [tap] s2 THMztj [raw] THMzk [pers] THMzi [pers] b[T] pzk b[c] pzi czk czi yzk qz yzi

Model for combining data sources Model to predict personal exposure using estimated tap water THM level and literature on distribution of factors affecting individual uptake of THM f THMzt [tap] s2 THMztj [raw] THMzk [pers] THMzi [pers] b[T] pzk b[c] pzi czk czi yzk qz yzi

3. Inference Mathematics Modelling Algorithms Inference

Bayesian

… or non Bayesian

Bayesian Full Probability Modelling Graphical approach to building complex models lends itself naturally to Bayesian inferential process Graph defines joint probability distribution on all the ‘nodes’ in the model Condition on parts of graph that are observed (data) Update probabilities of remaining nodes using Bayes theorem Automatically propagates all sources of uncertainty

4. Algorithms Mathematics Modelling Algorithms Inference Many algorithms, including MCMC, are able to exploit graphical structure MCMC: subgroups of variables updated randomly Ensemble converges to equilibrium (e.g. posterior) dist.

Key idea exploited by WinBUGS software MCMC Key idea exploited by WinBUGS software - need only look at neighbours ? Updating

Case study 2 Socioeconomic factors affecting health Background Interested in individual versus contextual effects of socioeconomic determinants of health Often investigated using multi-level studies (individuals within areas) Ecological studies also widely used in epidemiology and social sciences due to availability of small-area data investigate relationships at level of group, rather than individual outcome and exposures are available as group-level summaries usual aim is to transfer inference to individual level

Building the model s2 Multilevel model for individual data ai x[c]ik b[c] x[b]ik pik b[b] yik

Building the model s2 Multilevel model for individual data yik ~ Bernoulli(pik), person k, area i ai x[c]ik b[c] x[b]ik pik b[b] yik

Building the model s2 Multilevel model for individual data yik ~ Bernoulli(pik), person k, area i ai log pik = ai + b[c] x[c]ik + b[b] x[b]ik x[c]ik b[c] x[b]ik pik b[b] yik

Building the model s2 Multilevel model for individual data yik ~ Bernoulli(pik), person k, area i ai log pik = ai + b[c] x[c]ik + b[b] x[b]ik x[c]ik b[c] ai ~ Normal(0, s2) x[b]ik pik b[b] yik

Building the model s2 Multilevel model for individual data yik ~ Bernoulli(pik), person k, area i ai log pik = ai + b[c] x[c]ik + b[b] x[b]ik x[c]ik b[c] ai ~ Normal(0, s2) x[b]ik pik b[b] Prior distributions on s2, b[c], b[b] yik

Building the model Ecological model s2 ai V[c]i b[c] X[c]i b[b] qi X[b]i Yi Ni

Building the model Ecological model Yi ~ Binomial(qi, Ni), area i s2 V[c]i b[c] X[c]i b[b] qi X[b]i Yi Ni

Building the model Ecological model Yi ~ Binomial(qi, Ni), area i s2 qi =  pik(x[b], x[c]) fi(x[b], x[c]) dx[b]dx[c] ai V[c]i b[c] X[c]i b[b] qi X[b]i Yi Ni

Building the model Ecological model Yi ~ Binomial(qi, Ni), area i s2 qi =  pik(x[b], x[c]) fi(x[b], x[c]) dx[c]dx[c] ai V[c]i Assuming x[b], x[c] independent, with X[b]i = proportion exposed to ‘b’ in area i and fi(x[c]) = Normal(X[c]i, V[c]i), then qi = q0i(1-X[b]i) + q1iX[b]i where q0i = marginal prob of disease for unexposed = exp(ai + b[c]X[c]I + b2[c]V[c]i/2) b[c] X[c]i b[b] qi X[b]i Yi Ni

Building the model Ecological model Yi ~ Binomial(qi, Ni), area i s2 qi =  pik(x[b], x[c]) fi(x[b], x[c]) dx[b]dx[c] ai V[c]i Assuming x[b], x[c] independent, with X[b]i = proportion exposed to ‘b’ in area i and fi(x[c]) = Normal(X[c]i, V[c]i), then qi = q0i(1-X[b]i) + q1iX[b]i where q1i = marginal prob of disease for exposed = exp(ai + b[b] + b[c]X[c]I + b2[c]V[c]i/2) b[c] X[c]i b[b] qi X[b]i Yi Ni

Building the model Ecological model Yi ~ Binomial(qi, Ni), area i s2 qi =  pik(x[b], x[c]) fi(x[b], x[c]) dx[b]dx[c] ai V[c]i ai ~ Normal(0, s2) b[c] X[c]i b[b] qi X[b]i Yi Ni

Building the model Ecological model Yi ~ Binomial(qi, Ni), area i s2 qi =  pik(x[b], x[c]) fi(x[b], x[c]) dx[b]dx[c] ai V[c]i ai ~ Normal(0, s2) b[c] X[c]i Prior distributions on s2, b[b], b[c] b[b] qi X[b]i Yi Ni

Combining individual and aggregate data Individual level survey data often lack power to inform about contextual and/or individual-level effects Even when correct (integrated) model used, ecological data often contain little information about some or all effects of interest Can we improve inference by combining both types of model / data?

Combining individual and aggregate data s2 s2 Multilevel model for individual data Ecological model ai ai V[c]i x[c]ik b[c] b[c] X[c]i x[b]ik pik b[b] b[b] qi X[b]i yik Yi Ni

Combining individual and aggregate data s2 Hierarchical Related Regression (HRR) model ai V[c]i x[c]ik b[c] X[c]i x[b]ik pik b[b] qi X[b]i yik Yi Ni

Simulation Study

Simulation Study

Simulation Study

Comments Inference from aggregate data can be unbiased provided exposure contrasts between areas are high (and appropriate integrated model used) Combining aggregate data with small samples of individual data can reduce bias when exposure contrasts are low Combining individual and aggregate data can reduce MSE of estimated compared to individual data alone Individual data cannot help if individual-level model is misspecified

Application to LLTI Health outcome Exposures Data sources Limiting Long Term Illness (LLTI) in men aged 40-59 yrs living in London Exposures ethnicity (white/non-white), income, area deprivation Data sources Aggregate: 1991 Census aggregated to ward level Individual: Health Survey for England (with ward identifier) 1-9 observations per ward (median 1.6)

Ward level data Deprivation % non white Mean income Prevalence of LLTI

Results Model Non-white Log income Deprivation Between-area variance Individual -0.36 (-0.98, 0.23) -0.55 (-0.80, -0.32) -0.022 (-0.032, 0.074) 0.18 (0.052, 0.64) Ecological 0.50 (0.27, 0.72) -0.72 (-0.93, -0.51) 0.063 (0.054, 0.073) 0.19 (0.17, 0.21) Combined 0.48 (0.23, 0.72) -0.70 (-0.91, -0.50) 0.064 (0.054, 0.074) (0.17, 0.22) Combined (correlation modelled) (0.24, 0.73) -0.71 (-0.91, -0.51)

Thank you for your attention! Concluding Remarks Graphical models are powerful and flexible tool for building realistic statistical models for complex problems Applicable in many domains Allow exploiting of subject matter knowledge Allow formal combining of multiple data sources Built on rigorous mathematics Principled inferential methods Thank you for your attention!