Tianjian Zhou U of Chicago/UT Austin

Slides:



Advertisements
Similar presentations
Treatment of missing values
Advertisements

Comments on Hierarchical models, and the need for Bayes Peter Green, University of Bristol, UK IWSM, Chania, July 2002.
USE OF LAPLACE APPROXIMATIONS TO SIGNIFICANTLY IMPROVE THE EFFICIENCY
Overview of STAT 270 Ch 1-9 of Devore + Various Applications.
7 November The 2003 CHEBS Seminar 1 The problem with costs Tony O’Hagan CHEBS, University of Sheffield.
Ch 15 - Chi-square Nonparametric Methods: Chi-Square Applications
CHI-SQUARE GOODNESS OF FIT TEST What Are Nonparametric Statistics? What is the Purpose of the Chi-Square GOF? What Are the Assumptions? How Does it Work?
Partially Missing At Random and Ignorable Inferences for Parameter Subsets with Missing Data Roderick Little Rennes
PhyloSub Jiao et. al. BMC Bioinformatics 2014, 15:35.
The Paradigm of Econometrics Based on Greene’s Note 1.
Mixture Modeling Chongming Yang Research Support Center FHSS College.
Part 1: Introduction 1-1/22 Econometrics I Professor William Greene Stern School of Business Department of Economics.
Habil Zare Department of Genome Sciences University of Washington
Latent Variable Models Christopher M. Bishop. 1. Density Modeling A standard approach: parametric models  a number of adaptive parameters  Gaussian.
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
1 A Presentation of ‘Bayesian Models for Gene Expression With DNA Microarray Data’ by Ibrahim, Chen, and Gray Presentation By Lara DePadilla.
Assessing ETA Violations, and Selecting Attainable/Realistic Parameters Causal Effect/Variable Importance Estimation and the Experimental Treatment Assumption.
Mixture Models, Monte Carlo, Bayesian Updating and Dynamic Models Mike West Computing Science and Statistics, Vol. 24, pp , 1993.
Learning Theory Reza Shadmehr Linear and quadratic decision boundaries Kernel estimates of density Missing data.
Resampling techniques
We obtained breast cancer tissues from the Breast Cancer Biospecimen Repository of Fred Hutchinson Cancer Research Center. We performed two rounds of next-gen.
Statistical test for Non continuous variables. Dr L.M.M. Nunn.
Limits to Statistical Theory Bootstrap analysis ESM April 2006.
Computational Identification of Tumor heterogeneity
Missing Values Raymond Kim Pink Preechavanichwong Andrew Wendel October 27, 2015.
A REVIEW By Chi-Ming Kam Surajit Ray April 23, 2001 April 23, 2001.
A shared random effects transition model for longitudinal count data with informative missingness Jinhui Li Joint work with Yingnian Wu, Xiaowei Yang.
Tutorial I: Missing Value Analysis
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Bootstrapping James G. Anderson, Ph.D. Purdue University.
(ARM 2004) 1 INNOVATIVE STATISTICAL APPROACHES IN HSR: BAYESIAN, MULTIPLE INFORMANTS, & PROPENSITY SCORES Thomas R. Belin, UCLA.
Density Estimation in R Ha Le and Nikolaos Sarafianos COSC 7362 – Advanced Machine Learning Professor: Dr. Christoph F. Eick 1.
Markov Chain Monte Carlo in R
Estimating standard error using bootstrap
Bootstrap – The Statistician’s Magic Wand
 Adaptive Enrichment Designs for Confirmatory Clinical Trials Specifying the Intended Use Population and Estimating the Treatment Effect Richard Simon,
Multiple Random Variables and Joint Distributions
Why Stochastic Hydrology ?
MISSING DATA AND DROPOUT
ICS 280 Learning in Graphical Models
Model Inference and Averaging
Correlation – Regression
The Centre for Longitudinal Studies Missing Data Strategy
Simple Linear Regression - Introduction
Bayesian Inference for Small Population Longevity Risk Modelling
Estimating the Spatial Sensitivity Function of A Light Sensor N. K
Ranking Tumor Phylogeny Trees by Likelihood
NURS 790: Methods for Research and Evidence Based Practice
Single-Factor Studies
Single-Factor Studies
Filtering and State Estimation: Basic Concepts
BOOTSTRAPPING: LEARNING FROM THE SAMPLE
Stochastic Frontier Models
The European Statistical Training Programme (ESTP)
Linear Hierarchical Modelling
Narrative Reviews Limitations: Subjectivity inherent:
How will cancer be treated in the 21st century?
Chengyaun yin School of Mathematics SHUFE
Task 6 Statistical Approaches
The general linear model and Statistical Parametric Mapping
LECTURE 07: BAYESIAN ESTIMATION
Missing Data Mechanisms
Non response and missing data in longitudinal surveys
Mathematical Foundations of BME
Wellcome Centre for Neuroimaging, UCL, UK.
Statistical Inference
Longitudinal Data & Mixed Effects Models
Clinical prediction models
Chapter 13: Item nonresponse
Fractional-Random-Weight Bootstrap
Presentation transcript:

Tianjian Zhou U of Chicago/UT Austin Bayesian Nonparametric Models for Biomedical Data Analysis — Inference for Tumor Heterogeneity and Missing Data Tianjian Zhou U of Chicago/UT Austin

Tumor Heterogeneity A C G A C G

Tumor Heterogeneity A C G A C G A C G A C G

Tumor Heterogeneity A C G A C G A C G A C G A G A C

Tumor Heterogeneity A C G A C G A C G A C G A G A C A G A C

Tumor Heterogeneity A C G A C G A C G A C G A G A C A G A C A G A C

Tumor Heterogeneity A C G A C G A C G A C G A G A C A G A C A G A C A

Tumor Heterogeneity A C G A C G A C G A C G A G A C A G A C A G A C A

Tumor Heterogeneity A C G A C G A C G A C G A G A C A G A C A G A C A

Tumor Heterogeneity A C G A C G A C G A C G A G A C A G A C A G A C A

Tumor Heterogeneity A C G A C G A C G A C G A G A C A G A C A G A C A

Tumor Heterogeneity Subclone 1 (Normal) Subclone 2 Subclone 3 A C G A

Population frequencies Tumor Heterogeneity A C G A C G A G A C A G C T Subclone 1 Subclone 2 Subclone 3 Population frequencies 2/10 3/10 5/10

Why Study Tumor Heterogeneity Mutations are bad, want to eliminate mutated cells A C G A C G A G A C A G C T A C G A C G A G A C A G C T Therapy 1: targets the mutation at the 1st locus Therapy 2: targets the mutation at the 2nd locus

Why Study Tumor Heterogeneity Mutations are bad, want to eliminate mutated cells A C G A C G A G A C A C G A C G Therapy 1: targets the mutation at the 1st locus Therapy 2: targets the mutation at the 2nd locus

Subclones/DNA sequences are unobserved/latent Inference for Tumor Heterogeneity ???? ???? ???? ???? ? Subclone 1 Subclone 2 Subclone ? Subclones/DNA sequences are unobserved/latent

? Inference for Tumor Heterogeneity Data: Short DNA reads Mixture of signals from many cells Quantities of interest A C G ? ? ? ???? A G A C C A G A C ? # of subclones Phylogenetic relationship Genotypes Population frequencies Two DNA strands G A C T C A G A G A

? Inference for Tumor Heterogeneity Data: Short DNA reads Mixture of signals from many cells Quantities of interest A C G ? ? ? ???? A G A C C A G A C Proximal mutations ? # of subclones Phylogenetic relationship Genotypes Population frequencies Two DNA strands G A C T C A G A G A Proximal mutations

? Inference for Tumor Heterogeneity Data: Short DNA reads Mixture of signals from many cells Quantities of interest A C G ? ? ? ???? A G A C C A G A C Mutation Pair 1 ? # of subclones Phylogenetic relationship Genotypes Population frequencies Two DNA strands G A C T C A G A G A Mutation Pair 2

Population frequencies Representation of Subclones A C G A C G A G A C A G C T Subclone 1 Subclone 2 Subclone 3 Population frequencies 2/10 3/10 5/10

Population frequencies Representation of Subclones 1: mutation; 0: no mutation (reference) 1 1 1 1 Subclone 1 Subclone 2 Subclone 3 Population frequencies 2/10 3/10 5/10

Population frequencies Representation of Subclones Mutation pairs (MP) 1: mutation; 0: no mutation (reference) 1 1 1 MP1 Subclone 1 Subclone 2 Subclone 3 Population frequencies 2/10 3/10 5/10

Population frequencies Representation of Subclones Mutation pairs (MP) 1: mutation; 0: no mutation (reference) 1 1 MP2 Subclone 1 Subclone 2 Subclone 3 Population frequencies 2/10 3/10 5/10

Representation of Subclones 1 1 1 1 Latent factor matrix Z Factor loadings w # of subclones C 3 Phylogenetic tree T 1 → 2 → 3 0 0 1 0 0 1 MP1 MP2 2/10 3/10 5/10 S1 S2 S3 S1 S2 S3

Prior Model Latent factor matrix Z|T, C T: 1 → 2 → 3 0 0 MP1 MP2 MP3 S1 S2 S3

Prior Model Latent factor matrix Z|T, C T: 1 → 2 → 3 0 0 # of new mutations 𝐴 2 ~ Trunc−Poi 𝜆 MP1 MP2 MP3 MP4 S1 S2 S3

Prior Model Latent factor matrix Z|T, C T: 1 → 2 → 3 0 0 # of new mutations 𝐴 2 ~ Trunc−Poi 𝜆 𝐴 2 =3 MP1 MP2 MP3 MP4 S1 S2 S3

Prior Model Latent factor matrix Z|T, C T: 1 → 2 → 3 0 0 # of new mutations 𝐴 2 ~ Trunc−Poi 𝜆 𝐴 2 =3 MP1 MP2 MP3 MP4 S1 S2 S3

Prior Model Latent factor matrix Z|T, C T: 1 → 2 → 3 0 0 1 0 0 1 MP1 S1 S2 S3

Prior Model Latent factor matrix Z|T, C T: 1 → 2 → 3 0 0 1 0 0 1 # of new mutations 𝐴 3 ~ Trunc−Poi 𝜆 MP1 MP2 MP3 MP4 S1 S2 S3

Prior Model Latent factor matrix Z|T, C T: 1 → 2 → 3 0 0 1 0 0 1 # of new mutations 𝐴 3 ~ Trunc−Poi 𝜆 𝐴 3 =4 MP1 MP2 MP3 MP4 S1 S2 S3

Prior Model Latent factor matrix Z|T, C T: 1 → 2 → 3 0 0 1 0 0 1 # of new mutations 𝐴 3 ~ Trunc−Poi 𝜆 𝐴 3 =4 MP1 MP2 MP3 MP4 S1 S2 S3

Prior Model Latent factor matrix Z|T, C T: 1 → 2 → 3 0 0 1 0 0 1 1 1 MP1 MP2 MP3 MP4 S1 S2 S3

? Inference for Tumor Heterogeneity Data: Short DNA reads Mixture of signals from many cells Quantities of interest A C G ? ? ? ???? A G A C C A G A C Mutation Pair 1 ? # of subclones Phylogenetic relationship Genotypes Population frequencies Two DNA strands G A C T C A G A G A Mutation Pair 2

Sampling Model A C A C A C A G A C A G C 2/10 3/10 5/10

Sampling Model A C A G C 5/10

Sampling Model A C A G 5/10

Sampling Model A C A G 5/10

Sampling Model A C A G A C A C A G A C A G C 2/10 3/10 5/10

Sampling Model A C A G A C A C 2/10

Sampling Model A C A G A C 2/10

Sampling Model A C A G A C 2/10

Sampling Model A C A G A C A C A C A G A C A G C 2/10 3/10 5/10

Sampling Model A C A G A C A G C 5/10

Sampling Model A C A G A C C 5/10

Sampling Model A C A G A C C 5/10

Sampling Model A C A G A C C A C A C A G A C A G C 2/10 3/10 5/10

Sampling Model A C A G A C C A G A C 3/10

Sampling Model A C A G A C C A G 3/10

Sampling Model A C A G A C C A G 3/10

Sampling Model A C A G A C C A G A C A C A G A C A G C 2/10 3/10 5/10

Sampling Model A C A G A C C A G A G A C 3/10

Sampling Model A C A G A C C A G A C 3/10

Sampling Model A C A G A C C A G A C 3/10

Sampling Model A C 2/10 3/10 5/10 A G A C C A G A C A C A C A G A C A

Sampling Model A C G 2/10 3/10 5/10 A G A C C A G A C G A C T C A G A

Sampling Model A C G 2/10 3/10 5/10 1 1 1 1 1 A C G A C G A G A C A G 1 1 1 A C G A C G A G A C A G C T 2/10 3/10 5/10 1 1

? Inference for Tumor Heterogeneity Data: Short DNA reads Mixture of signals from many cells Quantities of interest A C G ? ? ? ???? A G A C C A G A C Mutation Pair 1 ? # of subclones Phylogenetic relationship Genotypes Population frequencies Two DNA strands G A C T C A G A G A Mutation Pair 2 Sampling model & Prior model → Posterior inference

TCGA Lung Cancer Data Malignant hyper-mutated subclone Small population frequency

Concluding Remark The use of mutation pairs strengthens inference for tumor heterogeneity

Inference for Missing Data 𝒚: Longitudinal outcomes after treated by a test drug 𝑠: Dropout time 𝑠=4 Treatment Effect E 𝑌 6 − 𝑌 1 Outcome 𝑠=5 1 2 3 4 5 6 Time

Inference for Missing Data Biased if not MCAR Inefficient Can’t do sensitivity analysis Treatment Effect E 𝑌 6 − 𝑌 1 Outcome 1 2 3 4 5 6 Time

Inference for Missing Data Treatment Effect E 𝑌 6 − 𝑌 1 Outcome 1 2 3 4 5 6 Time

Inference for Missing Data 𝒗= (Age: 66, Height: 185, Weight: 78, Gender: M) Dropout due to lack of efficacy 𝒗= (Age: 41, Height: 170, Weight: 62, Gender: F) Outcome 𝒗= (Age: 29, Height: 166, Weight: 54, Gender: F) Dropout due to pregnancy 𝒗= (Age: 33, Height: 159, Weight: 49, Gender: F) 1 2 3 4 5 6 Time

Extrapolation Factorization Joint model for 𝒚,𝑠 and 𝒗 𝑝 𝒚,𝑠,𝒗 =𝑝 𝒚 mis 𝒚 obs ,𝑠,𝒗 𝑝 𝒚 obs ,𝑠,𝒗 Observed data distribution: identified and can be estimated semi/non-parametrically Extrapolation distribution: not identified without uncheckable assumptions (e.g. MAR, missing non-future dependent NFD)

Observed Data Distribution: Pattern Mixture Modeling 𝑝 𝒚 obs ,𝑠,𝒗 =𝑝 𝒚 obs 𝑠,𝒗 𝑝 𝑠 𝒗 𝑝 𝒗 𝑝 𝒚 obs 𝑠,𝒗 Gaussian process (GP) & autoregressive (AR) & conditional autoregressive (CAR) priors 𝑝 𝑠 𝒗 Bayesian additive regression trees (BART) 𝑝 𝒗 Bayesian bootstrap

Extrapolation Distribution: Identifying Restrictions Missing at random (MAR): 𝑝 𝒚 mis 𝒚 obs ,𝑠,𝒗 is fully identified by 𝑝 𝒚 obs ,𝑠,𝒗 Non-future dependent (NFD): 𝑝 𝒚 mis 𝒚 obs ,𝑠,𝒗 is partially identified by 𝑝 𝒚 obs ,𝑠,𝒗 . Put informative priors on non-identified parameters

Monte Carlo Integration/G-Computation E 𝑡 𝒚 = ∫ 𝒚 𝑡 𝒚 𝑝 𝒚 d𝒚 = ∫ 𝒚 𝑡 𝒚 ∑ 𝑠 ∫ 𝒗 𝑝 𝒚 mis 𝒚 obs ,𝑠,𝒗 𝑝 𝒚 obs 𝑠,𝒗 𝑝 𝑠 𝒗 𝑝 𝒗 ⅆ𝒗 d𝒚 𝒗= (Age: 33, Height: 166, Weight: 54, Gender: F) Outcome 1 2 3 4 5 6 Time

Monte Carlo Integration/G-Computation E 𝑡 𝒚 = ∫ 𝒚 𝑡 𝒚 𝑝 𝒚 d𝒚 = ∫ 𝒚 𝑡 𝒚 ∑ 𝑠 ∫ 𝒗 𝑝 𝒚 mis 𝒚 obs ,𝑠,𝒗 𝑝 𝒚 obs 𝑠,𝒗 𝑝 𝑠 𝒗 𝑝 𝒗 ⅆ𝒗 d𝒚 𝒗= (Age: 33, Height: 166, Weight: 54, Gender: F) 𝑠=5 Outcome 1 2 3 4 5 6 Time

Monte Carlo Integration/G-Computation E 𝑡 𝒚 = ∫ 𝒚 𝑡 𝒚 𝑝 𝒚 d𝒚 = ∫ 𝒚 𝑡 𝒚 ∑ 𝑠 ∫ 𝒗 𝑝 𝒚 mis 𝒚 obs ,𝑠,𝒗 𝑝 𝒚 obs 𝑠,𝒗 𝑝 𝑠 𝒗 𝑝 𝒗 ⅆ𝒗 d𝒚 𝒗= (Age: 33, Height: 166, Weight: 54, Gender: F) 𝑠=5 Outcome 1 2 3 4 5 6 Time

Monte Carlo Integration/G-Computation E 𝑡 𝒚 = ∫ 𝒚 𝑡 𝒚 𝑝 𝒚 d𝒚 = ∫ 𝒚 𝑡 𝒚 ∑ 𝑠 ∫ 𝒗 𝑝 𝒚 mis 𝒚 obs ,𝑠,𝒗 𝑝 𝒚 obs 𝑠,𝒗 𝑝 𝑠 𝒗 𝑝 𝒗 ⅆ𝒗 d𝒚 𝒗= (Age: 33, Height: 166, Weight: 54, Gender: F) 𝑠=5 Outcome 1 2 3 4 5 6 Time

Monte Carlo Integration/G-Computation E 𝑡 𝒚 = ∫ 𝒚 𝑡 𝒚 𝑝 𝒚 d𝒚 = ∫ 𝒚 𝑡 𝒚 ∑ 𝑠 ∫ 𝒗 𝑝 𝒚 mis 𝒚 obs ,𝑠,𝒗 𝑝 𝒚 obs 𝑠,𝒗 𝑝 𝑠 𝒗 𝑝 𝒗 ⅆ𝒗 d𝒚 𝒗= (Age: 37, Height: 163, Weight: 51, Gender: F) Outcome 1 2 3 4 5 6 Time

Monte Carlo Integration/G-Computation E 𝑡 𝒚 = ∫ 𝒚 𝑡 𝒚 𝑝 𝒚 d𝒚 = ∫ 𝒚 𝑡 𝒚 ∑ 𝑠 ∫ 𝒗 𝑝 𝒚 mis 𝒚 obs ,𝑠,𝒗 𝑝 𝒚 obs 𝑠,𝒗 𝑝 𝑠 𝒗 𝑝 𝒗 ⅆ𝒗 d𝒚 𝒗= (Age: 37, Height: 163, Weight: 51, Gender: F) 𝑠=6 Outcome 1 2 3 4 5 6 Time

Monte Carlo Integration/G-Computation E 𝑡 𝒚 = ∫ 𝒚 𝑡 𝒚 𝑝 𝒚 d𝒚 = ∫ 𝒚 𝑡 𝒚 ∑ 𝑠 ∫ 𝒗 𝑝 𝒚 mis 𝒚 obs ,𝑠,𝒗 𝑝 𝒚 obs 𝑠,𝒗 𝑝 𝑠 𝒗 𝑝 𝒗 ⅆ𝒗 d𝒚 𝒗= (Age: 37, Height: 163, Weight: 51, Gender: F) 𝑠=6 Outcome 1 2 3 4 5 6 Time

Monte Carlo Integration/G-Computation E 𝑡 𝒚 = ∫ 𝒚 𝑡 𝒚 𝑝 𝒚 d𝒚 = ∫ 𝒚 𝑡 𝒚 ∑ 𝑠 ∫ 𝒗 𝑝 𝒚 mis 𝒚 obs ,𝑠,𝒗 𝑝 𝒚 obs 𝑠,𝒗 𝑝 𝑠 𝒗 𝑝 𝒗 ⅆ𝒗 d𝒚 𝒗= (Age: 66, Height: 185, Weight: 78, Gender: M) Outcome 1 2 3 4 5 6 Time

Monte Carlo Integration/G-Computation E 𝑡 𝒚 = ∫ 𝒚 𝑡 𝒚 𝑝 𝒚 d𝒚 = ∫ 𝒚 𝑡 𝒚 ∑ 𝑠 ∫ 𝒗 𝑝 𝒚 mis 𝒚 obs ,𝑠,𝒗 𝑝 𝒚 obs 𝑠,𝒗 𝑝 𝑠 𝒗 𝑝 𝒗 ⅆ𝒗 d𝒚 𝒗= (Age: 66, Height: 185, Weight: 78, Gender: M) 𝑠=3 Outcome 1 2 3 4 5 6 Time

Monte Carlo Integration/G-Computation E 𝑡 𝒚 = ∫ 𝒚 𝑡 𝒚 𝑝 𝒚 d𝒚 = ∫ 𝒚 𝑡 𝒚 ∑ 𝑠 ∫ 𝒗 𝑝 𝒚 mis 𝒚 obs ,𝑠,𝒗 𝑝 𝒚 obs 𝑠,𝒗 𝑝 𝑠 𝒗 𝑝 𝒗 ⅆ𝒗 d𝒚 𝒗= (Age: 66, Height: 185, Weight: 78, Gender: M) 𝑠=3 Outcome 1 2 3 4 5 6 Time

Monte Carlo Integration/G-Computation E 𝑡 𝒚 = ∫ 𝒚 𝑡 𝒚 𝑝 𝒚 d𝒚 = ∫ 𝒚 𝑡 𝒚 ∑ 𝑠 ∫ 𝒗 𝑝 𝒚 mis 𝒚 obs ,𝑠,𝒗 𝑝 𝒚 obs 𝑠,𝒗 𝑝 𝑠 𝒗 𝑝 𝒗 ⅆ𝒗 d𝒚 𝒗= (Age: 66, Height: 185, Weight: 78, Gender: M) 𝑠=3 Outcome 1 2 3 4 5 6 Time

Monte Carlo Integration/G-Computation E 𝑡 𝒚 = ∫ 𝒚 𝑡 𝒚 𝑝 𝒚 d𝒚 = ∫ 𝒚 𝑡 𝒚 ∑ 𝑠 ∫ 𝒗 𝑝 𝒚 mis 𝒚 obs ,𝑠,𝒗 𝑝 𝒚 obs 𝑠,𝒗 𝑝 𝑠 𝒗 𝑝 𝒗 ⅆ𝒗 d𝒚 Outcome 𝑦 6 − 𝑦 1 1 2 3 4 5 6 Time

Schizophrenia Dataset Test drug improvement over placebo A negative value represents an improvement Conclusion: no evidence that the test drug performs better than placebo

Sensitivity Analysis Vary uncheckable assumptions and see whether conclusion differs

Concluding Remark The model specifications (GP/AR/CAR/BART) nicely exploit the data structure and lead to improvement over simple parametric approaches

References Zhou, T., Müller, P., Sengupta, S. and Ji, Y. (2019) PairClone: A Bayesian subclone caller based on mutation pairs. Journal of the Royal Statistical Society: Series C (Applied Statistics), 68(3), 705-725. Zhou, T., Sengupta, S., Müller, P., and Ji, Y. (2019) TreeClone: Reconstruction of tumor subclone phylogeny based on mutation pairs using next generation sequencing data. The Annals of Applied Statistics, 13(2), 874-899. Zhou, T., Daniels, M. J. and Müller, P. (2019) A Semiparametric Bayesian Approach to Dropout in Longitudinal Studies with Auxiliary Covariates. Journal of Computational and Graphical Statistics, forthcoming.

Thank you! Questions & comments: tjzhou@uchicago.edu Currently on job market