Tianjian Zhou U of Chicago/UT Austin Bayesian Nonparametric Models for Biomedical Data Analysis — Inference for Tumor Heterogeneity and Missing Data Tianjian Zhou U of Chicago/UT Austin
Tumor Heterogeneity A C G A C G
Tumor Heterogeneity A C G A C G A C G A C G
Tumor Heterogeneity A C G A C G A C G A C G A G A C
Tumor Heterogeneity A C G A C G A C G A C G A G A C A G A C
Tumor Heterogeneity A C G A C G A C G A C G A G A C A G A C A G A C
Tumor Heterogeneity A C G A C G A C G A C G A G A C A G A C A G A C A
Tumor Heterogeneity A C G A C G A C G A C G A G A C A G A C A G A C A
Tumor Heterogeneity A C G A C G A C G A C G A G A C A G A C A G A C A
Tumor Heterogeneity A C G A C G A C G A C G A G A C A G A C A G A C A
Tumor Heterogeneity A C G A C G A C G A C G A G A C A G A C A G A C A
Tumor Heterogeneity Subclone 1 (Normal) Subclone 2 Subclone 3 A C G A
Population frequencies Tumor Heterogeneity A C G A C G A G A C A G C T Subclone 1 Subclone 2 Subclone 3 Population frequencies 2/10 3/10 5/10
Why Study Tumor Heterogeneity Mutations are bad, want to eliminate mutated cells A C G A C G A G A C A G C T A C G A C G A G A C A G C T Therapy 1: targets the mutation at the 1st locus Therapy 2: targets the mutation at the 2nd locus
Why Study Tumor Heterogeneity Mutations are bad, want to eliminate mutated cells A C G A C G A G A C A C G A C G Therapy 1: targets the mutation at the 1st locus Therapy 2: targets the mutation at the 2nd locus
Subclones/DNA sequences are unobserved/latent Inference for Tumor Heterogeneity ???? ???? ???? ???? ? Subclone 1 Subclone 2 Subclone ? Subclones/DNA sequences are unobserved/latent
? Inference for Tumor Heterogeneity Data: Short DNA reads Mixture of signals from many cells Quantities of interest A C G ? ? ? ???? A G A C C A G A C ? # of subclones Phylogenetic relationship Genotypes Population frequencies Two DNA strands G A C T C A G A G A
? Inference for Tumor Heterogeneity Data: Short DNA reads Mixture of signals from many cells Quantities of interest A C G ? ? ? ???? A G A C C A G A C Proximal mutations ? # of subclones Phylogenetic relationship Genotypes Population frequencies Two DNA strands G A C T C A G A G A Proximal mutations
? Inference for Tumor Heterogeneity Data: Short DNA reads Mixture of signals from many cells Quantities of interest A C G ? ? ? ???? A G A C C A G A C Mutation Pair 1 ? # of subclones Phylogenetic relationship Genotypes Population frequencies Two DNA strands G A C T C A G A G A Mutation Pair 2
Population frequencies Representation of Subclones A C G A C G A G A C A G C T Subclone 1 Subclone 2 Subclone 3 Population frequencies 2/10 3/10 5/10
Population frequencies Representation of Subclones 1: mutation; 0: no mutation (reference) 1 1 1 1 Subclone 1 Subclone 2 Subclone 3 Population frequencies 2/10 3/10 5/10
Population frequencies Representation of Subclones Mutation pairs (MP) 1: mutation; 0: no mutation (reference) 1 1 1 MP1 Subclone 1 Subclone 2 Subclone 3 Population frequencies 2/10 3/10 5/10
Population frequencies Representation of Subclones Mutation pairs (MP) 1: mutation; 0: no mutation (reference) 1 1 MP2 Subclone 1 Subclone 2 Subclone 3 Population frequencies 2/10 3/10 5/10
Representation of Subclones 1 1 1 1 Latent factor matrix Z Factor loadings w # of subclones C 3 Phylogenetic tree T 1 → 2 → 3 0 0 1 0 0 1 MP1 MP2 2/10 3/10 5/10 S1 S2 S3 S1 S2 S3
Prior Model Latent factor matrix Z|T, C T: 1 → 2 → 3 0 0 MP1 MP2 MP3 S1 S2 S3
Prior Model Latent factor matrix Z|T, C T: 1 → 2 → 3 0 0 # of new mutations 𝐴 2 ~ Trunc−Poi 𝜆 MP1 MP2 MP3 MP4 S1 S2 S3
Prior Model Latent factor matrix Z|T, C T: 1 → 2 → 3 0 0 # of new mutations 𝐴 2 ~ Trunc−Poi 𝜆 𝐴 2 =3 MP1 MP2 MP3 MP4 S1 S2 S3
Prior Model Latent factor matrix Z|T, C T: 1 → 2 → 3 0 0 # of new mutations 𝐴 2 ~ Trunc−Poi 𝜆 𝐴 2 =3 MP1 MP2 MP3 MP4 S1 S2 S3
Prior Model Latent factor matrix Z|T, C T: 1 → 2 → 3 0 0 1 0 0 1 MP1 S1 S2 S3
Prior Model Latent factor matrix Z|T, C T: 1 → 2 → 3 0 0 1 0 0 1 # of new mutations 𝐴 3 ~ Trunc−Poi 𝜆 MP1 MP2 MP3 MP4 S1 S2 S3
Prior Model Latent factor matrix Z|T, C T: 1 → 2 → 3 0 0 1 0 0 1 # of new mutations 𝐴 3 ~ Trunc−Poi 𝜆 𝐴 3 =4 MP1 MP2 MP3 MP4 S1 S2 S3
Prior Model Latent factor matrix Z|T, C T: 1 → 2 → 3 0 0 1 0 0 1 # of new mutations 𝐴 3 ~ Trunc−Poi 𝜆 𝐴 3 =4 MP1 MP2 MP3 MP4 S1 S2 S3
Prior Model Latent factor matrix Z|T, C T: 1 → 2 → 3 0 0 1 0 0 1 1 1 MP1 MP2 MP3 MP4 S1 S2 S3
? Inference for Tumor Heterogeneity Data: Short DNA reads Mixture of signals from many cells Quantities of interest A C G ? ? ? ???? A G A C C A G A C Mutation Pair 1 ? # of subclones Phylogenetic relationship Genotypes Population frequencies Two DNA strands G A C T C A G A G A Mutation Pair 2
Sampling Model A C A C A C A G A C A G C 2/10 3/10 5/10
Sampling Model A C A G C 5/10
Sampling Model A C A G 5/10
Sampling Model A C A G 5/10
Sampling Model A C A G A C A C A G A C A G C 2/10 3/10 5/10
Sampling Model A C A G A C A C 2/10
Sampling Model A C A G A C 2/10
Sampling Model A C A G A C 2/10
Sampling Model A C A G A C A C A C A G A C A G C 2/10 3/10 5/10
Sampling Model A C A G A C A G C 5/10
Sampling Model A C A G A C C 5/10
Sampling Model A C A G A C C 5/10
Sampling Model A C A G A C C A C A C A G A C A G C 2/10 3/10 5/10
Sampling Model A C A G A C C A G A C 3/10
Sampling Model A C A G A C C A G 3/10
Sampling Model A C A G A C C A G 3/10
Sampling Model A C A G A C C A G A C A C A G A C A G C 2/10 3/10 5/10
Sampling Model A C A G A C C A G A G A C 3/10
Sampling Model A C A G A C C A G A C 3/10
Sampling Model A C A G A C C A G A C 3/10
Sampling Model A C 2/10 3/10 5/10 A G A C C A G A C A C A C A G A C A
Sampling Model A C G 2/10 3/10 5/10 A G A C C A G A C G A C T C A G A
Sampling Model A C G 2/10 3/10 5/10 1 1 1 1 1 A C G A C G A G A C A G 1 1 1 A C G A C G A G A C A G C T 2/10 3/10 5/10 1 1
? Inference for Tumor Heterogeneity Data: Short DNA reads Mixture of signals from many cells Quantities of interest A C G ? ? ? ???? A G A C C A G A C Mutation Pair 1 ? # of subclones Phylogenetic relationship Genotypes Population frequencies Two DNA strands G A C T C A G A G A Mutation Pair 2 Sampling model & Prior model → Posterior inference
TCGA Lung Cancer Data Malignant hyper-mutated subclone Small population frequency
Concluding Remark The use of mutation pairs strengthens inference for tumor heterogeneity
Inference for Missing Data 𝒚: Longitudinal outcomes after treated by a test drug 𝑠: Dropout time 𝑠=4 Treatment Effect E 𝑌 6 − 𝑌 1 Outcome 𝑠=5 1 2 3 4 5 6 Time
Inference for Missing Data Biased if not MCAR Inefficient Can’t do sensitivity analysis Treatment Effect E 𝑌 6 − 𝑌 1 Outcome 1 2 3 4 5 6 Time
Inference for Missing Data Treatment Effect E 𝑌 6 − 𝑌 1 Outcome 1 2 3 4 5 6 Time
Inference for Missing Data 𝒗= (Age: 66, Height: 185, Weight: 78, Gender: M) Dropout due to lack of efficacy 𝒗= (Age: 41, Height: 170, Weight: 62, Gender: F) Outcome 𝒗= (Age: 29, Height: 166, Weight: 54, Gender: F) Dropout due to pregnancy 𝒗= (Age: 33, Height: 159, Weight: 49, Gender: F) 1 2 3 4 5 6 Time
Extrapolation Factorization Joint model for 𝒚,𝑠 and 𝒗 𝑝 𝒚,𝑠,𝒗 =𝑝 𝒚 mis 𝒚 obs ,𝑠,𝒗 𝑝 𝒚 obs ,𝑠,𝒗 Observed data distribution: identified and can be estimated semi/non-parametrically Extrapolation distribution: not identified without uncheckable assumptions (e.g. MAR, missing non-future dependent NFD)
Observed Data Distribution: Pattern Mixture Modeling 𝑝 𝒚 obs ,𝑠,𝒗 =𝑝 𝒚 obs 𝑠,𝒗 𝑝 𝑠 𝒗 𝑝 𝒗 𝑝 𝒚 obs 𝑠,𝒗 Gaussian process (GP) & autoregressive (AR) & conditional autoregressive (CAR) priors 𝑝 𝑠 𝒗 Bayesian additive regression trees (BART) 𝑝 𝒗 Bayesian bootstrap
Extrapolation Distribution: Identifying Restrictions Missing at random (MAR): 𝑝 𝒚 mis 𝒚 obs ,𝑠,𝒗 is fully identified by 𝑝 𝒚 obs ,𝑠,𝒗 Non-future dependent (NFD): 𝑝 𝒚 mis 𝒚 obs ,𝑠,𝒗 is partially identified by 𝑝 𝒚 obs ,𝑠,𝒗 . Put informative priors on non-identified parameters
Monte Carlo Integration/G-Computation E 𝑡 𝒚 = ∫ 𝒚 𝑡 𝒚 𝑝 𝒚 d𝒚 = ∫ 𝒚 𝑡 𝒚 ∑ 𝑠 ∫ 𝒗 𝑝 𝒚 mis 𝒚 obs ,𝑠,𝒗 𝑝 𝒚 obs 𝑠,𝒗 𝑝 𝑠 𝒗 𝑝 𝒗 ⅆ𝒗 d𝒚 𝒗= (Age: 33, Height: 166, Weight: 54, Gender: F) Outcome 1 2 3 4 5 6 Time
Monte Carlo Integration/G-Computation E 𝑡 𝒚 = ∫ 𝒚 𝑡 𝒚 𝑝 𝒚 d𝒚 = ∫ 𝒚 𝑡 𝒚 ∑ 𝑠 ∫ 𝒗 𝑝 𝒚 mis 𝒚 obs ,𝑠,𝒗 𝑝 𝒚 obs 𝑠,𝒗 𝑝 𝑠 𝒗 𝑝 𝒗 ⅆ𝒗 d𝒚 𝒗= (Age: 33, Height: 166, Weight: 54, Gender: F) 𝑠=5 Outcome 1 2 3 4 5 6 Time
Monte Carlo Integration/G-Computation E 𝑡 𝒚 = ∫ 𝒚 𝑡 𝒚 𝑝 𝒚 d𝒚 = ∫ 𝒚 𝑡 𝒚 ∑ 𝑠 ∫ 𝒗 𝑝 𝒚 mis 𝒚 obs ,𝑠,𝒗 𝑝 𝒚 obs 𝑠,𝒗 𝑝 𝑠 𝒗 𝑝 𝒗 ⅆ𝒗 d𝒚 𝒗= (Age: 33, Height: 166, Weight: 54, Gender: F) 𝑠=5 Outcome 1 2 3 4 5 6 Time
Monte Carlo Integration/G-Computation E 𝑡 𝒚 = ∫ 𝒚 𝑡 𝒚 𝑝 𝒚 d𝒚 = ∫ 𝒚 𝑡 𝒚 ∑ 𝑠 ∫ 𝒗 𝑝 𝒚 mis 𝒚 obs ,𝑠,𝒗 𝑝 𝒚 obs 𝑠,𝒗 𝑝 𝑠 𝒗 𝑝 𝒗 ⅆ𝒗 d𝒚 𝒗= (Age: 33, Height: 166, Weight: 54, Gender: F) 𝑠=5 Outcome 1 2 3 4 5 6 Time
Monte Carlo Integration/G-Computation E 𝑡 𝒚 = ∫ 𝒚 𝑡 𝒚 𝑝 𝒚 d𝒚 = ∫ 𝒚 𝑡 𝒚 ∑ 𝑠 ∫ 𝒗 𝑝 𝒚 mis 𝒚 obs ,𝑠,𝒗 𝑝 𝒚 obs 𝑠,𝒗 𝑝 𝑠 𝒗 𝑝 𝒗 ⅆ𝒗 d𝒚 𝒗= (Age: 37, Height: 163, Weight: 51, Gender: F) Outcome 1 2 3 4 5 6 Time
Monte Carlo Integration/G-Computation E 𝑡 𝒚 = ∫ 𝒚 𝑡 𝒚 𝑝 𝒚 d𝒚 = ∫ 𝒚 𝑡 𝒚 ∑ 𝑠 ∫ 𝒗 𝑝 𝒚 mis 𝒚 obs ,𝑠,𝒗 𝑝 𝒚 obs 𝑠,𝒗 𝑝 𝑠 𝒗 𝑝 𝒗 ⅆ𝒗 d𝒚 𝒗= (Age: 37, Height: 163, Weight: 51, Gender: F) 𝑠=6 Outcome 1 2 3 4 5 6 Time
Monte Carlo Integration/G-Computation E 𝑡 𝒚 = ∫ 𝒚 𝑡 𝒚 𝑝 𝒚 d𝒚 = ∫ 𝒚 𝑡 𝒚 ∑ 𝑠 ∫ 𝒗 𝑝 𝒚 mis 𝒚 obs ,𝑠,𝒗 𝑝 𝒚 obs 𝑠,𝒗 𝑝 𝑠 𝒗 𝑝 𝒗 ⅆ𝒗 d𝒚 𝒗= (Age: 37, Height: 163, Weight: 51, Gender: F) 𝑠=6 Outcome 1 2 3 4 5 6 Time
Monte Carlo Integration/G-Computation E 𝑡 𝒚 = ∫ 𝒚 𝑡 𝒚 𝑝 𝒚 d𝒚 = ∫ 𝒚 𝑡 𝒚 ∑ 𝑠 ∫ 𝒗 𝑝 𝒚 mis 𝒚 obs ,𝑠,𝒗 𝑝 𝒚 obs 𝑠,𝒗 𝑝 𝑠 𝒗 𝑝 𝒗 ⅆ𝒗 d𝒚 𝒗= (Age: 66, Height: 185, Weight: 78, Gender: M) Outcome 1 2 3 4 5 6 Time
Monte Carlo Integration/G-Computation E 𝑡 𝒚 = ∫ 𝒚 𝑡 𝒚 𝑝 𝒚 d𝒚 = ∫ 𝒚 𝑡 𝒚 ∑ 𝑠 ∫ 𝒗 𝑝 𝒚 mis 𝒚 obs ,𝑠,𝒗 𝑝 𝒚 obs 𝑠,𝒗 𝑝 𝑠 𝒗 𝑝 𝒗 ⅆ𝒗 d𝒚 𝒗= (Age: 66, Height: 185, Weight: 78, Gender: M) 𝑠=3 Outcome 1 2 3 4 5 6 Time
Monte Carlo Integration/G-Computation E 𝑡 𝒚 = ∫ 𝒚 𝑡 𝒚 𝑝 𝒚 d𝒚 = ∫ 𝒚 𝑡 𝒚 ∑ 𝑠 ∫ 𝒗 𝑝 𝒚 mis 𝒚 obs ,𝑠,𝒗 𝑝 𝒚 obs 𝑠,𝒗 𝑝 𝑠 𝒗 𝑝 𝒗 ⅆ𝒗 d𝒚 𝒗= (Age: 66, Height: 185, Weight: 78, Gender: M) 𝑠=3 Outcome 1 2 3 4 5 6 Time
Monte Carlo Integration/G-Computation E 𝑡 𝒚 = ∫ 𝒚 𝑡 𝒚 𝑝 𝒚 d𝒚 = ∫ 𝒚 𝑡 𝒚 ∑ 𝑠 ∫ 𝒗 𝑝 𝒚 mis 𝒚 obs ,𝑠,𝒗 𝑝 𝒚 obs 𝑠,𝒗 𝑝 𝑠 𝒗 𝑝 𝒗 ⅆ𝒗 d𝒚 𝒗= (Age: 66, Height: 185, Weight: 78, Gender: M) 𝑠=3 Outcome 1 2 3 4 5 6 Time
Monte Carlo Integration/G-Computation E 𝑡 𝒚 = ∫ 𝒚 𝑡 𝒚 𝑝 𝒚 d𝒚 = ∫ 𝒚 𝑡 𝒚 ∑ 𝑠 ∫ 𝒗 𝑝 𝒚 mis 𝒚 obs ,𝑠,𝒗 𝑝 𝒚 obs 𝑠,𝒗 𝑝 𝑠 𝒗 𝑝 𝒗 ⅆ𝒗 d𝒚 Outcome 𝑦 6 − 𝑦 1 1 2 3 4 5 6 Time
Schizophrenia Dataset Test drug improvement over placebo A negative value represents an improvement Conclusion: no evidence that the test drug performs better than placebo
Sensitivity Analysis Vary uncheckable assumptions and see whether conclusion differs
Concluding Remark The model specifications (GP/AR/CAR/BART) nicely exploit the data structure and lead to improvement over simple parametric approaches
References Zhou, T., Müller, P., Sengupta, S. and Ji, Y. (2019) PairClone: A Bayesian subclone caller based on mutation pairs. Journal of the Royal Statistical Society: Series C (Applied Statistics), 68(3), 705-725. Zhou, T., Sengupta, S., Müller, P., and Ji, Y. (2019) TreeClone: Reconstruction of tumor subclone phylogeny based on mutation pairs using next generation sequencing data. The Annals of Applied Statistics, 13(2), 874-899. Zhou, T., Daniels, M. J. and Müller, P. (2019) A Semiparametric Bayesian Approach to Dropout in Longitudinal Studies with Auxiliary Covariates. Journal of Computational and Graphical Statistics, forthcoming.
Thank you! Questions & comments: tjzhou@uchicago.edu Currently on job market