Download presentation
Presentation is loading. Please wait.
1
Tianjian Zhou U of Chicago/UT Austin
Bayesian Nonparametric Models for Biomedical Data Analysis — Inference for Tumor Heterogeneity and Missing Data Tianjian Zhou U of Chicago/UT Austin
2
Tumor Heterogeneity A C G A C G
3
Tumor Heterogeneity A C G A C G A C G A C G
4
Tumor Heterogeneity A C G A C G A C G A C G A G A C
5
Tumor Heterogeneity A C G A C G A C G A C G A G A C A G A C
6
Tumor Heterogeneity A C G A C G A C G A C G A G A C A G A C A G A C
7
Tumor Heterogeneity A C G A C G A C G A C G A G A C A G A C A G A C A
8
Tumor Heterogeneity A C G A C G A C G A C G A G A C A G A C A G A C A
9
Tumor Heterogeneity A C G A C G A C G A C G A G A C A G A C A G A C A
10
Tumor Heterogeneity A C G A C G A C G A C G A G A C A G A C A G A C A
11
Tumor Heterogeneity A C G A C G A C G A C G A G A C A G A C A G A C A
12
Tumor Heterogeneity Subclone 1 (Normal) Subclone 2 Subclone 3 A C G A
13
Population frequencies
Tumor Heterogeneity A C G A C G A G A C A G C T Subclone 1 Subclone 2 Subclone 3 Population frequencies 2/ / /10
14
Why Study Tumor Heterogeneity
Mutations are bad, want to eliminate mutated cells A C G A C G A G A C A G C T A C G A C G A G A C A G C T Therapy 1: targets the mutation at the 1st locus Therapy 2: targets the mutation at the 2nd locus
15
Why Study Tumor Heterogeneity
Mutations are bad, want to eliminate mutated cells A C G A C G A G A C A C G A C G Therapy 1: targets the mutation at the 1st locus Therapy 2: targets the mutation at the 2nd locus
16
Subclones/DNA sequences are unobserved/latent
Inference for Tumor Heterogeneity ???? ???? ???? ???? ? Subclone 1 Subclone 2 Subclone ? Subclones/DNA sequences are unobserved/latent
17
? Inference for Tumor Heterogeneity Data: Short DNA reads
Mixture of signals from many cells Quantities of interest A C G ? ? ? ???? A G A C C A G A C ? # of subclones Phylogenetic relationship Genotypes Population frequencies Two DNA strands G A C T C A G A G A
18
? Inference for Tumor Heterogeneity Data: Short DNA reads
Mixture of signals from many cells Quantities of interest A C G ? ? ? ???? A G A C C A G A C Proximal mutations ? # of subclones Phylogenetic relationship Genotypes Population frequencies Two DNA strands G A C T C A G A G A Proximal mutations
19
? Inference for Tumor Heterogeneity Data: Short DNA reads
Mixture of signals from many cells Quantities of interest A C G ? ? ? ???? A G A C C A G A C Mutation Pair 1 ? # of subclones Phylogenetic relationship Genotypes Population frequencies Two DNA strands G A C T C A G A G A Mutation Pair 2
20
Population frequencies
Representation of Subclones A C G A C G A G A C A G C T Subclone 1 Subclone 2 Subclone 3 Population frequencies 2/ / /10
21
Population frequencies
Representation of Subclones 1: mutation; 0: no mutation (reference) 1 1 1 1 Subclone 1 Subclone 2 Subclone 3 Population frequencies 2/ / /10
22
Population frequencies
Representation of Subclones Mutation pairs (MP) 1: mutation; 0: no mutation (reference) 1 1 1 MP1 Subclone 1 Subclone 2 Subclone 3 Population frequencies 2/ / /10
23
Population frequencies
Representation of Subclones Mutation pairs (MP) 1: mutation; 0: no mutation (reference) 1 1 MP2 Subclone 1 Subclone 2 Subclone 3 Population frequencies 2/ / /10
24
Representation of Subclones
1 1 1 1 Latent factor matrix Z Factor loadings w # of subclones C 3 Phylogenetic tree T 1 → 2 → 3 0 0 1 0 0 1 MP1 MP2 2/10 3/10 5/10 S S S3 S S S3
25
Prior Model Latent factor matrix Z|T, C T: 1 → 2 → 3 0 0 MP1 MP2 MP3
S S S3
26
Prior Model Latent factor matrix Z|T, C T: 1 → 2 → 3 0 0
# of new mutations 𝐴 2 ~ Trunc−Poi 𝜆 MP1 MP2 MP3 MP4 S S S3
27
Prior Model Latent factor matrix Z|T, C T: 1 → 2 → 3 0 0
# of new mutations 𝐴 2 ~ Trunc−Poi 𝜆 𝐴 2 =3 MP1 MP2 MP3 MP4 S S S3
28
Prior Model Latent factor matrix Z|T, C T: 1 → 2 → 3 0 0
# of new mutations 𝐴 2 ~ Trunc−Poi 𝜆 𝐴 2 =3 MP1 MP2 MP3 MP4 S S S3
29
Prior Model Latent factor matrix Z|T, C T: 1 → 2 → 3 0 0 1 0 0 1 MP1
S S S3
30
Prior Model Latent factor matrix Z|T, C T: 1 → 2 → 3 0 0 1 0 0 1
# of new mutations 𝐴 3 ~ Trunc−Poi 𝜆 MP1 MP2 MP3 MP4 S S S3
31
Prior Model Latent factor matrix Z|T, C T: 1 → 2 → 3 0 0 1 0 0 1
# of new mutations 𝐴 3 ~ Trunc−Poi 𝜆 𝐴 3 =4 MP1 MP2 MP3 MP4 S S S3
32
Prior Model Latent factor matrix Z|T, C T: 1 → 2 → 3 0 0 1 0 0 1
# of new mutations 𝐴 3 ~ Trunc−Poi 𝜆 𝐴 3 =4 MP1 MP2 MP3 MP4 S S S3
33
Prior Model Latent factor matrix Z|T, C T: 1 → 2 → 3 0 0 1 0 0 1 1 1
MP1 MP2 MP3 MP4 S S S3
34
? Inference for Tumor Heterogeneity Data: Short DNA reads
Mixture of signals from many cells Quantities of interest A C G ? ? ? ???? A G A C C A G A C Mutation Pair 1 ? # of subclones Phylogenetic relationship Genotypes Population frequencies Two DNA strands G A C T C A G A G A Mutation Pair 2
35
Sampling Model A C A C A C A G A C A G C 2/10 3/10 5/10
36
Sampling Model A C A G C 5/10
37
Sampling Model A C A G 5/10
38
Sampling Model A C A G 5/10
39
Sampling Model A C A G A C A C A G A C A G C 2/10 3/10 5/10
40
Sampling Model A C A G A C A C 2/10
41
Sampling Model A C A G A C 2/10
42
Sampling Model A C A G A C 2/10
43
Sampling Model A C A G A C A C A C A G A C A G C 2/10 3/10 5/10
44
Sampling Model A C A G A C A G C 5/10
45
Sampling Model A C A G A C C 5/10
46
Sampling Model A C A G A C C 5/10
47
Sampling Model A C A G A C C A C A C A G A C A G C 2/10 3/10 5/10
48
Sampling Model A C A G A C C A G A C 3/10
49
Sampling Model A C A G A C C A G 3/10
50
Sampling Model A C A G A C C A G 3/10
51
Sampling Model A C A G A C C A G A C A C A G A C A G C 2/10 3/10 5/10
52
Sampling Model A C A G A C C A G A G A C 3/10
53
Sampling Model A C A G A C C A G A C 3/10
54
Sampling Model A C A G A C C A G A C 3/10
55
Sampling Model A C 2/10 3/10 5/10 A G A C C A G A C A C A C A G A C A
56
Sampling Model A C G 2/10 3/10 5/10 A G A C C A G A C G A C T C A G A
57
Sampling Model A C G 2/10 3/10 5/10 1 1 1 1 1 A C G A C G A G A C A G
1 1 1 A C G A C G A G A C A G C T 2/10 3/10 5/10 1 1
58
? Inference for Tumor Heterogeneity Data: Short DNA reads
Mixture of signals from many cells Quantities of interest A C G ? ? ? ???? A G A C C A G A C Mutation Pair 1 ? # of subclones Phylogenetic relationship Genotypes Population frequencies Two DNA strands G A C T C A G A G A Mutation Pair 2 Sampling model & Prior model → Posterior inference
59
TCGA Lung Cancer Data Malignant hyper-mutated subclone
Small population frequency
60
Concluding Remark The use of mutation pairs strengthens inference for tumor heterogeneity
61
Inference for Missing Data
𝒚: Longitudinal outcomes after treated by a test drug 𝑠: Dropout time 𝑠=4 Treatment Effect E 𝑌 6 − 𝑌 1 Outcome 𝑠=5 1 2 3 4 5 6 Time
62
Inference for Missing Data
Biased if not MCAR Inefficient Can’t do sensitivity analysis Treatment Effect E 𝑌 6 − 𝑌 1 Outcome 1 2 3 4 5 6 Time
63
Inference for Missing Data
Treatment Effect E 𝑌 6 − 𝑌 1 Outcome 1 2 3 4 5 6 Time
64
Inference for Missing Data
𝒗= (Age: 66, Height: 185, Weight: 78, Gender: M) Dropout due to lack of efficacy 𝒗= (Age: 41, Height: 170, Weight: 62, Gender: F) Outcome 𝒗= (Age: 29, Height: 166, Weight: 54, Gender: F) Dropout due to pregnancy 𝒗= (Age: 33, Height: 159, Weight: 49, Gender: F) 1 2 3 4 5 6 Time
65
Extrapolation Factorization
Joint model for 𝒚,𝑠 and 𝒗 𝑝 𝒚,𝑠,𝒗 =𝑝 𝒚 mis 𝒚 obs ,𝑠,𝒗 𝑝 𝒚 obs ,𝑠,𝒗 Observed data distribution: identified and can be estimated semi/non-parametrically Extrapolation distribution: not identified without uncheckable assumptions (e.g. MAR, missing non-future dependent NFD)
66
Observed Data Distribution: Pattern Mixture Modeling
𝑝 𝒚 obs ,𝑠,𝒗 =𝑝 𝒚 obs 𝑠,𝒗 𝑝 𝑠 𝒗 𝑝 𝒗 𝑝 𝒚 obs 𝑠,𝒗 Gaussian process (GP) & autoregressive (AR) & conditional autoregressive (CAR) priors 𝑝 𝑠 𝒗 Bayesian additive regression trees (BART) 𝑝 𝒗 Bayesian bootstrap
67
Extrapolation Distribution: Identifying Restrictions
Missing at random (MAR): 𝑝 𝒚 mis 𝒚 obs ,𝑠,𝒗 is fully identified by 𝑝 𝒚 obs ,𝑠,𝒗 Non-future dependent (NFD): 𝑝 𝒚 mis 𝒚 obs ,𝑠,𝒗 is partially identified by 𝑝 𝒚 obs ,𝑠,𝒗 . Put informative priors on non-identified parameters
68
Monte Carlo Integration/G-Computation
E 𝑡 𝒚 = ∫ 𝒚 𝑡 𝒚 𝑝 𝒚 d𝒚 = ∫ 𝒚 𝑡 𝒚 ∑ 𝑠 ∫ 𝒗 𝑝 𝒚 mis 𝒚 obs ,𝑠,𝒗 𝑝 𝒚 obs 𝑠,𝒗 𝑝 𝑠 𝒗 𝑝 𝒗 ⅆ𝒗 d𝒚 𝒗= (Age: 33, Height: 166, Weight: 54, Gender: F) Outcome 1 2 3 4 5 6 Time
69
Monte Carlo Integration/G-Computation
E 𝑡 𝒚 = ∫ 𝒚 𝑡 𝒚 𝑝 𝒚 d𝒚 = ∫ 𝒚 𝑡 𝒚 ∑ 𝑠 ∫ 𝒗 𝑝 𝒚 mis 𝒚 obs ,𝑠,𝒗 𝑝 𝒚 obs 𝑠,𝒗 𝑝 𝑠 𝒗 𝑝 𝒗 ⅆ𝒗 d𝒚 𝒗= (Age: 33, Height: 166, Weight: 54, Gender: F) 𝑠=5 Outcome 1 2 3 4 5 6 Time
70
Monte Carlo Integration/G-Computation
E 𝑡 𝒚 = ∫ 𝒚 𝑡 𝒚 𝑝 𝒚 d𝒚 = ∫ 𝒚 𝑡 𝒚 ∑ 𝑠 ∫ 𝒗 𝑝 𝒚 mis 𝒚 obs ,𝑠,𝒗 𝑝 𝒚 obs 𝑠,𝒗 𝑝 𝑠 𝒗 𝑝 𝒗 ⅆ𝒗 d𝒚 𝒗= (Age: 33, Height: 166, Weight: 54, Gender: F) 𝑠=5 Outcome 1 2 3 4 5 6 Time
71
Monte Carlo Integration/G-Computation
E 𝑡 𝒚 = ∫ 𝒚 𝑡 𝒚 𝑝 𝒚 d𝒚 = ∫ 𝒚 𝑡 𝒚 ∑ 𝑠 ∫ 𝒗 𝑝 𝒚 mis 𝒚 obs ,𝑠,𝒗 𝑝 𝒚 obs 𝑠,𝒗 𝑝 𝑠 𝒗 𝑝 𝒗 ⅆ𝒗 d𝒚 𝒗= (Age: 33, Height: 166, Weight: 54, Gender: F) 𝑠=5 Outcome 1 2 3 4 5 6 Time
72
Monte Carlo Integration/G-Computation
E 𝑡 𝒚 = ∫ 𝒚 𝑡 𝒚 𝑝 𝒚 d𝒚 = ∫ 𝒚 𝑡 𝒚 ∑ 𝑠 ∫ 𝒗 𝑝 𝒚 mis 𝒚 obs ,𝑠,𝒗 𝑝 𝒚 obs 𝑠,𝒗 𝑝 𝑠 𝒗 𝑝 𝒗 ⅆ𝒗 d𝒚 𝒗= (Age: 37, Height: 163, Weight: 51, Gender: F) Outcome 1 2 3 4 5 6 Time
73
Monte Carlo Integration/G-Computation
E 𝑡 𝒚 = ∫ 𝒚 𝑡 𝒚 𝑝 𝒚 d𝒚 = ∫ 𝒚 𝑡 𝒚 ∑ 𝑠 ∫ 𝒗 𝑝 𝒚 mis 𝒚 obs ,𝑠,𝒗 𝑝 𝒚 obs 𝑠,𝒗 𝑝 𝑠 𝒗 𝑝 𝒗 ⅆ𝒗 d𝒚 𝒗= (Age: 37, Height: 163, Weight: 51, Gender: F) 𝑠=6 Outcome 1 2 3 4 5 6 Time
74
Monte Carlo Integration/G-Computation
E 𝑡 𝒚 = ∫ 𝒚 𝑡 𝒚 𝑝 𝒚 d𝒚 = ∫ 𝒚 𝑡 𝒚 ∑ 𝑠 ∫ 𝒗 𝑝 𝒚 mis 𝒚 obs ,𝑠,𝒗 𝑝 𝒚 obs 𝑠,𝒗 𝑝 𝑠 𝒗 𝑝 𝒗 ⅆ𝒗 d𝒚 𝒗= (Age: 37, Height: 163, Weight: 51, Gender: F) 𝑠=6 Outcome 1 2 3 4 5 6 Time
75
Monte Carlo Integration/G-Computation
E 𝑡 𝒚 = ∫ 𝒚 𝑡 𝒚 𝑝 𝒚 d𝒚 = ∫ 𝒚 𝑡 𝒚 ∑ 𝑠 ∫ 𝒗 𝑝 𝒚 mis 𝒚 obs ,𝑠,𝒗 𝑝 𝒚 obs 𝑠,𝒗 𝑝 𝑠 𝒗 𝑝 𝒗 ⅆ𝒗 d𝒚 𝒗= (Age: 66, Height: 185, Weight: 78, Gender: M) Outcome 1 2 3 4 5 6 Time
76
Monte Carlo Integration/G-Computation
E 𝑡 𝒚 = ∫ 𝒚 𝑡 𝒚 𝑝 𝒚 d𝒚 = ∫ 𝒚 𝑡 𝒚 ∑ 𝑠 ∫ 𝒗 𝑝 𝒚 mis 𝒚 obs ,𝑠,𝒗 𝑝 𝒚 obs 𝑠,𝒗 𝑝 𝑠 𝒗 𝑝 𝒗 ⅆ𝒗 d𝒚 𝒗= (Age: 66, Height: 185, Weight: 78, Gender: M) 𝑠=3 Outcome 1 2 3 4 5 6 Time
77
Monte Carlo Integration/G-Computation
E 𝑡 𝒚 = ∫ 𝒚 𝑡 𝒚 𝑝 𝒚 d𝒚 = ∫ 𝒚 𝑡 𝒚 ∑ 𝑠 ∫ 𝒗 𝑝 𝒚 mis 𝒚 obs ,𝑠,𝒗 𝑝 𝒚 obs 𝑠,𝒗 𝑝 𝑠 𝒗 𝑝 𝒗 ⅆ𝒗 d𝒚 𝒗= (Age: 66, Height: 185, Weight: 78, Gender: M) 𝑠=3 Outcome 1 2 3 4 5 6 Time
78
Monte Carlo Integration/G-Computation
E 𝑡 𝒚 = ∫ 𝒚 𝑡 𝒚 𝑝 𝒚 d𝒚 = ∫ 𝒚 𝑡 𝒚 ∑ 𝑠 ∫ 𝒗 𝑝 𝒚 mis 𝒚 obs ,𝑠,𝒗 𝑝 𝒚 obs 𝑠,𝒗 𝑝 𝑠 𝒗 𝑝 𝒗 ⅆ𝒗 d𝒚 𝒗= (Age: 66, Height: 185, Weight: 78, Gender: M) 𝑠=3 Outcome 1 2 3 4 5 6 Time
79
Monte Carlo Integration/G-Computation
E 𝑡 𝒚 = ∫ 𝒚 𝑡 𝒚 𝑝 𝒚 d𝒚 = ∫ 𝒚 𝑡 𝒚 ∑ 𝑠 ∫ 𝒗 𝑝 𝒚 mis 𝒚 obs ,𝑠,𝒗 𝑝 𝒚 obs 𝑠,𝒗 𝑝 𝑠 𝒗 𝑝 𝒗 ⅆ𝒗 d𝒚 Outcome 𝑦 6 − 𝑦 1 1 2 3 4 5 6 Time
80
Schizophrenia Dataset
Test drug improvement over placebo A negative value represents an improvement Conclusion: no evidence that the test drug performs better than placebo
81
Sensitivity Analysis Vary uncheckable assumptions and see whether conclusion differs
82
Concluding Remark The model specifications (GP/AR/CAR/BART) nicely exploit the data structure and lead to improvement over simple parametric approaches
83
References Zhou, T., Müller, P., Sengupta, S. and Ji, Y. (2019) PairClone: A Bayesian subclone caller based on mutation pairs. Journal of the Royal Statistical Society: Series C (Applied Statistics), 68(3), Zhou, T., Sengupta, S., Müller, P., and Ji, Y. (2019) TreeClone: Reconstruction of tumor subclone phylogeny based on mutation pairs using next generation sequencing data. The Annals of Applied Statistics, 13(2), Zhou, T., Daniels, M. J. and Müller, P. (2019) A Semiparametric Bayesian Approach to Dropout in Longitudinal Studies with Auxiliary Covariates. Journal of Computational and Graphical Statistics, forthcoming.
84
Thank you! Questions & comments: tjzhou@uchicago.edu
Currently on job market
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.