Tianjian Zhou U of Chicago/UT Austin

Tianjian Zhou U of Chicago/UT Austin
Bayesian Nonparametric Models for Biomedical Data Analysis — Inference for Tumor Heterogeneity and Missing Data Tianjian Zhou U of Chicago/UT Austin

Tumor Heterogeneity A C G A C G

Tumor Heterogeneity A C G A C G A C G A C G

Tumor Heterogeneity A C G A C G A C G A C G A G A C

Tumor Heterogeneity A C G A C G A C G A C G A G A C A G A C

Tumor Heterogeneity A C G A C G A C G A C G A G A C A G A C A G A C

Tumor Heterogeneity A C G A C G A C G A C G A G A C A G A C A G A C A

Tumor Heterogeneity Subclone 1 (Normal) Subclone 2 Subclone 3 A C G A

Population frequencies
Tumor Heterogeneity A C G A C G A G A C A G C T Subclone 1 Subclone 2 Subclone 3 Population frequencies 2/ / /10

Why Study Tumor Heterogeneity
Mutations are bad, want to eliminate mutated cells A C G A C G A G A C A G C T A C G A C G A G A C A G C T Therapy 1: targets the mutation at the 1st locus Therapy 2: targets the mutation at the 2nd locus

Why Study Tumor Heterogeneity
Mutations are bad, want to eliminate mutated cells A C G A C G A G A C A C G A C G Therapy 1: targets the mutation at the 1st locus Therapy 2: targets the mutation at the 2nd locus

Subclones/DNA sequences are unobserved/latent
Inference for Tumor Heterogeneity ???? ???? ???? ???? ? Subclone 1 Subclone 2 Subclone ? Subclones/DNA sequences are unobserved/latent

? Inference for Tumor Heterogeneity Data: Short DNA reads
Mixture of signals from many cells Quantities of interest A C G ? ? ? ???? A G A C C A G A C ? # of subclones Phylogenetic relationship Genotypes Population frequencies Two DNA strands G A C T C A G A G A

Mixture of signals from many cells Quantities of interest A C G ? ? ? ???? A G A C C A G A C Proximal mutations ? # of subclones Phylogenetic relationship Genotypes Population frequencies Two DNA strands G A C T C A G A G A Proximal mutations

Mixture of signals from many cells Quantities of interest A C G ? ? ? ???? A G A C C A G A C Mutation Pair 1 ? # of subclones Phylogenetic relationship Genotypes Population frequencies Two DNA strands G A C T C A G A G A Mutation Pair 2

Representation of Subclones A C G A C G A G A C A G C T Subclone 1 Subclone 2 Subclone 3 Population frequencies 2/ / /10

Representation of Subclones 1: mutation; 0: no mutation (reference) 1 1 1 1 Subclone 1 Subclone 2 Subclone 3 Population frequencies 2/ / /10

Representation of Subclones Mutation pairs (MP) 1: mutation; 0: no mutation (reference) 1 1 1 MP1 Subclone 1 Subclone 2 Subclone 3 Population frequencies 2/ / /10

Representation of Subclones Mutation pairs (MP) 1: mutation; 0: no mutation (reference) 1 1 MP2 Subclone 1 Subclone 2 Subclone 3 Population frequencies 2/ / /10

Representation of Subclones
1 1 1 1 Latent factor matrix Z Factor loadings w # of subclones C 3 Phylogenetic tree T 1 → 2 → 3 0 0 1 0 0 1 MP1 MP2 2/10 3/10 5/10 S S S3 S S S3

Prior Model Latent factor matrix Z|T, C T: 1 → 2 → 3 0 0 MP1 MP2 MP3
S S S3

Prior Model Latent factor matrix Z|T, C T: 1 → 2 → 3 0 0
# of new mutations 𝐴 2 ~ Trunc−Poi 𝜆 MP1 MP2 MP3 MP4 S S S3

Prior Model Latent factor matrix Z|T, C T: 1 → 2 → 3 0 0
# of new mutations 𝐴 2 ~ Trunc−Poi 𝜆 𝐴 2 =3 MP1 MP2 MP3 MP4 S S S3

Prior Model Latent factor matrix Z|T, C T: 1 → 2 → 3 0 0 1 0 0 1 MP1
S S S3

Prior Model Latent factor matrix Z|T, C T: 1 → 2 → 3 0 0 1 0 0 1
# of new mutations 𝐴 3 ~ Trunc−Poi 𝜆 MP1 MP2 MP3 MP4 S S S3

Prior Model Latent factor matrix Z|T, C T: 1 → 2 → 3 0 0 1 0 0 1
# of new mutations 𝐴 3 ~ Trunc−Poi 𝜆 𝐴 3 =4 MP1 MP2 MP3 MP4 S S S3

Prior Model Latent factor matrix Z|T, C T: 1 → 2 → 3 0 0 1 0 0 1 1 1
MP1 MP2 MP3 MP4 S S S3

Mixture of signals from many cells Quantities of interest A C G ? ? ? ???? A G A C C A G A C Mutation Pair 1 ? # of subclones Phylogenetic relationship Genotypes Population frequencies Two DNA strands G A C T C A G A G A Mutation Pair 2

Sampling Model A C A C A C A G A C A G C 2/10 3/10 5/10

Sampling Model A C A G C 5/10

Sampling Model A C A G 5/10

Sampling Model A C A G A C A C A G A C A G C 2/10 3/10 5/10

Sampling Model A C A G A C A C 2/10

Sampling Model A C A G A C 2/10

Sampling Model A C A G A C A C A C A G A C A G C 2/10 3/10 5/10

Sampling Model A C A G A C A G C 5/10

Sampling Model A C A G A C C 5/10

Sampling Model A C A G A C C A C A C A G A C A G C 2/10 3/10 5/10

Sampling Model A C A G A C C A G A C 3/10

Sampling Model A C A G A C C A G 3/10

Sampling Model A C A G A C C A G A C A C A G A C A G C 2/10 3/10 5/10

Sampling Model A C A G A C C A G A G A C 3/10

Sampling Model A C A G A C C A G A C 3/10

Sampling Model A C 2/10 3/10 5/10 A G A C C A G A C A C A C A G A C A

Sampling Model A C G 2/10 3/10 5/10 A G A C C A G A C G A C T C A G A

Sampling Model A C G 2/10 3/10 5/10 1 1 1 1 1 A C G A C G A G A C A G
1 1 1 A C G A C G A G A C A G C T 2/10 3/10 5/10 1 1

Mixture of signals from many cells Quantities of interest A C G ? ? ? ???? A G A C C A G A C Mutation Pair 1 ? # of subclones Phylogenetic relationship Genotypes Population frequencies Two DNA strands G A C T C A G A G A Mutation Pair 2 Sampling model & Prior model → Posterior inference

TCGA Lung Cancer Data Malignant hyper-mutated subclone
Small population frequency

Concluding Remark The use of mutation pairs strengthens inference for tumor heterogeneity

Inference for Missing Data
𝒚: Longitudinal outcomes after treated by a test drug 𝑠: Dropout time 𝑠=4 Treatment Effect E 𝑌 6 − 𝑌 1 Outcome 𝑠=5 1 2 3 4 5 6 Time

Biased if not MCAR Inefficient Can’t do sensitivity analysis Treatment Effect E 𝑌 6 − 𝑌 1 Outcome 1 2 3 4 5 6 Time

Treatment Effect E 𝑌 6 − 𝑌 1 Outcome 1 2 3 4 5 6 Time

𝒗= (Age: 66, Height: 185, Weight: 78, Gender: M) Dropout due to lack of efficacy 𝒗= (Age: 41, Height: 170, Weight: 62, Gender: F) Outcome 𝒗= (Age: 29, Height: 166, Weight: 54, Gender: F) Dropout due to pregnancy 𝒗= (Age: 33, Height: 159, Weight: 49, Gender: F) 1 2 3 4 5 6 Time

Extrapolation Factorization
Joint model for 𝒚,𝑠 and 𝒗 𝑝 𝒚,𝑠,𝒗 =𝑝 𝒚 mis 𝒚 obs ,𝑠,𝒗 𝑝 𝒚 obs ,𝑠,𝒗 Observed data distribution: identified and can be estimated semi/non-parametrically Extrapolation distribution: not identified without uncheckable assumptions (e.g. MAR, missing non-future dependent NFD)

Observed Data Distribution: Pattern Mixture Modeling
𝑝 𝒚 obs ,𝑠,𝒗 =𝑝 𝒚 obs 𝑠,𝒗 𝑝 𝑠 𝒗 𝑝 𝒗 𝑝 𝒚 obs 𝑠,𝒗 Gaussian process (GP) & autoregressive (AR) & conditional autoregressive (CAR) priors 𝑝 𝑠 𝒗 Bayesian additive regression trees (BART) 𝑝 𝒗 Bayesian bootstrap

Extrapolation Distribution: Identifying Restrictions
Missing at random (MAR): 𝑝 𝒚 mis 𝒚 obs ,𝑠,𝒗 is fully identified by 𝑝 𝒚 obs ,𝑠,𝒗 Non-future dependent (NFD): 𝑝 𝒚 mis 𝒚 obs ,𝑠,𝒗 is partially identified by 𝑝 𝒚 obs ,𝑠,𝒗 . Put informative priors on non-identified parameters

Monte Carlo Integration/G-Computation
E 𝑡 𝒚 = ∫ 𝒚 𝑡 𝒚 𝑝 𝒚 d𝒚 = ∫ 𝒚 𝑡 𝒚 ∑ 𝑠 ∫ 𝒗 𝑝 𝒚 mis 𝒚 obs ,𝑠,𝒗 𝑝 𝒚 obs 𝑠,𝒗 𝑝 𝑠 𝒗 𝑝 𝒗 ⅆ𝒗 d𝒚 𝒗= (Age: 33, Height: 166, Weight: 54, Gender: F) Outcome 1 2 3 4 5 6 Time

E 𝑡 𝒚 = ∫ 𝒚 𝑡 𝒚 𝑝 𝒚 d𝒚 = ∫ 𝒚 𝑡 𝒚 ∑ 𝑠 ∫ 𝒗 𝑝 𝒚 mis 𝒚 obs ,𝑠,𝒗 𝑝 𝒚 obs 𝑠,𝒗 𝑝 𝑠 𝒗 𝑝 𝒗 ⅆ𝒗 d𝒚 𝒗= (Age: 33, Height: 166, Weight: 54, Gender: F) 𝑠=5 Outcome 1 2 3 4 5 6 Time

E 𝑡 𝒚 = ∫ 𝒚 𝑡 𝒚 𝑝 𝒚 d𝒚 = ∫ 𝒚 𝑡 𝒚 ∑ 𝑠 ∫ 𝒗 𝑝 𝒚 mis 𝒚 obs ,𝑠,𝒗 𝑝 𝒚 obs 𝑠,𝒗 𝑝 𝑠 𝒗 𝑝 𝒗 ⅆ𝒗 d𝒚 𝒗= (Age: 37, Height: 163, Weight: 51, Gender: F) Outcome 1 2 3 4 5 6 Time

E 𝑡 𝒚 = ∫ 𝒚 𝑡 𝒚 𝑝 𝒚 d𝒚 = ∫ 𝒚 𝑡 𝒚 ∑ 𝑠 ∫ 𝒗 𝑝 𝒚 mis 𝒚 obs ,𝑠,𝒗 𝑝 𝒚 obs 𝑠,𝒗 𝑝 𝑠 𝒗 𝑝 𝒗 ⅆ𝒗 d𝒚 𝒗= (Age: 37, Height: 163, Weight: 51, Gender: F) 𝑠=6 Outcome 1 2 3 4 5 6 Time

E 𝑡 𝒚 = ∫ 𝒚 𝑡 𝒚 𝑝 𝒚 d𝒚 = ∫ 𝒚 𝑡 𝒚 ∑ 𝑠 ∫ 𝒗 𝑝 𝒚 mis 𝒚 obs ,𝑠,𝒗 𝑝 𝒚 obs 𝑠,𝒗 𝑝 𝑠 𝒗 𝑝 𝒗 ⅆ𝒗 d𝒚 𝒗= (Age: 66, Height: 185, Weight: 78, Gender: M) Outcome 1 2 3 4 5 6 Time

E 𝑡 𝒚 = ∫ 𝒚 𝑡 𝒚 𝑝 𝒚 d𝒚 = ∫ 𝒚 𝑡 𝒚 ∑ 𝑠 ∫ 𝒗 𝑝 𝒚 mis 𝒚 obs ,𝑠,𝒗 𝑝 𝒚 obs 𝑠,𝒗 𝑝 𝑠 𝒗 𝑝 𝒗 ⅆ𝒗 d𝒚 𝒗= (Age: 66, Height: 185, Weight: 78, Gender: M) 𝑠=3 Outcome 1 2 3 4 5 6 Time

E 𝑡 𝒚 = ∫ 𝒚 𝑡 𝒚 𝑝 𝒚 d𝒚 = ∫ 𝒚 𝑡 𝒚 ∑ 𝑠 ∫ 𝒗 𝑝 𝒚 mis 𝒚 obs ,𝑠,𝒗 𝑝 𝒚 obs 𝑠,𝒗 𝑝 𝑠 𝒗 𝑝 𝒗 ⅆ𝒗 d𝒚 Outcome 𝑦 6 − 𝑦 1 1 2 3 4 5 6 Time

Schizophrenia Dataset
Test drug improvement over placebo A negative value represents an improvement Conclusion: no evidence that the test drug performs better than placebo

Sensitivity Analysis Vary uncheckable assumptions and see whether conclusion differs

Concluding Remark The model specifications (GP/AR/CAR/BART) nicely exploit the data structure and lead to improvement over simple parametric approaches

References Zhou, T., Müller, P., Sengupta, S. and Ji, Y. (2019) PairClone: A Bayesian subclone caller based on mutation pairs. Journal of the Royal Statistical Society: Series C (Applied Statistics), 68(3), Zhou, T., Sengupta, S., Müller, P., and Ji, Y. (2019) TreeClone: Reconstruction of tumor subclone phylogeny based on mutation pairs using next generation sequencing data. The Annals of Applied Statistics, 13(2), Zhou, T., Daniels, M. J. and Müller, P. (2019) A Semiparametric Bayesian Approach to Dropout in Longitudinal Studies with Auxiliary Covariates. Journal of Computational and Graphical Statistics, forthcoming.

Thank you! Questions & comments: tjzhou@uchicago.edu
Currently on job market

Tianjian Zhou U of Chicago/UT Austin

Similar presentations

Presentation on theme: "Tianjian Zhou U of Chicago/UT Austin"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Tianjian Zhou U of Chicago/UT Austin

Similar presentations

Presentation on theme: "Tianjian Zhou U of Chicago/UT Austin"— Presentation transcript:

Similar presentations

About project

Feedback