BAYCLONE: BAYESIAN NONPARAMETRIC INFERENCE OF TUMOR SUBCLONES USING NGS DATA SUBHAJIT SENGUPTA, JIN WANG, JUHEE LEE, PETER MULLER, KAMALAKAR GULUKOTA,

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Sinead Williamson, Chong Wang, Katherine A. Heller, David M. Blei
Xiaolong Wang and Daniel Khashabi
Hierarchical Dirichlet Processes
Bayesian inference “Very much lies in the posterior distribution” Bayesian definition of sufficiency: A statistic T (x 1, …, x n ) is sufficient for 
METHODS FOR HAPLOTYPE RECONSTRUCTION
Bayesian Estimation in MARK
Segmentation and Fitting Using Probabilistic Methods
What is Statistical Modeling
A New Nonparametric Bayesian Model for Genetic Recombination in Open Ancestral Space Presented by Chunping Wang Machine Learning Group, Duke University.
1 Graphical Diagnostic Tools for Evaluating Latent Class Models: An Application to Depression in the ECA Study Elizabeth S. Garrett Department of Biostatistics.
CHAPTER 16 MARKOV CHAIN MONTE CARLO
1 Bayesian Methods with Monte Carlo Markov Chains II Henry Horng-Shing Lu Institute of Statistics National Chiao Tung University
Maximum Likelihood. Likelihood The likelihood is the probability of the data given the model.
. PGM: Tirgul 8 Markov Chains. Stochastic Sampling  In previous class, we examined methods that use independent samples to estimate P(X = x |e ) Problem:
Bayesian Networks. Graphical Models Bayesian networks Conditional random fields etc.
. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.
End of Chapter 8 Neil Weisenfeld March 28, 2005.
Arizona State University DMML Kernel Methods – Gaussian Processes Presented by Shankar Bhargav.
Maximum likelihood (ML)
Lecture II-2: Probability Review
Robin McDougall, Ed Waller and Scott Nokleby Faculties of Engineering & Applied Science and Energy Systems & Nuclear Science 1.
Image Analysis and Markov Random Fields (MRFs) Quanren Xiong.
Biointelligence Laboratory, Seoul National University
Chapter Two Probability Distributions: Discrete Variables
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
The horseshoe estimator for sparse signals CARLOS M. CARVALHO NICHOLAS G. POLSON JAMES G. SCOTT Biometrika (2010) Presented by Eric Wang 10/14/2010.
Model Inference and Averaging
Introduction to Bayesian statistics Yves Moreau. Overview The Cox-Jaynes axioms Bayes’ rule Probabilistic models Maximum likelihood Maximum a posteriori.
Bayesian Analysis and Applications of A Cure Rate Model.
Module 1: Statistical Issues in Micro simulation Paul Sousa.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
Finding Scientific topics August , Topic Modeling 1.A document as a probabilistic mixture of topics. 2.A topic as a probability distribution.
Yaomin Jin Design of Experiments Morris Method.
Simulation of the matrix Bingham-von Mises- Fisher distribution, with applications to multivariate and relational data Discussion led by Chunping Wang.
Mixture Models, Monte Carlo, Bayesian Updating and Dynamic Models Mike West Computing Science and Statistics, Vol. 24, pp , 1993.
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Yung-Kyun Noh and Joo-kyung Kim Biointelligence.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Fast Simulators for Assessment and Propagation of Model Uncertainty* Jim Berger, M.J. Bayarri, German Molina June 20, 2001 SAMO 2001, Madrid *Project of.
The Dirichlet Labeling Process for Functional Data Analysis XuanLong Nguyen & Alan E. Gelfand Duke University Machine Learning Group Presented by Lu Ren.
We obtained breast cancer tissues from the Breast Cancer Biospecimen Repository of Fred Hutchinson Cancer Research Center. We performed two rounds of next-gen.
- 1 - Bayesian inference of binomial problem Estimating a probability from binomial data –Objective is to estimate unknown proportion (or probability of.
Randomized Algorithms for Bayesian Hierarchical Clustering
Lecture 4: Statistics Review II Date: 9/5/02  Hypothesis tests: power  Estimation: likelihood, moment estimation, least square  Statistical properties.
Variational Inference for the Indian Buffet Process
Stick-Breaking Constructions
Tracking Multiple Cells By Correspondence Resolution In A Sequential Bayesian Framework Nilanjan Ray Gang Dong Scott T. Acton C.L. Brown Department of.
The famous “sprinkler” example (J. Pearl, Probabilistic Reasoning in Intelligent Systems, 1988)
Lecture 2: Statistical learning primer for biologists
The generalization of Bayes for continuous densities is that we have some density f(y|  ) where y and  are vectors of data and parameters with  being.
De novo discovery of mutated driver pathways in cancer Discussion leader: Matthew Bernstein Scribe: Kun-Chieh Wang Computational Network Biology BMI 826/Computer.
Markov Chain Monte Carlo for LDA C. Andrieu, N. D. Freitas, and A. Doucet, An Introduction to MCMC for Machine Learning, R. M. Neal, Probabilistic.
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
The Unscented Particle Filter 2000/09/29 이 시은. Introduction Filtering –estimate the states(parameters or hidden variable) as a set of observations becomes.
CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov
Multi-label Prediction via Sparse Infinite CCA Piyush Rai and Hal Daume III NIPS 2009 Presented by Lingbo Li ECE, Duke University July 16th, 2010 Note:
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Joo-kyung Kim Biointelligence Laboratory,
HW7: Evolutionarily conserved segments ENCODE region 009 (beta-globin locus) Multiple alignment of human, dog, and mouse 2 states: neutral (fast-evolving),
Institute of Statistics and Decision Sciences In Defense of a Dissertation Submitted for the Degree of Doctor of Philosophy 26 July 2005 Regression Model.
Introduction We consider the data of ~1800 phenotype measurements Each mouse has a given probability distribution of descending from one of 8 possible.
An Infinite Factor Model Hierarchy Via a Noisy-Or Mechanism
Gibbs sampling.
Latent Variables, Mixture Models and EM
A Non-Parametric Bayesian Method for Inferring Hidden Causes
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Example Human males have one X-chromosome and one Y-chromosome,
Parametric Methods Berlin Chen, 2005 References:
Learning From Observed Data
Biointelligence Laboratory, Seoul National University
Pattern Recognition and Machine Learning Chapter 2: Probability Distributions July chonbuk national university.
Presentation transcript:

BAYCLONE: BAYESIAN NONPARAMETRIC INFERENCE OF TUMOR SUBCLONES USING NGS DATA SUBHAJIT SENGUPTA, JIN WANG, JUHEE LEE, PETER MULLER, KAMALAKAR GULUKOTA, ARUNAVA BANERJEE, YUAN JI Takanobu Tsutsumi

Most multicellular organisms have two sets of chromosomes (diploids). Diploid organisms have one copy of each gene (and therefore one allele) on each chromosome. At each locus, two alleles can be homozygous if they share the same genotypes, or heterozygous if they do not. An Indian buffet process (IBP) is used to assume that SNVs are homozygous, where both alleles are either mutated or wild-type. However, the IBP model is not sufficient to fully describe the subclonal genomes, because biologically there are three possible allelic genotypes at an SNV: homozygous wild-type (no mutation on both alleles), heterozygous mutant (mutation on only one allele), or homozygous mutant (mutation on both alleles). Main idea The Indian buffet process The Indian buffet process is a stochastic process defining a probability distribution over equivalence classes of sparse binary matrices with a finite number of rows and an unbounded number of columns.

Z = [z sc ] S × C ternary matrix (S : SNV, C : subclones) z sc : allelic variation at SNV site s for subclone c (s = 1,2,…, S, c = 1,2,…,C) z sc ∈ {0, 0.5, 1} z sc = 0 : homozygous wild-type z sc = 0.5 : heterozygous variant z sc = 1 : homozygous variant The proportions of the C subclone can be denoted by w t = (w t0,w t1,….,w tC ) (0 < w tc < 1, Σ C c=0 w tc = 1) for sample t. The contribution of a subclone to the VAF at an SNV is 0×w tc : homozygous wild-type 0.5×w tc : heterozygous mutant 1×w tc : homozygous mutant They develop a latent feature model for the entire matrix Z to uncover the unknown subclones that constitute the tumor cells and given the data, they aim to infer two quantities, Z and w, by a Bayesian inference scheme. They extent IBP to categorical IBP (cIBP) that allows three values, 0, 0.5, 1, to describe the corresponding genotypes at each SNV.

Colored cells in green=1 (homozygous variants) brown=0.5 (heterozygous variants) white=0 (homozygous wild-type) For example, the same mutation is shared by two subclones at SNV 5 (three and four). Illustration of cIBP matrix Z for subclones in a tumor sample.

Probability Model Latent feature model with IBP Each subclone (one column of Z) is a latent feature vector and a data point is the observed VAF. For each component z sc in the binary matrix Z, assume z sc | π c ~ Bern(π c ) π c | α ~ beta(α/C, 1), c = 1,…,C (1) Bern(π c ) is the Bernoulli distribution. π c ∈ (0, 1) is the probability Pr(z sc = 1) a priori. To extend IBP to a categorical setting, each entry of the matrix is not necessarily 0 or 1, but a set of integers in {0,1,….,Q} where Q is fixed a priori. They call the extended model categorical IBP (cIBP) and use it as a prior in exploring subclones of tumor samples.

Development of cIBP A straightforward extension of IBP in (1) would be to replace the underlying beta distribution of π c with a Dirichlet distribution, and replace the Bernoulli distribution of z sc with a multinomial distribution. (2) Integrating out π c in (2), the probability of a (Q + 1)-nary matrix, Z is m cq : the number of rows possessing value q ∈ {1,…, Q} in column c

Sampling model Suppose there are T tumor samples in the data in which S SNVs are measured for each sample. They assume a binomial sampling model N st : the total number of reads mapped to SNV s in sample t, s = 1,2,…, S and t = 1,2,…, T. Among N st reads, assume n st possess a variant sequence at the locus. p st : the expected proportion of variant reads. The matrix Z follows a finite version of cIBP in (2), Z ~ cIBP c (Q = 2, α, [β 1,β 2 ]). They assume w t (the vector of subclonal weights) follows a Dirichlet prior given by, (3)

Parameters p st can be modeled as a linear combination of variant alleles z sc ∈ {0, 0.5, 1} weighted by the proportions of subclones bearing the alleles. They assume the expected p st is a result of mixing subclones with different proportions. (4) ε t0 is an error term defined as ε t0 = p 0 w t0, where p 0 ~ Beta(α0, β0). devised to capture experimental and data processing noise. p 0 is the relative frequency of variant reads produced as error from upstream data processing and takes a small value close to zero. w t0 absorbs the noise left unaccounted for by {w t1,…,w tC }. Ignoring the error term ε t0, model (4) can also be considered as a non-negative matrix factorization (NMF).

Model Selection and Posterior Inference MCMC simulation In order to infer the sampling parameters from the posterior distribution, they use Markov chain Monte Carlo (MCMC) simulations. The Gibbs sampling method is used to update z sc, whereas the Metropolis-Hastings (MH) sampling is used to get the samples of w tc and p 0. Let z -s,c be the set of assignment of all other SNVs but SNV s for subclone c, m - cq the number of SNVs with level q, not including SNV s and m - c. = Σ Q q=1 m - cq. p’ st is value of p st by plugging the current MCMC values and setting z sc = q. Markov Chain Monte Carlo (MCMC) MCMC methods are a class of algorithms for sampling from a probability distribution based on constructing a Markov chain that has the desired distribution as its equilibrium distribution. The state of the chain after a number of steps is then used as a sample of the desired distribution. The quality of the sample improves as a function of the number of steps.

Choice of C The number of subclones C in cIBP is unknown and must be estimated. They use predictive densities as a selection criterion. n -st : the data removing n st. η C : the set of parameters for a given C. The conditional predictive ordinate (CPO) of n st given n -st is given by The Monte-Carlo estimate of (5) is the harmonic mean of the likelihood values p(n st | η C l ), They take each data point out from n and compute average log-pseudo-marginal likelihood (LPML) over this set as For different values of C, they compare the values of L C and choose that C ^ which maximizes L C. (5) (6)

Estimate of Z The MCMC simulations generate posterior samples of the categorical matrix Z and other parameters. They define a posterior point estimate of Z where Z (l), l = 1,…,L are MCMC samples. The term d(Z (l),Z’) is a distance with the following definition. For two matrices Z and Z’, let for two columns c and c’. They define a distance where ζ c, c = 1,…,C is a permutation of {1,…,C}. (7)

Results Simulated Data They take a set of S = 100 SNV locations and consider T = 30 samples. The true number of latent subclones is C = 4 in this experiment. The model with C = 4 fits the data the best. True Z and estimate Z^ green : homozygous mutation (z sc = 1) brown : heterozygous mutation (z sc = 0.5) white : homozygous wild type (z sc = 0) They plot the true w and estimate w^ across all the samples using C^ = 4. LPML L C for C values. The simulation truth is 4.

Intra-Tumor Lung Cancer Samples They record whole-exome sequencing for four surgically dissected tumor samples taken from a single patient diagnosed with lung adenocarcinoma. They restrict their attention to the SNVs that (i) total number of mapped reads N st are ranged in [100, 240] (ii) the empirical fractions n st / N st in [0.25, 0.75] This filtering left them with 12,387 SNV's. They then randomly select S = 150 for computational purposes. In summary, the data record the read counts (N st ) and mutant allele read counts (n st ) of S = 150 SNVs from T = 4 tumor samples.

They select the best C (the number of subclone) by LPML. (C^ = 3) Subclone 1 appears to be the parent giving birth to two branching child subclones 2 and 3. They hypothesize that subclones 2 and 3 arise by acquiring additional somatic mutations in the top portion of the SNV regions where subclone 1 shows “white" color, i.e., homozygous wild type. The large chunk of “brown“ bars in (a) could be either somatic mutations acquired in the parent subclone 1, or germline mutations. This is expected since the four tumor samples were dissected from regions that were close by on the original lung tumor.

Bernoulli distribution Bernoulli distribution is the probability distribution of a random variable which takes value 1 with success probability and value 0 with failure probability. Dirichlet distribution Dirichlet distribution is a family of continuous multivariate probability distributions parameterized by a vector of positive reals. Its probability density function returns the belief that the probabilities of K rival events are x i given that each event has been observed α i -1 times. Multinomial distribution Multinomial distribution is a generalization of the binomial distribution. For n independent trials each of which leads to a success for exactly one of k categories, with each category having a given fixed success probability, the multinomial distribution gives the probability of any particular combination of numbers of successes for the various categories. Conditional predictive ordinate The density of the posterior predictive distribution evaluated at an observation. Log-pseudo-marginal likelihood The sum of the logged CPOs. This can be an estimator for the logarithm of the marginal likelihood. Gibbs sampling Gibbs sampling is a MCMC algorithm for obtaining a sequence of observations which are approximated from a specified multivariate probability distribution, when direct sampling is difficult. Metropolis-Hastings sampling Metropolis–Hastings algorithm is a MCMC method for obtaining a sequence of random samples from a probability distribution for which direct sampling is difficult.