Presentation is loading. Please wait.

Presentation is loading. Please wait.

BAYCLONE: BAYESIAN NONPARAMETRIC INFERENCE OF TUMOR SUBCLONES USING NGS DATA SUBHAJIT SENGUPTA, JIN WANG, JUHEE LEE, PETER MULLER, KAMALAKAR GULUKOTA,

Similar presentations


Presentation on theme: "BAYCLONE: BAYESIAN NONPARAMETRIC INFERENCE OF TUMOR SUBCLONES USING NGS DATA SUBHAJIT SENGUPTA, JIN WANG, JUHEE LEE, PETER MULLER, KAMALAKAR GULUKOTA,"— Presentation transcript:

1 BAYCLONE: BAYESIAN NONPARAMETRIC INFERENCE OF TUMOR SUBCLONES USING NGS DATA SUBHAJIT SENGUPTA, JIN WANG, JUHEE LEE, PETER MULLER, KAMALAKAR GULUKOTA, ARUNAVA BANERJEE, YUAN JI Takanobu Tsutsumi

2 Most multicellular organisms have two sets of chromosomes (diploids). Diploid organisms have one copy of each gene (and therefore one allele) on each chromosome. At each locus, two alleles can be homozygous if they share the same genotypes, or heterozygous if they do not. An Indian buffet process (IBP) is used to assume that SNVs are homozygous, where both alleles are either mutated or wild-type. However, the IBP model is not sufficient to fully describe the subclonal genomes, because biologically there are three possible allelic genotypes at an SNV: homozygous wild-type (no mutation on both alleles), heterozygous mutant (mutation on only one allele), or homozygous mutant (mutation on both alleles). Main idea The Indian buffet process The Indian buffet process is a stochastic process defining a probability distribution over equivalence classes of sparse binary matrices with a finite number of rows and an unbounded number of columns.

3 Z = [z sc ] S × C ternary matrix (S : SNV, C : subclones) z sc : allelic variation at SNV site s for subclone c (s = 1,2,…, S, c = 1,2,…,C) z sc ∈ {0, 0.5, 1} z sc = 0 : homozygous wild-type z sc = 0.5 : heterozygous variant z sc = 1 : homozygous variant The proportions of the C subclone can be denoted by w t = (w t0,w t1,….,w tC ) (0 < w tc < 1, Σ C c=0 w tc = 1) for sample t. The contribution of a subclone to the VAF at an SNV is 0×w tc : homozygous wild-type 0.5×w tc : heterozygous mutant 1×w tc : homozygous mutant They develop a latent feature model for the entire matrix Z to uncover the unknown subclones that constitute the tumor cells and given the data, they aim to infer two quantities, Z and w, by a Bayesian inference scheme. They extent IBP to categorical IBP (cIBP) that allows three values, 0, 0.5, 1, to describe the corresponding genotypes at each SNV.

4 Colored cells in green=1 (homozygous variants) brown=0.5 (heterozygous variants) white=0 (homozygous wild-type) For example, the same mutation is shared by two subclones at SNV 5 (three and four). Illustration of cIBP matrix Z for subclones in a tumor sample.

5 Probability Model Latent feature model with IBP Each subclone (one column of Z) is a latent feature vector and a data point is the observed VAF. For each component z sc in the binary matrix Z, assume z sc | π c ~ Bern(π c ) π c | α ~ beta(α/C, 1), c = 1,…,C (1) Bern(π c ) is the Bernoulli distribution. π c ∈ (0, 1) is the probability Pr(z sc = 1) a priori. To extend IBP to a categorical setting, each entry of the matrix is not necessarily 0 or 1, but a set of integers in {0,1,….,Q} where Q is fixed a priori. They call the extended model categorical IBP (cIBP) and use it as a prior in exploring subclones of tumor samples.

6 Development of cIBP A straightforward extension of IBP in (1) would be to replace the underlying beta distribution of π c with a Dirichlet distribution, and replace the Bernoulli distribution of z sc with a multinomial distribution. (2) Integrating out π c in (2), the probability of a (Q + 1)-nary matrix, Z is m cq : the number of rows possessing value q ∈ {1,…, Q} in column c

7 Sampling model Suppose there are T tumor samples in the data in which S SNVs are measured for each sample. They assume a binomial sampling model N st : the total number of reads mapped to SNV s in sample t, s = 1,2,…, S and t = 1,2,…, T. Among N st reads, assume n st possess a variant sequence at the locus. p st : the expected proportion of variant reads. The matrix Z follows a finite version of cIBP in (2), Z ~ cIBP c (Q = 2, α, [β 1,β 2 ]). They assume w t (the vector of subclonal weights) follows a Dirichlet prior given by, (3)

8 Parameters p st can be modeled as a linear combination of variant alleles z sc ∈ {0, 0.5, 1} weighted by the proportions of subclones bearing the alleles. They assume the expected p st is a result of mixing subclones with different proportions. (4) ε t0 is an error term defined as ε t0 = p 0 w t0, where p 0 ~ Beta(α0, β0). devised to capture experimental and data processing noise. p 0 is the relative frequency of variant reads produced as error from upstream data processing and takes a small value close to zero. w t0 absorbs the noise left unaccounted for by {w t1,…,w tC }. Ignoring the error term ε t0, model (4) can also be considered as a non-negative matrix factorization (NMF).

9 Model Selection and Posterior Inference MCMC simulation In order to infer the sampling parameters from the posterior distribution, they use Markov chain Monte Carlo (MCMC) simulations. The Gibbs sampling method is used to update z sc, whereas the Metropolis-Hastings (MH) sampling is used to get the samples of w tc and p 0. Let z -s,c be the set of assignment of all other SNVs but SNV s for subclone c, m - cq the number of SNVs with level q, not including SNV s and m - c. = Σ Q q=1 m - cq. p’ st is value of p st by plugging the current MCMC values and setting z sc = q. Markov Chain Monte Carlo (MCMC) MCMC methods are a class of algorithms for sampling from a probability distribution based on constructing a Markov chain that has the desired distribution as its equilibrium distribution. The state of the chain after a number of steps is then used as a sample of the desired distribution. The quality of the sample improves as a function of the number of steps.

10 Choice of C The number of subclones C in cIBP is unknown and must be estimated. They use predictive densities as a selection criterion. n -st : the data removing n st. η C : the set of parameters for a given C. The conditional predictive ordinate (CPO) of n st given n -st is given by The Monte-Carlo estimate of (5) is the harmonic mean of the likelihood values p(n st | η C l ), They take each data point out from n and compute average log-pseudo-marginal likelihood (LPML) over this set as For different values of C, they compare the values of L C and choose that C ^ which maximizes L C. (5) (6)

11 Estimate of Z The MCMC simulations generate posterior samples of the categorical matrix Z and other parameters. They define a posterior point estimate of Z where Z (l), l = 1,…,L are MCMC samples. The term d(Z (l),Z’) is a distance with the following definition. For two matrices Z and Z’, let for two columns c and c’. They define a distance where ζ c, c = 1,…,C is a permutation of {1,…,C}. (7)

12 Results Simulated Data They take a set of S = 100 SNV locations and consider T = 30 samples. The true number of latent subclones is C = 4 in this experiment. The model with C = 4 fits the data the best. True Z and estimate Z^ green : homozygous mutation (z sc = 1) brown : heterozygous mutation (z sc = 0.5) white : homozygous wild type (z sc = 0) They plot the true w and estimate w^ across all the samples using C^ = 4. LPML L C for C values. The simulation truth is 4.

13 Intra-Tumor Lung Cancer Samples They record whole-exome sequencing for four surgically dissected tumor samples taken from a single patient diagnosed with lung adenocarcinoma. They restrict their attention to the SNVs that (i) total number of mapped reads N st are ranged in [100, 240] (ii) the empirical fractions n st / N st in [0.25, 0.75] This filtering left them with 12,387 SNV's. They then randomly select S = 150 for computational purposes. In summary, the data record the read counts (N st ) and mutant allele read counts (n st ) of S = 150 SNVs from T = 4 tumor samples.

14 They select the best C (the number of subclone) by LPML. (C^ = 3) Subclone 1 appears to be the parent giving birth to two branching child subclones 2 and 3. They hypothesize that subclones 2 and 3 arise by acquiring additional somatic mutations in the top portion of the SNV regions where subclone 1 shows “white" color, i.e., homozygous wild type. The large chunk of “brown“ bars in (a) could be either somatic mutations acquired in the parent subclone 1, or germline mutations. This is expected since the four tumor samples were dissected from regions that were close by on the original lung tumor.

15 Bernoulli distribution Bernoulli distribution is the probability distribution of a random variable which takes value 1 with success probability and value 0 with failure probability. Dirichlet distribution Dirichlet distribution is a family of continuous multivariate probability distributions parameterized by a vector of positive reals. Its probability density function returns the belief that the probabilities of K rival events are x i given that each event has been observed α i -1 times. Multinomial distribution Multinomial distribution is a generalization of the binomial distribution. For n independent trials each of which leads to a success for exactly one of k categories, with each category having a given fixed success probability, the multinomial distribution gives the probability of any particular combination of numbers of successes for the various categories. Conditional predictive ordinate The density of the posterior predictive distribution evaluated at an observation. Log-pseudo-marginal likelihood The sum of the logged CPOs. This can be an estimator for the logarithm of the marginal likelihood. Gibbs sampling Gibbs sampling is a MCMC algorithm for obtaining a sequence of observations which are approximated from a specified multivariate probability distribution, when direct sampling is difficult. Metropolis-Hastings sampling Metropolis–Hastings algorithm is a MCMC method for obtaining a sequence of random samples from a probability distribution for which direct sampling is difficult.


Download ppt "BAYCLONE: BAYESIAN NONPARAMETRIC INFERENCE OF TUMOR SUBCLONES USING NGS DATA SUBHAJIT SENGUPTA, JIN WANG, JUHEE LEE, PETER MULLER, KAMALAKAR GULUKOTA,"

Similar presentations


Ads by Google