Download presentation
Presentation is loading. Please wait.
Published byAidan Ayers Modified over 10 years ago
1
Bayesian mixture models for analysing gene expression data Natalia Bochkina In collaboration with Alex Lewin, Sylvia Richardson, BAIR Consortium Imperial College London, UK
2
Introduction We use a fully Bayesian approach to model data and MCMC for parameter estimation. Models all parameters simultaneously. Prior information can be included in the model. Variances are automatically adjusted to avoid unstable estimates for small number of observations. Inference is based on the posterior distribution of all parameters. Use the mean of the posterior distribution as an estimate for all parameters.
3
Condition 1 Distribution of differential expression parameter Condition 2 Differential expression Distribution of expression index for gene g, condition 1 Distribution of expression index for gene g, condition 2
4
Bayesian Model y g1r ~ N( g - ½ d g, σ g1 ), r = 1, … R 1 y g2r ~ N( g + ½ d g, σ g2 ), r = 1, … R 2 σ 2 gk ~ IG(a k, b k ), k=1,2 E(σ 2 gk |s 2 gk ) = [(R k -1) s 2 gk + 2b k ]/(R k -1+2a k ) Non-informative priors on g, a k, b k. MeanDifference (log fold change) 2 conditions: Number of replicates in each condition Prior model: (Assume data is background corrected, log-transformed and normalised) Prior distribution on d g ?
5
Modelling differential expression Prior information / assumption: Genes are either differentially expressed or not (of interest or not) Can include this in the model via modelling the difference as a mixture d g ~ (1-p) δ 0 (d g ) + p H(d g | θ g ) How to choose H? Advantages: Automatically selects threshold as opposed to specifying constants as in the non-informative prior model for differences Interpretable: can use Bayesian classification to select differentially expressed genes: P{g in H 1 | data} > P{g in H 0 | data}. Can estimate false discovery and non-discovery rates (Newton et al 2004). H0H0 H1H1
6
Considered mixture models We choose several distributions as the non-zero part in the mixture distribution for d g : double gamma, Student t distribution, the conjugate model (Lonnstedt and Speed (2002)) and the uniform distribution in a fully Bayesian context. Gamma model: H is double gamma distribution: T model: H is Student t distribution: d g ~ (1-p)δ 0 + p T (ν, μ, τ) LS model: H is normal with variance proportional to variance of the data: d g ~ (1-p)δ 0 + p N (0, c σ g 2 ) Uniform model: H is uniform distribution: d g ~ (1-p)δ 0 + p U(-m 1, m 2 ) σ g 2 = σ g1 2 /R 1 + σ g2 2 /R 2 (-m 1, m 2 ) - slightly widened range of observed differences Priors on hyperparameters are either non-informative or weakly informative G(1,1) for parameters with support on positive semiline.
7
Simulated data Hyperparameters of variance a=1.5, b=0.05 are chosen close to Bayesian estimates of those in a real data set Difference Variance We compare performance of the four models on simulated data. For simplicity we considered a one group model (or a paired two group model). We simulate a data set with 1000 variables and 8 replicates: Plot of the simulated data set
8
Differences Gamma, T and LS models estimate differences well Uniform model shrinks values to zero Compared to empirical Bayes, posterior estimates in the fully Bayesian approach do not shrink large values of the differences Mixture estimates vs true values Posterior mean
9
Bayesian estimates of variance T and Gamma models have very similar variance estimates Uniform model produces similar estimates for small values and higher estimates for larger values compared with T and Gamma models LS model has more pertubation at both higher and lower values compared to T and Gamma models Blue: variance estimate based on Bayesian model with non- informative prior on differences. E(σ 2 |y) Mixture estimate of the variance can be larger than the sample variance Uniform model LS model sample variance Gamma model T model E(σ 2 |y)
10
Classification T, LS and Gamma models perform similarly Uniform model has a smaller number of false positives but also a smaller number of true positives Diff. Expressed genes (200) Non D. Expressed genes (800) Uniform prior is more conservative
11
Wrongly classified by mixture: truly dif. expressed, truly not dif. expressed Classification errors are on the borderline: Confusion between size of fold change and biological variability
12
Another simulation Can we improve estimation of within condition biological variability ? 2628 data points Many points added on borderline: classification errors in red
13
g = 1:G DAG for the mixture model a 1, b 1 ½(y g1. + y g2. ) 1, 2 δgδg 2 g1 s2g1s2g1 2 g2 s2g2s2g2 g zgzg a 2, b 2 p y g1. - y g2. The variance estimates are influenced by the mixture parameters Use only partial information from the replicates to estimate 2 gs and feed forward in the mixture ?
14
Estimation Estimation of all parameters combines information from biological replicates and between condition contrasts s 2 gs = 1/R s Σ r (y gsr - y gs. ) 2, s = 1,2 Within condition biological variability 1/R s Σ r y gsr = y gs., Average expression over replicates ½(y g1. + y g2. )Average expression over conditions ½(y g1. - y g2. ) Between conditions contrast
15
Mixture, full vs partial In 46 data points with improved classification when feed back from mixture is cut In 11 data points with changed but new incorrect classification Classification altered for 57 points: Work in progress
16
Different classification: Truly diff.expressed Truly not diff.expressed Difference: cut and no cut Full VariancePosterior probabilitySample st.dev. vs diff. Cut
17
Microarray data Genes classified differently by the full model and the model with feedback cut follow a curve. Posterior probability Sample difference Full Cut Pooled sample st.dev. Sample st.dev. vs diff. Variance
18
Since variance is overestimated in full mixture model compared to mixture model with cut, the number of False Negatives is lower for model with cut than for the full model.
19
LS model: empirical vs fully Bayesian in fully Bayesian model (FB) and empirical Bayes (EB) model. Estimated parameters Compare the Lonnstedt and Speed (LS) model Classification If parameter p is specified correctly, empirical and fully Bayesian models do not differ If parameter p is misspecified, estimate of the parameter c changes which leads to misclassification
20
Small p (p=0.01) Cut No Cut
21
Bayesian Estimate of FDR Step 1: Choose a gene specific parameter (e.g. δ g ) or a gene statistic Step 2: Model its prior distribution using a mixture model -- with one component to model the unaffected genes (null hypothesis) e.g. point mass at 0 for δ g -- other components to model (flexibly) the alternative Step 3: Calculate the posterior probability for any gene to belong to the unmodified component : p g0 | data Step 4: Evaluate FDR (and FNR) for any list assuming that all the gene classification are independent (Broët et al 2004) : Bayes FDR (list) | data = 1/card(list) Σ g list p g0
22
Gene lists can be built by computing separately a criteria for each gene and ranking Thousands of genes are considered simultaneously How to assess the performance of such lists ? Multiple Testing Problem Statistical Challenge Select interesting genes without including too many false positives in a gene list A gene is a false positive if it is included in the list when it is truly unmodified under the experimental set up Want an evaluation of the expected false discovery rate (FDR)
23
Post Prob (g H 1 ) = 1- p g0 Bayes rule FDR (black) FNR (blue) as a function of 1- p g0 Observed and estimated FDR/FNR correspond well
24
Summary Mixture models estimate differences and hyperparameters well on simulated data. Variance is overestimated for some genes. Mixture model with uniform alternative distribution is more conservative in classifying genes than structured models. Lonnstedt and Speed model: performs better in fully Bayesian framework because parameter p is estimated from data. Estimates of false discovery and non-discovery rates are close to the true values
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.