1 Sylvia Richardson Centre for Biostatistics Imperial College, London Bayesian hierarchical modelling of gene expression data In collaboration with Natalia.

1 Sylvia Richardson Centre for Biostatistics Imperial College, London Bayesian hierarchical modelling of gene expression data In collaboration with Natalia Bochkina, Anne Mette Hein, Alex Lewin (St Marys) Helen Causton and Tim Aitman (Hammersmith) Graeme Ambler and Peter Green (Bristol) Philippe Broët (INSERM, Paris) BBSRC Exploiting Genomics grant

2 Outline Hierarchical modelling framework A Bayesian gene expression index Modelling differential expression False discovery rate and mixture models

3 Introduction Gene expression is a hierarchical process –Substantive question –Experimental design –Sample preparation –Array design & manufacture –Gene expression matrix –Probe level data –Image level data Interest in using statistical framework capable to handle multiple sources of variability coherently Interesting variability (signal) Obscuring variability (noise) + Bayesian statistics

4 Bayesian hierarchical model framework Has the flexibility to model various sources of variability: between probes, gene specific, within array, between array, … Building of all these features into a common model Avoids the need to use systematically a plug- in approach uncertainty is propagated Borrow strength / share out information according to principle Allows some model checking

5 Gene expression analysis is a multi-step process Low-level Model (how is the measured expression related to the signal) Multi-arrays processing (how to make appropriate combined inference) Differential Expression Clustering Partition Model We build all these steps in a common statistical framework

6 Hierarchical model of replicate (biological) variability and array effect PM MM PM MM PM MM Gene specific variability (probe) Gene index BGX Condition 1 PM MM PM MM PM MM PM MM Gene specific variability (probe) Gene index BGX Differential expression parameter Condition 2 Integrated modelling of Affymetrix data PM MM Gene and condition BGX index Gene and condition BGX index Hierarchical model of replicate (biological) variability and array effect

7 A fully Bayesian Gene eXpression index for Affymetrix GeneChip arrays Anne Mette Hein SR, Helen Causton, Graeme Ambler, Peter Green Gene specific variability (probe) PM MM PM MM PM MM PM MM Gene index BGX

8 Single array model: Motivation Key observations: Conclusions: PMs and MMs both increase with spike-in concentration (MMs slower than PMs) MMs bind fraction of signal Spread of PMs increase with level Multiplicative (and additive) error; transformation needed Considerable variability in PM (and MM) response within a probe set Varying reliability in gene expression estimation for different genes Probe effects approximately additive on log-scale Estimate gene expression measure from PMs and MMs on log scale

9 The intensity for the PM measurement for probe (reporter) j and gene g is due to binding of labelled fragments that perfectly match the oligos in the spot The true Signal S gj of labelled fragments that do not perfectly match these oligos The non-specific hybridisation H gj The intensity of the corresponding MM measurement is caused by a binding fraction Φ of the true signal S gj by non-specific hybridisation H gj Model assumptions and key biological parameters

10 BGX single array model: g=1,…,G (thousands), j=1,…,J (11-20) Gene specific error terms: exchangeable log(ξ g 2 ) N(a, b 2 ) log(S gj +1) TN (μ g, ξ g 2 ) j=1,…,J Gene expression index (BGX): g =median (TN (μ g, ξ g 2 )) Pools information over probes j=1,…,J log(H gj +1) TN(λ, η 2 ) Array-wide distribution PM gj N( S gj + H gj, τ 2 ) MM gj N(Φ S gj + H gj, τ 2 ) Background noise, additive signal Non-specific hybridisation fraction Priors: vague 2 ~ (10 -3, 10 -3 ) ~ B(1,1), g ~ U(0,15) 2 ~ (10 -3, 10 -3 ), ~ N(0,10 3 ) Empirical Bayes

11 Implementation In WinBugs for ease of model development and C++ for efficiency Joint estimation of parameters in full Bayesian framework Base inference on posterior distribution of all unknown quantities, S gj, H gj, g = Median of TN( g, ξ g 2 ), …. and use appropriate summaries

12 14 samples of cRNA from acute myeloid leukemia (AML) tumor cell line In sample k: each of 11 genes spiked in at concentration c k : sample k: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 conc. c k (pM): 0.0 0.5 0.75 1.0 1.5 2.0 3.0 5.0 12.5 25 50 75 100 150 Each sample hybridised to an array Single array model performance: Data set : varying concentrations (geneLogic): Consider subset consisting of 500 normal genes + 11 spike-ins

13 Single array model performance: One array: four genes spiked in at concentration 5.0 Posterior distributions: 2.5-97.5 credibility intervals: o : log(PM-MM) : TN( medPost ( g ), medPost (ξ g 2 )) Log(S gj +1): g : posterior distributions reflect variability PM: MM: PM-MM: Probes: degree of response / variability over probe set: medium / high low / low medium / low high/ low Probe behaviour: Highly Variable responses within probes sets and between genes BGX index Log S gj

14 Single array model performance: signal and expression index 10 arrays: gene 1 spiked-in at increasing concentrations `true signal`/ expression index BGX increases with concentration Posterior distributions: 2.5-97.5 credibility intervals: o : log(PM-MM) : TN( medPost ( g ), medPost (ξ g 2 )) Log(S gj +1): g : as previously: log(H gj +1):

15 2.5-97.5 credibility interval: Single array model performance: non-specific hybridization 10 arrays: gene 1 spiked-in at increasing concentrations Signals Signals/cross Non-specific hybridization does not increase with concentration : TN( medPost ( g ), medPost (ξ g 2 )) log(H gj +1):

16 Single array model performance: 11 genes spiked in at 13 (increasing) concentrations BGX index g increases with concentration ….. … except for gene 7 (spiked-in??) Indication of smooth & sustained increase over a wider range of concentrations Comparison with other expression measures

17 2.5 – 97.5 % credibility intervals for the Bayesian expression index 11 spike-in genes at 13 different concentration (data set A) Note how the variability is substantially larger for low expression level Each colour corresponds to a different spike-in gene Gene 7 : broken red line

18 What variability is captured? For some genes, there is considerable discrepancy between the information given by the different probes Posterior becomes flat or bimodal Hard to summarise by a single number Less reproducibility of point estimates of expression level Model improvement: -- stratify Φ by CG content ? -- less weight to the MM in some cases? – more robust summary of index distribution or heavy tail distributions?

19 Single array model: examples of posterior distributions of BGX expression indices Each curve represents a gene Examples with data: o : log(PM gj -MM gj ) j=1,…,J g (at 0 if not defined) Mean +- 1SD

20 Differential expression and array effects Alex Lewin SR, Natalia Bochkina, Anne Glazier, Tim Aitman

21 Data Set and Biological question Previous Work (Tim Aitman, Anne Marie Glazier) The spontaneously hypertensive rat (SHR): A model of human insulin resistance syndromes. Deficiency in gene Cd36 found to be associated with insulin resistance in SHR Following this, several animal models were developed where other relevant genes are knocked out comparison between knocked out and wildtype (normal) mice or rats.

22 Data Set and Biological question Microarray Data Data set A (MAS 5) ( 12000 genes on each array) 3 SHR compared with 3 transgenic rats Data set B (RMA) ( 22700 genes on each array) 8 wildtype (normal) mice compared with 8 knocked out mice Biological Question Find genes which are expressed differently in wildtype and knockout / transgenic mice

23 Gene specific error term Differential expression parameter PM MM Condition 1 Condition 2 Posterior distribution (flat prior) Mixture modelling for classification Hierarchical model of replicate Variability and array effect Hierarchical model of replicate Variability and array effect

24 Model for Differential Expression Expression-level-dependent normalisation Only few replicates per gene, so share information between genes to estimate variability of gene expression between the replicates To select interesting genes: –Use posterior distribution of quantities of interest, function of, ranks …. –Use mixture prior on the differential expression parameter

25 Data: y gr = log gene expression for gene g, replicate r (for the present, y gr is treated as known data) g = gene effect r ( ) = array effect (possibly expression-level dependent) g 2 = gene specific variance 1st level y gr N( g + r ( g ), g 2 ), Σ r r ( g ) = 0 r ( ) = smooth function of g Bayesian hierarchical model for replicate expression data (under one condition) Piecewise polynomial with unknown break points

26 Condition 1 (3 replicates) Condition 2 (3 replicates) Needs normalisation Spline curves shown Exploratory analysis of array effect

27 2nd level Priors for g (flat), coefficients and break points Σ r ( g ) = 0 constraint imposed g 2 lognormal (μ, τ) Hyper-parameters μ and τ can be influential. In a full Bayesian analysis, these are not fixed 3rd level μ N( c, d) τ lognormal (e, f) Hierarchical structure for gene specific parameters

28 Variances are estimated using information from all G x R measurements (~12000 x 3) rather than just 3 Variances are stabilised and shrunk towards average variance Smoothing of the gene specific variances

29 Check assumptions on gene variances, e.g. exchangeable variances, what distribution ? Predict sample variance S g 2 new (a chosen checking function) from the model specification (not using the data for this) Compare predicted S g 2 new with observed S g 2 obs Bayesian p-value: Prob( S g 2 new > S g 2 obs ) Distribution of p-values approx Uniform if model is true (Marshall and Spiegelhalter, 2003) Easily implemented in MCMC algorithm Bayesian Model Checking

30 Data set A

31 Differential expression model The quantity of interest is the difference between conditions for each gene: d g, g = 1, …,N Joint model for the 2 conditions : y g1r = g - ½ d g + 1r ( g ) + g1r, r = 1, … R 1 y g2r = g + ½ d g + 2r ( g ) + g2r, r = 1, … R 2 g is now the overall gene effect over the conditions The parameter of interest d g is given a flat prior (for now) Same assumptions for the distribution of σ 2 gs Modelling of sr ( g ) as before, s = 1, 2, sum to zero constraint imposed within each condition

32 Possible Statistics for Differential Expression d g log fold change d g * = d g / (σ 2 g1 / R 1 + σ 2 g2 / R 2 ) ½ (standardised difference) We obtain the posterior distribution of all {d g } and/or {d g * } Can compute directly posterior probability of genes satisfying criterion X of interest: p g,X = Prob( g of interest | Criterion X, data) Can compute the distributions of ranks

33 Gene is of interest if |log fold change| > log(2) and log (overall expression) > 4 Criterion X The majority of the genes have very small p g,X : 90% of genes have p g,X < 0.2 Genes with p g,X > 0.5 (green) # 280 p g,X > 0.8 (red) # 46 p g,X = 0.49 Plot of log fold change versus overall expression level Data set A 3 wildtype mice compared to 3 knockout mice (U74A chip) Mas5 Genes with low overall expression have a greater range of fold change than those with higher expression

34 Gene is of interest if |log fold change| > log (1.5)Criterion X: The majority of the genes have very small p g,X : 97% of genes have p g,X < 0.2 Genes with p g,X > 0.5 (green) # 292 p g,X > 0.8 (red) # 139 Plot of log fold change versus overall expression level Experiment: 8 wildtype mice compared to 8 knockout mice RMA

35 Posterior probabilities and log fold change Data set A : 3 replicates MAS5 Data set B : 8 replicates RMA

36 Credibility intervals for ranks 100 genes with lowest rank (most under/ over expressed) Low rank, high uncertainty Low rank, low uncertainty Data set B

37 Compute Probability ( |d g * | > 2 | data) Bayesian analogue of a t test ! Order genes Select genes such that Probability ( |d g * | > 2 | data) > cut-off Using the posterior distribution of d g * (standardised difference)

38 Bayesian T test ( Bayesian estimate) Volcano plots For illustration, cut-offs lines drawn at 0.95

39 PM MM PM MM PM MM Gene specific variability (probe) Gene index BGX Condition 1 PM MM PM MM PM MM PM MM Gene specific variability (probe) Gene index BGX Distribution of differential expression parameter Condition 2 Integrated modelling of Affymetrix data PM MM Distribution of expression index for gene g, condition 1 Distribution of expression index for gene g, condition 2 Hierarchical model of replicate (biological) variability and array effect Hierarchical model of replicate (biological) variability and array effect

40 PM gjcr N ( S gjcr + H gjcr, τ cr 2 ) MM gjcr N ( ΦS gjcr + H gjcr, τ cr 2 ) BGX Multiple array model: conditions: c=1,…,C, replicates: r = 1,…,R c log(S gj cr +1) TN (μ gc, ξ gc 2 ) Gene and condition specific BGX g c =median (TN(μ gc, ξ gc 2 )) Pools information over replicate probe sets j = 1,…J, r = 1,…,R c Background noise, additive Array specific log(H gj cr +1) TN(λ cr,η cr 2 ) Array-specific distribution of non-specific hybridisation

41 Posterior distributions of BGX: Single array vs multiple array analyses: Mean +- 1SD Three replicate arrays analysed separately Three replicate arrays analysed together (multiple array model)

42 Subset of AffyU133A spike-in data set (AffyComp) Consider: Six arrays, 1154 genes (every 20 th and 42 spike-ins) Same cRNA hybridised to all arrays EXCEPT for spike-ins: `1` `2` `3` … `12` `13` `14` Spike-in genes: 1-3 4-6 7-9 … 34-36 37-39 40-42 Spike-in conc (pM): Condition 1 (array 1-3): 0.0 0.25 0.50 … 128 256 512 Condition 2 (array 4-6): 0.25 0.50 1.00 … 256 512 0.00 Fold change: - 2 2 … 2 2 -

43 M v A plots: True fold changes: Black: zero Red: 2 A: (1/2)*(expr g,1 +expr g,2 ), M: (expr g,1 -expr g,2 ) NB! Point estimates used MAS5 and RMA: expr gc = mean over three replicates BGX: Multiple array index

44 BGX: measure of uncertainty provided Posterior mean +- 1SD credibility intervals diff g =bgx g,1 - bgx g,2 } Spike in 1113 -1154 above the blue line Blue stars show RMA measure

45 Mixture and Bayesian estimation of false discovery rates Natalia Bochkina, Alex Lewin SR, Philippe Broët

46 Gene lists can be built by computing separately a criteria for each gene and ranking Thousands of genes are considered simultaneously How to assess the performance of such lists ? Multiple Testing Problem Statistical Challenge Select interesting genes without including too many false positives in a gene list A gene is a false positive if it is included in the list when it is truly unmodified under the experimental set up Want an evaluation of the expected false discovery rate (FDR)

47 Bayesian Estimate of FDR Step 1: Choose a gene specific parameter (e.g. d g ) or a gene statistic (see later) Step 2: Model its prior (resp marginal) distribution using a mixture model -- with one component that models the unaffected genes (null hypothesis) e.g. point mass at 0 for d g -- other components that model (flexibly) the alternative Step 3: Calculate the posterior probability for any gene of belonging to the unmodified component : p g0 | data Step 4: Evaluate FDR (and FNR) for any list Assuming that all the gene classification are independent: Bayes FDR (list) | data = 1/card(list) Σ g list p g0

48 Mixture prior To obtain a gene list, a commonly used method (cf Lonnstedt &Speed 2002, Newton 2003, Smyth 2003, …) is to define a mixture prior for d g : H 0 d g = 0 point mass at 0 with probability p 0 H 1 d g ~ flexible 2-sided distribution to model differential expression Classify each gene following its posterior probabilities of not being in the null: 1- p g0 Use Bayes rule or fix the FDR

49 Classification with mixture prior Joint estimation of all the mixture parameters (including p 0 ) avoids plugging-in of values (e.g. p 0 ) that are influential on the classification Sensitivity to prior settings of the alternative distribution and performance has been tested on simulated data sets Work in progress Poster by Natalia Bochkina

50 Performance of the mixture prior y g1r = g - ½ d g + g1r, r = 1, … R 1 y g2r = g + ½ d g + g2r, r = 1, … R 2 (For simplification, we assume that the data has been pre normalised) σ 2 g ~ IG(a, b) d g ~ p 0 δ 0 + p 1 G ( 1.5, 1 ) + p 2 G (1.5, 2 ) H 0 H 1 Dirichlet distribution for (p 0, p 1, p 2 ) Exponential hyper prior for 1 and 2

51 Simulated data y gr ~ N(d g, σ 2 g ) (8 replicates) σ 2 g ~ IG(1.5, 0.05) d g ~ (-1) Bern(0.5) G(2,2), g=1:200 d g = 0, g=201:1000 Choice of simulation parameters inspired by estimates found in analyses of biological data sets Plot of the true differences

52 Posterior estimates of fold change using mixture model

53 Comparison of mixture classification and posterior probabilities for the standardised differences In red, 200 genes with dg 0 Probability ( |d g * | > 2 | data) 31 = 4% False negative 10 = 6% False positive Post Prob (g H 1 )

54 Post Prob (g H 1 ) = 1- p g0 Bayes rule FDR (black) FNR (blue) as a function of 1- p g0

55 Using mixtures for modelling the marginal distribution of gene statistics Instead of modelling the prior for d g as a mixture, an alternative is –To summarise differential expression by a gene statistic –To model is marginal distribution as a mixture such that the distribution is approximately known under H 0 and use a flexible distribution for the alternative

56 Mixture modelling of transformed F statistics Gene statistic based on classical F statistic (this was developed to analyse multiclass ( > 2 conditions) experiments) Gives a de-centred asymmetric marginal distribution rather than a two-tailed one Transform F -> approx. standard Normal if no change across conditions (H 0 ). Use a mixture of normals (variable number) for modelling the alternative (following Richardson and Green 1997)

57 Results for Simulated Data (to detect modified profile over 3 conditions) Broet, Lewin, SR 2004 Bayes mixture estimate of FDR is close to true value Case A : well separated null and alternative hypotheses Case B : less separated null and alternative hypotheses For details, see the poster by Alex Lewin

58 Marginal mixture performance for the simulated data (2 conditions, same data as for the prior mixture) Number on list as a function of cut-off prob Expected number of false positive

59 Simulated data, comparison of prior and marginal mixture classification Good agreement between the 2 approaches The marginal mixture has more false positives Transformation to Normality for 2 conditions?? Further comparison in progress

60 Bayesian gene expression measure (BGX) Good range of resolution, provides credibility intervals Differential Expression Expression-level-dependent normalisation Borrow information across genes for variances Joint distribution of ranks, gene lists based on posterior probabilities False Discovery Rate Mixture gives good estimate of FDR and classifies well Future work Mixture prior on BGX index, with uncertainty propagated to mixture parameters, comparison of marginal and prior mixture approaches, clustering for more general experimental set-ups Summary

61 Papers and technical reports: Hein AM., Richardson S., Causton H., Ambler G. and Green P. (2004) BGX: a fully Bayesian gene expression index for Affymetrix GeneChip data (submitted) Lewin A., Richardson S., Marshall C., Glazier A. and Aitman T. (2003) Bayesian Modelling of Differential Gene Expression (submitted) Broët P., Lewin A., Richardson S., Dalmasso C. and Magdelenat H. (2004) A mixture model-based strategy for selecting sets of genes in multiclass response microarray experiments. (Bioinformatics, advanced access April 29 2004) Broët, P., Richardson, S. and Radvanyi, F. (2002) Bayesian Hierarchical Model for Identifying Changes in Gene Expression from Microarray Experiments, Journal of Computational Biology 9, 671-683. Available at http ://www.bgx.org.uk/ Thanks

1 Sylvia Richardson Centre for Biostatistics Imperial College, London Bayesian hierarchical modelling of gene expression data In collaboration with Natalia.

Similar presentations

Presentation on theme: "1 Sylvia Richardson Centre for Biostatistics Imperial College, London Bayesian hierarchical modelling of gene expression data In collaboration with Natalia."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Sylvia Richardson Centre for Biostatistics Imperial College, London Bayesian hierarchical modelling of gene expression data In collaboration with Natalia.

Similar presentations

Presentation on theme: "1 Sylvia Richardson Centre for Biostatistics Imperial College, London Bayesian hierarchical modelling of gene expression data In collaboration with Natalia."— Presentation transcript:

Similar presentations

About project

Feedback