1 Identifying differentially expressed genes from RNA-seq data Many recent algorithms for calling differentially expressed genes: edgeR: Empirical analysis.

1 Identifying differentially expressed genes from RNA-seq data Many recent algorithms for calling differentially expressed genes: edgeR: Empirical analysis of digital gene expression data in R http://www.bioconductor.org/packages/2.10/bioc/html/edgeR.html DEseq: Differential gene expression analysis based on the negative binomial distribution http://www-huber.embl.de/users/anders/DESeq/ baySeq: Empirical Bayesian analysis of patterns of differential expression in count data http://www.bioconductor.org/packages/2.10/bioc/html/baySeq.html

Identifying differentially expressed genes from RNA-seq data Why can’t we use the same software for microarray data analysis? 1. Microarray data are continuous values. Sequence data are discrete values of read counts. e.g. 5.342 signal intensity versus 5 counts 2.Reproducibility of RNA-seq measurements is different for low-abundance versus high-abundance transcripts – this is called over-dispersion 2

Overdispersion: variance of replicates is higher for high-abundance reads 3

baySeq Empirical Bayesian analysis of patterns of differential expression in count data http://www.bioconductor.org/packages/2.10/bioc/html/baySeq.html Identifies differentially expressed genes between 2 or more samples using replicated RNA-seq data 4

Bayes’ Theorem P (M | D) = P (D | M) * P (M) P (D) Read in English: The Probability that your Model is correct Given ( | ) the Data is equal to the Probability of your Data Given the Model times the Probability of your Model divided by the Probability of the Data What the hell does this mean? 5

Bayes’ Theorem: wikipedia’s ridiculous example Your friend had a conversation with someone who happened to have long hair. What is the probability that that person was a woman, given that you know ~50% of people are women and ~75% of women have long hair. P (W) = Probability that this person was a woman = 0.5 P (L | W) = Probability that the person had long hair IF the person is a woman = 0.75 P (L | M) = Probability that the person had long hair IF that person is a man = 0.3 P (L) = Probability that any random person has long hair = P (L | W) * P (W) = 75% of 50% of the population = 0.75*0.5 = 0.375 + P (L | M) * P (M) = 30% of 50% of the population = 0.3 * 0.5 = 0.15 So: P (W|L) = P(L|W)*P(W) = P(L|W)*P(W)= 0.714 P(L)P (L|W)*P(W) + P (L|M)*P(M) 6

Bayes’ Theorem P (M | D) = P (D | M) * P (M) P (D) Read in English: The Probability that your Model is correct Given ( | ) the Data is equal to the Probability of your Data Given the Model times the Probability of your Model divided by the Probability of the Data P is called the ‘Posterior Probability’ that this Model is the right one to describe your data. Each gene will have a PP for each model, where  PP = 1. The Bigger the PP, the more likely this is the right model. Posterior Probability is NOT a p-value! 7

Bayes’ Theorem We don’t know some of these factors ( e.g. P(D|M) ), but we can describe the data with some parameter set called  (which includes mean and stdev of count data across replicates of each gene) 8 Imagine you have three RNA-seq replicates of two samples (WT vs mutant). There are two models for each gene M 0 = the model that your gene is NOT differentially expressed across the 2 samples M DE = the model that a given gene IS differentially expressed P (M DE | D geneX ) = P (D geneX | M DE ) * P (M DE ) P (D geneX ) P (M 0 | D geneX ) = P (D geneX | M 0 ) * P (M 0 ) P (D geneX )

wikepedia E.g. Infecting a cell culture with viral particles.  = Multiplicity of Infection (MOI) = # of viral particles/cell in your culture How to model the mean and standard deviation of your replicates: Poisson Distribution: Accounts for random fluctuations 9

Overdispersion: variance of replicates is higher for high-abundance reads 10

How to model the mean and standard deviation of your replicates: Negative Binomial Distribution: Mean and variance are different Notice the wider spread on the right side of each distribution, for higher numbers 11

12 baySeq Empirical Bayesian analysis of patterns of differential expression in count data http://www.bioconductor.org/packages/2.10/bioc/html/baySeq.html Imagine you have triplicate measurements of sample A and sample B. We want to identify genes differentially expressed (DE) in A versus B. A rep1 A rep2 B rep1 B rep2 We define TWO possible models: NO DE (NDE): A rep1 = A rep2 = B rep1 = B rep2 we say that the data for all samples was drawn from the same distribution (* i.e. same mean and standard deviation, if normally distributed) DE: A rep1 = A rep2 NOT EQUAL B rep1 = B rep2 A and B replicates are drawn from two different distributions tuple: simply  counts per transcription unit

13 baySeq Empirical Bayesian analysis of patterns of differential expression in count data http://www.bioconductor.org/packages/2.10/bioc/html/baySeq.html P (M DE | D geneX ) = P(D geneX |M DE ) * P(M DE ) P(D geneX ) Next, we use Bayes’ Rule to try to estimate the probability that the DE model is true for geneX baySeq tries to use data sharing across the dataset to estimate the components on the right side of the equation P (D geneX | M DE ) = Int[ P(D geneX | K, M DE ) * P(K | M DE ) ] dK where K is the parameter set of  (mean, dispersion) for each gene in each replicate **  for each gene is estimated by looking at all data in the replicates prior probability

baySeq Empirical Bayesian analysis of patterns of differential expression in count data http://www.bioconductor.org/packages/2.10/bioc/html/baySeq.html P (M DE | D geneX ) = P(D geneX |M DE ) * P(M DE ) P(D geneX ) prior probability The prior probability is the probability of M before any data are considered. Here: 1. Guess at a starting prior probability for M DE 2. Iteratively test different alternatives 3. Repeat until ‘convergence’ (P(M DE ) does not change with more iterations) * For data with strong DE signal, the posterior probability is not very dependent on the starting prior probability P(M DE ) 14

16 Volcano Plot

1 Identifying differentially expressed genes from RNA-seq data Many recent algorithms for calling differentially expressed genes: edgeR: Empirical analysis.

Similar presentations

Presentation on theme: "1 Identifying differentially expressed genes from RNA-seq data Many recent algorithms for calling differentially expressed genes: edgeR: Empirical analysis."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Identifying differentially expressed genes from RNA-seq data Many recent algorithms for calling differentially expressed genes: edgeR: Empirical analysis.

Similar presentations

Presentation on theme: "1 Identifying differentially expressed genes from RNA-seq data Many recent algorithms for calling differentially expressed genes: edgeR: Empirical analysis."— Presentation transcript:

Similar presentations

About project

Feedback