1 Identifying differentially expressed genes from RNA-seq data Many recent algorithms for calling differentially expressed genes: edgeR: Empirical analysis.

Slides:



Advertisements
Similar presentations
1 Parametric Empirical Bayes Methods for Microarrays 3/7/2011 Copyright © 2011 Dan Nettleton.
Advertisements

Probability Distributions CSLU 2850.Lo1 Spring 2008 Cameron McInally Fordham University May contain work from the Creative Commons.
1 Methods of Experimental Particle Physics Alexei Safonov Lecture #21.
Statistics 1: Introduction to Probability and Statistics Section 3-3.
Binomial Distribution & Bayes’ Theorem. Questions What is a probability? What is the probability of obtaining 2 heads in 4 coin tosses? What is the probability.
Ka-fu Wong © 2003 Chap 8- 1 Dr. Ka-fu Wong ECON1003 Analysis of Economic Data.
Terminology A statistic is a number calculated from a sample of data. For each different sample, the value of the statistic is a uniquely determined number.
DEG Mi-kyoung Seo.
RNA-seq: the future of transcriptomics ……. ?
Data Analysis for High-Throughput Sequencing
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Bayesian Inference Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis Technical.
Prof. Bart Selman Module Probability --- Part d)
Basics of Statistical Estimation. Learning Probabilities: Classical Approach Simplest case: Flipping a thumbtack tails heads True probability  is unknown.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Bayesian Inference Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis Technical.
Modular 13 Ch 8.1 to 8.2.
Chapter 6: Sampling Distributions
Review of normal distribution. Exercise Solution.
One Sample  M ean μ, Variance σ 2, Proportion π Two Samples  M eans, Variances, Proportions μ1 vs. μ2 σ12 vs. σ22 π1 vs. π Multiple.
Probability Distributions and Test of Hypothesis Ka-Lok Ng Dept. of Bioinformatics Asia University.
B AD 6243: Applied Univariate Statistics Understanding Data and Data Distributions Professor Laku Chidambaram Price College of Business University of Oklahoma.
AP Statistics Chapter 9 Notes.
EDUC 200C Friday, October 26, Goals for today Homework Midterm exam Null Hypothesis Sampling distributions Hypothesis testing Mid-quarter evaluations.
Model Inference and Averaging
Bayesian inference review Objective –estimate unknown parameter  based on observations y. Result is given by probability distribution. Bayesian inference.
Estimating parameters in a statistical model Likelihood and Maximum likelihood estimation Bayesian point estimates Maximum a posteriori point.
University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 08/10/ :23 PM 1 Some basic statistical concepts, statistics.
Measures of Dispersion CUMULATIVE FREQUENCIES INTER-QUARTILE RANGE RANGE MEAN DEVIATION VARIANCE and STANDARD DEVIATION STATISTICS: DESCRIBING VARIABILITY.
Lecture 11. Microarray and RNA-seq II
Lecture 13 Chi-square and sample variance Finish the discussion of chi-square distribution from lecture 12 Expected value of sum of squares equals n-1.
Statistical estimation, confidence intervals
Biostatistics Class 6 Hypothesis Testing: One-Sample Inference 2/29/2000.
Inference for Regression Simple Linear Regression IPS Chapter 10.1 © 2009 W.H. Freeman and Company.
Week11 Parameter, Statistic and Random Samples A parameter is a number that describes the population. It is a fixed number, but in practice we do not know.
Maximum Likelihood - "Frequentist" inference x 1,x 2,....,x n ~ iid N( ,  2 ) Joint pdf for the whole random sample Maximum likelihood estimates.
Lecture 2 Review Probabilities Probability Distributions Normal probability distributions Sampling distributions and estimation.
CHAPTER 15: Sampling Distributions
8 Sampling Distribution of the Mean Chapter8 p Sampling Distributions Population mean and standard deviation,  and   unknown Maximal Likelihood.
Stat 112: Notes 2 Today’s class: Section 3.3. –Full description of simple linear regression model. –Checking the assumptions of the simple linear regression.
1 Global expression analysis Monday 10/1: Intro* 1 page Project Overview Due Intro to R lab Wednesday 10/3: Stats & FDR - * read the paper! Monday 10/8:
Stats Probability Theory Summary. The sample Space, S The sample space, S, for a random phenomena is the set of all possible outcomes.
Confidence Interval & Unbiased Estimator Review and Foreword.
Simple examples of the Bayesian approach For proportions and means.
IE 300, Fall 2012 Richard Sowers IESE. 8/30/2012 Goals: Rules of Probability Counting Equally likely Some examples.
Sampling and estimation Petter Mostad
Bayes Theorem. Prior Probabilities On way to party, you ask “Has Karl already had too many beers?” Your prior probabilities are 20% yes, 80% no.
Psychology 202a Advanced Psychological Statistics September 29, 2015.
Q1: Standard Deviation is a measure of what? CenterSpreadShape.
Chapter 3 Discrete Random Variables and Probability Distributions  Random Variables.2 - Probability Distributions for Discrete Random Variables.3.
Lecture 12 RNA – seq analysis.
Central Limit Theorem Let X 1, X 2, …, X n be n independent, identically distributed random variables with mean  and standard deviation . For large n:
m/sampling_dist/index.html.
Hypothesis Testing and Statistical Significance
Chapter 9 Day 2. Warm-up  If students picked numbers completely at random from the numbers 1 to 20, the proportion of times that the number 7 would be.
Canadian Bioinformatics Workshops
Generalization Performance of Exchange Monte Carlo Method for Normal Mixture Models Kenji Nagata, Sumio Watanabe Tokyo Institute of Technology.
Chapter 6: Sampling Distributions
RNA Quantitation from RNAseq Data
MCMC Output & Metropolis-Hastings Algorithm Part I
Review 49% of all people in the world are male. Interested whether physics majors are more likely to be male than the general population, you survey 10.
CHAPTER 6 Random Variables
RNA-Seq analysis in R (Bioconductor)
Chapter 6: Sampling Distributions
Psychology 202a Advanced Psychological Statistics

Sample Mean Distributions
Simple Probability Problem
A Correlated Random Effects Hurdle Model for Detecting Differentially Expressed Genes in Discrete Single Cell RNA Sequencing Data Michael Sekula Department.
Example Human males have one X-chromosome and one Y-chromosome,
If the question asks: “Find the probability if...”
Differential Expression of RNA-Seq Data
Presentation transcript:

1 Identifying differentially expressed genes from RNA-seq data Many recent algorithms for calling differentially expressed genes: edgeR: Empirical analysis of digital gene expression data in R DEseq: Differential gene expression analysis based on the negative binomial distribution baySeq: Empirical Bayesian analysis of patterns of differential expression in count data

Identifying differentially expressed genes from RNA-seq data Why can’t we use the same software for microarray data analysis? 1. Microarray data are continuous values. Sequence data are discrete values of read counts. e.g signal intensity versus 5 counts 2.Reproducibility of RNA-seq measurements is different for low-abundance versus high-abundance transcripts – this is called over-dispersion 2

Overdispersion: variance of replicates is higher for high-abundance reads 3

baySeq Empirical Bayesian analysis of patterns of differential expression in count data Identifies differentially expressed genes between 2 or more samples using replicated RNA-seq data 4

Bayes’ Theorem P (M | D) = P (D | M) * P (M) P (D) Read in English: The Probability that your Model is correct Given ( | ) the Data is equal to the Probability of your Data Given the Model times the Probability of your Model divided by the Probability of the Data What the hell does this mean? 5

Bayes’ Theorem: wikipedia’s ridiculous example Your friend had a conversation with someone who happened to have long hair. What is the probability that that person was a woman, given that you know ~50% of people are women and ~75% of women have long hair. P (W) = Probability that this person was a woman = 0.5 P (L | W) = Probability that the person had long hair IF the person is a woman = 0.75 P (L | M) = Probability that the person had long hair IF that person is a man = 0.3 P (L) = Probability that any random person has long hair = P (L | W) * P (W) = 75% of 50% of the population = 0.75*0.5 = P (L | M) * P (M) = 30% of 50% of the population = 0.3 * 0.5 = 0.15 So: P (W|L) = P(L|W)*P(W) = P(L|W)*P(W)= P(L)P (L|W)*P(W) + P (L|M)*P(M) 6

Bayes’ Theorem P (M | D) = P (D | M) * P (M) P (D) Read in English: The Probability that your Model is correct Given ( | ) the Data is equal to the Probability of your Data Given the Model times the Probability of your Model divided by the Probability of the Data P is called the ‘Posterior Probability’ that this Model is the right one to describe your data. Each gene will have a PP for each model, where  PP = 1. The Bigger the PP, the more likely this is the right model. Posterior Probability is NOT a p-value! 7

Bayes’ Theorem We don’t know some of these factors ( e.g. P(D|M) ), but we can describe the data with some parameter set called  (which includes mean and stdev of count data across replicates of each gene) 8 Imagine you have three RNA-seq replicates of two samples (WT vs mutant). There are two models for each gene M 0 = the model that your gene is NOT differentially expressed across the 2 samples M DE = the model that a given gene IS differentially expressed P (M DE | D geneX ) = P (D geneX | M DE ) * P (M DE ) P (D geneX ) P (M 0 | D geneX ) = P (D geneX | M 0 ) * P (M 0 ) P (D geneX )

wikepedia E.g. Infecting a cell culture with viral particles.  = Multiplicity of Infection (MOI) = # of viral particles/cell in your culture How to model the mean and standard deviation of your replicates: Poisson Distribution: Accounts for random fluctuations 9

Overdispersion: variance of replicates is higher for high-abundance reads 10

How to model the mean and standard deviation of your replicates: Negative Binomial Distribution: Mean and variance are different Notice the wider spread on the right side of each distribution, for higher numbers 11

12 baySeq Empirical Bayesian analysis of patterns of differential expression in count data Imagine you have triplicate measurements of sample A and sample B. We want to identify genes differentially expressed (DE) in A versus B. A rep1 A rep2 B rep1 B rep2 We define TWO possible models: NO DE (NDE): A rep1 = A rep2 = B rep1 = B rep2 we say that the data for all samples was drawn from the same distribution (* i.e. same mean and standard deviation, if normally distributed) DE: A rep1 = A rep2 NOT EQUAL B rep1 = B rep2 A and B replicates are drawn from two different distributions tuple: simply  counts per transcription unit

13 baySeq Empirical Bayesian analysis of patterns of differential expression in count data P (M DE | D geneX ) = P(D geneX |M DE ) * P(M DE ) P(D geneX ) Next, we use Bayes’ Rule to try to estimate the probability that the DE model is true for geneX baySeq tries to use data sharing across the dataset to estimate the components on the right side of the equation P (D geneX | M DE ) = Int[ P(D geneX | K, M DE ) * P(K | M DE ) ] dK where K is the parameter set of  (mean, dispersion) for each gene in each replicate **  for each gene is estimated by looking at all data in the replicates prior probability

baySeq Empirical Bayesian analysis of patterns of differential expression in count data P (M DE | D geneX ) = P(D geneX |M DE ) * P(M DE ) P(D geneX ) prior probability The prior probability is the probability of M before any data are considered. Here: 1. Guess at a starting prior probability for M DE 2. Iteratively test different alternatives 3. Repeat until ‘convergence’ (P(M DE ) does not change with more iterations) * For data with strong DE signal, the posterior probability is not very dependent on the starting prior probability P(M DE ) 14

15

16 Volcano Plot