Introduction to DESeq and edgeR packages Peter A.C. ’t Hoen.

Slides:



Advertisements
Similar presentations
Special random variables Chapter 5 Some discrete or continuous probability distributions.
Advertisements

Statistical Concepts and Methodologies for Data Analyses Benilton Carvalho Computational Biology and Statistics Group Department of Oncology University.
Discrete Uniform Distribution
CmpE 104 SOFTWARE STATISTICAL TOOLS & METHODS MEASURING & ESTIMATING SOFTWARE SIZE AND RESOURCE & SCHEDULE ESTIMATING.
Chapter 5 Discrete Random Variables and Probability Distributions
The Bernoulli distribution Discrete distributions.
Modeling Process Quality
Statistical model for count data Speaker : Tzu-Chun Lo Advisor : Yao-Ting Huang.
Chapter 4 Discrete Random Variables and Probability Distributions
Chapter 1 Probability Theory (i) : One Random Variable
Discrete Probability Distributions Introduction to Business Statistics, 5e Kvanli/Guynes/Pavur (c)2000 South-Western College Publishing.
Binomial Random Variables. Binomial experiment A sequence of n trials (called Bernoulli trials), each of which results in either a “success” or a “failure”.
Maximum likelihood Conditional distribution and likelihood Maximum likelihood estimations Information in the data and likelihood Observed and Fisher’s.
Maximum likelihood (ML)
Discrete Probability Distributions
Discrete Random Variables and Probability Distributions
Class notes for ISE 201 San Jose State University
Maximum likelihood (ML)
McGraw-Hill Ryerson Copyright © 2011 McGraw-Hill Ryerson Limited. Adapted by Peter Au, George Brown College.
Generalized Linear Models
Chapter 21 Random Variables Discrete: Bernoulli, Binomial, Geometric, Poisson Continuous: Uniform, Exponential, Gamma, Normal Expectation & Variance, Joint.
Copyright © Cengage Learning. All rights reserved. 3.5 Hypergeometric and Negative Binomial Distributions.
Problem A newly married couple plans to have four children and would like to have three girls and a boy. What are the chances (probability) their desire.
Distributions Dr. Omar Al Jadaan Assistant Professor – Computer Science & Mathematics.
McGraw-Hill/Irwin Copyright © 2007 by The McGraw-Hill Companies, Inc. All rights reserved. Discrete Random Variables Chapter 4.
Discrete Distributions
Standard Statistical Distributions Most elementary statistical books provide a survey of commonly used statistical distributions. The reason we study these.
Short Resume of Statistical Terms Fall 2013 By Yaohang Li, Ph.D.
Poisson Random Variable Provides model for data that represent the number of occurrences of a specified event in a given unit of time X represents the.
Random Sampling, Point Estimation and Maximum Likelihood.
Generalized Linear Models All the regression models treated so far have common structure. This structure can be split up into two parts: The random part:
BINOMIALDISTRIBUTION AND ITS APPLICATION. Binomial Distribution  The binomial probability density function –f(x) = n C x p x q n-x for x=0,1,2,3…,n for.
Biostatistics, statistical software VII. Non-parametric tests: Wilcoxon’s signed rank test, Mann-Whitney U-test, Kruskal- Wallis test, Spearman’ rank correlation.
Biostatistics Class 3 Discrete Probability Distributions 2/8/2000.
Discrete Probability Distributions. Random Variable Random variable is a variable whose value is subject to variations due to chance. A random variable.
Copyright © 2010 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill/Irwin Chapter 5 Discrete Random Variables.
McGraw-Hill/IrwinCopyright © 2009 by The McGraw-Hill Companies, Inc. All Rights Reserved. Chapter 5 Discrete Random Variables.
Probability Definitions Dr. Dan Gilbert Associate Professor Tennessee Wesleyan College.
Probability Distributions u Discrete Probability Distribution –Discrete vs. continuous random variables »discrete - only a countable number of values »continuous.
1 Topic 3 - Discrete distributions Basics of discrete distributions Mean and variance of a discrete distribution Binomial distribution Poisson distribution.
The Triangle of Statistical Inference: Likelihoood Data Scientific Model Probability Model Inference.
Chapter 7. Control Charts for Attributes
Methodology Solving problems with known distributions 1.
Stats Probability Theory Summary. The sample Space, S The sample space, S, for a random phenomena is the set of all possible outcomes.
Exam 2: Rules Section 2.1 Bring a cheat sheet. One page 2 sides. Bring a calculator. Bring your book to use the tables in the back.
Some Common Discrete Random Variables. Binomial Random Variables.
Lec. 08 – Discrete (and Continuous) Probability Distributions.
Statistical Estimation Vasileios Hatzivassiloglou University of Texas at Dallas.
Chapter 31Introduction to Statistical Quality Control, 7th Edition by Douglas C. Montgomery. Copyright (c) 2012 John Wiley & Sons, Inc.
Review of statistical modeling and probability theory Alan Moses ML4bio.
Copyright © 2011 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill/Irwin Chapter 5 Discrete Random Variables.
Chap 5-1 Chapter 5 Discrete Random Variables and Probability Distributions Statistics for Business and Economics 6 th Edition.
Discrete Probability Distributions Chapter 4. § 4.3 More Discrete Probability Distributions.
Nonlinear function minimization (review). Newton’s minimization method Ecological detective p. 267 Isaac Newton We want to find the minimum value of f(x)
Copyright © Cengage Learning. All rights reserved. 3 Discrete Random Variables and Probability Distributions.
Statistics Behind Differential Gene Expression
Introduction to Probability - III John Rundle Econophysics PHYS 250
MECH 373 Instrumentation and Measurements
Ch3.5 Hypergeometric Distribution
Discrete Random Variables and Probability Distributions
Math 4030 – 4a More Discrete Distributions
Discrete Random Variables
apeglm: Shrinkage Estimators for Differential Expression of RNA-Seq
RNA-Seq analysis in R (Bioconductor)
Discrete Probability Distributions
edgeR: empirical Bayes analysis
Probability & Statistics Probability Theory Mathematical Probability Models Event Relationships Distributions of Random Variables Continuous Random.
Bernoulli Trials Two Possible Outcomes Trials are independent.
Each Distribution for Random Variables Has:
Differential Expression of RNA-Seq Data
Presentation transcript:

Introduction to DESeq and edgeR packages Peter A.C. ’t Hoen

Poisson distribution discrete probability distribution that expresses the probability of a number of events occurring in a fixed period of time if these events occur with a known average rate and independently of the time since the last event = expected k = number of occurrences

Count process Poisson distribution Y t ~ Poisson(λ t ) with λ t = pn t t: tag λ: true expression Y: observed expression p: probability n: total number of RNA molecules Truncated Poisson distribution: zero can mean not expressed or not counted Count variance ~ λ t Murray F Freeman and John W Tukey. Ann Math Statist, 21: , (1950)

Negative binomial distribution discrete probability distribution of the number of successes in a sequence of Bernoulli trials before a specified (non-random) number r of failures occurs also arises as a continuous mixture of Poisson distributions where the mixing distribution of the Poisson rate is a gamma distribution. That is, we can view the negative binomial as a Poisson(λ) distribution, where λ is itself a random variable, distributed according to Gamma(r, p/(1 − p)).

edgeR (1) Robinson, Smyth (Biostatistics, 2008; Bioinformatics 2007) Package available from Bioconductor with very informative vignette Y ij ~ NB (  ij,  ) Var (Y ij ) =  ij ( 1 +  ij x  ) Negative binomial (gamma Poisson) with average mu Phi is overdispersion parameter (biological variation)  = 0 gives Poisson distribution

Overdispersion in our data

edgeR (2) Test per gene Y gij ~ NB (  gij,  g ) where  gij = M j x p gj Var (Y gij ) =  gij ( 1 +  ij x  g ) p gi is proportion of tags for tag g in sample i M j is library size for sample i and library j  g is dispersion parameter for tag g

edgeR (3) Estimation of common dispersion parameter by conditioning  g on the sum of counts and maximizing the common likelihood l C (  ) =  l g (  g ) Common dispersion parameter OR weighted linear combination of common and individual likelihoods WL (  g ) = l g (  g ) +  l C (  g )

edgeR (4) Exact test replacing hypergeometric probabilities with NB- derived probabilities (qCML) for single factor experiment Generalized linear models and Cox-Reid profile-adjusted likelihood (CR) method for multifactorial experiments

edgeR: what is new? Exact Test not able to work with confounders  replaced by generalized linear model with log likelihood ratio test Abundance trending in dispersion estimates

Dispersion trend dispersion abundance

Dispersion trending (after filtering for low ab) dispersion abundance

DESeq (1) Anders and Huber: Genome Biology (2010) 11:R106 Roughly same principles as edgeR No multifactorial analysis implemented yet

DESeq (2) (1)Y ij ~ NB (  ij, σ 2 ij ) (2)  ij = s j q i,ρ(j) s j scaling factor for sample j q i,ρ(j) proportional concentration of tag i in condition ρ (3)σ 2 ij =  ij + s 2 j ν i,ρ(j) ν i,ρ(j) is a smooth function depending on q i,ρ(j) (concentration) Count noise Extra variance

DESeq (3): variance trend with expression Purple: Poisson Dashed orange: edgeR (before trending) Orange: DESeq You can derive: Squared CV is 1/μ + φ

DESeq (3) Differences with edgeR: Complete shrinkage to trended dispersion; limited tagwise dispersion estimates Different variance estimates for different sample groups allowed Deals better with samples with large differences in read depth?

DESeq (4): statistical testing In analogy to initial edgeR implementation exact test on the NB probabilities in the two conditions

Conclusions edgeR and DESeq are comparable implementation of statistical tests using NB distribution edgeR and DESeq produce largely similar results Implementation of generalized linear models in edgeR allows for testing with confounders Results comparable to limma for medium – high expressed genes: modeling of stochastic effects is particularly important for low expressed genes

Comparison to limma (on sqrt scaled data)