edgeR: empirical Bayes analysis

Slides:



Advertisements
Similar presentations
Linear Models for Microarray Data
Advertisements

11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.
Statistical Techniques I EXST7005 Start here Measures of Dispersion.
Lecture (11,12) Parameter Estimation of PDF and Fitting a Distribution Function.
Differentially expressed genes
Gene Set Analysis 09/24/07. From individual gene to gene sets Finding a list of differentially expressed genes is only the starting point. Suppose we.
Analysis of Differential Expression T-test ANOVA Non-parametric methods Correlation Regression.
Lecture 9: One Way ANOVA Between Subjects
ARE OBSERVATIONS OBTAINED DIFFERENT?. ARE OBSERVATIONS OBTAINED DIFFERENT? You use different statistical tests for different problems. We will examine.
Chapter 11 Multiple Regression.
Experimental Evaluation
Multiple Regression Farrokh Alemi, Ph.D. Kashif Haqqi M.D.
Proteomics Informatics – Data Analysis and Visualization (Week 13)
Essential Statistics in Biology: Getting the Numbers Right
Chapter 15 Data Analysis: Testing for Significant Differences.
Introduction to DESeq and edgeR packages Peter A.C. ’t Hoen.
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
Fundamentals of Data Analysis Lecture 10 Management of data sets and improving the precision of measurement pt. 2.
Lecture 11. Microarray and RNA-seq II
Random Regressors and Moment Based Estimation Prepared by Vera Tabakova, East Carolina University.
Microarray data analysis David A. McClellan, Ph.D. Introduction to Bioinformatics Brigham Young University Dept. Integrative Biology.
Chapter 10: Analyzing Experimental Data Inferential statistics are used to determine whether the independent variable had an effect on the dependent variance.
A A R H U S U N I V E R S I T E T Faculty of Agricultural Sciences Introduction to analysis of microarray data David Edwards.
Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman.
MGS3100_04.ppt/Sep 29, 2015/Page 1 Georgia State University - Confidential MGS 3100 Business Analysis Regression Sep 29 and 30, 2015.
Analysis of Variance (ANOVA) Brian Healy, PhD BIO203.
CHEMISTRY ANALYTICAL CHEMISTRY Fall Lecture 6.
Application of Class Discovery and Class Prediction Methods to Microarray Data Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics.
Chapter 22: Building Multiple Regression Models Generalization of univariate linear regression models. One unit of data with a value of dependent variable.
Suppose we have T genes which we measured under two experimental conditions (Ctl and Nic) in n replicated experiments t i * and p i are the t-statistic.
Comp. Genomics Recitation 10 4/7/09 Differential expression detection.
Sampling Design and Analysis MTH 494 Lecture-21 Ossam Chohan Assistant Professor CIIT Abbottabad.
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 6 –Multiple hypothesis testing Marshall University Genomics.
Canadian Bioinformatics Workshops
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
Regression Analysis Part A Basic Linear Regression Analysis and Estimation of Parameters Read Chapters 3, 4 and 5 of Forecasting and Time Series, An Applied.
Methods of Presenting and Interpreting Information Class 9.
Statistics Behind Differential Gene Expression
RNA Quantitation from RNAseq Data
Moderní metody analýzy genomu
Gene expression from RNA-Seq
Basic Estimation Techniques
RNA-Seq analysis in R (Bioconductor)
Differential Gene Expression
Gene expression.
GOseq - Part II Tom Chittenden, PhD, DPhil
Gene Ontology Analysis for RNA-seq: Accounting for Selection Bias
12 Inferential Analysis.
Significance Analysis of Microarrays (SAM)
Correlation and Simple Linear Regression
Basic Estimation Techniques
Discrete Event Simulation - 4
Comparing Populations
Correlation and Simple Linear Regression
What is Regression Analysis?
Significance Analysis of Microarrays (SAM)
12 Inferential Analysis.
Elements of a statistical test Statistical null hypotheses
Simple Linear Regression
Simple Linear Regression and Correlation
Statistics II: An Overview of Statistics
Product moment correlation
Volume 3, Issue 1, Pages (July 2016)
Working with RNA-Seq Data
Quantitative analyses using RNA-seq data
Sequence Analysis - RNA-Seq 2
MGS 3100 Business Analysis Regression Feb 18, 2016
Differential Expression of RNA-Seq Data
Correlation and Simple Linear Regression
Correlation and Simple Linear Regression
Presentation transcript:

edgeR: empirical Bayes analysis of digital gene expression data Tom Chittenden, PhD, DPhil, PStat Vice President of Statistical Sciences Founding Director of the Advanced Artificial Intelligence Research Laboratory Lecturer on Pediatrics & Biological Engineering

“It’s tough to make predictions, especially about the future”  Yogi Berra (Niels Bohr) “Essentially, all models are wrong, but some are useful” George Box

HMS Research Computing Course Page https://wiki.med.harvard.edu/Orchestra/RBiostatisticsCourse Course information and materials will be posted two days prior to the start of each class. Kristina Holton of HMS Research Computing will serve as a teaching assistant for the course.

Based Feature Selection Statistical Framework Tumor samples and controls Gene expression SV SNV Clinical information a priori Biological Knowledge Based Feature Selection Clinical Ontologies Genomic Ontologies NLP PubMed Gene signatures INDELS CNV Tumor markers EMR Integrated disease signature Class Prediction Post hoc non-experimental Validation Experimental validation Pathway Databases GO Database PubMed

R Biostatistics Course Part I - Unsupervised Biostatistical Analysis Data Normalization Unsupervised Cluster & Network Analysis Hypothesis Generation Data Platform RNA-Seq Data Weighted Correlation Network Analysis (WGCNA) RPKM FPKM VST TMM Unbiased Data Analysis HMS RC NGS Course R Biostatistics Course Part II - Supervised Biostatistical Analysis LIMMA: Linear Models for Microarray Data RPKM: Reads per kilo base per million RPK= Number of Mapped reads/ length of transcript in kb (transcript length/1000) RPKM = RPK/total number of reads in million (total no of reads/ 1,000,000) Example: Number of mapped reads =3 length of transcript=300 bp Total number of reads =10,000 RPK = 3/(300/1000) = 3/0.3 = 10 RPKM = 10 / (10,000/1,000,000) = 10/ 0.01 = 1000 RPKM =1000 Gene Set Analysis Data Normalization Differential Analysis Hypothesis Generation Data Platform RNA-Seq Data edgeR TMM GOSeq Controls for Biases in Background Distribution & Transcript Length

HMS Research Computing Office Hours Advanced R Programming & Biostatistics Generating Results Wednesdays 1:00 to 3:00pm

TCGA breast cancer dataset TCGA practice breast cancer dataset

Robinson MD, McCarthy DJ and Smyth GK, Bioinformatics 2010 Part II - Supervised Differential Gene Expression and Functional Enrichment Analyses Differential Gene Expression Analysis with edgeR Functional Enrichment of Gene Ontology Terms with GOSeq Analysis Functional Enrichment and Visualization of Kyoto Encyclopedia of Genes and Genomes (KEGG) Pathways with GOSeq Analysis and Pathview LIMMA: Linear Models for Microarray Data RPKM: Reads per kilo base per million RPK= Number of Mapped reads/ length of transcript in kb (transcript length/1000) RPKM = RPK/total number of reads in million (total no of reads/ 1,000,000) Example: Number of mapped reads =3 length of transcript=300 bp Total number of reads =10,000 RPK = 3/(300/1000) = 3/0.3 = 10 RPKM = 10 / (10,000/1,000,000) = 10/ 0.01 = 1000 RPKM =1000 Robinson MD, McCarthy DJ and Smyth GK, Bioinformatics 2010

Robinson MD, McCarthy DJ and Smyth GK, Bioinformatics 2010 edgeR (empirical analysis of DGE in R) edgeR is a class of statistical methods for examining differential expression of replicated count data edgeR is an overdispersed Poisson model used to account for both biological and technical variability edgeR uses Empirical Bayes methods to moderate the degree of overdispersion across transcripts, improving the reliability of statistical inference The Poisson distribution assumes that the mean and variance are the same. Sometimes, your data show extra variation that is greater than the mean. This situation is called overdispersion and negative binomial regression is more flexible in that regard than Poisson regression (you could still use Poisson regression in that case but the standard errors could be biased). The negative binomial distribution has one parameter more than the Poisson regression that adjusts the variance independently from the mean. In fact, the Poisson distribution is a special case of the negative binomial distribution. Robinson MD, McCarthy DJ and Smyth GK, Bioinformatics 2010

Negative Binomial Model Each sample is sequenced and reads are mapped to the appropriate reference genome The number of reads from sample (i) mapped to gene (g) will be denoted (ygi) The set of genewise counts for sample (i) makes up the expression profile or library for that sample The expected size of each count is the product of the library size and the relative abundance of that gene in that sample Yunshun Chen, Davis McCarthy, Mark Robinson, Gordon K. Smyth, edgeR User’s Guide, 2014

The data is fit with a negative binomial (NB) model: Ygi ∼ NB(MiPgj ,Φg) For gene (g) and sample (i) Mi = Library size (total number of reads) Φg = Dispersion pgj = Relative abundance of gene (g) in experimental group (j) to which sample (i) belongs Mean = µgi = MiPgj Variance = µgi (1+µgiΦg) Note: NB distribution reduces to Poisson distribution when Φg = 0 √Φg = Biological Coefficient of Variation between samples The Poisson distribution assumes that the mean and variance are the same. Sometimes, your data show extra variation that is greater than the mean. This situation is called overdispersion and negative binomial regression is more flexible in that regard than Poisson regression (you could still use Poisson regression in that case but the standard errors could be biased). The negative binomial distribution has one parameter more than the Poisson regression that adjusts the variance independently from the mean. In fact, the Poisson distribution is a special case of the negative binomial distribution. √Φg = Biological Coefficient of Variation between samples. Coefficient of variation (CV) (standard deviation divided by mean). In some DGE applications, technical variation can be treated as Poisson. Yunshun Chen, Davis McCarthy, Mark Robinson, Gordon K. Smyth, edgeR User’s Guide, 2014

Mean-Variance Plot >plotMeanVar(y) The Poisson distribution assumes that the mean and variance are the same. Sometimes, your data show extra variation that is greater than the mean. This situation is called overdispersion and negative binomial regression is more flexible in that regard than Poisson regression (you could still use Poisson regression in that case but the standard errors could be biased). The negative binomial distribution has one parameter more than the Poisson regression that adjusts the variance independently from the mean. In fact, the Poisson distribution is a special case of the negative binomial distribution. Yunshun Chen, Davis McCarthy, Mark Robinson, Gordon K. Smyth, edgeR User’s Guide, 2010

Why edgeR?

Creating the DGEList data class edgeR stores data in a simple list-based data object called a DGEList If the table of counts exists as a data.frame then a DGEList object can be created by >group <- c(rep("TNBC", 10), rep("Normal", 10)) >y <- DGEList(counts=x[,1:20], group=group) Yunshun Chen, Davis McCarthy, Mark Robinson, Gordon K. Smyth, edgeR User’s Guide, 2014

Trimmed Mean of M-values (TMM) Normalization The ‘calcNormFactors’ function normalizes for RNA composition by determining a set of scaling factors for the library sizes that minimize the log-fold changes between samples TMM is used to compute these factors to scale original library size to the ‘effective library size,’ which for differences in transcriptome sizes are accounted for and thus used for all downstream analyses >y <- calcNormFactors(y) >y$samples Yunshun Chen, Davis McCarthy, Mark Robinson, Gordon K. Smyth, edgeR User’s Guide, 2014

Data exploration - Multi-dimensional Scaling (MDS) plot The ‘plotMDS’ function produces a multi-dimensional scaling plot of the RNA samples based on leading log-fold-change distances This plot can be viewed as a type of unsupervised Clustering, somewhat similar in principle to the HCL clustering of the WGCNA in Class 1 “Dimension 1 is the direction that best separates the samples, without regard to whether they are treatments or replicates. Dimension 2 is the next best direction, uncorrelated with the first, that separates the samples.” The leading log-fold-change is the average (root-mean-square) of the largest absolute log-fold- changes between each pair of samples. Yunshun Chen, Davis McCarthy, Mark Robinson, Gordon K. Smyth, edgeR User’s Guide, 2014

Data exploration - Multi-dimensional Scaling (MDS) plot >plotMDS(y) Yunshun Chen, Davis McCarthy, Mark Robinson, Gordon K. Smyth, edgeR User’s Guide, 2014

WGCNA, Peter Langfelder and Steve Horvath, 2008

Estimating Dispersions edgeR uses the quantile-adjusted conditional maximum likelihood (qCML) method to estimate dispersions for experiments with single factor qCML calculates likelihood by conditioning on the total counts for each tag, and uses pseudo counts after adjusting for library sizes (effective library sizes) A scale-free network is a connected graph or network with the property that the number of links k originating from a given node exhibits a power. law distribution P(k)∼k^(-gamma). where gamma is a parameter whose value is typically in the range 2 < gamma < 3 Yunshun Chen, Davis McCarthy, Mark Robinson, Gordon K. Smyth, edgeR User’s Guide, 2014

Estimating Dispersions qCML common dispersion is calculated using the ‘estimateCommonDisp’ function >y <- estimateCommonDisp(y, verbose=TRUE) qCML tagwise dispersions are calculated using the ‘estimateTagwiseDisp’ function >y <- estimateTagwiseDisp(y) Yunshun Chen, Davis McCarthy, Mark Robinson, Gordon K. Smyth, edgeR User’s Guide, 2014

Plotting Estimated Dispersions >plotBCV(y, main = "Plot of Estimated Dispersions") Yunshun Chen, Davis McCarthy, Mark Robinson, Gordon K. Smyth, edgeR User’s Guide, 2014

Testing for DE Genes Once the negative binomial model is fitted and dispersion estimates are derived, the NB exact test is used to determine differential expression for single factor experimental designs Knowing the conditional distribution for the sum of counts in a group, exact p-values are derived by summing over all sums of counts that have a probability less than the probability under the null hypothesis of the observed sum of counts Yunshun Chen, Davis McCarthy, Mark Robinson, Gordon K. Smyth, edgeR User’s Guide, 2014

Mark Robinson and Gordon K. Smyth, Biostatistics, 2008 Testing for DE Genes Conditioning on the total pseudo-sum, the probability of observing class totals equal to or more extreme than observed can be calculated, giving an exact method of assessing differential expression The 2-tailed p-value equals the sum of the probabilities of class totals that are no more likely than those observed Mark Robinson and Gordon K. Smyth, Biostatistics, 2008

Testing for DE Genes The exact test for the negative binomial distribution has strong parallels with Fisher's exact test for the hypergeometric distribution Hypothesis testing is performed with the ‘exactTest’ function, and it allows for both common dispersion and tagwise dispersion approaches >et <- exactTest(y) Yunshun Chen, Davis McCarthy, Mark Robinson, Gordon K. Smyth, edgeR User’s Guide, 2014

Plot the log-fold-changes with a smear plot, highlighting the DE genes >plotSmear(et, de.tags=detags, main = "Smear Plot") M A Yunshun Chen, Davis McCarthy, Mark Robinson, Gordon K. Smyth, edgeR User’s Guide, 2014

Multiple Testing Correction Number of genes tested (N) False positives incidence Probability of calling 1 or more false positives by chance (100(1-0.95N)) 1 1/20 5% 2 1/10 10% 20 64% 100 5 99.4%

Westfall and Young Permutation Common Methods Ranked by Stringency Bonferroni Bonferroni Step-Down Westfall and Young Permutation Benjamini and Hochberg False Discovery Rate None More False Negatives More False Positives

Bonferroni correction The p-value of each gene is multiplied by the number of genes in the gene list. If the corrected p-value is still below the error rate, the gene will be significant: Corrected P-value= p-value * n (number of genes in test) <0.05 As a consequence, if testing 1000 genes at a time, the highest accepted individual p-value is 0.00005, making the correction very stringent. With a Family-wise error rate of 0.05 (i.e., the probability of at least one error in the family), the expected number of false positives will be 0.05.

Bonferroni Step-down (Holm) correction 1) The p-value of each gene is ranked from the smallest to the largest. 2) The first p-value is multiplied by the number of genes present in the gene list. Corrected P-value= p-value * n < 0.05 3) The second p-value is multiplied by the number of genes less 1: Corrected P-value= p-value * n-1 < 0.05 4) The third p-value is multiplied by the number of genes less 2: Corrected P-value= p-value * n-2 < 0.05 It follows that sequence until no gene is found to be significant.

Bonferroni Step-down (Holm) correction Example: Let n=1000, error rate=0.05 Gene p-value (S to L) Rank Correction Is gene significant after correction? A 0.00002 1 0.00002*1000 = 0.02 0.02 < 0.05 => Yes B 0.00004 2 0.00004*999 = 0.039 0.039 < 0.05 => Yes C 0.00009 3 0.00009*998 = 0.0898 0.0898 > 0.05 => No

Westfall and Young Permutation Both Bonferroni and Holm methods are called single-step procedures, where each p-value is corrected independently. The Westfall and Young permutation method takes advantage of the dependence structure between genes, by permuting all genes at the same time. Westfall and Young permutation follows a step-down procedure similar to the Holm method, combined with a bootstrapping method to compute the p-value (null) distribution

Westfall and Young Permutation 1) p-values are calculated for each gene based on the original data set and ranked. 2) The permutation method creates a pseudo-data set by dividing the data into artificial treatment and control groups. 3) p-values for all genes are computed on the pseudo-data set. 4) The successive minima of the new p-values are retained and compared to the original ones. 5) This process is repeated a large number of times, and the proportion of resampled data sets where the minimum pseudo-p-value is less than the original p-value is the adjusted p-value.

Benjamini and Hochberg False Discovery Rate This correction is the least stringent of all 4 options, and therefore tolerates more false positives. There will be also less false negatives. 1) The p-values of each gene are ranked from largest to smallest. 2) The largest p-value remains as it is. 3) The second largest p-value is multiplied by the total number of genes in gene list divided by its rank. Corrected p-value = p-value*(n/n-1) < 0.05 4) The third p-value is multiplied as in step 3: Corrected p-value = p-value*(n/n-2) < 0.05

Benjamini and Hochberg False Discovery Rate Example: Let n=1000, error rate=0.05 Gene p-value (L to S) Rank Correction Is gene significant after correction? A 0.1 1000 No correction 0.1 > 0.05 => No B 0.06 999 0.06* (1000/999) = 0.06006 0.06006 > 0.05 => No C 0.04 998 0.04*(1000/998) = 0.04008 0.04008 < 0.05 => Yes LIMMA: Linear Models for Microarray Data RPKM: Reads per kilo base per million RPK= Number of Mapped reads/ length of transcript in kb (transcript length/1000) RPKM = RPK/total number of reads in million (total no of reads/ 1,000,000) Example: Number of mapped reads =3 length of transcript=300 bp Total number of reads =10,000 RPK = 3/(300/1000) = 3/0.3 = 10 RPKM = 10 / (10,000/1,000,000) = 10/ 0.01 = 1000 RPKM =1000

Summary Table Correction Method Type of Error control Genes identified by chance after correction Bonferroni Family-wise error rate If error rate equals 0.05, expects 0.05 genes to be significant by Chance. Bonferroni Step Down Westfall and Young Permutation Benjamini and Hochberg False Discovery Rate If error rate equals 0.05, 5% of genes considered statistically significant (that pass the restriction after correction) will be identified by chance (false positives). LIMMA: Linear Models for Microarray Data RPKM: Reads per kilo base per million RPK= Number of Mapped reads/ length of transcript in kb (transcript length/1000) RPKM = RPK/total number of reads in million (total no of reads/ 1,000,000) Example: Number of mapped reads =3 length of transcript=300 bp Total number of reads =10,000 RPK = 3/(300/1000) = 3/0.3 = 10 RPKM = 10 / (10,000/1,000,000) = 10/ 0.01 = 1000 RPKM =1000

Practicum – Class 1 Example Directory Path edgeR: Differential Expression Analysis of Digital Gene Expression Data Example Directory Path “C:/Users/Owner/Documents/RCCG_HMS/RC_Biostatistics_Courses/Biostats_RNA-Seq/Supervised/Class 1” LIMMA: Linear Models for Microarray Data RPKM: Reads per kilo base per million RPK= Number of Mapped reads/ length of transcript in kb (transcript length/1000) RPKM = RPK/total number of reads in million (total no of reads/ 1,000,000) Example: Number of mapped reads =3 length of transcript=300 bp Total number of reads =10,000 RPK = 3/(300/1000) = 3/0.3 = 10 RPKM = 10 / (10,000/1,000,000) = 10/ 0.01 = 1000 RPKM =1000