Biases in RNA-Seq data October 30, 2013 NBIC Advanced RNA-Seq course

Slides:



Advertisements
Similar presentations
RNA-Seq based discovery and reconstruction of unannotated transcripts
Advertisements

Peter Tsai Bioinformatics Institute, University of Auckland
DEG Mi-kyoung Seo.
RNA-seq: the future of transcriptomics ……. ?
Data Analysis for High-Throughput Sequencing
RNA-Seq An alternative to microarray. Steps Grow cells or isolate tissue (brain, liver, muscle) Isolate total RNA Isolate mRNA from total RNA (poly.
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520
Transcriptomics Jim Noonan GENE 760.
Getting the numbers comparable
RNA-Seq based discovery and reconstruction of unannotated transcripts in partially annotated genomes 3 Serghei Mangul*, Adrian Caciula*, Ion.
Analysis of Differential Expression T-test ANOVA Non-parametric methods Correlation Regression.
RNA-Seq An alternative to microarray. Steps Grow cells or isolate tissue (brain, liver, muscle) Isolate total RNA Isolate mRNA from total RNA (poly.
Quantitative Genetics
Review for Exam 2 Some important themes from Chapters 6-9 Chap. 6. Significance Tests Chap. 7: Comparing Two Groups Chap. 8: Contingency Tables (Categorical.
Biases in RNA-Seq data Aim: to provide you with a brief overview of biases in RNA-seq data such that you become aware of this potential problem (and solutions)
mRNA-Seq: methods and applications
Achim Tresch Computational Biology ‘Omics’ - Analysis of high dimensional Data.
Sequencing Errors and Biases Biological Sequence Analysis BNFO 691/602 Spring 2013 Mark Reimers.
Brief workflow RNA is isolated from cells, fragmented at random positions, and copied into complementary DNA (cDNA). Fragments meeting a certain size specification.
Probability Distributions and Test of Hypothesis Ka-Lok Ng Dept. of Bioinformatics Asia University.
Li and Dewey BMC Bioinformatics 2011, 12:323
Maximum likelihood estimation of relative transcript abundances Advanced bioinformatics 2012.
Todd J. Treangen, Steven L. Salzberg
1 CSI5388: Functional Elements of Statistics for Machine Learning Part I.
Comparing two sample means Dr David Field. Comparing two samples Researchers often begin with a hypothesis that two sample means will be different from.
Applying statistical tests to microarray data. Introduction to filtering Recall- Filtering is the process of deciding which genes in a microarray experiment.
RNAseq analyses -- methods
Lecture 11. Microarray and RNA-seq II
Various topics Petter Mostad Overview Epidemiology Study types / data types Econometrics Time series data More about sampling –Estimation.
The iPlant Collaborative
Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad 27/11/2006.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Serghei Mangul Department of Computer Science Georgia State University Joint work with Irina Astrovskaya, Marius Nicolae, Bassam Tork, Ion Mandoiu and.
Analysis of Variance 1 Dr. Mohammed Alahmed Ph.D. in BioStatistics (011)
Summarization of Oligonucleotide Expression Arrays BIOS Winter 2010.
Proposed redefinition of “gene” requires it to have a biological role Gerstein MB, …, Snyder M Genome Res 17: example of complexities observed.
1 Global expression analysis Monday 10/1: Intro* 1 page Project Overview Due Intro to R lab Wednesday 10/3: Stats & FDR - * read the paper! Monday 10/8:
Introduction to RNAseq
Biases in RNA-Seq data. Transcript length bias Two transcripts of length 50 and 100 have the same abundance in a control sample. The expression of both.
Lecture 12 RNA – seq analysis.
Nonlinear differential equation model for quantification of transcriptional regulation applied to microarray data of Saccharomyces cerevisiae Vu, T. T.,
 Assumptions are an essential part of statistics and the process of building and testing models.  There are many different assumptions across the range.
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
Chapter 9: Introduction to the t statistic. The t Statistic The t statistic allows researchers to use sample data to test hypotheses about an unknown.
CyVerse Workshop Transcriptome Assembly. Overview of work RNA-Seq without a reference genome Generate Sequence QC and Processing Transcriptome Assembly.
Two-Sample-Means-1 Two Independent Populations (Chapter 6) Develop a confidence interval for the difference in means between two independent normal populations.
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Aim: to provide you with a brief overview of biases in RNA-seq data such that you become aware of this potential problem (and solutions) Biases in RNA-Seq.
Statistics Behind Differential Gene Expression
RNA Quantitation from RNAseq Data
Moderní metody analýzy genomu
Gene expression from RNA-Seq
RNA-Seq analysis in R (Bioconductor)
Sampling Distributions and Estimation
Biases in RNA-Seq data October 15th, 2012 NBIC Advanced RNA-Seq course
Lecture 7. Topics in RNA Bioinformatics (Single-Cell RNA Sequencing)
Design and Analysis of Single-Cell Sequencing Experiments
Gene expression estimation from RNA-Seq data
Alternative Splicing QTLs in European and African Populations
Modeling Enzyme Processivity Reveals that RNA-Seq Libraries Are Biased in Characteristic and Correctable Ways  Nathan Archer, Mark D. Walsh, Vahid Shahrezaei,
Volume 14, Issue 7, Pages (February 2016)
Quantitative analyses using RNA-seq data
Modeling Enzyme Processivity Reveals that RNA-Seq Libraries Are Biased in Characteristic and Correctable Ways  Nathan Archer, Mark D. Walsh, Vahid Shahrezaei,
Sequence Analysis - RNA-Seq 2
Analysis of RNA-Seq data Counting, normalization, and statistical tests for differential expression March 16, 2018 Dr. ir. Perry D. Moerland Bioinformatics.
Volume 11, Issue 7, Pages (May 2015)
Toward Accurate and Quantitative Comparative Metagenomics
Presentation transcript:

Biases in RNA-Seq data October 30, 2013 NBIC Advanced RNA-Seq course Prof. dr. Antoine van Kampen Bioinformatics Laboratory Academic Medical Center Biosystems Data Analysis Group Swammerdam Institute for Life Sciences a.h.vankampen@amc.uva.nl

Aim: to provide you with a brief (almost up-to-date) overview of literature about biases in RNA-seq data such that you become aware of this potential problem (and solutions)

Example of RNA-seq bias.........

What is the problem? Experimental (and computational) biases affect expression estimates and, therefore, subsequent data analysis: Differential expression analysis Study of alternative splicing Transcript assembly Gene set enrichment analysis Other downstream analysis We must attempt to avoid, detect and correct these biases

Types of bias Library size Gene length Mappability of reads lower sequence complexity, repeats, ...... Position Fragments are preferentially located towards either the beginning or end of transcripts Sequence-specific biased likelihood for fragments being selected %GC

Few words about microarrays Are not free of bias It has taken a decade to understand these biases and to provide solutions Recognition of biases (e.g., by the MicroArray Quality Control (MAQC) consortium) has led to the development of quality control standards For RNA-Seq it will also take some time to "understand the data". Comparison of microarrays and RNA-Seq may help to identify bias Malone and Oliver (2011) BMC Biology, 9:34

Normalization for gene length and library size: RPKM / FPKM

Within one sample transcript 1 (size = L) Count =6 transcript 2 (size=2L) Count = 12 You can’t conclude that gene 2 has a higher expression than gene 1!

Comparison of two samples transcript 1 (sample 1) Count =6, library size = 600 transcript 1 (sample 2) Count =12, library size = 1200 You can’t conclude that gene 1 has a higher expression in sample 2 compared to sample 1!

RPKM: Reads per kilobase per million mapped reads Unit of measurement RPKM reflects the molar concentration of a transcript in the starting sample by normalizing for RNA length Total read number in the measurement This facilitates comparison of transcript levels within and between samples Mortazavi et al (2008) Nature Methods, 5(7), 621

Rewriting the formula  

Example1 2500kb transcript with 900 alignments in a sample of 10 million reads (out of which 8 million reads can be mapped)

Example1 2500kb transcript with 900 alignments in a sample of 10 million reads (out of which 8 million reads can be mapped)

Example 2 Given a 40M read measurement, how many reads would we expect for a 1 RPKM measurement for a 2kb transcript?

Example 2 Given a 40M read measurement, how many reads would we expect for a 1 RPKM measurement for a 2kb transcript?

FPKM: Fragments per K per M What's the difference between FPKM and RPKM? Paired-end RNA-Seq experiments produce two reads per fragment, but that doesn't necessarily mean that both reads will be mappable. For example, the second read is of poor quality. If we were to count reads rather than fragments, we might double-count some fragments but not others, leading to a skewed expression value. Thus, FPKM is calculated by counting fragments, not reads. Trapnell et al (2010) Nature Biotechnology, 28(5), 511

Other normalization methods Spike-ins Housekeeping genes (Bullard et al, 2010) Upper-quartile (Bullard et al, 2010). Counts are divided by (75th) upper-quartile of counts for transcripts with at least one read TMM (Robinson and Oshlack, 2010). Trimmed Mean of M values Quantile normalization (Irizarry et al, 2003; developed for microarrays) Comparison of normalization methods (Dillies, 2013) Irizarry, et al (2003). Biostatistics (Oxford, England), 4(2), 249–64. Bullard et al (2010) BMC bioinformatics, 11, 94. Robinson & Oshlack (2010). Genome biology, 11(3), R25. Dillies, et al (2013) A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing. Briefings in Bioinformatics

Why (not) use spike-ins? Opinions differ on whether this could be made to work. In M.D. Robinson and A. Oshlack “A scaling normalisation method for differential expression analysis of RNA-seq data”, Genome Biology 11, R25 (2010), it is claimed that “In order to use spike-in controls for normalisation, the ratio of the concentration of the spike to the sample must be kept constant throughout the experiment. In practice this is difficult to achieve and small variations will lead to biased estimation of the normalisation factor.”

Quantile Normalization Transform intensities /read counts into one standard distribution shape http://www.people.vcu.edu/~mreimers/OGMDA/normalize.expression.html Developed for affymetrix chips In 2003, Benjamin Bolstad, one of Terry Speed’s students, proposed cutting through all the complexity by a simple non-parametric normalization procedure, at least for one-color arrays. He proposed to shoe-horn the intensities of all probes on each chip into one standard distribution shape, which he determined by pooling all the individual chip distributions. In practice, the distribution of intensities from any high-quality chip will do. The algorithm mapped every value on any one chip to the corresponding quantile of the standard distribution; hence the method is called quantile normalization. This simple 'between-chip' procedure worked as well as most of the more complex procedures then current, and certainly better than the regression method, which was then the manufacturer's default for Affymetrix chips. This method was also made available as the default in the affy package of Bioconductor, which has become the most widely used suite of freeware tools for microarrays (see www.bioconductor.org). Schematic representation of quantile normalization: the value x, which is the α-th quantile of all probes on chip 1, is mapped to the value y, which is the α quantile of the reference distribution F2.

Why do we need other normalization methods when we have RPKM?

Normalization objective Normalization factors/procedures should ensure that a gene with the same expression level in two samples is not detected as differentially expressed (DE).

Thought experiment Suppose Two RNA populations (samples): A and B The same 3 genes expressed in both samples Numbers indicate number of transcripts / cell 150 150 100 100 50 50 Condition A Condition B No differential expression of these genes

Thought experiment Suppose Two RNA populations (samples): A and B The same 3 genes expressed in both samples Numbers indicate number of transcripts / cell Now condition A has 3 additional genes not in B with equal number and expression 150 150 150 100 100 100 50 50 50 Condition A Condition B Still no differential expression of first three genes However, RNA production in A is twice the size of B

Thought experiment Suppose we sequence both samples with the same depth (1200 reads) These reads get ‘distributed’ over the expressed genes 600 400 300 300 200 200 200 #reads #reads 100 100 Reads: 600 600 1200 A correct normalization factor adjust condition A by a factor of two (no differential expression) Proportion of reads attributed to a gene in a library depends on the expression properties of whole sample  If a sample has larger RNA output (S) then RNAseq will under-sample many genes (lower number of reads)

RKPM would fail in this example (assume transcript lengths are the same) In this example: Condition A, first (red) gene: Condition B, first (red) gene: RKPM normalization would result in differential expression while this is not the case! Because we didn’t take into account the total RNA production.

When does RKPM fail? If samples have largely different RNA production Many unique genes and/or highly expressed genes If many genes in one sample have a very high expression compared to the other samples If RNA sample is contaminated Reads that represent the contamination will take away reads from the true sample, thus dropping the number of reads of interest. If you assume that your samples are ‘comparable’ then RKPM is OK e.g., technical replicates

RNA spikes in Illumina: RPKM is OK for measuring relative abundance of transcripts within one experiment (e.g. one lane of Illumina sequencer): RNA spikes in Illumina: 300 and 1500nt (arabidopsis) and 10000nt (-phage) 104, 105, …, 109 transcripts per 100ng mRNA (data from Mortazavi et al.) from Conrad Burden, Mathematical Sciences Institute, Australian National University, Canberra

Let us take RNA production into consideration Again, don’t consider gene length (assume transcript lengths are of equal size) Transcripts Reads RNA production Condition A = 600 Condition B = 300 Larger RNA production, results in less reads per gene (assuming same sequence depth). Correction (for ‘red’ gene): condition A: 100 reads*600 = 60000 condition B: 200 reads*300 = 60000

Closer look: relative gene expression μ = unknown true expression (transcripts/cell) L = unknown true gene length (bp) Sk = Total RNA production (bp) g = gene k = library In cell only two genes are expressed A: 10 transcripts; L=10 B: 40 transcripts; L=5 total RNA production = 10*10 + 40*5 = 300 Relative expression of A is 10*10/300 = 0.33 (not 10/50=0.2) Relative expression of B = 40*5/300 = 0.66 (not 40/50=0.8)

Expected number of reads Y = read count μ = unknown true expression (transcripts/cell) L = unknown true gene length (bp) Sk = Total RNA production (bp) Nk = library size g = gene k = library The total RNA production (Sk) cannot be estimated directly we do not know the expression levels and true lengths of every gene. Thus, how to correct for RNA production?

Back to our example Transcripts Reads RNA production Condition A = 600 Condition B = 300 (factor 2 difference) Larger RNA production, results in less reads per gene (assuming same sequence depth). The correction factor that we applied is the ratio of RNA production Correction (for ‘red’ gene): condition A: 100 reads*600/300 = 200 condition B: 200 reads

Normalization factor Essentially a global fold change The total RNA production (Sk) cannot be estimated directly Relative RNA production of two samples (fk) can more easily be determined. Empirical strategy assumption that the majority of them are not differentially expressed. How to do this?  E.g., Trimmed Mean of M-values (TMM)

Ratio Average expression Yig = read counts for gene g in sample i = 1, 2 Ni = total read counts for sample i = 1, 2 Ratio Average expression

Example. Accounting for total number of reads Technical replicates mean log ratio ~0 Data from Marioni, 2008

Example. Accounting for total number of reads Technical replicates housekeeping genes Liver / Kidney mean log ratio shifted to higher kidney expression

A few strongly expressed, differentially expressed genes in liver  less sequence reads available for bulk of lower expressed liver genes  ratio=liver/kidney becomes smaller (i.e., shift of distribution towards kidney)

A few strongly expressed, differentially expressed genes in liver  less sequence reads available for bulk of lower expressed liver genes  ratio=liver/kidney becomes smaller (i.e., shift of distribution towards kidney) Trim the data M 30% A 5%

Then, from the trimmed subset of genes, calculate a relative scaling factor from a weighted average of M –values: Implemented in edgeR (Bioconductor)

A few strongly expressed, differentially expressed genes in liver  less sequence reads available for bulk of lower expressed liver genes  ratio=liver/kidney becomes smaller (i.e., shift of distribution towards kidney) Trim the data M 30% A 5% log λTMM Offset for HK genes is similar to λ

Gene length bias

Gene length bias 33% of highest expressed genes These data have not been normalized with RPKM! Gene length bias This bias (a) affects comparison between genes or isoforms within one sample and (b) results in more power to detect longer transcripts 33% of highest expressed genes 33% of lowest expressed genes Data in these plots have not been normalized with RPKM!! Current RNA-seq protocols use an mRNA fragmentation approach prior to sequencing to gain sequence coverage of the whole transcript. This means, in simple terms, that the total number of reads for a given transcript is proportional to the expression level of the transcript multiplied by the length of the transcript. In other words a long transcript will have more reads mapping to it compared to a short gene of similar expression. Since the power of an experiment is proportional to the sampling size, there is more power to detect differential expression for longer genes. For each platform we first binned all genes into equal gene number bins based on their transcript length. Next we designated genes as differentially expressed (DE) based on a cut-off from the statistical procedure defined in the relevant publication. From figure F: there is more bias for the lower expressed genes. Oshlack and Wakefield (2009) Biology Direct, 16, 4

Question: does this bias disappear when we use RPKM?

Mean-variance relationship . Sample variance across lanes in the liver sample from the Marioni et al (2008) Genome research, 18(9), 1509–17. Red line: for the one third of shortest genes Blue line: for the longest genes. Black line: line of equality Plot A: blue/red lines close to line of equality between mean and variance which is what would be expected from a Poisson process. This is what we expect from a Poisson process

Mean-variance relationship log(variance) variance mean log(mean / length) Plot B: Counts divided by gene length (which you do when using RPKM). The short genes have higher variance for a given expression level than long genes. Because of the change in variance we are still left with a gene length dependency.  Thus, RPKM does not fully correct After correction  no longer Poisson distributed Since there is a gene length bias, one may correct for this by using RPKM. This will effectively divide by the gene length. However, this division also affects the mean-variance relationship. Before correction it is more or less as expected. After correction (Figure b) the relation is no longer Poisson distributed. And we see that the shorter genes now have a higher variance than the longer genes.

Just to refresh your memory The power of a statistical test is the probability that the test will reject the null hypothesis when the alternative hypothesis is true (i.e. the probability of not committing a Type II error).

Power and gene length bias More power to detect longer differentially expressed transcripts t = t statistic SE = standard error L = gene length c = proportionality constant δ = effect size (power is related to effect size) In the paper they show that δ is still related to L after accounting for gene length

Gene set enrichment analysis and gene length bias This bias affects Gene Set Analysis (GSA) In GSA we compare sets of transcripts that are potentially of different length For gene length corrections in this context see: Gao et al (2011) Bioinformatics, 27(5), 662 (R package) Young et al (2010) Genome Biology, 11:R14 (GOseq) Correction at gene level or gene set level

Mappability bias

Mappability bias Uniquely mapping reads are typically summarized over genomic regions E.g., regions with lower sequence complexity will tend to end up with lower sequence coverage Regions with higher/lower mappability may give spurious results in downstream analysis Test: generate all 32nt fragments from hg18 and align them back to hg18 32 nt corresponds to trimmed Illumina reads Each fragment that cannot be uniquely aligned is unmappable and its first position is considered an unmappable position Schwartz et al (2011) PLoS One, 6(1), e16685

Result of test Unexpected because introns are assumed to have lower sequence complexity in general Since in RNA-seq we align reads prior to further analysis, this step may already introduce a (slight) bias.

Mappability: dependency on transcript length Reads of the same length but corresponding to longer transcripts have a higher mappability

Mappability: evolutionary conservation and expression level Note: expression level in lung fibroblasts

Sequence-specific bias 1

Where do you expect to find reads? mRNA AAAAAAAAAAAAA

Where do you expect to find reads? mRNA Sequence reads AAAAAAAAAAAAA

Where do you expect to find reads? mRNA Sequence reads AAAAAAAAAAAAA

Where do you expect to find reads? mRNA Sequence reads AAAAAAAAAAAAA

Where do you expect to find reads? mRNA Sequence reads AAAAAAAAAAAAA

Where do you expect to find reads? mRNA Sequence reads AAAAAAAAAAAAA

Where do you expect to find reads? mRNA Sequence reads AAAAAAAAAAAAA Of course.....uniformly distributed over the transcripts

RNA-Seq protocol Current sequencers require that cDNA molecules represent partial fragments of the RNA cDNA fragments are obtained by a series of steps (e.g., priming, fragmentation, size selection) Some of these steps are inherently random Therefore, we expect fragments with starting points approximately uniformly distributed over transcript.

Roberts et al (2011). Genome biology, 12(3), R22.

Biases in Illumina RNA-seq data caused by hexamer priming Generation of double-stranded complementary DNA (dscDNA) involves priming by random hexamers To generate reads across entire transcript length Turns out to give a bias in the nucleotide composition at the start (5’-end) of sequencing reads This bias influences the uniformity of coverage along transcript These spatial biases hinder comparisons between genomic regions Hansen et al (2010) NAR, 38(12), e131 but also see: Li et al (2010) Genome Biology, 11(5), R50 Schwartz et al (2011) PLoS One, 6(1), e16685 Roberts et al(2011) Genome Biology, 12: R22

position 1 in read A read (35nt) transcript A position x in transcript Determine the nucleotide frequencies considering all reads: What do we expect?? Position in read Nucl 1 2 3 4 5 ......... 35 A C G T

position x in transcript position 1 in read A read (35nt) transcript A position x in transcript Determine the nucleotide frequencies considering all reads: What do we expect?? Position in read Nucl 1 2 3 4 5 ......... 35 A C G T We would expect that the frequencies for these nucleotides at the different positions are about equal 25%

first hexamer = shaded Bias Frequencies slightly deviate between experiments but show identical relative behavior Distribution after position 13 reflects nt composition of transcriptome Effect is independent of study, laboratory, and organisms Apparently, hexamer priming is not completely random These patterns do not reflect sequencing error and cannot be removed by 5’ trimming! Figure 1. Nucleotide frequencies versus position for stringently mapped reads. For each experiment, mapped reads were extended upstream of the 50-start position, such that the first position of the actual read is 1 and positions 0 to -20 are obtained from the genome. The first hexamer of the read is shaded. (RNA-Seq experiments conducted using priming with random hexamers). http://www.bio.davidson.edu/courses/genomics/method/randompriming.html Random Hexamer Primer is a mixture of single-stranded random hexanucleotides with 5'- and 3'-hydroxyl ends.

Re-weighting scheme 1 Aim Approach Adjust biased nucleotide frequencies at the beginning of the reads to make them similar to distribution of the end of the reads (which is assumed to be representative for the transcriptome) Approach Associate weight with each read such that they down- or up-weight reads with heptamer at beginning of read that is over/under-represented Determine expression level of region by adjusting counts by multiplying them with the weights

Re-weighting scheme 2 (assume read of at least 35nt) = heptamer Read heptamers (h) at positions i=1,2 of reads heptamers (h) at positions i=24..29 of reads Weights are determined over all possible 47=16.384 heptamers p(h) = proportion of reads that start with heptamer h log(w)

Application of re-weighting scheme E.g., indicates that this heptamer (TTGGTCG) was under-represented, thus count is up-weighted

Example: gene YOL086C in yeast for WT experiment unmapple bases  Base level counts at each position Extreme expression values are removed but coverage is still far from uniform Figure 4. Evaluation of the reweighting scheme. Unadjusted and re-weighted base-level counts for reads from the WT experiment mapped to the sense strand of a 1-kb coding region in S. cerevisiae (YOL086C). The gray bars near the x-axis indicate unmappable genomic locations. Stranded coverage plots were made by adding the weights of reads associated with each base in each stranded ROCE (weights of one for unadjusted counts). For such a standard coverage plot, each position of the read is assigned the same weight. ROCE = bascially coding region

Sequence-specific bias 2

Roberts et al (2011) Genome Biology, 12:R22 Bias inference is non-trivial due to the fact that fragment abundances are proportional to transcript abundances The expression levels of transcripts from which fragments originate must be taken into account when estimating bias (see next figure) At the same time, expression estimates made without correcting for bias may lead to the over- or under-representation of fragments. The problems of bias estimation and expression estimation are fundamentally linked, and must be solved together. Likelihood based approaches are well suited to resolving this difficulty bias and abundance parameters can be estimated jointly by maximizing a likelihood function for the data.

Explanation of next figure (A) Sequence logos showing the distribution of nucleotides in a 23bp window surrounding the ends of fragments from an experiment primed with hexamers . The 3’-end sequences are complemented. Counts were taken only from transcripts mapping to single-isoform genes. (B) Sequence logo showing normalized nucleotide frequencies after reweighting by initial (not bias corrected) FPKM in order to account for differences in abundance. (C) The background distribution for the yeast transcriptome, assuming uniform expression of all single-isoform genes. The difference in 5’ en 3’ distributions are due to the ends being primed from opposite strands. Comparing (C) to (A) and (B) shows that while the bias is confounded with expression in (A), the abundance normalization reveals the true bias to extend from 5bp upstream to 5bp downstream of the fragment end. Taking the ratio of the normalized nucleotide frequencies (B) to the background (C) for the NNSR dataset gives bias weights (D), which further reveal that the bias is partially due to selection for upstream sequences similar to the strand tags, namely TCCGATCTCT in first-strand synthesis (which selects the 5’ end) and TCCGATCTGA in second-strand synthesis (which selects the 3’-end). Roberts et al (2011) Genome Biology, 12:R22

Yeast transcriptome internal to fragment Raw counts FPKM correction Background distribution Yeast transcriptome

Bias in nucleotide usage in reads This bias is cause by a mixture of internal to fragment Raw counts FPKM correction Bias in nucleotide usage in reads This bias is cause by a mixture of - highly expressed - positional bias Primed with hexamers (yeast transcriptome) Background distribution

internal to fragment Raw counts FPKM correction Background distribution After FPKM normalisation (better reflects the positional bias)

internal to fragment Raw counts FPKM correction Background distribution Bias weigths for A,T,C,G

Bias correction Use of statistical model that takes expression and nucleotide bias into account non-uniform coverage of raw read counts along transcript Bias weights. Used to correct raw read counts. Large weights correspond to positions with high count. Figure 3 Bias correction within transcripts. An example showing the effect of bias correction on the read counts for human transcript NM_004684. The top panel shows raw read counts (number of 3’ ends of fragments at each location), and the bottom panel shows the product of the bias parameters (total bias weight defined in the Supplementary methods in Additional file 3) at the same locations. We correctly identify bias at different positions and can therefore correct for the non-uniformity. Note that the bias parameters were learned from the entire dataset excluding reads mapped to this transcript in order to cross-validate our results. The RNA-Seq for the experiment was performed with the NSR protocol, which is why 3’ counts were used instead of 5’

Use of spike-in standards

Synthetic spike-in standards External RNA Control Consortium (ERCC) ERCC RNA standards range of GC content and length minimal sequence homology with endogenous transcripts from sequenced eukaryotes FPKM normalization The ERCC consortium synthesized RNAs by in vitro transcription of de novo DNAsequences or of DNAderived from the B. subtilis or the deep-sea vent microbe M. jannaschii genomes. Jiang (2011) Synthetic spike-in standards for RNA-seq experiments. Genome Research

Results suggest systematic bias: better agreement between the observed read counts from replicates than between the observed read counts and expected concentration of ERCC’s within a given library. The replicates show that the rnaseq sequencing is very reproducible. We have (slightly) more variation when comparing known spike-in concentrations with read counts. Pool 14: mixture of ERCC RNA’s present in different concentrations 44 Human RNA libraries in which 2% of pool 14 was added. Count versus concentration*length (mass) per ERCC. Pool of 44 2% ERCC spike-in H. Sapiens libraries Read counts for each ERCC transcript in two different libraries of human RNA-seq with 2% ERCC spike-ins

Transcript-specific sources of error Fold deviation between observed and expected read count for each ERCC in the 100% ERCC library Fold deviation between observed and expected depends on Read count GC content Transcript length   Transcript specific biases affect comparisons of read counts between different RNAs in one library

Read coverage biases: single ERCC RNA The ERCC RNAs are single isoform with well-defined ends Ideal for measuring transcript coverage Position effect

Read coverage biases: 96 ERCC RNAs Suggested: drop in coverage at 3’-end due to the inherently reduced number of priming positions at the end of the transcript Position effect Average relative coverage along all control RNAs for ERCC spiked in 44 H. sapiens libraries. Dashed lines represent 1 SD around the average across different libraries.

Read coverage biases: sequence-specific heterogeneity Could be due to RNA structure (single vs double-stranded template regions) and/or Preparation of the RNA (e.g., nonrandom hydrolysis) or cDNA synthesis (e.g., nonrandomness in “random” hexamer) over/under representation

Read coverage biases: account for sequence-specific bias through statistical models These models result in a more even coverage

GC bias

GC-bias in DNA-seq Correlation of the Solexa read coverage and GC content. 27mer reads generated from Beta vulgaris BAC ZR-47B15.Each data point corresponds to the number of reads recorded for a 1-kbp window This genome is GC poor reads per kbp ~linear relationship Solexa is now Illumina Dohm et al (2008) NAR, 36, e105 GC content (%)

GC-bias in DNA-seq (1) Models for GC bias: Fragmentation model Benjamini & Speed (2012) Nucleic acids research, 40(10), e72. GC-bias in DNA-seq (1) Models for GC bias: Fragmentation model locally, GC counts could be associated with the stability of DNA and the modify the probability of a fragmentation point in the genome Read model GC content primarily modifies the base-sequencing process (GC explains read count) Full-fragment model GC content of full fragment determines which fragments are selected or amplified Global model GC effects on scales larger than the fragment length (e.g., higher-order DNA structure) These loosely defined models can be realized statistically by counting the GC in a suitable region and comparing that to fragment coverage. This method is implemented as GCcorrect in R bioconductor package.

GC-bias in DNA-seq (2) Conclusions from their study Not a linear relationship (compare to Dohm et al, and Jiang et al). Instead unimodel relationship Dependency between count and GC originates from a biased representation of possible DNA fragments (both high GC and high AT fragments being underrepresented) PCR is the most important cause for GC bias Not the GC content of the reads. This dependency is consistent but the exact shape varies considerably across samples, even matched samples They argue that models taking GC content and fragment length into account is also important for RNA-seq

GC-bias in DNA-seq (3) Single position models Estimate 'mean fragment count' (rate) for individual locations rather than bins. Link fragment count to GC content

Single Position Model (1) Mappaple positions along genome are randomly sampled (n~10 million) Fragment 5'-end count #fragments with 5'-end in sampled positions Determine GC count in corresponding sliding window W0,4 (depends on genome not on read) The sliding window Wa,l is characterized by the offset a and length l

Single Position Model (2) Sgc = stratum with gc=GC(x+a,l) (x=position, a=shift, l=length) Ngc = number of sample positions assigned to SGC Fgc = number of fragments starting (5'-end) at the x's in Sgc Estimate λgc by The sliding window Wa,l is characterized by the offset a and length l Fgc/Ngc down-weights the rate if many positions are assigned to specific stratum (=%GC) because this wouldn't be expected if there was no bias.

Single Position Model (3) unimodel Fragment rate GC

Model comparison Estimated model Wa,l (i.e., choice of GC window) Used to generate predicted counts for any genomic region Comparison of models (i.e., different a, l) through TV (0<= TV <= 1) normalized total variation distance distance between stratified estimated rates Wa,l and uniform rate (U) global mean rate (n=sampled positions, F=total number of mapped fragments) We look for high TV: counts are strongly dependent on GC

Fragment length models To measure effect of fragment lengths Fragment length model = single position model but counting fragments of length s only Determine model Wsa,l only count fragments of length s starting at x's in Sgc Model the count of fragments using GC in the fragment (not in fixed window of length l) To reduce impact of local biases a few base pairs from the ends of the fragments are removed: Wsa,s-a-m GC window grows with fragment length In this model we have parameters GC and fragment length

Predicted rates For example, Wa=0,l=25 5 reads 4 reads 3 reads DNA window l and gc=5 F = 12 N = 4 mean fragment rate = 12/4 = 3 (we are just averaging) See corresponding excel sheet for an example as how to subsequently correct for GC bias on a position-by-position basis. predicted rate for %GC

Evaluation The success of a model (Wa,l) is evaluated by comparing its predictions with the observed fragment counts For robust evaluation use Mean Absolute Deviation (MAD) B is set of bins (i.e., non-overlapping windows on genome) Fb the count of fragments for which 5'-end is inside bin b. Normalization. (CN=copy number, e=0.1 account for small counts)

Results (1) – Bin counts Two different libraries from same starting DNA 10 kb bins Unimodal relation Same trend but the curves are not aligned. This makes a case for single sample normalization' Figure 2. GC curves (10 kb bins). Observed fragment counts and loess lines plotted against GC of two libraries from the same normal sample. Counts and curves of all libraries are scaled to fit median counts of normal library 1. Bins were randomly sampled from chromosome 1, and counts include fragments from both strands.

Results (2) – Single position models Compare different GC windows throught TV a=0, different lengths l Expection: strongest effect after a few bp (fragmentation effect) after 30-75 bp (read effect) at the fragment length (full-fragment effect)

Corresponding GC curve is very sharp Strongest effect (TV score) coincides with fragment length (W0,180 and W0,295) Corresponding GC curve is very sharp Figure 3. Single position models. (A) The top curves represent TV scores for GC windows of different lengths, all beginning at 0 (a=0). The horizontal bars on the bottom mark the median fragment lengths (and 0.05, 0.95 quantiles). For each library, the strongest GC windows are those that encompass the full fragment. For library 1, we mark the optimal model (W0,180), and show its resulting GC curve on the right panel (B). (We actually show W2,176, removing 2 bp from each side of the fragment.) The GC curve measures the fragment rate given the fraction of GC in the window. Vertical lines (blue) represent 1 SD. For comparison, we plot the distribution of GC (dotted line) in our sample from chromosome 1 (scaled). bars on the bottom (left panel): median and 0.05/0.95 quantiles

Results (3) – Single position models Smaller scales (l=50bp) allows to compare GC window that overlaps with read with a GC window that does not W0,50 versus W75,50 Effect of 'read' model is not as large as 'fragment center' model May imply that bias is not driven by base calling or sequencing effects but by the composition of the full fragment Figure 4. Different lags. (A) GC curve of the window before the fragment, W50,50;(B) within the read, W0,50 and (C) in the fragment center, not overlapping the read, W75,50.(D) A plot of TV scores for 50 bp sliding windows (Wa,50). The x-axis marks a, the location of the window 50-end relative to 50-end of the fragment. On the bottom, we mark a fragment and its reads in relation to the GC windows from the top panels.

Results (4) – Results of fragment length Length of fragment influences shape of the GC curve Interaction between GC and length Long fragments tend to have a higher GC count Figure 4. Different lags. (A) GC curve of the window before the fragment, W50,50;(B) within the read, W0,50 and (C) in the fragment center, not overlapping the read, W75,50.(D) A plot of TV scores for 50 bp sliding windows (Wa,50). The x-axis marks a, the location of the window 50-end relative to 50-end of the fragment. On the bottom, we mark a fragment and its reads in relation to the GC windows from the top panels.

GC-content bias Fragment rate by length and GC. (A) A heat map describes rates for each (GC, length) pair. Each dotted line represents a single length. In (B), GC curves for fragments of specific lengths are drawn [corresponding to the dotted lines in (A)]. Blue / dark curves represent shorter fragments than red / bright. Here x-axis is the fraction of GC.

Results (5) – Fragmentation effect Sequence specific bias (as also observed in RNAseq experiments) Figure 6. Fragmentation effect. Left: relative abundance of nucleotides at fixed positions relative to fragment 50-end. A horizontal dotted line marks the relative abundance of the base at mappable positions. Right: fragment rates when stratifying by the dinucleotide (1,0). Dinucleotide counts overlapping the fragment 50-end.

Results (6) – read count correction The authors conclude that the fragment model best explains the counts (see figure 7 in paper) However, not including fragment length (thus W(a,l ) instead of W(a,s)

Some other stuff.......

Technical bias: cDNA library preparation Normalized read coverage grouped by five different transcript length C. elegans Normalized read coverage with respect to the relative transcript position is shown grouped by five different transcript length bins for the C. elegans SRX001872 data set Bohnert et al, Nucleic Acids Res (2010)

Technical bias: cDNA library preparation cDNA library preparation: RNA fragmentation and cDNA fragmentation compared. Fragmentation of oligo-dT primed cDNA (blue line) is more biased towards the 3′ end of the transcript. RNA fragmentation (red line) provides more even coverage along the gene body, but is relatively depleted for both the 5′ and 3′ ends. Wang et al (2009) Nature reviews. Genetics, 10(1), 57–63.

RNAseq may detect other RNA species (1) Tarazona et al (2011) Genome research, 21(12), 2213–23

RNAseq may detect other RNA species (2) median transcript length in Brain and UHR samples (MAQC) versus read depth protein coding pseudo gene processed transcript lincRNA Read depth Tarazona et al (2011) Genome research, 21(12), 2213–23

Incorrect base quality values Work of DePristo et al in context of genotyping (exome/genome sequencing) DePristo, M. a, Banks, E., Poplin, R., Garimella, K. V., Maguire, J. R., Hartl, C., Philippakis, A. a, et al. (2011). A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nature genetics, 43(5), 491–8. Figure 3 Raw (pink) and recalibrated (blue) base quality scores for NGS paired-end read set of Life/SOLiD. The top panel shows reported base quality scores compared to the empirical estimates; the middle panel shows the difference between the average reported and empirical quality score for each machine cycle, with positive and negative cycle values given for the first and second read in the pair, respectively; and the bottom panel shows the difference between reported and empirical quality scores for each of the 16 genomic dinucleotide contexts. For example, the AG context occurs at all sites in a read where G is the current nucleotide and A is the preceding one in the read. Root-mean-square errors (RMSE) are given for the pre- and post-recalibration curves. DePristo et al. (2011). Nature genetics, 43(5), 491–8.

And more.... Jones, D. C., Ruzzo, W. L., Peng, X., & Katze, M. G. (2012). A new approach to bias correction in RNA-Seq. Bioinformatics (Oxford, England), 28(7), 921–8. Risso, D., Schwartz, K., Sherlock, G., & Dudoit, S. (2011). GC-content normalization for RNA-Seq data. BMC bioinformatics, 12(1), 480. Sendler, E., Johnson, G. D., & Krawetz, S. a. (2011). Local and global factors affecting RNA sequencing analysis. Analytical biochemistry, 419(2), 317–22. Vijaya Satya, R., Zavaljevski, N., & Reifman, J. (2012). A new strategy to reduce allelic bias in RNA-Seq readmapping. Nucleic acids research, 40(16), e127. Zheng, W., Chung, L. M., & Zhao, H. (2011). Bias detection and correction in RNA-Sequencing data. BMC bioinformatics, 12, 290. Tools Wang, L., Wang, S., & Li, W. (2012). RSeQC: quality control of RNA-seq experiments. Bioinformatics (Oxford, England), 28(16), 2184–5. DeLuca, D. S., Levin, J. Z., Sivachenko, A., Fennell, T., Nazaire, M.-D., Williams, C., Reich, M., et al. (2012). RNA-SeQC: RNA-seq metrics for quality control and process optimization. Bioinformatics (Oxford, England), 28(11), 1530–2. Picard tools

Benjamini, Y. , & Speed, T. P. (2012) Benjamini, Y., & Speed, T. P. (2012). Summarizing and correcting the GC content bias in high-throughput sequencing. Nucleic acids research, 40(10), e72. doi:10.1093/nar/gks001 Bohnert, R., & Rätsch, G. (2010). rQuant.web: a tool for RNA-Seq-based transcript quantitation. Nucleic acids research, 38(Web Server issue), W348–51. doi:10.1093/nar/gkq448 Bullard, J. H., Purdom, E., Hansen, K. D., & Dudoit, S. (2010). Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments. BMC bioinformatics, 11, 94. doi:10.1186/1471-2105-11-94 DeLuca, D. S., Levin, J. Z., Sivachenko, A., Fennell, T., Nazaire, M.-D., Williams, C., Reich, M., et al. (2012). RNA-SeQC: RNA-seq metrics for quality control and process optimization. Bioinformatics (Oxford, England), 28(11), 1530–2. doi:10.1093/bioinformatics/bts196 Dohm, J. C., Lottaz, C., Borodina, T., & Himmelbauer, H. (2008). Substantial biases in ultra-short read data sets from high-throughput DNA sequencing. Helicobacter, 36(16). doi:10.1093/nar/gkn425 Gao, L., Fang, Z., Zhang, K., Zhi, D., & Cui, X. (2011). Length bias correction for RNA-seq data in gene set analyses. Bioinformatics (Oxford, England), 27(5), 662–9. doi:10.1093/bioinformatics/btr005 Hansen, K. D., Brenner, S. E., & Dudoit, S. (2010). Biases in Illumina transcriptome sequencing caused by random hexamer priming. Nucleic acids research, 38(12), e131. doi:10.1093/nar/gkq224 Hansen, K. D., Irizarry, R. a, & Wu, Z. (2012). Removing technical variability in RNA-seq data using conditional quantile normalization. Biostatistics (Oxford, England), 13(2), 204–16. doi:10.1093/biostatistics/kxr054 Jiang, L., Schlesinger, F., Davis, C. a, Zhang, Y., Li, R., Salit, M., Gingeras, T. R., et al. (2011). Synthetic spike-in standards for RNA-seq experiments. Genome research, 21(9), 1543–51. doi:10.1101/gr.121095.111 Jones, D. C., Ruzzo, W. L., Peng, X., & Katze, M. G. (2012). A new approach to bias correction in RNA-Seq. Bioinformatics (Oxford, England), 28(7), 921–8. doi:10.1093/bioinformatics/bts055 Malone, J. H., & Oliver, B. (2011). Microarrays, deep sequencing and the true measure of the transcriptome. BMC biology, 9, 34. doi:10.1186/1741-7007-9-34 Mortazavi, A., Williams, B. A., McCue, K., Schaeffer, L., & Wold, B. (2008). Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nature methods, 5(7), 621–8. doi:10.1038/nmeth.1226 Oshlack, A., & Wakefield, M. J. (2009). Transcript length bias in RNA-seq data confounds systems biology. Biology direct, 4, 14. doi:10.1186/1745-6150-4-14 Risso, D., Schwartz, K., Sherlock, G., & Dudoit, S. (2011). GC-content normalization for RNA-Seq data. BMC bioinformatics, 12(1), 480. doi:10.1186/1471-2105-12-480 Roberts, A., Trapnell, C., Donaghey, J., Rinn, J. L., & Pachter, L. (2011). Improving RNA-Seq expression estimates by correcting for fragment bias. Genome biology, 12(3), R22. doi:10.1186/gb-2011-12-3-r22 Robinson, M. D., & Oshlack, A. (2010). A scaling normalization method for differential expression analysis of RNA-seq data. Genome biology, 11(3), R25. doi:10.1186/gb-2010-11-3-r25 Schwartz, S., Oren, R., & Ast, G. (2011). Detection and removal of biases in the analysis of next-generation sequencing reads. PloS one, 6(1), e16685. doi:10.1371/journal.pone.0016685 Sendler, E., Johnson, G. D., & Krawetz, S. a. (2011). Local and global factors affecting RNA sequencing analysis. Analytical biochemistry, 419(2), 317–22. doi:10.1016/j.ab.2011.08.013 Tarazona, S., García-Alcalde, F., Dopazo, J., Ferrer, A., & Conesa, A. (2011). Differential expression in RNA-seq: a matter of depth. Genome research, 21(12), 2213–23. doi:10.1101/gr.124321.111 Trapnell, C., Williams, B. A., Pertea, G., Mortazavi, A., Kwan, G., van Baren, M. J., Salzberg, S. L., et al. (2010). Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nature biotechnology, 28(5), 511–5. doi:10.1038/nbt.1621 Vijaya Satya, R., Zavaljevski, N., & Reifman, J. (2012). A new strategy to reduce allelic bias in RNA-Seq readmapping. Nucleic acids research, 40(16), e127. doi:10.1093/nar/gks425 Wang, L., Wang, S., & Li, W. (2012). RSeQC: quality control of RNA-seq experiments. Bioinformatics (Oxford, England), 28(16), 2184–5. doi:10.1093/bioinformatics/bts356 Wang, Z., Gerstein, M., & Snyder, M. (2009). RNA-Seq: a revolutionary tool for transcriptomics. Nature reviews. Genetics, 10(1), 57–63. doi:10.1038/nrg2484 Young, M. D., Wakefield, M. J., Smyth, G. K., & Oshlack, A. (2010). Gene ontology analysis for RNA-seq: accounting for selection bias. Genome biology, 11(2), R14. doi:10.1186/gb-2010-11-2-r14 Zheng, W., Chung, L. M., & Zhao, H. (2011). Bias detection and correction in RNA-Sequencing data. BMC bioinformatics, 12, 290. doi:10.1186/1471-2105-12-290 • Zheng, W., Chung, L. M., & Zhao, H. (2011). Bias detection and correction in RNA-Sequencing data. BMC bioinformatics, 12, 290. doi:10.1186/1471-2105-12-290

To conclude Be aware of different types of bias Try to avoid Try to detect Try to correct

Questions?