Introduction to RNA-seq

Introduction to RNA-seq
Joel Parker, Ph.D.

Why mRNAseq? Measurement of differential expression
There are at least four compelling reasons for choosing mRNA-seq instead of microarray based technologies Specificity of what is being measured Reduced technical (batch) bias Increased dynamic range and log ratio (FC) estimates More sensitive detection of genes, transcripts, and differential expression Other reasons Detection of expressed SNVs Detection of fusions and other structural variations No transcriptome definition is needed No probes need to be designed or manufactured Cost (will soon be equivalent on a per assay basis with microarray)

Why mRNAseq? – Reduced Bias
Cell types separate biologically CD19 CD8 CD14 CD4

Why mRNAseq? – Reduced Processing Bias
Client’s miRNAseq samples sequenced on 4 different machines at 2 different sites at different times over several months with no apparent bias in the top principal components GAIIx HS-01 HS-02 HS-IL

Library preparation mRNA RNA Capture Enrichment via hybridization
Total RNA Depletion of rRNA via hybridization Blood, MT, etc

Sequencing parameters
Read Length Trapnell et al., Nature Biotechnology 31,46–53 (2013) Precision = PPV; Recall = Sensitivity

Detection is Dependent on Depth
PMID:

Liu et al., Bioinformatics (2014) 30 (3): 301-304.

Computational Processing
Technical variation (batch effects) from library preparation and sequencing are small, and the sequencing strategy directs the level of repeatability and detection, especially depth The raw results of sequencing require significant computational processing Alignment : Maximizing unambiguous alignments; Alignment of reads that cross exon junctions; Ex: Bowtie, BWA, TopHat, Mapsplice, STAR, . . Abundance estimation : Gene or transcript; Handling alignments that are ambiguous in the transcriptome; Ex: Sailfish, RSEM, Cufflinks, MISO, Salmon, IsoEM, IsoInfer, Rseq, . . . Normalization of read counts : Minimizing bias due to variation in number of clusters available; Ex: Total count (RPM), Upper quartile, quantile, density Different algorithmic and computational strategies, reference genome and transcriptome definition, impact performance much more than SE vs. PE, 50 bp vs. 100 bp.

Alignment BWA, Bowtie alignment to transcriptome X X X X X X X Trinity, Trans-Abyss X X X Transcriptome Count alignments

Example Concordant Gene
V2 V1

Example Discordant 1 Gene
V2 V1

Example Discordant 2 Gene
V2 V1

Alignment TopHat, MapSplice, STAR Trinity, Trans-Abyss

Alignment Comparison Engstrom et al., Nature Methods 10, (2013)

Engstrom et al., Nature Methods 10, 1185-1191 (2013)
Alignment Comparison Splice Junction Accuracy Engstrom et al., Nature Methods 10, (2013)

Technical variation (batch effects) from library preparation and sequencing are small, and the sequencing strategy directs the level of repeatability and detection, especially depth The raw results of sequencing require significant computational processing Alignment : Maximizing unambiguous alignments; Alignment of reads that cross exon junctions; Ex: Bowtie, BWA, TopHat Abundance estimation : Gene or transcript; Handling alignments that are ambiguous in the transcriptome; Ex: Sailfish, RSEM, Cufflinks, MISO, IsoEM, IsoInfer, Rseq, . . . Normalization of read counts : Minimizing bias due to variation in number of clusters available; Ex: Total count (RPM), Upper quartile, quantile, density Different algorithmic and computational strategies, especially the transcriptome definition, impact performance much more than SE vs. PE, 50 bp vs. 100 bp.

Multireads: Reads Mapping to Multiple Genes/Transcripts
HTSeq << PMID: Wang X, Wu Z, Zhang X. Isoform abundance inference provides a more accurate estimation of gene expression levels in RNA-seq. J Bioinform Comput Biol Dec;8 Suppl 1: PubMed PMID:

Multireads: Reads Mapping to Multiple Genes/Transcripts
200 350 1 Long 150 100 300 2 Medium Multireads 50 200 3 Short Unique Relative abundance for these genes, f1, f2, f3 N

Approach 1: Ignore Multireads
200 350 1 Long 150 100 300 2 Medium 50 200 3 Short Relative abundance for these genes, f1, f2, f3 Nagalakshmi et. al. Science. 2008 Marioni, et. al. Genome Research 2008

Approach 1: Ignore Multireads
200 350 1 Long 150 100 300 2 Medium 50 200 3 Short Over-estimates the abundance of genes with unique reads Under-estimates the abundance of genes with multireads Not an option at all, if interested in isoform expression N

Approach 2: Allocate Fraction of Multireads Using Estimates From Uniques
200 350 1 Long 150 100 300 2 Medium 50 200 3 Short Relative abundance for these genes, f1, f2, f3 Ali Mortazavi, et. al. Nature Methods 2008 Sailfish, RSEM,Cufflinks N

Cufflinks PMID:

RSEM Li and Dewey, 2011 PMID: The model consists of N sets of random variables, one per sequenced RNA-Seq fragment. For fragment n, its parent transcript, length, start position, and orientation are represented by the latent variables Gn, Fn, Sn and On respectively. For PE data, the observed variables (shaded circles), are the read lengths ( and ), quality scores ( and ), and sequences ( and ). For SE data, , , and are unobserved. The primary parameters of the model are given by the vector θ, which represents the prior probabilities of a fragment being derived from each transcript. θi represents the probability that a fragment is derived from transcript i A) PE isoform; B) PE gene; C) SE isoform; D) SE gene

Salmon Novelties Streaming variational Bayes (VB) inference combined with batched VB or EM Lightweight alignment through maximal exact matches Transcript / gene abundance inference is abstracted from the alignment step [RSEM also permits this; sam-xlate in

Repeatability & Detection by Isoform Database
Ensemb 37612 RefSeq 77608 eaGene Larger reference transcriptomes result in reduced repeatability (left), but increased detection (right) Detection - 73% of RefSeq, 66% of UCSC, and 52% of Ensembl

Technical variation (batch effects) from library preparation and sequencing are small, and the sequencing strategy directs the level of repeatability and detection, especially depth The raw results of sequencing require significant computational processing Alignment : Maximizing unambiguous alignments; Alignment of reads that cross exon junctions; Ex: Bowtie, BWA, TopHat Abundance estimation : Gene or transcript; Handling alignments that are ambiguous in the transcriptome; Ex: Sailfish, RSEM, Cufflinks, MISO, IsoEM, IsoInfer, Rseq, . . . Normalization of read counts : Minimizing bias due to variation in number of clusters available; Ex: Total count (RPM), Upper quartile, quantile, density Different algorithmic and computational strategies, especially the transcriptome definition, impact performance much more than SE vs. PE, 50 bp vs. 100 bp.

Normalization

Differential Expression
Bullard JH, Purdom E, Hansen KD, Dudoit S. Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments. BMC Bioinformatics Feb 18;11:94. doi: / PubMed PMID: ; PubMed Central PMCID: PMC

Statistical power Why not use a simple t-test on log normalized counts? DESeq2 t-test

Dispersion parameter quantity of interest
raw count for gene i, sample j normalization factor quantity of interest one dispersion per gene the figure shows random variation with increasing mean for a NB with constant dispersion parameter K_{ij} \sim \text{NB}(s_{ij} q_{ij}, \alpha_i ) \text{Var}(K_{ij}) = \mu_{ij} + \alpha_i \mu_{ij}^2 variance depends on mean value though

Differences across condition
r indexes coefficients: r=1 Intercept, r=2 condition B vs A, etc. K_{ij} &\sim \text{NB}(s_{ij} q_{ij}, \alpha_i ) \\ \log_2 q_{ij} &= \sum_r x_{jr} \beta_{ir} xj. = [1,0,...] if sample j is in A, while xj. = [1,1,...] if sample j is in B

Controlling for different batches
Using a design formula: ~ batch + condition, adds terms that control for batch differences If batches are unknown, possible to detect these with other methods: svaseq, RUVSeq data <- data.frame(counts=rnorm(10*4, rep(c(100,200,300,400),each=10), 50), batch=factor(rep(1:2,each=2*10)), condition=factor(rep(c("A","B","A","B"), each=10)));ggplot(data, aes(x=batch, y=counts, fill=condition)) + geom_boxplot()

Complex designs Treatment effect for enriched samples over baseline, controlling for individual effects: ~indiv + enrich + cond + enrich:cond indiv enrich cond 1 input ctrl 1 IP ctrl 1 input trt 1 IP trt 2 input ctrl 2 IP ctrl 2 input trt 2 IP trt ...

"Wilkinson and Rogers" notation
y ~ ... 'y' modeled by a linear predictor of ... a + b 'a' and 'b' each put in model a + b + a:b 'a' and 'b' and their interaction a*b equivalent to the above row 0 + a 'a' and no intercept a + I(a^2) quadratic function of 'a' poly(a,2) orthogonal polynomial of deg 2 ns(a, df=df) natural cubic spline of degree 'df' -Formula vignette

DESeq2 steps DESeq() results() size factors (sequencing depth)
count matrix (from featureCounts, htseq, tximport, etc.) size factors (sequencing depth) dispersion (biological variance) Wald test or likelihood ratio test build results table DESeq() results()

Testing against a threshold
"We get too many DEGs..." using 'lfcThreshold' in results() null hypothesis: fold change = 1 null hypothesis: fold change is < 2 or > 1/2 "For well-powered experiments, however, a statistical test against the conventional null hypothesis of zero LFC may report genes with statistically significant changes that are so weak in effect strength that they could be considered irrelevant or distracting."

Multiple Test Correction
What is a p-value? How is it distributed when the null is true?

Multiple Test Correction
P-value - probability of the observation when null hypothesis is true Q-value – false discovery rate at which the gene is called significant False discovery rate – expected fraction false positives corresponding to a set of results

Count model vs linear model
DESeq2 and edgeR similar approach, similar results very sensitive, may sometimes underestimate FDR limma + voom uses a linear model, weights determined by variance over mean strong control of FDR, may be less sensitive for small counts and small sample size recommend when number of biological replicates per group grows large (e.g. > 20) and small counts not of interest

Two paths in RNA-seq analysis
Count matrix Transformations and Exploratory Data Analysis (EDA) clustering, heatmaps, sample-sample distances vst(), rlog(), plotPCA() cpm(), plotMDS() Differential expression testing, p-values, FDR DESeq() results() glmLRT() topTags() DESeq2 DESeq2 edgeR edgeR

Regularized logarithm, "rlog"
log2(x + 1) "rlog" sample 2 sample 2 sample 1 sample 1 Poisson noise from low counts, when squared a big contribution to Euclidean distance between samples

VST and rlog vs log(x+1) Essentially provides a similar outcome as filtering at T and/or adding a pseudocount of X, but parameter estimation is data-driven.

Practical guides

Introduction to RNA-seq

Similar presentations

Presentation on theme: "Introduction to RNA-seq"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Introduction to RNA-seq

Similar presentations

Presentation on theme: "Introduction to RNA-seq"— Presentation transcript:

Similar presentations

About project

Feedback