RNA-Seq Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520 Guest lecture by Wei Li
RNA-seq Protocol Martin and Wang Nat. Rev. Genet. (2011)
RNA-seq https://www.youtube.com/watch?v=V_4n8n5Z6I8 (RNA-Seq using Ion Proton)
Why RNA-seq, not microarray? No need to design microarray probes Digital representation, higher detection range Alternative splicing Fusion Mutations
RNA-seq Applications Gene expression; differential expression
RNA-seq Applications Alternative splicing, novel isoforms
RNA-seq Applications Novel genes or transcripts, lncRNA
RNA-seq Applications Detect gene fusions Mutations, RNA editing
RNA-seq Experimental Design and Analysis
Experimental Design Assessing biological variation requires biological replicates (no need for technical replicates) 3 preferred, 2 OK, 1 only for exploratory assays (not good for publications)
Experimental Design For differential expression, don’t pool RNA from multiple biological replicates Batch effects still exist, try to be consistent or process all samples at the same time
Batch effect A research group’s striking finding in 2014 “Human heart is more similar with human brain than mouse brain” Human Heart Mouse Brain Human Brain
Circles: human tissues Cones: mouse tissues
Batch effect Other researcher’s response in Twitter
1st batch: human tissues 2nd batch: human tissues 3rd batch: mouse tissues 4th batch: mouse tissues 5th batch: human/mouse tissues
Batch effect
Batch effect Before experiments: careful design After experiments: batch effect removal (combat)
Experimental Design Ribo-minus (remove too abundant genes) PolyA (mRNA, enrich for exons) Strand specific (anti-sense lncRNA) Sequencing: PE (resolve redundancy) or SE: expression PE for splicing, novel transcripts Depth: 30-50M differential expression, deeper transcript assembly Read length: longer for transcript assembly
Alignment Prefer splice-aware aligners TopHat, BWA, STAR (not DNASTAR) Sometimes need to trim the beginning bases
Quality Control: RSeQC Read qualities
Quality Control: RSeQC Nucleotide compositions
Quality Control: RSeQC Read count distribution and GC content
Quality Control: RSeQC Read count distributions across genes
Quality Control: RSeQC Insert size distribution and splicing junctions Paired-end read Insert size
Quality Control: RSeQC
Differential Expression
Differential expression You see the expression of gene X doubles in condition B compared with condition A How reliable it is? What’s the chance of observing it by random? All comes to variation estimation! Expression A B p=0.001 Expression A B Expression A B p=0.27
Differential expression Variation can be estimated if you have many biological replicates But in practice, only 2-3 replicates are available What to do next? – Proper statistical models
Sequencing Read Distribution Poisson distribution: # events within an interval Mean = Variance But: sequencing data is over-dispersed (Mean<Variance)
Sequencing Read Distribution Negative binomial Def: # of successes before r failures occur, if Pb(each success) is p
Differential Expression Negative binomial for RNA-seq Variance estimated by borrowing information from all the genes – hierarchical models Test whether μi is the same for gene i between samples j FDR?
Differential expression EdgeR DESeq/DESeq2
Expression Index RPKM (Reads per kilobase of transcript per million reads of library) Corrects for coverage, gene length 1 RPKM ~ 0.3 -1 transcript / cell Comparable between different genes within the same dataset TopHat / Cufflinks FPKM (Fragments), PE libraries, RPKM/2 TPM (transcripts per million) Normalizes to transcript copies instead of reads Longer transcripts have more reads RSEM, HTSeq
Differential Expression Should we do differential expression on RPKM/FPKM or TPM? Cufflinks: RPKM/FPKM LIMMA-VOOM and DESeq: TPM Power to detect DE is proportional to length Continued development and updates Gene A (1kb) Gene B (8kb)
Alternative Splicing Assign reads to splice isoforms (TopHat)
Alternative Splicing Different AS events
Alternative Splicing MATS: Multivariate Analysis of Transcript Splicing
Reference-based assembly Transcript Assembly Reference-based assembly Cufflinks De novo assembly Trinity
Transcript Assembly (Cufflinks) Read mapping using Tophat Construct a graph of reads “Incompatible” fragments (reads) means they are definitely NOT from the same transcript
Transcript Assembly (Cufflinks) Incompatible
Transcript Assembly (Cufflinks) 3. Identify the minimum # paths that cover all reads (each path is one possible transcript) Dilworth’s theorem: finding a minimum partition P into chains is equivalent to finding a maximum antichain in P (an antichain is a set of mutually incompatible fragments)
Transcript Assembly (Cufflinks) 4. Transcript abundance estimation
Isoform Inference If given known set of isoforms Estimate x to maximize the likelihood of observing n
Known Isoform Abundance Inference
Isoform Inference With known isoform set, sometimes the gene-level expression level inference is great, although isoform abundances have big uncertainty (e.g. known set incomplete) De novo isoform inference is a non-identifiable problem if RNA-seq reads are short and gene is long with too many exons Algorithm: Trinity
De-novo transcriptome assembly
De bruijn graph (1946) Used in the earliest human genome assemblies Standard algorithm for genome assembly A sequence of length k can be represented as an edge between two sequences (length k-1)
De bruijn graph (1946)
De bruijn graph How to do genome assembly? Sequences as nodes -> traverse all nodes in a graph -> Hamilton path problem -> NP complete problem! De bruijn graph: Sequences as edges -> traverse all edges in a graph -> Euler graph -> Polynomial algorithm!
Gene Fusion More seen in cancer samples Still a bit hard to call TopHatFusion in TopHat2 Maher et al, Nat 2009
Other Applications RNA editing Circular RNA Change on RNA sequence after transcription Most frequent: A to I (behaves like G), C to U Evolves from mononucleotide deaminases, might be involved in RNA degradation Circular RNA Mostly arise from splicing Varying length, abundance, and stability Possible function: sponge for RBP or miRNA
Summary RNA-seq design considerations Read mapping: TopHat, BWA, STAR De novo transcriptome assembly: TRINITY Quality control: RSeQC Expression index: FPKM and TPM Differential expression Cufflinks: versatile LIMMA-VOOM and DESeq: better variance estimates Alternative splicing: MATS Gene fusion, genome editing, circular RNA
Acknowledgement Alisha Holloway Simon Andrews Radhika Khetani