mRNA-Seq: methods and applications Jim Noonan GENE 760
Introduction to mRNA-seq Technical methodology Read mapping and normalization Estimating isoform-level gene expression De novo transcript reconstruction Sensitivity and sequencing depth Differential expression analysis
mRNA-seq workflow Wang et al. Nat Rev Genet 10:57 (2009) Martin and Wang Nat Rev Genet 12:671 (2011)
Illumina RNA-seq library preparation Capture poly-A RNA with poly-T oligo attached beads (100 ng total) (2x) RNA quality must be high – degradation produces 3’ bias Non-poly-A RNAs are not recovered Fragment mRNA Synthesize ds cDNA Ligate adapters Amplify Generate clusters and sequence
Ribosomal RNA subtraction RiboMinus
Mapping RNA-seq reads and quantifying transcripts
RNA-seq reads mapped to a reference genome Normalization : Reads per kilobase of feature length per million mapped reads (RPKM) Quantify expression of known genes (counting) Gene model level composite of the whole gene vs constitutive Differences that we are seeing could be due to splicing methods for isoform level expression values Transcriptome reconstruction combination of Tophat,paired end tags What is a “feature?” What about genomes with poor genome annotation? What about species with no sequenced genome? For a detailed comparison of normalization methods, see: Bullard et al. BMC Bioinformatics 11:94 (2010). Robinson and Oshlack, Genome Biol 11:R25 (2010)
Quantifying gene expression by RNA-seq Use existing gene annotation: Align to genome plus annotated splices Depends on high-quality gene annotation Which annotation to use: RefSeq, GENCODE, UCSC? Isoform quantification? Identifying novel transcripts? Reference-guided alignments: Align to genome sequence Infer splice events from reads Allows transcriptome analyses of genomes with poor gene annotation De novo transcript assembly: Assemble transcripts directly from reads Allows transcriptome analyses of species without reference genomes
Composite gene model approach Map reads to genome Map remaining reads to known splice junctions Requires good gene models Isoforms are ignored
Which gene annotation to use?
Strategies for transcript assembly Garber et al. Nat Methods 8:469 (2011)
Splice-aware short read aligners Martin and Wang Nat Rev Genet 12:671 (2011)
Reference based transcript assembly Martin and Wang Nat Rev Genet 12:671 (2011)
Transcript assembly programs Martin and Wang Nat Rev Genet 12:671 (2011)
Cufflinks: ab initio transcript assembly Step 1: map reads to reference genome Trapnell et al. Nat. Biotechnology 28:511 (2010)
Cufflinks: ab initio transcript assembly Isoform abundances estimated by maximum likelihood Trapnell et al. Nat. Biotechnology 28:511 (2010)
Graph-based transcript assembly Martin and Wang Nat Rev Genet 12:671 (2011)
Graph-based transcript assembly Martin and Wang Nat Rev Genet 12:671 (2011)
Trinity: de novo transcript assembly Grabherr et al. Nat Biotechnol 29:644 (2011)
What depth of sequencing is required to characterize a transcriptome? Wang et al. Nat Rev Genet 10:57 (2009)
Considerations Gene length: Expression level: Long genes are detected before short genes Expression level: High expressors are detected before low expressors Complexity of the transcriptome: Tissues with many cell types require more sequencing Feature type Composite gene models Common isoforms Rare isoforms Detection vs. quantification Obtaining confident expression level estimates (e.g., “stable” RPKMs) requires greater coverage
Transcript detection is biased in favor of long genes Tarazona et al. Genome Res 21:2213 (2011)
Applications of mRNA-seq Characterizing transcriptome complexity Alternative splicing Differential expression analysis Gene- and isoform-level expression comparisons Novel RNA species lincRNAs and eRNAs Pervasive transcription Translation Ribosome profiling Allele-specific expression Effect of genetic variation on gene expression Imprinting RNA editing Novel events
Alternative isoform regulation in human tissue transcriptomes Wang et al Nature 456:470 (2008)
Diversity of alternative splicing events in human tissues Wang et al. Nature 456:470 (2008)
Differential expression Garber et al. Nat Methods 8:469 (2011)
Programs for identifying DE genes in RNA-seq datasets Assumed distribution for count data URL DESeq Negative binomial www-huber.embl.de/users/anders/DESeq/ DEGseq Poisson www.bioconductor.org/packages/2.6/bioc/html/DEGseq.html edgeR www.bioconductor.org/packages/release/bioc/html/edgeR.html baySeq www.bioconductor.org/packages/release/bioc/html/baySeq.html Cuffdiff cufflinks.cbcb.umd.edu/
Differential expression: Characterizing transcriptome dynamics during brain development Neuronal functions synaptic transmission cell adhesion Embryonic mouse cortex RNA-seq DEX Neuronal migration “Stemness” functions Cell cycle M phase Sox2, Oct4 Ayoub et al PNAS 1086:14950 (2011)
Differential expression: Characterizing transcriptome dynamics during brain development Embryonic mouse cortex Differential isoforms RNA-seq DE isoforms Ayoub et al PNAS 1086:14950 (2011)
Novel RNA species: annotating lincRNAs Guttman et al Nat Biotechnol 28:503 (2010)
Enhancer-associated RNAs (eRNAs) Neurons treated with KCL Kim et al Nature 465:182 (2010)
Enhancer-associated RNAs (eRNAs) Ren B. Nature 465:173 (2010)
How much of the genome is transcribed? van Bakel et al. PLoS Biol. 8:e1000371 (2010)
Exploiting sequence information in RNA-seq reads Majewski and Pastinen. Trends Genet 27:72 (2011)
Detecting variants that affect splicing Pickrell et al . Nature 464:768 (2010)
mRNA-seq applications Summary: mRNA-seq applications Quantify transcriptome complexity and compare across biological states Determine how transcriptomes are translated in different biological contexts Effect of genetic variation on gene expression Imprinting and RNA editing