RNA-Seq with the Tuxedo Suite Monica Britton, Ph.D. Sr. Bioinformatics Analyst September 2015 Workshop
The Basic Tuxedo Suite References Trapnell C, et al TopHat: discovering splice junctions with RNA-Seq. Bioinformatics Trapnell C, et al Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nature Biotechnology Kim D, et al TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biology Roberts A, et al Improving RNA-Seq expression estimates by correcting for fragment bias. Genome Biology Roberts A, et al Identification of novel transcripts in annotated genomes using RNA-Seq. Bioinformatics Trapnell C, et al Differential analysis of gene regulation at transcript resolution with RNA-Seq. Nature Biotechnology Cufflinks assembles transcripts Cuffdiff identifies differential expression of genes/ transcripts/promoters
Alignment and Differential Expression TopHat Cuffdiff Read set(s) Existing annotation (GTF) bam file(s) Toptables, etc. We followed these steps with the single-end reads
But, do we have all the genes? For organisms with genomes, gene models are stored in gtf files Assumptions: – The gtf file contains annotation for ALL transcripts and genes – All splice sites, start/stop codons, etc. are correct Are these assumptions correct for every sequenced organism? RNA-Seq reads can be used to independently construct genes and splice variants using limited or no annotation Method used depends on how much sequence information there is for the organism…
Gene Construction (Alignment) vs. Assembly Haas and Zody (2010) Nat. Biotech. 28:421-3 Novel or Non-Model Organisms Genome- Sequenced Organisms Trinity software
Gene / Transcriptome Construction Annotation can be improved – even for well-annotated model organisms – Identify all expressed exons – Combine expressed exons into genes – Find all splice variants for a gene – Discover novel transcripts For newly sequenced organisms – Validate ab initio annotation – Comparison between different annotation sets Can assist in finding some types of contamination – Reconstruction of rRNA genes – Genomic/mitochondrial DNA in RNA library preps.
Reference Annotation Based Transcript (RABT) Assembly TopHat Cufflinks Cuffmerge Cuffcompare Cuffdiff Read set(s) Existing annotation (GTF) [optional] bam file(s) Read-set specific GTF(s) Merged GTF Final assembly (GTF and stats) Toptables, etc.
TopHat Spliced Alignment to a Genome
Reference Annotation Based Transcript (RABT) Assembly
Cufflinks – Identification of Incompatible Fragments Incompatible alignment
Cufflinks – Minimum Paths to Transcripts
Cufflinks – Abundance Estimation
Merging Cufflinks Assemblies
So Now We’ve Explored These Tools…
We’ve Used Other Software in Conjunction HTSeq-count edgeR Raw Counts (But HTSeq-count and edgeR are independent)
And Then Came Some Extensions…
Modules Introduced in 2014 Cuffquant Improves efficiency of running multiple samples Stores data in “.cxb” compressed format, that can later be analyzed with cuffdiff or cuffnorm Cuffnorm Generate tables of expression values that are normalized for library size. Tables are used as input to Monocle Monocle Used to analyze single-cell expression data Trapnell, et al., 2014, Nat. Biotech. 32:381
…But Software Continues to Evolve HISAT (Hierarchical Indexing for Spliced Alignment of Transcripts) Kim et al., 2015, Nat. Methods Planned to be Tophat3 Faster than other aligners More accurate on simulated reads.
…But Software Continues to Evolve StringTie Pertea et al., 2015, Nat. Biotech Probable successor to Cufflinks2 Assembles more transcripts (based on simulated reads) Ballgown Frazee et al., 2015, Nat. Biotech Bioconductor R package Probable successor to Cuffdiff2 Includes useful Tablemaker preprocessor
A New Potential Game-Changer (2015) Kallisto (“Near-Optimal RNA-Seq Quantification”) Bray et al. ( Extremely fast, uses pseudo-alignment based on k-mers and deBruijn graphs SpeedAccuracy
A Few Words About Bacterial RNA-Seq
Eukaryotic and Bacterial Gene Structures are Different Eukaryotes – Gene structure includes introns and exons – Splicing, poly-adenylation – Each mRNA is a discrete molecule when translated Bacteria / Prokaryotes – Individual genes and groups of genes in operons – Generally, no splicing, no polyA – One mRNA can contain coding sequences for multiple proteins
Bacterial RNA-Seq Considerations rRNA depletion strategies may leave considerable amounts of non-coding RNA molecules Splicing-aware aligners (such as Tophat) may not be useful Reads from polycistronic mRNA may overlap two genes – How would HTSeq-Count handle this? Compare alignments to the genome to alignments to transcriptome. – Some aligners, such as bwa-mem, will report secondary alignments – Transcriptome alignments can be used to generate counts table for edgeR Specialized software, such as Rockhopper (stand-alone,