Download presentation
Presentation is loading. Please wait.
Published byChloe Anthony Modified over 8 years ago
1
RNA-Seq with the Tuxedo Suite Monica Britton, Ph.D. Sr. Bioinformatics Analyst September 2015 Workshop
2
The Basic Tuxedo Suite References Trapnell C, et al. 2009 TopHat: discovering splice junctions with RNA-Seq. Bioinformatics Trapnell C, et al. 2010 Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nature Biotechnology Kim D, et al. 2011 TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biology Roberts A, et al. 2011 Improving RNA-Seq expression estimates by correcting for fragment bias. Genome Biology Roberts A, et al. 2011 Identification of novel transcripts in annotated genomes using RNA-Seq. Bioinformatics Trapnell C, et al. 2013 Differential analysis of gene regulation at transcript resolution with RNA-Seq. Nature Biotechnology Cufflinks assembles transcripts Cuffdiff identifies differential expression of genes/ transcripts/promoters
3
Alignment and Differential Expression TopHat Cuffdiff Read set(s) Existing annotation (GTF) bam file(s) Toptables, etc. We followed these steps with the single-end reads
4
But, do we have all the genes? For organisms with genomes, gene models are stored in gtf files Assumptions: – The gtf file contains annotation for ALL transcripts and genes – All splice sites, start/stop codons, etc. are correct Are these assumptions correct for every sequenced organism? RNA-Seq reads can be used to independently construct genes and splice variants using limited or no annotation Method used depends on how much sequence information there is for the organism…
5
Gene Construction (Alignment) vs. Assembly Haas and Zody (2010) Nat. Biotech. 28:421-3 Novel or Non-Model Organisms Genome- Sequenced Organisms Trinity software
6
Gene / Transcriptome Construction Annotation can be improved – even for well-annotated model organisms – Identify all expressed exons – Combine expressed exons into genes – Find all splice variants for a gene – Discover novel transcripts For newly sequenced organisms – Validate ab initio annotation – Comparison between different annotation sets Can assist in finding some types of contamination – Reconstruction of rRNA genes – Genomic/mitochondrial DNA in RNA library preps.
7
Reference Annotation Based Transcript (RABT) Assembly TopHat Cufflinks Cuffmerge Cuffcompare Cuffdiff Read set(s) Existing annotation (GTF) [optional] bam file(s) Read-set specific GTF(s) Merged GTF Final assembly (GTF and stats) Toptables, etc.
8
TopHat Spliced Alignment to a Genome
9
Reference Annotation Based Transcript (RABT) Assembly
10
Cufflinks – Identification of Incompatible Fragments Incompatible alignment
11
Cufflinks – Minimum Paths to Transcripts
12
Cufflinks – Abundance Estimation
14
Merging Cufflinks Assemblies
15
So Now We’ve Explored These Tools…
16
We’ve Used Other Software in Conjunction HTSeq-count edgeR Raw Counts (But HTSeq-count and edgeR are independent)
17
And Then Came Some Extensions…
18
Modules Introduced in 2014 Cuffquant Improves efficiency of running multiple samples Stores data in “.cxb” compressed format, that can later be analyzed with cuffdiff or cuffnorm Cuffnorm Generate tables of expression values that are normalized for library size. Tables are used as input to Monocle Monocle Used to analyze single-cell expression data Trapnell, et al., 2014, Nat. Biotech. 32:381
19
…But Software Continues to Evolve HISAT (Hierarchical Indexing for Spliced Alignment of Transcripts) Kim et al., 2015, Nat. Methods Planned to be Tophat3 Faster than other aligners More accurate on simulated reads.
20
…But Software Continues to Evolve StringTie Pertea et al., 2015, Nat. Biotech Probable successor to Cufflinks2 Assembles more transcripts (based on simulated reads) Ballgown Frazee et al., 2015, Nat. Biotech Bioconductor R package Probable successor to Cuffdiff2 Includes useful Tablemaker preprocessor
21
A New Potential Game-Changer (2015) Kallisto (“Near-Optimal RNA-Seq Quantification”) Bray et al. (http://arxiv.org/abs/1505.02710)http://arxiv.org/abs/1505.02710 Extremely fast, uses pseudo-alignment based on k-mers and deBruijn graphs SpeedAccuracy
22
A Few Words About Bacterial RNA-Seq
23
Eukaryotic and Bacterial Gene Structures are Different Eukaryotes – Gene structure includes introns and exons – Splicing, poly-adenylation – Each mRNA is a discrete molecule when translated Bacteria / Prokaryotes – Individual genes and groups of genes in operons – Generally, no splicing, no polyA – One mRNA can contain coding sequences for multiple proteins
24
Bacterial RNA-Seq Considerations rRNA depletion strategies may leave considerable amounts of non-coding RNA molecules Splicing-aware aligners (such as Tophat) may not be useful Reads from polycistronic mRNA may overlap two genes – How would HTSeq-Count handle this? Compare alignments to the genome to alignments to transcriptome. – Some aligners, such as bwa-mem, will report secondary alignments – Transcriptome alignments can be used to generate counts table for edgeR Specialized software, such as Rockhopper (stand-alone, http://cs.wellesley.edu/~btjaden/Rockhopper/) http://cs.wellesley.edu/~btjaden/Rockhopper/
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.