Presentation is loading. Please wait.

Presentation is loading. Please wait.

RNA-Seq with the Tuxedo Suite Monica Britton, Ph.D. Sr. Bioinformatics Analyst September 2015 Workshop.

Similar presentations


Presentation on theme: "RNA-Seq with the Tuxedo Suite Monica Britton, Ph.D. Sr. Bioinformatics Analyst September 2015 Workshop."— Presentation transcript:

1 RNA-Seq with the Tuxedo Suite Monica Britton, Ph.D. Sr. Bioinformatics Analyst September 2015 Workshop

2 The Basic Tuxedo Suite References Trapnell C, et al. 2009 TopHat: discovering splice junctions with RNA-Seq. Bioinformatics Trapnell C, et al. 2010 Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nature Biotechnology Kim D, et al. 2011 TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biology Roberts A, et al. 2011 Improving RNA-Seq expression estimates by correcting for fragment bias. Genome Biology Roberts A, et al. 2011 Identification of novel transcripts in annotated genomes using RNA-Seq. Bioinformatics Trapnell C, et al. 2013 Differential analysis of gene regulation at transcript resolution with RNA-Seq. Nature Biotechnology Cufflinks assembles transcripts Cuffdiff identifies differential expression of genes/ transcripts/promoters

3 Alignment and Differential Expression TopHat Cuffdiff Read set(s) Existing annotation (GTF) bam file(s) Toptables, etc. We followed these steps with the single-end reads

4 But, do we have all the genes? For organisms with genomes, gene models are stored in gtf files Assumptions: – The gtf file contains annotation for ALL transcripts and genes – All splice sites, start/stop codons, etc. are correct Are these assumptions correct for every sequenced organism? RNA-Seq reads can be used to independently construct genes and splice variants using limited or no annotation Method used depends on how much sequence information there is for the organism…

5 Gene Construction (Alignment) vs. Assembly Haas and Zody (2010) Nat. Biotech. 28:421-3 Novel or Non-Model Organisms Genome- Sequenced Organisms Trinity software

6 Gene / Transcriptome Construction Annotation can be improved – even for well-annotated model organisms – Identify all expressed exons – Combine expressed exons into genes – Find all splice variants for a gene – Discover novel transcripts For newly sequenced organisms – Validate ab initio annotation – Comparison between different annotation sets Can assist in finding some types of contamination – Reconstruction of rRNA genes – Genomic/mitochondrial DNA in RNA library preps.

7 Reference Annotation Based Transcript (RABT) Assembly TopHat Cufflinks Cuffmerge Cuffcompare Cuffdiff Read set(s) Existing annotation (GTF) [optional] bam file(s) Read-set specific GTF(s) Merged GTF Final assembly (GTF and stats) Toptables, etc.

8 TopHat Spliced Alignment to a Genome

9 Reference Annotation Based Transcript (RABT) Assembly

10 Cufflinks – Identification of Incompatible Fragments Incompatible alignment

11 Cufflinks – Minimum Paths to Transcripts

12 Cufflinks – Abundance Estimation

13

14 Merging Cufflinks Assemblies

15 So Now We’ve Explored These Tools…

16 We’ve Used Other Software in Conjunction HTSeq-count edgeR Raw Counts (But HTSeq-count and edgeR are independent)

17 And Then Came Some Extensions…

18 Modules Introduced in 2014 Cuffquant Improves efficiency of running multiple samples Stores data in “.cxb” compressed format, that can later be analyzed with cuffdiff or cuffnorm Cuffnorm Generate tables of expression values that are normalized for library size. Tables are used as input to Monocle Monocle Used to analyze single-cell expression data Trapnell, et al., 2014, Nat. Biotech. 32:381

19 …But Software Continues to Evolve HISAT (Hierarchical Indexing for Spliced Alignment of Transcripts) Kim et al., 2015, Nat. Methods Planned to be Tophat3 Faster than other aligners More accurate on simulated reads.

20 …But Software Continues to Evolve StringTie Pertea et al., 2015, Nat. Biotech Probable successor to Cufflinks2 Assembles more transcripts (based on simulated reads) Ballgown Frazee et al., 2015, Nat. Biotech Bioconductor R package Probable successor to Cuffdiff2 Includes useful Tablemaker preprocessor

21 A New Potential Game-Changer (2015) Kallisto (“Near-Optimal RNA-Seq Quantification”) Bray et al. (http://arxiv.org/abs/1505.02710)http://arxiv.org/abs/1505.02710 Extremely fast, uses pseudo-alignment based on k-mers and deBruijn graphs SpeedAccuracy

22 A Few Words About Bacterial RNA-Seq

23 Eukaryotic and Bacterial Gene Structures are Different Eukaryotes – Gene structure includes introns and exons – Splicing, poly-adenylation – Each mRNA is a discrete molecule when translated Bacteria / Prokaryotes – Individual genes and groups of genes in operons – Generally, no splicing, no polyA – One mRNA can contain coding sequences for multiple proteins

24 Bacterial RNA-Seq Considerations rRNA depletion strategies may leave considerable amounts of non-coding RNA molecules Splicing-aware aligners (such as Tophat) may not be useful Reads from polycistronic mRNA may overlap two genes – How would HTSeq-Count handle this? Compare alignments to the genome to alignments to transcriptome. – Some aligners, such as bwa-mem, will report secondary alignments – Transcriptome alignments can be used to generate counts table for edgeR Specialized software, such as Rockhopper (stand-alone, http://cs.wellesley.edu/~btjaden/Rockhopper/) http://cs.wellesley.edu/~btjaden/Rockhopper/


Download ppt "RNA-Seq with the Tuxedo Suite Monica Britton, Ph.D. Sr. Bioinformatics Analyst September 2015 Workshop."

Similar presentations


Ads by Google