The iPlant Collaborative Community Cyberinfrastructure for Life Science Tools and Services Workshop Transcriptome assembly with SOAPdenovo Trans
RNA-Seq without a reference genome Overview of work Transcriptome Assembly Characterize Transcript abundance Visualization Generate Sequence QC and Processing We will focus mostly on the transcriptome assembly itself as this is the greatest challenge to working without a reference genome and involves several steps.
Transcriptome assembly Reminder: you’ll need to do some thinking and reading here Assemble Transcriptome Evaluate and refine On to mapping Good read data The most costly (time, compute) process is empirically determining what parameters yield the best assembly given the data and organism. We can suggest some good directions, but we can’t promise “right” answers Get familiar with examples of work and reasoning on approaches relevant to your organism
Transcriptome assembly from RNA-Seq Challenges (just a few of them) DNA assembly assumes even sequencing depth, not true in RNA-Seq (e.g. repetitive regions = more reads for DNA, more expression for RNA), also higher coverage is needed for de novo assembly from RNA-Seq (30X or more) Sequencing error correction, esp. in highly expressed transcripts Multiple transcript variants (splicing) can confound assembly
Transcriptome assembly from RNA-Seq Some reminders Practically the only agreement on which software is the right one to use is that there is no “right” one (we’ll need to experiment*) Don't: use less than 200 to 500 Million RNA reads, mate-paired, of 100 bp or better length, high quality, and expect to get a complete transcriptome.(1) Despite the challenges, making your own transcriptome is still very very useful and perhaps more practical than assembling your own genome. 1. How to get Best mRNA Transcript assemblies. http://eugenes.org/EvidentialGene/ by Don Gilbert, 2013 Jan
Transcriptome assembly from RNA-Seq Overview Pre-analysis: Data generation
Experimental Consideration – Library prep Generate the best data you can! Remove ribosomal RNAs Two options with some pros/cons Poly(A) selection – effective, but will miss ncRNAs and non polyadenlyated transcripts rRNA depletion – hybridize rRNAs and remove, but may introduce different biases (e.g. against highly expressed transcripts)
Experimental Consideration – Library prep Generate the best data you can! PCR Amplification Most protocols have a PCR amplification step – this of course introduces bias (e.g. against high GC content). Some alternative protocols or technologies (PacBio) can avoid amplification but again have their own issues.
Experimental Consideration – Library prep Generate the best data you can! Strand specificity If possible, doing a strand-specific protocol can simplify future analyses
Experimental Consideration – Library prep Generate the best data you can! No orientation Strandedness preserved http://www.giga.ulg.ac.be/jcms/prod_1025901/en/transcriptome-analysis-with-strand-specific-libraries
Trinity Overview Analysis
Trinity Overview Trinity: a novel method for the efficient and robust de novo reconstruction of transcriptomes from RNA-seq data. Trinity combines three independent software modules: Inchworm Chrysalis Butterfly Trinity: reconstructing a full-length transcriptome without a genome from RNA-Seq data : Manfred G. Grabherr, et al; Nat Biotechnol. 2011 May 15; 29(7): 644–652.
Trinity: reconstructing a full-length transcriptome without a genome from RNA-Seq data : Manfred G. Grabherr, et al; Nat Biotechnol. 2011 May 15; 29(7): 644–652.
Trinity Overview Trinity aggregates isolated transcript graphs ftp://ftp.broad.mit.edu/pub/users/bhaas/rnaseq_workshop/rnaseq_workshop_2014/rnaseq_workshop_slides.pdf
Why Trinity? Some comparisons Trinity: reconstructing a full-length transcriptome without a genome from RNA-Seq data : Manfred G. Grabherr, et al; Nat Biotechnol. 2011 May 15; 29(7): 644–652.
Trinity Overview Inchworm first finds the node with the highest number of reads and extends on both sides by picking the highest intensity nodes. Inchworm will find three segments – A, C and the blue part of B Chrysalis takes the three segments determined by inchworm and clusters them into two groups that are related to two genes Butterfly reconstructs the full gene structures including alternate splice forms A and B. http://www.homolog.us/blogs/blog/2011/08/25/de-novo-transcriptome-assemblers-oases-trinity-etc-iv/
Trinity Reads split into k-mers De Brujin graph constructed from kmers Kmers and De Brujin graphs Reads split into k-mers De Brujin graph constructed from kmers Next-generation transcriptome assembly, Jeffrey A. Martin and Zhong Wang – Nat.Reviw.Gen doi:10.1038/nrg3068 Published online 7 September 2011
Trinity Redundancies are collapsed Kmers and De Brujin graphs Redundancies are collapsed Paths through the graph that explained the observed sequence generate the alignments Next-generation transcriptome assembly, Jeffrey A. Martin and Zhong Wang – Nat.Reviw.Gen doi:10.1038/nrg3068 Published online 7 September 2011
Trinity Trinity.fasta : file containing assembled trancritps Key Results Trinity.fasta : file containing assembled trancritps Trinity groups transcripts into clusters based on shared sequence content Such a transcript cluster is very loosely referred to as a gene Information is encoded in the Trinity fasta accession eg:
Trinity Examine the quality of the assembly N50 statistic Where to go from here Examine the quality of the assembly N50 statistic Core gene representation Transcript annotation- Trinonate
Assembly quality N50 http://schatzlab.cshl.edu/
CEGMA How good is assembly coverage? CEGMA: a pipeline to accurately annotate core genes in eukaryotic genomes Bioinformatics (2007) 23 (9): 1061-1067. doi: 10.1093/bioinformatics/btm071 First published online: March 1, 2007
Keep asking: ask.iplantcollabortive.org
The iPlant Collaborative is funded by a grant from the National Science Foundation Plant Cyberinfrastructure Program (#DBI-0735191).