Download presentation
1
The iPlant Collaborative
Community Cyberinfrastructure for Life Science Tools and Services Workshop Transcriptome assembly with SOAPdenovo Trans
2
RNA-Seq without a reference genome
Overview of work Transcriptome Assembly Characterize Transcript abundance Visualization Generate Sequence QC and Processing We will focus mostly on the transcriptome assembly itself as this is the greatest challenge to working without a reference genome and involves several steps.
3
Transcriptome assembly
Reminder: you’ll need to do some thinking and reading here Assemble Transcriptome Evaluate and refine On to mapping Good read data The most costly (time, compute) process is empirically determining what parameters yield the best assembly given the data and organism. We can suggest some good directions, but we can’t promise “right” answers Get familiar with examples of work and reasoning on approaches relevant to your organism
4
Transcriptome assembly from RNA-Seq
Challenges (just a few of them) DNA assembly assumes even sequencing depth, not true in RNA-Seq (e.g. repetitive regions = more reads for DNA, more expression for RNA), also higher coverage is needed for de novo assembly from RNA-Seq (30X or more) Sequencing error correction, esp. in highly expressed transcripts Multiple transcript variants (splicing) can confound assembly
5
Transcriptome assembly from RNA-Seq
Some reminders Practically the only agreement on which software is the right one to use is that there is no “right” one (we’ll need to experiment*) Don't: use less than 200 to 500 Million RNA reads, mate-paired, of 100 bp or better length, high quality, and expect to get a complete transcriptome.(1) Despite the challenges, making your own transcriptome is still very very useful and perhaps more practical than assembling your own genome. 1. How to get Best mRNA Transcript assemblies. by Don Gilbert, 2013 Jan
6
Transcriptome assembly from RNA-Seq
Overview Pre-analysis: Data generation
7
Experimental Consideration – Library prep
Generate the best data you can! Remove ribosomal RNAs Two options with some pros/cons Poly(A) selection – effective, but will miss ncRNAs and non polyadenlyated transcripts rRNA depletion – hybridize rRNAs and remove, but may introduce different biases (e.g. against highly expressed transcripts)
8
Experimental Consideration – Library prep
Generate the best data you can! PCR Amplification Most protocols have a PCR amplification step – this of course introduces bias (e.g. against high GC content). Some alternative protocols or technologies (PacBio) can avoid amplification but again have their own issues.
9
Experimental Consideration – Library prep
Generate the best data you can! Strand specificity If possible, doing a strand-specific protocol can simplify future analyses
10
Experimental Consideration – Library prep
Generate the best data you can! No orientation Strandedness preserved
11
Trinity Overview Analysis
12
Trinity Overview Trinity: a novel method for the efficient and robust de novo reconstruction of transcriptomes from RNA-seq data. Trinity combines three independent software modules: Inchworm Chrysalis Butterfly Trinity: reconstructing a full-length transcriptome without a genome from RNA-Seq data : Manfred G. Grabherr, et al; Nat Biotechnol May 15; 29(7): 644–652.
13
Trinity: reconstructing a full-length transcriptome without a genome from RNA-Seq data : Manfred G. Grabherr, et al; Nat Biotechnol May 15; 29(7): 644–652.
14
Trinity Overview Trinity aggregates isolated transcript graphs
ftp://ftp.broad.mit.edu/pub/users/bhaas/rnaseq_workshop/rnaseq_workshop_2014/rnaseq_workshop_slides.pdf
15
Why Trinity? Some comparisons
Trinity: reconstructing a full-length transcriptome without a genome from RNA-Seq data : Manfred G. Grabherr, et al; Nat Biotechnol May 15; 29(7): 644–652.
16
Trinity Overview Inchworm first finds the node with the highest number of reads and extends on both sides by picking the highest intensity nodes. Inchworm will find three segments – A, C and the blue part of B Chrysalis takes the three segments determined by inchworm and clusters them into two groups that are related to two genes Butterfly reconstructs the full gene structures including alternate splice forms A and B.
17
Trinity Reads split into k-mers De Brujin graph constructed from kmers
Kmers and De Brujin graphs Reads split into k-mers De Brujin graph constructed from kmers Next-generation transcriptome assembly, Jeffrey A. Martin and Zhong Wang – Nat.Reviw.Gen doi: /nrg3068 Published online 7 September 2011
18
Trinity Redundancies are collapsed
Kmers and De Brujin graphs Redundancies are collapsed Paths through the graph that explained the observed sequence generate the alignments Next-generation transcriptome assembly, Jeffrey A. Martin and Zhong Wang – Nat.Reviw.Gen doi: /nrg3068 Published online 7 September 2011
19
Trinity Trinity.fasta : file containing assembled trancritps
Key Results Trinity.fasta : file containing assembled trancritps Trinity groups transcripts into clusters based on shared sequence content Such a transcript cluster is very loosely referred to as a gene Information is encoded in the Trinity fasta accession eg:
20
Trinity Examine the quality of the assembly N50 statistic
Where to go from here Examine the quality of the assembly N50 statistic Core gene representation Transcript annotation- Trinonate
21
Assembly quality N50
22
CEGMA How good is assembly coverage?
CEGMA: a pipeline to accurately annotate core genes in eukaryotic genomes Bioinformatics (2007) 23 (9): doi: /bioinformatics/btm071 First published online: March 1, 2007
23
Keep asking: ask.iplantcollabortive.org
24
The iPlant Collaborative is funded by a grant from the National Science Foundation Plant Cyberinfrastructure Program (#DBI ).
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.