Presentation is loading. Please wait.

Presentation is loading. Please wait.

CyVerse Workshop Transcriptome Assembly. Overview of work RNA-Seq without a reference genome Generate Sequence QC and Processing Transcriptome Assembly.

Similar presentations


Presentation on theme: "CyVerse Workshop Transcriptome Assembly. Overview of work RNA-Seq without a reference genome Generate Sequence QC and Processing Transcriptome Assembly."— Presentation transcript:

1 CyVerse Workshop Transcriptome Assembly

2 Overview of work RNA-Seq without a reference genome Generate Sequence QC and Processing Transcriptome Assembly Characterize Transcript abundance Visualization We will focus mostly on the transcriptome assembly itself as this is the greatest challenge to working without a reference genome and involves several steps.

3 Reminder: you’ll need to do some thinking and reading here Transcriptome assembly Good read data Assemble Transcriptome On to mapping The most costly (time, compute) process is empirically determining what parameters yield the best assembly given the data and organism. We can suggest some good directions, but we can’t promise “right” answers Evaluate and refine Get familiar with examples of work and reasoning on approaches relevant to your organism

4 Challenges (just a few of them) Transcriptome assembly from RNA-Seq DNA assembly assumes even sequencing depth, not true in RNA-Seq (e.g. repetitive regions = more reads for DNA, more expression for RNA), also higher coverage is needed for de novo assembly from RNA-Seq (30X or more) Sequencing error correction, esp. in highly expressed transcripts Multiple transcript variants (splicing) can confound assembly

5 Some reminders Transcriptome assembly from RNA-Seq Practically the only agreement on which software is the right one to use is that there is no “right” one (we’ll need to experiment*) Don't: use less than 200 to 500 Million RNA reads, mate-paired, of 100 bp or better length, high quality, and expect to get a complete transcriptome. (1) Despite the challenges, making your own transcriptome is still very very useful and perhaps more practical than assembling your own genome. 1. How to get Best mRNA Transcript assemblies. http://eugenes.org/EvidentialGene/http://eugenes.org/EvidentialGene/ by Don Gilbert, 2013 Jan

6 Overview Transcriptome assembly from RNA-Seq Pre-analysis: Data generation

7 Generate the best data you can! Experimental Consideration – Library prep Remove ribosomal RNAs Two options with some pros/cons Poly(A) selection – effective, but will miss ncRNAs and non polyadenlyated transcripts rRNA depletion – hybridize rRNAs and remove, but may introduce different biases (e.g. against highly expressed transcripts)

8 Generate the best data you can! Experimental Consideration – Library prep PCR Amplification Most protocols have a PCR amplification step – this of course introduces bias (e.g. against high GC content). Some alternative protocols or technologies (PacBio) can avoid amplification but again have their own issues.

9 Generate the best data you can! Experimental Consideration – Library prep Strand specificity If possible, doing a strand-specific protocol can simplify future analyses

10 Generate the best data you can! Experimental Consideration – Library prep http://www.giga.ulg.ac.be/jcms/prod_1025901/en/tra nscriptome-analysis-with-strand-specific-libraries No orientation Strandedness preserved

11 Overview SOAPdenovo-Trans Analysis

12 Overview SOAPdenovo-Trans SOAPdenovo-Trans is a de novo transcriptome assembler Sensitive to alternative splicing and different expression level among transcripts. Construct full-length transcript sets from RNA-Seq read data

13 Some comparisons Why SOAPdenovo-Trans? http://arxiv.org/ftp/arxiv/papers/1305/1305.6760.pdf *Runs more quickly (easier to refine parameters) Less memory demands Good quality (Software changes rapidly, so “clear winners” will always change, you can too when the time comes)

14 Overview SOAPdenovo-Trans De Brujin graphs are constructed Error correction Contigs are constructed and single/paired reads are mapped to contigs to make scaffold graphs Transcripts are created from scaffold graphs Figure from: SOAPdenovo-Trans: De novo transcriptome assembly with shortf RNA-Seq reads Yinlong Xie1,2,3,†, Gengxiong Wu1,†, et.al.1BGI-Shenzhen, Shenzhen, China.

15 Kmers and De Brujin graphs SOAPdenovo-Trans Next-generation transcriptome assembly, Jeffrey A. Martin and Zhong Wang – Nat.Reviw.Gen doi:10.1038/nrg3068 Published online 7 September 2011 Reads split into k- mers De Brujin graph constructed from kmers

16 Kmers and De Brujin graphs SOAPdenovo-Trans Next-generation transcriptome assembly, Jeffrey A. Martin and Zhong Wang – Nat.Reviw.Gen doi:10.1038/nrg3068 Published online 7 September 2011 Redundancies are collapsed Paths through the graph that explained the observed sequence generate the alignments

17 Choosing kmers SOAPdenovo-Trans De novo assembly and analysis of RNA-seq data. Nature Methods 7, 909–912 (2010) doi:10.1038/nmeth.1517 Transcripts with lower read depths were represented better with lower K values Transcripts with higher read depth represented better with higher K

18 Kmer rules of thumb SOAPdenovo-Trans Larger K – more specific (better for heterozygous genomes) Smaller K – more sensitive (better for repetitive genomes) Use a range of sizes (31,41,51, etc.) and merge the results Recommendations 1/2 or 2/3 read length can be found in various places online. Remember to use an odd number

19 Key Results SOAPdenovo-Trans.scafSeq – Assembled transcript sets.scafStatistics – Collection of statistics describing the population of transcript sequences in the assembly.contig - The assembled contigs used to create the scaffolds.agp - Describes the assembly of contigs from reads and scaffolds from contigs. May be used with other programs to visualize the assembly in detail

20 Where to go from here SOAPdenovo-Trans Examine the quality of the assembly o N50 statistic o Core gene representation

21 N50 Assembly quality http://schatzlab.cshl.edu/

22 How good is assembly coverage? CEGMA CEGMA: a pipeline to accurately annotate core genes in eukaryotic genomes Bioinformatics (2007) 23 (9): 1061-1067. doi: 10.1093/bioinformatics/btm071 First published online: March 1, 2007

23 Asian honeybee transcriptome Sample Data

24 Detailed instructions with videos, manuals, documentation in Keep asking: ask.iplantcollabortive.org


Download ppt "CyVerse Workshop Transcriptome Assembly. Overview of work RNA-Seq without a reference genome Generate Sequence QC and Processing Transcriptome Assembly."

Similar presentations


Ads by Google