de Novo Transcriptome Assembly

Slides:



Advertisements
Similar presentations
MCB Lecture #15 Oct 23/14 De novo assemblies using PacBio.
Advertisements

RNAseq.
Peter Tsai Bioinformatics Institute, University of Auckland
Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520
NGS Transcriptomic Workflows Hugh Shanahan & Jamie al-Nasir Royal Holloway, University of London.
Genome Annotation BCB 660 October 20, From Carson Holt.
De-novo Assembly Day 4.
Li and Dewey BMC Bioinformatics 2011, 12:323
Todd J. Treangen, Steven L. Salzberg
Bioinformatics and OMICs Group Meeting REFERENCE GUIDED RNA SEQUENCING.
Genomics Virtual Lab: analyze your data with a mouse click Igor Makunin School of Agriculture and Food Sciences, UQ, April 8, 2015.
Transcriptome analysis With a reference – Challenging due to size and complexity of datasets – Many tools available, driven by biomedical research – GATK.
RNAseq analyses -- methods
Sequence assembly using paired- end short tags Pramila Ariyaratne Genome Institute of Singapore SOC-FOS-SICS Joint Workshop on Computational Analysis of.
Next Generation Sequencing and its data analysis challenges Background Alignment and Assembly Applications Genome Epigenome Transcriptome.
The iPlant Collaborative
RNA Sequencing I: De novo RNAseq
RNA-Seq Assembly 转录组拼接 唐海宝 基因组与生物技术研究中心 2013 年 11 月 23 日.
RNA Sequence Assembly WEI Xueliang. Overview Sequence Assembly Current Method My Method RNA Assembly To Do.
Spliced Transcripts Alignment & Reconstruction
Introduction to RNAseq
billion-piece genome puzzle
Alternative Splicing (a review by Liliana Florea, 2005) CS 498 SS Saurabh Sinha 11/30/06.
De novo assembly validation
The iPlant Collaborative
De novo assembly of RNA Steve Kelly
Accessing and visualizing genomics data
CyVerse Workshop Transcriptome Assembly. Overview of work RNA-Seq without a reference genome Generate Sequence QC and Processing Transcriptome Assembly.
RNA Sequencing and transcriptome reconstruction Manfred G. Grabherr.
How to design arrays with Next generation sequencing (NGS) data Lecture 2 Christopher Wheat.
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Library QA & QC Day 1, Video 3
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
RNA-Seq with the Tuxedo Suite Monica Britton, Ph.D. Sr. Bioinformatics Analyst September 2015 Workshop.
Simon v RNA-Seq Analysis Simon v
Annotating The data.
Computing challenges in working with genomics-scale data
Canadian Bioinformatics Workshops
Lesson: Sequence processing
RNA Quantitation from RNAseq Data
Simon v1.0 Motif Searching Simon v1.0.
Cancer Genomics Core Lab
Dr. Christoph W. Sensen und Dr. Jung Soh Trieste Course 2017
Gene expression from RNA-Seq
RNA-Seq analysis in R (Bioconductor)
Transcriptomics II De novo assembly
Genome Sequence Annotation Server
S1 Supporting information Bioinformatic workflow and quality of the metrics Number of slides: 10.
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Primer design.
Kallisto: near-optimal RNA seq quantification tool
Transcriptome Assembly
The Web frame for NGS output
How to Build a Horse: Final Report
Maximize read usage through mapping strategies
Simon V Motif Searching Simon V
Basic Local Alignment Search Tool (BLAST)
Inference of alternative splicing from RNA-Seq data with probabilistic splice graphs BMI/CS Spring 2019 Colin Dewey
Overview of Workflows: Why Use Them?
Basic Local Alignment Search Tool
Quantitative analyses using RNA-seq data
Sequence Analysis - RNA-Seq 2
Introduction to RNA-Seq & Transcriptome Analysis

RNA-Seq Data Analysis UND Genomics Core.
Quality Control & Nascent Sequencing
Presentation transcript:

de Novo Transcriptome Assembly Sheri Sanders Bioinformatics Analyst NCGAS @ IU ss93@iu.edu

Scope What I’m not talking about What I am going to talk about PacBio Iso-seq Experimental Design What I am going to talk about How Trinity works, and why you care Considerations What is CDTA and why I do it Some quality control metrics How to get my pipeline and help So… I started writing this Tuesday, I had planned to give this talk (Data Movment and Management) and then this one (free resources available to you)... Bare with me!

Pipeline

Inchworm assembles RNA-seq reads into contigs.

Inchworm assembly of contigs via greedy k-mer extension All reads are broken up into 25bp k-mers, and the abundance of each kmer is calculated. What’s a Kmer? If k=25 (default in Trinity), this means the kmers are 25mers, or 25bp sequence subsections. Kmer is used as a shorthand, because many assemblers use a variety of k values. A T G C 25 4 13 12 CTAGGAATCG…AGCCTTTGA Most common k-mer in read set

Inchworm assembly of contigs via greedy k-mer extension 5 22 4 2 A T G C 14 25 CTAGGAATCG…AGCCTTTGA A Most common k-mer in read set A T G C 6 5

Inchworm assembly of contigs via greedy k-mer extension 1 5 56 22 A T G C CTAGGAATCG…AGCCTTTGA ATT Most common k-mer in read set Rare transcripts? Transcript length? Sequence error rate? This continues until there are no overlapping k-mers. All included k-mers are then removed from the initial list. The next most common k-mer is used and the process restarts. What are some possible biases at this step?

Inchworm assembly of contigs via greedy k-mer extension >a121:len = 5,885 >a122:len = 2,560 >a123:len = 4,443 >a124:len = 48 >a126:len = 66

Inchworm assembles RNA-seq readsinto contigs. Chrysalis clusters these contigs and makes de Bruijn graphs. This step seeks to capture the transcriptional complexity for each gene. Reads are then mapped to the graphs. Inchworm assembles RNA-seq readsinto contigs.

Chrysalis clustering of Inchworm contigs based on shared subsequences (k-mers) and paired reads. Isoforms of the same gene: After inchworm: K-mers that share subsequences or are supported by pairs are grouped Contigs are combined into de Bruijn graphs

Chrysalis construction of de bruijn graphs Rare exons? Transcript length? Paired? What are some possible biases at this step?

Inchworm assembles RNA-seq reads into contigs. Chrysalis clusters these contigs and makes de Bruijn graphs. This maps the full transcriptional complexity for each gene. Reads are then mapped to the graphs. Inchworm assembles RNA-seq reads into contigs. Butterfly then uses read, pairs, and graph information to construct full-length transcripts – including isoforms and paralogous genes.

Butterfly reconstruction of transcripts including alternatively spliced isoforms

Unix Command Despite all of this, it runs easily! Trinity.pl \ --seqType (fq for fastq or fa for fasta) \ --left ~/path/to/reads_1.fq \ --right ~/path/to/reads_2.fq \ (or --single for single reads) \ --CPU 4 \ #this will depend on the amount of ram you need. --bflyHeapSpace 10G \ #1GB for every 1 million sequences! --output ~/path/to/output_dir #make sure you have LOTS of space

Output Main output is a fasta file “Trinity.fasta” usually 100,000s transcripts! Why? names in format of c[0-9]*_g[0-9]*_i[0-9]* Transcripts are grouped as follows: Components (c): the set of all sequences that share at least one k-mer (including paralogs) Contigs (g): transcripts that share a number of k-mers (the set of isoforms of a gene) Sequences (i) (isoforms and allelic variation) Don’t trust the groupings too much…

2. Sequencing error or a SNP or mutation after gene duplication 1+3. alternative transcription start site or hybrid joining or DNA contamination 4. alternative splicing or transposed element Basically, it can be difficult to determine the difference between: Alleles Isoforms Gene duplicates Sequencing Error Contamination 1 alternative transcription start site or hybrid joining or DNA contamination; 2 SND caused by a sequencing error or a SNP or mutation after gene duplication; 3 alternative transcription start site or DNA contamination; 4 alternative exon use; 5 alternative exon use or mutations after recent gene duplication. 5. alternative exon use or mutations after recent gene duplication

Filtering Many options, going to depend on the individual project. Things you can do: Complete proteins Remove contamination (blast) Length requirements Use gene “g” level (want one representative for each gene to map to)

So… what do I do? Who can I trust?

Assembler

CDTA – Combined de novo Transcriptome Assembly Multiple assemblers, multiple parameters (kmers) Best of all worlds Get as much data as possible and look for concordance between the different assemblers. It is less likely that different assembly algorithms will experience the same biases/errors in assembly. (Similar to why using different sequencing technologies can help reduce noise in genome assemblies) Not always needed…

Kmers – why? The structure of an assembly graph is highly dependent on the k-mer size used for assembly. Small k-mers result in shorter contigs with lots of connections, while large k- mers can result in longer contigs with fewer connections. If you have longer reads and/or higher read depth, you can use larger k-mers which are useful in resolving complex areas of the graph (i.e. advantages of PacBio vs Illumina in genome assembly). Conversely, if you have shorter reads and/or lower read depth, you may have to use shorter k-mers. Often we use several and combine to gain information from a range of kmers, because estimating an optimum can be difficult. See https://github.com/rrwick/Bandage/wiki/Effect-of-kmer-size for a great write up on this. https://github.com/rrwick/Bandage/wiki/Effect-of-kmer-size

What I generally run Trinity (k=25) SOAPdenovo (k=35,45,55,65,75,85) Velvet (k=35,45,55,65,75,85) TransAbyss (k=45,55,65,76,85) Combine with Evigenes

Why I generally do this… In several projects (particularly in large or polyploid systems), we were not recovering genes we knew were expressed – we had qPCR to back them up! No one assembler got all the target genes – the CDTA did! We’ve seen quality increases in the transcriptome when we run this pipeline, and it has been published in best practices for RNA-seq to use multiple parameters at least. It’s easier to defend in publication!

What this looks like in practice Project File Trimmed Seq Velvet Oasis Trinity SOAPdenovo TransAbyss Final Assemblies

Structure Each Run file has the following structure #PBS stuff in header to run job… including suggested resource needs #Define variables export left=../trimmed_seq/left.fq export right=../trimmed_seq/right.fq #IMPORTANT PARAMETERS An explanation of all the major parameters to choose, such as kmers, etc. #commands Largely don’t have to be changed, because variables are used above to allow for flexibility. SOAP RunSOAP.sh RunCombine.sh

Workflow Submit all the assemblies in parallel This decreases time to run all 19 assemblies! Not entirely pushbutton – I require you to open every run file (there are four) and LOOK AT WHAT YOU ARE RUNNING. Submit all the combiner scripts in parallel (there are three) These combine all the kmers from SOAP, Velvet, and TransAbyss Also label each contig with kmer and assembler for evaluation if interested Moves all the final groups to the final_assemblies folder Run Evigenes in final_assemblies folder Combines the assemblies Pending queue load and size of data… this can take ~2weeks - ~months

Evigenes Evidential Genes Leverages fastanrdb, CD-hit, Cd-hit-est, and blast Removes perfect redundancy (fastanrdb) Removes perfect fragments (cd-hit-est) Uses blastn to find 98% identity, exon sized aligments (blast) Identifies full length cds and identifies highest quality to identify main (okay), alternative (okalt), and dropped (drop) sets. See eugenes website for more details! NOTE: This is not what I call a totally friendly program to use…

Benefits Pretty much filters for you – usually I end up with the expected 20- 30k transcripts in the okay complete set. You get a separate file with all the alternatives, tagged with which gene in the okay set they are associated with. This is nice if you want this data! Automatically gives you cds, aa, and fa formats Replicability is high for a filtering paradigm You start with working scripts that you can easily change, with documentation.

Other Options If you aren’t aiming for a full transcriptome assembly (only care about target genes) you can do just use one assembler (trinity, velvet/oases, etc.) with multiple parameters. Main benefit is speed. When might you trust this?

Quality Control Quality (quast is painless) Correctness Completeness Transcript length – what are you expectations? Gene/isoform ratio – what are your expectations? Length metrics - > 1000bp, >5000 bp, >25,000bps Correctness Blast to a similar organism GC – does this match expected? Completeness BUSCO – works on transcriptomes too! Blast to a similar organism (how much overlap?)

Downstream Analysis Worth checking out Trinity’s Downstream Analysis page. Transdecoder is built in to translated your transcripts (if you don’t use things like EviGenes). Not entirely trinity specific - i.e. you can use Trinity’s DE pipeline for edgeR, DESeq2, voom, and limma if you so desire. Trinotate is an annotation pipeline for Trinity data, but is a good discussion of tools you can use on your data in general.

How do you get all this? Github (soon) Jetstream (soon) Contact me! (available now ^_~) MY GENERAL GOAL: Make this as painless as possible without losing integrity – aim for anyone with basic unix skills (edit a file and submit a job!) and ability to read documentation (all linked) to be able to get a transcriptome.

HELP!!! Trinity Galaxy is available if you hate the command line Will need to ask for an account Contact us at the National Center for Genome Analysis Support Google everything! Especially error codes! Don’t be shy – ask on BioStars/SEQanswers/etc., ask on github, email your cluster admins Bunches of tutorials on my ugly tutorial repo And on our blog

Slightly Old links of interest: Trinity paper: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3875132/pdf/nihms-537313.pdf WONDERFUL tutorial: http://trinityrnaseq.sourceforge.net/rnaseq_workshop.html Comparison paper for de-novo assemblers: http://www.biomedcentral.com/1471-2105/12/S14/S2 Great RSEM explanation: http://www.biomedcentral.com/content/pdf/gb-2010-11-3-r25.pd