De novo assembly of RNA Steve Kelly

Slides:

Advertisements

Similar presentations

Advertisements

Simon v2.3 RNA-Seq Analysis Simon v2.3.

Peter Tsai Bioinformatics Institute, University of Auckland

Introduction to Short Read Sequencing Analysis

Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520

Transcriptomics Jim Noonan GENE 760.

MCB Lecture #21 Nov 20/14 Prokaryote RNAseq.

High Throughput Sequencing

© Wiley Publishing All Rights Reserved.

De-novo Assembly Day 4.

Li and Dewey BMC Bioinformatics 2011, 12:323

National Center for Genome Analysis Support: Carrie Ganote Ram Podicheti Le-Shin Wu Tom Doak Quality Control and Assessment.

Expression Analysis of RNA-seq Data

CS 394C March 19, 2012 Tandy Warnow.

Todd J. Treangen, Steven L. Salzberg

Bioinformatics and OMICs Group Meeting REFERENCE GUIDED RNA SEQUENCING.

PE-Assembler: De novo assembler using short paired-end reads Pramila Nuwantha Ariyaratne.

Transcriptome analysis With a reference – Challenging due to size and complexity of datasets – Many tools available, driven by biomedical research – GATK.

Introduction to Short Read Sequencing Analysis

June 11, 2013 Intro to Bioinformatics – Assembling a Transcriptome Tom Doak Carrie Ganote National Center for Genome Analysis Support.

Next Generation Sequencing and its data analysis challenges Background Alignment and Assembly Applications Genome Epigenome Transcriptome.

Next Generation DNA Sequencing

Schedule change Day 2: AM - Introduction to RNA-Seq (and a touch of miRNA-Seq) Day 2: PM - RNA-Seq practical (Tophat + Cuffdiff pipeline on Galaxy) Day.

RNA-Seq Analysis Simon V4.1.

Transcriptome Analysis

Quick introduction to genomic file types Preliminary quality control (lab)

The iPlant Collaborative

RNA surveillance and degradation: the Yin Yang of RNA RNA Pol II AAAAAAAAAAA AAA production destruction RNA Ribosome.

RNA-Seq Assembly 转录组拼接唐海宝基因组与生物技术研究中心 2013 年 11 月 23 日.

Quality Control Hubert DENISE

IPlant Collaborative Discovery Environment RNA-seq Basic Analysis Log in with your iPlant ID; three orange icons.

RNA-Seq Primer Understanding the RNA-Seq evidence tracks on the GEP UCSC Genome Browser Wilson Leung08/2014.

Introduction to RNAseq

SMARTAR: small RNA transcriptome analyzer Geuvadis RNA analysis meeting April 16 th 2012 Esther Lizano and Marc Friedländer Xavier Estivill lab Programme.

Bioinformatics for biologists Dr. Habil Zare, PhD PI of Oncinfo Lab Assistant Professor, Department of Computer Science Texas State University Presented.

The iPlant Collaborative

Construction of Substitution matrices

Denovo Sequencing Practical. Overview Very small dataset from Staphylococcus aureus – 4 million x 75 base-pair, paired end reads Cover basic aspects of.

Comparative transcriptomics of fungi Group Nicotiana Daan van Vliet, Dou Hu, Joost de Jong, Krista Kokki.

__________________________________________________________________________________________________ Fall 2015GCBA 815 __________________________________________________________________________________________________.

Manuel Holtgrewe Algorithmic Bioinformatics, Department of Mathematics and Computer Science PMSB Project: RNA-Seq Read Simulation.

CyVerse Workshop Transcriptome Assembly. Overview of work RNA-Seq without a reference genome Generate Sequence QC and Processing Transcriptome Assembly.

RNA Sequencing and transcriptome reconstruction Manfred G. Grabherr.

Reliable Identification of Genomic Variants from RNA-seq Data Robert Piskol, Gokul Ramaswami, Jin Billy Li PRESENTED BY GAYATHRI RAJAN VINEELA GANGALAPUDI.

Canadian Bioinformatics Workshops

Canadian Bioinformatics Workshops

Simon v RNA-Seq Analysis Simon v

de Novo Transcriptome Assembly

RNA Quantitation from RNAseq Data

Biases and their Effect on Biological Interpretation

Quality Control & Preprocessing of Metagenomic Data

Gene expression from RNA-Seq

RNA-Seq analysis in R (Bioconductor)

COMPUTATIONAL GENOMICS GENOME ASSEMBLY

Denovo genome assembly of Moniliophthora roreri

Transcriptomics II De novo assembly

Pre-assembly analyses

S1 Supporting information Bioinformatic workflow and quality of the metrics Number of slides: 10.

The FASTQ format and quality control

Kallisto: near-optimal RNA seq quantification tool

Design and Analysis of Single-Cell Sequencing Experiments

Transcriptome Assembly

A critical evaluation of HTQC: a fast quality control toolkit for Illumina sequencing data Chandan Pal, PhD student Sahlgrenska Academy Institute of.

Maximize read usage through mapping strategies

Garbage In, Garbage Out: Quality control on sequence data

Sequence Analysis - RNA-Seq 2

BF528 - Sequence Analysis Fundamentals

RNA-Seq Data Analysis UND Genomics Core.

Presentation transcript:

De novo assembly of RNA Steve Kelly www.stevekelly.eu Slides are available at: http://bioinformatics.plants.ox.ac.uk/EBI_2014/EBI_2014.ppt Example data is available at: http://bioinformatics.plants.ox.ac.uk/EBI_2014/reads.zip

What is a de novo transcriptome? Set of nucleotide sequences obtained by sequencing reverse transcribed RNA Poly A selected Ribsomal RNA depleted Random hexamer primed Long reads/short reads Strand specific Sequencing technology: Read size: Throughput per run: Illumina ~120bp ~800Gbp 454 ~1000bp ~700Mbp PacBio ~2700bp ~100Mbp GridION ~100kbp ~100Gbp/day Select for mRNA

Why do de novo transcriptomes Good way of generating new data from non-model species - It can tell you about gene presence - It can tell you about gene expression - It can tell you about evolution Cheaper than you might think! 2Gb of paired end sequence data is sufficient to get 90% of expressed genes! A de novo plant transcriptome from Illumina sequence: ~£300 Typical computing resources (standard PC desktop) Linux desktop: cost £800-1000. at least 1 quad core processor at least 16gb of ram at least 3 TB of disk space

Why do de novo transcriptomes Example data for this talk is available at: http://bioinformatics.plants.ox.ac.uk/EBI_2014/reads.zip Paired end Illumina sequence data from Sorghum bicolor leaves R1.fq is a file containing the first end reads. R2.fq is contains the second end reads. The first read in R2.fq is the pair of the first read in R1.fq so the order of the reads is very important in paired-end read files! All software used is available for Linux - Web addresses for current versions are given - Software takes many hours to run of large datasets All software is command line only - Command line examples for every step are given

Command line examples The name of the software being used Where you can get it for free Example: Using fastx toolkit http://hannonlab.cshl.edu/fastx_toolkit/ Command line: ~/fastx_quality_stats –i R1.fq –o R1_output.txt ~/fastx_nucleotide_distribution_graph.sh –i R1_output.txt –o R1_nuc.png How to use software to generate the data shown on the slide Shorthand for the full path to the directory where the program is located

De novo assembly protocol outline 1) Pre-assembly read processing Remove poor quality data before assembly 2) Assembly Using different parameters and merging 3) Post-assembly processing Identifying protein sequences and orthologous protein groups 4) Quantification Mapping raw reads back onto assembly to determine expression level

What do sequence files look like? Each read has a name that starts with an @ symbol @FCC00CKABXX:2:1101:1248:2120#CAGATCAT/1 GAGTTTTTGTCATCGTTGATGCCAAGGAACAGTATACATGAAG… + Fefefggggggfggggggggfgggegggggggaggggggggdg…

What do sequence files look like? Each read has a name that starts with an @ symbol If the reads are paired, then the first in pair will end with a /1 and the second with a /2 @FCC00CKABXX:2:1101:1248:2120#CAGATCAT/1 GAGTTTTTGTCATCGTTGATGCCAAGGAACAGTATACATGAAG… + Fefefggggggfggggggggfgggegggggggaggggggggdg… @FCC00CKABXX:2:1101:1248:2120#CAGATCAT/2 GTTCAACTTGATGAATCATCTAAGGAGTGATCCAACAAAACAAA… gggggegggggggggggggggfgggcfbgfgggeccd[Q^d^]]…

What do sequence files look like? Each read has a name that starts with an @ symbol If the reads are paired, then the first in pair will end with a /1 and the second with a /2 @FCC00CKABXX:2:1101:1248:2120#CAGATCAT/1 GAGTTTTTGTCATCGTTGATGCCAAGGAACAGTATACATGAAG… + Fefefggggggfggggggggfgggegggggggaggggggggdg… @FCC00CKABXX:2:1101:1248:2120#CAGATCAT/2 GTTCAACTTGATGAATCATCTAAGGAGTGATCCAACAAAACAAA… gggggegggggggggggggggfgggcfbgfgggeccd[Q^d^]]… The read sequence is on the line after the read name

What do sequence files look like? Each read has a name that starts with an @ symbol If the reads are paired, then the first in pair will end with a /1 and the second with a /2 @FCC00CKABXX:2:1101:1248:2120#CAGATCAT/1 GAGTTTTTGTCATCGTTGATGCCAAGGAACAGTATACATGAAG… + Fefefggggggfggggggggfgggegggggggaggggggggdg… @FCC00CKABXX:2:1101:1248:2120#CAGATCAT/2 GTTCAACTTGATGAATCATCTAAGGAGTGATCCAACAAAACAAA… gggggegggggggggggggggfgggcfbgfgggeccd[Q^d^]]… The sequence read is followed by a + symbol on a new line The read sequence is on the line after the read name

What do sequence files look like? Each read has a name that starts with an @ symbol If the reads are paired, then the first in pair will end with a /1 and the second with a /2 @FCC00CKABXX:2:1101:1248:2120#CAGATCAT/1 GAGTTTTTGTCATCGTTGATGCCAAGGAACAGTATACATGAAG… + Fefefggggggfggggggggfgggegggggggaggggggggdg… @FCC00CKABXX:2:1101:1248:2120#CAGATCAT/2 GTTCAACTTGATGAATCATCTAAGGAGTGATCCAACAAAACAAA… gggggegggggggggggggggfgggcfbgfgggeccd[Q^d^]]… The sequence read is followed by a + symbol on a new line The read sequence is on the line after the read name The base-by-base qualities are on line #4

Understanding read quality scores Quality scores are the probability that the base is called incorrectly Q = -10 Log10(p) where p is the probability that base is incorrect Quality score Probability of sequencing error Accuracy 10 1 in 10 90% 20 1 in 100 99% 30 1 in 1000 99.9% Read quality is more important for assembly than for read mapping!

Example of read quality scores Reads in Illumina 1.5+ encoding (Quality 2 - 40) What does quality score of B mean? ASCII value for B = 66 (http://www.asciitable.com/) Illumina 1.5+ offset = 64 B = 66 – 64 = 2 2 = -10 Log10(p) p = 10^(-2/10) p = 0.63 i.e. 63% chance that the base call is wrong

Visualising read quality data Important to visualise/summarise the raw data before using it! Most common errors can be spotted in general summary statistics! Use FASTQC for nice visual output: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ Use FASTX or Trimmomatic for simple processing tasks: http://hannonlab.cshl.edu/fastx_toolkit/ http://www.usadellab.org/cms/?page=trimmomatic

Pre-assembly read processing Example: Using fastx toolkit http://hannonlab.cshl.edu/fastx_toolkit/ Command line: ~/fastx_quality_stats –i R1.fq –o R1_output.txt ~/fastx_nucleotide_distribution_graph.sh –i R1_output.txt –o R1_nuc.png

Pre-assembly read processing Example: Using fastx toolkit http://hannonlab.cshl.edu/fastx_toolkit/ Normal 5’ composition bias caused by non-random hexamer used during cDNA synthesis Command line: ~/fastx_quality_stats –i R1.fq –o R1_output.txt ~/fastx_nucleotide_distribution_graph.sh –i R1_output.txt –o R1_nuc.png

Pre-assembly read processing Example: Using fastx toolkit http://hannonlab.cshl.edu/fastx_toolkit/ If there is unusual composition bias at either end it is likely to cause problems. It’s best to remove these bases! Command line: ~/fastx_trimmer –f 6 –i R1.fq –o trim_R1.fq

Pre-assembly read processing Example: Using fastx toolkit http://hannonlab.cshl.edu/fastx_toolkit/ Non-uniform composition in the reads due to sequencing adaptor sequences. Remove the appropriate adapter sequence using fastx_clipper Command line: ~/fastx_clipper –a GATCGGAAGAGCTCGTATGCCGTCTTCTGCTTG –i R1.fq –o filt_R1.fq java -jar trimmomatic-0.30.jar PE -threads 4 -phred64 R1.fq R2.fq trimmed_R1.fq unpaired_R1.fq trimmed_R2.fq unpaired_R2.fq ILLUMINACLIP:all_adaptors.fasta:2:30:10 LEADING:10 TRAILING:10 SLIDINGWINDOW:5:15 MINLEN:50

Pre-assembly read processing Example: Using fastx toolkit http://hannonlab.cshl.edu/fastx_toolkit/ Command line: ~/fastq_quality_boxplot_graph.sh –i R1_output.txt –o R1_box_plot.png

Pre-assembly read processing Example: Using fastx toolkit http://hannonlab.cshl.edu/fastx_toolkit/ Normal 3’ degradation of read quality scores. Sequence at 3’end has higher proportion of errors Before assembly good idea to remove poor quality sequences! Command line: ~/fastq_quality_boxplot_graph.sh –i R1_output.txt –o R1_box_plot.png

Understanding read quality scores Quality scores are the probability that the base is called incorrectly Q = -10 Log10(p) where p is the probability that base is incorrect Quality score Probability of sequencing error Accuracy 10 1 in 10 90% 20 1 in 100 99% 30 1 in 1000 99.9% Read quality is more important for assembly than for read mapping! Quantification SNP calling or assembly

Pre-assembly read processing Three ways to deal with poor quality reads: Discard reads based on quality threshold Most stringent Lose lots of data 2) Remove poor quality sections from reads Middle stringency Keep most of the read data, just loosing the unreliable bits Try to find and fix the errors in poor quality reads Least stringent Error correction may introduce errors Keep all the data

Pre-assembly read processing Example: Using fastx toolkit http://hannonlab.cshl.edu/fastx_toolkit/ Removed reads which have quality < 30 for more than 50% of the read length 22.5 million -> 14.3 million (64%) Command line: ~/fastq_quality_filter –q 30 –p 50 –i R1.fq –o filt_R1.fq ~/fastx_quality_stats –i filt_R1.fq –o filt_R1_output.txt ~/fastq_quality_boxplot_graph.sh –i filt_R1_output.txt –o filt_R1_box_plot.png

Pre-assembly read processing Example: Using fastx toolkit http://hannonlab.cshl.edu/fastx_toolkit/ Clip reads from both ends which have quality until quality > 30. Discard if read length < 30 22.5 million -> 22.2 million reads (99%) Command line: ~/fastq_quality_trimmer –t 30 –l 30 –i R1.fq –o trim_R1.fq ~/fastx_quality_stats –i trim_R1.fq –o trim_R1_output.txt ~/fastq_quality_boxplot_graph.sh –i trim_R1_output.txt –o trim_R1_box_plot.png

Pre-assembly read processing Example: Using SortMeRNA http://bioinfo.lifl.fr/RNA/sortmerna/ Even after polyA selection or ribosomal RNA depletion you will have 1-30% ribosomal RNA contamination in your RNAseq. Best to remove it before assembly! You will need to interleave your read files to run SortMeRNA. @FCC00CKABXX:2:1101:1248:2120#CAGATCAT/1 GAGTTTTTGTCATCGTTGATGCCAAGGAACAGTATACATGAA + Fefefggggggfggggggggfgggegggggggaggggggggd @FCC00CKABXX:2:1101:1248:2120#CAGATCAT/2 GTTCAACTTGATGAATCATCTAAGGAGTGATCCAACAAAACA gggggegggggggggggggggfgggcfbgfgggeccd[Q^d^ One file containing all read pairs interleaved consecutively. Command line: ~/sortmerna --I interleaved_trimmed_reads.fq -n 4 --db “the full paths to the four included ribosomal RNA libraries” --paired-in --accept ribosomal_RNA_reads --other non_ribosomal_reads --log rRNA_filter_log -a 4

Pre-assembly read processing Example: Using ALLPATHS-LG http://www.broadinstitute.org/software/allpaths-lg/blog/ - Splits all reads into all possible sub-sequences of length k Makes stacks of near identical sub-sequences Bigger the value of k the more conservative Small value of k can introduce more errors than it fixes!

Pre-assembly read processing Example: Using ALLPATHS-LG http://www.broadinstitute.org/software/allpaths-lg/blog/ - Splits all reads into all possible sub-sequences of length k Makes stacks of near identical sub-sequences Bigger the value of k the more conservative Small value of k can introduce more errors than it fixes!

Pre-assembly read processing Example: Using ALLPATHS-LG http://www.broadinstitute.org/software/allpaths-lg/blog/ - Splits all reads into all possible >25bp sub-sequences (p < 1e-6) - Makes stacks of near identical sub-sequences (no gaps) Command line: ~/ErrorCorrectReads.pl PHRED_ENCODING=64 READS_OUT=corrected_reads PAIRED_READS_A_IN=R1.fq PAIRED_READS_B_IN=R2.fq FILL_FRAGMENTS=True

Pre-assembly read processing Example: Using ALLPATHS-LG http://www.broadinstitute.org/software/allpaths-lg/blog/ - Splits all reads into all possible >25bp sub-sequences (p < 1e-6) Makes stacks of near identical sub-sequences (no gaps) Correct singletons and low frequency bases Keep all 22.5 million reads (100%) Command line: ~/ErrorCorrectReads.pl PHRED_ENCODING=64 READS_OUT=corrected_reads PAIRED_READS_A_IN=R1.fq PAIRED_READS_B_IN=R2.fq FILL_FRAGMENTS=True

Pre-assembly read processing Example: Using ALLPATHS-LG http://www.broadinstitute.org/software/allpaths-lg/blog/ Join paired end reads into long reads prior to assembly. Assembly contains up to 50% less reads Assembly program doesn’t incorrectly guess insert size! Prevents some common assembly artefacts Command line: ~/ErrorCorrectReads.pl PHRED_ENCODING=64 READS_OUT=corrected_reads PAIRED_READS_A_IN=R1.fq PAIRED_READS_B_IN=R2.fq FILL_FRAGMENTS=True

Pre-assembly read processing Example “good practice” pre-assembly processing - Visually inspect nucleotide composition and quality scores Clip off 5’ or 3’ ends containing unexpected/unusual nucleotide bias Filter out any reads containing adaptor sequences Trim back reads from both ends based on quality score Filter out ribosomal RNA reads Error correct and join overlapping remaining reads using k >= 25bp - (Optional step here would be to normalise reads to even out coverage depth for example see program called “khmer”) Assemble!

Which assembler to use? Assembly packages: All use the same underlying idea of assembling using de Bruijn graphs 1) Velvet/Oases - Good all round and very fast 2) Trinity - Good at getting full length transcripts 3) TransAbyss - Good at getting lots of splice variants 4) SOAPdenovo-Trans - Very fast with low memory footprint Results of all assembly programs are quite similar. As each of them comes up with a new idea they tend to be built into the others quickly too so that they don’t get left behind! I recommend you start with velvet/oases as its easy to install and use!

Simplified overview of assembly process

Simplified overview of assembly process

Assembling the transcriptome Example: Using Velvet/Oases http://www.ebi.ac.uk/~zerbino/velvet/ & http://www.ebi.ac.uk/~zerbino/oases/ Assembly only needs two parameters - Need to specify the insert size Need to specify the k-mer size for constructing the de Bruijn graph Larger k more specific Smaller k more sensitive Best to use a range of k-mer sizes and merge the results - For example k = 31, 41, 51 and 61 Set a minimum transcript length Set a minimum coverage cut off Command line example: ~/oases_pipeline.py –m 31 –M 61 –s 10 –g 41 –o merged –d “ –fastq –shortPaired –separate R1.fq R2.fq“ –d –ins_length 170 –cov_cutoff 6 –min_pair_count 4 –min_trans_lgth 200

Assessing the transcriptome Effect of kmer size of transcript length part 1

Assessing the transcriptome Effect of kmer size of transcript length part 2

Assessing the transcriptome Effect of kmer size on number of transcripts

Quantifying your transcriptome Example: Using RSEM & bowtie http://deweylab.biostat.wisc.edu/rsem/ & http://bowtie-bio.sourceforge.net/index.shtml Use processed reads for quantification Cant use programs such as cufflinks to estimate abundance from transcripts Current best method is RSEM gene_id transcript_id(s) length effective_length expected_count TPM FPKM transcript_1 2021 1936.59 519.77 5190.23 6378.74 transcript_100 887 802.59 96 2313.29 2843.01 transcript_101 880 795.59 1856.65 45132.93 55467.89 transcript_102 876 791.59 370 9039.72 11109.72 transcript_103 173.95 4249.96 5223.15 transcript_104 875 790.59 53 1296.52 1593.4 Use expected count for DE testing Command line example: ~/rsem-prepare-reference --bowtie-path “path-to-bowtie” transcripts.fa ~/rsem-calculate-expression --bowtie-path “path-to-bowtie” --paired-end R1.fq R2.fq transcripts.fa

Summary A de novo plant transcriptome from Illumina sequence: ~£300 Important to check the raw data before assembly Correcting and removing errors before assembly improves quality Most assembly methods have similar performance Try it!

Assemble your own mini-transcriptome Download and extract some pre-trimmed read data from: http://bioinformatics.plants.ox.ac.uk/EBI_2014/reads.zip Command line: wget http://bioinformatics.plants.ox.ac.uk/EBI_2014/reads.zip ./ unzip reads.zip These reads come from 5 genes detected in a sorghum RNAseq experiment Sb09g020820, Sb02g034570, Sb07g000980, Sb07g023920, Sb06g016090 Error correct and join overlapping paired end reads: Command line: ErrorCorrectReads.pl PHRED_ENCODING=64 READS_OUT=corrected_reads PAIRED_READS_A_IN=R1.fq PAIRED_READS_B_IN=R2.fq FILL_FRAGMENTS=True

Assemble your own mini-transcriptome Assemble uncorrected (trimmed) reads Command line: velveth kmer_61 61 -fastq -shortPaired -separate 1.fq 2.fq velvetg kmer_61 -read_trkg yes -scaffolding yes -ins_length 170 oases kmer_61 -ins_length 170 -scaffolding yes Assemble corrected reads Command line: velveth kmer_61ec 61 -fastq -shortPaired -separate corrected_reads.paired.A.fastq corrected_reads.paired.B.fastq velvetg kmer_61ec -read_trkg yes -scaffolding yes -ins_length 170 oases kmer_61ec -ins_length 170 -scaffolding yes

Assemble your own mini-transcriptome Assemble corrected “joined” reads (*MUCH* better for larger read datasets) Command line: velveth kmer_61ecj 61 -fastq –short corrected_reads.filled.fastq velvetg kmer_61ecj -read_trkg yes oases kmer_61ecj Look in the “transcripts.fa” output files to see what transcripts you have assembled BLAST transcripts online at NCBI BLAST Are each of the 5 expected genes assembled? Sb09g020820, Sb02g034570, Sb07g000980, Sb07g023920 & Sb06g016090 If you assemble at k = 21 do you generate any chimeric transcripts? Try the oases pipeline script on slide 35.