De Novo Genome Assembly - Introduction Henrik Lantz - BILS/SciLife/Uppsala University.

Slides:



Advertisements
Similar presentations
Schulich School of Medicine & Dentistry The University of Western Ontario London Regional Genomics Centre Next Generation Sequencing Meeting April 1, 2010.
Advertisements

MCB Lecture #15 Oct 23/14 De novo assemblies using PacBio.
Next Generation Sequencing – An Overview
Next–generation DNA sequencing technologies – theory & practice
Next-generation sequencing
RNA sequencing, transcriptome and expression quantification
World’s Leading Provider of Turn-key Compute Solutions for NGS / Bioinformatics.
Class 02: Whole genome sequencing. The seminal papers ``Is Whole Genome Sequencing Feasible?'' ``Whole-Genome DNA.
Novel multi-platform next generation assembly methods for mammalian genomes The Baylor College of Medicine, Australian Government and University of Connecticut.
1 Next Generation Sequencing Itai Sharon November 11th, 2009 Introduction to Bioinformatics.
Henrik Lantz - BILS/SciLife/Uppsala University
Evaluation of PacBio sequencing to improve the sunflower genome assembly Stéphane Muños & Jérôme Gouzy Presented by Nicolas Langlade Sunflower Genome Consortium.
NGS Transcriptomic Workflows Hugh Shanahan & Jamie al-Nasir Royal Holloway, University of London.
Genome Assembly Bonnie Hurwitz Graduate student TMPL.
High Throughput Sequencing
Next generation sequencing Why? What? How? Marcel Dinger Developmental Biology Divisional Seminar 7 October 2010.
CS 6293 Advanced Topics: Current Bioinformatics
NGS Data Generation Dr Laura Emery. Overview The NGS data explosion Sequencing technologies An example of a sequencing workflow Bioinformatics challenges.
Update on Next-Generation Sequencing
Next Now-Generation Genomics: methods and applications for modern disease research Aaron J. Mackey, Ph.D. Center for Public Health.
Molecular Biology Dr. Chaim Wachtel April 4, 2013.
Sequencing Data Quality Saulo Aflitos. Read (≈100bp) Contig (≈2Kbp) Scaffold (≈ 2Mbp) Pseudo Molecule (Super Scaffold) Paired-End Mate-Pair LowComplexityRegion.
Sequencing Technologies and Applications at JGI
MCB 5472 Lecture 2 Feb 3/14 (1) GenBank continued (2) Primer: Genome sequencing and assembly.
De-novo Assembly Day 4.
CS 394C March 19, 2012 Tandy Warnow.
Introduction to next generation sequencing Rolf Sommer Kaas.
PE-Assembler: De novo assembler using short paired-end reads Pramila Nuwantha Ariyaratne.
PERFORMANCE COMPARISON OF NEXT GENERATION SEQUENCING PLATFORMS Bekir Erguner 1,3, Duran Üstek 2, Mahmut Ş. Sağıroğlu 1 1Advanced Genomics and Bioinformatics.
Sequence assembly using paired- end short tags Pramila Ariyaratne Genome Institute of Singapore SOC-FOS-SICS Joint Workshop on Computational Analysis of.
Steps in a genome sequencing project Funding and sequencing strategy source of funding identified / community drive development of sequencing strategy.
P. Tang ( 鄧致剛 ); RRC. Gan ( 甘瑞麒 ); PJ Huang ( 黄栢榕 ) Bioinformatics Center, Chang Gung University. Genome Sequencing Genome Resequencing De novo Genome.
Next Generation DNA Sequencing
Quick introduction to genomic file types Preliminary quality control (lab)
The Changing Face of Sequencing
Towards your own genome. Designing your Sequencing Run Sequencing strategy Genome size and genome.
Problems of Genome Assembly James Yorke and Aleksey Zimin University of Maryland, College Park 1.
Molecular Biology Dr. Chaim Wachtel May 28, 2015.
SEQUENCING – THE BENCHTOPS. Roche 454 Junior Same technology as 454 FLX Read length: 400 bases Paired-end 100,000 reads 12 hours (instrument time) Output.
Genomics.
RNA Sequence Assembly WEI Xueliang. Overview Sequence Assembly Current Method My Method RNA Assembly To Do.
UK NGS Sequencing Update July 2009 Dr Gerard Bishop - Division of Biology Dr Sarah Butcher – Centre for Bioinformatics.
1.Data production 2.General outline of assembly strategy.
Anna Shcherbina Bioinformatics Challenge Day 01/10/2013 De novo assembly from clinical sample This work is sponsored by the Defense Threat Reduction Agency.
De novo assembly validation
De Novo Genome Assembly - Introduction
Manuel Holtgrewe Algorithmic Bioinformatics, Department of Mathematics and Computer Science PMSB Project: RNA-Seq Read Simulation.
Next Generation Sequencing Lenka Veselovská Laboratory of Developmental Biology and Genomics.
Third Generation Sequencing. Today Illumina – Solexa sequencing technology 454 Life sciences – 454 sequencer Applied Biosystem – SOLiD system Tomorrow.
Meet the ants Camponotus floridanus Carpenter ant Harpegnathos saltator Jumping ant Solenopsis invicta Red imported fire ant Pogonomyrmex barbatus Harvester.
CyVerse Workshop Transcriptome Assembly. Overview of work RNA-Seq without a reference genome Generate Sequence QC and Processing Transcriptome Assembly.
When the next-generation sequencing becomes the now- generation Lisa Zhang November 6th, 2012.
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Will 10x technology make us rethink genome assemblies?
De Novo Genome Assembly - Introduction
Short Read Sequencing Analysis Workshop
Lesson: Sequence processing
Sequencing technologies
Introduction to next generation sequencing
Sequencing technology and assembly
B3- Olympic High School Bioinformatics
Henrik Lantz - NBIS/SciLife/Uppsala University
2nd (Next) Generation Sequencing
Genome Sequencing and Assembly
Next-generation DNA sequencing
BF nd (Next) Generation Sequencing
Canadian Bioinformatics Workshops
IWGS workflow. iWGS workflow. A typical iWGS analysis consists of four steps: (1) data simulation (optional); (2) preprocessing (optional); (3) de novo.
Henrik Lantz - NBIS/SciLifeLab/Uppsala University
Presentation transcript:

De Novo Genome Assembly - Introduction Henrik Lantz - BILS/SciLife/Uppsala University

De Novo Assembly - Scope De novo genome assembly of eukaryote genomes Bioinformatics in general, programs in particular Practical experience Ease of entry - not memorization

Schedule - de novo assembly course Tuesday November Welcome to the course NGS Sequence technologies (Henrik Lantz) Coffee break Quality assessment (Henrik Lantz) Computer exercise - Quality assessment Lunch Genome assembly (Henrik Lantz) Computer exercise (incl. coffee break) - Genome assembly Dinner at Lingon Wednesday November Assembly validation (Francesco Vezzi) Coffee break Computer exercise - Assembly validation Lunch Computer exercise - Assembly validation contd. (incl. coffee break) Discussion of exercises + evaluation All lectures and exercises in this room!

Practical info Coffee breaks Lunch Dinner at Lingon Svartbäcksg. 30 Cards

De Novo Genome Assembly - Sequence Technologies Henrik Lantz - BILS/SciLife/Uppsala University

De novo genome project workflow Extracting DNA (and RNA) - as much DNA as possible! Single individual and haploid tissue if possible! Choosing best sequence technology for the project Sequencing Quality assessment and other pre-assembly investigations Assembly Assembly validation Assembly comparisons Repeat masking? Annotation

NGS Sequence technologies Illumina 454 Ion Torrent Ion Proton Solid Moleculo Pacific biosciences Oxford Nanopore

NGS sequencing Genomic DNA is fragmented (not Nanopore) and sequenced -> millions of small sequences (reads) from random parts of the genome Depending on sequence technology, reads can be from 50 bp up to 15kb in length

Assembly Reads Overlapping reads Consensus sequence = genome Coverage 2x 5x Assembly Coverage = number of reads that support a certain position Average coverage often asked for/reported

.ace file of assembly

Average Coverage Example: I know that the genome I am sequencing is 10 Mbases. I want a 50x coverage to do a good assembly. I am ordering 125 bp Illumina reads. How many reads do I need? (125xN)/10e+6=50 N=(50x10e+6)/125=4e+6 (4 million reads) A Illumina lane gives you 180x2 million reads (PE)

Fastq AGGCACTCCCTGCAGGTGTTGGACCACCTGGCTGAGCCACAGCGTCGCTTCCTGCTGCCAGGGCCTCGGAGAGGGTGGCTGTGGAGACACTGTGGGAGCA +HWI-ST0866_0110:5:1101:1264:2090#GATCAG/1 TCTTTATTGGCATCAGGCATCACCACACCATGGTTCTTGGCTCCCATGTTGGCCTGGACTCTCTTGCCATTCCGGGATCCTCTCTCATAGATGTACTCGC +HWI-ST0866_0110:5:1101:1418:2201#GATCAG/1 __P`ccceegge]eghhhhdfhhhhhhhhhfhhefghffffhffhhfheg^eeffgfegf`fghhhffhhggadcX[`bbbbbbbbbcbbbcbR]aabaa Quality values in increasing order: !"#$%&'()*+,-./ :; You might get the data in a.sff or.bam format. Fastq-reads are easy to extract from both of these binary (compressed) formats!

Fasta format >asmbl_2719 AGCACCTAGAGCAGGATGGGAGGTCTCTCCTTGCTGTGGCAGAGGCAGATCTCCTTTCCC AACACCTAGCAGTATGAACTAGTGAGCTCCTGACTGTTTTCCAGTGGTAATGAGGTGTGA CCCGCTGCAGCTGCACACTGAATTCTCTCAGTTCCCCGAGGCCAGCCCAGCAGTGTGGGC AATGCTTTGTTTGTGTGCTGTTGACCATTCC >asmbl_2702 GTCTGCACTGGGAATGCCCCCTGGAGCAGAACCATTGCCATGGATAAGGACACTACATTT CCTGGTGTTAAGGTGAATATAACCTCCAGGTTAAGGATGACATTAATTTCAATTACAGCT TGCCTCTTGTAAGCTAAGCAGTTAATCAACAAGCTATACTGTGACTACACCCTTAGATCA ATAGCTGGGAAAACATCACCTCCCCCAAATACTCCACCTCTTAACTGCACTCTTTGAAAG AAGTACAGGCCAGAGTTTAGCTGATCCATCCCTGTGGCTAATCGTCCTGCTTACAAGCTG CAATATTTTTTAAAACCAGACAATTGGTAGAGGTTTAAACATCAGCCAAGCTGTTCAATT TACAGCAGGTTAAGCATTCCTGAAACTGTGATCACTGATATATTTGGGTCAGTCAGATGT CTTGTTAGTGCTT >asmbl_2701 ACAAACAAAACAAAATAAAACAAAGGAAACAAGCAAAAAAAACCATCATACAATCCCATG TGTCCAAGAGCTTTACTGTGAAATCAACTATGGAGTCAAAACAATAGAAAAGCTTCCAGA TTTCTGTATTCCAGGCTGAGACAAGTTTGTAAATACTTCCAGAAATTGCCAACAAGCCTG CAGGGTAACATCTCTAATGCACACCTCCCTGATACGAAATGCAGAGCACCTTAACTTCTT CAGCCCTCCCCCAGTCACAACCAGCTATAAATCCTGCCCTTCACTTGTTGGAATATCTCA TCATAAGGGAAGCATTTTTTAGGCTGAGAAATACAAATCCACCTTGACGGAGCCGGTCAG GCATATACATGGGCTATGCTGCTGATAGGTTTGTACCAAGCACTCCTAGTGTGAGAATAA

Paired-End

Insert size Read 1 Read 2 DNA-fragment Inner mate distance Adapter+primer Insert size

Mate-pair Large amounts of high quality DNA needed. Used to get long Insert-sizes

Contigs and scaffolds Contig = a continuous stretch of nucleotides resulting from the assembly of several reads Scaffold = several contigs stitched together with NNNs in between contig1contig2contig3 Paired-end reads NNN scaffold1 NNN

N50 - contigs of this size or larger include 50 % of the assembly >contig1 TTTATGTCCGTAGCATGTAGACATATGGCA30 bp30 >contig2 AGTCTTGAGCCGAATTCGTG20 bp30+20=50 (>45) >contig3 GTTGGAGCTATTCAGCGTAC20 bp >contig4 ACAAATGATC10 bp >contig5 CGCTTCGAAC10 bp 90 bp total 50% of total = 45 L50 = number of contigs that include 50% if the assembly. Here, L50=2! N50=20!

NG50 - compared with genome size rather than assembly size N50 - contigs of this size or larger include 50 % of the assembly NG50 - contigs of this size or larger include 50 % of the genome NG50 is a better approximation of assembly quality, but can sometimes not be calculated, e.g., the genome size is unknown Can be quite different from N50, e.g., genome is 1,5 Gb but assembly is 1 Gb due to non-assembled repeats

Sequencing technology comparison Sequencing systemRead lengthYield 454 Titanium XLR70Up to 1000 bp450 Mbp/run 454 Titanium XL+Up to 600 bp700 Mbp/run Illumina Hi-Seq100 bp37 Gbp/lane, 600 Gbp/run Illumina Hi-Seq rapid 2x bp30 Gbp/lane, 120 Gbp/run Illumina Hi-Seq rapid 2x bp45 Gbp/lane, 180 Gbp/run Illumina MiSeq 2x300Up to 300 bp20-25 Gbp/lane, 150 Gbp/run Ion Proton200 bp10-18 Gbp/run Ion Torrent400 bp1 Gbp/run PacBio1-40 kb1.4 Gb/SMRTcell SOLiD 5500 Wildfire75x35 PE, 60x60 MP600 Gbp Oxford Nanopore<100k??

Error rates and types Sequencing systemError typeError rate 454 Titanium XLR70Indels1% 454 Titanium XL+Indels1% Illumina Hi-SeqSubstitutions0.1% Illumina Hi-Seq rapid 2x100Substitutions0.1% Illumina Hi-Seq rapid 2x150Substitutions0.1% Illumina MiSeq 2x300Substitutions0.1% Ion ProtonIndels0.1% Ion TorrentIndels0.1% PacBioInsertions % depending on read length SOLiD 5500 WildfireAT-bias0.01% Oxford NanoporeDeletions?3-15%?

454 Pros: Good length (>400 bp), long insert-sizes Cons: Homopolymers, long running time, low yield, expensive, now deprecated

Illumina Pros: Huge yield, cheap, reliable, read length “long enough” ( bp), industry standard=huge amount of available software Cons: GC-problems, quality-dip at end of reads, long running time for Hi-Seq, short insert-sizes

Ion Proton Pros: Good length (200 bp), rna-seq stranded by default, high quality all through the read Cons: Lower yield -> higher cost per base compared to Illumina, no paired-end/mate- pair

Ion Torrent Pros: Excellent read length (400 bp), rna-seq stranded by default, high quality all through the read Cons: Lower yield -> higher cost per base compared to Illumina, no paired-end/mate- pair

Solid Pros: Stable mate-pair protocols (10 kbp insert sizes), high yield Cons: Very short sequences, uses specific chemistry that creates problems when using reads together with other technologies, now deprecated

Pacific Biosciences Pros: Long reads (average 4.5 kbp) Cons: High error rate on longer fragments (15%), expensive

You need help? BILS is a VR-financed organization that offers bioinformatics support to all projects in Sweden. Contact (please ask your PI if necessary) or go to bils.se and use the web Biosupport.se is perfect for shorter questions.

Biosupport.se