De Novo Genome Assembly - Introduction

Slides:

Advertisements

Similar presentations

MCB Lecture #15 Oct 23/14 De novo assemblies using PacBio.

Advertisements

DNA Extraction Outline Purpose of DNA extraction

V Improvements to 3kb Long Insert Size Paired-End Library Preparation Naomi Park, Lesley Shirley, Michael Quail, Harold Swerdlow Wellcome Trust Sanger.

Next–generation DNA sequencing technologies – theory & practice

Next-generation sequencing

Novel multi-platform next generation assembly methods for mammalian genomes The Baylor College of Medicine, Australian Government and University of Connecticut.

CS273a Lecture 4, Autumn 08, Batzoglou Hierarchical Sequencing.

Henrik Lantz - BILS/SciLife/Uppsala University

High Throughput Sequencing

Next Generation DNA Sequencing Platforms: Evolving Tools for

Polymerase Chain Reaction

Sequencing Data Quality Saulo Aflitos. Read (≈100bp) Contig (≈2Kbp) Scaffold (≈ 2Mbp) Pseudo Molecule (Super Scaffold) Paired-End Mate-Pair LowComplexityRegion.

De-novo Assembly Day 4.

Spectrophotometry August 2011 SLCC/UVU STEP grant workshop.

Introduction to next generation sequencing Rolf Sommer Kaas.

PERFORMANCE COMPARISON OF NEXT GENERATION SEQUENCING PLATFORMS Bekir Erguner 1,3, Duran Üstek 2, Mahmut Ş. Sağıroğlu 1 1Advanced Genomics and Bioinformatics.

Steps in a genome sequencing project Funding and sequencing strategy source of funding identified / community drive development of sequencing strategy.

Next Generation DNA Sequencing

Quick introduction to genomic file types Preliminary quality control (lab)

RNA-Seq Assembly 转录组拼接唐海宝基因组与生物技术研究中心 2013 年 11 月 23 日.

De Novo Genome Assembly - Introduction Henrik Lantz - BILS/SciLife/Uppsala University.

Problems of Genome Assembly James Yorke and Aleksey Zimin University of Maryland, College Park 1.

SEQUENCING – THE BENCHTOPS. Roche 454 Junior Same technology as 454 FLX Read length: 400 bases Paired-end 100,000 reads 12 hours (instrument time) Output.

RNA Sequence Assembly WEI Xueliang. Overview Sequence Assembly Current Method My Method RNA Assembly To Do.

From the Seed Sample to DNA II: DNA Isolation, Quantification, & Normalization Beni Kaufman.

Genomics Core Facility at UNH: High-Throughput Sequencing on the Illumina HiSeq 2500 Platform Project Consultation Sample Submission Library Creation Illumina.

De novo assembly validation

De Novo Genome Assembly - Introduction

QC and pre-assembly analyses

CyVerse Workshop Transcriptome Assembly. Overview of work RNA-Seq without a reference genome Generate Sequence QC and Processing Transcriptome Assembly.

RNA Sequencing and transcriptome reconstruction Manfred G. Grabherr.

When the next-generation sequencing becomes the now- generation Lisa Zhang November 6th, 2012.

Canadian Bioinformatics Workshops

Library QA & QC Day 1, Video 3

Short Read Workshop Day 1 - Experimental Design Example 1: How to log in to vieques.

Estimation of quantity and quality of isolated DNA

Will 10x technology make us rethink genome assemblies?

SPECTROPHOTOMETRY (Quantification of Nucleic Acids)

Quality Control Metrics for DNA Sequencing

Short Read Sequencing Analysis Workshop

Lesson: Sequence processing

Detection of FLT3 Internal Tandem Duplication in Targeted, Short-Read-Length, Next- Generation Sequencing Data David H. Spencer, Haley J. Abel, Christina.

Restriction Enzyme Digestion of Phage DNA

Sequencing technologies

Sequence assembly Jose Blanca COMAV institute bioinf.comav.upv.es.

DNA Sequencing -sayed Mohammad Amin Nourion -A’Kia Buford

Dr. Christoph W. Sensen und Dr. Jung Soh Trieste Course 2017

Denovo genome assembly of Moniliophthora roreri

Genome sequence assembly

Pre-assembly analyses

RNA molecule RNA fragment Activity Intro Slide:

Sequencing technology and assembly

B3- Olympic High School Bioinformatics

Teagasc/APC Sequencing Facility

Polymerase Chain Reaction

Henrik Lantz - NBIS/SciLife/Uppsala University

CS 598AGB Genome Assembly Tandy Warnow.

Small RNA Sample Preparation

Detection of FLT3 Internal Tandem Duplication in Targeted, Short-Read-Length, Next- Generation Sequencing Data David H. Spencer, Haley J. Abel, Christina.

2nd (Next) Generation Sequencing

Technology Experimental Design Cost Estimation

A critical evaluation of HTQC: a fast quality control toolkit for Illumina sequencing data Chandan Pal, PhD student Sahlgrenska Academy Institute of.

Mass and Molar Ratios of DNA

BF nd (Next) Generation Sequencing

Canadian Bioinformatics Workshops

BF528 - Sequence Analysis Fundamentals

Genomic DNA Sample Preparation

Sequence Analysis - RNA-Seq 1

Henrik Lantz - NBIS/SciLifeLab/Uppsala University

Presentation transcript:

De Novo Genome Assembly - Introduction Henrik Lantz - NBIS/SciLife/Uppsala University

De Novo Assembly - Scope De novo genome assembly - not reference based Bioinformatics course - not biological interpretation Practical experience - focus on computer exercises Examples of programs - not exhaustive

Schedule - de novo assembly course Monday November 14 09.00-09.15 Introduction (Henrik Lantz) 09.15-10.00 Lecture: NGS technologies and basic concepts (Henrik Lantz) 10.00-10.15 Coffee break 10.15-10.45 Lecture: Quality control and read trimming (Mahesh Panchal) 10.45-12.00 Exercise: Quality control and read trimming 12.00-13.00 Lunch 13.00-13.45 Lecture: Kmer-analysis, contamination analysis, and mapping-based analysis (Mahesh Panchal) 13.45-15.00 Exercise: Kmer-analysis, contamination analysis, and mapping-based analysis 15.00-15.15 Coffee break 15.15-16.30 Team based exercise on quality control All lectures and exercises in this room!

Schedule - de novo assembly course Tuesday November 15 09.00-09.30 Discussion of last day’s exercises (Henrik Lantz, Mahesh Panchal, Martin Norling) 09.30-10.00 Lecture: Assembly basics - Genome properties (Henrik Lantz) 10.00-10.15 Coffee break 10.15-11.00 Lecture: Illumina assembly (Martin Norling) 11.00-12.00 Exercise: Illumina assembly 12.00-13.00 Lunch 13.00-13.30 Exercise: Illumina assembly contd. 13.30-14.30 Lecture: PacBio assembly, assembly polising, and demonstration of SMRT-portal (Mahesh Panchal) 14.30-17.00 Exercise: PacBio assembly (incl. coffee break)

Schedule - de novo assembly course Wednesday November 16 09.00-09.30 Discussion of last day’s exercises (Mahesh Panchal, Martin Norling) 09.30-10.00 Lecture: Assembly assessment (Martin Norling) 10.00-10.15 Coffee break 10.15-11.00 Exercise: Assembly assessment 11.00-11.30 Lecture: Assembly validation (Martin Norling) 11.30-12.00 Exercise: Assembly validation 12.00-13.00 Lunch 13.00-13.30 Lecture: Contamination assessment (Martin Norling) 13.30-15.00 Exercise: Assembly validation contd. + contamination assessment 15.00-15.15 Coffee break 15.15-15.30 Exercise discussion (Martin Norling) 15.30-17.00 Wrap-up and project discussion

Coffee breaks Lunch Dinner at Meza Grill & Bar, Östra Ågatan 11 Practical info Coffee breaks Lunch Dinner at Meza Grill & Bar, Östra Ågatan 11

De Novo Genome Assembly - Assembly basics Henrik Lantz - BILS/SciLife/Uppsala University

De novo genome project workflow Extract DNA (and RNA) Choose best sequence technology for the project Sequencing Quality assessment and other pre-assembly investigations Assembly Assembly validation Assembly comparisons Repeat masking? Annotation

De novo genome project workflow Extract DNA (and RNA)

De novo genome project workflow Extract DNA (and RNA) Extract much more DNA than you think you need Also remember to extract RNA for the annotation Single individual and haploid tissue if possible In particular for Illumina mate-pairs data and PacBio, a lot of high molecular weight DNA is critical! Extracting DNA for de novo assembly is very different from extractions intended for PCR Do several extractions if possible, and run them on a gel to get an idea of how fragmented the DNA is Try to remove contaminants from the extractions

Causes of DNA degradation Experimental setup Sample prep By Olga Vinnere Pettersson Uppsala Genome Center, SciLifeLab Mechanical damage during tissue homogenization. Wrong pH and ionic strength of extraction buffer. Incomplete removal / contamination with nucleases. Phenol: too old, or inappropriately buffered (pH 7.8 – 8.0); incomplete removal. Wrong pH of DNA solvent (acidic water). Recommended: 1:10 TE for short-term storage, or 1xTE for long-term storage. Vigorous pipetting (wide-bore pipet tips). Vortexing of DNA in high concentrations. Too many freeze-thaw cycles (we tested 5, still Ok). Debatable: sequence-dependent

What are the main contaminants? Polysaccharides Lypopolysaccharides Growth media residuals Chitin Protein Secondary metabolites Pigments Growth media residuals Chitin Fats Proteins Pigments Polyphenols Polysaccharides Secondary metabolites Pigments By Olga Vinnere Pettersson, Uppsala Genome Center, SciLifeLab

What do absorption ratios tell us? Pure DNA 260/280: 1.8 – 2.0 < 1.8: Too little DNA compared to other components of the solution; presence of organic contaminants: proteins and phenol; glycogen - absorb at 280 nm. > 2.0: High share of RNA. Pure DNA 260/230: 2.0 – 2.2 <2.0: Salt contamination, humic acids, peptides, aromatic compounds, polyphenols, urea, guanidine, thiocyanates (latter three are common kit components) – absorb at 230 nm. >2.2: High share of RNA, very high share of phenol, high turbidity, dirty instrument, wrong blank. Photometrically active contaminants: phenol, polyphenols, EDTA, thiocyanate, protein, RNA, nucleotides (fragments below 5 bp) By Olga Vinnere Pettersson, Uppsala Genome Center, SciLifeLab

DNA quality requirements NanoDrop: 260/280 = 1.8 – 2.0 Experimental setup Sample prep By Olga Vinnere Pettersson Uppsala Genome Center, SciLifeLab Some DNA left in the well Sharp band of 20+kb No sign of proteins No smear of degraded DNA No sign of RNA NanoDrop: 260/280 = 1.8 – 2.0 260/230 = 2.0 – 2.2 Qubit or Picogreen: 10 kb insert libraries: 3-5 ug 20 kb insert libraries: 10-20 ug

Example: Experimental setup Sample prep By Olga Vinnere Pettersson Uppsala Genome Center, SciLifeLab Example:

Some general concepts Assembly process Coverage Paired end/mate pair Insert size File formats Contigs/scaffolds N50

Next Generation Sequencing Genomic DNA is fragmented (not Nanopore) and sequenced -> millions of small sequences (reads) from random parts of the genome Depending on sequence technology, reads can be from 100 bp up to 100kb in length

De novo assembly process Genomic DNA Fragmentation + Sequencing Sequence reads Assembly Connection between reads found Consensus sequence Modified from “De Novo Genome asssembly” PDF by Torsten Seeman, Melbourne University.

.ace file of assembly

Assembly Reads Overlapping reads 5x Per base coverage 2x Assembly Consensus sequence = genome Usually the haploid genome that is reported Per base coverage = number of reads that support a certain position Depth of coverage = average coverage (often shortened to “coverage” or “depth”)

Depth of coverage Coverage = (number of reads x read length)/genome size Example 1: (10e+6 reads x 100 bps)/10e+6= 100x coverage

N=(50x10e+6)/125=4e+6 (4 million reads) Depth of coverage Example 2: I know that the genome I am sequencing is 10 Mbases. I want a 50x coverage to do a good assembly. I am ordering 125 bp Illumina reads. How many reads do I need? (125xN)/10e+6=50 N=(50x10e+6)/125=4e+6 (4 million reads) A Illumina lane gives you 180x2 million reads (PE)

Insert size Insert size Read 1 Read 2 Inner mate distance DNA-fragment Adapter+primer Inner mate distance

Paired-End

Mate-pair Used to get long Insert-sizes

Orientation of paired reads Paired end (PE) reads Mate pair (MP) reads

Fastq format @D00118:257:C8672ANXX:2:2302:2055:2109 1:N:0:GAGATTCC+GTACTGAC CGTAGCCCTGTGCGACGGTGTCCGACTGCACGTCGCCGTCGTAGTTCTTGCACGCCCAGACGTAACCGCCTTCCC + 3:>@BGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGECGGFGGGGGGGGGGGGGGGGGG The names follow this format: @<instrument>:<run number>:<flowcell ID>:<lane>:<tile>:<x-pos>:<y-pos> <read>:<is filtered>:<control number>:<index sequence> Quality values in increasing order: !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ You might get the data in a .sff or .bam format. Fastq-reads are easy to extract from both of these binary (compressed) formats!

Fastq format - paired reads First file: @D00118:257:C8672ANXX:2:2302:2055:2109 1:N:0:GAGATTCC+GTACTGAC CGTAGCCCTGTGCGACGGTGTCCGACTGCACGTCGCCGTCGTAGTTCTTGCACGCCCAGACGTAACCGCCTTCCC + 3:>@BGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGECGGFGGGGGGGGGGGGGGGGGG Second file: @D00118:257:C8672ANXX:2:2302:2055:2109 2:N:0:GAGATTCC+GTACTGAC GCGCATTGTCGCCTATGACCCGAACCTGAGCCCTGAGCAATGGTTCGCCTTCACCCCGCCCCGAGGACGGCGGC+ CCCCCGGGGFGGGGGGGGGGGGGGDGGGGGGGGEGGGGGGCGEGGGGGGGGGGGGGGGGGGGGFGGGGG The names follow this format: @<instrument>:<run number>:<flowcell ID>:<lane>:<tile>:<x-pos>:<y-pos> <read>:<is filtered>:<control number>:<index sequence> For paired sequences (paired-end or mate-pairs) you get two files. Every read in the first file has an almost identically named “friend” read in the second file. They differ by one single number.

Fasta format >asmbl_2719 AGCACCTAGAGCAGGATGGGAGGTCTCTCCTTGCTGTGGCAGAGGCAGATCTCCTTTCCC AACACCTAGCAGTATGAACTAGTGAGCTCCTGACTGTTTTCCAGTGGTAATGAGGTGTGA CCCGCTGCAGCTGCACACTGAATTCTCTCAGTTCCCCGAGGCCAGCCCAGCAGTGTGGGC AATGCTTTGTTTGTGTGCTGTTGACCATTCC >asmbl_2702 GTCTGCACTGGGAATGCCCCCTGGAGCAGAACCATTGCCATGGATAAGGACACTACATTT CCTGGTGTTAAGGTGAATATAACCTCCAGGTTAAGGATGACATTAATTTCAATTACAGCT TGCCTCTTGTAAGCTAAGCAGTTAATCAACAAGCTATACTGTGACTACACCCTTAGATCA ATAGCTGGGAAAACATCACCTCCCCCAAATACTCCACCTCTTAACTGCACTCTTTGAAAG AAGTACAGGCCAGAGTTTAGCTGATCCATCCCTGTGGCTAATCGTCCTGCTTACAAGCTG CAATATTTTTTAAAACCAGACAATTGGTAGAGGTTTAAACATCAGCCAAGCTGTTCAATT TACAGCAGGTTAAGCATTCCTGAAACTGTGATCACTGATATATTTGGGTCAGTCAGATGT CTTGTTAGTGCTT >asmbl_2701 ACAAACAAAACAAAATAAAACAAAGGAAACAAGCAAAAAAAACCATCATACAATCCCATG TGTCCAAGAGCTTTACTGTGAAATCAACTATGGAGTCAAAACAATAGAAAAGCTTCCAGA TTTCTGTATTCCAGGCTGAGACAAGTTTGTAAATACTTCCAGAAATTGCCAACAAGCCTG CAGGGTAACATCTCTAATGCACACCTCCCTGATACGAAATGCAGAGCACCTTAACTTCTT CAGCCCTCCCCCAGTCACAACCAGCTATAAATCCTGCCCTTCACTTGTTGGAATATCTCA TCATAAGGGAAGCATTTTTTAGGCTGAGAAATACAAATCCACCTTGACGGAGCCGGTCAG GCATATACATGGGCTATGCTGCTGATAGGTTTGTACCAAGCACTCCTAGTGTGAGAATAA

Scaffold = several contigs stitched together with NNNs in between Contigs and scaffolds Contig = a contiguous stretch of nucleotides resulting from the assembly of several reads Scaffold = several contigs stitched together with NNNs in between Paired reads NNN NNN contig1 contig2 contig3 NNN NNN scaffold1

A scaffold in fasta-format

N50 - a measure of contiguity (at best) N50 = contigs of this size or larger include 50 % of the assembly >contig1 TTTATGTCCGTAGCATGTAGACATATGGCA 30 bp 30 >contig2 AGTCTTGAGCCGAATTCGTG 20 bp 30+20=50 (>45) >contig3 GTTGGAGCTATTCAGCGTAC 20 bp >contig4 ACAAATGATC 10 bp >contig5 CGCTTCGAAC 10 bp 90 bp total 50% of total = 45 L50 = number of contigs that include 50% if the assembly. Here, L50=2! N50=20!

NG50 - compared with genome size rather than assembly size N50 - contigs of this size or larger include 50 % of the assembly NG50 - contigs of this size or larger include 50 % of the genome NG50 is a better approximation of assembly quality, but can sometimes not be calculated, e.g., the genome size is unknown Can be quite different from N50, e.g., genome is 1,5 Gb but assembly is 1 Gb due to non-assembled repeats

NGS Sequence technologies Deprecated 454 Solid Supported, not used much in genome assembly Ion Torrent (Ion PGM) Ion Proton Current workhorses Illumina Pacific biosciences Up and coming? Oxford Nanopore

Supporting technologies Dovetail genomics (Chicago libraries) BioNano (Irys system) 10x genomics - GemCode

Sequencing technology comparison Sequencing system Read length Yield Illumina Hi-Seq 2500 2x125 bp 180 M read pairs/lane, 28 Gbp/lane Illumina HiSeqX 2x150 bp 350 M read pairs/lane, 78Gbp/lane Illumina MiSeq Up to 2x300 bp 18 M read pairs/lane, 7.4Gbp/run PacBio 1-20 (70) kb 1.3 Gb/SMRTcell Oxford Nanopore 1-100kb ?

Error rates and types Sequencing system Error type Error rate Illumina Substitutions 0.1% PacBio Insertions 0.001-15% depending on read length Oxford Nanopore Substitutions, indels 38%

Illumina technology

Illumina Pros: Huge yield, cheap, reliable, read length “long enough” (100-300 bp), industry standard=huge amount of available software Cons: GC-problems, quality-dip at end of reads, long running time for Hi-Seq, short insert-sizes

PacBio technology

Pacific Biosciences Pros: Long reads (average 4.5 kbp), single molecules Cons: High error rate on longer fragments (15%), expensive

Nanopore technology

Pros: Extremely long sequences, single molecule, portable (minION) Nanopore Pros: Extremely long sequences, single molecule, portable (minION) Cons: Very high error rates (up to 38% reported)

10x genomics Long DNA fragments are separated in gel beads (gems) and then sequenced with Illumina HiSeq -> a “cloud” of reads originating from the same (long) DNA fragment These reads can then be used to assemble the genome (Supernova) or scaffold/phase the genome (Architect)

BioNano

Dovetail Genomics

Biosupport.se is perfect for shorter questions. You need help? NBIS is a VR-financed organization that offers bioinformatics support to all projects in Sweden. Please go to http://nbis.se/support/supportform/index.php to apply for support. Biosupport.se is perfect for shorter questions.

Biosupport.se