Universidad de los Andes, Bogotá, Colombia, Septiembre 2015  Sequence and annotation of genomes and metagenomes with Galaxy Dr. rer. nat. Diego Mauricio.

Slides:



Advertisements
Similar presentations
CS 336 March 19, 2012 Tandy Warnow.
Advertisements

Graph Theory Aiding DNA Fragment Assembly Jonathan Kaptcianos advisor: Professor Jo Ellis-Monaghan Work.
Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.
Sequence Assembly for Single Molecule Methods Steven Skiena, Alexey Smirnov Department of Computer Science SUNY at Stony Brook {skiena,
Pamela Ferretti Laboratory of Computational Metagenomics Centre for Integrative Biology University of Trento Italy Microbial Genome Assembly 1.
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Next Generation Sequencing, Assembly, and Alignment Methods
SplitMEM: graphical pan-genome analysis with suffix skips Shoshana Marcus May 29, 2014.
Alignment Problem (Optimal) pairwise alignment consists of considering all possible alignments of two sequences and choosing the optimal one. Sub-optimal.
Genome Sequence Assembly: Algorithms and Issues Fiona Wong Jan. 22, 2003 ECS 289A.
Assembly.
Sequencing and Assembly Cont’d. CS273a Lecture 5, Win07, Batzoglou Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT..
Sequencing and Assembly Cont’d. CS273a Lecture 5, Aut08, Batzoglou Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT..
Introduction to computational genomics – hands on course Gene expression (Gasch et al) Unit 1: Mapper Unit 2: Aggregator and peak finder Solexa MNase Reads.
Phylogenetic Tree Construction and Related Problems Bioinformatics.
CS 6030 – Bioinformatics Summer II 2012 Jason Eric Johnson
Assembling Genomes BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BCH364C-391L/Spring.
De-novo Assembly Day 4.
Sequence Assembly: Concepts BMI/CS 576 Sushmita Roy September 2012 BMI/CS 576.
Mon C222 lecture by Veli Mäkinen Thu C222 study group by VM  Mon C222 exercises by Anna Kuosmanen Algorithms in Molecular Biology, 5.
DNA Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 8, 2005 ChengXiang Zhai Department of Computer Science University of Illinois,
CS 394C March 19, 2012 Tandy Warnow.
Todd J. Treangen, Steven L. Salzberg
Graphs and DNA sequencing CS 466 Saurabh Sinha. Three problems in graph theory.
Improving the Accuracy of Genome Assemblies July 17 th 2012 Roy Ronen *,1, Christina Boucher *,1, Hamidreza Chitsaz 2 and Pavel Pevzner 1 1. University.
Sequence assembly using paired- end short tags Pramila Ariyaratne Genome Institute of Singapore SOC-FOS-SICS Joint Workshop on Computational Analysis of.
394C March 5, 2012 Introduction to Genome Assembly.
Graph Theory And Bioinformatics Jason Wengert. Outline Introduction to Graphs Eulerian Paths & Hamiltonian Cycles Interval Graph & Shape of Genes Sequencing.
Sequence Assembly Fall 2015 BMI/CS 576 Colin Dewey
Fuzzypath – Algorithms, Applications and Future Developments
Sequence Assembly BMI/CS 576 Fall 2010 Colin Dewey.
Metagenomics Assembly Hubert DENISE
JM - 1 Introduction to Bioinformatics: Lecture III Genome Assembly and String Matching Jarek Meller Jarek Meller Division of Biomedical.
Problems of Genome Assembly James Yorke and Aleksey Zimin University of Maryland, College Park 1.
Mathematics of Networks (Cont)
Week 11 - Monday.  What did we talk about last time?  Binomial theorem and Pascal's triangle  Conditional probability  Bayes’ theorem.
Assembly of Paired-end Solexa Reads by Kmer Extension using Base Qualities Zemin Ning The Wellcome Trust Sanger Institute.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
RNA Sequence Assembly WEI Xueliang. Overview Sequence Assembly Current Method My Method RNA Assembly To Do.
String Matching String matching: definition of the problem (text,pattern) depends on what we have: text or patterns Exact matching: Approximate matching:
Week 12 - Wednesday.  What did we talk about last time?  Matching  Stable marriage  Started Euler paths.
A new Approach to Fragment Assembly in DNA Sequenceing Fei wu April,24,2006.
Whole Genome Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 13, 2005 ChengXiang Zhai Department of Computer Science University of.
CS 173, Lecture B Introduction to Genome Assembly (using Eulerian Graphs) Tandy Warnow.
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
ALLPATHS: De Novo Assembly of Whole-Genome Shotgun Microreads
Variation Detections and De novo Assemblies from Next-gen Data Zemin Ning The Wellcome Trust Sanger Institute.
Review: Graph Theory in Bioinformatics Yunkai Liu Assistant Professor Computer Science Department University of South Dakota.
Performance Profiling of NGS Genome Assembly Algorithms Alex Ropelewski Pittsburgh Supercomputing Center
Short reads: 50 to 150 nt (nucleotide)
DNA Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics)
CSCI2950-C Genomes, Networks, and Cancer
Assembly algorithms for next-generation sequencing data
Sequence assembly Jose Blanca COMAV institute bioinf.comav.upv.es.
CAP5510 – Bioinformatics Sequence Assembly
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
Genome sequence assembly
Eulerian tours Miles Jones MTThF 8:30-9:50am CSE 4140 August 15, 2016.
EECS 203 Lecture 20 More Graphs.
Introduction to Genome Assembly
Discrete Maths 9. Graphs Objective
Removing Erroneous Connections
CS 598AGB Genome Assembly Tandy Warnow.
Graph Algorithms in Bioinformatics
CSE 5290: Algorithms for Bioinformatics Fall 2009
Introduction to Sequencing
Assembling Genomes BCH339N Systems Biology / Bioinformatics – Spring 2016 Edward Marcotte, Univ of Texas at Austin.
Fragment Assembly 7/30/2019.
Presentation transcript:

Universidad de los Andes, Bogotá, Colombia, Septiembre 2015  Sequence and annotation of genomes and metagenomes with Galaxy Dr. rer. nat. Diego Mauricio Riaño Pachón Brazilian Bioethanol Science and Technology Laboratory (CTBE) Brazilian Center for Research in energy and Materials (CNPEM) Genome assembly

Universidad de los Andes, Bogotá, Colombia, Septiembre 2015 Genome sequencing before... And now 2 Before – Industry scale Lots of equipment – lots of personnel (Wet and Dry) Today A single technician, can produce hundreds or thousands more data in a week, a single bioinformatician (if any) must analyze the data

Universidad de los Andes, Bogotá, Colombia, Septiembre

La avalancha de datos: Costos 4 Stein, Genome Biology, 11:207 “in the not too distant future it will cost less to sequence a base of DNA than to store it on a hard disk” Data HiSeq 2500 (v4) Cost one flowcell: US$ Yield: 500 Gbp Cost per bp: US$4x10 -6 Cost to store 1 TB: US$900 Cost to store 1bp (FastQ format ~5bytes): US$4.5x10 -4 There is not enough bioinformaticians to cope with the speed for data generation. Biologist should become savy on genome assembly and annotation.

Universidad de los Andes, Bogotá, Colombia, Septiembre But... A lot of bioinformatics analysis looks like this

Universidad de los Andes, Bogotá, Colombia, Septiembre 2015 Galaxy, bridging the gap 6

Universidad de los Andes, Bogotá, Colombia, Septiembre 2015 Genome assembly 7 X

Universidad de los Andes, Bogotá, Colombia, Septiembre 2015 de novo Genome assembly 8 Why? No references available. You are the only one studying that bug! The references available might not be the best one pan genome vs core genome species definition

Universidad de los Andes, Bogotá, Colombia, Septiembre 2015 ¿How do you get the genome sequence of an organism? Example: imagine a genome of size 10bp, you have three copies, each copy get fragmented in the following way, you do not know the order of the fragments: TG, ATG y CCTAC AT, GCC y TACTG CTG, CTA y ATGC ¿Which is the original genome sequence? CCTAC CC CTA ATGCCTACTG TAC C CCTAC GCCTACTG CTACTG

Universidad de los Andes, Bogotá, Colombia, Septiembre 2015 de novo Genome assembly: concepts 10 Each fragment can be sequence form any end. Preferentially from both: paired-end sequencing Paired-ends Mate pairs Yesterday we talked about types of reads, let’s see how they work to get a genome assembly

Universidad de los Andes, Bogotá, Colombia, Septiembre 2015 de novo Genome assembly: concepts 11 Typically bp How? Assemblers! Scaffoldings: use reference, use mate- pairs

Universidad de los Andes, Bogotá, Colombia, Septiembre 2015 de novo Genome assembly: concepts 12

Universidad de los Andes, Bogotá, Colombia, Septiembre 2015 de novo Genome assembly: Overview 13 Ekblom & Wolf,

Universidad de los Andes, Bogotá, Colombia, Septiembre 2015 de novo Genome assembly: Overview 14 Ekblom & Wolf,

Universidad de los Andes, Bogotá, Colombia, Septiembre 2015 de novo Genome assembly: Wet-lab Effort 15 Ekblom & Wolf, How much data to generate? How many reads do I need?

Universidad de los Andes, Bogotá, Colombia, Septiembre 2015 de novo Genome assembly: Wet-lab Effort 16 Sims et al, How much data to generate? How many reads do I need? The goal: generate robust scientific findings with the lowest sequencing cost Coverage! What is coverage?

Universidad de los Andes, Bogotá, Colombia, Septiembre 2015 de novo Genome assembly: Concepts 17 Sims et al, Coverage! “The expected coverage is the average number of times that each nucleotide is expected to be sequenced given a certain number of reads of a given length and the assumption that reads are randomly distributed across an idealized genome” Also called depth or depth of coverage Genome sequence Coverage

Universidad de los Andes, Bogotá, Colombia, Septiembre 2015 Theoretical coverage according to the Lander-Waterman formula for the human genome and exome sequencing de novo Genome assembly: Coverage Sanger vs NGS 18 Sims et al, Lander & Waterman, Sanger 8-10 Typical Coverage Illumina Why? Coverage= Read length x Number of reads Genome size Lander and Waterman, 1988 Probability that a base is sequenced Y times:

Universidad de los Andes, Bogotá, Colombia, Septiembre 2015 de novo Genome assembly: Coverage NGS 19 Sims et al, Lander & Waterman, Probability that a base is sequenced Y times: Compute the probability that a base is sequenced 4 times if you have a coverage of 5 Compute the probability that a base is sequenced at most 4 times if you have a coverage of 5 Compute the probability that a base is sequenced at least 4 times if you have a coverage of 5 You can interpret the second probability as the proportion of bases that will be covered by 4 or less reads

Universidad de los Andes, Bogotá, Colombia, Septiembre 2015 de novo Genome assembly: Coverage Illumina 20 Use the Illumina Coverage calculator and compute: Number of bacterial species with genome size of 5Mbp that you could sequence at a coverage of 50x on a MiSeq V3 reagents MiSeq V2 reagents MiSeq v2 Nano HiSeq 2500 Rapid Run with cBot Number of plant genomes with genome size of 400Mbp that you could sequence at a coverage of 50x on a MiSeq V3 reagents MiSeq V2 reagents MiSeq v2 Nano HiSeq 2500 Rapid Run with cBot HiSeq 2500 High Output

Universidad de los Andes, Bogotá, Colombia, Septiembre 2015 Effect of read length on breath of coverage 21 Percentage of the E.coli genome recovered by contigs greater than a threshold length as a function of read length. Whiteford, et al.,

Universidad de los Andes, Bogotá, Colombia, Septiembre 2015  After all that you have decided how much data you need, you have paid your sequence provider, send your sample and should have now some nice clean reads. 22 Let’s see how to do a de novo genome assembly

Universidad de los Andes, Bogotá, Colombia, Septiembre 2015 de novo CONTIG assembly: Overview 23 From small RANDOMLY located sequence fragments Consensus, contiguous sequences “An assembly is a hierarchical data structure that maps the sequence data to a putative reconstruction of the target. It groups reads into contigs and contigs into scaffolds.” Miller et al.,

Universidad de los Andes, Bogotá, Colombia, Septiembre 2015 Methods for de novo genome assembly 24 Overlap-Layout- Consensus: OLC De Bruijn Graphs These two use graph theory to face the problem of genome assembly. The difference in on how you build the graph.

Universidad de los Andes, Bogotá, Colombia, Septiembre 2015 Basics of graph theory: a tale of bridges 25 Compeau et al., Seven Bridges of Königsberg: Is there walk through the city that would cross each bridge once and only once? ( ) Leonard Euler Basel, Switzerland Euler’s insights (1735): The route inside each island is irrelevant Only the sequence of bridges crossed is important Simplify the problem Vertex or node Edge A graph G={V,E} Today: Kaliningrad, Russia

Universidad de los Andes, Bogotá, Colombia, Septiembre 2015 Basics of graph theory: a tale of bridges Seven Bridges of Königsberg Leonard Euler A graph G={V,E} Except for the endpoints of the walk, each time one enters a node, one leaves that same node. If one has to traverse each bridge exactly one, then it follows that, except for start and finish, the number of bridges (edges) touching the land (nodes) must be even. Degree of a node: number of edges connected to the node As all land masses have an odd degree, one cannot possibly traverse each bridge exactly once A necessary condition for the walk is that the graph must have exactly zero or two nodes of odd degree

Universidad de los Andes, Bogotá, Colombia, Septiembre 2015 Types of graphs 27 Mention some examples of such graphs!

Universidad de los Andes, Bogotá, Colombia, Septiembre 2015 Graph theory in biology 28 Regulatory, signal transduction, metabolic networks Social, epidemiological networks Phylogenetic trees

Universidad de los Andes, Bogotá, Colombia, Septiembre 2015 Methods for de novo genome assembly 29 Overlap-Layout- Consensus: OLC De Bruijn Graphs These two use graph theory to face the problem of genome assembly. The difference in on how you build the graph. Problem abstraction/representation

Universidad de los Andes, Bogotá, Colombia, Septiembre 2015 Overlap-Layout- Consensus: OLC 30 Overlap Layout Consensus Build overlap graph Build contigs Select likely nucleotide sequence for each contig

Universidad de los Andes, Bogotá, Colombia, Septiembre 2015 Overlap-Layout- Consensus: OLC 31 Overlap El-Metwally, et al., Very simple task: For each pair of reads, find overlap. But it is very computationally demanding for large number of reads. Reads are nodes, there is an edge between nodes if there is a suffix- prefix relationship among them

Universidad de los Andes, Bogotá, Colombia, Septiembre 2015 Overlap-Layout- Consensus: OLC 32 Overlap How to compute the overlaps?Naïve approach Do this for each pair of reads! Suffix to Prefix overlaps

Universidad de los Andes, Bogotá, Colombia, Septiembre 2015 Overlap-Layout- Consensus: OLC 33 Overlap How to compute the overlaps? Generalized Suffix Tree: A more efficient approach in most cases Generalized suffix tree for { “GACATA”, “ATAGAC” } S=GACATA$ 0 ATAGAC$ $0$0 $1$1 $0$0 $0$0 $1$1 $0$0 $1$1 $1$1 A TA GAC ATA$ 0 GAC$ 1 TAC C GAC$ 1 ATA$ 0 GAC$ 1 Where is query GACATA? Check that all suffixes are present in the tree

Universidad de los Andes, Bogotá, Colombia, Septiembre 2015 Overlap-Layout- Consensus: OLC 34 Overlap How to compute the overlaps? Generalized Suffix Tree: A more efficient approach in most cases Generalized suffix tree for { “GACATA”, “ATAGAC” } S=GACATA$ 0 ATAGAC$ $0$0 $1$1 $0$0 $0$0 $1$1 $0$0 $1$1 $1$1 A TA GAC ATA$ 0 GAC$ 1 TAC C GAC$ 1 ATA$ 0 GAC$ 1 Blue edge implies length-3 suffix of second string equals length-3 prefix of query GACATA ATAGAC

Universidad de los Andes, Bogotá, Colombia, Septiembre 2015 Overlap-Layout- Consensus: OLC 35 Overlap How to compute the overlaps? Generalized Suffix Tree: A more efficient approach Generalized suffix tree for { “GACATA”, “ATAGAC” } S=GACATA$ 0 ATAGAC$ $0$0 $1$1 $0$0 $0$0 $1$1 $0$0 $1$1 $1$1 A TA GAC ATA$ 0 GAC$ 1 TAC C GAC$ 1 ATA$ 0 GAC$ 1 Now use ATAGAC as query Which are the suffix-prefix alignments with GACATA?

Universidad de los Andes, Bogotá, Colombia, Septiembre 2015 Overlap-Layout- Consensus: OLC 36 Overlap How to compute the overlaps? Dynamic programming is needed due to sequencing errors, e.g., indels or mismatches. First do suffix tree to reduce number of reads that should be aligned using dynamic programming, reduce tremendously the size of the problem. Dynamic programming

Universidad de los Andes, Bogotá, Colombia, Septiembre 2015 Overlap-Layout- Consensus: OLC 37 Overlap How to compute the overlaps? Dynamic programming

Universidad de los Andes, Bogotá, Colombia, Septiembre Dynamic Programing: Global Alignment Gaps: λ= -6 Similarity matrix(σ): Match=+5; Mismatch=-2 Initialize (0,0)=0 Filling in the cells: Eddy SA What is dynamic programming? Nature Biotech. 22: ACACTA - A G C A C A C A 0 i j  =gaps=-6

Universidad de los Andes, Bogotá, Colombia, Septiembre ` -ACACTA A-6+5 G-12 C A C A C A Match=5 Mismatch=-2  =-6 j i

Universidad de los Andes, Bogotá, Colombia, Septiembre 2015 Overlap-Layout- Consensus: OLC 40 Layout Select the path that visits every node, i.e., look for a Hamiltonian path in the graph

Universidad de los Andes, Bogotá, Colombia, Septiembre 2015 Overlap-Layout- Consensus: OLC 41 Layout Select the path that visits every node exactly once, i.e., look for a Hamiltonian path in the graph Overlap graph: Edge represent overlaps of 2 or more nt Search for the hamiltonian path X

Universidad de los Andes, Bogotá, Colombia, Septiembre 2015 Consensus Overlap-Layout- Consensus: OLC 42

Universidad de los Andes, Bogotá, Colombia, Septiembre 2015 Overlap-Layout- Consensus: Drawbacks Ovelap step is very time consuming 2.Overlap graph is large, you need one node per read (consider sequencing errors) and the number of edges grows faster than the number of nodes 3.Not practical when you have hundreds of millions of reads, i.e., Illumina. But, good with datasets of long reads (e.g., Celera Assembler)

Universidad de los Andes, Bogotá, Colombia, Septiembre 2015 Overlap-Layout- Consensus: Software 44 SoftwareYearReferenceDownload ARACHNE2002 Genome Res : http:// PHRAP phrap.html CAP1999 Genome Res : http://seq.cs.iastate.edu/ TIGR1995 Genome Sci Tech :9-19http:// CELERA2000 Science : http://wgs-assembler.sourceforge.net Newbler2005 Nature : http://

Universidad de los Andes, Bogotá, Colombia, Septiembre 2015 k-mers  k-mers are strings, of length k, of characters from a defined alphabet. 45 Given the set of reads: R={TACAGT, CAGTC, AGTCA, CAGA} Answer 1.How many k-mers are in these reads (including duplicates), for k=3? 2.How many distinct k-mers are in these reads? a.For k=2 b.For k=3 c.For k=5 3.It appears that these reads come form the toy genome TACAGTCAGA. What is the largest k such that the set of distinct k-mers in the genome is exactly the set of distinct k-mers in the reads? 4.For any value of k, is there a mathematical relationship between N, the number of k- mers (incl. duplicates) in a sequence, and L, the length of the sequence?

Universidad de los Andes, Bogotá, Colombia, Septiembre 2015 De Bruijn graphs 46 Compeau et al., Nicolas de Bruijn (1918 –2012) Netherlands The Problem: find a shortest circular “superstring” that contains all possible “substrings” of length k (k- mers) over a given alphabet. How many k-mers of length k exist over an alphabet of length n? Build a graph, where every possible (k-1)-mer is a node Draw an edge between two nodes if there is a k-mer whose prefix is the first node and suffix is the second node Example: Find the shortest circular superstring that contains all k-mer of length 4 on a binary alphabet

Universidad de los Andes, Bogotá, Colombia, Septiembre 2015 De Bruijn graphs 47 Compeau et al., How many 4-mers exist over an alphabet of length 2? Build a graph, where every possible (k-1)-mer is a node Draw an edge between two nodes if there is a k-mer whose prefix is the first node and suffix is the second node Find the shortest superstring that contains all k-mer of length 4 on a binary alphabet 2 4 =16 1. Create all k-1 nodes (how many?) All possible 4-mers 2. Draw edges Shortest superstring containing all 4-mers: Eulerian cycle

Universidad de los Andes, Bogotá, Colombia, Septiembre 2015 De Bruijn graphs 48 Compeau et al., The edges in the de Bruijn graph represent all possible k-mers

Universidad de los Andes, Bogotá, Colombia, Septiembre 2015 Graphs for genome assembly 49 Compeau et al., Ovelap Graph But computing read overlaps is very costly

Universidad de los Andes, Bogotá, Colombia, Septiembre 2015 Graphs for genome assembly 50 Compeau et al., Then, split the reads as k-mers (sub-strings of length k) Now, you have two options: 1.- Let the k-mers be nodes in the graph k=3 k-mer graph ATGGCGT Reads GGCGTGC CGTGCAA TGCAATG CAATGGC ATGGTGTGGCGTGCGGGC TGC GCA CAA AAT Draw edges based on pairwise alignments Look for a hamiltonian cycle: Visit each vertex once (hard to solve) Genome:

Universidad de los Andes, Bogotá, Colombia, Septiembre 2015 De Bruijn graphs for genome assembly 51 Compeau et al., Then, split the reads as k-mers (sub-strings of length k) 2.- Let the k-mers-1 be nodes in the graph, i.e., suffixes and prefixes k=3 ATGGCGT Reads GGCGTGC CGTGCAA TGCAATG CAATGGC Edges represent k-mers having a particular suffix and prefix Look for an Eulerian cycle: Visit each edge once (easier to solve) k-mer-1 graph AT CG GG CA AA GT GC TG Genome: ATG TGG GGC GCG CGT GTG TGC GCA CAA AAT

Universidad de los Andes, Bogotá, Colombia, Septiembre 2015 De Bruijn graph: Assumptions so far 52 Compeau et al., All k-mers present in the genome are available 2.K-mers are error free 3.Each k-mer appear at most once in the genome 4.The genome is a single circular chromosome Does not apply in NGS datasets! 1. Generating (nearly) all k-mers present in the genome Reads of length k, only capture a small fraction of the k-mers from the genome, e.g., due to difficulties in sequencing some genomic regions. For the genome sequence: ATGGCGTGCA ATGGCGTGGCGTGCCGTGCAA TGCAATGCAATGGC Reads: Do the reads represent all the 7-mers from the genome? What happens if brake your reads into 3-mers? That is why we do not use k = length of the read. When using k-mers smaller than the read length, the resulting k-mers represent nearly all the k-mers in the genome.

Universidad de los Andes, Bogotá, Colombia, Septiembre 2015 De Bruijn graph: Assumptions so far 53 Compeau et al., All k-mers present in the genome are available 2.K-mers are error free 3.Each k-mer appear at most once in the genome 4.The genome is a single circular chromosome Does not apply in NGS datasets! 2. Handling errors in reads. Errors create bulges in the de Bruijn graph. The same happens with in-exact repeats or polymorphisms Deal with the bulges, different packages deal in different ways. As an alternative, error-correct the reads prior to the assembly. A single sequencing error, creates a bulge and increases the size of the graph

Universidad de los Andes, Bogotá, Colombia, Septiembre 2015 De Bruijn graph: Assumptions so far 54 Compeau et al., All k-mers present in the genome are available 2.K-mers are error free 3.Each k-mer appear at most once in the genome 4.The genome is a single circular chromosome Does not apply in NGS datasets! 3. Handling DNA repeats. Let’s have the cyclic genome ATGCATGC And the 3-mer reads: ATG, TGC, GCA, CAT Obtain the genome sequence from the reads using de Bruijn graphs, with a k=3! Check whether all k-mers in the genome are available?

Universidad de los Andes, Bogotá, Colombia, Septiembre 2015 De Bruijn graph: Assumptions so far 55 Compeau et al., All k-mers present in the genome are available 2.K-mers are error free 3.Each k-mer appear at most once in the genome 4.The genome is a single circular chromosome Does not apply in NGS datasets! 3. Handling DNA repeats. Let’s have the cyclic genome ATGCATGC And the 3-mer reads: ATG, TGC, GCA, CAT One solution, will be to record how many times each k-mer appears (m=k-mer multiplicity), drawing m edges between its suffix and prefix Obtain the genome sequence from the reads using de Bruijn graphs, with a k=3, and assuming k-mer multiplicity = 2

Universidad de los Andes, Bogotá, Colombia, Septiembre 2015 De Bruijn graph: Assumptions so far 56 Compeau et al., All k-mers present in the genome are available 2.K-mers are error free 3.Each k-mer appear at most once in the genome 4.The genome is a single circular chromosome Does not apply in NGS datasets! 3. Handling DNA repeats. Let’s have the cyclic genome ATGCATGC And the 3-mer reads: ATG, TGC, GCA, CAT One solution, will be to record how many times each k-mer appears (m=k-mer multiplicity), drawing m edges between its suffix and prefix With current data, instead of relying on multiplicity, the best approach is to exploit paired-end reads. How?

Universidad de los Andes, Bogotá, Colombia, Septiembre 2015 De Bruijn graph: Assumptions so far 57 Compeau et al., All k-mers present in the genome are available 2.K-mers are error free 3.Each k-mer appear at most once in the genome 4.The genome is a single circular chromosome Does not apply in NGS datasets! 4. Handling multiple and linear chromosomes. Single linear chromosome: Look for an Eulerian path instead of an Eulerian cycle. Visit each edge, but no need to finish in the starting node. Several linear chromosome: Search for multiple Eulerian paths, each would be a “chromosome”

Universidad de los Andes, Bogotá, Colombia, Septiembre 2015 Review of graph complexity 58 Miller et al., Low frequency dead-ends: Reads with sequencing errors towards the end Thickness of edges represents multiplicity Bulges, due to sequencing errors or polymorphisms toward the middle of the reads Collapsed paths, due to near identical repeats.

Universidad de los Andes, Bogotá, Colombia, Septiembre 2015 Some methods to resolve graph complexity 59 Miller et al., Thickness of edges represents multiplicity Collapsed repeat, repeat length shorter than read length Which path to follow? Read

Universidad de los Andes, Bogotá, Colombia, Septiembre 2015 Some methods to resolve graph complexity 60 Miller et al., Thickness of edges represents multiplicity Collapsed repeat, repeat length shorter than paired-end distances (insert sizes) R1R2

Universidad de los Andes, Bogotá, Colombia, Septiembre 2015 Some methods to resolve graph complexity 61 Miller et al., Thickness of edges represents multiplicity Bulge/bubble, due to sequencing errors or polymorphisms Following paired-end/mate-pair constraints

Universidad de los Andes, Bogotá, Colombia, Septiembre 2015 De Bruijn graph assemblers: Software 62 SoftwareYearReferenceDownload Euler2001 PNAS : http://cseweb.ucsd.edu/~ppevzner/software.html Velvet2007 Genome Res : https:// AllPaths2010 PNAS : http:// SPAdes1995 J Comput Biol : http://bioinf.spbau.ru/spades IDBA2010 RECOMB. 2010http://i.cs.hku.hk/~alse/idba/ Trinity (Transcriptomics) 2011 Nat Biotechnol : http://trinityrnaseq.github.io/

Universidad de los Andes, Bogotá, Colombia, Septiembre 2015 Comparing assemblers 63 Mis-assemblies Mismatch error rate indels Genome Fraction

Universidad de los Andes, Bogotá, Colombia, Septiembre 2015 Selecting the best k-mer for assembly 64 The quality of the assembly strongly depends on the value of k-mer for de Bruijn graph assemblers The ideal k-mer depends on: Sequencing coverage Sequencing error rate Genome complexity Too small k: the assembly fragments in repeats longer than k Too large k: higher chances that the k-mer will have errors, bulges/bubbles

Universidad de los Andes, Bogotá, Colombia, Septiembre 2015 Selecting the k-mer: Velvet Optimizer 65 Run velvet for a collection of k-mer values: k i <K<k j Pick the assembly that is best at some metric, e.g., N50, total length, number of contigs. Very simple strategy, but very time consuming. We will use a manual version of this in the practical session.

Universidad de los Andes, Bogotá, Colombia, Septiembre 2015 Selecting the k-mer:KMERGENIE 66 Chikhi & Medvedev A fast and efficient way to compute best k-mer for a de Bruijn assembly 1. Compute multiplicity histogram, for various values of k Number of distinct k-mers with multiplicity 60 Noise Signal 2. Estimate the number of genomic k-mers (signal) 3. The best k for assembly is the one which provides the most distinct genomic k-mers.

Universidad de los Andes, Bogotá, Colombia, Septiembre 2015 Comparing assemblers 67 Development of software for genome assembly is a very dynamic area, and this is related to the continuous changes in the sequencing technologies, For you project, it is always advisable to use more than a single assembler, and then compare results, or even merge results A good starting point, is to check the results of comparison of different assemblers: GAGE: Assemblathon:

Universidad de los Andes, Bogotá, Colombia, Septiembre 2015 Estimate genome size 68 This is the formula to estimate the genome size. N = (M*L)/(L-K+1) and Genome_size = T/N, where N: Depth, M: Kmer peak, K: Kmer-size, L: avg readlength, T: Total bases. Estimate number of unique k- mer after removing error-kmers, that would be the red line in the graph Compute the average coverage of these unique k-mers (genomic k-mers), that is approximately where the peak of the red line is located.

Universidad de los Andes, Bogotá, Colombia, Septiembre 2015 Estimate genome size 69 This is the formula to estimate the genome size. N = (M*L)/(L-K+1) and Genome_size = T/N, where N: Depth, M: Kmer peak, K: Kmer-size, L: avg readlength, T: Total bases.