Bioinformatics Algorithms Sequence Assembly 2 Copy Right Notice Most slides in this presentation are adopted from slides of text book and various sources.

4 3 Genome Sequencing Goal  figuring the order of nucleotides across a genome Problem  Current DNA sequencing methods can handle only short stretches of DNA at once (<1-2Kbp) Solution  Sequence and then use computers to assemble the small pieces


6 5 Sanger Sequencing 1980 19902000 1982: lambda virus DNA stretches up to 30-40Kbp (Sanger et al.) 1994: H. Influenzae 1.8 Mbp (Fleischmann et al.) 2001: H. Sapiens, D. Melanogaster 3 Gbp (Venter et al.) 2007: Global Ocean Sampling Expedition ~3,000 organisms, 7Gbp (Venter et al.)

7 6 Sanger Sequencing Advantages  Long reads (~900bps)  Suitable for small projects Disadvantages  Low throughput  Expensive

8 7 7 2010: 5K$, a few days? 2009: Illumina, Helicos 40-50K$ Sequencing the Human Genome Year Log 10 (price) 20102005 2000 10 8 6 4 2 2012: 100$, <24 hrs? 2008: ABI SOLiD 60K$, 2 weeks 2007: 454 1M$, 3 months 2001: Celera 100M$, 3 years 2001: Human Genome Project 2.7G$, 11 years

9 8 Next Generation Sequencing Alternative sequencing technologies to capillary, introduced in mid 2000s. Systems by Illumina Solexa and ABI SOLiD. Much higher throughput (1-4gbps / day) Lower cost / base pair Very short fragment lengths High error rate Inherent ability to do paired-end (mate-pair) sequencing.

10 9 Technology Summary Read lengthSequencing Technology Throughput (per run) Cost (1mbp)* Sanger~800bpSanger400kbp500$ 454~400bpPolony500Mbp60$ Solexa75bpPolony20Gbp2$ SOLiD75bpPolony60Gbp2$ Helicos30-35bpSingle molecule 25Gbp1$ *Source: Shendure & Ji, Nat Biotech, 2008

11 10 Assembly 10 Cut DNA to larger pieces (2Kbp, 15Kbp) and sequence both ends of each piece (Fleischmann et al., 1994) contig 1 contig 2 15Kbp mates 2Kbp mates ~(length―1,000) ~500 bp resolving repeats Better assembly of contigs, gap lengths estimation

12 11  many pieces to assemble High coverage: Assembly: How Much DNA? Low coverage: A few pieces to assemble a few contigs, a few gaps many contigs, many gaps  Input Output Lander and Waterman, 1988

13 12 Assembly paradigms Overlap-layout-consensus greedy (TIGR Assembler, phrap, CAP3...) graph-based (Celera Assembler, Arachne) de Bruijn Graph based approaches (especially useful for short read sequencing)

14 13 Overlap-Layout-Consensus Assemblers:ARACHNE, PHRAP, CAP, TIGR, CELERA Overlap: find potentially overlapping reads Layout: merge reads into contigs and contigs into supercontigs (scaffolds) Consensus: derive the DNA sequence and correct read errors..ACGATTACAATAGGTT..

15 14 OVERLAP GRAPH Edge Types:AB A BA B BB BAA A Regular Dovetail Prefix Dovetail Suffix Dovetail E.G.: Edges are annotated with deltas of overlaps

16 15 OVERLAP GRAPH Find the best match between the suffix of one read and the prefix of another Due to sequencing errors, need to use dynamic programming to find the optimal overlap alignment Apply a fast filtration method to filter out pairs of fragments that do not share a significantly long common substring

17 16 The Maximum Overlap Graph Each edge, (u,v) is weighted with the length of the maximal overlap between a suffix of u and a prefix of v a b d c TACGA CTAAAG ACCC GACA 1 1 1 2 1 0-weight edges omitted!

18 17 Paths and Layouts The path dbc leads to the alignment: a b d c TACGA CTAAAG ACCC GACA 1 1 1 2 1 GACA-------- ---ACCC----- ------CTAAAG

19 18 Superstrings Every path that covers every node is a superstring Zero weight edges result in alignments like: Higher weights produce more overlap, and thus shorter strings The shortest common superstring is the highest weight path that covers every node GACA-------- ----GCCC----- --------TTAAAG

20 19 Graph formulation of SCS Input: A weighted, directed graph Output: The highest-weight path that touches every node of the graph Does this problem sound familiar?

21 20 The Greedy Algorithm Algorithm greedy Sort edges in increasing weight order For each edge in this order If the edge does not form a cycle and the edge does not start or end at the same node as another edge in the set then add the edge to the current set End for End Algorithm

22 21 Greedy Example 7 6 5 4 3 2 1 2 2

23 22 Handling repeats 1.Repeat detection pre-assembly: find fragments that belong to repeats statistically (most existing assemblers) repeat database (RepeatMasker) during assembly: detect "tangles" indicative of repeats (Pevzner, Tang, Waterman 2001) post-assembly: find repetitive regions and potential mis- assemblies. Reputer, RepeatMasker "unhappy" mate-pairs (too close, too far, mis-oriented) 2.Repeat resolution find DNA fragments belonging to the repeat determine correct tiling across the repeat

24 23 Statistical repeat detection Significant deviations from average coverage flagged as repeats. - frequent k-mers are ignored - “arrival” rate of reads in contigs compared with theoretical value (e.g., 800 bp reads & 8x coverage - reads "arrive" every 100 bp) Problem 1: assumption of uniform distribution of fragments - leads to false positives non-random libraries poor clonability regions Problem 2: repeats with low copy number are missed - leads to false negatives

25 24 Consensus A consensus sequence is derived from a profile of the assembled fragments A sufficient number of reads is required to ensure a statistically significant consensus Reading errors are corrected

26 25 Derive Consensus Sequence Derive multiple alignment from pairwise read alignments (i.e., progressive alignment) TAGATTACACAGATTACTGA TTGATGGCGTAA CTA TAGATTACACAGATTACTGACTTGATGGCGTAAACTA TAG TTACACAGATTATTGACTTCATGGCGTAA CTA TAGATTACACAGATTACTGACTTGATGGCGTAA CTA TAGATTACACAGATTACTGACTTGATGGGGTAA CTA TAGATTACACAGATTACTGACTTGATGGCGTAA CTA Derive each consensus base by weighted voting. Another approach based on finding a longest path in a DAG is given the popular assembler

27 26 Definitions Let v and w be two strings over the alphabet . Concatenation of v and w is denoted by v  w, and v[i] is the i th symbol in v, 1  i  |v|. v[i,j] denotes a substring in v and for any x  , x k, k  1 is x concatenated with itself k times. A string of length k is called a k-mer. The k-spectrum of v is the set of all k-mers that are substrings of v. [Example: v = abcd; 2-spectrum of v is {ab, bc, cd}, 3-spectrum of v is {abc, bcd}. A DNA strand is a string with alphabet Σ = {a, g, c, t}. Characters of a DNA strand are called bases. The complement of a base α[i], denoted by α[i], is defined by the following bijection of Σ onto Σ: {t → a, c → g, a → t, g → c}. The reverse complement of a DNA strand α, denoted by α, is obtained by reversing α and complementing each base (α[i] = α[|α| − i + 1]). Note that α[i] = α[i]  and α = α . A DNA molecule is a pair of complementary DNA strands, m = {α m, α m }. We denote the length of m as |m| = |α m | = h and call m an h-molecule.

28 27 The bi-directed graphs A bidirected graph is one in which each edge is given an independent orientation (or direction, or arrow – thus, 2 kinds of arrowheads) at each end. Thus, there are three kinds of bidirected edges: (1) those where the arrows point outward, towards the vertices, at both ends; (2) those where both arrows point inward, away from the vertices; and (3) those in which one arrow points away from its vertex and towards the opposite end, while the other arrow points in the same direction as the first, away from the opposite end and towards its own vertex.

29 28 The bi-directed graphs We denote a bidirected graph by an incidence matrix I: V  E  {-2, -1, 0, 1, 2}. I(x,e) = 0 if edge e is not incident to node x, +1 if e is positive-incident to x [denoted by diamond], -1 if e is negative incident on x, +2 if e is a self-loop on x (positive incident) and -2 if e is a self-loop on x (negative incident). The in-degree deg n (x) and out-degree deg p (x) of a vertex are defined as usual. The balance at a node x is bal(x) = deg n (x) + deg p (x); a graph is balanced if the balance of each vertex is 0. A walk is a sequence x 1 e 1 …x k-1 e k-1 x k where e i is an edge between nodes x i and x i+1 and e i and e i-1 have opposite orientations at x i. Bal(W) = 0, bal (X) = -1, bal (Y) = 1, bal (Z) = -1; the graph is not balanced. We can view a loop less directed graph as a special kind of bidirected graph, if every edge is positive-incident to one of its endpoints and negative-incident to the other one – the definition of a walk reduces to its usual meaning in directed graphs. However, it is possible for the shortest walk between two vertices to repeat a vertex in a bidirected graph. Here, observe that there does not exist a walk between W and Z which does not repeat a vertex [not possible in a directed graph] – the walk [from node W to node Z] is ABCBD – observe that AD is not a walk but BD is. W Z XY ABCDE W1000 X-110 Y0 200 Z000 0 AB C D

30 29 The bi-directed de Bruijn graphs A bi-directed de Bruijn graph Nodes: all possible k-mers Edges: ((v, dv), (u, du)) or (v, u, dv, du) There is an edge between v1 and v2 iff suffix(v, k-1)=prefix(u, k-1) or suffix(v, k-1)=prefix(u, k-1) and v[1].suffix(v, k-1).u[k] is k+1 substring of S or S or v[1].suffix(v, k-1).u[k] is k+1 substring of S or S

31 30 The bi-directed de Bruijn graphs Canonical k-mer: for the reverse complement of two k-mers, the lexicographically smaller (larger) is defined as canonical k-mer, and the other one is non-canonical k-mer. The orientations of the arrow heads on the edges are chosen as follows. If the canonical k-mers of nodes vi and vj overlap then an edge (vi, vj >, >) is introduced. If the canonical k-mer of vi overlaps with the non-canonical k-mer of vj then an edge (vi, vj, >, <) is introduced. If the noncanonical k-mer of vi overlaps with canonical k- mer of vj then an edge (vi, vj ) is introduced. A walk W (vi, vj ) between two nodes vi, vj ÎV of a bi- directed graph G(V, E) is a sequence vi ei vi1 ei1 vi2 vim eim,,,,,,,, vj, such that for every intermediate vertex vi l, 1 ≤ l ≤ m the orientation of the arrow head on the incoming edge adjacent on vil, 1<l<m, should match the orientation of the arrow head on the outgoing edge.

32 31 The bi-directed de Bruijn graphs Kundeti et al. BMC Bioinformatics 2010, 11:560

33 32 NGS Assembly using Bi- directed de Bruijn graph 1. Construct a bi-directed de Bruijn graph 2. de Bruijn graph simplification (Compaction) 3. Removal of errors 4. Assemble sequence through Chinese Postman Walk (CPW) on bi-directed de Bruijn Graphs.

34 33 Construct a bi-directed de Bruijn graphs 1. Generate Edges Kundeti et al. BMC Bioinformatics 2010, 11:560

35 34 Construct a bi-directed de Bruijn graphs 2. Reduce multiplicity: sort all bi-directed edges The sorting take O(n), where n=Nr, and N is the number of reads and r is the average size of reads. The sorting step is the dominated step in build bi-directed de Bruijn graph 3. Collect bi-directed vertices 4. Generate adjacent lists.

36 35 Graph Compaction Compact chains into single edges Reduction to familiar list ranking List ranking: ◦ Distributed linked list with adjacency information ◦ Find distance from each node to end of list Extend: ◦ Multiple linked lists ◦ Identify nodes with multiple edges ◦ Undirected

37 36 Graph Compaction Kundeti et al. BMC Bioinformatics 2010, 11:560

38 37 Graph Compaction Edges in the graph are nodes for list ranking Sort by edges to assign unique edge labels Sort by nodes to bring edges incident to a node together Identify nodes on a chain Mark adjacent edges on chains Perform undirected list ranking

39 38 Errors Detection and Removal Assumption: incidences of errors is random. Errors is unlikely to occur repeatedly at the same base. Each base in the genome is sampled on an average as many times as the coverage numbers, which is high in NGS. Combining: identifying errors on their comparatively lower frequency

40 39 Errors Detection and Removal Tips: misreading of one or more based towards the end of the short reads TCGTTGCGTGCGTGAGCGT k Tip

41 40 Errors Detection and Removal Bubbles: misreading of one or more bases in the middle of a short reads. TCGTTGCGTGCGTGAGCGT kk Bubble

42 41 Errors Detection and Removal Spurious links: when an erroneous (k+1) molecule happens to be identical to a legitimate (k+1) molecular form elsewhere in the genome. Spurious Link

43 42 Euler Circuits  A path is a connected sequence of edges showing a route on the graph that starts at a vertex and ends at a vertex.  The path that starts and ends at the same vertex is called a circuit.  Circuits that cover every edge once and only once are called Euler circuits.

44 43 Euler’s Theorem 1. If a graph G is connected and has all nodes with even degree, then G has an Euler circuit. 2. If G has an Euler circuit, then G must be connected and all its degree must be even numbers.

45 44 Chinese Postman Problem Suppose there is a mailman who needs to deliver mail to a certain neighborhood. That mailman is lazy, so he wants to find the shortest route through the neighborhood, that meets the following criteria: It is a closed circuit (it ends at the same point it starts). He needs to go through every street at least once. If the graph traveled has an Eulerian Circuit, this circuit is the ideal solution.

46 45 Chinese Postman Problem Solution (Edmonds and Johnson) 1. Find the odd nodes in G. 2. calculate the shortest path between all odd nodes. 3. construct a complete graph F of all odd nodes. The weights of edges are the shortest paths between them. 4. Find the minimum weighted perfecting matching (MWPM) on F. For every edge (u,v) in the set of MWPM, duplicate the shortest path between u and v in original graph G to construct a new multi graph G’ Find the Eulerian Circuit in G’.

47 46 Strategy for Solving Chinese Postman Problem 1. Eulerizing the graph 2. Find an Euler circuit on the new graph. 3. “Squeeze” this Euler circuit from the Eulerized graph onto the original graph by reusing an edge of the original graph each time the circuit on the eulerized graph uses an added edge.

48 47 Chinese Postman Walk (CPW) on bi-directed DBG A Chinese Postman walk in a bi-directed graph is a bi-directed walk which visits every edge at least once. A cyclic Chinese Postman walk of minimum cost on a weighted bi-directed graph is denoted as CPW

49 48 Chinese Postman Walk (CPW) on bi-directed DBG Lemma 1. A connected bi-directed graph is Eulerian if and only if every vertex is balanced. Lemma 2. A non Eulerian bi-directed graph G = (V,E) has a cyclic Chinese Postman walk a corresponding multi-bi-directed graph Gm = (V,Em) which is Eulerian. Lemma 3. Finding a cyclic CP walk on a bi- directed graph G(V,E) is equivalent to finding a minimum weight Eulerian multi-bi- directed graph G(V,E) corresponding to G. Lemma 4. If a bi-directed-graph G(V,E) has a cyclic CP walk then the cost of that walk is equal to the weight of G(V,E).

50 49 Chinese Postman Walk (CPW) on bi-directed DBG Lemma 5. A non Eulerain bi-directed graph G(V,E) has a cyclic CP walk the balancing bi-partite graph B(P,Q,Eb) has a perfect match. Lemma 6. If G(V,E) is a non Eulerian bi- directed graph that has a cyclic CP walk, then every corresponding Eulerian multi-bi- directed graph Gm(V,Em) belongs to the family F.

51 50 Chinese Postman Walk (CPW) on bi-directed DBG

