Download presentation
Presentation is loading. Please wait.
Published byLeslie Dawson Modified over 8 years ago
1
Fragment assembly of DNA A typical approach to sequencing long DNA molecules is to sample and then sequence fragments from them.
2
® Pei-Jie Wu2 Fragment assembly of DNA Biological background Models Algorithms Heuristics
3
® Pei-Jie Wu3 Biological background Problem as puzzle We do not know which letter from the set {A, C, G, T} is written on each card, but we do know that cards in the same position of opposite stands from a complementary pair. Our goal is obtain the letters using certain hint, which are (approximate) substrings of the rows.
4
® Pei-Jie Wu4 Biological background Target: The long sequence to reconstruct. Fragment vs. Subsequence Shotgun method: Based on fragment overlap Fragment assembly: A collection of fragments to put together
5
® Pei-Jie Wu5 Biological background --The ideal case Case: p.106 Aligned the input set, ignoring spaces at the extremities Overlaps: the end part of a fragment is similar to the beginning of another Consensus sequence base on majority vote
6
® Pei-Jie Wu6 Biological background --Complications The main factors that add to the complexity of the problem are: –Error –Unknown orientation –Repeated regions –Lack of coverage.
7
® Pei-Jie Wu7 Biological background --Complications It usually means algorithms that require more time and space when computer program deal with error. The simplest errors are called base call errors and comprise base substitutions, insertions and deletions in the fragments. Base call errors occurs in practice at rates varying from 1 to 5 errors every 100 characters. Figures 4.2, 4.3, 4.4 Errors
8
® Pei-Jie Wu8 Biological background --Complications Two other types of errors: chimera and Contamination Chimeras, arise when two regular fragments from distinct parts of the target molecule join end-to-end to form a fragment that is not a contiguous part of the target –Figure 4.5 –Solution: Must be recognized as such and removed from the fragment set in a preprocessing stage. Contamination is from host or vector DNA –Solution: Most vectors are well know, so we can screen the data before starting assembly. Errors
9
® Pei-Jie Wu9 Biological background --Complications We generally do not know to which strand a particular fragment belongs to. The input fragments as being all approximate substrings of the consensus sought either as given or in reverse complement. Figure 4.6 Complexity: 2 n Unknown orientation
10
® Pei-Jie Wu10 Biological background --Complications Repeats are sequences that appear two or more times in the targrt molecule. –Short repeats –Longer repeats If the level of similarity between two copies of a repeat is high enough, the differences can be mistaken for base call errors Figure 4.7 Repeated regions
11
® Pei-Jie Wu11 Biological background --Complications Problems: –If a fragment is totally contained in a repeat, we may have several places to put it in the final alignment. When the copies are not exactly equal, we may weaken the consensus by placing a fragment in the wrong way copy. –Repeats can be positioned in such a way as to render assembly inherently ambiguous. (Figure 4.8 and 4.9) Direct repeats: repeated copies in the same strand. Inverted repeats: repeated regions in opposite strands (Figure 4.10) Repeated regions
12
® Pei-Jie Wu12 Biological background --Complications Coverage: position i of the target as the number of fragments that cover this position. Contigs: The contiguously covered regions Figure 4.11 Solutions: –Sampling more fragments –Directed sequencing or walking Lack of coverage
13
® Pei-Jie Wu13 Biological background --Alternative methods for DNA sequencing Directed sequencing: a method that can be used to cover small remaining gaps in a shotgun project. Problem: –It is expensive to build special primers –Sequential rather than parallel Sequencing by hybridization (SBH), it consists of assembling the target molecule based on many hybridization experiments with very short, fixed length sequences called probes.
14
® Pei-Jie Wu14 Models Shortest common superstring (SCS) RECONSTRUCTION MULTICONTIG –All three assume that the fragment collection is free of contamination and chimeras.
15
® Pei-Jie Wu15 Models --Shortest common superstring Seeking the shortest superstring of a collection of given strings PROBLEM: Shortest common superstring (SCS) INPUT: a collection F of strings. OUTPUT: a shortest possible string S such that for every f F, S is a superstring of f.
16
® Pei-Jie Wu16 Models --Shortest common superstring Example 4.1 Example 4.2 –Figure 4.12 –Figure 4.13 A superstring may contain only one copy, which will absorb all fragments totally contained in any of the copies
17
® Pei-Jie Wu17 Models --Reconstruction Takes into account both errors and unknown orientation Dynamic programming sequence comparison algorithm Use distance rather than similarity Expression: p.116
18
® Pei-Jie Wu18 Models --Reconstruction PROBLEM: RECONSTRUCTION INPUT: a collection F of strings and an error tolerance between 1 and 0. OUTPUT: (p.117) Find a string S as short as possoble such that either f or its reverse complement must be an approximate substring of S at error level Does not model repeats, lack of coverage, and size of target
19
® Pei-Jie Wu19 Models --Multicontig Involve internal linkage of the fragments in the layout Nonlink: there is a fragment that properly contains the overlap on both sides Weakest link: the smallest size of any link t-contig: the weakest link of a layout is at least as large as t Example 4.4 Definition: p.119
20
® Pei-Jie Wu20 Algorithms Greedy algorithm Acyclic subgraphs (no errors and know orientation)
21
® Pei-Jie Wu21 Algorithms --Representing overlaps Over multigraph OM(F) of a collection F is the directed, weighted multigraph Set V of nodes of this structure is just F itself. A directed edge from a to a different fragment b with weight t 0 exists if the suffix of a with t characters is a prefix of b May be many edges from a to b No self-loops
22
® Pei-Jie Wu22 Algorithms --Paths originating superstrings Edge e = (f, g) in the path has a certain weight t, which means that the last t bases of the tail f of e Figure 4.15 –Example in p.121 Equation 4.3 Hamiltonian paths: A path that goes through every vertex Equation 4.4 –Minimizing |S(P)| maximizing w(P)
23
® Pei-Jie Wu23 Algorithms --Shortest superstrings as paths A collection F is said to be substring-free if there are no two distinct strings a and b in such that a is a substring of b. THEOREM 4.1 COROLLARY 4.1 LEMMA 4.1 THEOREM 4.2
24
® Pei-Jie Wu24 Algorithms --The greedy algorithm Looking for shortest common superstrings is the same as looking for Hamiltonian paths of maximum weight in a directed multigraph. OM(F) OG(F) “greedy” attempt at computing the heaveiest path. The basic idea employed in it is to continuously add the heaviest available edge
25
® Pei-Jie Wu25 Algorithms --The greedy algorithm Three conditions we have to test before accepting an edge in our Hamiltonian path: –Edges are processed in nonincreasing order by weight –The procedure ends when we have exactly n-1 edges, or –when the accepted edges induce a connected subgraph. Figure 4.16 Example 4.5 –Figure 4.17
26
® Pei-Jie Wu26 Algorithms --Acyclic subgraphs Assembling fragments without error and known orientation assuming that the fragments have been obtained from a “good sampling” of the target DNA. “good sampling”: fragments cover the entire target molecule, and the collection as a whole to exhibit enough linkage to guarantee a safe assembly. Figure 4.18
27
® Pei-Jie Wu27 Algorithms --Acyclic subgraphs The presence of repeated regions, or repeated element, in the target string S is related to the existence of cycles in the overlap graph. Cycles in an overlap graph are necessarily due to repeats in S. The converse is not necessarily true; that is, we may have repeats but still an acyclic overlap graph. THEOREM 4.5 Algorithm: Topological sorting Example 4.6 –Figure 4.19, 4.20 and 4.21
28
® Pei-Jie Wu28 Heuristics None of the formalisms proposed for fragment assembly are entirely adequate Fragment assembly can be viewed as a multiple alignment problem with some additional feature: –Each fragment can participate with either the direct or the reverse-complemented sequence. –The sequences themselves are usually much shorter than the alignment itself.
29
® Pei-Jie Wu29 Heuristics Three criteria according to the second feature: –Scoring Entropy is a quantity that is defied on a group of relative frequencies, and it is low when one of these frequencies stands out from the others, and high when they are all more or less equal Lower the entropy, the better Coverage: A fragment covers a column i if it participates in this column either with a character or with an internal space. Linkage The way individual fragment are linked in the layout is another determinant of layout quality. Figure 4.22
30
® Pei-Jie Wu30 Heuristics --Assembly in practice Practical implementations often divide the whole problem in three phase: –Finding overlaps –Building a layout –Computing the consensus
31
® Pei-Jie Wu31 Heuristics --Assembly in practice The first step in any assembly problem is fragment overlap delection. Determine reverse complement Consider fragments entirely contained in other fragment Recall Section 3.2.3 –Figure 4.23 Finding overlaps
32
® Pei-Jie Wu32 Heuristics --Assembly in practice Finding a good ordering of fragments in a contig No algorithm that is simple and general enough There are four issues to keep in mind when building paths: –Every path has a corresponding complement path –It is not necessary to include contain fragments –Cycles usually indicate the presence of repeats –Unbalanced coverage may be related to repeats as well (see Figure 4.13) Ordering fragments
33
® Pei-Jie Wu33 Heuristics --Assembly in practice Building a layout from a path in an overlap graph Two techniques related to alignment construction: –The first one helps in building a good layout from a path in the presence of errors. Example 4.7 Implement: Figure 4.24 –The second one focuses on locally improving an already constructed layout Example 4.8 in Figure 4.25 Implement: sum-of-pairs scoring scheme Alignment and consensus
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.