Download presentation
Presentation is loading. Please wait.
Published byStanley Henry Modified over 9 years ago
1
A new Approach to Fragment Assembly in DNA Sequenceing Fei wu April,24,2006
2
2 Preface Introduce the author Introduce the author The background of the paper The background of the paper The history of DNA Sequencing The history of DNA Sequencing
3
3 Traditional DNA Sequencing Read 500 – 700 nucleotides at a time from the small fragments (Sanger method)Read 500 – 700 nucleotides at a time from the small fragments (Sanger method) Shear DNA into millions of small fragmentsShear DNA into millions of small fragments Shake DNA
4
4 Fragment Assembly Computational Challenge: assemble individual short fragments (reads) into a single genomic sequence (“super string”)Computational Challenge: assemble individual short fragments (reads) into a single genomic sequence (“super string”) Until late 1990s the shotgun fragment assembly of human genome was viewed as intractable problemUntil late 1990s the shotgun fragment assembly of human genome was viewed as intractable problem
5
5 Shortest Superstring Problem Problem: Given a set of strings, find a shortest string that contains all of them Problem: Given a set of strings, find a shortest string that contains all of them Input: Strings s 1, s 2,…., s n Input: Strings s 1, s 2,…., s n Output: A string s that contains all strings Output: A string s that contains all strings s 1, s 2,…., s n as substrings, such that the length of s is minimized s 1, s 2,…., s n as substrings, such that the length of s is minimized Complexity: NP – complete Complexity: NP – complete Note: this formulation does not take into account sequencing errors Note: this formulation does not take into account sequencing errors
6
6 Reducing SSP to eulerian path problem Define overlap ( s i, s j ) as the length of the longest prefix of s j that matches a suffix of s i. Define overlap ( s i, s j ) as the length of the longest prefix of s j that matches a suffix of s i. aaaggcatcaaatctaaaggcatcaaa aaaggcatcaaatctaaaggcatcaaa Construct a graph with n vertices representing the n strings s 1, s 2,…., s n. Construct a graph with n vertices representing the n strings s 1, s 2,…., s n. Insert edges of length overlap ( s i, s j ) between vertices s i and s j. Insert edges of length overlap ( s i, s j ) between vertices s i and s j. Find the shortest path which visits every vertex exactly once. This is the Traveling Salesman Problem (TSP), which is also NP – complete. Find the shortest path which visits every vertex exactly once. This is the Traveling Salesman Problem (TSP), which is also NP – complete.
7
7 Bruijun graph Properties Properties If n = 1 then the condition for any two vertices forming an edge holds vacuously, and hence all the vertices are connected forming a total of m 2 edges. Each vertex has exactly m incoming and m outgoing edges
8
8 Sequencing by Hybridization
9
9 l -mer (tulip) composition Spectrum ( s, l ) - unordered multiset of all possible (n – l + 1) l-mers in a string s of length n Spectrum ( s, l ) - unordered multiset of all possible (n – l + 1) l-mers in a string s of length n The order of individual elements in Spectrum ( s, l ) does not matter The order of individual elements in Spectrum ( s, l ) does not matter For s = TATGGTGC all of the following are equivalent representations of Spectrum ( s, 3 ): For s = TATGGTGC all of the following are equivalent representations of Spectrum ( s, 3 ): {TAT, ATG, TGG, GGT, GTG, TGC} {TAT, ATG, TGG, GGT, GTG, TGC} {ATG, GGT, GTG, TAT, TGC, TGG} {ATG, GGT, GTG, TAT, TGC, TGG} {TGG, TGC, TAT, GTG, GGT, ATG} {TGG, TGC, TAT, GTG, GGT, ATG}
10
10 SBH: Eulerian Path Approach S = { ATG, TGC, GTG, GGC, GCA, GCG, CGT } Vertices correspond to ( l – 1 ) – mers : { AT, TG, GC, GG, GT, CA, CG } Vertices correspond to ( l – 1 ) – mers : { AT, TG, GC, GG, GT, CA, CG } Edges correspond to l – mers from S Edges correspond to l – mers from S AT GT CG CA GC TG GG Path visited every EDGE once
11
11 S = { AT, TG, GC, GG, GT, CA, CG } corresponds to two different paths: ATGGCGTGCA ATGCGTGGCA AT TG GC GG GT CG GT CG CA GCTG GG
12
12 Error Correction Or Data Corruption Euler algorithm sometimes introduces errors. Euler algorithm sometimes introduces errors. Introduces errors for reducing the complexity of the Bruijn graph. Introduces errors for reducing the complexity of the Bruijn graph. Reeducation of Bruijn graph eliminate false edge. Reeducation of Bruijn graph eliminate false edge. For example: N.meningitieds sequencing project,orphan elimination corrects 234410 errors, and introces 1452 errors. For example: N.meningitieds sequencing project,orphan elimination corrects 234410 errors, and introces 1452 errors.
13
13 Observations of the EULER
14
14 Conclusions Finishing is a bottleneck in large-scale DNA Finishing is a bottleneck in large-scale DNA EULER has excellent scaling potential. EULER has excellent scaling potential. The complexity of EULER is mainly defined by the number of tangles rather than the number of repeats/length of the gonomes. The complexity of EULER is mainly defined by the number of tangles rather than the number of repeats/length of the gonomes.
15
RESULTS AND DISCUSSION The general performance of SEA on the benchmark Prediction ambiguity improves alignment quality Alignment quality versus local structure prediction ambiguity
16
CONCLUSION
17
Any Questions?
18
18
19
19
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.