Presentation is loading. Please wait.

Presentation is loading. Please wait.

Large Scale Assembly of DNA Strings using Suffix Trees David Rivshin Parallel 2 4/11/2001.

Similar presentations


Presentation on theme: "Large Scale Assembly of DNA Strings using Suffix Trees David Rivshin Parallel 2 4/11/2001."— Presentation transcript:

1 Large Scale Assembly of DNA Strings using Suffix Trees David Rivshin Parallel 2 4/11/2001

2 Overview " The Superstring Problem " GREEDY heuristic " Tries and suffix trees " Suffix tree implementation of GREEDY " Issues in Parallelization

3 Superstrings " Definition: " if s1, s2, s3,... sN are strings of interest, " a string S is a superstrings of s1-sN if all s1-sN strings are substrings of S " We are interested primarily in shortest superstrings " The problem of reassembling slices of a DNA sequence can be described as the problem of finding the shortest superstring of those slices

4 Let's get GREEDY! " The general problem of finding a shortest superstrings is known to be NP-hard " GREEDY is a well known linear time algorithm which approximates "well enough" the shortest superstring " Many tweaks to GREEDY have been proposed in various papers to make it more accurate of improve speed for specific problems

5 GREEDY Details " Let A be the set of all strings of interest " Remove from A any string which is a substring of another string in A, and keep one copy of any duplicate strings " Find the two strings in A which have the greatest overlap " Remove those two strings from A, and place their combination into A " Repeat until there are no more overlaps

6 Tries " A trie is a generic data structure for storing a list of words, where common prefixes are combined into nodes of the tree

7 Suffix Trees " A suffix tree is a trie where the words are all the suffixes of a single string 8 4 2 6 5 7 1 3 agacttcg gacttcg acttcg cttcg ttcg tcg cg g 1234567812345678

8 Suffix Trees " Suffix trees can be constructed in linear time " It is straight forward to combine multiple strings into one larger suffix tree if terminal nodes are labeled also by their original string " Substrings are identified by a (i,1) label on a non-leaf node " Duplicates are identified by multiple (i,1) labels on the same node

9 Suffix Tree GREEDY Setup " Construct suffix tree containing all strings " Remove substrings/duplicates " Initialize CHAIN array CHAIN[n] = 0 " Initialize WRAP array to WRAP[n] = n " Create a list of nodes sorted by string depth " Initialize P and S sets for each node " P u contains a string number i for each (i, 1) at u " S u contains a string number i for each (i, d>1) at u

10 ST GREEDY Meat " Get the node u with the greatest string depth " Remove any i from S u for which CHAIN[i] != 0 " For each j from P u " Test j against each i in S u where WRAP[i] != j " If blend(i, j) is possible then " set CHAIN[i] = j " set WRAP[ WRAP[i] ] = WRAP[j] " set WRAP[ WRAP[j] ] = WRAP[i] " remove j from P u and i from S u

11 ST GREEDY Gristle " Send remaining values in P u to u's parent " Discard remaining values in S u " Repeat for node with next-greatest string depth until root node (string depth = 0) is processed " Inspect CHAIN array to determine the superstring created

12 Issues with DNA Reassembly " Sequence reassembly often requires that overlaps are at least D characters long " this can be accomplished easily by stopping the algorithm when the string depth goes below D " Reassembly often requires handling Fuzzy data " There is no apparent simple way to generalize this algorithm to support fuzziness without dramatically increasing the size of the tree

13 Parallelization " Two distinct phases which are mutually serial: " setup, especially building the tree " algorithm proper " Construction of the tree can be easily done in parallel by assigning separate pieces of the finished tree to different processing elements " The algorithm can be done in parallel by assigning different nodes of equal string depth to different processing elements

14


Download ppt "Large Scale Assembly of DNA Strings using Suffix Trees David Rivshin Parallel 2 4/11/2001."

Similar presentations


Ads by Google