Large Scale Assembly of DNA Strings using Suffix Trees David Rivshin Parallel 2 4/11/2001
Overview " The Superstring Problem " GREEDY heuristic " Tries and suffix trees " Suffix tree implementation of GREEDY " Issues in Parallelization
Superstrings " Definition: " if s1, s2, s3,... sN are strings of interest, " a string S is a superstrings of s1-sN if all s1-sN strings are substrings of S " We are interested primarily in shortest superstrings " The problem of reassembling slices of a DNA sequence can be described as the problem of finding the shortest superstring of those slices
Let's get GREEDY! " The general problem of finding a shortest superstrings is known to be NP-hard " GREEDY is a well known linear time algorithm which approximates "well enough" the shortest superstring " Many tweaks to GREEDY have been proposed in various papers to make it more accurate of improve speed for specific problems
GREEDY Details " Let A be the set of all strings of interest " Remove from A any string which is a substring of another string in A, and keep one copy of any duplicate strings " Find the two strings in A which have the greatest overlap " Remove those two strings from A, and place their combination into A " Repeat until there are no more overlaps
Tries " A trie is a generic data structure for storing a list of words, where common prefixes are combined into nodes of the tree
Suffix Trees " A suffix tree is a trie where the words are all the suffixes of a single string agacttcg gacttcg acttcg cttcg ttcg tcg cg g
Suffix Trees " Suffix trees can be constructed in linear time " It is straight forward to combine multiple strings into one larger suffix tree if terminal nodes are labeled also by their original string " Substrings are identified by a (i,1) label on a non-leaf node " Duplicates are identified by multiple (i,1) labels on the same node
Suffix Tree GREEDY Setup " Construct suffix tree containing all strings " Remove substrings/duplicates " Initialize CHAIN array CHAIN[n] = 0 " Initialize WRAP array to WRAP[n] = n " Create a list of nodes sorted by string depth " Initialize P and S sets for each node " P u contains a string number i for each (i, 1) at u " S u contains a string number i for each (i, d>1) at u
ST GREEDY Meat " Get the node u with the greatest string depth " Remove any i from S u for which CHAIN[i] != 0 " For each j from P u " Test j against each i in S u where WRAP[i] != j " If blend(i, j) is possible then " set CHAIN[i] = j " set WRAP[ WRAP[i] ] = WRAP[j] " set WRAP[ WRAP[j] ] = WRAP[i] " remove j from P u and i from S u
ST GREEDY Gristle " Send remaining values in P u to u's parent " Discard remaining values in S u " Repeat for node with next-greatest string depth until root node (string depth = 0) is processed " Inspect CHAIN array to determine the superstring created
Issues with DNA Reassembly " Sequence reassembly often requires that overlaps are at least D characters long " this can be accomplished easily by stopping the algorithm when the string depth goes below D " Reassembly often requires handling Fuzzy data " There is no apparent simple way to generalize this algorithm to support fuzziness without dramatically increasing the size of the tree
Parallelization " Two distinct phases which are mutually serial: " setup, especially building the tree " algorithm proper " Construction of the tree can be easily done in parallel by assigning separate pieces of the finished tree to different processing elements " The algorithm can be done in parallel by assigning different nodes of equal string depth to different processing elements