Large Scale Assembly of DNA Strings using Suffix Trees David Rivshin Parallel 2 4/11/2001.

Slides:



Advertisements
Similar presentations
CS 336 March 19, 2012 Tandy Warnow.
Advertisements

Heaps1 Part-D2 Heaps Heaps2 Recall Priority Queue ADT (§ 7.1.3) A priority queue stores a collection of entries Each entry is a pair (key, value)
Types of Algorithms.
Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.
1 Parallel Parentheses Matching Plus Some Applications.
Partitioning and Divide-and-Conquer Strategies ITCS 4/5145 Parallel Computing, UNC-Charlotte, B. Wilkinson, Jan 23, 2013.
Suffix Trees Construction and Applications João Carreira 2008.
FACTORS.
Two implementation issues Alphabet size Generalizing to multiple strings.
Sandeep Tata, Richard A. Hankins, and Jignesh M. Patel Presented by Niketan Pansare, Megha Kokane.
296.3: Algorithms in the Real World
© 2004 Goodrich, Tamassia Tries1. © 2004 Goodrich, Tamassia Tries2 Preprocessing Strings Preprocessing the pattern speeds up pattern matching queries.
1 Prof. Dr. Th. Ottmann Theory I Algorithm Design and Analysis (12 - Text search: suffix trees)
Suffix Trees Suffix trees Linearized suffix trees Virtual suffix trees Suffix arrays Enhanced suffix arrays Suffix cactus, suffix vectors, …
The Trie Data Structure Basic definition: a recursive tree structure that uses the digital decomposition of strings to represent a set of strings for searching.
Tries Search for ‘bell’ O(n) by KMP algorithm O(dm) in a trie Tries
Next Generation Sequencing, Assembly, and Alignment Methods
Suffix Trees String … any sequence of characters. Substring of string S … string composed of characters i through j, i ate is.
Goodrich, Tamassia String Processing1 Pattern Matching.
On Demand String Sorting over Unbounded Alphabets Carmel Kent Moshe Lewenstein Dafna Sheinwald.
Advanced Topics in Algorithms and Data Structures 1 Lecture 4 : Accelerated Cascading and Parallel List Ranking We will first discuss a technique called.
2 -1 Analysis of algorithms Best case: easiest Worst case Average case: hardest.
1 Parallel EST Clustering by Kalyanaraman, Aluru, and Kothari Nargess Memarsadeghi CMSC 838 Presentation.
Chapter 9: Huffman Codes
COMP 171 Data Structures and Algorithms Tutorial 10 Hash Tables.
General Trees and Variants CPSC 335. General Trees and transformation to binary trees B-tree variants: B*, B+, prefix B+ 2-4, Horizontal-vertical, Red-black.
Aho-Corasick Algorithm Generalizes KMP to handle sets of strings New ideas –keyword trees –failure functions/links –output links.
Building Suffix Trees in O(m) time Weiner had first linear time algorithm in 1973 McCreight developed a more space efficient algorithm in 1976 Ukkonen.
Genome Assembly Charles Yan Fragment Assembly Given a large number of fragments, such as ACC AC AT AC AT GG …, the goal is to figure out the original.
1 Exact Set Matching Charles Yan Exact Set Matching Goal: To find all occurrences in text T of any pattern in a set of patterns P={p 1,p 2,…,p.
On the Use of Regular Expressions for Searching Text Charles L.A. Clarke and Gordon V. Cormack Fast Text Searching.
§6 B+ Trees 【 Definition 】 A B+ tree of order M is a tree with the following structural properties: (1) The root is either a leaf or has between 2 and.
Data Structures Arrays both single and multiple dimensions Stacks Queues Trees Linked Lists.
Advanced Algorithm Design and Analysis (Lecture 5) SW5 fall 2004 Simonas Šaltenis E1-215b
Graphs and DNA sequencing CS 466 Saurabh Sinha. Three problems in graph theory.
Whole genome comparison Kelley Crouse And Greg Matuszek.
CHAPTER 09 Compiled by: Dr. Mohammad Omar Alhawarat Sorting & Searching.
A Summary of XISS and Index Fabric Ho Wai Shing. Contents Definition of Terms XISS (Li and Moon, VLDB2001) Numbering Scheme Indices Stored Join Algorithms.
Multi-way Trees. M-way trees So far we have discussed binary trees only. In this lecture, we go over another type of tree called m- way trees or trees.
Graph Theory And Bioinformatics Jason Wengert. Outline Introduction to Graphs Eulerian Paths & Hamiltonian Cycles Interval Graph & Shape of Genes Sequencing.
Palette: Distributing Tables in Software-Defined Networks Yossi Kanizo (Technion, Israel) Joint work with Isaac Keslassy (Technion, Israel) and David Hay.
P p Chapter 10 has several programming projects, including a project that uses heaps. p p This presentation shows you what a heap is, and demonstrates.
Fragment assembly of DNA A typical approach to sequencing long DNA molecules is to sample and then sequence fragments from them.
Tries1. 2 Outline and Reading Standard tries (§9.2.1) Compressed tries (§9.2.2) Suffix tries (§9.2.3)
Greedy Algorithms for the Shortest Common Superstring Overview by Anton Nesterov Saint Petersburg State University Russia Original paper by A. Frieze,
Union-find Algorithm Presented by Michael Cassarino.
1 CS 430: Information Discovery Lecture 4 Files Structures for Inverted Files.
Mining Document Collections to Facilitate Accurate Approximate Entity Matching Presented By Harshda Vabale.
Types of Algorithms. 2 Algorithm classification Algorithms that use a similar problem-solving approach can be grouped together We’ll talk about a classification.
Queues, Stacks and Heaps. Queue List structure using the FIFO process Nodes are removed form the front and added to the back ABDC FrontBack.
ALGORITHMS.
Everything is String. Closed Factorization Golnaz Badkobeh 1, Hideo Bannai 2, Keisuke Goto 2, Tomohiro I 2, Costas S. Iliopoulos 3, Shunsuke Inenaga 2,
Computer Science Background for Biologists CSC 487/687 Computing for Bioinformatics Fall 2005.
1 Chapter 17 2 nd Part Making Complex Decisions --- Decision-theoretic Agent Design Xin Lu 11/04/2002.
Data Structures and Algorithms Instructor: Tesfaye Guta [M.Sc.] Haramaya University.
Chapter 11. Chapter Summary  Introduction to trees (11.1)  Application of trees (11.2)  Tree traversal (11.3)  Spanning trees (11.4)
Advanced Data Structures Lecture 8 Mingmin Xie. Agenda Overview Trie Suffix Tree Suffix Array, LCP Construction Applications.
Tries 4/16/2018 8:59 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
Tries 5/27/2018 3:08 AM Tries Tries.
Greedy Technique.
Tries 9/14/ :13 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
Strings: Tries, Suffix Trees
Types of Algorithms.
Types of Algorithms.
Suffix Trees String … any sequence of characters.
Tries 2/27/2019 5:37 PM Tries Tries.
Suffix Arrays and Suffix Trees
Types of Algorithms.
Strings: Tries, Suffix Trees
Trees in java.util A set is an object that stores unique elements
Presentation transcript:

Large Scale Assembly of DNA Strings using Suffix Trees David Rivshin Parallel 2 4/11/2001

Overview " The Superstring Problem " GREEDY heuristic " Tries and suffix trees " Suffix tree implementation of GREEDY " Issues in Parallelization

Superstrings " Definition: " if s1, s2, s3,... sN are strings of interest, " a string S is a superstrings of s1-sN if all s1-sN strings are substrings of S " We are interested primarily in shortest superstrings " The problem of reassembling slices of a DNA sequence can be described as the problem of finding the shortest superstring of those slices

Let's get GREEDY! " The general problem of finding a shortest superstrings is known to be NP-hard " GREEDY is a well known linear time algorithm which approximates "well enough" the shortest superstring " Many tweaks to GREEDY have been proposed in various papers to make it more accurate of improve speed for specific problems

GREEDY Details " Let A be the set of all strings of interest " Remove from A any string which is a substring of another string in A, and keep one copy of any duplicate strings " Find the two strings in A which have the greatest overlap " Remove those two strings from A, and place their combination into A " Repeat until there are no more overlaps

Tries " A trie is a generic data structure for storing a list of words, where common prefixes are combined into nodes of the tree

Suffix Trees " A suffix tree is a trie where the words are all the suffixes of a single string agacttcg gacttcg acttcg cttcg ttcg tcg cg g

Suffix Trees " Suffix trees can be constructed in linear time " It is straight forward to combine multiple strings into one larger suffix tree if terminal nodes are labeled also by their original string " Substrings are identified by a (i,1) label on a non-leaf node " Duplicates are identified by multiple (i,1) labels on the same node

Suffix Tree GREEDY Setup " Construct suffix tree containing all strings " Remove substrings/duplicates " Initialize CHAIN array CHAIN[n] = 0 " Initialize WRAP array to WRAP[n] = n " Create a list of nodes sorted by string depth " Initialize P and S sets for each node " P u contains a string number i for each (i, 1) at u " S u contains a string number i for each (i, d>1) at u

ST GREEDY Meat " Get the node u with the greatest string depth " Remove any i from S u for which CHAIN[i] != 0 " For each j from P u " Test j against each i in S u where WRAP[i] != j " If blend(i, j) is possible then " set CHAIN[i] = j " set WRAP[ WRAP[i] ] = WRAP[j] " set WRAP[ WRAP[j] ] = WRAP[i] " remove j from P u and i from S u

ST GREEDY Gristle " Send remaining values in P u to u's parent " Discard remaining values in S u " Repeat for node with next-greatest string depth until root node (string depth = 0) is processed " Inspect CHAIN array to determine the superstring created

Issues with DNA Reassembly " Sequence reassembly often requires that overlaps are at least D characters long " this can be accomplished easily by stopping the algorithm when the string depth goes below D " Reassembly often requires handling Fuzzy data " There is no apparent simple way to generalize this algorithm to support fuzziness without dramatically increasing the size of the tree

Parallelization " Two distinct phases which are mutually serial: " setup, especially building the tree " algorithm proper " Construction of the tree can be easily done in parallel by assigning separate pieces of the finished tree to different processing elements " The algorithm can be done in parallel by assigning different nodes of equal string depth to different processing elements