Database Index to Large Biological Sequences Ela Hunt, Malcolm P. Atkinson, and Robert W. Irving Proceedings of the 27th VLDB Conference,2001 Presented.

Slides:



Advertisements
Similar presentations
Text Indexing The Suffix Array. Basic notation and facts Occurrences of P in T = All suffixes of T having P as a prefix SUF(T) = Sorted set of suffixes.
Advertisements

CS252: Systems Programming Ninghui Li Program Interview Questions.
Text Indexing The Suffix Array. Basic notation and facts Occurrences of P in T = All suffixes of T having P as a prefix SUF(T) = Sorted set of suffixes.
Suffix Trees Construction and Applications João Carreira 2008.
Suffix Trees, Suffix Arrays and Suffix Trays Richard Cole Tsvi Kopelowitz Moshe Lewenstein.
Two implementation issues Alphabet size Generalizing to multiple strings.
Genome-scale disk-based suffix tree indexing Benjarath Phoophakdee Mohammed J. Zaki Compiled by: Amit Mahajan Chaitra Venus.
What about the trees of the Mississippi? Suffix Trees explained in an algorithm for indexing large biological sequences Jacob Kleerekoper & Marjolijn Elsinga.
Suffix Trees and Suffix Arrays
Sandeep Tata, Richard A. Hankins, and Jignesh M. Patel Presented by Niketan Pansare, Megha Kokane.
Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.
1 Suffix Trees © Jeff Parker, Outline An introduction to the Suffix Tree Some sample applications How to build a Suffix Tree efficiently.
15-853Page : Algorithms in the Real World Suffix Trees.
296.3: Algorithms in the Real World
© 2004 Goodrich, Tamassia Tries1. © 2004 Goodrich, Tamassia Tries2 Preprocessing Strings Preprocessing the pattern speeds up pattern matching queries.
1 Suffix Trees and Suffix Arrays Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto Addison-Wesley, (Chapter 8)
1 Prof. Dr. Th. Ottmann Theory I Algorithm Design and Analysis (12 - Text search: suffix trees)
Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.
G ENOME - SCALE D ISK - BASED S UFFIX T REE I NDEXING Phoophakdee and Zaki.
Tries Standard Tries Compressed Tries Suffix Tries.
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Tries Search for ‘bell’ O(n) by KMP algorithm O(dm) in a trie Tries
Modern Information Retrieval Chapter 8 Indexing and Searching.
Suffix Trees String … any sequence of characters. Substring of string S … string composed of characters i through j, i ate is.
Modern Information Retrieval
1 Huffman Codes. 2 Introduction Huffman codes are a very effective technique for compressing data; savings of 20% to 90% are typical, depending on the.
Full-Text Indexing via Burrows-Wheeler Transform Wing-Kai Hon Oct 18, 2006.
Indexed Search Tree (Trie) Fawzi Emad Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.
Suffix trees and suffix arrays presentation by Haim Kaplan.
1 CS 430: Information Discovery Lecture 4 Data Structures for Information Retrieval.
Obtaining Provably Good Performance from Suffix Trees in Secondary Storage Pang Ko & Srinivas Aluru Department of Electrical and Computer Engineering Iowa.
Text Indexing S. Srinivasa Rao April 19, 2007 [based on slides by Paolo Ferragina]
6/26/2015 7:13 PMTries1. 6/26/2015 7:13 PMTries2 Outline and Reading Standard tries (§9.2.1) Compressed tries (§9.2.2) Suffix tries (§9.2.3) Huffman encoding.
Tries. (Compacted) Trie y s 1 z stile zyg 5 etic ial ygy aibelyite czecin omo systile syzygetic syzygial syzygy szaibelyite szczecin.
Building Suffix Trees in O(m) time Weiner had first linear time algorithm in 1973 McCreight developed a more space efficient algorithm in 1976 Ukkonen.
WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.
School of Engineering and Computer Science Victoria University of Wellington Copyright: Xiaoying Gao, Peter Andreae, VUW Indexing Large Data COMP
On the Use of Regular Expressions for Searching Text Charles L.A. Clarke and Gordon V. Cormack Fast Text Searching.
Indexing structures for files D ƯƠ NG ANH KHOA-QLU13082.
Indexing. Goals: Store large files Support multiple search keys Support efficient insert, delete, and range queries.
Chapter. 8: Indexing and Searching Sections: 8.1 Introduction, 8.2 Inverted Files 9/13/ Dr. Almetwally Mostafa.
Introduction n How to retrieval information? n A simple alternative is to search the whole text sequentially n Another option is to build data structures.
Tries1. 2 Outline and Reading Standard tries (§9.2.1) Compressed tries (§9.2.2) Suffix tries (§9.2.3)
1 CS 430: Information Discovery Lecture 4 Files Structures for Inverted Files.
Suffix trees. Trie A tree representing a set of strings. a b c e e f d b f e g { aeef ad bbfe bbfg c }
Marwan Al-Namari Hassan Al-Mathami. Indexing What is Indexing? Indexing is a mechanisms. Why we need to use Indexing? We used indexing to speed up access.
Spring 2003 ECE569 Lecture 05.1 ECE 569 Database System Engineering Spring 2003 Yanyong Zhang
2016/1/27Summer Course1 Pattern Search Problems Part I: Fundament Concept.
1 Algorithms CSCI 235, Fall 2015 Lecture 30 More Greedy Algorithms.
Suffix Tree 6 Mar MinKoo Seo. Contents  Basic Text Searching  Introduction to Suffix Tree  Suffix Trees and Exact Matching  Longest Common Substring.
Advanced Data Structures Lecture 8 Mingmin Xie. Agenda Overview Trie Suffix Tree Suffix Array, LCP Construction Applications.
ETH Zurich A Database Index to Large Biological Sequences (Ela Hunt, Malcolm P. Atkinson, Robert W. Irving) A report on the paper from Nicola.
Spatial Approximate String Search. Abstract This work deals with the approximate string search in large spatial databases. Specifically, we investigate.
Linear Time Suffix Array Construction Using D-Critical Substrings
Tries 4/16/2018 8:59 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
15-853:Algorithms in the Real World
A database index to large biological sequences
Tries 07/28/16 11:04 Text Compression
Indexing Structures for Files and Physical Database Design
Indexing Goals: Store large files Support multiple search keys
Tries 5/27/2018 3:08 AM Tries Tries.
The short-read alignment in distributed memory environment
Tries 9/14/ :13 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
Suffix trees.
CSE 589 Applied Algorithms Spring 1999
Suffix trees and suffix arrays
Tries 2/23/2019 8:29 AM Tries 2/23/2019 8:29 AM Tries.
Suffix Trees String … any sequence of characters.
Tries 2/27/2019 5:37 PM Tries Tries.
Algorithms CSCI 235, Spring 2019 Lecture 30 More Greedy Algorithms
Presentation transcript:

Database Index to Large Biological Sequences Ela Hunt, Malcolm P. Atkinson, and Robert W. Irving Proceedings of the 27th VLDB Conference,2001 Presented by Raghav & Balaji

Indexing Large Biological Sequences  Introduction  Indexing strategies  Suffix trees  New Construction Algorithm  Query  Experiment and Results  Conclusion

Introduction  What's a DNA? A, C, G, T (A with T, C with G) ‏ Base pair Gbp (Giga base pairs) ‏ Mammalian genomes – 3Gbp  What is the challenge in indexing DNA? Large Size and no definite pattern  Searching genetic DNA sequences Sequentially scanning and filtering approach (BLAST, FASTA)

Introduction  Rise in volume of data and demand for searches by researchers accelerated the need for better searches using indexes.  New Sequences will be revealed as improved sequencing techniques are developed.  Determining DNA sequences is useful in studying fundamental biological processes, as well as in forensic research.

Indexing Strategies Considered  Inverted files  Not suitable since DNA cannot be broken into words.  B-tree  Same as above  Q-grams  Cannot deliver matches that have low similarity to the query.  Most of the techniques are infeasible.

Indexing Strategies Considered  Suffix Trees Ideal Choice for this type of indexing. Suffix trees on disk could only be built for small sequences. “Memory Bottleneck”.  Suffix tree storage optimization Reduce the RAM required to around 13 bytes per character indexed Not test on disk

Indexing Strategies Considered  Approach to searching genetic DNA sequences using an adaptation of the suffix tree.  Build suffix tree on disk for arbitrarily large sequences  New query process strategies.  Alternative data structures Q-grams, Suffix array, String B tree…

Suffix Trees  Suffix tree - compressed digital trie.  A suffix tree is a rooted directed tree with m leaves, where m is the length S (the database string)  For any leaf i, the concatenation of the edge- labels on the path from the root to leaf i exactly spells out the suffix of S that starts at position i

Suffix Trees Suffix tree is a compressed digital (suffix) trie

Suffices of mississippi: 1mississippi 2ississippi 3ssissippi 4sissippi 5issippi 6ssippi 7sippi 8ippi 9ppi 10 pi 11 i Suffix tree building mississippimississippi ississippiississippi ssissippississippi root issippiissippi ppippi ppippi ppippi ppippi ppippi i

Result suffix tree building mississippimississippi ssippissippi ssippissippi ppippi ppippi ppippi pipi i 1 11 root i 4 7 ssippissippi ppippi s si p i ssi

Suffix Trees  Suffix Links:  A necessary implementation trick to achieve a linear time and space bound during building the tree  A suffix link is: a pointer from an internal node xS to another internal node S where x is a arbitrary character and S is a possibly empty substring

Suffix Trees  Construction  Suffix link Complexity O(n) Ukkonen’s Method

Suffix Trees  General applications of Suffix trees Find all occurrences of q as a substring of S Longest substring common to a set T of strings Find the longest palindrome in S

Suffix Trees  Analysis of Suffix Link Based Algorithm Build the tree incrementally, check pointing the tree after each portion has been attempted.  2 distinct traversal patterns exist both of which are used during construction. Very long construction time.  These effects combine to limit the size of the tree that can be constructed and stored on disk to the available main memory.

Suffix Trees  Using Suffix link based algorithm, it was observed that checkpointing trees indexing more than 21Mbp was not possible using 1.8GB of main memory.  Reasons being Object header size increases

New Construction Algorithm  Difficulties of traditional suffix tree construction: Memory bottleneck Necessity of random access  New conception To abandon the use of suffix links To perform multiple passes over the sequence, constructing the suffix tree for a sub range of suffixes at each pass.

New Construction Algorithm  Removing Suffix link means that the construction of a new partition does not modify previously checkpointed partitions of the tree.  Using multiple passes, it means that it is not necessary to access or update previously checkpointed partitions.  i.e. Data structure for the complete partitions can be evicted from the main memory and will not be faulted back during the rest of the tree’s construction.

New Construction Algorithm  Partition concept: Build multiple suffix tree that fit in memory(AC, AT or AG fall into different partitions) Base on the prefixes of each suffix  Use a sliding window of length l. Form a string s1 of window length, l. Scan the string and count the number of occurrances of s1. Use a bin packing technique to pack (s1, #occurrances)

New Construction Algorithm  Partition technology: Assumption:tree is uniformly populated. Prefix code(P i ): Suffixes that are indexed during the jth pass of the sequence have jr  P i  (j+1)r

New Construction Algorithm  The actual algorithm [Pseudo code]

2 New Construction Algorithm 1 root 2 ANA$ 3 NA$ 4 A$ 5 $ root ANA$ NA$ $ $ A Tree creation for ANA$

New Construction Algorithm left index right index suffix number sib suffix link child left index child sib Original tree (Ukkonen) Modified Node

Query  Only exact pattern matching.  One query involves one partial traversal.  Complexity of suffix tree search: O(k+m); k-query length, m-no of matches in the index. Queries of length q bring back 1/(a^q) fraction of the whole tree where a = size of the active alphabet i.e. 4 (A,C,G,T).  New query strategies: Short query: serial scan of the sequence Longer query: using index structure Threshold: 10 to 12 letters

Experiment and Results  Develop and experiment platform: Software: PJama, JAVA 1.3 & Solaris 7 OS Hardware: Enterprise 450 with 2GB RAM  Test data 6 single chromosomes of worm C. elegans(20.5Mbp max. length) Human chromosomes 21,22, and 1(280Mbp)  Alphabets A, C, G, T, $, *

Experiment and Results  Trees with suffix link: (use 20.5Mbp DNA) –Construct in memory: 7 mins –Construct in disk: 34 hours  Trees without suffix link: (263Mbp DNA) –19 hours

Experiment Results Exact String matching using 263Mbp of human DNA Queries sent in batches using warm storage

Experiment Results Cold Storage

Experiment Results

Further Work  Improvements to the tree representation and incremental construction algorithm.  Investigation of the interaction between approximate matching algorithms and disk- based suffix trees.  Investigation of alternative persistent storage solutions.  Integration of the algorithms with biological research tools and usability studies.

Conclusion  Present an approach to searching genetic DNA sequences using an adaptation of the suffix tree data structure.  Allow to build suffix trees on disk for arbitrarily large sequences.  Open up the perspective of building suffix trees in parallel, and the simplicity of this approach can make suffix trees more popular.