Linear Time Suffix Array Construction Using D-Critical Substrings

Slides:



Advertisements
Similar presentations
Boosting Textual Compression in Optimal Linear Time.
Advertisements

Lecture #1 From 0-th order entropy compression To k-th order entropy compression.
Text Indexing The Suffix Array. Basic notation and facts Occurrences of P in T = All suffixes of T having P as a prefix SUF(T) = Sorted set of suffixes.
Introduction to Computer Science 2 Lecture 7: Extended binary trees
Lecture 4 (week 2) Source Coding and Compression
Huffman code and ID3 Prof. Sin-Min Lee Department of Computer Science.
A simple example finding the maximum of a set S of n numbers.
Linear-time construction of CSA using o(n log n)-bit working space for large alphabets Joong Chae Na School of Computer Sci. & Eng. Seoul National University,
Paolo Ferragina, Università di Pisa Compressed Permuterm Index Paolo Ferragina Dipartimento di Informatica, Università di Pisa.
Text Indexing The Suffix Array. Basic notation and facts Occurrences of P in T = All suffixes of T having P as a prefix SUF(T) = Sorted set of suffixes.
Suffix Trees, Suffix Arrays and Suffix Trays Richard Cole Tsvi Kopelowitz Moshe Lewenstein.
What about the trees of the Mississippi? Suffix Trees explained in an algorithm for indexing large biological sequences Jacob Kleerekoper & Marjolijn Elsinga.
1 Suffix tree and suffix array techniques for pattern analysis in strings Esko Ukkonen Univ Helsinki Erice School 30 Oct 2005 Modified Alon Itai 2006.
Suffix Trees and Suffix Arrays
Compressed Compact Suffix Arrays Veli Mäkinen University of Helsinki Gonzalo Navarro University of Chile compact compress.
Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.
Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.
1 Data structures for Pattern Matching Suffix trees and suffix arrays are a basic data structure in pattern matching Reported by: Olga Sergeeva, Saint.
1 A simple construction of two- dimensional suffix trees in linear time * Division of Electronics and Computer Engineering Hanyang University, Korea Dong.
Full-Text Indexing via Burrows-Wheeler Transform Wing-Kai Hon Oct 18, 2006.
Suffix arrays. Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The.
Blockwise Suffix Sorting for Space-Efficient Burrows-Wheeler Ben Langmead Based on work by Juha Kärkkäinen.
Web Algorithmics Dictionary-based compressors. LZ77 Algorithm’s step: Output Advance by len + 1 A buffer “window” has fixed length and moves aacaacabcaaaaaa.
Suffix arrays. Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The.
Amortized Rigidness in Dynamic Cartesian Trees Iwona Białynicka-Birula and Roberto Grossi Università di Pisa STACS 2006.
Suffix trees and suffix arrays presentation by Haim Kaplan.
Space Efficient Linear Time Construction of Suffix Arrays
A Simpler Analysis of Burrows-Wheeler Based Compression Haim Kaplan Shir Landau Elad Verbin.
Tries. (Compacted) Trie y s 1 z stile zyg 5 etic ial ygy aibelyite czecin omo systile syzygetic syzygial syzygy szaibelyite szczecin.
Database Index to Large Biological Sequences Ela Hunt, Malcolm P. Atkinson, and Robert W. Irving Proceedings of the 27th VLDB Conference,2001 Presented.
Suffix Arrays A new method for on-line string searches by U. Manber and G. Myers(1993) Simple linear work suffix array construction by J. Karkkainen and.
Parallel Suffix Array Construction by Accelerated Sampling Matthew Felice Pace University of Warwick Joint work with Alexander Tiskin University of Warwick.
Cache-efficient string sorting for Burrows-Wheeler Transform Advait D. Karande Sriram Saroop.
Suffix trees. Trie A tree representing a set of strings. a b c e e f d b f e g { aeef ad bbfe bbfg c }
Joint Advanced Student School Compressed Suffix Arrays Compression of Suffix Arrays to linear size Fabian Pache.
Huffman’s Algorithm 11/02/ Weighted 2-tree A weighted 2-tree T is an extended binary tree with n external nodes and each of the external nodes is.
The Burrows-Wheeler Transform: Theory and Practice Article by: Giovanni Manzini Original Algorithm by: M. Burrows and D. J. Wheeler Lecturer: Eran Vered.
ETRI Linear-Time Search in Suffix Arrays July 14, 2003 Jeong Seop Sim, Dong Kyue Kim Heejin Park, Kunsoo Park.
Suffix Tree 6 Mar MinKoo Seo. Contents  Basic Text Searching  Introduction to Suffix Tree  Suffix Trees and Exact Matching  Longest Common Substring.
Advanced Data Structures Lecture 8 Mingmin Xie. Agenda Overview Trie Suffix Tree Suffix Array, LCP Construction Applications.
Advanced Sorting 7 2  9 4   2   4   7
Chapter 11 Sorting Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and Mount.
COMP9319 Web Data Compression and Search
15-853:Algorithms in the Real World
EE465: Introduction to Digital Image Processing
Tries 07/28/16 11:04 Text Compression
Subject Name: Design and Analysis of Algorithm Subject Code: 10CS43
Succinct Data Structures
BWT-Transformation What is BWT-transformation? BWT string compression
COMP9319 Web Data Compression and Search
Two equivalent problems
Unit 1. Sorting and Divide and Conquer
Searching.
Strings: Tries, Suffix Trees
CS 3343: Analysis of Algorithms
Chapter 9: Huffman Codes
Space-for-time tradeoffs
Unit-2 Divide and Conquer
Sorting Algorithms Ellysa N. Kosinaya.
CH 9.2 : Hash Tables Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and.
Suffix trees.
Chapter 3 Brute Force Copyright © 2007 Pearson Addison-Wesley. All rights reserved.
CH 9.2 : Hash Tables Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and.
Linear Octree Ref: Tu and O’Hallaron 2004
Space-for-time tradeoffs
String Data Structures and Algorithms
Suffix trees and suffix arrays
Space-for-time tradeoffs
Suffix Arrays and Suffix Trees
Strings: Tries, Suffix Trees
Presentation transcript:

Linear Time Suffix Array Construction Using D-Critical Substrings Ge Nong, Sun Yat-sen Univ. Sen Zhang, SUNY College at Oneonta Wai Hong Chan, Hong Kong Baptist Univ.

Talk outline Background Existing linear SA algorithms Our linear SA algorithm Performance evaluation

SA and its applications Proposed by Manber and Myers in SODA’90 Given a size-n string S with a unique and lexicographically smallest sentinel $ at the end, the suffix starting at S[i] is the substring S[i...n-1], for i ∈ [0, n-1] The suffix array (SA) of S is the index array of all suffixes sorted in their increasing/decreasing lexicographical order

An example S = mississippi$ Index Suffix mississippi$ 1 ississippi$ 2 mississippi$ 1 ississippi$ 2 ssissippi$ 3 sissippi$ 4 issippi$ 5 ssippi$ 6 sippi$ 7 ippi$ 8 ppi$ 9 pi$ 10 i$ 11 $ Index Suffix 11 $ 10 i$ 7 ippi$ 4 issippi$ 1 ississippi$ mississippi$ 9 pi$ 8 ppi$ 6 sippi$ 3 sissippi$ 5 ssippi$ 2 ssissippi$

Applications In general, could play as a space efficient alternative for suffix tree, for example: Computing Burrows-Wheeler Transform (BWT) in compression Building compact index for pattern alignment/matching in bio-informatics

Existing linear SA algorithms The current practical linear SA algorithms from others are the KS (Karkkainen, Sanders and Burkhardt) and the KA (Ko and Aluru) algorithms, both adopt the divide-and-conquer methodology KA has a better performance, but KS is simpler and more elegant in design

Motivation Motivation: to have a linear algorithm for SA construction that has A better time/space performance than the KA algorithm; A simple design comparable to that of the KS algorithm; and A capability to use external memory (e.g., harddisk) for computing huge SAs.

Our algorithm A recursive divide-and-conquer procedure consists of two linear components: Problem reduction: reducing the problem by sampling fixed-size d-critical substrings, at a reduction ratio not more than ½; Solution induction: inducing the SA at each level from the lower level. The total time is linear of O(n).

Sorting in our algorithm Sorting in the algorithm comprises Bucket sorting for problem reduction; and Induced sorting for solution induction. Both the bucket and the Induced sortings are linear in time.

Problem reduction Problem reduction: (1) Traverse the string once to find all the fixed-size d-critical substrings, where d>=2 and each substring has a length of d+2 characters; (2) Sort all the sampled d-critical substrings; Repeat (1) and (2) until there is only one d-critical substring.

Solution induction Traverse twice in a total time of O(n): Traverse once to induced sort all the type-L suffixes from the sorted LMS suffixes; Traverse once more to induced sort all the type-S suffixes from the sorted type-L suffixes.

S-type and L-type Characters S[i] is a S-type character if S[i..n-1] < S[i+1..n-1] Otherwise, S[i] is L-type S[i] is left most S-type character if S[i] is S-type and S[i-1] is L-type

Example S: m i s s i s s i p p i $ t: L S L L S L L S L L L S

Assigning d-critical characters All “left most S-type” characters are d-critical characters In between any two neighboring d-critical characters, there are at least one but at most d characters

An example for 2-critical substrings S: m i s s i s s i p p i $ t: L S L L S L L S L L L S DCS: i s s i i s s i i p p i p i $ $ $ $ $ $ DCS = d-critical substring

Key ideas There are at most 0.5n d-critical characters/substrings. If we can sort all the d-critical substrings, we can replace each d-critical substring with its index in the order, i.e. naming, which will produce a shorter string of length not longer than ½ of the original.

Key ideas (cont.) From the SA of the shortened string, we can compute the SA of the original string in O(n) time by induction.

Sorting d-critical substrings Sorting all the d-critical substrings can be split into 3 tasks: (1) Bucket sort the substrings according to the omega weights of their last characters (2) From the result of (1), continue to bucket sort the substrings by their other characters, from the last to the first

Sorting {issi, issi, ippi$, pi$$, $$$$} i ippi issi 1 2 p 3 s omega weight sorting Character sorting naming

Reduced string S: m i s s i s s i p p i $ t: L S L L S L L S L L L S DCS: i s s i i s s i i p p i p i $ $ $ $ $ $ S1: 2 2 1 3 0

Main Results Theorem 4: Given S is of a constant or integer alphabet: The time complexity is O(n); The space complexity is O(nlog(n)) bits.

Performance evaluation

Time and space

Recursion depth and reduction ratio: smaller and better

Summary The d-critical sorting algorithm was observed to achieve the better time and space performances than the linear KA and KS algorithms for SA construction The whole algorithm is coded in around 100-130 effective lines in C++ Sorting the fixed-size d-critical substrings allows the algorithm to use external memory

Thank you!