Space-time Tradeoffs for Longest-Common-Prefix Array Construction Simon J. Puglisi and Andrew Turpin

Slides:

Advertisements

Similar presentations

Succinct Data Structures for Permutations, Functions and Suffix Arrays

Advertisements

Analysis of Algorithms

Using Divide and Conquer for Sorting

A New Compressed Suffix Tree Supporting Fast Search and its Construction Algorithm Using Optimal Working Space Dong Kyue Kim 1 andHeejin Park 2 1 School.

Genome-scale disk-based suffix tree indexing Benjarath Phoophakdee Mohammed J. Zaki Compiled by: Amit Mahajan Chaitra Venus.

1 Suffix tree and suffix array techniques for pattern analysis in strings Esko Ukkonen Univ Helsinki Erice School 30 Oct 2005 Modified Alon Itai 2006.

Compressed Compact Suffix Arrays Veli Mäkinen University of Helsinki Gonzalo Navarro University of Chile compact compress.

Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.

Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.

Pattern Matching1. 2 Outline and Reading Strings (§9.1.1) Pattern matching algorithms Brute-force algorithm (§9.1.2) Boyer-Moore algorithm (§9.1.3) Knuth-Morris-Pratt.

Goodrich, Tamassia String Processing1 Pattern Matching.

Fast and Practical Algorithms for Computing Runs Gang Chen – McMaster, Ontario, CAN Simon J. Puglisi – RMIT, Melbourne, AUS Bill Smyth – McMaster, Ontario,

Blockwise Suffix Sorting for Space-Efficient Burrows-Wheeler Ben Langmead Based on work by Juha Kärkkäinen.

Hashing COMP171 Fall Hashing 2 Hash table * Support the following operations n Find n Insert n Delete. (deletions may be unnecessary in some applications)

Space Efficient Linear Time Construction of Suffix Arrays

CSE 326: Data Structures Sorting Ben Lerner Summer 2007.

CSC 2300 Data Structures & Algorithms March 20, 2007 Chapter 7. Sorting.

Data Structures CS 310. Abstract Data Types (ADTs) An ADT is a formal description of a set of data values and a set of operations that manipulate the.

Linear Time Algorithms for Finding and Representing all Tandem Repeats in a String Dan Gusfield and Jens Stoye Journal of Computer and System Science 69.

Pattern Matching1. 2 Outline Strings Pattern matching algorithms Brute-force algorithm Boyer-Moore algorithm Knuth-Morris-Pratt algorithm.

A Fast Algorithm for Multi-Pattern Searching Sun Wu, Udi Manber May 1994.

CSE 373 Data Structures Lecture 15

Hideo Bannai, Shunsuke Inenaga, Masayuki Takeda Kyushu University, Japan SPIRE Cartagena, Colombia.

String Matching. Problem is to find if a pattern P[1..m] occurs within text T[1..n] Simple solution: Naïve String Matching –Match each position in the.

An Online Algorithm for Finding the Longest Previous Factors Daisuke Okanohara University of Tokyo Karlsruhe, Sep 15, 2008 Kunihiko.

Oct 29, 2001CSE 373, Autumn External Storage For large data sets, the computer will have to access the disk. Disk access can take 200,000 times longer.

Parallel Suffix Array Construction by Accelerated Sampling Matthew Felice Pace University of Warwick Joint work with Alexander Tiskin University of Warwick.

Optimizing multi-pattern searches for compressed suffix arrays Kalle Karhu Department of Computer Science and Engineering Aalto University, School of Science,

Analyzing algorithms & Asymptotic Notation BIO/CS 471 – Algorithms for Bioinformatics.

Introduction to Algorithms Jiafen Liu Sept

CSE332: Data Abstractions Lecture 14: Beyond Comparison Sorting Dan Grossman Spring 2012.

MA/CSSE 473 Day 27 Hash table review Intro to string searching.

1 CSE 326: Data Structures: Hash Tables Lecture 12: Monday, Feb 3, 2003.

1Computer Sciences Department. Book: Introduction to Algorithms, by: Thomas H. Cormen Charles E. Leiserson Ronald L. Rivest Clifford Stein Electronic:

Program Efficiency & Complexity Analysis. Algorithm Review An algorithm is a definite procedure for solving a problem in finite number of steps Algorithm.

Hashing Sections 10.2 – 10.3 CS 302 Dr. George Bebis.

MA/CSSE 473 Day 23 Student questions Space-time tradeoffs Hash tables review String search algorithms intro.

Algorithm Analysis CS 400/600 – Data Structures. Algorithm Analysis2 Abstract Data Types Abstract Data Type (ADT): a definition for a data type solely.

Cache-efficient string sorting for Burrows-Wheeler Transform Advait D. Karande Sriram Saroop.

On the Sorting-Complexity of Suffix Tree Construction MARTIN FARACH-COLTON PAOLO FERRAGINA S. MUTHUKRISHNAN Requires Math fonts downloadable from herehere.

Szymon Grabowski, Marcin Raniszewski Institute of Applied Computer Science, Lodz University of Technology, Poland The Prague Stringology Conference, 1-3.

Suffix trees. Trie A tree representing a set of strings. a b c e e f d b f e g { aeef ad bbfe bbfg c }

Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter.

Semi-dynamic compact index for short patterns and succinct van Emde Boas tree 1 Yoshiaki Matsuoka 1, Tomohiro I 2, Shunsuke Inenaga 1, Hideo Bannai 1,

Hashing 1 Hashing. Hashing 2 Hashing … * Again, a (dynamic) set of elements in which we do ‘search’, ‘insert’, and ‘delete’ n Linear ones: lists, stacks,

Everything is String. Closed Factorization Golnaz Badkobeh 1, Hideo Bannai 2, Keisuke Goto 2, Tomohiro I 2, Costas S. Iliopoulos 3, Shunsuke Inenaga 2,

Evidence from Content INST 734 Module 2 Doug Oard.

Excellence Publication Co. Ltd. Volume Volume 1.

Copyright © 2007 Pearson Addison-Wesley. All rights reserved. A. Levitin “ Introduction to the Design & Analysis of Algorithms, ” 2 nd ed., Ch. 1 Chapter.

ETRI Linear-Time Search in Suffix Arrays July 14, 2003 Jeong Seop Sim, Dong Kyue Kim Heejin Park, Kunsoo Park.

Lecture 3 Sorting and Selection. Comparison Sort.

Hierarchical Memory Systems Prof. Sin-Min Lee Department of Computer Science.

BITS Pilani Pilani Campus Data Structure and Algorithms Design Dr. Maheswari Karthikeyan Lecture1.

Constraint Programming for the Diameter Constrained Minimum Spanning Tree Problem Thiago F. Noronha Celso C. Ribeiro Andréa C. Santos.

Compressed Suffix Arrays for Massive Data Jouni Sirén SPIRE 2009.

May 26th –non-comparison sorting

Tries 07/28/16 11:04 Text Compression

Succinct Data Structures

13 Text Processing Hongfei Yan June 1, 2016.

Radish-Sort 11/11/ :01 AM Quick-Sort     2 9  9

Objective of This Course

Pattern Matching 12/8/ :21 PM Pattern Matching Pattern Matching

KMP String Matching Donald Knuth Jim H. Morris Vaughan Pratt 1997.

Suffix Arrays and Suffix Trees

Pattern Matching Pattern Matching 5/1/2019 3:53 PM Spring 2007

Pattern Matching 4/27/2019 1:16 AM Pattern Matching Pattern Matching

Sequences 5/17/ :43 AM Pattern Matching.

Estimating Algorithm Performance

Sorting We have actually seen already two efficient ways to sort:

Presentation transcript:

Space-time Tradeoffs for Longest-Common-Prefix Array Construction Simon J. Puglisi and Andrew Turpin

Outline Refresher on suffix array (SA) and longest- common-prefix array (LCP) basics –Why are they useful? Existing algorithms for LCP construction –Why do we need new ones? Two algorithms for LCP construction –Empirical comparison to earlier algorithms Concise representations of LCP array

The Ubiquitous Suffix Array isuffix 0toy_boat_toy_boat_toy_boat$ 1oy_boat_toy_boat_toy_boat$ 2y_boat_toy_boat_toy_boat$ 3_boat_toy_boat_toy_boat$ 4boat_toy_boat_toy_boat$ 5oat_toy_boat_toy_boat$ 6at_toy_boat_toy_boat$ 7t_toy_boat_toy_boat$ 8_toy_boat_toy_boat$ 9toy_boat_toy_boat$ 10oy_boat_toy_boat$ 11y_boat_toy_boat$ 12_boat_toy_boat$ 13boat_toy_boat$ 14oat_toy_boat$ 15at_toy_boat$ 16t_toy_boat$ 17_toy_boat$ 18toy_boat$ 19oy_boat$ 20y_boat$ 21_boat$ 22boat$ 23oat$ 24at$ 25t$ 26$ SAsuffix 26$ 24at$ 15at_toy_boat$ 6at_toy_boat_toy_boat$ 22boat$ 13boat_toy_boat$ 4boat_toy_boat_toy_boat$ 23oat$ 14oat_toy_boat$ 5oat_toy_boat_toy_boat$ 19oy_boat$ 10oy_boat_toy_boat$ 1oy_boat_toy_boat_toy_boat$ 25t$ 18toy_boat$ 9toy_boat_toy_boat$ 0toy_boat_toy_boat_toy_boat$ 16t_toy_boat$ 7t_toy_boat_toy_boat$ 20y_boat$ 11y_boat_toy_boat$ 2y_boat_toy_boat_toy_boat$ 21_boat$ 12_boat_toy_boat$ 3_boat_toy_boat_toy_boat$ 17_toy_boat$ 8_toy_boat_toy_boat$ Suffix Sort

LCPSAsuffix -26$ 024at$ 215at_toy_boat$ 106at_toy_boat_toy_boat$ 022boat$ 413boat_toy_boat$ 134boat_toy_boat_toy_boat$ 023oat$ 314oat_toy_boat$ 125oat_toy_boat_toy_boat$ 119oy_boat$ 710oy_boat_toy_boat$ 161oy_boat_toy_boat_toy_boat$ 025t$ 118toy_boat$ 89toy_boat_toy_boat$ 170toy_boat_toy_boat_toy_boat$ 116t_toy_boat$ 107t_toy_boat_toy_boat$ 020y_boat$ 611y_boat_toy_boat$ 152y_boat_toy_boat_toy_boat$ 021_boat$ 512_boat_toy_boat$ 143_boat_toy_boat_toy_boat$ 017_toy_boat$ 98_toy_boat_toy_boat$ The Longest-Common-Prefix (LCP) Array LCP[i] = The length of the longest common prefix of suffix SA[i] and SA[i-1]. = |lcp(SA[i-1],SA[i])|, i > 0

Why the Longest-Common-Prefix array? (SA,LCP,x) == suffix tree –Any bottom-up and top-down traversal (Abouelhoda et al., JDA 2004) –Same asymptotic time bounds, just smaller and faster in practice –Eg., LZ77 factorization (Chen et al., CPM 2007) Important for disk resident suffix trees –LOFSA (Sinha et al., SIGMOD 2008)

Previous work Brute force: –for each i \in 1..n-1 work out LCP[i] by comparing t[SA[i-1]..n] to t[SA[i]..n] until we get a mismatch –Expensive if string has regularities, O(n 2 ) in the worst case Ө(n) time (Kasai et al., CPM 1999) –13n bytes of space –x[1..n], SA[1..n], ISA[1..n], LCP[1..n] Ө(n) time (Manzini, SWAT 2004) –Two refinements of Kasai et al.’s algorithm –9n bytes –6n + 4H k n bytes (space usage decreases with text entropy)

The need for new LCP construction algorithms Prior algorithms use lots of memory –Try to compute LCP[] for the Human Genome –DNA has high entropy, so 9n byte alg is best –27Gb of RAM Poor locality of memory reference –Using secondary memory for large inputs implausible –Even in RAM the algorithms are (relatively) slow Eg., slower than the fastest SA construction algorithms

SA New Alg: choose a (special) sample of suffixes LSsuffix -15at_toy_boat$ 022boat$ 44boat_toy_boat_toy_boat$ 023oat$ 11oy_boat_toy_boat_toy_boat$ 025t$ 118toy_boat$ 89toy_boat_toy_boat$ 116t_toy_boat$ 011y_boat_toy_boat$ 152y_boat_toy_boat_toy_boat$ 08_toy_boat_toy_boat$ Choose a sample of the SA and compute lcp’s Preprocess L for O(1) time Range Minimum Queries (RMQ) – requires 2n + o(n) bits, after O(n) time preprocessing (Fischer 2008) Lcp for two non-adjacent suffixes is the minimum value in L[] between them.

A difference cover D v, modulo v, is a set of integers in the range [0..v) such that for all i \in [0..v), there exist j, k \in D v such that i = k-j (mod v). –A tool for linear time suffix sorting (Karkkainen et al, JACM, 2006) |D v | = O(√v) δ function defined on D v : –δ(i,j) = k, i+k and j+k \in D v (mod v) for any i,j –δ computed in O(1) time and requires O(v) space The sample is defined by a difference cover

SA New Alg: choose a (special) sample of suffixes LSsuffix -15at_toy_boat$ 022boat$ 44boat_toy_boat_toy_boat$ 023oat$ 11oy_boat_toy_boat_toy_boat$ 025t$ 118toy_boat$ 89toy_boat_toy_boat$ 116t_toy_boat$ 011y_boat_toy_boat$ 152y_boat_toy_boat_toy_boat$ 08_toy_boat_toy_boat$ In this example, suffixes i such that i mod 7 \in D 7 = {1,2,4} have been chosen Suffixes 1,2,4, 8,9,11, 15,16,18,…  S has O(n/√v) elements (because |D v | is O(√v))

Using L to compute values in LCP i j δ(i,j)  lcp(i,j) = l’ + δ(i,j) rank(i+δ(i,j)) rank(j+δ(i,j)) SL l’ = lcp((i+δ(i,j)), (i+δ(i,j))) = RMQ L (...)..... i+δ(i,j) j+δ(i,j) l’ i + δ(i,j) l’ j + δ(i,j) lcp(i,j)

To compute L efficiently we exploit the following simple observation: If lcp for SA[i] is l, then lcp for SA[i]+v ≥ l-v –The lcp for a given suffix provides a lower bound on the lcp of suffixes which follow it in the string. Computing L SALCP..... j j+v..... l ≥ l-v  Overall O(n√v) time and O(n/√v) space to compute L

Now computing any LCP[k] requires at most v comparisons and an RMQ on L To compute LCP over top of SA: –for i = 1 to n do if lcp(SA[i],SA[i-1]) < v then –LCP[i] = lcp(SA[i],SA[i-1]) else –LCP[i] = δ(i,j) + RMQ L (…) Total time O(nv); extra space O(n/√v) Combining things…

Time (sec) Ours (on disk) 6n 9n 13n Memory (bytes per input character) 14 Ours (in memory) Running Time & Memory Required for 200Mb DNA

Time (sec) Ours (on disk) 6n 9n 13n Memory (bytes per input character) 14 Ours (in memory) Running Time & Memory Required for 200Mb English

An even better algorithm… In fact it’s possible to use even less space (and do away with the difference cover as well!) Requires O(vn) time and O(n/v) space –(Juha Karkkainen, last Friday)

Conclusions O(nv) time, O(n/√v) space algorithm (using DC) –O(nv) time, O(n/v) space (by rejigging things a bit) By varying v we have a controlled tradeoff between memory and time Algorithms are fast and use low memory Runtime is not greatly effected if the output (and most of the input) resides on disk

Representing LCP in small space The 2 nd algorithm implies a concise representation of the LCP array –nlogn/v bits to store sample suffixes –nlogv bits to store “extra part” Choosing v = logn → n + nloglogn bits Sadakane 2001: 6n + o(n) bits

Future Work Can we eliminate the random access to the text so that algorithm scales unboundedly? Is there a way to exploit the self-similarity present in the SA (and hence LCP) to further reduce constant factors in the runtime? What is the concise representation like in practice? Can it be made smaller?