Two equivalent problems

Slides:



Advertisements
Similar presentations
The LCA Problem Revisited Michael A. BenderMartin Farach-Colton Latin American Theoretical Informatics Symposium, pages 8894, Speaker:
Advertisements

Text Indexing The Suffix Array. Basic notation and facts Occurrences of P in T = All suffixes of T having P as a prefix SUF(T) = Sorted set of suffixes.
Paolo Ferragina, Università di Pisa Compressed Permuterm Index Paolo Ferragina Dipartimento di Informatica, Università di Pisa.
Text Indexing The Suffix Array. Basic notation and facts Occurrences of P in T = All suffixes of T having P as a prefix SUF(T) = Sorted set of suffixes.
Suffix Trees, Suffix Arrays and Suffix Trays Richard Cole Tsvi Kopelowitz Moshe Lewenstein.
Constant-Time LCA Retrieval
What about the trees of the Mississippi? Suffix Trees explained in an algorithm for indexing large biological sequences Jacob Kleerekoper & Marjolijn Elsinga.
Suffix Trees and Suffix Arrays
Compressed Compact Suffix Arrays Veli Mäkinen University of Helsinki Gonzalo Navarro University of Chile compact compress.
Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.
1 Suffix Trees © Jeff Parker, Outline An introduction to the Suffix Tree Some sample applications How to build a Suffix Tree efficiently.
Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.
Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU Suffix Trees and Their Uses.
1 Nearest Commen Ancestors (NCA) Discrete Range Maximum Cartesian Tree [Vuillemin 1980] i j max(i, j) i.
Fast and Practical Algorithms for Computing Runs Gang Chen – McMaster, Ontario, CAN Simon J. Puglisi – RMIT, Melbourne, AUS Bill Smyth – McMaster, Ontario,
Lowest Common Ancestors Two vertices (u, v) Lowest common ancestors, lca (u, v) Example lca (5, 6) = 4 lca (3, 7) = 2 lca (7, 8) = 1 l(v):
Lowest common ancestors. Write an Euler tour of the tree LCA(1,5) = 3 Shallowest node.
PhD Thesis Iwona Bialynicka-Birula Ranked Queries in Index Data Structures.
Suffix arrays. Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The.
Amortized Rigidness in Dynamic Cartesian Trees Iwona Białynicka-Birula and Roberto Grossi Università di Pisa STACS 2006.
Suffix trees and suffix arrays presentation by Haim Kaplan.
Text Indexing S. Srinivasa Rao April 19, 2007 [based on slides by Paolo Ferragina]
Tries. (Compacted) Trie y s 1 z stile zyg 5 etic ial ygy aibelyite czecin omo systile syzygetic syzygial syzygy szaibelyite szczecin.
Improvements on the Range-Minimum-Query- Problem
Database Index to Large Biological Sequences Ela Hunt, Malcolm P. Atkinson, and Robert W. Irving Proceedings of the 27th VLDB Conference,2001 Presented.
Introduction n – length of text, m – length of search pattern string Generally suffix tree construction takes O(n) time, O(n) space and searching takes.
The LCA Problem Revisited Michael A.Bender & Martin Farach-Colton Presented by: Dvir Halevi.
The LCA Problem Revisited
Constant-Time LCA Retrieval Presentation by Danny Hermelin, String Matching Algorithms Seminar, Haifa University.
Suffix trees. Trie A tree representing a set of strings. a b c e e f d b f e g { aeef ad bbfe bbfg c }
Everything is String. Closed Factorization Golnaz Badkobeh 1, Hideo Bannai 2, Keisuke Goto 2, Tomohiro I 2, Costas S. Iliopoulos 3, Shunsuke Inenaga 2,
Suffix Tree 6 Mar MinKoo Seo. Contents  Basic Text Searching  Introduction to Suffix Tree  Suffix Trees and Exact Matching  Longest Common Substring.
Advanced Data Structures Lecture 8 Mingmin Xie. Agenda Overview Trie Suffix Tree Suffix Array, LCP Construction Applications.
Succinct Data Structures
Linear Time Suffix Array Construction Using D-Critical Substrings
Instructor: Lilian de Greef Quarter: Summer 2017
Index construction: Compression of postings
15-853:Algorithms in the Real World
Succinct Data Structures
Tries 07/28/16 11:04 Text Compression
Tries 5/27/2018 3:08 AM Tries Tries.
Advanced Algorithms for Massive DataSets
Discrete Methods in Mathematical Informatics
COMP9319 Web Data Compression and Search
Chapter 3 Brute Force Copyright © 2007 Pearson Addison-Wesley. All rights reserved.
Ariel Rosenfeld Bar-Ilan Uni.
Tries 9/14/ :13 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
Minimum Spanning Tree Verification
Auto-completion Search
String Data Structures and Algorithms: Suffix Trees and Suffix Arrays
Suffix trees.
Index construction: Compression of postings
Problem with Huffman Coding
String Data Structures and Algorithms
String Data Structures and Algorithms
3. Brute Force Selection sort Brute-Force string matching
Suffix trees and suffix arrays
Tries 2/23/2019 8:29 AM Tries 2/23/2019 8:29 AM Tries.
Suffix Trees String … any sequence of characters.
Tries 2/27/2019 5:37 PM Tries Tries.
Suffix Arrays and Suffix Trees
3. Brute Force Selection sort Brute-Force string matching
Chap 3 String Matching 3 -.
The LCA Problem Revisited
Rank and Select data structures
Discrete Range Maximum
3. Brute Force Selection sort Brute-Force string matching
Presentation transcript:

Two equivalent problems Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Two equivalent problems RMQ on integer arrays  RMQ on ±1 integer arrays LCA on a Cartesian Tree Euler tour of the Cartesian tree RMQ on the array of node-levels LCA on general trees  RMQ on ±1 integer arrays Euler tour of the Cartesian tree RMQ on the array of node-levels

Search for k-mismatches Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Search for k-mismatches T CCGTACGATCAGTACAGTACAGTACTTTTTTAAACCGGAGACTACA P If O(1)  O(k) time CCGAACTATC Problem: Find longest match between P[i,…] and T[j,…] Data Structure Concatenate P and T into a string X = T$P Construct a data structure on X that retrieves FAST the longest match between any pair of suffixes of X LCA or LCP query

Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Suffix Tree & LCA 12 11 8 5 2 1 10 9 7 4 6 3 # i ppi# ssi mississippi# p i# pi# s ssippi# si 14 $ 13 LCA Longest match(3,13) T#P = mississippi#si$ 1 2 4 6 8 10 13

Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Suffix Array & LCP  RMQ SA Lcp Longest match(3,13) 12 11 14 8 5 2 1 10 9 13 7 4 6 3 1 4 2 3 LCP T#P = mississippi#si 1 2 4 6 8 10 13 RMQ si sippi# sissippi# ssippi# ssissippi# LCP Surprisingly, also LCA  RMQ

Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" The RMQ problem RMQA(i,j) – returns the index of the smallest element in the subarray A[i..j]. A[0] A[1] A[2] A[3] A[4] A[5] A[6] A[7] A[8] A[9] 10 2 12 1 12 13 21 15 14 10 RMQ(2,7) = 3 Trivial solution: Precompute RMQ for every pair of indices. This takes Q(n2) space, and O(1) query time

Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" RMQ on a general array A[0] A[1] A[2] A[3] A[4] A[5] A[6] A[7] A[8] A[9] 10 25 22 34 7 19 9 12 26 16 Cartesian Tree 4 6 2 5 7 1 3 9 8

RMQ(i,j) = LCA(i,j) on Cartesian trees Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" RMQ(i,j) = LCA(i,j) on Cartesian trees A[0] A[1] A[2] A[3] A[4] A[5] A[6] A[7] A[8] A[9] 10 25 22 34 7 19 9 12 26 16 4 6 2 5 7 1 3 9 8

Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Generic RMQ  LCA  RMQ ±1 LCA(u,v) = shallowest node between u and v during a depth first search traversal of T. Node at the lowest level ±1 array 3 Node Level 12 9 1 2 1 2 3 2 3 2 1 2 3 2 1 3 8 1 3 2 3 1 4 1 7 1 3 5 6 5 3 2 5 1 Euler tour 11 4 10 7 5 6 2 4 7 6 LCAT(4,6) = 3

We are left with “RMQ on ±1 array” Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" We are left with “RMQ on ±1 array” RMQA(i,j) – returns the index of the smallest element in the subarray A[i..j]. A[0] A[1] A[2] A[3] A[4] A[5] A[6] A[7] A[8] A[9] 10 11 12 11 12 13 14 15 14 13 RMQ(2,7) = 3 Recall the trivial solution: Precompute RMQ for every pair of indices. This takes Q(n2) space, and O(1) query time

Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Sparse Table Preprocess sub arrays of len 2k, for every k=0,1,…, log n M(i,j) = index of min value in A[i, i+ 2j -1] A[0] A[1] A[2] A[3] A[4] A[5] A[6] A[7] A[8] A[9] 10 11 12 11 12 13 14 15 14 13 M(2,0)=2 Total space is O(n log n) RMQ query ? M(2,1)=3 M(2,2)=3 M(2,3)=3

Querying the Sparse Table Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Querying the Sparse Table i j 2k elements a1 ... ... 2k elements Total space is O(n log n) RMQ query takes O(1) time

Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Bucketing A’[i] = min in the i-th block of A. B’[i] is the position (index) of that min. A A’[0] A’[i] A’[2n/logn] A’ ... ... ... ... ... … ... B[0] B[i] B[2n/logn]

Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Use the Bucketing A’[0] A’[i] A’[2n/logn] A’ ... ... ... ... ... Preprocess A’ for RMQ using SparseTable Space is (2n/log n) * log (2n/log n) = O(n) RMQ queries on A’ take O(1) time Preprocess every block of A’ for border RMQ Space is O(n), border RMQ take O(1) time. RMQ(i,j) takes O(1) time, if i,j lie in distinct blocks

In-block RMQ over ±1 arrays Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" In-block RMQ over ±1 arrays There are normalized blocks Set Table[BlockEnc, i, j] = RMQ(i,j) X 3 4 5 6 5 4 5 6 5 4 1 3 2 Y DX = DY +1 -1 Table entries

Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" LZ-parsing (gzip) # s i 1 12 si 1 p i 3 ssi mississippi# 2 ppi# ppi# 1 4 ssippi# # ppi# ssippi# 6 3 i# ppi# ssippi# pi# 7 4 11 8 5 2 1 10 9 <m><i><s><si><ssip><pi> T = mississippi# 1 2 4 6 8 10

Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" LZ-parsing (gzip) It is on the path to 6 # s By maximality check only nodes i 1 12 1 p Leftmost occ = 3 < 6 si i 3 ssi mississippi# 2 ppi# ppi# 1 4 ssippi# Leftmost occ = 3 < 6 # ppi# ssippi# 6 3 i# ppi# ssippi# pi# 7 4 11 8 5 2 1 10 9 <ssip> Longest repeated prefix of T[6,...] Repeat is on the left of 6 T = mississippi# 1 2 4 6 8 10

Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" LZ-parsing (gzip) min-leaf  Leftmost copy # s 3 i 1 12 si 2 Parsing: Scan T Visit ST and stop when min-leaf ≥ current pos 1 p 3 i 3 ssi mississippi# 4 9 2 2 ppi# ppi# 1 4 ssippi# # ppi# ssippi# 6 3 i# ppi# ssippi# pi# 7 4 11 8 5 2 1 10 9 <m><i><s><si><ssip><pi> Precompute the min descending leaf at every node in O(n) time. T = mississippi# 1 2 4 6 8 10