Text Indexing The Suffix Array. Basic notation and facts Occurrences of P in T = All suffixes of T having P as a prefix SUF(T) = Sorted set of suffixes.

Slides:



Advertisements
Similar presentations
Paolo Ferragina, Università di Pisa XML Compression and Indexing Paolo Ferragina Dipartimento di Informatica, Università di Pisa [Joint with F. Luccio,
Advertisements

Paolo Ferragina, Università di Pisa On Compression and Indexing: two sides of the same coin Paolo Ferragina Dipartimento di Informatica, Università di.
Algorithms and data structures for big data, whats next? Paolo Ferragina University of Pisa.
Dynamic Rank-Select Structures with Applications to Run-Length Encoded Texts Sunho Lee and Kunsoo Park Seoul National Univ.
Algoritmi per IR Dictionary-based compressors. Lempel-Ziv Algorithms Keep a “dictionary” of recently-seen strings. The differences are: How the dictionary.
Lecture #1 From 0-th order entropy compression To k-th order entropy compression.
Text Indexing The Suffix Array. Basic notation and facts Occurrences of P in T = All suffixes of T having P as a prefix SUF(T) = Sorted set of suffixes.
1 Suffix Arrays: A new method for on-line string searches Udi Manber Gene Myers May 1989 Presented by: Oren Weimann.
Paolo Ferragina, Università di Pisa Compressed Permuterm Index Paolo Ferragina Dipartimento di Informatica, Università di Pisa.
Paolo Ferragina, Università di Pisa Compressing and Indexing Strings and (labeled) Trees Paolo Ferragina Dipartimento di Informatica, Università di Pisa.
Suffix Trees, Suffix Arrays and Suffix Trays Richard Cole Tsvi Kopelowitz Moshe Lewenstein.
Interplay between Stringology and Data Structure Design Roberto Grossi.
What about the trees of the Mississippi? Suffix Trees explained in an algorithm for indexing large biological sequences Jacob Kleerekoper & Marjolijn Elsinga.
1 Suffix tree and suffix array techniques for pattern analysis in strings Esko Ukkonen Univ Helsinki Erice School 30 Oct 2005 Modified Alon Itai 2006.
Suffix Trees and Suffix Arrays
Compressed Compact Suffix Arrays Veli Mäkinen University of Helsinki Gonzalo Navarro University of Chile compact compress.
Suffix Tree and Suffix Array R Brain Chen R Pluto Chang.
Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.
1 Suffix Trees © Jeff Parker, Outline An introduction to the Suffix Tree Some sample applications How to build a Suffix Tree efficiently.
A Categorization Theorem on Suffix Arrays with Applications to Space Efficient Text Indexes Meng He, J. Ian Munro, and S. Srinivasa Rao University of Waterloo.
1 Suffix Trees and Suffix Arrays Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto Addison-Wesley, (Chapter 8)
String Processing II: Compressed Indexes Patrick Nichols Jon Sheffi Dacheng Zhao
Suffix Trees Suffix trees Linearized suffix trees Virtual suffix trees Suffix arrays Enhanced suffix arrays Suffix cactus, suffix vectors, …
Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.
1 Omics Achim Tresch UoC / MPIPZ Cologne treschgroup.de/OmicsModule1415.html
Modern Information Retrieval Chapter 8 Indexing and Searching.
Fudan University Chen, Yaoliang 1. TTS System A Chinese Text-To-Speech system SafeDB Bug backlog SMemoHelper A small tool that helps learn English.
Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU Suffix Trees and Their Uses.
Modern Information Retrieval
Full-Text Indexing via Burrows-Wheeler Transform Wing-Kai Hon Oct 18, 2006.
Suffix arrays. Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The.
Blockwise Suffix Sorting for Space-Efficient Burrows-Wheeler Ben Langmead Based on work by Juha Kärkkäinen.
Web Algorithmics Dictionary-based compressors. LZ77 Algorithm’s step: Output Advance by len + 1 A buffer “window” has fixed length and moves aacaacabcaaaaaa.
Suffix arrays. Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The.
Suffix trees.
Amortized Rigidness in Dynamic Cartesian Trees Iwona Białynicka-Birula and Roberto Grossi Università di Pisa STACS 2006.
Suffix trees and suffix arrays presentation by Haim Kaplan.
Suffix trees. Trie A tree representing a set of strings. a c b c e e f d b f e g { aeef ad bbfe bbfg c }
Text Indexing S. Srinivasa Rao April 19, 2007 [based on slides by Paolo Ferragina]
Tries. (Compacted) Trie y s 1 z stile zyg 5 etic ial ygy aibelyite czecin omo systile syzygetic syzygial syzygy szaibelyite szczecin.
Advanced Algorithms for Massive DataSets Data Compression.
Indexing and Searching
Compressed Index for a Dynamic Collection of Texts H.W. Chan, W.K. Hon, T.W. Lam The University of Hong Kong.
Information Retrieval Space occupancy evaluation.
Data Structures and Algorithms Huffman compression: An Application of Binary Trees and Priority Queues.
Database Index to Large Biological Sequences Ela Hunt, Malcolm P. Atkinson, and Robert W. Irving Proceedings of the 27th VLDB Conference,2001 Presented.
Read Alignment Algorithms. The Problem 2 Given a very long reference sequence of length n and given several short strings.
Suffix trees. Trie A tree representing a set of strings. a b c e e f d b f e g { aeef ad bbfe bbfg c }
Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter.
B ACKWARD S EARCH FM-I NDEX (F ULL - TEXT INDEX IN M INUTE SPACE ) Paper by Ferragina & Manzini Presentation by Yuval Rikover.
BACKWARD SEARCH FM-INDEX (FULL-TEXT INDEX IN MINUTE SPACE)
Joint Advanced Student School Compressed Suffix Arrays Compression of Suffix Arrays to linear size Fabian Pache.
The Burrows-Wheeler Transform: Theory and Practice Article by: Giovanni Manzini Original Algorithm by: M. Burrows and D. J. Wheeler Lecturer: Eran Vered.
ETRI Linear-Time Search in Suffix Arrays July 14, 2003 Jeong Seop Sim, Dong Kyue Kim Heejin Park, Kunsoo Park.
Suffix Tree 6 Mar MinKoo Seo. Contents  Basic Text Searching  Introduction to Suffix Tree  Suffix Trees and Exact Matching  Longest Common Substring.
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Exact String Matching Algorithms. Copyright notice Many of the images in this power point presentation of other people. The Copyright belong to the original.
Advanced Data Structures Lecture 8 Mingmin Xie. Agenda Overview Trie Suffix Tree Suffix Array, LCP Construction Applications.
On searching and extracting strings from compressed textual data Rossano Venturini Dipartimento di Informatica University of Pisa, Italy.
Linear Time Suffix Array Construction Using D-Critical Substrings
Tries 4/16/2018 8:59 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
15-853:Algorithms in the Real World
Advanced Algorithms for Massive DataSets
COMP9319 Web Data Compression and Search
Two equivalent problems
Suffix trees.
Problem with Huffman Coding
Suffix trees and suffix arrays
Suffix Arrays and Suffix Trees
Hard Instances of Compressed Text Indexing
Presentation transcript:

Text Indexing The Suffix Array

Basic notation and facts Occurrences of P in T = All suffixes of T having P as a prefix SUF(T) = Sorted set of suffixes of T T = mississippi mississippi 4,7 P = si T[i,N] iff P is a prefix of the i-th suffix of T (ie. T[i,N]) T P i Pattern P occurs at position i of T From substring search To prefix search Reduction

The Suffix Array Prop 1. All suffixes in SUF(T) having prefix P are contiguous. P=si T = mississippi# # i# ippi# issippi# ississippi# mississippi# pi# ppi# sippi# sissippi# ssippi# ssissippi# SUF(T) Suffix Array SA:   N log 2 N) bits Text T: N chars  In practice, a total of 5N bytes  (N 2 ) space SA T = mississippi# suffix pointer 5 Prop 2. Starting position is the lexicographic one of P.

Searching a pattern Indirected binary search on SA: O(p) time per suffix cmp T = mississippi# SA P = si P is larger 2 accesses per step

Searching a pattern Indirected binary search on SA: O(p) time per suffix cmp T = mississippi# SA P = si P is smaller Suffix Array search O (log 2 N) binary-search steps Each step takes O(p) char cmp  overall, O (p log 2 N) time

Listing of the occurrences T = mississippi # 4 7 SA si# occ= si$ Suffix Array search O (p * log 2 N + occ) time can be reduced… where # <  < $ sissippi sippi

SA Lcp Text mining T = mississippi# issippi ississippi Does it exist a repeated substring of length ≥ L ? Search for Lcp[i] ≥ L Does it exist a substring of length ≥ L occurring ≥ C times ? Search for Lcp[i,i+C-2] whose entries are ≥ L How long is the common prefix between T[i,...] and T[j,...] ? Min of the subarray Lcp[h,k] s.t. SA[h]=i and SA[k]=j.

Giovanni Manzini’s page

Compressed Indexing What about accessing or searching compressed data?

i #mississip p p i#mississi p p pi#mississ i s ippi#missi s s issippi#mi s s sippi#miss i s sissippi#m i i ssippi#mis s m ississippi # i ssissippi# m Decompress any substring # mississipp i i ppi#missis s FL unknown k-th occurrence of  in L corresponds to k-th occurrence of  in F Recall that LF-mapping means: You can reconstruct any substring backward IF you know the row of its last character s i m How do we know where to start ? Keep sampled positions T = mississippi# sampling step is Trade-off between space (n log n/S bits) and decompression time of an L- long substring (S+ L time) due to the sampling step S

fr occ=2 [lr-fr+1] Substring search in T ( Count the pattern occurrences ) #mississipp i#mississip ippi#missis issippi#mis ississippi# mississippi pi#mississi ppi#mississ sippi#missi sissippi#mi ssippi#miss ssissippi#m ipssm#pissiiipssm#pissii L mississippi # 1 i 2 m 6 p 7 s 9 F Available info P = si First step fr lr Inductive step: Given fr,lr for P[j+1,p] Œ Take c=P[j] P[ j ]  Find the first c in L[fr, lr] Ž Find the last c in L[fr, lr]  L-to-F mapping of these chars lr rows prefixed by char “i” s s unknown Rank is enough