Tries. (Compacted) Trie 1 2 2 0 4 5 6 7 2 3 y s 1 z stile zyg 5 etic ial ygy aibelyite czecin omo systile syzygetic syzygial syzygy szaibelyite szczecin.

Slides:



Advertisements
Similar presentations
Paolo Ferragina, Università di Pisa XML Compression and Indexing Paolo Ferragina Dipartimento di Informatica, Università di Pisa [Joint with F. Luccio,
Advertisements

Algorithms and data structures for big data, whats next? Paolo Ferragina University of Pisa.
Lecture #1 From 0-th order entropy compression To k-th order entropy compression.
Text Indexing The Suffix Array. Basic notation and facts Occurrences of P in T = All suffixes of T having P as a prefix SUF(T) = Sorted set of suffixes.
Paolo Ferragina, Università di Pisa Compressed Permuterm Index Paolo Ferragina Dipartimento di Informatica, Università di Pisa.
Text Indexing The Suffix Array. Basic notation and facts Occurrences of P in T = All suffixes of T having P as a prefix SUF(T) = Sorted set of suffixes.
Data Compressor---Huffman Encoding and Decoding. Huffman Encoding Compression Typically, in files and messages, Each character requires 1 byte or 8 bits.
Suffix Trees, Suffix Arrays and Suffix Trays Richard Cole Tsvi Kopelowitz Moshe Lewenstein.
Interplay between Stringology and Data Structure Design Roberto Grossi.
What about the trees of the Mississippi? Suffix Trees explained in an algorithm for indexing large biological sequences Jacob Kleerekoper & Marjolijn Elsinga.
1 Suffix tree and suffix array techniques for pattern analysis in strings Esko Ukkonen Univ Helsinki Erice School 30 Oct 2005 Modified Alon Itai 2006.
Suffix Trees and Suffix Arrays
Compressed Compact Suffix Arrays Veli Mäkinen University of Helsinki Gonzalo Navarro University of Chile compact compress.
Suffix Tree and Suffix Array R Brain Chen R Pluto Chang.
Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.
1 Suffix Trees © Jeff Parker, Outline An introduction to the Suffix Tree Some sample applications How to build a Suffix Tree efficiently.
A Categorization Theorem on Suffix Arrays with Applications to Space Efficient Text Indexes Meng He, J. Ian Munro, and S. Srinivasa Rao University of Waterloo.
1 Suffix Trees and Suffix Arrays Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto Addison-Wesley, (Chapter 8)
Suffix Trees Suffix trees Linearized suffix trees Virtual suffix trees Suffix arrays Enhanced suffix arrays Suffix cactus, suffix vectors, …
Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.
Tries Search for ‘bell’ O(n) by KMP algorithm O(dm) in a trie Tries
Hierarchy-conscious Data Structures for String Analysis Carlo Fantozzi PhD Student (XVI ciclo) Bioinformatics Course - June 25, 2002.
Advanced Algorithm Design and Analysis (Lecture 4) SW5 fall 2004 Simonas Šaltenis E1-215b
Modern Information Retrieval Chapter 8 Indexing and Searching.
Indexes. Primary Indexes Dense Indexes Pointer to every record of a sequential file, (ordered by search key). Can make sense because records may be much.
Indexes. Primary Indexes Dense Indexes Pointer to every record of a sequential file, (ordered by search key). Can make sense because records may be much.
COMP 451/651 Indexes Chapter 1.
Modern Information Retrieval
Suffix arrays. Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The.
Blockwise Suffix Sorting for Space-Efficient Burrows-Wheeler Ben Langmead Based on work by Juha Kärkkäinen.
Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1.
Web Algorithmics Dictionary-based compressors. LZ77 Algorithm’s step: Output Advance by len + 1 A buffer “window” has fixed length and moves aacaacabcaaaaaa.
Suffix arrays. Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The.
Amortized Rigidness in Dynamic Cartesian Trees Iwona Białynicka-Birula and Roberto Grossi Università di Pisa STACS 2006.
Suffix trees and suffix arrays presentation by Haim Kaplan.
1 CS 430: Information Discovery Lecture 4 Data Structures for Information Retrieval.
Text Indexing S. Srinivasa Rao April 19, 2007 [based on slides by Paolo Ferragina]
Primary Indexes Dense Indexes
Indexing and Searching
1 A Lempel-Ziv text index on secondary storage Diego Arroyuelo and Gonzalo Navarro Combinatorial Pattern Matching 2007.
Database Index to Large Biological Sequences Ela Hunt, Malcolm P. Atkinson, and Robert W. Irving Proceedings of the 27th VLDB Conference,2001 Presented.
Indexing structures for files D ƯƠ NG ANH KHOA-QLU13082.
Algorithms for Information Retrieval Is algorithmic design a 5-mins thinking task ???
Chapter. 8: Indexing and Searching Sections: 8.1 Introduction, 8.2 Inverted Files 9/13/ Dr. Almetwally Mostafa.
CS 430: Information Discovery
Introduction n How to retrieval information? n A simple alternative is to search the whole text sequentially n Another option is to build data structures.
1 CS 430: Information Discovery Lecture 4 Files Structures for Inverted Files.
Suffix trees. Trie A tree representing a set of strings. a b c e e f d b f e g { aeef ad bbfe bbfg c }
Dictionary search Exact string search Paper on Cuckoo Hashing.
1 CPS216: Advanced Database Systems Notes 05: Operators for Data Access (contd.) Shivnath Babu.
Introduction to Information Retrieval Introduction to Information Retrieval Adapted from Christopher Manning and Prabhakar Raghavan Dictionary indexing.
ETRI Linear-Time Search in Suffix Arrays July 14, 2003 Jeong Seop Sim, Dong Kyue Kim Heejin Park, Kunsoo Park.
Suffix Tree 6 Mar MinKoo Seo. Contents  Basic Text Searching  Introduction to Suffix Tree  Suffix Trees and Exact Matching  Longest Common Substring.
Advanced Data Structures Lecture 8 Mingmin Xie. Agenda Overview Trie Suffix Tree Suffix Array, LCP Construction Applications.
Linear Time Suffix Array Construction Using D-Critical Substrings
Tries 4/16/2018 8:59 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
COMP9319 Web Data Compression and Search
15-853:Algorithms in the Real World
Tries 07/28/16 11:04 Text Compression
Dictionary data structures for the Inverted Index
COMP9319 Web Data Compression and Search
Two equivalent problems
13 Text Processing Hongfei Yan June 1, 2016.
Evaluation of Relational Operations
Strings: Tries, Suffix Trees
Auto-completion Search
Suffix trees.
Suffix trees and suffix arrays
Suffix Arrays and Suffix Trees
Strings: Tries, Suffix Trees
Presentation transcript:

Tries

(Compacted) Trie y s 1 z stile zyg 5 etic ial ygy aibelyite czecin omo systile syzygetic syzygial syzygy szaibelyite szczecin szomo [Fredkin, CACM 1960] ( 2 ; 3,5) Performance: Search ≈ O(|P|) time Space ≈ O(K + N)

(Compacted) Trie y s 1 z stile zyg 5 etic ial ygy aibelyite czecin omo systile syzygetic syzygial syzygy szaibelyite szczecin szomo [Fredkin, CACM 1960] ( 2 ; 3,5)... But in practice… Search: random memory accesses Space: len + pointers + strings Performance: Search ≈ O(|P|) time Space ≈ O(K + N)

….0systile 2zygetic 5ial 5y 0szaibelyite 2czecin 2omo…. systile szaielyite CT on a sample 2-level indexing Disk Internal Memory 2 limitations: Sampling rate ≈ lengths of sampled strings Trade-off ≈ speed vs space (because of bucket size) 2 advantages: Search ≈ typically 1 I/O Space ≈ Front-coding over buckets

An old idea: Patricia Trie y s 1 z stile zyg 5 etic ial ygy aibelyite czecin omo [Morrison, J.ACM 1968]

A new search ….systile syzygetic syzygial syzygy szaibelyite szczecin szomo… y s 1 z s z 5 e i y a c o Search(P): Phase 1: tree navigation Phase 2: Compute LCP Phase 3: tree navigation Three-phase search: P = syzyyea g < y P’s position Only 1 string is checked Trie Space ≈ #strings, NOT their length [Ferragina-Grossi, J.ACM 1999]

….Locality Preserving Front Coding…. PT on all strings 2-level indexing Disk Internal Memory A limitation is n < M Typically 1 I/O What about n > M

The String B-tree PT Search(P) O((p/B) log B n) I/Os O(occ/B) I/Os It is dynamic... 1 string checked : O(p/B) O(log B n) levels + Lexicographic position of P [Ferragina-Grossi, J.ACM 1999] Knuth, vol 3°, pag. 489: “elegant”

GA AGAGCGC GG AG C A G A G A On Front-Coding… AGAAGA 5 G 3 C 0 GCGCAGA 6 G 4 GGA 6 GA Knuth In-order visit + Path covering Front Coding 3 0 Compacted Trie = FC + tree structure What about other traversals ? FC +... is searchable

Why pre-order visit In Front-coding the Lcp information is encoded many times GA AGAGCGC GG AG C A G A G A 3 0 AGAAGA 1 G 3 C 4 GCGCAGA 1 G 3 GGA 1 GA Rear Coding

Text Indexing

What do we mean by “Indexing” ?  Word-based indexes, here a notion of “word” must be devised ! » Inverted lists, Signature files, Bitmaps.  Full-text indexes, no constraint on text and queries ! » Suffix Array, Suffix tree, String B-tree,...

Basic notation and facts Occurrences of P in T = All suffixes of T having P as a prefix SUF(T) = Sorted set of suffixes of T T = mississippi mississippi 4,7 P = si T[i,N] iff P is a prefix of the i-th suffix of T (ie. T[i,N]) T P i Pattern P occurs at position i of T From substring search To prefix search Reduction

The Suffix Tree T# = mississippi# # i ppi# ssi mississippi# 1 p i# pi# 2 1 s i ppi# ssippi# 3 si ssippi# ppi# 1 # ssippi# Label = Space: #nodes Search pattern P Maximal repeated substring = node

The Suffix Array Prop 1. All suffixes in SUF(T) with prefix P are contiguous. P=si T = mississippi# # i# ippi# issippi# ississippi# mississippi# pi# ppi# sippi# sissippi# ssippi# ssissippi# SUF(T) Suffix Array SA:   N log 2 N) bits Text T: N chars  In practice, a total of 5N bytes SA T = mississippi# suffix pointer 5 Prop 2. Starting position is the lexicographic one of P.

Searching a pattern Indirected binary search on SA: O(p) time per suffix cmp T = mississippi# SA P = si P is larger 2 accesses per step

Searching a pattern Indirected binary search on SA: O(p) time per suffix cmp T = mississippi# SA P = si P is smaller Suffix Array search O (log 2 N) binary-search steps Each step takes O(p) char cmp  overall, O (p log 2 N) time + [Manber-Myers, ’90]

Locating the occurrences T = mississippi # 4 7 SA si# occ= si$ Suffix Array search O (p + log 2 N + occ) time where # <  < $ sissippi sippi

Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA Text mining How long is the common prefix between T[i,...] and T[j,...] ? Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j Lcp # i# ippi# issippi# ississippi# mississippi pi# ppi# sippi# sissippi# ssippi# ssissippi# SA Lcp(7,3) = 1 = min{2,1,3}

Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA Text mining Does it exist a repeated substring of length ≥ L ? Maximal Lcp of a suffix is with its adjacent Search for Lcp[i] ≥ L Lcp # i# ippi# issippi# ississippi# mississippi pi# ppi# sippi# sissippi# ssippi# ssissippi# SA

Lcp Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA Text mining Does exist a substring of length ≥ L occurring ≥ C times ? Exist ≥ C equal substrings of length ≥ L chars Exist ≥ C suffixes sharing a prefix of ≥ L chars These suffixes may be not contiguous, but... Their “block” has a common prefix of ≥ L chars Search for Lcp[i,i+C-2] whose entries are ≥ L # i# ippi# issippi# ississippi# mississippi pi# ppi# sippi# sissippi# ssippi# ssissippi# SA L = 1, C = 4

How to construct SA from T ? # i# ippi# issippi# ississippi# mississippi pi# ppi# sippi# sissippi# ssippi# ssissippi# SA Elegant but inefficient Obvious inefficiencies:  (n 2 log n) time in the worst-case  (n log n) cache misses or I/O faults Input: T = mississippi#

The skew algorithm The key problem: Compare efficiently two suffixes Brute-force =  (n) time per cmp,  (n 2 log n) total In order to sort the suffixes of S 1. Divide the suffixes of S in two groups S 0,2 = suffixes starting at positions 0 mod 3 or 2 mod 3 S 1 = suffixes starting at positions 1 mod 3 2a. Sort recursively S 0,2 (they are 2n/3) 2b. Sort S 1 : suffix(3i+1) = S[3i+1]  suff(3i+2) 3. Merge the sorted S 0,2 with the sorted S 1 T(n) = O(split) + T(2n/3) + O(|S 1 |) + O(merge) = O(n)

Sort recursively S 0,2 We turn this problem into the SA-construction of a shorter string of length (2/3)n. S=AAT GTG AGA TGA $$$ RadixSort all triplets that start at positions 0,2 mod 3 T = {ATG, TGT, TGA, GAG, GAT, ATG, GA$, A$$} Sort(T) = (A$$, ATG, GA$, GAG, GAT, TGA, TGT) Assign lexicographic names (log n bits) A$$=1, ATG=2, GA$=3,… Build s 0,2 and encode it: ATG TGA GAT GA$ TGT GAG ATG A$$

Sort recursively S 0,2  Given S=AAT GTG AGA TGA $$$ We have built: s 0,2 = ATG TGA GAT GA$ TGT GAG ATG A$$ enc(s 0,2 ) = It is SA 0,2 = [12, 9, 2, 11, 6, 8, 5, 3] A suffix of s 0,2 A suffix of enc(s 0,2 ) SA(enc(s 0,2 )) gives SA 0,2 Lex-order is preserved

Sort S 1 We turn this problem into the sort of pairs S=AAT GTG AGA TGA $$$ Key observation: suff(1) = = suff(7) = = SA 0,2 = [12, 9, 2, 11, 6, 8, 5, 3] Suffix of S  SA 1 = [1, 7, 4, 10]

The merge step To merge suffix s i in S 0,2 with suffix s k in S 1, note that  If (i mod 3) = 2  s i+1 and s k+1 belong to S 0,2  If (i mod 3) = 0  s i+2 and s k+2 belong to S 0,2 their order can be derived from SA 0,2 in O(1) time SA 1 SA 0,2 SA T(n) = T(2n/3) + O(n) + O(merge) = O(n) S=AAT GTG AGA TGA $$$