Text Indexing S. Srinivasa Rao April 19, 2007 [based on slides by Paolo Ferragina]

Slides:



Advertisements
Similar presentations
Text Indexing The Suffix Array. Basic notation and facts Occurrences of P in T = All suffixes of T having P as a prefix SUF(T) = Sorted set of suffixes.
Advertisements

Paolo Ferragina, Università di Pisa Compressed Permuterm Index Paolo Ferragina Dipartimento di Informatica, Università di Pisa.
Text Indexing The Suffix Array. Basic notation and facts Occurrences of P in T = All suffixes of T having P as a prefix SUF(T) = Sorted set of suffixes.
Suffix Trees, Suffix Arrays and Suffix Trays Richard Cole Tsvi Kopelowitz Moshe Lewenstein.
B+-trees. Model of Computation Data stored on disk(s) Minimum transfer unit: a page = b bytes or B records (or block) N records -> N/B = n pages I/O complexity:
Two implementation issues Alphabet size Generalizing to multiple strings.
What about the trees of the Mississippi? Suffix Trees explained in an algorithm for indexing large biological sequences Jacob Kleerekoper & Marjolijn Elsinga.
Suffix Trees and Suffix Arrays
Compressed Compact Suffix Arrays Veli Mäkinen University of Helsinki Gonzalo Navarro University of Chile compact compress.
Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.
A Categorization Theorem on Suffix Arrays with Applications to Space Efficient Text Indexes Meng He, J. Ian Munro, and S. Srinivasa Rao University of Waterloo.
© 2004 Goodrich, Tamassia Tries1. © 2004 Goodrich, Tamassia Tries2 Preprocessing Strings Preprocessing the pattern speeds up pattern matching queries.
1 Suffix Trees and Suffix Arrays Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto Addison-Wesley, (Chapter 8)
Suffix Trees Suffix trees Linearized suffix trees Virtual suffix trees Suffix arrays Enhanced suffix arrays Suffix cactus, suffix vectors, …
Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.
Tries Standard Tries Compressed Tries Suffix Tries.
Tries Search for ‘bell’ O(n) by KMP algorithm O(dm) in a trie Tries
Advanced Algorithm Design and Analysis (Lecture 4) SW5 fall 2004 Simonas Šaltenis E1-215b
Modern Information Retrieval Chapter 8 Indexing and Searching.
Modern Information Retrieval
Goodrich, Tamassia String Processing1 Pattern Matching.
On Demand String Sorting over Unbounded Alphabets Carmel Kent Moshe Lewenstein Dafna Sheinwald.
B+-tree and Hashing.
1 Compressed Index for Dictionary Matching WK Hon (NTHU), TW Lam (HKU), R Shah (LSU), SL Tam (HKU), JS Vitter (Purdue)
Full-Text Indexing via Burrows-Wheeler Transform Wing-Kai Hon Oct 18, 2006.
Suffix arrays. Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The.
Indexed Search Tree (Trie) Fawzi Emad Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.
Suffix trees and suffix arrays presentation by Haim Kaplan.
1 CS 430: Information Discovery Lecture 4 Data Structures for Information Retrieval.
Department of Computer Eng. & IT Amirkabir University of Technology (Tehran Polytechnic) Data Structures Lecturer: Abbas Sarraf Search.
6/26/2015 7:13 PMTries1. 6/26/2015 7:13 PMTries2 Outline and Reading Standard tries (§9.2.1) Compressed tries (§9.2.2) Suffix tries (§9.2.3) Huffman encoding.
Tries. (Compacted) Trie y s 1 z stile zyg 5 etic ial ygy aibelyite czecin omo systile syzygetic syzygial syzygy szaibelyite szczecin.
1 A Lempel-Ziv text index on secondary storage Diego Arroyuelo and Gonzalo Navarro Combinatorial Pattern Matching 2007.
1 Database Tuning Rasmus Pagh and S. Srinivasa Rao IT University of Copenhagen Spring 2007 February 8, 2007 Tree Indexes Lecture based on [RG, Chapter.
WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.
CS4432: Database Systems II
Database Index to Large Biological Sequences Ela Hunt, Malcolm P. Atkinson, and Robert W. Irving Proceedings of the 27th VLDB Conference,2001 Presented.
Indexing. Goals: Store large files Support multiple search keys Support efficient insert, delete, and range queries.
Database Management 8. course. Query types Equality query – Each field has to be equal to a constant Range query – Not all the fields have to be equal.
© 2004 Goodrich, Tamassia Tries1. © 2004 Goodrich, Tamassia Tries2 Preprocessing Strings Preprocessing the pattern speeds up pattern matching queries.
CSC401 – Analysis of Algorithms Chapter 9 Text Processing
Tries1. 2 Outline and Reading Standard tries (§9.2.1) Compressed tries (§9.2.2) Suffix tries (§9.2.3)
Indexing and hashing Azita Keshmiri CS 157B. Basic concept An index for a file in a database system works the same way as the index in text book. For.
1 CS 430: Information Discovery Lecture 4 Files Structures for Inverted Files.
Suffix trees. Trie A tree representing a set of strings. a b c e e f d b f e g { aeef ad bbfe bbfg c }
Lecture 11COMPSCI.220.FS.T Balancing an AVLTree Two mirror-symmetric pairs of cases to rebalance the tree if after the insertion of a new key to.
Sets of Digital Data CSCI 2720 Fall 2005 Kraemer.
Marwan Al-Namari Hassan Al-Mathami. Indexing What is Indexing? Indexing is a mechanisms. Why we need to use Indexing? We used indexing to speed up access.
1 i206: Lecture 16: Data Structures for Disk; Advanced Trees Marti Hearst Spring 2012.
1 Lexicographic Search:Tries All of the searching methods we have seen so far compare entire keys during the search Idea: Why not consider a key to be.
Suffix Tree 6 Mar MinKoo Seo. Contents  Basic Text Searching  Introduction to Suffix Tree  Suffix Trees and Exact Matching  Longest Common Substring.
Advanced Data Structures Lecture 8 Mingmin Xie. Agenda Overview Trie Suffix Tree Suffix Array, LCP Construction Applications.
Generic Trees—Trie, Compressed Trie, Suffix Trie (with Analysi
Linear Time Suffix Array Construction Using D-Critical Substrings
Why indexing? For efficient searching of a document
Tries 4/16/2018 8:59 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
15-853:Algorithms in the Real World
Tries 07/28/16 11:04 Text Compression
Tries 5/27/2018 3:08 AM Tries Tries.
Azita Keshmiri CS 157B Ch 12 indexing and hashing
COMP9319 Web Data Compression and Search
Two equivalent problems
Tries 9/14/ :13 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
Suffix trees.
String Matching Module-5.
Suffix trees and suffix arrays
Tries 2/23/2019 8:29 AM Tries 2/23/2019 8:29 AM Tries.
Tries 2/27/2019 5:37 PM Tries Tries.
Suffix Arrays and Suffix Trees
Sequences 5/17/ :43 AM Pattern Matching.
Presentation transcript:

Text Indexing S. Srinivasa Rao April 19, 2007 [based on slides by Paolo Ferragina]

Why string data are interesting ? They are ubiquitous: l Digital libraries and product catalogues l Electronic white and yellow pages l Specialized information sources (e.g. Genomic or Patent dbs) l Web pages repositories l Private information dbs l... String collections are growing at a staggering rate: l...more than 10Tb of textual data in the web l...more than 15Gb of base pairs in the genomic dbs

The need for an “index” Brute-force scanning is not a viable approach: –Fast single searches –Multiple simple searches for complex queries The index is a basic block of any IR system. An IR system also encompasses: –IR models –Ranking algorithms –Query languages and operations –... We will concentrate only on “index design” !!

Two families of indexes Types of data Linguistic or tokenizable text Raw sequence of characters or bytes Word-based query Character-based query Types of query Two indexing approaches : Word-based indexes, here a concept of “word” must be devised ! » Inverted files, Signature files or Bitmaps. Full-text indexes, no constraint on text and queries ! » Suffix Array, Suffix tree, Hybrid indexes, or String B-tree. DNA sequences Audio-video files Executables Arbitrary substring Complex matches Exact word Word prefix or suffix Phrase

Word-based: Inverted files (or lists) Now is the time for all good men to come to the aid of their country Doc #1 It was a dark and stormy night in the country manor. The time was past midnight Doc #2  Query answering is a two-phase process: midnight AND time VocabularyPostings 2

Full-text indexes Their need is pervasive: –Raw data: DNA sequences, Audio-Video files,... –Linguistic texts: data mining, statistics,... –Vocabulary for Inverted Lists –Xpath queries on XML documents –Intrusion detection, Anti-viruses,... Classes of indexes:  Suffix array, Suffix tree (variants)  Multi-level indexes: Short Pat array  B-tree based data structures: Prefix B-tree, String B-tree

Terminology l An alphabet, denoted ∑ is a set of (ordered) characters. l A string S is an array of characters, S[1,n] = S[1] S[2] … S[n]. l S[i,j] = S[i] … S[j] is a substring of S. l S[1,j] is a prefix of S; S[i,n] is a suffix of S. l ∑* denotes all strings over alphabet ∑. l Lexicographic order: –Example: For ∑ = { a, b, c, …, z }, where a < b < c < … < z, the lexicographic order is the same as in a dictionary. SUF(T) = Sorted set of suffixes of T

Indexed string matching problem l Let T be a set of K strings in ∑*, where N is the total length of all strings in T. l String matching query on T: Given a pattern P find all occurrences of P in the strings in T. l Static problem: Store T in a data structure that supports string matching queries. Such a data structure is called a full-text index. l Dynamic version: Supports also insertions and deletions of strings in the full-text index.

A simple but crucial observation Pattern P[1,p] occurs at position i of T[1,n] iff P[1,p] is a prefix of the suffix T[i,n] T P i T[i,n] Occurrences of P in T = All suffixes of T having P as a prefix T = This is a visual example This is a visual example 3,6,12 Can transform the string matching problem to a “prefix matching problem” over all the suffixes.

Suffix Array [Manber-Myers, 90] Suffix array: an array of pointers to all the suffixes in the text in their lexicographic order. T = mississippi# # i# ippi# issippi# ississippi# mississippi# pi# ppi# sippi# sissippi# ssippi# ssissippi# SUF(T) Suffix Array SA: array of ints, 4N bytes Text T: N bytes  5N bytes of space occupancy  (N 2 ) space SA T = mississippi# suffix pointer 5

Two key properties [Manber-Myers, 90] Prop 1. All suffixes in SUF(T) having prefix P are contiguous. Prop 2. Starting position is the lexicographic one of P. P=si T = mississippi# # i# ippi# issippi# ississippi# mississippi# pi# ppi# sippi# sissippi# ssippi# ssissippi# SUF(T) SA T = mississippi#

Searching in Suffix Array [Manber-Myers, 90] Indirected binary search on SA: O(p log 2 N) time T = mississippi# SA si P is larger 2 accesses for binary step

Searching in Suffix Array [Manber-Myers, 90] Indirected binary search on SA: O(p log 2 N) time T = mississippi# SA si P is smaller

Listing the occurrences [Manber-Myers, 90] Brute-force comparison: O(p x occ) time T = mississippi# SA si occ=2 Suffix Array search O (p (log 2 N + occ)) time O (log 2 N + occ) in practice External memory Simple disk paging for SA O ((p/B) (log 2 N + occ)) I/Os issippi P is not a prefix P is a prefix sissippi P is a prefix sippi log B N + occ/B

SA Lcp Lcp[1,n-1] stores the longest-common-prefix between suffixes adjacent in SA Output-sensitive retrieval T = mississippi# # i# ippi# issippi# ississippi# mississippi# pi# ppi# sippi# sissippi# ssippi# ssissippi# SUF(T) P=si Suffix Array search O ((p/B) log 2 N + (occ/B)) I/Os 9 N bytes of space Scan Lcp until Lcp[i] < p + : incremental search base B : tricky !! Compare against P occ=2

q Incremental search (case 1) Incremental search using the LCP array: no rescanning of pattern chars SA Range Minima i j The cost: O (1) memory accesses P Min Lcp[i,q-1] > P’s known induct. < P’s > P’s

q Incremental search (case 2) Incremental search using the LCP array: no rescanning of pattern chars SA i j Min Lcp[i,q-1] The cost: O (1) memory accesses Range Minima known induct. < P’s > P’s P

Incremental search (case 3) Incremental search using the LCP array: no rescanning of pattern chars SA i j q Min Lcp[i,q-1] Suffix char > Pattern char Suffix char < Pattern char Suffix Array search O(log 2 N) binary steps O(p) total char-cmp for routing  O((p/B) + log 2 N + (occ/B)) I/Os base B : more tricky (we will not consider this) L The cost: O( L ) char cmp Range Minima < P’s > P’s P

Summary: suffix array [Manber-Myers, 90] l Space: O(N) l String matching queries: O(p + log N + occ) l Can be constructed O(N log N) time. l Static.

SA Hybrid Index: Short Pat array Exploit internal memory : sample the suffix array and copy something in memory M Disk P binary-search inside SA + Supra-index O((p/B) + log 2 (N/s) + (occ/B)) I/Os  Parameter s depends on M and influences both performance and space !! (only a huristic) s Copy a prefix of marked suffixes

Tries l Trie (name from the word “retrieval”): a data structure for storing a set of strings –Let’s assume that all strings end with “$” (not in  ) b s e a r $ i d $ u l k $ $ l u n d a y $ $ Set of strings: { bear, bid, bulk, bull, sun, sunday }

Tries l Properties of a trie: –A multi-way tree. –Each node has from 1 to  1 children. –Each edge of the tree is labeled with a character. –Each leaf node corresponds to the stored string, which is a concatenation of characters on a path from the root to this node.

Analysis of the Trie Given k strings of total length N: l Size: –O(N) in the worst-case l Search, insertion, and deletion (string of length m): –O(m) (assuming ∑ is constant) l Observation: –Having chains of one-child nodes is wasteful

Compact Tries l Compact Trie: –Replace a chain of one-child nodes with an edge labeled with a string –Each non-leaf node (except root) has at least two children b s e a r $ i d $ u l k $ $ l u n d a y $ $ b sun ear$ id$ ul k$ l$ day$ $

Compact Tries l Implementation: –Strings are external to the structure in one array, edges are labeled with indices in the array (from, to) –Improves the space to O(k) b sun ear$ id$ ul k$ l$ day$ $ bear$bid$bulk$bull$sun$sunday$ T:T: 1,1 20,22 2,5 7,9 11,12 13,14 18,19 27,30 5,5

Patricia Tries l Patricia trie: –a compact trie where each edge’s label (from, to) is replaced by (T[from], to – from + 1) bear bid bulk bull sun sunday T:T: b,1 s,3 e,4 i,3 u,2 k,2 l,2 d,4 _,1 1,1 20,22 2,5 7,9 11,12 13,14 18,19 27,30 5,5

Suffix Trees [McCreight, 76] l Suffix tree – a compact trie (or similar structure) of all suffixes of the text –Patricia trie of suffixes is sometimes called a Pat tree carrara$ a r r $ carrara$ rara$ a$ rara$ a $ ra$ (a,1)(r,1) ($,1) (c,8) (r,5) (a,2) (r,5) (a,1) (r,3) ($,1)

Search in suffix trees T = abababbc# c c c b a b b b b b a a b c a b b c a b c b P = ba c b 0 1 (5,8) O(N) space O(p) time What about ST in external memory ? –Unbalanced tree topology –Updates CPAT tree ~ 5N on average No (p/B), possibly no (occ/B), mainly static and space costly 768 c Packing ?!  (p) I/Os  (occ) I/Os??  Search is a path traversal 2 4 and O(occ) time - Large space ~ 15N b a

Summary: compact trie K strings of total length N: l Space: O(K) l String matching queries: O(p + occ) time l Can be constructed O(N) time. l Update(S): O(|S|) time

Summary: suffix tree [McCreight, 76] For a string of length n: l Space: O(n) l String matching queries: O(p + occ) l Can be constructed O(n) time. l Static.

The String B-tree ( An I/O-efficient full-text index !!) [Ferragina-Grossi, 95]

The prologue We are left with many open issues: –Suffix Array: updates –Hybrid: Heuristic tuning of the performance –Suffix tree: difficult packing and  (p) I/Os B-tree is ubiquitous in large-scale applications: –Atomic keys: integers, reals,... –Prefix B-tree: bounded length keys (  255 chars) Suffix trees + B-trees ?String B-tree [Ferragina-Grossi, 95]  Index unbounded length keys  Good worst-case I/O-bounds in search and update

Some considerations Strings have arbitrary length: –Disk page cannot ensure the storage of  (B) strings –M may be unable to store even one single string String storage: –Pointers allow to fit  (B) strings per disk page –String comparison needs disk access and may be expensive String pointers organization seen so far:  Suffix array: simple but static and not optimal  Patricia trie: sophisticated and much efficient (optimal ?) Recall the problem:  is a string collection Search( P[1,p] ): retrieve all occurrences of P in  ’s strings Update( S[1,t] ): insert or delete a text S from 

1º step: B-tree on string pointers AATCAGCGAATGCTGCTT CTGTTGATGA Disk P = AT O((p/B) log 2 B) I/Os O(log B N) levels Search(P) O ((p/B) log 2 N) I/Os O (occ/B) I/Os It is dynamic !! O(s (s/B) log 2 N) I/Os + B

2º step: The Patricia trie 1 AGAAGAAGAAGA AGAAGGAGAAGG AGACAGAC GCGCAGAGCGCAGA GCGCAGGGCGCAGG GCGCGGAGCGCGGA GCGCGGGAGCGCGGGA Disk A G A (1; 1,3)(4; 1,4) (6; 5,6) (5; 5,6) (2; 1,2) (3; 4,4) (1; 6,6) (2; 6,6) (4; 7,7) (5; 7,7) (7; 7,8) (6; 7,7) G

2º step: The Patricia trie 1 AGAAGAAGAAGA AGAAGGAGAAGG AGACAGAC GCGCAGAGCGCAGA GCGCAGGGCGCAGG GCGCGGAGCGCGGA GCGCGGGAGCGCGGGA Disk A 0 G A A A A 4 GA 6 7 GG G C Space PT O(k), not O(N) Two-phase search: P = GCACGCAC G 1 G 57 A Search(P): First phase: no string access Second phase: O(p/B) I/Os mismatch P’s position max LCP with P Just one string is checked !! G C

3º step: B-tree + Patricia tree AATCAGCGAATGCTGCTT CTGTTGATGA Disk PT Search(P) O((p/B) log B N) I/Os O(occ/B) I/Os Insert(S) O( s (s/B) log B N) I/Os O(p/B) I/Os O(log B N) levels P = AT +

PT Level i Search(P) O(log B N) I/Os just to go to the leaf level 4º step: Incremental Search Max_lcp(i) P Level i+1 PT First case Leaf Level adjacent

Level i+1 PT Inductive Step PT Level i Search(P) O(p/B + log B N) I/Os O(occ/B) I/Os 4º step: Incremental Search Max_lcp(i) P Max_lcp(i+1) skip Max_lcp(i) i-th step: O(( lcp i+1 – lcp i )/B + 1) I/Os No rescanning Second case

Summary: String B-tree String B-tree performance: [Ferragina-Grossi, 95] –Search(P) takes O(p/B + log B N + occ/B) I/Os –Update(S) takes O( s log B N ) I/Os –Space is  (N/B) disk pages Using the String B-tree in internal memory: –Search(P) takes O(p + log 2 N + occ) time –Update(S) takes O( s log 2 N ) time –Space is  (N) bytes  It is a sort of dynamic suffix array

Summary l Indexed string matching problem –Word-based –Full-text l Internal memory data structures –Suffix array –Suffix tree l External memory data structures –Patricia trie and Pat tree –Short Pat array –String B-tree