Paolo Ferragina, Università di Pisa XML Compression and Indexing Paolo Ferragina Dipartimento di Informatica, Università di Pisa [Joint with F. Luccio,

Slides:



Advertisements
Similar presentations
Paolo Ferragina, Università di Pisa On Compression and Indexing: two sides of the same coin Paolo Ferragina Dipartimento di Informatica, Università di.
Advertisements

Algorithms and data structures for big data, whats next? Paolo Ferragina University of Pisa.
Dynamic Rank-Select Structures with Applications to Run-Length Encoded Texts Sunho Lee and Kunsoo Park Seoul National Univ.
Algoritmi per IR Dictionary-based compressors. Lempel-Ziv Algorithms Keep a “dictionary” of recently-seen strings. The differences are: How the dictionary.
Lecture #1 From 0-th order entropy compression To k-th order entropy compression.
Text Indexing The Suffix Array. Basic notation and facts Occurrences of P in T = All suffixes of T having P as a prefix SUF(T) = Sorted set of suffixes.
Latent Semantic Indexing (mapping onto a smaller space of latent concepts) Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 18.
Paolo Ferragina, Università di Pisa Compressed Permuterm Index Paolo Ferragina Dipartimento di Informatica, Università di Pisa.
Text Indexing The Suffix Array. Basic notation and facts Occurrences of P in T = All suffixes of T having P as a prefix SUF(T) = Sorted set of suffixes.
Paolo Ferragina, Università di Pisa Compressing and Indexing Strings and (labeled) Trees Paolo Ferragina Dipartimento di Informatica, Università di Pisa.
Fast Compressed Tries through Path Decompositions Roberto Grossi Giuseppe Ottaviano* Università di Pisa * Part of the work done while at Microsoft Research.
Suffix Trees, Suffix Arrays and Suffix Trays Richard Cole Tsvi Kopelowitz Moshe Lewenstein.
The course Project #1: Dictionary search with 1 error The problem consists of building a data structure to index a large dictionary of strings.
Interplay between Stringology and Data Structure Design Roberto Grossi.
What about the trees of the Mississippi? Suffix Trees explained in an algorithm for indexing large biological sequences Jacob Kleerekoper & Marjolijn Elsinga.
Compressed Compact Suffix Arrays Veli Mäkinen University of Helsinki Gonzalo Navarro University of Chile compact compress.
Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.
1 Suffix Trees © Jeff Parker, Outline An introduction to the Suffix Tree Some sample applications How to build a Suffix Tree efficiently.
A Categorization Theorem on Suffix Arrays with Applications to Space Efficient Text Indexes Meng He, J. Ian Munro, and S. Srinivasa Rao University of Waterloo.
String Processing II: Compressed Indexes Patrick Nichols Jon Sheffi Dacheng Zhao
Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.
Wavelet Trees Ankur Gupta Butler University. Text Dictionary Problem The input is a text T drawn from an alphabet Σ. We want to support the following.
From Kolmogorov and Shannon to Bioinformatics and Grid Computing Raffaele Giancarlo Dipartimento di Matematica, Università di Palermo.
Full-Text Indexing via Burrows-Wheeler Transform Wing-Kai Hon Oct 18, 2006.
Blockwise Suffix Sorting for Space-Efficient Burrows-Wheeler Ben Langmead Based on work by Juha Kärkkäinen.
Indexed Search Tree (Trie) Fawzi Emad Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.
Web Algorithmics Dictionary-based compressors. LZ77 Algorithm’s step: Output Advance by len + 1 A buffer “window” has fixed length and moves aacaacabcaaaaaa.
Suffix arrays. Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The.
Amortized Rigidness in Dynamic Cartesian Trees Iwona Białynicka-Birula and Roberto Grossi Università di Pisa STACS 2006.
Suffix trees and suffix arrays presentation by Haim Kaplan.
Text Indexing S. Srinivasa Rao April 19, 2007 [based on slides by Paolo Ferragina]
IR Paolo Ferragina Dipartimento di Informatica Università di Pisa.
Tries. (Compacted) Trie y s 1 z stile zyg 5 etic ial ygy aibelyite czecin omo systile syzygetic syzygial syzygy szaibelyite szczecin.
Indexing and Searching
Compressed Index for a Dynamic Collection of Texts H.W. Chan, W.K. Hon, T.W. Lam The University of Hong Kong.
Algorithms and Data Structures for Massive Datasets (Acube Lab) Rossano Venturini Dipartimento di Informatica Università di Pisa Paolo Ferragina Giuseppe.
Basics of Data Compression Paolo Ferragina Dipartimento di Informatica Università di Pisa.
CS212: DATA STRUCTURES Lecture 10:Hashing 1. Outline 2  Map Abstract Data type  Map Abstract Data type methods  What is hash  Hash tables  Bucket.
Basics of Data Compression Paolo Ferragina Dipartimento di Informatica Università di Pisa.
Random access to arrays of variable-length items
TABLE COMPRESSION AND RELATED PROBLEMS Raffaele Giancarlo Dipartimento di Matematica Università di Palermo.
Suffix trees. Trie A tree representing a set of strings. a b c e e f d b f e g { aeef ad bbfe bbfg c }
B ACKWARD S EARCH FM-I NDEX (F ULL - TEXT INDEX IN M INUTE SPACE ) Paper by Ferragina & Manzini Presentation by Yuval Rikover.
BACKWARD SEARCH FM-INDEX (FULL-TEXT INDEX IN MINUTE SPACE)
Index construction: Compression of postings Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 5.3 and a paper.
The Burrows-Wheeler Transform: Theory and Practice Article by: Giovanni Manzini Original Algorithm by: M. Burrows and D. J. Wheeler Lecturer: Eran Vered.
Index construction: Compression of documents Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading Managing-Gigabytes: pg 21-36, 52-56,
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Basics
Suffix Tree 6 Mar MinKoo Seo. Contents  Basic Text Searching  Introduction to Suffix Tree  Suffix Trees and Exact Matching  Longest Common Substring.
Advanced Data Structures Lecture 8 Mingmin Xie. Agenda Overview Trie Suffix Tree Suffix Array, LCP Construction Applications.
On searching and extracting strings from compressed textual data Rossano Venturini Dipartimento di Informatica University of Pisa, Italy.
A simple storage scheme for strings achieving entropy bounds Paolo Ferragina and Rossano Venturini Dipartimento di Informatica University of Pisa, Italy.
Linear Time Suffix Array Construction Using D-Critical Substrings
Index construction: Compression of postings
Burrows-Wheeler Transformation Review
15-853:Algorithms in the Real World
Tries 07/28/16 11:04 Text Compression
Tries 5/27/2018 3:08 AM Tries Tries.
Advanced Algorithms for Massive DataSets
COMP9319 Web Data Compression and Search
Two equivalent problems
Paolo Ferragina Dipartimento di Informatica, Università di Pisa
Auto-completion Search
CSC2431 February 3rd 2010 Alecia Fowler
Suffix trees.
Index construction: Compression of postings
Problem with Huffman Coding
Suffix trees and suffix arrays
Index construction: Compression of postings
Index construction: Compression of postings
Presentation transcript:

Paolo Ferragina, Università di Pisa XML Compression and Indexing Paolo Ferragina Dipartimento di Informatica, Università di Pisa [Joint with F. Luccio, G. Manzini, S. Muthukrishnan] The Future of Web Search Barcelona, May 2006 Under patenting by Pisa-Rutgers Univ.

Paolo Ferragina, Università di Pisa Compressed Permuterm Index Paolo Ferragina, Rossano Venturini Dipartimento di Informatica, Università di Pisa Under Y! -patenting

Paolo Ferragina, Università di Pisa A basic problem Given a dictionary D of strings, having variable length, design a compressed data structure that supports 1) string id 2) Prefix( ): find all strings in D that are prefixed by 3) Suffix( ): find all strings in D that are suffixed by 4) Substring( ): find all strings in D that contain 5) PrefixSuffix( ) = Prefix( ) Suffix( ) IR book of Manning-Raghavan-Schutze Tolerant Retrieval Problem (wildcards) Prefix( ) = * Suffix( ) = * Substring( ) = * * PrefixSuffix( ) = *

Paolo Ferragina, Università di Pisa A basic problem Given a dictionary D of strings, having variable length, design a compressed data structure that supports 1) string id 2) Prefix( ): find all s in D that are prefixed by 3) Suffix( ): find all s in D that are suffixed by 4) Substring( ): find all s in D that contain 5) PrefixSuffix( ) = Prefix( ) Suffix( ) Hashing Not exact searches

Paolo Ferragina, Università di Pisa A basic problem Given a dictionary D of strings, having variable length, design a compressed data structure that supports 1) string id 2) Prefix( ): find all s in D that are prefixed by 3) Suffix( ): find all s in D that are suffixed by 4) Substring( ): find all s in D that contain 5) PrefixSuffix( ) = Prefix( ) Suffix( ) (Compacted) Trie Two versions: for D and for D R + Intersect answers No substring search (unless using Suffix Trie) Need to store D for resolving edge-labels

Paolo Ferragina, Università di Pisa A basic problem Given a dictionary D of strings, having variable length, design a compressed data structure that supports 1) string id 2) Prefix( ): find all s in D that are prefixed by 3) Suffix( ): find all s in D that are suffixed by 4) Substring( ): find all s in D that contain 5) PrefixSuffix( ) = Prefix( ) Suffix( ) Front coding...

Paolo Ferragina, Università di Pisa Two versions: for D and for D R + Intersect answers Need some extra data structures for bucket identification No substring search Front-coding Applied.html 34 roma.html 38 1.html 38 tic_Art.html 34 yate.html 35 er_Soap.html 35 urvedic_Soap.html 33 Bath_Salt_Bulk.html 42 s.html 25 Essence_Oils.html 25 Mineral_Bath_Crystals.html 38 Salt.html 33 Cream.html % bzip 10% Be back on this, later on! uk-2002 crawl 250Mb

Paolo Ferragina, Università di Pisa A basic problem Given a dictionary D of strings, having variable length, compress them in a way that we can efficiently support 1) string id 2) Prefix( ): find all s in D that are prefixed by 3) Suffix( ): find all s in D that are suffixed by 4) Substring( ): find all s in D that contain by 5) PrefixSuffix( ) = Prefix( ) Suffix( ) Permuterm Index (Garfield, 76) Reduce any query to a prefix query over a larger dictionary

Paolo Ferragina, Università di Pisa Premuterm Index [Garfield, 1976] Take a dictionary D={yahoo,google} 1. Append a special char $ to the end of each string 2. Generate all rotations of these strings yahoo$ ahoo$y hoo$ya oo$yah o$yaho $yahoo google$ oogle$g ogle$go gle$goo le$goog e$googl $google Prefix(ya) = Prefix($ya) Suffix(oo) = Prefix(oo$) Substring(oo) = Prefix(oo) PrefixSuffix(y,o)= Prefix(o$y) Any query on D reduces to a prefix-query on P[D] Permuterm Dictionary Space problems

Paolo Ferragina, Università di Pisa Compressed Permuterm Index It deploys two ingredients: Permuterm index Compressed full-text index Theoretically: Query ops take optimal time: proportional to pattern length Space occupancy is |D| H k (D) + o(|D| log | |) bits Technically: A simple reduction step: Permuterm Compressed index Re-use known machinery on compressed indexes Achieve bzip-compression at Front-coding speed SIGIR 07

Paolo Ferragina, Università di Pisa pi#mississi p ppi#mississ i sippi#missi s sissippi#mi s ssippi#miss i ssissippi#m i issippi#mis s mississippi # ississippi# m The Burrows-Wheeler Transform (1994) Take the text T = mississippi# mississippi# ississippi#m ssissippi#mi sissippi#mis sippi#missis ippi#mississ ppi#mississi pi#mississip i#mississipp #mississippi ssippi#missi issippi#miss Sort the rows #mississipp i i#mississip p ippi#missis s L T F

Paolo Ferragina, Università di Pisa Compressing L is effective Key observation: l L is locally homogeneous L is highly compressible Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

Paolo Ferragina, Università di Pisa The FM-index The result: Count(P): O(p) time Locate (P): O(occ * polylog(|T|)) time Display( T[i,i+L] ): O( L + polylog(|T|) ) time Space occupancy: |T| H k (T) + o(|T| log | |) bits [Ferragina-Manzini, JACM 05] Survey of Navarro-Makinen contains many other indexes New concept: The FM-index is an opportunistic data structure The main idea is to reduce substring search to some basic operations over arrays of symbols Compressed Permuterm index builds upon the best two features of the FM-index

Paolo Ferragina, Università di Pisa p i#mississi p p pi#mississ i s ippi#missi s s issippi#mi s s sippi#miss i s sissippi#m i i ssippi#mis s m ississippi # i ssissippi# m # mississipp i i #mississip p i ppi#missis s FL Take two equal Ls chars How do we map Ls onto Fs chars ?... Need to distinguish equal chars in F... Rotate rightward their rows Same relative order !! unknown First ingredient: L F mapping

Paolo Ferragina, Università di Pisa p i#mississi p p pi#mississ i s ippi#missi s s issippi#mi s s sippi#miss i s sissippi#m i i ssippi#mis s m ississippi # i ssissippi# m First ingredient: L F mapping # mississipp i i #mississip p i ppi#missis s FL unknown The oracle Rank( s, 9 ) = 3 FM-index is actually Rank ds over BWT O(1) time and H k -space

Paolo Ferragina, Università di Pisa p i#mississi p p pi#mississ i s ippi#missi s s issippi#mi s s sippi#miss i s sissippi#m i i ssippi#mis s m ississippi # i ssissippi# m # mississipp i i #mississip p i ppi#missis s FL unknown Second ingredient: Backward step Backward step(i): Return LF[i], in O(1) time LF T scanned backward by using LF-mapping i... s...s

Paolo Ferragina, Università di Pisa fr occ=2 [lr-fr+1] Third ingredient: substring search #mississipp i#mississip ippi#missis issippi#mis ississippi# mississippi pi#mississi ppi#mississ sippi#missi sissippi#mi ssippi#miss ssissippi#m ipssm#pissiiipssm#pissii P = si lr unknown L Count(P[1,p]): Finds in O(p) time

Paolo Ferragina, Università di Pisa The Comprressed Permuterm Some queries are trivial... Prefix( ) = Substring search($ ) within Z Suffix( ) = Substring search( $) within Z Substr( ) = Substring search( ) within Z Z = $hat$hip$hop$hot$# Build FM-index to support substring searches Lexicographically sorted

Paolo Ferragina, Università di Pisa PrefixSuffix search Key property: Last char of s i is at L[i+1] Cyclic-LF[i] If (i > #D) return LF[i] else return LF[i+1] LF[3] i=3 CLF[3] unknown

Paolo Ferragina, Università di Pisa PrefixSuffix(ho,p) PrefixSuffix(P): Search FM-index of Z using Cyclic-LF instead of LF No change in time/space bounds of compressed indexes unknown $ho LF CLF

Paolo Ferragina, Università di Pisa Rank and Select of strings Z = $hat$hip$hop$hot$# Other queries... Rank(s) = row of $s$ Select(i)= backw from L[i+1] unknown

Paolo Ferragina, Università di Pisa Experiments Three dictionaries: Term dictionary: Trec WT10G Host dictionary (reversed): UK-2005 Url dictionary (host reversed): first 190Mb of UK-2005 TermHostUrl size 118 Mb34 Mb190 Mb # strings 10 Mil2 Mil3 Mil FC 40%45%30% bzip 33%25%10% PrefixSuffix search needs *2

Paolo Ferragina, Università di Pisa

A test on URLs Time of sec/char, and space close to bzip Time close to Front-Coding (4 sec/char), but <50% of its space MRS book says: one disadvantage of the PI is that its dictionary becomes quite large, including as it does all rotations of each term. Choose your trade-off Now, they mention CPI Trade-off % dict-size

Paolo Ferragina, Università di Pisa We proposed an approach for dictionary storage: + Theory: optimal time and entropy-bounds for space + Practice: trades time vs space, thus fitting user needs