Download presentation
Presentation is loading. Please wait.
Published byCarlos Moreno Modified over 10 years ago
1
Paolo Ferragina, Università di Pisa XML Compression and Indexing Paolo Ferragina Dipartimento di Informatica, Università di Pisa [Joint with F. Luccio, G. Manzini, S. Muthukrishnan] The Future of Web Search Barcelona, May 2006 Under patenting by Pisa-Rutgers Univ.
2
Paolo Ferragina, Università di Pisa Compressed Permuterm Index Paolo Ferragina, Rossano Venturini Dipartimento di Informatica, Università di Pisa Under Y! -patenting
3
Paolo Ferragina, Università di Pisa A basic problem Given a dictionary D of strings, having variable length, design a compressed data structure that supports 1) string id 2) Prefix( ): find all strings in D that are prefixed by 3) Suffix( ): find all strings in D that are suffixed by 4) Substring( ): find all strings in D that contain 5) PrefixSuffix( ) = Prefix( ) Suffix( ) IR book of Manning-Raghavan-Schutze Tolerant Retrieval Problem (wildcards) Prefix( ) = * Suffix( ) = * Substring( ) = * * PrefixSuffix( ) = *
4
Paolo Ferragina, Università di Pisa A basic problem Given a dictionary D of strings, having variable length, design a compressed data structure that supports 1) string id 2) Prefix( ): find all s in D that are prefixed by 3) Suffix( ): find all s in D that are suffixed by 4) Substring( ): find all s in D that contain 5) PrefixSuffix( ) = Prefix( ) Suffix( ) Hashing Not exact searches
5
Paolo Ferragina, Università di Pisa A basic problem Given a dictionary D of strings, having variable length, design a compressed data structure that supports 1) string id 2) Prefix( ): find all s in D that are prefixed by 3) Suffix( ): find all s in D that are suffixed by 4) Substring( ): find all s in D that contain 5) PrefixSuffix( ) = Prefix( ) Suffix( ) (Compacted) Trie Two versions: for D and for D R + Intersect answers No substring search (unless using Suffix Trie) Need to store D for resolving edge-labels
6
Paolo Ferragina, Università di Pisa A basic problem Given a dictionary D of strings, having variable length, design a compressed data structure that supports 1) string id 2) Prefix( ): find all s in D that are prefixed by 3) Suffix( ): find all s in D that are suffixed by 4) Substring( ): find all s in D that contain 5) PrefixSuffix( ) = Prefix( ) Suffix( ) Front coding...
7
Paolo Ferragina, Università di Pisa Two versions: for D and for D R + Intersect answers Need some extra data structures for bucket identification No substring search Front-coding http://checkmate.com/All_Natural/ http://checkmate.com/All_Natural/Applied.html http://checkmate.com/All_Natural/Aroma.html http://checkmate.com/All_Natural/Aroma1.html http://checkmate.com/All_Natural/Aromatic_Art.html http://checkmate.com/All_Natural/Ayate.html http://checkmate.com/All_Natural/Ayer_Soap.html http://checkmate.com/All_Natural/Ayurvedic_Soap.html http://checkmate.com/All_Natural/Bath_Salt_Bulk.html http://checkmate.com/All_Natural/Bath_Salts.html http://checkmate.com/All/Essence_Oils.html http://checkmate.com/All/Mineral_Bath_Crystals.html http://checkmate.com/All/Mineral_Bath_Salt.html http://checkmate.com/All/Mineral_Cream.html http://checkmate.com/All/Natural/Washcloth.html... 0 http://checkmate.com/All_Natural/ 33 Applied.html 34 roma.html 38 1.html 38 tic_Art.html 34 yate.html 35 er_Soap.html 35 urvedic_Soap.html 33 Bath_Salt_Bulk.html 42 s.html 25 Essence_Oils.html 25 Mineral_Bath_Crystals.html 38 Salt.html 33 Cream.html 0 http://checkmate.com/All/Natural/Washcloth.html... 30 35% bzip 10% Be back on this, later on! uk-2002 crawl 250Mb
8
Paolo Ferragina, Università di Pisa A basic problem Given a dictionary D of strings, having variable length, compress them in a way that we can efficiently support 1) string id 2) Prefix( ): find all s in D that are prefixed by 3) Suffix( ): find all s in D that are suffixed by 4) Substring( ): find all s in D that contain by 5) PrefixSuffix( ) = Prefix( ) Suffix( ) Permuterm Index (Garfield, 76) Reduce any query to a prefix query over a larger dictionary
9
Paolo Ferragina, Università di Pisa Premuterm Index [Garfield, 1976] Take a dictionary D={yahoo,google} 1. Append a special char $ to the end of each string 2. Generate all rotations of these strings yahoo$ ahoo$y hoo$ya oo$yah o$yaho $yahoo google$ oogle$g ogle$go gle$goo le$goog e$googl $google Prefix(ya) = Prefix($ya) Suffix(oo) = Prefix(oo$) Substring(oo) = Prefix(oo) PrefixSuffix(y,o)= Prefix(o$y) Any query on D reduces to a prefix-query on P[D] Permuterm Dictionary Space problems
10
Paolo Ferragina, Università di Pisa Compressed Permuterm Index It deploys two ingredients: Permuterm index Compressed full-text index Theoretically: Query ops take optimal time: proportional to pattern length Space occupancy is |D| H k (D) + o(|D| log | |) bits Technically: A simple reduction step: Permuterm Compressed index Re-use known machinery on compressed indexes Achieve bzip-compression at Front-coding speed SIGIR 07
11
Paolo Ferragina, Università di Pisa pi#mississi p ppi#mississ i sippi#missi s sissippi#mi s ssippi#miss i ssissippi#m i issippi#mis s mississippi # ississippi# m The Burrows-Wheeler Transform (1994) Take the text T = mississippi# mississippi# ississippi#m ssissippi#mi sissippi#mis sippi#missis ippi#mississ ppi#mississi pi#mississip i#mississipp #mississippi ssippi#missi issippi#miss Sort the rows #mississipp i i#mississip p ippi#missis s L T F
12
Paolo Ferragina, Università di Pisa Compressing L is effective Key observation: l L is locally homogeneous L is highly compressible Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !
13
Paolo Ferragina, Università di Pisa The FM-index The result: Count(P): O(p) time Locate (P): O(occ * polylog(|T|)) time Display( T[i,i+L] ): O( L + polylog(|T|) ) time Space occupancy: |T| H k (T) + o(|T| log | |) bits [Ferragina-Manzini, JACM 05] Survey of Navarro-Makinen contains many other indexes New concept: The FM-index is an opportunistic data structure The main idea is to reduce substring search to some basic operations over arrays of symbols Compressed Permuterm index builds upon the best two features of the FM-index
14
Paolo Ferragina, Università di Pisa p i#mississi p p pi#mississ i s ippi#missi s s issippi#mi s s sippi#miss i s sissippi#m i i ssippi#mis s m ississippi # i ssissippi# m # mississipp i i #mississip p i ppi#missis s FL Take two equal Ls chars How do we map Ls onto Fs chars ?... Need to distinguish equal chars in F... Rotate rightward their rows Same relative order !! unknown First ingredient: L F mapping
15
Paolo Ferragina, Università di Pisa p i#mississi p p pi#mississ i s ippi#missi s s issippi#mi s s sippi#miss i s sissippi#m i i ssippi#mis s m ississippi # i ssissippi# m First ingredient: L F mapping # mississipp i i #mississip p i ppi#missis s FL unknown The oracle Rank( s, 9 ) = 3 FM-index is actually Rank ds over BWT O(1) time and H k -space 1 2 6 7 9
16
Paolo Ferragina, Università di Pisa p i#mississi p p pi#mississ i s ippi#missi s s issippi#mi s s sippi#miss i s sissippi#m i i ssippi#mis s m ississippi # i ssissippi# m # mississipp i i #mississip p i ppi#missis s FL unknown Second ingredient: Backward step Backward step(i): Return LF[i], in O(1) time LF T scanned backward by using LF-mapping i... s...s
17
Paolo Ferragina, Università di Pisa fr occ=2 [lr-fr+1] Third ingredient: substring search #mississipp i#mississip ippi#missis issippi#mis ississippi# mississippi pi#mississi ppi#mississ sippi#missi sissippi#mi ssippi#miss ssissippi#m ipssm#pissiiipssm#pissii P = si lr unknown L Count(P[1,p]): Finds in O(p) time
18
Paolo Ferragina, Università di Pisa The Comprressed Permuterm Some queries are trivial... Prefix( ) = Substring search($ ) within Z Suffix( ) = Substring search( $) within Z Substr( ) = Substring search( ) within Z Z = $hat$hip$hop$hot$# Build FM-index to support substring searches Lexicographically sorted
19
Paolo Ferragina, Università di Pisa PrefixSuffix search Key property: Last char of s i is at L[i+1] Cyclic-LF[i] If (i > #D) return LF[i] else return LF[i+1] LF[3] i=3 CLF[3] unknown
20
Paolo Ferragina, Università di Pisa PrefixSuffix(ho,p) PrefixSuffix(P): Search FM-index of Z using Cyclic-LF instead of LF No change in time/space bounds of compressed indexes unknown $ho LF CLF
21
Paolo Ferragina, Università di Pisa Rank and Select of strings Z = $hat$hip$hop$hot$# Other queries... Rank(s) = row of $s$ Select(i)= backw from L[i+1] unknown
22
Paolo Ferragina, Università di Pisa Experiments Three dictionaries: Term dictionary: Trec WT10G Host dictionary (reversed): UK-2005 Url dictionary (host reversed): first 190Mb of UK-2005 TermHostUrl size 118 Mb34 Mb190 Mb # strings 10 Mil2 Mil3 Mil FC 40%45%30% bzip 33%25%10% PrefixSuffix search needs *2
23
Paolo Ferragina, Università di Pisa
24
A test on URLs Time of 20 60 sec/char, and space close to bzip Time close to Front-Coding (4 sec/char), but <50% of its space MRS book says: one disadvantage of the PI is that its dictionary becomes quite large, including as it does all rotations of each term. Choose your trade-off Now, they mention CPI Trade-off % dict-size
25
Paolo Ferragina, Università di Pisa We proposed an approach for dictionary storage: + Theory: optimal time and entropy-bounds for space + Practice: trades time vs space, thus fitting user needs
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.