Download presentation
Presentation is loading. Please wait.
Published byPaul Owens Modified over 8 years ago
1
On searching and extracting strings from compressed textual data Rossano Venturini Dipartimento di Informatica University of Pisa, Italy
2
The Memory Hierarchy Accesses orders of magitude faster Data compression ● More data in high levels ● Speed up algorithms if dec time < transfer time Applications should carefully organize their data to reduce trasfers
3
An example Human genome 3 billion pairs of bases (A, T, C, and G) 700 Mbytes fits on RAM but biologists need tools for data analysis on it searching operations: motif, approximate pattern matching, etc Data Structures (indexes) are mandatory Example, Find longest tandem substring Sequential algorithms: O(n 2 ) n is 3 billion! Suffix tree: O(n) Suffix tree requires 7-15 Gbytes does not fit on RAM Compressed indexes ● Same functionalities and compressed space ● They become faster than suffix tree ● RAM vs Disk
4
What do we need? Many applications on texts require access to it searching operations on it We want data structures and algorithms to represent texts in a compressed form which permit efficient access to substrings Applications save space without degradation of performance efficient pattern-matching queries Useful to solve more difficult problems. e.g., approximate or reg.exp. pattern-matching Efficient: avoid whole decompression! Classical data compression approaches fail
5
Random Access to compressed text Given a (static) string S[1,n] drawn from an alphabet of size , we want to represent it in a compressed form C which permits us to extract any m-long substring of S in O(1+m/ log n ) optimal time Applications can replace S with C retrieve S's substrings without any degradation of performance RAM Model CPU + internal memory Any primitive operation among O(1) words in O(1) time Memory references have equal cost No cheat! O(log n) bits word size
6
The 0-th order empirical entropy It is the best you can hope with a memoryless compressor Huffman is close to this bound symbol codeword Add pointers...Random access frequency of c in S S abababababac C 0100 0 0 0 011 access decode a b ac 6/1 2 0 5/1 2 10 1/1 2 11 T freqcode H 0 cannot distinguish between a y b x and any other random string with same number of a and b Better compression using a codeword that depends on the k symbols preceding the one to be compressed (its context)
7
The k-th order empirical entropy For any string w, w S = string of symbols following substring w in S Example: S = abababababac k = 2 w=ab ab S =aaaaa Better compression. NO random access. Compress every w S up to its H 0 a b ac 5/50 0-- 0 T(ab) freqcode C 000000000001 Random access and k-th order empirical entropy? Our scheme...blocking approach S abababababac access context?
8
Our storage scheme [Ferragina-Venturini SODA-TCS '07] b S ... -- 0 -- 00 1 0... C code 000 11 10 01 00 1 0 frequency block... ● b = ½ log n ● # blocks = n/b = O(n / log n) ● #distinct blocks = O( b ) = O(n ½ ) P 1122456... Decode is easy! access decode Constant time per block
9
Space analysis Blocks table T: O( b ) = O(n ½ ) entries Each entry is represented with O(log n) bits T requires O(n ½ log n) = o(n) bits We use a two-level storage scheme for pointers o(n log ) bits The real challenge is to bound |C| Introduce statistical encoder E k (S). Easy to bound. |E k (S)| < nH k (S) + o(n log ) bits, where k ≤ log |S| |C| < Always better than S on RAM model
10
Open problems I/O-efficient Random access External Memory Model (I/O model) CPU + Internal memory + slow Disk Memory has size M Disk and memory are divided in blocks of size B CPU can only operate on memory (M << n) Algorithm can make memory transfer operations read one block from disk to memory write one block from memory to disk Algorithm cost: # memory transfer required
11
Open problems I/O-efficient Random access Optimal solution is able to access any O(B log n long substring of S with O(1) I/Os Our scheme migth work in the I/O model too Three inefficiencies: Its blocking approach reduce overall compression by posing limit to k It does not exploit no cost internal-memory ops Table T may not fit in internal memory Different approach Compress every symbol with a model that fits in memory Find optimal model in O(n) time? contexts of may have variable length
12
(Compressed) String Indexing MISISSSIPPI# S SI P SI SI String-matching problem: Given a text S[1,n], we wish to devise a (compressed) representation for S that efficiently supports the following operations: ✔ Count(P): How many times string P[1,p] occurs in S as a substring? ✔ Locate(P): List the positions of the occurrences of P[1,p] in S? ✔ Extract(i,j):Print S[i,j] Time-efficient solutions, but not compressed ✔ Suffix Arrays, Suffix Trees,... ✔ (n log n) bits – in practice 5-20 times |S| Space-efficient solutions, but not time-efficient ✔ Zgrep: uncompress and scan-based algorithm Compress ed Indexes
13
Suffix Array MISISSSIPPI# SI P 11 8 5 2 1 10 9 7 4 6 3 12 # I# IPPI# ISSIPPI# ISSISSIPPI# MISSISSIPPI# PI# PPI# SIPPI# SISSIPPI# SSIPPI# SSISSIPPI# suffix pointer 5 Property All suffixes of S having prefix P are contiguos Basic idea: Find suffixes prefixed by P Search O(log n) binary-search steps O(p) char comparison per step Thus, O(p log n) time Space: (n log n) bits
14
Space occupancy MISISSSIPPI# 11 8 5 2 1 10 9 7 4 6 3 12 SA + S (n log n) bits Can we do better? ● # permutations {1,2,...,n} = n! ● not all permutations are valid SAs ● #SA = # text on n ● Lower bound from #text = (n log ) ● LB from entropy = (nH k (S)) bits nHk(S) bits << n log n bits
15
Compressed indexes Funtionalities of SA in compressed space Three families: FM-Indexes [Ferragina-Manzini FOCS '00-JACM '05] BWT CSAs [Grossi-Vitter STOC '00, Sadakane SODA '02] BWT LZ-index [Navarro SPIRE '02] LZ78
16
Paolo Ferragina, Università di Pisa p i#mississi p p pi#mississ i s ippi#missi s s issippi#mi s s sippi#miss i s sissippi#m i i ssippi#mis s m ississippi # i ssissippi# m The Burrows-Wheeler Transform [TR 1994] Let us given S = mississippi# mississippi# ississippi#m ssissippi#mi sissippi#mis sippi#missis ippi#mississ ppi#mississi pi#mississip i#mississipp #mississippi ssippi#missi issippi#miss Sort the rows # mississipp i i #mississip p i ppi#missis s L S
17
L is highly compressible Shakespeare's Hamlet L is locally homogeneous thus, highly compressible L is reversible
18
FM-indexes [Ferragina-Manzini FOCS '00-JACM '05] # i# ippi# issippi# ississippi# mississippi# pi# ppi# sippi# sissippi# ssippi# ssissippi# SA 12 11 8 5 2 1 10 9 7 4 6 3 p i#mississi p p pi#mississ i s ippi#missi s s issippi#mi s s sippi#miss i s sissippi#m i i ssippi#mis s m ississippi # i ssissippi# m i ppi#missis s # mississipp i i #mississip p FL unknown L[3] = S[7] = S[SA[3] – 1] i-th symbol of L preceeds i-th suffix L includes SA and S. Can we search within L ? FM-index reduces searching ops in S to “counting” ops in L rank(s,9)=3 rank(c,i) = # occs symbol c in L[1,i] Many O(1) time and H k - space sols
19
FM-index vs Suffix Array FMI nH k (S)+o(n log ) O(p) O(log 1+ n) O((j-i)+log 1+ n)) SA (n log n) bits O(p log n O(1) occ O(j-i) Space Count(P[1,p]) Locate(P) Extract(i,j) Studies at theoretical stage Experimental and Algorithmic engineering effort
21
Are they of practical impact? Count(P) takes 2 microsecs/char, ≈ 50% space Extract 1 Mb/sec Locate(P) takes 15 microsecs/occ ≈ 70% space 500 times slower! 5 times slower! time/space trade-offs
22
Open Problems Theoretical open problems Locate in O(1) time per occ Polylogarithmic factor per occ / 100 times slower than SA LZFM-index: O(1) time and O(nH k (S) log n ) bits of space Can we compress SA in H k -space with fast access? I/O efficient compressed indexes They may have to operate on External-memory String B-tree, Self-adjusting Suffix Tree, and Cache-oblivious String B- tree: optimal in time but not compressed. Their I/O-efficiency and space efficiency of compressed indexes Practical open problems Find new applictions e.g., word/symbol predictors Others?
23
Dictionary of Strings Given a dictionary of strings, of variable length, compress them in a way that we can efficiently support Id string, and prefix, suffix, and prefix-suffix searches Hash Need D to avoid false-positive No prefix/suffix searches, need Id (Compact) tries Need node/edge pointers Need Id Need D to retrieve edges' labels Suffix needs (compact) trie on D R Prefix-Suffix needs intersect! Prefix(GC)Suffix(A)PrefixSuffix(GC*A) TST requires 4-8 bytes per char
24
Permuterm index [Garfield,1976] Take a dictionary D={yahoo,google} Append a special char $ to the end of each string Generate all rotations of these strings yahoo$ ahoo$y hoo$ya oo$yah o$yaho $yahoo google$ oogle$g ogle$go gle$goo le$goog e$googl $google Permuterm Dictionary Prefix(ya) --> Prefix($ya) Suffix(le) --> Prefix(le$) PrefixSuffix(ya*oo) --> Prefix(oo$ya) Substring(oo) --> Prefix(oo) Reduce ops to Prefix searches Space problems
25
Compressed Permuterm index [Ferragina-Venturini SIGIR '07] 50K chars/sec, and space close to bzip Time close to Front-Coding (250K chars/sec), but <50% of its space IR MRS book says: “one disadvantage of the PI is that its dictionary becomes quite large, including as it does all rotations of each term”. Trade-off % URL dictionary Under Y! - patenting Compressed Permuterm Permuterm index approach + Compressed indexes + Novel BWT property= Queries in optimal time and H k space Now, they mention CPI Open: I/O efficient Compressed permuterm
26
What I did Storage scheme which provides fast accesses to compressed data Pizza&Chili site and Experimental comparison among known compressed indexes Compressed index which provides fast queries on dictionaries of strings
27
Thank you
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.