On searching and extracting strings from compressed textual data Rossano Venturini Dipartimento di Informatica University of Pisa, Italy.

On searching and extracting strings from compressed textual data Rossano Venturini Dipartimento di Informatica University of Pisa, Italy

The Memory Hierarchy Accesses orders of magitude faster Data compression ● More data in high levels ● Speed up algorithms if dec time < transfer time Applications should carefully organize their data to reduce trasfers

An example Human genome  3 billion pairs of bases (A, T, C, and G)  700 Mbytes fits on RAM but biologists need tools for data analysis on it  searching operations: motif, approximate pattern matching, etc Data Structures (indexes) are mandatory  Example, Find longest tandem substring Sequential algorithms: O(n 2 ) n is 3 billion! Suffix tree: O(n) Suffix tree requires 7-15 Gbytes  does not fit on RAM Compressed indexes ● Same functionalities and compressed space ● They become faster than suffix tree ● RAM vs Disk

What do we need? Many applications on texts require  access to it  searching operations on it We want data structures and algorithms to represent texts in a compressed form which permit  efficient access to substrings Applications save space without degradation of performance  efficient pattern-matching queries Useful to solve more difficult problems. e.g., approximate or reg.exp. pattern-matching Efficient: avoid whole decompression! Classical data compression approaches fail

Random Access to compressed text Given a (static) string S[1,n] drawn from an alphabet  of size , we want to represent it in a compressed form C which permits us to extract any m-long substring of S in O(1+m/ log  n ) optimal time Applications can replace S with C  retrieve S's substrings without any degradation of performance RAM Model  CPU + internal memory  Any primitive operation among O(1) words in O(1) time  Memory references have equal cost  No cheat! O(log n) bits word size

The 0-th order empirical entropy It is the best you can hope with a memoryless compressor Huffman is close to this bound  symbol codeword Add pointers...Random access frequency of c in S S abababababac C 0100 0 0 0 011 access decode a b ac 6/1 2 0 5/1 2 10 1/1 2 11 T freqcode H 0 cannot distinguish between a y b x and any other random string with same number of a and b Better compression using a codeword that depends on the k symbols preceding the one to be compressed (its context)

The k-th order empirical entropy For any string w, w S = string of symbols following substring w in S Example:  S = abababababac k = 2 w=ab  ab S =aaaaa Better compression. NO random access. Compress every w S up to its H 0 a b ac 5/50 0-- 0 T(ab) freqcode C 000000000001 Random access and k-th order empirical entropy? Our scheme...blocking approach S abababababac access context?

Our storage scheme [Ferragina-Venturini SODA-TCS '07] b S ... -- 0 -- 00 1 0... C code 000 11 10 01 00 1 0  frequency block...      ● b = ½ log  n ● # blocks = n/b = O(n / log  n) ● #distinct blocks = O(  b ) = O(n ½ ) P 1122456... Decode is easy! access decode  Constant time per block

Space analysis Blocks table T:  O(  b ) = O(n ½ ) entries  Each entry is represented with O(log n) bits  T requires O(n ½ log n) = o(n) bits We use a two-level storage scheme for pointers o(n log  ) bits The real challenge is to bound |C|  Introduce statistical encoder E k (S). Easy to bound. |E k (S)| < nH k (S) + o(n log  ) bits, where k ≤  log  |S| |C| < Always better than S on RAM model

Open problems I/O-efficient Random access External Memory Model (I/O model) CPU + Internal memory + slow Disk  Memory has size M  Disk and memory are divided in blocks of size B  CPU can only operate on memory (M << n)  Algorithm can make memory transfer operations read one block from disk to memory write one block from memory to disk  Algorithm cost: # memory transfer required

Open problems I/O-efficient Random access Optimal solution is able to access any O(B log  n  long substring of S with O(1) I/Os  Our scheme migth work in the I/O model too  Three inefficiencies: Its blocking approach reduce overall compression by posing limit to k It does not exploit no cost internal-memory ops Table T may not fit in internal memory Different approach  Compress every symbol with a model that fits in memory  Find optimal model in O(n) time? contexts of may have variable length

(Compressed) String Indexing MISISSSIPPI# S SI P SI SI String-matching problem: Given a text S[1,n], we wish to devise a (compressed) representation for S that efficiently supports the following operations: ✔ Count(P): How many times string P[1,p] occurs in S as a substring? ✔ Locate(P): List the positions of the occurrences of P[1,p] in S? ✔ Extract(i,j):Print S[i,j] Time-efficient solutions, but not compressed ✔ Suffix Arrays, Suffix Trees,... ✔  (n log n) bits – in practice 5-20 times |S| Space-efficient solutions, but not time-efficient ✔ Zgrep: uncompress and scan-based algorithm Compress ed Indexes

Suffix Array MISISSSIPPI# SI P 11 8 5 2 1 10 9 7 4 6 3 12 # I# IPPI# ISSIPPI# ISSISSIPPI# MISSISSIPPI# PI# PPI# SIPPI# SISSIPPI# SSIPPI# SSISSIPPI# suffix pointer 5 Property All suffixes of S having prefix P are contiguos Basic idea: Find suffixes prefixed by P Search O(log n) binary-search steps O(p) char comparison per step Thus, O(p log n) time Space:  (n log n) bits

Space occupancy MISISSSIPPI# 11 8 5 2 1 10 9 7 4 6 3 12 SA + S  (n log n) bits Can we do better? ● # permutations {1,2,...,n} = n! ● not all permutations are valid SAs ● #SA = # text on  n ● Lower bound from #text =  (n log  ) ● LB from entropy =  (nH k (S)) bits nHk(S) bits << n log n bits

Compressed indexes Funtionalities of SA in compressed space Three families:  FM-Indexes [Ferragina-Manzini FOCS '00-JACM '05] BWT  CSAs [Grossi-Vitter STOC '00, Sadakane SODA '02] BWT  LZ-index [Navarro SPIRE '02] LZ78

Paolo Ferragina, Università di Pisa p i#mississi p p pi#mississ i s ippi#missi s s issippi#mi s s sippi#miss i s sissippi#m i i ssippi#mis s m ississippi # i ssissippi# m The Burrows-Wheeler Transform [TR 1994] Let us given S = mississippi# mississippi# ississippi#m ssissippi#mi sissippi#mis sippi#missis ippi#mississ ppi#mississi pi#mississip i#mississipp #mississippi ssippi#missi issippi#miss Sort the rows # mississipp i i #mississip p i ppi#missis s L S

L is highly compressible Shakespeare's Hamlet L is locally homogeneous thus, highly compressible L is reversible

FM-indexes [Ferragina-Manzini FOCS '00-JACM '05] # i# ippi# issippi# ississippi# mississippi# pi# ppi# sippi# sissippi# ssippi# ssissippi# SA 12 11 8 5 2 1 10 9 7 4 6 3 p i#mississi p p pi#mississ i s ippi#missi s s issippi#mi s s sippi#miss i s sissippi#m i i ssippi#mis s m ississippi # i ssissippi# m i ppi#missis s # mississipp i i #mississip p FL unknown L[3] = S[7] = S[SA[3] – 1] i-th symbol of L preceeds i-th suffix L includes SA and S. Can we search within L ? FM-index reduces searching ops in S to “counting” ops in L rank(s,9)=3 rank(c,i) = # occs symbol c in L[1,i] Many O(1) time and H k - space sols

FM-index vs Suffix Array FMI nH k (S)+o(n log  ) O(p) O(log 1+  n) O((j-i)+log 1+  n)) SA  (n log n) bits O(p log n  O(1) occ O(j-i) Space Count(P[1,p]) Locate(P) Extract(i,j) Studies at theoretical stage Experimental and Algorithmic engineering effort

Are they of practical impact? Count(P) takes 2 microsecs/char, ≈ 50% space Extract 1 Mb/sec Locate(P) takes 15 microsecs/occ ≈ 70% space 500 times slower! 5 times slower! time/space trade-offs

Open Problems Theoretical open problems  Locate in O(1) time per occ Polylogarithmic factor per occ / 100 times slower than SA LZFM-index: O(1) time and O(nH k (S) log  n ) bits of space Can we compress SA in H k -space with fast access?  I/O efficient compressed indexes They may have to operate on External-memory String B-tree, Self-adjusting Suffix Tree, and Cache-oblivious String B- tree: optimal in time but not compressed. Their I/O-efficiency and space efficiency of compressed indexes Practical open problems  Find new applictions e.g., word/symbol predictors Others?

Dictionary of Strings Given a dictionary of strings, of variable length, compress them in a way that we can efficiently support Id  string, and prefix, suffix, and prefix-suffix searches  Hash Need D to avoid false-positive No prefix/suffix searches, need Id  (Compact) tries Need node/edge pointers Need Id Need D to retrieve edges' labels Suffix needs (compact) trie on D R  Prefix-Suffix needs intersect! Prefix(GC)Suffix(A)PrefixSuffix(GC*A) TST requires 4-8 bytes per char

Permuterm index [Garfield,1976] Take a dictionary D={yahoo,google}  Append a special char $ to the end of each string  Generate all rotations of these strings yahoo$ ahoo$y hoo$ya oo$yah o$yaho $yahoo google$ oogle$g ogle$go gle$goo le$goog e$googl $google Permuterm Dictionary Prefix(ya) --> Prefix($ya) Suffix(le) --> Prefix(le$) PrefixSuffix(ya*oo) --> Prefix(oo$ya) Substring(oo) --> Prefix(oo) Reduce ops to Prefix searches Space problems

Compressed Permuterm index [Ferragina-Venturini SIGIR '07] 50K chars/sec, and space close to bzip Time close to Front-Coding (250K chars/sec), but <50% of its space IR MRS book says: “one disadvantage of the PI is that its dictionary becomes quite large, including as it does all rotations of each term”. Trade-off % URL dictionary Under Y! - patenting Compressed Permuterm Permuterm index approach + Compressed indexes + Novel BWT property= Queries in optimal time and H k space Now, they mention CPI Open: I/O efficient Compressed permuterm

What I did Storage scheme which provides fast accesses to compressed data Pizza&Chili site and Experimental comparison among known compressed indexes Compressed index which provides fast queries on dictionaries of strings

Thank you

On searching and extracting strings from compressed textual data Rossano Venturini Dipartimento di Informatica University of Pisa, Italy.

Similar presentations

Presentation on theme: "On searching and extracting strings from compressed textual data Rossano Venturini Dipartimento di Informatica University of Pisa, Italy."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

On searching and extracting strings from compressed textual data Rossano Venturini Dipartimento di Informatica University of Pisa, Italy.

Similar presentations

Presentation on theme: "On searching and extracting strings from compressed textual data Rossano Venturini Dipartimento di Informatica University of Pisa, Italy."— Presentation transcript:

Similar presentations

About project

Feedback