Download presentation
Presentation is loading. Please wait.
2
Efficient Decodable and Searchable Natural Language Adaptive Compression Salvador de Bahía. Aug 2005 28 th Annual International ACM SIGIR Gonzalo Navarro (DCC – U. Chile) Nieves R. Brisaboa (UDC – Spain) José R. Paramá (UDC – Spain) Antonio Fariña (UDC – Spain)
3
August 2005ACM SIGIR’05 – Efficiently Decodable and Searchable N.L. Adaptive Compression2 — Semistatic Compressors Word-based Huffman: Plain & Tagged Huffman End-Tagged Dense Code (ETDC) — Dynamic Compressors Dynamic End-Tagged Dense Code (DETDC) Outline Outline Conclusions Word-based compression IIntroduction Dynamic Lightweight ETDC (DLETDC) — Basic Ideas — Searching DLETDC — Empirical Results
4
August 2005ACM SIGIR’05 – Efficiently Decodable and Searchable N.L. Adaptive Compression3 Introduction Why compression? My huge disk is cheap. Compression reduces not only space ! — Disk access time (your huge disk is slow) — Transmission time (networks are even slower) — Search time (less data to process) Compression can be integrated into Text Retrieval Systems, improving their performance in all aspects Word-based semistatic models proved successful Word-based semistatic models proved successful MG system (Witten, Moffat, Bell) Byte-oriented Huffman (Moura, N., Ziviani, Baeza-Yates) End-Tagged Dense Codes (Brisaboa, Fariña, N.)
5
August 2005ACM SIGIR’05 – Efficiently Decodable and Searchable N.L. Adaptive Compression4 Introduction But what about sending results to a remote receiver? — Uncompressed form (wastes network bandwidth) — Keep in compressed form (plus the large model of the corpus) — Semistatic recompression (uneffective for small files) — Dynamic recompression — Dynamic recompression (effective, permits a conversation) Gzip (40%), DETDC (32%), arithmetic encoding (25%), … Semistatic compression is preferred in Text Databases Reduces their size: 25%-30% compression ratio Permits direct access and local decompression Permits direct search on the compressed text Searches are up to 8 times faster. Decompression is only needed for presenting results
6
August 2005ACM SIGIR’05 – Efficiently Decodable and Searchable N.L. Adaptive Compression5 Introduction Dynamic compression: Receiver Decompression (to present results) Keyword search (to classify documents, alert users, …) Dictionary-based (Ziv-Lempel): Fast decompression Decent direct searching Not so good compression ratios Bad for low bandwidth networks Statistical (arithmetic, PPM): Good compression ratios Costly at sender and receiver side Direct searching impossible Bad for weak receivers
7
August 2005ACM SIGIR’05 – Efficiently Decodable and Searchable N.L. Adaptive Compression6 Introduction Our contribution: DLETC — New dynamic compressor for text databases Simple to understand and implement — Statistical Good compression ratio Good for low-bandwidth networks — Easier to handle by the receiver Breaks the usual sender/receiver symmetry Very fast decompression Good for weak receivers — Searchable without decompressing First statistical method permitting direct search
8
August 2005ACM SIGIR’05 – Efficiently Decodable and Searchable N.L. Adaptive Compression7 — S— Semistatic Compressors Word-based Huffman: Plain & Tagged Huffman End-Tagged Dense Code (ETDC) — D— Dynamic Compressors Dynamic End-Tagged Dense Code (DETDC) Outline Outline Conclusions WWord-based compression Introduction Dynamic Lightweight ETDC (DLETDC) — Basic Ideas — Searching DLETDC — Empirical Results
9
August 2005ACM SIGIR’05 – Efficiently Decodable and Searchable N.L. Adaptive Compression8 Semistatic compression Statistical semistatic compression — 2 passes Gather frequencies of words and encoding Compression (substitution word codeword) — Association between source symbol codeword does not change across the text — Direct search is possible — Most representative method: Huffman.
10
August 2005ACM SIGIR’05 – Efficiently Decodable and Searchable N.L. Adaptive Compression9 Word-based Huffman Moffat proposed the use of words instead of characters as the coding alphabet Distribution of words more biased than that of characters Word freq distribution Character freq distribution (Spanish) Compression ratio about 25% (English texts) This idea joins the requirements of compression algorithms and of Information Retrieval systems
11
August 2005ACM SIGIR’05 – Efficiently Decodable and Searchable N.L. Adaptive Compression10 Plain Huffman & Tagged Huffman Moura, Navarro, Ziviani and Baeza: — 2 new techniques: Plain Huffman and Tagged Huffman Common features — Huffman-based — Word-based — Byte- rather than bit-oriented (compression ratio ±30%) Plain Huffman = Huffman over bytes (256-ary tree) Tagged Huffman flags the beginning of each codeword First bit is: “1” for 1 st bit of 1 st byte “0” for 1 st bit remaining bytes 1xxxxxxx 0xxxxxxx 0xxxxxxx
12
August 2005ACM SIGIR’05 – Efficiently Decodable and Searchable N.L. Adaptive Compression11 Plain Huffman & Tagged Huffman Differences: — Plain Huffman. (tree of arity 2 b =256) — Tagged Huffman. (tree of arity 2 b-1 =128) Loss of compression ratio (3.5 perc. points) Tag marks beginning of codewords Direct search (improved searches) Boyer-Moore Random access (random decompression)
13
August 2005ACM SIGIR’05 – Efficiently Decodable and Searchable N.L. Adaptive Compression12 — S— Semi-static Compressors Word-based Huffman: Plain & Tagged Huffman End-Tagged Dense Code (ETDC) — D— Dynamic Compressors Dynamic End-Tagged Dense Code (DETDC) Outline Outline Conclusions Word-based compression Introduction Dynamic Lightweight ETDC (DLETDC) — Basic Ideas — Searching DLETDC — Empirical Results
14
August 2005ACM SIGIR’05 – Efficiently Decodable and Searchable N.L. Adaptive Compression13 End-Tagged Dense Code Small change: Flag signal the end of a codeword Prefix code independently of the remaining 7 bits of each byte Huffman tree is not needed: Dense coding. Improving Tagged Huffman compression ratio by 2.5 perc. points Flag bit Same Tagged Huffman searching capabilities First bit is: “1” --> for 1st bit of last byte “0” --> for 1st bit remaining bytes 1xxxxxxx 0xxxxxxx Two-byte codeword 1xxxxxxx0xxxxxxx Three-byte codeword 1xxxxxxx0xxxxxxx
15
August 2005ACM SIGIR’05 – Efficiently Decodable and Searchable N.L. Adaptive Compression14 End-Tagged Dense Code Encoding scheme..... Words from 128+ 128 2 +1 to 128 +128 2 +128 3 use three bytes (128 3 = 2 21 codewords) 00000000:00000000:10000000 …… 01111111:01111111:11111111 Words from 128+1 to 128+128 2 are encoded using two bytes (128 2 = 2 14 codewords) 00000000:10000000 ….. 01111111:11111111 First 128 words are encoded using one byte (2 7 codewords) 10000000 10000001 ….. 11111111 Codewords depend on the rank, not on the frequency
16
August 2005ACM SIGIR’05 – Efficiently Decodable and Searchable N.L. Adaptive Compression15 — S— Semi-static Compressors Word-based Huffman: Plain & Tagged Huffman End-Tagged Dense Code (ETDC) — D— Dynamic Compressors Dynamic End-Tagged Dense Code (DETDC) Outline Outline Conclusions Word-based compression Introduction Dynamic Lightweight ETDC (DLETDC) — Basic Ideas — Searching DLETDC — Empirical Results
17
August 2005ACM SIGIR’05 – Efficiently Decodable and Searchable N.L. Adaptive Compression16 Dynamic codes Statistical dynamic compression — Dynamic modeling Frequencies are computed dynamically as text arrives. — Sender and receiver… Start with an empty model Adapt their model during the process. Both processes are symmetric
18
August 2005ACM SIGIR’05 – Efficiently Decodable and Searchable N.L. Adaptive Compression17 Dynamic End-Tagged Dense Code Symmetric sender and receiver. It uses the on-the-fly encoding and decoding algorithms — Requirement: maintaining the vocabulary sorted by freq. speed Sender’s vocabulary 1 the8 2 is4 3 code2 4 tag2 5 Huffman2 6 7 ratio1 8 1 ETDC speed C 8 = encode (8) Example wordfreq speed ETDC21 send codeword increase frequencyexchange with top 1 2 The codeword given to a word s i may change each time s i is processed. DETDC …
19
August 2005ACM SIGIR’05 – Efficiently Decodable and Searchable N.L. Adaptive Compression18 rose Dynamic End-Tagged Dense Code: transmission sender receiver rose Example: … a rose rose is a very nice one rose a 126 127 128 129 1 C 127 rose 1 a 126 127 128 129 1 1 vocabulary word freq vocabulary word freq
20
August 2005ACM SIGIR’05 – Efficiently Decodable and Searchable N.L. Adaptive Compression19 is 1 126 127 128 129 vocabulary word freq Dynamic End-Tagged Dense Code: transmission 1 rose sender receiver is a 2 1 rose a 2 is 1 is Example: … a rose rose is a very nice one 126 127 128 129 vocabulary word freq C new 0|128
21
August 2005ACM SIGIR’05 – Efficiently Decodable and Searchable N.L. Adaptive Compression20 — Semi-static Compressors Word-based Huffman: Plain & Tagged Huffman End-Tagged Dense Code (ETDC) — Dynamic Compressors Dynamic End-Tagged Dense Code (DETDC) Outline Outline Conclusions Word-based compression Introduction DDynamic Lightweight ETDC (DLETDC) — B— Basic Ideas — S— Searching DLETDC — E— Empirical Results
22
August 2005ACM SIGIR’05 – Efficiently Decodable and Searchable N.L. Adaptive Compression21 DLETDC – Basic Ideas The idea: — Break the correspondence between position and codeword — Codewords are mantained explicitly by the sender — SENDER: After processing a symbol s i … Exchange s i top(f i ) to keep the vocabulary sorted, Keep original codewords unless codeword length should vary Otherwise swap, and notify the receiver. — 2 fixed slots are reserved: new-symbol, ascii word Swap, C i, CodeIn j. New-symbol swap 0 1 128 129 255 0 0 127 128 129 128 129 127 0 16511 16512 255 0128 0 16513 0129 127 2113662 127254 2113663 127 255 0 2113664 00128 127 2113661 127253 Rank 1-byte 2-byte 3-byte — RECEIVER: does not maintain frequencies Decodes codewords Adds new symbols (C new-symbol ) Only swaps when it decodes a C swap
23
August 2005ACM SIGIR’05 – Efficiently Decodable and Searchable N.L. Adaptive Compression22 C 127 very 1 126 127 128 129 vocabulary word DLETDC: transmission sender receiver a 1 1 is a a very a Example: … a rose rose is a very nice one 126 127 128 129 vocabulary word freq 125 rose 125 rose 2 C 125 is C 126 a C 127 C 128 1-byte 2-byte No codeword swap needed since: “a” C 127 and “is” C 126 code 2
24
August 2005ACM SIGIR’05 – Efficiently Decodable and Searchable N.L. Adaptive Compression23 128 1127 2 C swap C 128 C 126 1 126 127 128 129 vocabulary word DLETDC: transmission sender receiver very is a very very Example: … a rose rose is a very nice one 126 129 vocabulary word freq 125 rose 125 rose 2 C 125 a C 127 is C 126 very C 128 1-byte 2-byte A codeword swap is needed code 2
25
August 2005ACM SIGIR’05 – Efficiently Decodable and Searchable N.L. Adaptive Compression24 128 130 C new 128 2127 2 1 126 127 129 vocabulary word DLETDC: transmission sender receiver nice very a nice is nice Example: … a rose rose is a very nice one 126 129 vocabulary word freq 125 rose 125 rose 2 C 125 a C 127 very C 126 is C 128 1-byte 2-byte A new word case code nice 130 nice 1 C 129 nice
26
August 2005ACM SIGIR’05 – Efficiently Decodable and Searchable N.L. Adaptive Compression25 DLETDC – Evolution of swaps. Swaps worsen compression ratio and processing time How many swaps occur in practice? ZIFF corpus 4,6x10 7 words 237,622 diff words Between lengts 1 and 2 Between lengts 2 and 3
27
August 2005ACM SIGIR’05 – Efficiently Decodable and Searchable N.L. Adaptive Compression26 — Semi-static Compressors Word-based Huffman: Plain & Tagged Huffman End-Tagged Dense Code (ETDC) — Dynamic Compressors Dynamic End-Tagged Dense Code (DETDC) Outline Outline Conclusions Word-based compression Introduction Dynamic Lightweight ETDC (DLETDC) — T— The code — S— Searching DLETDC — E— Empirical Results
28
August 2005ACM SIGIR’05 – Efficiently Decodable and Searchable N.L. Adaptive Compression27 DLETDC – Searches – Keyword filtering Multipattern Horspool — A trie is traversed to recognize text backwards Search process: — Initial phase Search for the ASCII pattern: C_new | (pattern) — When found Replace in the trie by its codeword: C_pat Need also to monitor C_new to know C_pat — Watch the swaps Track C_swap C_pat * and C_swap * C_pat It suffices to search for: C_new C_pat and make some neighborhood checks
29
August 2005ACM SIGIR’05 – Efficiently Decodable and Searchable N.L. Adaptive Compression28 — Semi-static Compressors Word-based Huffman: Plain & Tagged Huffman End-Tagged Dense Code (ETDC) — Dynamic Compressors Dynamic End-Tagged Dense Code (DETDC) Outline Outline Conclusions Word-based compression Introduction Dynamic Lightweight ETDC (DLETDC) — T— The code — S— Searching DLETDC — E— Empirical Results
30
August 2005ACM SIGIR’05 – Efficiently Decodable and Searchable N.L. Adaptive Compression29 Empirical Results We used some text collections from TREC-2 & TREC- 4, to perform the experiments Results in: — Compression ratio — Compression time — Decompression time — Search speed CORPUSsize (bytes)# wordsDiff. words CALGARY2,131,045528,61130,995 FT9114,749,3553,135,38375,681 CR51,085,54510,230,907117,713 FT92175,449,23536,803,204284,892 ZIFF185,220,21540,866,492237,622 FT93197,586,29442,063,804291,427 FT94203,783,92343,335,126295,018 AP250,714,27153,349,620269,141 ALL FT591,568,807124,971,944577,352 ALL1,080,719,883229,596,845886,190 — Dual Intel Pentium-III 800 Mhz with 768Mb RAM. Debian GNU/Linux (kernel 2.2.19) gcc 3.3.3 20040429 and –O9 optimizations Time represents CPU user-time
31
August 2005ACM SIGIR’05 – Efficiently Decodable and Searchable N.L. Adaptive Compression30 Empirical Results. Compression Ratio 35.09 33.69 33.66 27.98 25.98 Comparison with other techniques.
32
August 2005ACM SIGIR’05 – Efficiently Decodable and Searchable N.L. Adaptive Compression31 Empirical Results. Compression time 431.5 154.6 150.1 510.0 1342.0
33
August 2005ACM SIGIR’05 – Efficiently Decodable and Searchable N.L. Adaptive Compression32 Empirical Results. Decompression time 59.8 49.5 75.8 394.1 432.4
34
August 2005ACM SIGIR’05 – Efficiently Decodable and Searchable N.L. Adaptive Compression33 Search speed 3.52 3.464.384.49 5.01 10.14 13.8015.8417.6118.70 Avg time over 10,000 random searches for words of length >= 6
35
August 2005ACM SIGIR’05 – Efficiently Decodable and Searchable N.L. Adaptive Compression34 — Semi-static Compressors Word-based Huffman: Plain & Tagged Huffman End-Tagged Dense Code (ETDC) — Dynamic Compressors Dynamic End-Tagged Dense Code (DETDC) Outline Outline CConclusions Word-based compression Introduction Dynamic Lightweight ETDC (DLETDC) — Basic Ideas — Searching DLETDC — Empirical Results
36
August 2005ACM SIGIR’05 – Efficiently Decodable and Searchable N.L. Adaptive Compression35 Conclusions We have presented DLETDC — A new dynamic “dense” compressor. — Designed for low bandwidth and weak receivers. Competitive compression ratio (around 32-34%, < gzip) Fast at compression (much faster than other adaptive) Very fast at decompression (even faster than gunzip!) Very fast at searching (>3 times faster than plain searching) DLETDC breaks the common symmetry between sender and receiver in adaptive techniques. — Lightweight and fast decompressor/receiver. — A dynamic searchable technique.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.