Download presentation
Presentation is loading. Please wait.
Published byLeo Miller Modified over 9 years ago
1
Compressed Prefix Sums O’Neil Delpratt Naila Rahman Rajeev Raman
2
Given static sequence of positive integers, such that Support Problem Minimising space for storing according to some compressibility criteria. Supporting Sum rapidly Trivial solution Explicitly store Sum values Requires: bits, support Sum in O(1) time PSDS Prefix Sums problem
3
Motivation: Inverted List Locations of keywords in main text Positive sequence of strictly increasing integers Term=Moses: 650, 687, 696 Bible doc: Moses…… ….Moses… ………………..Moses…
4
Store differences Significant space saving, standard technique [Managing Gigabytes, Witten, Moffat, Bell] We can store in PSDS => th location of keyword Direct access to individual, help to answer conjunction queries. Inverted List
5
String Collection Collection of non-empty stings, Store in PSDS, where Concatenate strings Store concatenated string in array Store concatenated string in compress self-index e.g. FM-index or CSA Get th string Offset =, Length = Text string: selmanjava3d programming2000” Offsets:0,6, 13,10, 24
6
URLs Web search engines with large database of URLs. URLs are strings URLs are 60 chars long average compressed fairly well Explicit pointer for each URL requires 64 bits www.cs.le.ac.ukwww.cs.le.ac.uk/people/ond1www.le.ac.uk/librarywww.star.ac.uk
7
XML Documents XML Doc: selman java3d programming 2000 book titleauthoryear #doc 13 2 46 7 “Java3d programming” “[cr][sp]” “selman” 5 “2000” “[cr][sp]” selman java3d programming 2000 Text nodes 10-12 chars in length in avg. Compressed to average 3-4 bytes 32-bit pointer overhead for each string (naive)
8
Related Work [CJM]: Clark thesis,Jacobson FOCS 89, Clark,Munro SODA 96 [Geary et al.]: Geary, Raman, Rahman CPM 04,TCS 06 [Kim et al.]: Kim, Chae Na, Kim, Park WEA 05 [Gupta et al. (a)]: Gupta, Hon, Shah, Vitter DCC 06 [Gupta et al. (b)]: Gupta, Hon, Shah, Vitter WEA 06 [GV]: Grossi, Vitter STOC 00, SICOMP 05 [MG]: Witten, Moffat, Bell, Managing Gigabytes
9
Select Space usage: bits Time:. Bitvector Representation Write in unary is “0001” |B|= m bits B: 0 0 1 0 1 1 0 0 0 1 position of the iith 1 bit in B [CJM, KIM et al.] } B: 0 0 1 0 1 1 0 0 0 1
10
Given the # of 1s in B is n different bit sequences Lower bound to store all L sequences bits. space usage is based on Average. Could we do better? Succinct Bound
11
Data-aware encoding Exploit skewed distribution. Self-delimiting encodings of values. concat. unary and binary.. add up to, average value is then.,
12
Data-aware encoding Golomb(b,x) Concat. in unary and in binary using or bits b=3 Golomb(3,9)= q=2 in unary(q+1)=001 and r=2 => 001 11. Best encoding for inverted lists if - [MG]. - [Gupta et. al. (b)] Not achievable
13
a) GOLOMBSUCCINCT b) New Select DS c) Data aware PSDS. Space:, bits Time: d) Implementation and Experimental Evaluation Contributions of paper
14
If..- [GV, Elias] Succinct vs Golomb
15
Succinct PSDS Given Compute 11 01 10 10 11 01 00 10 Lower-half: Lower order bits of, so we take bits 1011011001010100 Upper-half: Multiplicities bits. Upper-order bits, i.e= 0,1,1,2,2,4,5,6 Space usage: time: - [CJM] V: Simple to do Succinct i.e. 5= 00101 11 01 10 10 11 01 00 10 i.e. 5= 00101=>1 1011011001010100 B: e.g [GV, Elias] 0 1 2 3 4 5 6 7 get(B,4)=10 11 01 10 10 11 01 00 10 1011011001010100
16
New select DS = position of the th 1 bit in bitstring B of length N Extracted string & contracted string [Kim et al.] Remove zero blocks [Geary et al.]: Fast select – every block has at least a single 1 bit. Block of zeros 001……..1.. 001……000000000..1.. XX A: P: 0 0 1 0 0 1 0 0 0 A’:
17
New select DS Assume BS of N bits Results Select & rank: O(1) time, space: N+o(N) bits Select1 and select0 Partitioned BV [Delpratt, Raman, Rahman, WEA 06] In practice Joint fastest with CJM
18
New select DS TypicalWorst-case NewDSCJMKIMNewDSCJMKIM Input BS(1- )NNNNNN Select(1- )0.94N(1+ )0.52N(1+ )0.63N0.94N2.77N1.17N rank0.03N0.5N0.25N0.02N0.5N0.25N sum~2N ~1.94N4.27N2.42N Reliable space bound Speed evaluation: Orders.xml NewDS=0.101, CJM=0.105, KIM=0.178 oper./per sec
19
Data aware tree PSDS Results Space usage: bits, Time: [Gupta et al. (b)] achieved, Time:
20
Delete larger child Indicate nodes removed. n-1 extra bits Data aware tree PSDS 59 3623 6 171521 59 3623 6 171521
21
Implemented Succinct, Explicit- & Succinct- PSDS Gamma tree PSDS Remove right child nodes vs largest node = negligible difference Tree is slow Data: Lengths of Text node strings Compressibility measures Succinct measure close to GOLOMB measure Implementation and Experimental Evaluation File#Text nodesGapGOLOMBSucc. orders.xml150K2.564.994.874.71 Xpath.xml1.7m3.266.414.424.37 per node
22
Experimental Evaluation Results Comparative space usage for data structures Linux machine, 8 million random operation calls, 10 repeated runs Time: sec per operation FileSuccinctExplicit-Succinct- orders.xml0.1010.2350.306 Xpath.xml0.3060.4530.564 Succinct PSDS performed best
23
Compression of Prefix sums is important Space efficient data-aware PSDS Succinct PSDS was more appropriate in our application New select DS Future improvements Succinct- more competitive: single -decode x20 faster than single select To data aware tree PSDS Conclusions and future work
24
Thank you!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.