Download presentation
Presentation is loading. Please wait.
Published byJesse Edwards Modified over 9 years ago
1
Inverted File Compression In Managing Gigabytes 과목 : 정보검색론 강의 : 부산대학교 권혁철
2
Inverted File Compression Inverted file entry – t : term, f t : # of documents d k : document no. where d k < d k+1 – => gap = d k+1 - d k Two compression classes –Global Methods V.S Local Methods
3
Summary of coding methods
4
Unary code Simple method –fixed representation of the positive integer –log N (bits) Unary code –gap 이 x 일 때, x-1 bit 의 1 과 1bit 의 0 으로 표현 –l x = (x - 1) + 1, Pr[x] = 2 -x –eg) x = 9 일 때, => 11111111 0
5
code –1 + log x bit 의 unary code 와 log x bit 의 binary code(x - 2 log x ) 로 표현 –l x = 1 + log x + log x , Pr[x] = 1/2x 2 –eg) x = 9 일 때, log x = 3, x - 2 log x =1 => 1110 001 –V = or V = or ….
6
code – code 와 표현 방법이 유사. –1 + log x bit 의 unary code 대신에 code 를 사용하고, log x bit 의 binary code(x - 2 log x ) 로 표현 –l x = 1 + 2 log(1 + log x ) + log x , Pr[x] = 1/2x(log x) 2 –eg) x = 9 일 때, => 11000 001
7
Global Bernoulli model Pr[x] = (1-p) x-1 p, p : gap x 가 나타날 확률 Golomb code –q + 1 bit 의 unary code 와 + log b or log b bit 의 binary code –q = (x - 1) / b , r = x - q b - 1 –b A = log(2 - p) / - log(1 - p) 0.69(N n / f) –eg) b=3, r=0(0), 1(10), 2(11) b=6, r=0(00), 1(01), 2(100), 3(101), 4(110), 5(111) x=9 이면, q = 2, r = 2 따라서, 110 11
8
Global “observed frequency” model Based on observed frequency of appear gap size Use arithmetic or Huffman code In theory –better compression method In practice –slightly better than and code
9
Local Bernoulli model The frequency of term t, f t, is known –Bernoulli model on each individual inverted file entry can be used Very common words are encoded with b=1. –Tantamount bitvector –thus, inverted file can never worse than bitvector. Necessary to store the parameter f t –b can be used during decoding
10
Skewed Bernoulli model Bernoulli model 의 vector V G = V T = slightly worse than the Golomb code (a) (b) (c) Word position in Bible : (a)bridegroom; (b)Jezebel; (c) twelfth
11
Local hyperbolic model Pr[x] = / x, x = 1, 2, …, m – = 1 / (log e (m+1)+0.5772) –m is largest gap Better performance more complex to implement requires the use of arithmetic coding
12
Local “observed frequency” model The ultimate in local modeling batched frequency request more memory space best compression method
13
Performance of Index Compression Methods MethodBits per pointer BibleGNUbibComactTREC Global methods Unary 264 920 490 1719 Binary 15.00 16.00 18.00 20.00 Bernoulli 9.67 11.65 10.58 12.61 6.55 5.69 4.48 6.43 6.26 5.08 4.36 6.19 Observed frequency 5.92 4.83 4.21 5.83 Local methods Bernoulli 6.13 6.17 5.40 5.73 Hyperbolic 5.77 5.17 4.65 5.74 Skewed Bernoulli 5.68 4.71 4.24 5.28 Batched frequency 5.61 4.65 4.03 5.27
14
Compression of bitmaps Bitmaps : Hierarchical bitvetor compression 기법으로 압축 (a) original bitvector (b) hierarchical structure (c) flattened tree as a string of bits
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.