Presentation is loading. Please wait.

Presentation is loading. Please wait.

Inverted File Compression In Managing Gigabytes 과목 : 정보검색론 강의 : 부산대학교 권혁철.

Similar presentations


Presentation on theme: "Inverted File Compression In Managing Gigabytes 과목 : 정보검색론 강의 : 부산대학교 권혁철."— Presentation transcript:

1 Inverted File Compression In Managing Gigabytes 과목 : 정보검색론 강의 : 부산대학교 권혁철

2 Inverted File Compression Inverted file entry – t : term, f t : # of documents d k : document no. where d k < d k+1 – => gap = d k+1 - d k Two compression classes –Global Methods V.S Local Methods

3 Summary of coding methods

4 Unary code Simple method –fixed representation of the positive integer –log N (bits) Unary code –gap 이 x 일 때, x-1 bit 의 1 과 1bit 의 0 으로 표현 –l x = (x - 1) + 1, Pr[x] = 2 -x –eg) x = 9 일 때, => 11111111 0

5  code –1 +  log x  bit 의 unary code 와  log x  bit 의 binary code(x - 2  log x  ) 로 표현 –l x = 1 +  log x  +  log x , Pr[x] = 1/2x 2 –eg) x = 9 일 때,  log x  = 3, x - 2  log x  =1 => 1110 001 –V  = or V  = or ….

6  code –  code 와 표현 방법이 유사. –1 +  log x  bit 의 unary code 대신에  code 를 사용하고,  log x  bit 의 binary code(x - 2  log x  ) 로 표현 –l x = 1 + 2  log(1 +  log x  )  +  log x , Pr[x] = 1/2x(log x) 2 –eg) x = 9 일 때, => 11000 001

7 Global Bernoulli model Pr[x] = (1-p) x-1 p, p : gap x 가 나타날 확률 Golomb code –q + 1 bit 의 unary code 와 +  log b  or  log b  bit 의 binary code –q =  (x - 1) / b , r = x - q b - 1 –b A =  log(2 - p) / - log(1 - p)   0.69(N  n / f) –eg) b=3, r=0(0), 1(10), 2(11) b=6, r=0(00), 1(01), 2(100), 3(101), 4(110), 5(111) x=9 이면, q = 2, r = 2 따라서, 110 11

8 Global “observed frequency” model Based on observed frequency of appear gap size Use arithmetic or Huffman code In theory –better compression method In practice –slightly better than  and  code

9 Local Bernoulli model The frequency of term t, f t, is known –Bernoulli model on each individual inverted file entry can be used Very common words are encoded with b=1. –Tantamount bitvector –thus, inverted file can never worse than bitvector. Necessary to store the parameter f t –b can be used during decoding

10 Skewed Bernoulli model Bernoulli model 의 vector V G = V T = slightly worse than the Golomb code (a) (b) (c) Word position in Bible : (a)bridegroom; (b)Jezebel; (c) twelfth

11 Local hyperbolic model Pr[x] =  / x, x = 1, 2, …, m –  = 1 / (log e (m+1)+0.5772) –m is largest gap Better performance more complex to implement requires the use of arithmetic coding

12 Local “observed frequency” model The ultimate in local modeling batched frequency request more memory space best compression method

13 Performance of Index Compression Methods MethodBits per pointer BibleGNUbibComactTREC Global methods Unary 264 920 490 1719 Binary 15.00 16.00 18.00 20.00 Bernoulli 9.67 11.65 10.58 12.61  6.55 5.69 4.48 6.43  6.26 5.08 4.36 6.19 Observed frequency 5.92 4.83 4.21 5.83 Local methods Bernoulli 6.13 6.17 5.40 5.73 Hyperbolic 5.77 5.17 4.65 5.74 Skewed Bernoulli 5.68 4.71 4.24 5.28 Batched frequency 5.61 4.65 4.03 5.27

14 Compression of bitmaps Bitmaps : Hierarchical bitvetor compression 기법으로 압축 (a) original bitvector (b) hierarchical structure (c) flattened tree as a string of bits


Download ppt "Inverted File Compression In Managing Gigabytes 과목 : 정보검색론 강의 : 부산대학교 권혁철."

Similar presentations


Ads by Google