Download presentation
Presentation is loading. Please wait.
Published byJonas Bailey Modified over 9 years ago
1
5.Index Construction 인공지능연구실
2
2 목차 Memory-based inversion Sort-based inversion Exploiting index compression Compressed in-memory inversion Comparison of inversion methods Constructing signature files and bitmaps Dynamic collections
3
3 The problem is the size of the frequency matrix BibleTREC Terms9,020538,244 Doc31,102742,368 4byte integer(each entry) Matrix4x9,020x31,102bytes= over one gigabytes 4 x 538,244 x 742,368bytes= 1.4 terabytes Figure5.1 static One month127 years
4
4 Memory-based inversion Fig5.2( 문서 ), Fig5.3( 역파일 ) Assumed that the linked lists are sorted Dynamic dictionary data structure Linked list(reference point)
5
5 문제점 ? 메모리 Resource 많이 필요 the best method for small collections (Bible…) Random Data 처리 못함
6
6 Sort-based inversion Fig5.4, Fig5.5 시간과 공간을 비교 QSort. 외부 Mergesort [logR] inital Sorted runs Temporary File Merged runs (fully sorted) K block
7
7 문제점 ? Two copies of temp files 10~100Mbyte 범위에 적절
8
8 Exploiting index compression To reduce the resource(space,time) -temporary file 의 압축 (sort-based) -inverted file 을 main memory 에서 만들고, index 를 disk 에 쓰기전에 decompressing Compressing the temporary files Multiway-merging In-place multiway merging
9
9 Compressing the temporary files Chapters 3 and 4 장에서 설명됨 Compression temporary file of t 요소때문에 약간의 압축 손실 발생 ( 예,unary+delta code,TREC collection) ( 가정 ) unary code t-gap( 다음에오는 triple 과의 t 차이값 ) t-gap=0 → code 0, t-gap=1 → 10, t-gap=2 → 110 (0.6Mbyte 필요 )
10
10 Multiway merging Now, processor-intensive than dick- intensive Reduce time by multiway-merge Use if priority queue such as heap
11
11 In-place multiway merging(1) Heap OUTPUT BLOCK1 RUN 1, BLOCK 2 OUTPUT BLOCK2 RUN 2, BLOCK 3 RUN 3, BLOCK 2 1 2 RUN 1, BLOCK 2 RUN 2, BLOCK 2 RUN 3, BLOCK 1 OUTPUT BLOCK3 Blocks in memory One per run Temporary file, On disk Block table In memory
12
12 In-place multiway merging(2) 알고리즘 - 메모리의 각 run 에서 b byte 의 블록이 heap 으로 이동 - heap 에서 메모리내 output 블록으로 b byte 만큼이동 - output 블록은 temporary file 로 다시 쓰여짐 (block table) Slack 의 사용 - 입력프로세스보다 출력프로세스가 먼저 수행 되는 경향으로 빈블럭이 추가됨 Slack 추가 → permutation →compaction → truncation 처리 Second Edition) permutation → truncation 처리
13
13 Compressed in-memory inversion Large memory inversion(1) Large main memory array - list of document numbers d, frequencies f dt Compared in-memory technique (Section 5.1) next pointer field 필요 없음. term t : f t log N bits f t log m t bits (m t : maximum within-document frequency ) preliminary pass 필요 : N, f t, m t
14
14 Compressed in-memory inversion Large memory inversion(2) Two-pass Golomb-coded in memory First Pass - count f t, F t - write f t, F t to a lexicon file Second Pass - read lexicon file - calculate b t, b t w = 2 log((N-ft)/ft) , B t - build a compressed in-memory inverted file - rebuild in-memory inverted file
15
15 Compressed in-memory inversion Lexicon-based partitioning Subdivide into small tasks Lexicon-based, no extra disk make multiple second pass each processing one load - ex) three second pass Lexicon-based, extra disk Time save, Disk Space 낭비
16
16 Compressed in-memory inversion Text-based partitioning Inversion and Merge In-memory inverted file 생성 Merge inverted file on disk Chunk Information file - frequency of each term in chunk Second temp disk file - disk current pointer
17
17 Constructing signature files and bitmaps Enough Main Memory signature of k documents k = 8M / W - W : signature width (bits) - M : main memory (bytes) Bitmap build a compressed inverted file decompress it and store it with unary code
18
18 Dynamic collections(1) ‘Insert’ operation append a new document to an existing collection ‘Edit’ operation alter, remove Expanding the text Expanding the index
19
19 Dynamic collections Expanding the text Inserting a new document the text of the collection must be expanded compression - cope with hitherto unseen symbol uncompression - escape flag, stored uncompressed periodically be completely rebuilt a new compression model
20
20 Dynamic collections Expanding the Index(1) ‘stop-press’ file accumulate update in a stop-press file rebuild when file too large drawback reindex (the data) time The Inverted file new-inserted document contains many terms variable-length recoreds
21
21 Dynamic collections Expanding the Index(2) Issue suitable file structure record extension record insertion
22
22 Dynamic collections Expanding the Index(2) Block Structure Fixed length blocks : b bytes - block address table, records, free space - figure 5.15 Main memory - record address table : record number, block number - free list - current last block of the file
23
23 Dynamic collections Expanding the Index(3) Access record 1) Record number 2) Block address from the record address table 3) Block read into memory 4) The address of the record within the block 5) Read the record
24
24 Dynamic collections Expanding the Index(4) Expanding a particular record sufficient free space 1) Block read 2) record 이동, make space 3) extension 추가 4) block table 수정, write insufficient free space - smallest record remove, insert extension - extended record remove, insert into new block
25
25 Dynamic collections Expanding the Index(5) Insert a record free list check - insert 할 block 결정 - new block 생성 Block read/write (disk operation) general case : 2 worst case : 4 Reduce the number of disk operation using ‘update cache’
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.