5.Index Construction 인공지능연구실. 2 목차 Memory-based inversion Sort-based inversion Exploiting index compression Compressed in-memory inversion Comparison.

Slides:



Advertisements
Similar presentations
Information Retrieval in Practice
Advertisements

Chapter 12: File System Implementation
Introduction to Database Systems1 Records and Files Storage Technology: Topic 3.
Folk/Zoellick/Riccardi, File Structures 1 Objectives: To get familiar with: Data compression Storage management Internal sorting and binary search Chapter.
Chapter 10: File-System Interface
Chapter 11: File System Implementation
File System Implementation
File System Implementation CSCI 444/544 Operating Systems Fall 2008.
Cache effective mergesort and quicksort Nir Zepkowitz Based on: “Improving Memory Performance of Sorting Algorithms” by Li Xiao, Xiaodong Zhang, Stefan.
Chapter 12: File System Implementation
FALL 2006CENG 351 Data Management and File Structures1 External Sorting.
1 Operating Systems Chapter 7-File-System File Concept Access Methods Directory Structure Protection File-System Structure Allocation Methods Free-Space.
Information Retrieval IR 4. Plan This time: Index construction.
METU Department of Computer Eng Ceng 302 Introduction to DBMS Disk Storage, Basic File Structures, and Hashing by Pinar Senkul resources: mostly froom.
File System Structure §File structure l Logical storage unit l Collection of related information §File system resides on secondary storage (disks). §File.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Chapter 13 Disk Storage, Basic File Structures, and Hashing.
Organizing files for performance Chapter Data compression Advantages of reduced file size Redundancy reduction: state code example Repeating sequences:
1.1 CAS CS 460/660 Introduction to Database Systems File Organization Slides from UC Berkeley.
File System Implementation
Hinrich Schütze and Christina Lioma Lecture 4: Index Construction
Contiguous Allocation of Disk Space. Linked Allocation.
External Sorting Chapter 13.. Why Sort? A classic problem in computer science! Data requested in sorted order  e.g., find students in increasing gpa.
Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
Algorithms for Information Retrieval Is algorithmic design a 5-mins thinking task ???
1 File Systems Chapter Files 6.2 Directories 6.3 File system implementation 6.4 Example file systems.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 17 Disk Storage, Basic File Structures, and Hashing.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 11: File System Implementation.
1 Chapter 17 Disk Storage, Basic File Structures, and Hashing Chapter 18 Index Structures for Files.
Inverted index, Compressing inverted index And Computing score in complete search system Chintan Mistry Mrugank dalal.
Chapter 11: File System Implementation Silberschatz, Galvin and Gagne ©2005 Operating System Concepts – 7 th Edition, Jan 1, 2005 File-System Structure.
Chapter 4 Memory Management Virtual Memory.
Page 111/15/2015 CSE 30341: Operating Systems Principles Chapter 11: File System Implementation  Overview  Allocation methods: Contiguous, Linked, Indexed,
Silberschatz, Galvin and Gagne  Operating System Concepts Chapter 12: File System Implementation File System Structure File System Implementation.
File System Implementation
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 11: File System Implementation.
File Structures. 2 Chapter - Objectives Disk Storage Devices Files of Records Operations on Files Unordered Files Ordered Files Hashed Files Dynamic and.
3 Data. Software And Data Data Data element – a single, meaningful unit of data. Name Social Security Number Data structure – a set of related data elements.
12.1 Silberschatz, Galvin and Gagne ©2003 Operating System Concepts with Java Chapter 12: File System Implementation Chapter 12: File System Implementation.
Silberschatz, Galvin and Gagne  Operating System Concepts Chapter 12: File System Implementation File System Structure File System Implementation.
Index Construction: sorting Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading Chap 4.
1 CPS216: Advanced Database Systems Notes 05: Operators for Data Access (contd.) Shivnath Babu.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 11: File System Implementation.
Silberschatz, Galvin and Gagne  Operating System Concepts Chapter 12: File System Implementation File System Structure File System Implementation.
Chapter 15 A External Methods. © 2004 Pearson Addison-Wesley. All rights reserved 15 A-2 A Look At External Storage External storage –Exists beyond the.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 4: Index Construction Related to Chapter 4:
11.1 Silberschatz, Galvin and Gagne ©2005 Operating System Principles 11.5 Free-Space Management Bit vector (n blocks) … 012n-1 bit[i] =  1  block[i]
File Systems - Part I CS Introduction to Operating Systems.
for all Hyperion video tutorial/Training/Certification/Material Essbase Optimization Techniques by Amit.
Chapter 9: Sorting1 Sorting & Searching Ch. # 9. Chapter 9: Sorting2 Chapter Outline  What is sorting and complexity of sorting  Different types of.
Chapter 5 Record Storage and Primary File Organizations
FILE SYSTEM IMPLEMENTATION 1. 2 File-System Structure File structure Logical storage unit Collection of related information File system resides on secondary.
COMP091 – Operating Systems 1 Memory Management. Memory Management Terms Physical address –Actual address as seen by memory unit Logical address –Address.
Singleton Processing with Limited Memory Peter L. Montgomery Microsoft Research Redmond, WA, USA.
Module 11: File Structure
File-System Implementation
Modified from Stanford CS276 slides Lecture 4: Index Construction
Chapter 11: File System Implementation
Oracle SQL*Loader
9/12/2018.
Chapter 11: File System Implementation
CS222P: Principles of Data Management Lecture #2 Heap Files, Page structure, Record formats Instructor: Chen Li.
Chapter 11: File System Implementation
Chapter3 Memory Management Techniques
Lecture 7: Index Construction
Outline Allocation Free space management Memory mapped files
Overview: File system implementation (cont)
Lecture 3: Main Memory.
File-System Structure
Chapter 14: File-System Implementation
Chapter 11: File System Implementation
Presentation transcript:

5.Index Construction 인공지능연구실

2 목차 Memory-based inversion Sort-based inversion Exploiting index compression Compressed in-memory inversion Comparison of inversion methods Constructing signature files and bitmaps Dynamic collections

3 The problem is the size of the frequency matrix BibleTREC Terms9,020538,244 Doc31,102742,368 4byte integer(each entry) Matrix4x9,020x31,102bytes= over one gigabytes 4 x 538,244 x 742,368bytes= 1.4 terabytes Figure5.1 static One month127 years

4 Memory-based inversion Fig5.2( 문서 ), Fig5.3( 역파일 ) Assumed that the linked lists are sorted Dynamic dictionary data structure Linked list(reference point)

5 문제점 ? 메모리 Resource 많이 필요 the best method for small collections (Bible…) Random Data 처리 못함

6 Sort-based inversion Fig5.4, Fig5.5  시간과 공간을 비교 QSort. 외부 Mergesort [logR] inital Sorted runs Temporary File Merged runs (fully sorted) K block

7 문제점 ? Two copies of temp files 10~100Mbyte 범위에 적절

8 Exploiting index compression To reduce the resource(space,time) -temporary file 의 압축 (sort-based) -inverted file 을 main memory 에서 만들고, index 를 disk 에 쓰기전에 decompressing Compressing the temporary files Multiway-merging In-place multiway merging

9 Compressing the temporary files Chapters 3 and 4 장에서 설명됨 Compression temporary file of t 요소때문에 약간의 압축 손실 발생 ( 예,unary+delta code,TREC collection) ( 가정 ) unary code t-gap( 다음에오는 triple 과의 t 차이값 ) t-gap=0 → code 0, t-gap=1 → 10, t-gap=2 → 110 (0.6Mbyte 필요 )

10 Multiway merging Now, processor-intensive than dick- intensive Reduce time by multiway-merge Use if priority queue such as heap

11 In-place multiway merging(1) Heap OUTPUT BLOCK1 RUN 1, BLOCK 2 OUTPUT BLOCK2 RUN 2, BLOCK 3 RUN 3, BLOCK RUN 1, BLOCK 2 RUN 2, BLOCK 2 RUN 3, BLOCK 1 OUTPUT BLOCK3 Blocks in memory One per run Temporary file, On disk Block table In memory

12 In-place multiway merging(2) 알고리즘 - 메모리의 각 run 에서 b byte 의 블록이 heap 으로 이동 - heap 에서 메모리내 output 블록으로 b byte 만큼이동 - output 블록은 temporary file 로 다시 쓰여짐 (block table) Slack 의 사용 - 입력프로세스보다 출력프로세스가 먼저 수행 되는 경향으로 빈블럭이 추가됨 Slack 추가 → permutation →compaction → truncation 처리 Second Edition) permutation → truncation 처리

13 Compressed in-memory inversion Large memory inversion(1)  Large main memory  array - list of document numbers d, frequencies f dt  Compared in-memory technique (Section 5.1)  next pointer field 필요 없음.  term t : f t   log N  bits f t   log m t  bits (m t : maximum within-document frequency )  preliminary pass 필요 : N, f t, m t

14 Compressed in-memory inversion Large memory inversion(2)  Two-pass Golomb-coded in memory  First Pass - count f t, F t - write f t, F t to a lexicon file  Second Pass - read lexicon file - calculate b t, b t w = 2  log((N-ft)/ft) , B t - build a compressed in-memory inverted file - rebuild in-memory inverted file

15 Compressed in-memory inversion Lexicon-based partitioning  Subdivide into small tasks  Lexicon-based, no extra disk  make multiple second pass  each processing one load - ex) three second pass  Lexicon-based, extra disk  Time save, Disk Space 낭비

16 Compressed in-memory inversion Text-based partitioning  Inversion and Merge  In-memory inverted file 생성  Merge inverted file on disk  Chunk  Information file - frequency of each term in chunk  Second temp disk file - disk current pointer

17 Constructing signature files and bitmaps  Enough Main Memory  signature of k documents  k =  8M / W  - W : signature width (bits) - M : main memory (bytes)  Bitmap  build a compressed inverted file  decompress it and store it with unary code

18 Dynamic collections(1)  ‘Insert’ operation  append a new document to an existing collection  ‘Edit’ operation  alter, remove  Expanding the text  Expanding the index

19 Dynamic collections Expanding the text  Inserting a new document  the text of the collection must be expanded  compression - cope with hitherto unseen symbol  uncompression - escape flag, stored uncompressed  periodically be completely rebuilt  a new compression model

20 Dynamic collections Expanding the Index(1)  ‘stop-press’ file  accumulate update in a stop-press file  rebuild when file too large  drawback  reindex (the data) time   The Inverted file  new-inserted document contains many terms  variable-length recoreds

21 Dynamic collections Expanding the Index(2)  Issue  suitable file structure  record extension  record insertion

22 Dynamic collections Expanding the Index(2)  Block Structure  Fixed length blocks : b bytes - block address table, records, free space - figure 5.15  Main memory - record address table : record number, block number - free list - current last block of the file

23 Dynamic collections Expanding the Index(3)  Access record 1) Record number 2) Block address from the record address table 3) Block read into memory 4) The address of the record within the block 5) Read the record

24 Dynamic collections Expanding the Index(4)  Expanding a particular record  sufficient free space 1) Block read 2) record 이동, make space 3) extension 추가 4) block table 수정, write  insufficient free space - smallest record remove, insert extension - extended record remove, insert into new block

25 Dynamic collections Expanding the Index(5)  Insert a record  free list check - insert 할 block 결정 - new block 생성  Block read/write (disk operation)  general case : 2  worst case : 4  Reduce the number of disk operation  using ‘update cache’