1 Signature Files Information Retrieval: Data Structures and Algorithms by W.B. Frakes and R. Baeza-Yates (Eds.) Englewood Cliffs, NJ: Prentice Hall, 1992.

Slides:



Advertisements
Similar presentations
CpSc 3220 File and Database Processing Lecture 17 Indexed Files.
Advertisements

©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part C Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
Chapter 11 Indexing and Hashing (2) Yonsei University 2 nd Semester, 2013 Sanghyun Park.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
Multidimensional Data
1 Suffix Trees and Suffix Arrays Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto Addison-Wesley, (Chapter 8)
Modern information retrieval Chapter 8 – Indexing and Searching.
Modern Information Retrieval Chapter 8 Indexing and Searching.
Multidimensional Data. Many applications of databases are "geographic" = 2­dimensional data. Others involve large numbers of dimensions. Example: data.
Modern Information Retrieval
1 File Structures Information Retrieval: Data Structures and Algorithms by W.B. Frakes and R. Baeza-Yates (Eds.) Englewood Cliffs, NJ: Prentice Hall, 1992.
BTrees & Bitmap Indexes
1 Indexing and Searching (File Structures) Modern Information Retrieval (C hapter 8) With G. Navarro.
CSE3201/CSE4500 Information Retrieval Systems
Chapter 8 File organization and Indices.
Database Implementation Issues CPSC 315 – Programming Studio Spring 2008 Project 1, Lecture 5 Slides adapted from those used by Jennifer Welch.
Data Indexing Herbert A. Evans. Purposes of Data Indexing What is Data Indexing? Why is it important?
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part A Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
1 CS 430: Information Discovery Lecture 4 Data Structures for Information Retrieval.
Indexing and Searching
E.G.M. PetrakisHashing1 Hashing on the Disk  Keys are stored in “disk pages” (“buckets”)  several records fit within one page  Retrieval:  find address.
File Structures Dale-Marie Wilson, Ph.D.. Basic Concepts Primary storage Main memory Inappropriate for storing database Volatile Secondary storage Physical.
Multidimensional Data Many applications of databases are ``geographic'' = 2­dimensional data. Others involve large numbers of dimensions. Example: data.
Layers of a DBMS Query optimization Execution engine Files and access methods Buffer management Disk space management Query Processor Query execution plan.
Hsin-Hsi Chen8-1 Chapter 8 Indexing and Searching Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University.
Chapter. 8: Indexing and Searching Sections: 8.1 Introduction, 8.2 Inverted Files 9/13/ Dr. Almetwally Mostafa.
1 © Prentice Hall, 2002 Physical Database Design Dr. Bijoy Bordoloi.
Information Retrieval CSE 8337 Spring 2005 Indexing and Searching Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza-Yates.
Signature files. Signature Files Important alternative to inverted indexes. Given a document, the signature is calculated as follows. - First, each word.
CSE3201/CSE4500 Information Retrieval Systems Signature Based Text Retrieval Systems.
Chapter 6 1 © Prentice Hall, 2002 The Physical Design Stage of SDLC (figures 2.4, 2.5 revisited) Project Identification and Selection Project Initiation.
Chapter 11 Indexing & Hashing. 2 n Sophisticated database access methods n Basic concerns: access/insertion/deletion time, space overhead n Indexing 
CS 430: Information Discovery
Introduction n How to retrieval information? n A simple alternative is to search the whole text sequentially n Another option is to build data structures.
12.1 Chapter 12: Indexing and Hashing Spring 2009 Sections , , Problems , 12.7, 12.8, 12.13, 12.15,
Indexing and hashing Azita Keshmiri CS 157B. Basic concept An index for a file in a database system works the same way as the index in text book. For.
1 CS 430: Information Discovery Lecture 4 Files Structures for Inverted Files.
Index Tuning Conventional index Secondary index To speed up queries on attributes not within primary key Primary index –Determine.
CE Operating Systems Lecture 17 File systems – interface and implementation.
Marwan Al-Namari Hassan Al-Mathami. Indexing What is Indexing? Indexing is a mechanisms. Why we need to use Indexing? We used indexing to speed up access.
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
Database System Concepts, 6 th Ed. ©Silberschatz, Korth and Sudarshan See for conditions on re-usewww.db-book.com Module D: Hashing.
CS4432: Database Systems II
Multidimensional Access Structures COMP3017 Advanced Databases Dr Nicholas Gibbins –
CS222: Principles of Data Management Lecture #4 Catalogs, Buffer Manager, File Organizations Instructor: Chen Li.
Module 11: File Structure
Indexing Structures for Files and Physical Database Design
CS522 Advanced database Systems
Indexing and hashing.
Azita Keshmiri CS 157B Ch 12 indexing and hashing
Indexing & querying text
Multidimensional Access Structures
Ch. 8 File Structures Sequential files. Text files. Indexed files.
Chapter 11: File System Implementation
Database Management Systems (CS 564)
CS 430: Information Discovery
Indexing and Searching (File Structures)
Chapter 11: File System Implementation
Database Implementation Issues
Chapter 11: Indexing and Hashing
Chapter 11: File System Implementation
Indexing and Hashing Basic Concepts Ordered Indices
Indexing and Hashing B.Ramamurthy Chapter 11 2/5/2019 B.Ramamurthy.
DATABASE IMPLEMENTATION ISSUES
Chapter 11: File System Implementation
Database Implementation Issues
Chapter 11: Indexing and Hashing
Operating Systems: Internals and Design Principles, 6/E
Database Implementation Issues
Index Structures Chapter 13 of GUW September 16, 2019
Presentation transcript:

1 Signature Files Information Retrieval: Data Structures and Algorithms by W.B. Frakes and R. Baeza-Yates (Eds.) Englewood Cliffs, NJ: Prentice Hall, (Chapters 4)

2 Signature Files l Characteristics »Word-oriented index structures based on hashing »Low overhead (10%~20% over the text size) at the cost of forcing a sequential search over the index »Suitable for not very large texts »Inverted files outperform signature files for most applications

3 Structure l Use superimposed coding to create signature. l Each text is divided into logical blocks. l A block contains n distinct non-common words. l Each word yields “word signature”. l A word signature is a B-bit pattern, with m 1-bit. »Each word is divided into successive, overlapping triplets. e.g. free -->  fr, fre, ree, ee  »Each such triplet is hashed to a bit position. l The word signatures are OR’ed to form block signature. l Block signatures are concatenated to form the document signature.

4 Example l Example (n=2, B=12, m=4) wordsignature free text block signature l Search »Use hash function to determine the m 1-bit positions. »Examine each block signature for 1’s bit positions that the signature of the search word has a 1.

5 False Drop l false alarm (false hit, or false drop) F d the probability that a block signature seems to qualify, given that the block does not actually qualify. F d = Prob{signature qualifies/block does not} l For a given value of B, the value of m that minimizes the false drop probability is such that each row of the matrix contains “1”s with probability 0.5. F d = 2 -m m = B ln2/n

documents assume documents span exactly one logical block the size of document signature F = the size of block signature B Sequential Signature File (SSF)

7 Classification of Signature-Based Methods l Compression If the signature matrix is deliberately sparse, it can be compressed. l Vertical partitioning Storing the signature matrix column-wise improves the response time on the expense of insertion time. l Horizontal partitioning Grouping similar signatures together and/or providing an index on the signature matrix may result in better-than-linear search.

8 Classification of Signature-Based Methods l Sequential storage of the signature matrix »without compression sequential signature files (SSF) »with compression bit-block compression (BC) variable bit-block compression (VBC) l Vertical partitioning »without compression bit-sliced signature files (BSSF, B’SSF) frame sliced (FSSF) generalized frame-sliced (GFSSF)

9 Classification of Signature-Based Methods ( Continued ) »with compression compressed bit slices (CBS) doubly compressed bit slices (DCBS) no-false-drop method (NFD) l Horizontal partitioning »data independent partitioning Gustafson’s method partitioned signature files »data dependent partitioning 2-level signature files 5-trees

10 Criteria l the storage overhead l the response time on single word queries l the performance on insertion, as well as whether the insertion maintains the “append-only” property

11 Compression l idea »Create sparse document signatures on purpose. »Compress them before storing them sequentially. l Method »Use B-bit vector, where B is large. »Hash each word into one (or k) bit position(s). »Use run-length encoding (McIlroy 1982).

Compression using run-length encoding data base management system block signature L1L1 L2L2 L3L3 L4L4 L5L5 [L 1 ] [L 2 ] [L 3 ] [L 4 ] [L 5 ] where [x] is the encoded vale of x. search: Decode the encoded lengths of all the preceding intervals example: search “data” (1) data ==> (2) decode [L1]=0000, decode [L2]=00, decode [L3]= disadvantage: search becomes low

Bit-block Compression (BC) Data Structure: (1) The sparse vector is divided into groups of consecutive bits (bit-blocks). (2) Each bit block is encoded individually. Algorithm: Part I. It is one bit long, and it indicates whether there are any “1”s in the bit-block (1) or the bit -block is (0). In the latter case, the bit-block signature stops here Part II. It indicates the number s of “1”s in the bit-block. It consists of s-1 “1” and a terminating zero Part III. It contains the offsets of the “1”s from the beginning of the bit-block 說明: 4bits ,距離為 0, 1, 2, 3 ,編碼為 00, 01, 10, 11 block signature: | |

14 Bit-block Compression (BC) (Continued) Search “data” (1) data ==> (2) check the 4th block of signature | | (4) OK, there is at least one setting in the 4th bit-block. (5) Check furthermore. “0” tells us there is only one setting in the 4th bit-clock. Is it the 3rd bit? (6) Yes, “10” confirms the result. Discussion: (1) Bit-block compression requires less space than Sequential Signature File for the same false drop probability. (2) The response time of Bit-block compression is lightly less then Sequential Signature File.

15 Vertical Partitioning l idea avoid bringing useless portions of the document signature in main memory l methods »store the signature file in a bit-sliced form or in a frame-sliced form »store the signature matrix column-wise to improve the response time on the expense of insertion time

Bit-Sliced Signature Files (BSSF) Transposed bit matrix transpose represent documents (document signature)

F bit-files search:(1) retrieve m bit-files. e.g., the word signature of free is the document contains “free”: 3rd, 7th, 8th, 11th bit are set i.e., only 3rd, 7th, 8th, 11th files are examined. (2) “and” these vectors. The 1s in the result N-bit vector denote the qualifying logical blocks (documents). (3) retrieve text file through pointer file. insertion: require F disk accesses for a new logical block (document), one for each bit-file, but no rewriting documents

18 Frame-Sliced Signature File (FSSF) l Ideas »random disk accesses are more expensive than sequential ones »force each word to hash into bit positions that are closer to each other in the document signature »these bit files are stored together and can be retrieved with a few random accesses l Procedures »The document signature (F bits long) is divided into k frames of s consecutive bits each. »For each word in the document, one of the k frames will be chosen by a hash function. »Using another hash function, the word sets m bits in that frame.

19 documents frames Each frame will be kept in consecutive disk blocks. Frame-Sliced Signature File (Cont.)

20 FSSF ( Continued ) l Example (n=2, B=12, s=6, f=2, m=3) WordSignature free text doc. signature l Search »Only one frame has to be retrieved for a single word query. I.E., only one random disk access is required. e.g., search documents that contain the word “free” ->because the word signature of “free” is placed in 2nd frame, only the 2nd frame has to be examined. »At most k frames have to be scanned for an k word query. l Insertion »Only f frames have to be accessed instead of F bit-slices.

21 Vertical Partitioning with Compression l idea »create a very sparse signature matrix »store it in a bit-sliced form »compress each bit slice by storing the position of the 1s in the slice.

22 Compressed Bit Slices (CBS) l Rooms for improvements »Searching –Each search word requires the retrieval of m bit files. –The search time could be improved if m was forced to be “1”. »Insertion –Require too many disk accesses (equal to F, which is typically ).

23 Compressed Bit Slices (CBS) ( Continued ) l Let m=1. To maintain the same false drop probability, F has to be increased. l To compress each bit file, we store only the positions of the “1”s. l For unpredictable number of “1”s, we store them in buckets of size B p. documents Size of a signature Sparse bit matrix

h(“base”)=30 Obtain the pointers to the relevant documents from buckets Hash a word to obtain bucket address l Differences with inversion »The directory (hash table) is sparse »The actual word is stored nowhere »Simple structure

Doubly Compressed Bit Slices h 1 (“base”)=30 h2 (“ base”)=011 Follow the pointers of posting buckets to retrieve the qualifying documents. Distinguish synonyms partially. Idea: compress the sparse directory 當 S 變小 碰撞在一 起的的機會 變大,採用 中間 buckets 為了區別 真碰撞和假 碰撞,多了 一個 hash function

No False Drops Method Using pointer to the word in the text file To distinguish between synonyms completely.

Horizontal Partitioning documents 1. Goal: group the signatures into sets, partitioning the signature matrix horizontally. 2. Grouping criterion

28 Partitioned Signature Files l Using a portion of a document signature as a signature key to partition the signature file. l All signatures with the same key will be grouped into a so-called “module”. l When a query signature arrives, »examine its signature key and look for the corresponding modules »scan all the signatures within those modules that have been selected