CSE3201/CSE4500 Information Retrieval Systems Signature Files CSE3201/CSE4500 Information Retrieval Systems
Signature File for Text Retrieval A “signature” is created as an abstraction of a document. All the signatures that represent the documents in the collection are kept in a file called “signature file”.
Word Signature(WS) A word signature is a fixed-length bit-string represents a word. is described by The length (N) A number of bits set to 1(k) N=24 1 1 1 0 0 0 1 0 0 1 0 0 0 0 0 0 0 1 0 0 1 0 0 0 k=7
Word Signature Generation Use a hash function to find the location of the bit(s) that will be set on. Using triplets of characters to generate word signature. divide the word into overlapping triplets. For each triplet of characters: convert the characters to a numeric value (can be ASCII representation of the character). Use the the number as the input to the hash function. The hash function will produce a number which represent the bit position of the triplet in the word signature.
Word Signature Generation Example: A signature 111000111001 is generated for the word “signature”. The position is read from left to right -si sig ign gna nat atu tur ure re- 12 7 3 2 1 9 8 1 1 1 0 0 0 1 1 1 0 0 1
Document Signature (DS) Document Signature can be created using two methods: concatenation of word signatures superimposed coding.
Document Signature – Concatenation of WS The length of document signatures (DS) can vary. A fixed number of bits may precede the document signature (DS) to indicate the length of DS. It is possible to fix the length of the Document Signature (DS). The length can be set to equal the longest document in the collection. Extra “0” bits are padded to the shorter documents.
Document Signature – Superimposed Coding Each document is divided into blocks containing a constant number of distinct words. To create a block signature, perform OR operation on all the words in the block. free 001 000 110 010 text 000 010 101 001 Block signature 001 010 111 011
Document Signature – Superimposed Coding To create the document signature, all the block signatures are superimposed.
Query Signature Query will be converted to a block signature as in the document. Query: free 001 000 110 010 text 000 010 101 001 Block 001 010 111 011
Query on Signature File Match? Perform AND operation between the query and block signature, if ( result – query) = 0, they are matched Query 001 010 111 011 1 No Yes No No Yes No Yes
Signature File Structure Sequential During searching, each signature will be compared to query signature. Time consuming Bit-Sliced Signature The signature file undergo a matrix transposed
Matrix Transposed
Bit-Sliced d1 d2 d3 d4 d1 d2 d3 d4 sequential Bit sliced 1 1 N bits 1 N bits d1 1 d2 N records d3 d4 sequential Bit sliced
Bit Sliced Signature File Retrieval If ith bit in the query signature is set to 1, retrieve the ith signature block/record. If there is n number of bits are set to 1, only n number of records needs to be retrieved.
Bit Slice Signature File Query: 001 010 111 011 1 1 Retrieved records Match, because all bits in this column is set to 1 (the 2nd block).
Bit Sliced Signature File Advantages: Smaller number of records are retrieved -> faster retrieval. Disadvantages: An update operation become a very costly exercise.
False Drop False drop occurs when a document’s signature matches a query’s signature but the query’s word does not match any word in the document. It is possible because 2 distinct blocks may have the same signatures due to: the hashing algorithm superimposed coding The rate of false drop depends on: The size of the signature (N bits) The size of bits set to 1(k bits) The number of words per-block
Inverted or Signature? Inverted files: Slower retrieval More accurate Easier to maintain In fact, inverted files are still the most popular storage structure for information retrieval.