Presentation is loading. Please wait.

Presentation is loading. Please wait.

CSE3201/CSE4500 Information Retrieval Systems

Similar presentations


Presentation on theme: "CSE3201/CSE4500 Information Retrieval Systems"— Presentation transcript:

1 CSE3201/CSE4500 Information Retrieval Systems
Signature Files CSE3201/CSE4500 Information Retrieval Systems

2 Signature File for Text Retrieval
A “signature” is created as an abstraction of a document. All the signatures that represent the documents in the collection are kept in a file called “signature file”.

3 Word Signature(WS) A word signature
is a fixed-length bit-string represents a word. is described by The length (N) A number of bits set to 1(k) N=24 k=7

4 Word Signature Generation
Use a hash function to find the location of the bit(s) that will be set on. Using triplets of characters to generate word signature. divide the word into overlapping triplets. For each triplet of characters: convert the characters to a numeric value (can be ASCII representation of the character). Use the the number as the input to the hash function. The hash function will produce a number which represent the bit position of the triplet in the word signature.

5 Word Signature Generation
Example: A signature is generated for the word “signature”. The position is read from left to right -si sig ign gna nat atu tur ure re- 12 7 3 2 1 9 8

6 Document Signature (DS)
Document Signature can be created using two methods: concatenation of word signatures superimposed coding.

7 Document Signature – Concatenation of WS
The length of document signatures (DS) can vary. A fixed number of bits may precede the document signature (DS) to indicate the length of DS. It is possible to fix the length of the Document Signature (DS). The length can be set to equal the longest document in the collection. Extra “0” bits are padded to the shorter documents.

8 Document Signature – Superimposed Coding
Each document is divided into blocks containing a constant number of distinct words. To create a block signature, perform OR operation on all the words in the block. free text Block signature

9 Document Signature – Superimposed Coding
To create the document signature, all the block signatures are superimposed.

10 Query Signature Query will be converted to a block signature as in the document. Query: free text Block

11 Query on Signature File
Match? Perform AND operation between the query and block signature, if ( result – query) = 0, they are matched Query 1 No Yes No No Yes No Yes

12 Signature File Structure
Sequential During searching, each signature will be compared to query signature. Time consuming Bit-Sliced Signature The signature file undergo a matrix transposed

13 Matrix Transposed

14 Bit-Sliced d1 d2 d3 d4 d1 d2 d3 d4 sequential Bit sliced 1 1 N bits
1 N bits d1 1 d2 N records d3 d4 sequential Bit sliced

15 Bit Sliced Signature File
Retrieval If ith bit in the query signature is set to 1, retrieve the ith signature block/record. If there is n number of bits are set to 1, only n number of records needs to be retrieved.

16 Bit Slice Signature File
Query: 1 1 Retrieved records Match, because all bits in this column is set to 1 (the 2nd block).

17 Bit Sliced Signature File
Advantages: Smaller number of records are retrieved -> faster retrieval. Disadvantages: An update operation become a very costly exercise.

18 False Drop False drop occurs when a document’s signature matches a query’s signature but the query’s word does not match any word in the document. It is possible because 2 distinct blocks may have the same signatures due to: the hashing algorithm superimposed coding The rate of false drop depends on: The size of the signature (N bits) The size of bits set to 1(k bits) The number of words per-block

19 Inverted or Signature? Inverted files:
Slower retrieval More accurate Easier to maintain In fact, inverted files are still the most popular storage structure for information retrieval.


Download ppt "CSE3201/CSE4500 Information Retrieval Systems"

Similar presentations


Ads by Google