Download presentation
Presentation is loading. Please wait.
Published byJennifer Melton Modified over 9 years ago
1
www.monash.edu.au CSE3201/CSE4500 Information Retrieval Systems Signature Based Text Retrieval Systems
2
www.monash.edu.au 2 Signature File for Text Retrieval A “signature” is created as an abstraction of a document. All the signatures that represent the documents in the collection are kept in a file called “signature file”.
3
www.monash.edu.au 3 Word Signature(WS) A word signature –is a fixed-length bit-string represents a word. –is described by >The length (N) >A number of bits set to 1(k) 1 1 1 0 0 0 1 0 0 1 0 0 0 0 0 0 0 1 0 0 1 0 0 0 N=24 k=7
4
www.monash.edu.au 4 Word Signature Generation Use a hash function to find the location of the bit(s) that will be set on. Using triplets of characters to generate word signature. –divide the word into overlapping triplets. –For each triplet of characters: >convert the characters to a numeric value (can be ASCII representation of the character). >Use the the number as the input to the hash function. >The hash function will produce a number which represent the bit position of the triplet in the word signature.
5
www.monash.edu.au 5 Signature Generator Algorithm Set hash_value to 0 for each character in the triplet do hash_value:=(hash_value*137+character ASCIIvalue)mod 256 K values
6
www.monash.edu.au 6 Word Signature Generation – simplified example Example: –A signature 111000111001 is generated for the word “signature”. The position is read from left to right -sisigigngnanatatutururere- 1273239 8 1 1 1 0 0 0 1 1 1 0 0 1 signature Hash function Position of the bit set to 1 1
7
www.monash.edu.au 7 Document Signature (DS) Document Signature can be created using two methods: –concatenation of word signatures. –superimposed coding.
8
www.monash.edu.au 8 Document Signature – Concatenation of WS The length of document signatures (DS) can vary. A fixed number of bits may precede the document signature (DS) to indicate the length of DS. It is possible to fix the length of the Document Signature (DS). –The length can be set to equal the longest document in the collection. –Extra “0” bits are padded to the shorter documents.
9
www.monash.edu.au 9 Document Signature – Superimposed Coding Each document is divided into blocks containing a constant number of distinct words. To create a block signature, perform OR operation on all the words in the block. free001 000 110 010 text000 010 101 001 Block signature001 010 111 011
10
www.monash.edu.au 10 Document Signature – Superimposed Coding To create the document signature, all the block signatures are superimposed.
11
www.monash.edu.au 11 Query Signature Query will be converted to a block signature as in the document. Example: free001000110010 Text000010101001 Block/ Query 001010111011
12
www.monash.edu.au 12 Matching the Query and Document Signature Premise: –The positions of the bits set to 1 represent the existence of particular words in the query or document. A relevant document is document that has a signature with bits set to 1 at the same position of the bits in the query’s signature. The relevant document’s signature does not have to be an exact match of the query’s signature. Example: –Query: 0100 –Match document signatures: 1111, 0111, 0110, 0100.
13
www.monash.edu.au 13 Query on Signature File Query 001 010 111 011 001000111011 001111111011 001010101011 001010111010 111010111011 001100111011 001010111111 No Yes No Yes Match? Perform AND operation between the query and block signature, if ( result – query) = 0, they are matched
14
www.monash.edu.au 14 Signature File Structure Sequential –During searching, each signature will be compared to query signature. –Time consuming because: >Memory size is limited, hence all signatures cannot be loaded to the memory at once. >May result in multiple number of I/O operations. We need a file structure for the signature file that minimise the I/O operation. Bit-Sliced Signature –At the maximum, only N (the size of the signature) number of records need to be retrieved.
15
www.monash.edu.au 15 Matrix Transposed x ij -> x ji
16
www.monash.edu.au 16 Bit-Sliced 001000111011 001111111011 001010101011 001010111010 0000 0000 1111 0100 0111 0100 1111 1101 1111 0000 1111 1110 Bit sliced sequential N bits N records d1 d4 d2 d3 Query: 001 010 111 011 dndn d1d2d3d4 dndn
17
www.monash.edu.au 17 Bit Sliced Signature File Retrieval –If i th bit in the query signature is set to 1, retrieve the i th signature block/record. –If there is n number of bits are set to 1 in the query, only n number of records needs to be retrieved.
18
www.monash.edu.au 18 Bit Slice Signature File 0000 0000 1111 0100 0111 0100 1111 1101 1111 0000 1111 1110 Query: 001 010 111 011 1111 0111 1111 1101 1111 1111 1110 Match, because all bits in this column is set to 1 (the 2 nd block). Retrieved records
19
www.monash.edu.au 19 Bit Sliced Signature File Advantages: –Smaller number of records are retrieved -> faster retrieval. Disadvantages: –An update operation become a very costly exercise.
20
www.monash.edu.au 20 False Drop False drop occurs when a document’s signature matches a query’s signature but the query’s word does not match any word in the document. It is possible because 2 distinct blocks may have the same signatures due to: –the hashing algorithm –superimposed coding
21
www.monash.edu.au 21 Relation Between the Signature Properties and False Drop The rate of false drop depends on: –The size of the signature (N bits) >Increase in N will decrease the false drop –The size of bits set to 1(k bits) >Increase in k to a certain level will decrease the false drop –The number of unique words per-block >Decrease in the number of unique word per-block will decrease the false drop.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.