CSE3201/CSE4500 Information Retrieval Systems

Slides:



Advertisements
Similar presentations
Hash Functions A hash function takes data of arbitrary size and returns a value in a fixed range. If you compute the hash of the same data at different.
Advertisements

Organisation Of Data (1) Database Theory
1 Signature Files Information Retrieval: Data Structures and Algorithms by W.B. Frakes and R. Baeza-Yates (Eds.) Englewood Cliffs, NJ: Prentice Hall, 1992.
1 Suffix Trees and Suffix Arrays Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto Addison-Wesley, (Chapter 8)
Modern Information Retrieval Chapter 8 Indexing and Searching.
Jan. 2013Dr. Yangjun Chen ACS Outline Signature Files - Signature for attribute values - Signature for records - Searching a signature file Signature.
Bitmap Index Buddhika Madduma 22/03/2010 Web and Document Databases - ACS-7102.
Modern Information Retrieval
1 Indexing and Searching (File Structures) Modern Information Retrieval (C hapter 8) With G. Navarro.
Inverted Indices. Inverted Files Definition: an inverted file is a word-oriented mechanism for indexing a text collection in order to speed up the searching.
1 CS 430: Information Discovery Lecture 3 Inverted Files and Boolean Operations.
Database Implementation Issues CPSC 315 – Programming Studio Spring 2008 Project 1, Lecture 5 Slides adapted from those used by Jennifer Welch.
1 CS 430: Information Discovery Lecture 4 Data Structures for Information Retrieval.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Chapter 13 Disk Storage, Basic File Structures, and Hashing.
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
Spring 2004 ECE569 Lecture ECE 569 Database System Engineering Spring 2004 Yanyong Zhang
Indexing and Searching
WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.
Chapter 7 Indexing Objectives: To get familiar with: Indexing
DISK STORAGE INDEX STRUCTURES FOR FILES Lecture 12.
Chapter 8.  Cryptography is the science of keeping information secure in terms of confidentiality and integrity.  Cryptography is also referred to as.
Practical Techniques for Searches on Encrypted Data Yongdae Kim Written by Song, Wagner, Perrig.
L. Padmasree Vamshi Ambati J. Anand Chandulal J. Anand Chandulal M. Sreenivasa Rao M. Sreenivasa Rao Signature Based Duplicate Detection in Digital Libraries.
Objectives Learn what a file system does
File Organization Techniques
Access 2007 Database Application Managing Business Information Effectively BCIS 1 and 2.
Chapter 13 File Structures. Understand the file access methods. Describe the characteristics of a sequential file. After reading this chapter, the reader.
Chapter. 8: Indexing and Searching Sections: 8.1 Introduction, 8.2 Inverted Files 9/13/ Dr. Almetwally Mostafa.
Signature files. Signature Files Important alternative to inverted indexes. Given a document, the signature is calculated as follows. - First, each word.
1 California State University, Fullerton Chapter 7 Information System Data Management.
CSE3201/CSE4500 Information Retrieval Systems Signature Based Text Retrieval Systems.
Panagiotis Antonopoulos Microsoft Corp Ioannis Konstantinou National Technical University of Athens Dimitrios Tsoumakos.
1 Problem Solving using computers Data.. Representation & storage Representation of Numeric data The Binary System.
Data and its manifestations. Storage and Retrieval techniques.
File Structures Foundations of Computer Science  Cengage Learning.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A Comparison of SOM Based Document Categorization Systems.
CS 430: Information Discovery
Introduction n How to retrieval information? n A simple alternative is to search the whole text sequentially n Another option is to build data structures.
1 5. Abstract Data Structures & Algorithms 5.2 Static Data Structures.
1 CS 430: Information Discovery Lecture 4 Files Structures for Inverted Files.
Storage Structures. Memory Hierarchies Primary Storage –Registers –Cache memory –RAM Secondary Storage –Magnetic disks –Magnetic tape –CDROM (read-only.
File Structures. 2 Chapter - Objectives Disk Storage Devices Files of Records Operations on Files Unordered Files Ordered Files Hashed Files Dynamic and.
GIS Data Models GEOG 370 Christine Erlien, Instructor.
Marwan Al-Namari Hassan Al-Mathami. Indexing What is Indexing? Indexing is a mechanisms. Why we need to use Indexing? We used indexing to speed up access.
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
Collections Management Museums EMu Searching EMu Searching Explained (What’s going on under the hood!) Bernard Marshall Chief Technical Officer KE Software.
It consists of two parts: collection of files – stores related data directory structure – organizes & provides information Some file systems may have.
What are the advantages of using bar code scanner?  Fast  It is fast  It is fast for reading data  It is fast for data input  Accurate  The advantage.
Strings PART I STRINGS, DATES, AND TIMES. FUNDAMENTALS OF CHARATERS AND STRINGS VB represents characters using American National Standards Institute(ANSI)
Chapter 5 Record Storage and Primary File Organizations
Introduction to File Processing with PHP. Review of Course Outcomes 1. Implement file reading and writing programs using PHP. 2. Identify file access.
Madhuri Gollu Id: 207. Agenda Agenda  Records with Variable Length Fields  Records with Repeating Fields  Variable Format Records  Records that do.
INFO Week 7 Indexing and Searching Dr. Xia Lin Assistant Professor College of Information Science and Technology Drexel University.
Chapter 5 Ranking with Indexes. Indexes and Ranking n Indexes are designed to support search  Faster response time, supports updates n Text search engines.
Information Retrieval Inverted Files.. Document Vectors as Points on a Surface Normalize all document vectors to be of length 1 Define d' = Then the ends.
Storage and File Organization
Why indexing? For efficient searching of a document
Indexing and hashing.
Data and Information.
Ch. 8 File Structures Sequential files. Text files. Indexed files.
CS 430: Information Discovery
Indexing and Searching (File Structures)
Disk Storage, Basic File Structures, and Hashing
Database Implementation Issues
Advance Database System
DATABASE IMPLEMENTATION ISSUES
INDEXING.
Database Implementation Issues
Hashing Hash are the auxiliary values that are used in cryptography.
Database Implementation Issues
Presentation transcript:

CSE3201/CSE4500 Information Retrieval Systems Signature Files CSE3201/CSE4500 Information Retrieval Systems

Signature File for Text Retrieval A “signature” is created as an abstraction of a document. All the signatures that represent the documents in the collection are kept in a file called “signature file”.

Word Signature(WS) A word signature is a fixed-length bit-string represents a word. is described by The length (N) A number of bits set to 1(k) N=24 1 1 1 0 0 0 1 0 0 1 0 0 0 0 0 0 0 1 0 0 1 0 0 0 k=7

Word Signature Generation Use a hash function to find the location of the bit(s) that will be set on. Using triplets of characters to generate word signature. divide the word into overlapping triplets. For each triplet of characters: convert the characters to a numeric value (can be ASCII representation of the character). Use the the number as the input to the hash function. The hash function will produce a number which represent the bit position of the triplet in the word signature.

Word Signature Generation Example: A signature 111000111001 is generated for the word “signature”. The position is read from left to right -si sig ign gna nat atu tur ure re- 12 7 3 2 1 9 8 1 1 1 0 0 0 1 1 1 0 0 1

Document Signature (DS) Document Signature can be created using two methods: concatenation of word signatures superimposed coding.

Document Signature – Concatenation of WS The length of document signatures (DS) can vary. A fixed number of bits may precede the document signature (DS) to indicate the length of DS. It is possible to fix the length of the Document Signature (DS). The length can be set to equal the longest document in the collection. Extra “0” bits are padded to the shorter documents.

Document Signature – Superimposed Coding Each document is divided into blocks containing a constant number of distinct words. To create a block signature, perform OR operation on all the words in the block. free 001 000 110 010 text 000 010 101 001 Block signature 001 010 111 011

Document Signature – Superimposed Coding To create the document signature, all the block signatures are superimposed.

Query Signature Query will be converted to a block signature as in the document. Query: free 001 000 110 010 text 000 010 101 001 Block 001 010 111 011

Query on Signature File Match? Perform AND operation between the query and block signature, if ( result – query) = 0, they are matched Query 001 010 111 011 1 No Yes No No Yes No Yes

Signature File Structure Sequential During searching, each signature will be compared to query signature. Time consuming Bit-Sliced Signature The signature file undergo a matrix transposed

Matrix Transposed

Bit-Sliced d1 d2 d3 d4 d1 d2 d3 d4 sequential Bit sliced 1 1 N bits 1 N bits d1 1 d2 N records d3 d4 sequential Bit sliced

Bit Sliced Signature File Retrieval If ith bit in the query signature is set to 1, retrieve the ith signature block/record. If there is n number of bits are set to 1, only n number of records needs to be retrieved.

Bit Slice Signature File Query: 001 010 111 011 1 1 Retrieved records Match, because all bits in this column is set to 1 (the 2nd block).

Bit Sliced Signature File Advantages: Smaller number of records are retrieved -> faster retrieval. Disadvantages: An update operation become a very costly exercise.

False Drop False drop occurs when a document’s signature matches a query’s signature but the query’s word does not match any word in the document. It is possible because 2 distinct blocks may have the same signatures due to: the hashing algorithm superimposed coding The rate of false drop depends on: The size of the signature (N bits) The size of bits set to 1(k bits) The number of words per-block

Inverted or Signature? Inverted files: Slower retrieval More accurate Easier to maintain In fact, inverted files are still the most popular storage structure for information retrieval.