Chapter. 8: Indexing and Searching Sections: 8.1 Introduction, 8.2 Inverted Files 9/13/2015 1 Dr. Almetwally Mostafa.

Slides:



Advertisements
Similar presentations
Information Retrieval in Practice
Advertisements

©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part C Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
Chapter 11 Indexing and Hashing (2) Yonsei University 2 nd Semester, 2013 Sanghyun Park.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
1 Suffix Trees and Suffix Arrays Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto Addison-Wesley, (Chapter 8)
Modern information retrieval Chapter 8 – Indexing and Searching.
Modern Information Retrieval Chapter 8 Indexing and Searching.
Chapter 11: File System Implementation
Modern Information Retrieval
1 Indexing and Searching (File Structures) Modern Information Retrieval (C hapter 8) With G. Navarro.
FALL 2004CENG 351 Data Management and File Structures1 External Sorting Reference: Chapter 8.
CSE3201/CSE4500 Information Retrieval Systems
Inverted Indices. Inverted Files Definition: an inverted file is a word-oriented mechanism for indexing a text collection in order to speed up the searching.
1 CS 430: Information Discovery Lecture 3 Inverted Files and Boolean Operations.
FALL 2006CENG 351 Data Management and File Structures1 External Sorting.
1 CS 430: Information Discovery Lecture 4 Data Structures for Information Retrieval.
METU Department of Computer Eng Ceng 302 Introduction to DBMS Disk Storage, Basic File Structures, and Hashing by Pinar Senkul resources: mostly froom.
Quick Review of material covered Apr 8 B+-Tree Overview and some definitions –balanced tree –multi-level –reorganizes itself on insertion and deletion.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Chapter 13 Disk Storage, Basic File Structures, and Hashing.
Organizing files for performance Chapter Data compression Advantages of reduced file size Redundancy reduction: state code example Repeating sequences:
Indexing and Searching
PrasadL3InvertedIndex1 Inverted Index Construction Adapted from Lectures by Prabhakar Raghavan (Yahoo and Stanford) and Christopher Manning (Stanford)
Homework #3 Due Thursday, April 17 Problems: –Chapter 11: 11.6, –Chapter 12: 12.1, 12.2, 12.3, 12.4, 12.5, 12.7.
WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.
1.A file is organized logically as a sequence of records. 2. These records are mapped onto disk blocks. 3. Files are provided as a basic construct in operating.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 File Organizations and Indexing Chapter 8.
Indexing structures for files D ƯƠ NG ANH KHOA-QLU13082.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 File Organizations and Indexing Chapter 5, 6 of Elmasri “ How index-learning turns no student.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 17 Disk Storage, Basic File Structures, and Hashing.
Information Retrieval CSE 8337 Spring 2005 Indexing and Searching Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza-Yates.
CSE3201/CSE4500 Information Retrieval Systems Signature Based Text Retrieval Systems.
CHAPTER 09 Compiled by: Dr. Mohammad Omar Alhawarat Sorting & Searching.
File Processing - Indexing MVNC1 Indexing Jim Skon.
March 16 & 21, Csci 2111: Data and File Structures Week 9, Lectures 1 & 2 Indexed Sequential File Access and Prefix B+ Trees.
CS 430: Information Discovery
Introduction n How to retrieval information? n A simple alternative is to search the whole text sequentially n Another option is to build data structures.
12.1 Chapter 12: Indexing and Hashing Spring 2009 Sections , , Problems , 12.7, 12.8, 12.13, 12.15,
CSC 211 Data Structures Lecture 13
1 5. Abstract Data Structures & Algorithms 5.2 Static Data Structures.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 11: File System Implementation.
The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,
1 CS 430: Information Discovery Lecture 4 Files Structures for Inverted Files.
March 23 & 28, Csci 2111: Data and File Structures Week 10, Lectures 1 & 2 Hashing.
March 23 & 28, Hashing. 2 What is Hashing? A Hash function is a function h(K) which transforms a key K into an address. Hashing is like indexing.
Storage Structures. Memory Hierarchies Primary Storage –Registers –Cache memory –RAM Secondary Storage –Magnetic disks –Magnetic tape –CDROM (read-only.
File Structures. 2 Chapter - Objectives Disk Storage Devices Files of Records Operations on Files Unordered Files Ordered Files Hashed Files Dynamic and.
Marwan Al-Namari Hassan Al-Mathami. Indexing What is Indexing? Indexing is a mechanisms. Why we need to use Indexing? We used indexing to speed up access.
Spring 2003 ECE569 Lecture 05.1 ECE 569 Database System Engineering Spring 2003 Yanyong Zhang
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
Evidence from Content INST 734 Module 2 Doug Oard.
Copyright © 2009 Pearson Education, Inc. Publishing as Prentice Hall Chapter 9 Designing Databases 9.1.
CS 241 Discussion Section (12/1/2011). Tradeoffs When do you: – Expand Increase total memory usage – Split Make smaller chunks (avoid internal fragmentation)
Chapter 5 Record Storage and Primary File Organizations
DS.H.1 Hashing Chapter 5 Overview The General Idea Hash Functions Separate Chaining Open Addressing Rehashing Extendible Hashing Application Example: Geometric.
Why indexing? For efficient searching of a document
Indexing and hashing.
Ch. 8 File Structures Sequential files. Text files. Indexed files.
Chapter 11: File System Implementation
9/12/2018.
CS 430: Information Discovery
Indexing and Searching (File Structures)
Chapter 11: File System Implementation
Disk Storage, Basic File Structures, and Hashing
Physical Database Design
Chapter 11: File System Implementation
Chapter 11: File System Implementation
CENG 351 Data Management and File Structures
Information Retrieval B
Advance Database System
Indexing and Searching
Presentation transcript:

Chapter. 8: Indexing and Searching Sections: 8.1 Introduction, 8.2 Inverted Files 9/13/ Dr. Almetwally Mostafa

 How to retrieval information?  A simple alternative is to search the whole text sequentially  Another option is to build data structures over the text (called indices) to speed up the search 9/13/ Dr. Almetwally Mostafa

Introduction n Indexing techniques: u Inverted files u Suffix arrays u Signature files 9/13/ Dr. Almetwally Mostafa

Notation n n: the size of the text n m: the length of the pattern n v: the size of the vocabulary n M: the amount of main memory available 9/13/ Dr. Almetwally Mostafa

Inverted Files n Definition: an inverted file is a word-oriented mechanism for indexing a text collection in order to speed up the searching task. n Structure of inverted file: u Vocabulary: is the set of all distinct words in the text u Occurrences: lists containing all information necessary for each word of the vocabulary (text position, frequency, documents where the word appears, etc.) 9/13/ Dr. Almetwally Mostafa

 Text: character position  Inverted file the words are converted to lower-case That house has a garden. The garden has many flowers. The flowers are beautiful beautiful flowers garden house 70 45, 58 18, 29 6 VocabularyOccurrences 9/13/ Dr. Almetwally Mostafa

Space Requirements n The space required for the vocabulary is rather small. According to Heaps’ law the vocabulary grows as O(n  ), where  is a constant between 0.4 and 0.6 in practice n On the other hand, the occurrences demand much more space. Since each word appearing in the text is referenced once in that structure, the extra space is O(n) n To reduce space requirements, a technique called block addressing is used 9/13/ Dr. Almetwally Mostafa

Block Addressing n The text is divided in blocks n The occurrences point to the blocks where the word appears n Advantages: u the number of pointers is smaller than positions u all the occurrences of a word inside a single block are collapsed to one reference n Disadvantages: u online search over the qualifying blocks if exact positions are required 9/13/ Dr. Almetwally Mostafa

Example n Text: n Inverted file: Block 1Block 2 Block 3 Block 4 That house has a garden. The garden has many flowers. The flowers are beautiful beautiful flowers garden house VocabularyOccurrences 9/13/ Dr. Almetwally Mostafa

45% 19% 18% 73% 26% 25% 36% 18% 1.7% 64% 32% 2.4% 35% 26% 0.5% 63% 47% 0.7% Addressing words Addressing documents Addressing 256 blocks Index Small collection (1Mb) Medium collection (200Mb) Large collection (2Gb)  Size of an inverted files as approximate percentages of the size of the whole text collection  Right without indexing stopword while left consider it indexed 9/13/2015Dr. Almetwally Mostafa 10

Searching n The search algorithm on an inverted index follows three steps: u Vocabulary search: the words present in the query are searched in the vocabulary u Retrieval occurrences: the lists of the occurrences of all words found are retrieved u Manipulation of occurrences: the occurrences are processed to solve the query 9/13/ Dr. Almetwally Mostafa

Searching n Searching task on an inverted file always starts in the vocabulary ( It is better to store the vocabulary in a separate file ) n The structures most used to store the vocabulary are hashing, tries or B-trees n An alternative is simply storing the words in lexicographical order ( cheaper in space and very competitive with O(log v) cost ) 9/13/ Dr. Almetwally Mostafa

Construction n All the vocabulary is kept in a suitable data structure storing for each word a list of its occurrences n Each word of the text is read and searched in the vocabulary n If it is not found, it is added to the vocabulary with a empty list of occurrences and the new position is added to the end of its list of occurrences 9/13/ Dr. Almetwally Mostafa

Construction n Once the text is exhausted the vocabulary is written to disk with the list of occurrences. Two files are created: u in the first file, the list of occurrences are stored contiguously u in the second file, the vocabulary is stored in lexicographical order and, for each word, a pointer to its list in the first file is also included. This allows the vocabulary to be kept in memory at search time n The overall process is O(n) worst-case time 9/13/ Dr. Almetwally Mostafa

Construction An option is to use the previous algorithm until the main memory is exhausted. When no more memory is available, the partial index I i obtained up to now is written to disk and erased the main memory before continuing with the rest of the text Once the text is exhausted, a number of partial indices I i exist on disk n The partial indices are merged to obtain the final index 9/13/ Dr. Almetwally Mostafa

Example I I I I I I I I 1 I 2 I 3 I 4 I 5 I 6 I 7 I final index initial dumps level 1 level 2 level 3 9/13/ Dr. Almetwally Mostafa

Construction n The total time to generate partial indices is O(n) n The number of partial indices is O(n/M) n To merge the O(n/M) partial indices are necessary log 2 (n/M) merging levels n The total cost of this algorithm is O(n log(n/M)) 9/13/ Dr. Almetwally Mostafa

 Inverted file is probably the most adequate indexing technique for database text  The indices are appropriate when the text collection is large and semi-static  Otherwise, if the text collection is volatile online searching is the only option  Some techniques combine online and indexed searching 9/13/ Dr. Almetwally Mostafa