1 CS 430: Information Discovery Lecture 4 Data Structures for Information Retrieval.

Slides:



Advertisements
Similar presentations
CS 430 / INFO 430 Information Retrieval
Advertisements

Hashing and Indexing John Ortiz.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
Dr. Kalpakis CMSC 661, Principles of Database Systems Index Structures [13]
1 Suffix Trees and Suffix Arrays Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto Addison-Wesley, (Chapter 8)
The Trie Data Structure Basic definition: a recursive tree structure that uses the digital decomposition of strings to represent a set of strings for searching.
Modern Information Retrieval Chapter 8 Indexing and Searching.
1 Lecture 8: Data structures for databases II Jose M. Peña
Chapter 15 B External Methods – B-Trees. © 2004 Pearson Addison-Wesley. All rights reserved 15 B-2 B-Trees To organize the index file as an external search.
Modern Information Retrieval
1 File Structures Information Retrieval: Data Structures and Algorithms by W.B. Frakes and R. Baeza-Yates (Eds.) Englewood Cliffs, NJ: Prentice Hall, 1992.
1 CS 430 / INFO 430 Information Retrieval Lecture 5 Searching Full Text 5.
1 Indexing and Searching (File Structures) Modern Information Retrieval (C hapter 8) With G. Navarro.
1 CS 430 / INFO 430 Information Retrieval Lecture 7 String Processing.
CSE3201/CSE4500 Information Retrieval Systems
1 CS 430 / INFO 430 Information Retrieval Lecture 4 Searching Full Text 4.
Inverted Indices. Inverted Files Definition: an inverted file is a word-oriented mechanism for indexing a text collection in order to speed up the searching.
1 CS 430: Information Discovery Lecture 3 Inverted Files and Boolean Operations.
CPSC 231 B-Trees (D.H.)1 LEARNING OBJECTIVES Problems with simple indexing. Multilevel indexing: B-Tree. –B-Tree creation: insertion and deletion of nodes.
2010/3/81 Lecture 8 on Physical Database DBMS has a view of the database as a collection of stored records, and that view is supported by the file manager.
CS/Info 430: Information Retrieval
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part B Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
Quick Review of material covered Apr 8 B+-Tree Overview and some definitions –balanced tree –multi-level –reorganizes itself on insertion and deletion.
CS 4432lecture #10 - indexing & hashing1 CS4432: Database Systems II Lecture #10 Professor Elke A. Rundensteiner.
Preliminaries Multiway trees have nodes with greater than two children. Multiway trees of order k have nodes with most k children Trees –For all.
© 2006 Pearson Addison-Wesley. All rights reserved13 B-1 Chapter 13 (excerpts) Advanced Implementation of Tables CS102 Sections 51 and 52 Marc Smith and.
1 Database Tuning Rasmus Pagh and S. Srinivasa Rao IT University of Copenhagen Spring 2007 February 8, 2007 Tree Indexes Lecture based on [RG, Chapter.
Homework #3 Due Thursday, April 17 Problems: –Chapter 11: 11.6, –Chapter 12: 12.1, 12.2, 12.3, 12.4, 12.5, 12.7.
WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.
1 CS 430 / INFO 430 Information Retrieval Lecture 4 Searching Full Text 4.
1 CS 502: Computing Methods for Digital Libraries Lecture 11 Information Retrieval I.
Tree-Structured Indexes. Range Searches ``Find all students with gpa > 3.0’’ –If data is in sorted file, do binary search to find first such student,
Indexing. Goals: Store large files Support multiple search keys Support efficient insert, delete, and range queries.
Database Management 8. course. Query types Equality query – Each field has to be equal to a constant Range query – Not all the fields have to be equal.
Chapter. 8: Indexing and Searching Sections: 8.1 Introduction, 8.2 Inverted Files 9/13/ Dr. Almetwally Mostafa.
1 B Trees - Motivation Recall our discussion on AVL-trees –The maximum height of an AVL-tree with n-nodes is log 2 (n) since the branching factor (degree,
1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement and Relevance Feedback.
CHAPTER 09 Compiled by: Dr. Mohammad Omar Alhawarat Sorting & Searching.
March 16 & 21, Csci 2111: Data and File Structures Week 9, Lectures 1 & 2 Indexed Sequential File Access and Prefix B+ Trees.
INTRODUCTION TO MULTIWAY TREES P INTRO - Binary Trees are useful for quick retrieval of items stored in the tree (using linked list) - often,
CS 430: Information Discovery
© 2006 Pearson Addison-Wesley. All rights reserved13 B-1 Chapter 13 (continued) Advanced Implementation of Tables.
Introduction n How to retrieval information? n A simple alternative is to search the whole text sequentially n Another option is to build data structures.
1 CS 430: Information Discovery Lecture 3 Inverted Files.
COSC 2007 Data Structures II Chapter 15 External Methods.
12.1 Chapter 12: Indexing and Hashing Spring 2009 Sections , , Problems , 12.7, 12.8, 12.13, 12.15,
File Processing : Index and Hash 2015, Spring Pusan National University Ki-Joune Li.
1 CS 430 / INFO 430 Information Retrieval Lecture 5 Searching Full Text 5.
Starting at Binary Trees
1 Tree Indexing (1) Linear index is poor for insertion/deletion. Tree index can efficiently support all desired operations: –Insert/delete –Multiple search.
1 5. Abstract Data Structures & Algorithms 5.2 Static Data Structures.
© 2010 Pearson Addison-Wesley. All rights reserved. Addison Wesley is an imprint of CHAPTER 12: Multi-way Search Trees Java Software Structures: Designing.
Indexing and hashing Azita Keshmiri CS 157B. Basic concept An index for a file in a database system works the same way as the index in text book. For.
1 CS 430: Information Discovery Lecture 4 Files Structures for Inverted Files.
1 CS 430: Information Discovery Sample Midterm Examination Notes on the Solutions.
Marwan Al-Namari Hassan Al-Mathami. Indexing What is Indexing? Indexing is a mechanisms. Why we need to use Indexing? We used indexing to speed up access.
Spring 2003 ECE569 Lecture 05.1 ECE 569 Database System Engineering Spring 2003 Yanyong Zhang
BINARY TREES Objectives Define trees as data structures Define the terms associated with trees Discuss tree traversal algorithms Discuss a binary.
1 CS 430: Information Discovery Lecture 3 Inverted Files.
1 Ullman et al. : Database System Principles Notes 4: Indexing.
Chapter 5 Ranking with Indexes. Indexes and Ranking n Indexes are designed to support search  Faster response time, supports updates n Text search engines.
Information Retrieval Inverted Files.. Document Vectors as Points on a Surface Normalize all document vectors to be of length 1 Define d' = Then the ends.
Why indexing? For efficient searching of a document
Azita Keshmiri CS 157B Ch 12 indexing and hashing
Lecture 22 Binary Search Trees Chapter 10 of textbook
CS 430: Information Discovery
Indexing and Searching (File Structures)
CS 430 / INFO 430 Information Retrieval
B+Trees The slides for this text are organized into chapters. This lecture covers Chapter 9. Chapter 1: Introduction to Database Systems Chapter 2: The.
Index Structures Chapter 13 of GUW September 16, 2019
Presentation transcript:

1 CS 430: Information Discovery Lecture 4 Data Structures for Information Retrieval

2 Course Administration The Wednesday evening classes have been moved to Hollister 110. Introduction to Perl Classes will be held on Wednesday evenings, September 19 and October 3. Before the first class, look at the CS 430 web site and attempt the (optional) Assignment 0. (These classes and Assignment 0 are optional.)

3 Inverted Files: Search for Keywords Index file: Stores list of terms (keywords). Designed for rapid searching and processing range queries. May be held in memory. Postings file: Stores list of postings for each term. Designed for rapid evaluation of Boolean operators. May be stored sequentially. Document file: [Repositories for the storage of document collections are covered in CS 502.]

4 Index File Structures: Binary Tree elk cathog beedogfox ant gnu

5 Binary Tree Advantages Can be searched quickly Convenient for batch updating Easy to add an extra term Economical use of storage Disadvantages Poor for sequential processing, e.g., comp* Tree tends to become unbalanced If the index is held on disk, important to optimize the number of disk accesses

6 Binary Tree Calculation of maximum depth of tree. Illustrates importance of balanced trees. Worst case: depth = n O(n) Ideal case: depth = log(n + 1)/log 2 O(log n)

7 Right Threaded Binary Tree Threaded tree: A binary search tree in which each node uses an otherwise-empty left child link to refer to the node's in- order predecessor and an empty right child link to refer to its in-order successor. Right-threaded tree: A variant of a threaded tree in which only the right thread, i.e. link to the successor, of each node is maintained. Knuth vol 1, 2.3.1, page 325.

8 Right Threaded Binary Tree From: Robert F. Rossa

9 B-trees B-tree of order m: A balanced, multiway search tree: Each node stores many keys Root has between 2 and 2m keys. All other internal nodes have between m and 2m keys. If k i is the i th key in a given internal node -> all keys in the (i-1) th child are smaller than k i -> all keys in the i th child are bigger than k i All leaves are at the same depth

10 B + -tree B + -tree: A B-tree is used as an index Data is stored in the leaves of the tree, known as buckets D 9 D D 54 D D Example: B + -tree of order 2, bucket size 4

11 B-tree Discussion For a discussion of B-trees, see Frake, Section 2.3.1, pages B-trees combine fast retrieval with moderately efficient updating. Bottom-up updating is usual fast, but may require recursive tree climbing to the root. The main weakness is poor storage utilization; typically buckets are only 0.69 full. Various algorithmic improvements increase storage utilization at the expense of updating performance.

12 Signature Files: Sequential Search without Inverted File Inexact filter: A quick test which discards many of the non-qualifying items. Advantages Much faster than full text scanning -- 1 or 2 orders of magnitude Modest space overhead -- 10% to 15% of file Insertion is straightforward Disadvantages Sequential searching no good for very large files Some hits are false hits

13 Signature Files Signature size. Number of bits in a signature, F. Word signature. A bit pattern of size F with m bits set to 1 and the others 0. The word signature is calculated by a hash function. Block. A sequence of text that contains D distinct words. Block signature. The logical OR of all the word signatures in a block of text.

14 Signature Files Example WordSignature free text block signature F = 12 bits in a signature m = 4 bits per word D = 2 words per block

15 Signature Files A query term is processed by matching its signature against the block signature. (a) If the term is in the block, its word signature will always match the block signature. (b) A word signature may match the block signature, but the word is not in the block. This is a false hit. The design challenge is to minimize the false drop probability, F d. Frake, Section 4.2, page 47 discussed how to minimize F d. The rest of this chapter discusses enhancements to the basic algorithm.

16 Search for Substring In some information retrieval applications, any substring can be a search term. Tries, implemented using suffix trees, provide lexicographical indexes for all the substrings in a document or set of documents.

17 Tries: Search for Substring Basic concept The text is divided into unique semi-infinite strings, or sistrings. Each sistring has a starting position in the text, and continues to the right until it is unique. The sistrings are stored in (the leaves of) a tree, the suffix tree. Common parts are stored only once. Each sistring can be associated with a location within a document where the sistring occurs. Subtrees below a certain node represent all occurrences of the substring represented by that node. Suffix trees have a size of the same order of magnitude as the input documents.

18 Tries: Suffix Tree Example: suffix tree for the following words: begin beginning between bread break b e rea gin tween d k _ ning

19 Tries: Sistrings A binary example String: Sistrings:

20 Tries: Lexical Ordering Unique string indicated in blue

21 Trie: Basic Concept

22 Patricia Tree Single-descendant nodes are eliminated. Nodes have bit number.

23 Oxford English Dictionary