1 CS 430: Information Discovery Lecture 4 Files Structures for Inverted Files.

Slides:



Advertisements
Similar presentations
CS 430 / INFO 430 Information Retrieval
Advertisements

Dr. Kalpakis CMSC 661, Principles of Database Systems Index Structures [13]
1 Suffix Trees and Suffix Arrays Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto Addison-Wesley, (Chapter 8)
The Trie Data Structure Basic definition: a recursive tree structure that uses the digital decomposition of strings to represent a set of strings for searching.
Modern Information Retrieval Chapter 8 Indexing and Searching.
1 Lecture 8: Data structures for databases II Jose M. Peña
Trees Chapter 8.
Chapter 15 B External Methods – B-Trees. © 2004 Pearson Addison-Wesley. All rights reserved 15 B-2 B-Trees To organize the index file as an external search.
Modern Information Retrieval
1 File Structures Information Retrieval: Data Structures and Algorithms by W.B. Frakes and R. Baeza-Yates (Eds.) Englewood Cliffs, NJ: Prentice Hall, 1992.
1 CS 430 / INFO 430 Information Retrieval Lecture 5 Searching Full Text 5.
1 Indexing and Searching (File Structures) Modern Information Retrieval (C hapter 8) With G. Navarro.
Fall 2007CS 2251 Trees Chapter 8. Fall 2007CS 2252 Chapter Objectives To learn how to use a tree to represent a hierarchical organization of information.
Trees Chapter 8. Chapter 8: Trees2 Chapter Objectives To learn how to use a tree to represent a hierarchical organization of information To learn how.
1 CS 430 / INFO 430 Information Retrieval Lecture 7 String Processing.
CSE3201/CSE4500 Information Retrieval Systems
1 CS 430 / INFO 430 Information Retrieval Lecture 4 Searching Full Text 4.
1 CS 430: Information Discovery Lecture 3 Inverted Files and Boolean Operations.
CPSC 231 B-Trees (D.H.)1 LEARNING OBJECTIVES Problems with simple indexing. Multilevel indexing: B-Tree. –B-Tree creation: insertion and deletion of nodes.
2010/3/81 Lecture 8 on Physical Database DBMS has a view of the database as a collection of stored records, and that view is supported by the file manager.
1 CS 430: Information Discovery Lecture 4 Data Structures for Information Retrieval.
CS/Info 430: Information Retrieval
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part B Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
Quick Review of material covered Apr 8 B+-Tree Overview and some definitions –balanced tree –multi-level –reorganizes itself on insertion and deletion.
Techniques and Data Structures for Efficient Multimedia Similarity Search.
1 Lecture 20: Indexes Friday, February 25, Outline Representing data elements (12) Index structures (13.1, 13.2) B-trees (13.3)
CS 4432lecture #10 - indexing & hashing1 CS4432: Database Systems II Lecture #10 Professor Elke A. Rundensteiner.
© 2006 Pearson Addison-Wesley. All rights reserved13 B-1 Chapter 13 (excerpts) Advanced Implementation of Tables CS102 Sections 51 and 52 Marc Smith and.
1 Database Tuning Rasmus Pagh and S. Srinivasa Rao IT University of Copenhagen Spring 2007 February 8, 2007 Tree Indexes Lecture based on [RG, Chapter.
Homework #3 Due Thursday, April 17 Problems: –Chapter 11: 11.6, –Chapter 12: 12.1, 12.2, 12.3, 12.4, 12.5, 12.7.
File Structures Dale-Marie Wilson, Ph.D.. Basic Concepts Primary storage Main memory Inappropriate for storing database Volatile Secondary storage Physical.
WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.
1 CS 430 / INFO 430 Information Retrieval Lecture 4 Searching Full Text 4.
Chapter 61 Chapter 6 Index Structures for Files. Chapter 62 Indexes Indexes are additional auxiliary access structures with typically provide either faster.
Indexing. Goals: Store large files Support multiple search keys Support efficient insert, delete, and range queries.
Trees. Tree Terminology Chapter 8: Trees 2 A tree consists of a collection of elements or nodes, with each node linked to its successors The node at the.
Database Management 8. course. Query types Equality query – Each field has to be equal to a constant Range query – Not all the fields have to be equal.
Chapter 13 File Structures. Understand the file access methods. Describe the characteristics of a sequential file. After reading this chapter, the reader.
Chapter. 8: Indexing and Searching Sections: 8.1 Introduction, 8.2 Inverted Files 9/13/ Dr. Almetwally Mostafa.
1 B Trees - Motivation Recall our discussion on AVL-trees –The maximum height of an AVL-tree with n-nodes is log 2 (n) since the branching factor (degree,
CSE3201/CSE4500 Information Retrieval Systems Signature Based Text Retrieval Systems.
1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement and Relevance Feedback.
Lecture 10 Trees –Definiton of trees –Uses of trees –Operations on a tree.
Multi-way Trees. M-way trees So far we have discussed binary trees only. In this lecture, we go over another type of tree called m- way trees or trees.
Trees Chapter 8. Chapter 8: Trees2 Chapter Objectives To learn how to use a tree to represent a hierarchical organization of information To learn how.
Spring 2010CS 2251 Trees Chapter 6. Spring 2010CS 2252 Chapter Objectives Learn to use a tree to represent a hierarchical organization of information.
March 16 & 21, Csci 2111: Data and File Structures Week 9, Lectures 1 & 2 Indexed Sequential File Access and Prefix B+ Trees.
CS 430: Information Discovery
© 2006 Pearson Addison-Wesley. All rights reserved13 B-1 Chapter 13 (continued) Advanced Implementation of Tables.
Introduction n How to retrieval information? n A simple alternative is to search the whole text sequentially n Another option is to build data structures.
1 CS 430: Information Discovery Lecture 3 Inverted Files.
12.1 Chapter 12: Indexing and Hashing Spring 2009 Sections , , Problems , 12.7, 12.8, 12.13, 12.15,
Starting at Binary Trees
1 Tree Indexing (1) Linear index is poor for insertion/deletion. Tree index can efficiently support all desired operations: –Insert/delete –Multiple search.
© 2010 Pearson Addison-Wesley. All rights reserved. Addison Wesley is an imprint of CHAPTER 12: Multi-way Search Trees Java Software Structures: Designing.
Indexing and hashing Azita Keshmiri CS 157B. Basic concept An index for a file in a database system works the same way as the index in text book. For.
1 CS 430: Information Discovery Sample Midterm Examination Notes on the Solutions.
Marwan Al-Namari Hassan Al-Mathami. Indexing What is Indexing? Indexing is a mechanisms. Why we need to use Indexing? We used indexing to speed up access.
1 Chapter 12: Indexing and Hashing Indexing Indexing Basic Concepts Basic Concepts Ordered Indices Ordered Indices B+-Tree Index Files B+-Tree Index Files.
1 CS 430: Information Discovery Lecture 3 Inverted Files.
Chapter 5 Ranking with Indexes. Indexes and Ranking n Indexes are designed to support search  Faster response time, supports updates n Text search engines.
Information Retrieval Inverted Files.. Document Vectors as Points on a Surface Normalize all document vectors to be of length 1 Define d' = Then the ends.
Why indexing? For efficient searching of a document
Lecture 22 Binary Search Trees Chapter 10 of textbook
CS 430: Information Discovery
Indexing and Searching (File Structures)
COP3530- Data Structures B Trees
CS 430 / INFO 430 Information Retrieval
Advanced Implementation of Tables
Index Structures Chapter 13 of GUW September 16, 2019
Presentation transcript:

1 CS 430: Information Discovery Lecture 4 Files Structures for Inverted Files

2 Course Administration Assignment 1 has been posted on the web site.

3 Right Threaded Binary Tree Threaded tree: A binary search tree in which each node uses an otherwise-empty left child link to refer to the node's in- order predecessor and an empty right child link to refer to its in-order successor. Right-threaded tree: A variant of a threaded tree in which only the right thread, i.e. link to the successor, of each node is maintained. Knuth vol 1, 2.3.1, page 325.

4 Right Threaded Binary Tree From: Robert F. Rossa

5 Definitions Keyword: A term that is used to describe the subject matter in a document. It is sometimes called an index term. In full text indexing, every word in the text is treated as a keyword (with the exception of stopwords). Keywords can be extracted automatically from a document or assigned by a human cataloguer or indexer. Controlled vocabulary: A list of words that can be used as keywords. For example, in a retrieval system used for research papers in medicine, the controlled vocabulary might be a list of medical terms.

6 Restrictions in Building Inverted Files Underlying character set, e.g., printable ASCII, Unicode, UTF8. Whether to use a controlled vocabulary. If so, what words to include. List of stopwords. Rules to decide the beginning and end of words, e.g., spaces or punctuation. Character sequences not to be indexed, e.g., sequences of numbers.

7 Representation of Inverted Files Index file: Stores list of terms (keywords). Designed for rapid searching and processing range queries. May be held in memory. Postings file: Stores list of postings for each term. Designed for rapid evaluation of Boolean operators. May be stored sequentially. Document file: [Repositories for the storage of document collections are covered in CS 502.]

8 SetRecordsUnique Terms A2,6535,123 B38,304c.25,000 Sizes of Inverted Files Set A has an average of 14 postings per term and a maximum of over 2,000 postings per term. Set B has an average of 88 postings per record. Examples from Harman and Candela, 1990

9 B-trees B-tree of order m: A balanced, multiway search tree: Each node stores many keys Root has between 2 and 2m keys. All other internal nodes have between m and 2m keys. If k i is the i th key in a given internal node -> all keys in the (i-1) th child are smaller than k i -> all keys in the i th child are bigger than k i All leaves are at the same depth

10 B + -tree B + -tree: A B-tree is used as an index Data is stored in the leaves of the tree, known as buckets D 9 D D 54 D D Example: B + -tree of order 2, bucket size 4

11 B-tree Discussion For a discussion of B-trees, see Frake, Section 2.3.1, pages B-trees combine fast retrieval with moderately efficient updating. Bottom-up updating is usual fast, but may require recursive tree climbing to the root. The main weakness is poor storage utilization; typically buckets are only 0.69 full. Various algorithmic improvements increase storage utilization at the expense of updating performance.

12 Signature Files Inexact filter: A quick test which discards many of the non-qualifying items. Advantages Much faster than full text scanning -- 1 or 2 orders of magnitude Modest space overhead -- 10% to 15% of file Insertion is straightforward Disadvantages Sequential searching no good for very large files Some hits are false hits

13 Signature Files Signature size. Number of bits in a signature, F. Word signature. A bit pattern of size F with m bits set to 1 and the others 0. The word signature is calculated by a hash function. Block. A sequence of text that contains D distinct words. Block signature. The logical OR of all the word signatures in a block of text.

14 Signature Files Example WordSignature free text block signature F = 12 bits in a signature m = 4 bits per word D = 2 words per block

15 Signature Files A query term is processed by matching its signature against the block signature. (a) If the term is in the block, its word signature will always match the block signature. (b) A word signature may match the block signature, but the word is not in the block. This is a false hit. The design challenge is to minimize the false drop probability, F d. Frake, Section 4.2, page 47 discussed how to minimize F d. The rest of this chapter discusses enhancements to the basic algorithm.

16 Tries Basic concept The text is divided into unique semi-infinite strings, or sistrings. Each sistring has a starting position in the text, and continues to the right until it is unique. The sistrings are stored in (the leaves of) a tree, the suffix tree. Common parts are stored only once. Each sistring can be associated with a location within a document where the sistring occurs. Subtrees below a certain node represent all occurrences of the substring represented by that node. Suffix trees (and similar suffix arrays) have a size of the same order of magnitude as the input documents.

17 Tries: Suffix Tree Example: suffix tree for the following words: begin beginning between bread break b e rea gin tween d k _ ning

18 Tries: Sistrings A binary example String: Sistrings:

19 Tries: Lexical Ordering Unique remaining subtrie indicated in red

20 Trie: Basic Concept

21 Patricia Tree Single-descendant nodes are eliminated. Nodes have bit number.

22 Oxford English Dictionary