CS 430: Information Discovery

Slides:



Advertisements
Similar presentations
CpSc 3220 File and Database Processing Lecture 17 Indexed Files.
Advertisements

CS 430 / INFO 430 Information Retrieval
Hashing and Indexing John Ortiz.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
Dr. Kalpakis CMSC 661, Principles of Database Systems Index Structures [13]
1 Suffix Trees and Suffix Arrays Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto Addison-Wesley, (Chapter 8)
The Trie Data Structure Basic definition: a recursive tree structure that uses the digital decomposition of strings to represent a set of strings for searching.
Modern Information Retrieval Chapter 8 Indexing and Searching.
1 Lecture 8: Data structures for databases II Jose M. Peña
Chapter 15 B External Methods – B-Trees. © 2004 Pearson Addison-Wesley. All rights reserved 15 B-2 B-Trees To organize the index file as an external search.
Modern Information Retrieval
1 CS 430 / INFO 430 Information Retrieval Lecture 5 Searching Full Text 5.
1 Indexing and Searching (File Structures) Modern Information Retrieval (C hapter 8) With G. Navarro.
1 CS 430 / INFO 430 Information Retrieval Lecture 7 String Processing.
1 CS 430 / INFO 430 Information Retrieval Lecture 4 Searching Full Text 4.
1 CS 430: Information Discovery Lecture 3 Inverted Files and Boolean Operations.
2010/3/81 Lecture 8 on Physical Database DBMS has a view of the database as a collection of stored records, and that view is supported by the file manager.
1 CS 430: Information Discovery Lecture 4 Data Structures for Information Retrieval.
CS/Info 430: Information Retrieval
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part B Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
B + -Trees (Part 1) COMP171. Slide 2 Main and secondary memories  Secondary storage device is much, much slower than the main RAM  Pages and blocks.
CS 4432lecture #10 - indexing & hashing1 CS4432: Database Systems II Lecture #10 Professor Elke A. Rundensteiner.
Preliminaries Multiway trees have nodes with greater than two children. Multiway trees of order k have nodes with most k children Trees –For all.
1 Database Tuning Rasmus Pagh and S. Srinivasa Rao IT University of Copenhagen Spring 2007 February 8, 2007 Tree Indexes Lecture based on [RG, Chapter.
Homework #3 Due Thursday, April 17 Problems: –Chapter 11: 11.6, –Chapter 12: 12.1, 12.2, 12.3, 12.4, 12.5, 12.7.
B + -Trees COMP171 Fall AVL Trees / Slide 2 Dictionary for Secondary storage * The AVL tree is an excellent dictionary structure when the entire.
File Structures Dale-Marie Wilson, Ph.D.. Basic Concepts Primary storage Main memory Inappropriate for storing database Volatile Secondary storage Physical.
1 CS 430 / INFO 430 Information Retrieval Lecture 4 Searching Full Text 4.
1 CS 502: Computing Methods for Digital Libraries Lecture 11 Information Retrieval I.
1 CS 728 Advanced Database Systems Chapter 17 Database File Indexing Techniques, B- Trees, and B + -Trees.
Tree-Structured Indexes. Range Searches ``Find all students with gpa > 3.0’’ –If data is in sorted file, do binary search to find first such student,
Chapter 61 Chapter 6 Index Structures for Files. Chapter 62 Indexes Indexes are additional auxiliary access structures with typically provide either faster.
Indexing structures for files D ƯƠ NG ANH KHOA-QLU13082.
Indexing. Goals: Store large files Support multiple search keys Support efficient insert, delete, and range queries.
Database Management 8. course. Query types Equality query – Each field has to be equal to a constant Range query – Not all the fields have to be equal.
Chapter. 8: Indexing and Searching Sections: 8.1 Introduction, 8.2 Inverted Files 9/13/ Dr. Almetwally Mostafa.
1 B Trees - Motivation Recall our discussion on AVL-trees –The maximum height of an AVL-tree with n-nodes is log 2 (n) since the branching factor (degree,
1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement and Relevance Feedback.
1 CS 430: Information Discovery Lecture 9 Term Weighting and Ranking.
Multi-way Trees. M-way trees So far we have discussed binary trees only. In this lecture, we go over another type of tree called m- way trees or trees.
Chapter 11 Indexing & Hashing. 2 n Sophisticated database access methods n Basic concerns: access/insertion/deletion time, space overhead n Indexing 
March 16 & 21, Csci 2111: Data and File Structures Week 9, Lectures 1 & 2 Indexed Sequential File Access and Prefix B+ Trees.
INTRODUCTION TO MULTIWAY TREES P INTRO - Binary Trees are useful for quick retrieval of items stored in the tree (using linked list) - often,
DATA STRUCTURE & ALGORITHMS (BCS 1223) CHAPTER 8 : SEARCHING.
Introduction n How to retrieval information? n A simple alternative is to search the whole text sequentially n Another option is to build data structures.
1 CS 430: Information Discovery Lecture 3 Inverted Files.
COSC 2007 Data Structures II Chapter 15 External Methods.
12.1 Chapter 12: Indexing and Hashing Spring 2009 Sections , , Problems , 12.7, 12.8, 12.13, 12.15,
1 CS 430 / INFO 430 Information Retrieval Lecture 5 Searching Full Text 5.
Starting at Binary Trees
1 Tree Indexing (1) Linear index is poor for insertion/deletion. Tree index can efficiently support all desired operations: –Insert/delete –Multiple search.
1 5. Abstract Data Structures & Algorithms 5.2 Static Data Structures.
Indexing and hashing Azita Keshmiri CS 157B. Basic concept An index for a file in a database system works the same way as the index in text book. For.
1 CS 430: Information Discovery Lecture 4 Files Structures for Inverted Files.
1 CS 430: Information Discovery Sample Midterm Examination Notes on the Solutions.
Marwan Al-Namari Hassan Al-Mathami. Indexing What is Indexing? Indexing is a mechanisms. Why we need to use Indexing? We used indexing to speed up access.
Spring 2003 ECE569 Lecture 05.1 ECE 569 Database System Engineering Spring 2003 Yanyong Zhang
1 Multi-Level Indexing and B-Trees. 2 Statement of the Problem When indexes grow too large they have to be stored on secondary storage. However, there.
1 CS 430: Information Discovery Lecture 3 Inverted Files.
Chapter 5 Ranking with Indexes. Indexes and Ranking n Indexes are designed to support search  Faster response time, supports updates n Text search engines.
Information Retrieval Inverted Files.. Document Vectors as Points on a Surface Normalize all document vectors to be of length 1 Define d' = Then the ends.
Why indexing? For efficient searching of a document
COMP261 Lecture 23 B Trees.
Azita Keshmiri CS 157B Ch 12 indexing and hashing
COMP 430 Intro. to Database Systems
Lecture 22 Binary Search Trees Chapter 10 of textbook
CS 430: Information Discovery
Indexing and Hashing Basic Concepts Ordered Indices
CS 430 / INFO 430 Information Retrieval
Index Structures Chapter 13 of GUW September 16, 2019
Presentation transcript:

CS 430: Information Discovery Lecture 6 Data Structures for Information Retrieval

Course Administration • Course news group cornell.courses.cs430 • Sending messages to cs430@cs.cornell.edu The Teaching Assistants should not do your work for you. Before sending them a question, check the slides and the text book. The assignments leave several decisions to your judgment. Make your own decisions and record them in your report.

Indexing Subsystem documents Documents assign document IDs text document numbers and *field numbers break into words words stoplist non-stoplist words stemming* *Indicates optional operation. stemmed words term weighting* terms with weights Index database from Frakes, page 7

Organization of Inverted Files Index file Postings file Documents file Term Pointer to postings ant bee cat dog elk fox gnu hog Inverted lists

Documents File for Web Search System Indexes are built using a web crawler, which retrieves each page on the web (or a subset). After indexing each page is discarded, unless stored in a cache. In addition to the usual index file (word list) and postings files the indexing system stores: 1. List of URLs of pages indexed. This list is used instead of the documents file. 2. Short abstract of each page (optional). Used to describe the page when lists of hits returned to user. 3. For each page, list of URLs of pages it links to. This data structure is used for reference pattern ranking, either static (e.g., PageRank), or dynamic.

Index File Structures: Linear Index Advantages Can be searched quickly, e.g., by binary search, O(log n) Good for sequential processing, e.g., comp* Convenient for batch updating Economical use of storage Disadvantages Index must be rebuilt if an extra term is added

Index File Structures: Binary Tree Input: elk, hog, bee, fox, cat, gnu, ant, dog elk bee hog ant cat fox gnu dog

Binary Tree Advantages Can be searched quickly Convenient for batch updating Easy to add an extra term Economical use of storage Disadvantages Poor for sequential processing, e.g., comp* Tree tends to become unbalanced If the index is held on disk, important to optimize the number of disk accesses

Binary Tree Calculation of maximum depth of tree. Illustrates importance of balanced trees. Worst case: depth = n O(n) Ideal case: depth = log(n + 1)/log 2 O(log n)

Right Threaded Binary Tree Threaded tree: A binary search tree in which each node uses an otherwise-empty left child link to refer to the node's in-order predecessor and an empty right child link to refer to its in-order successor. Right-threaded tree: A variant of a threaded tree in which only the right thread, i.e. link to the successor, of each node is maintained. Knuth vol 1, 2.3.1, page 325.

Right Threaded Binary Tree From: Robert F. Rossa

B-trees B-tree of order m: A balanced, multiway search tree: • Each node stores many keys • Root has between 2 and 2m keys. All other internal nodes have between m and 2m keys. • If ki is the ith key in a given internal node -> all keys in the (i-1)th child are smaller than ki -> all keys in the ith child are bigger than ki • All leaves are at the same depth

B-trees B-tree example (order 2) 50 65 55 59 10 19 35 70 90 98 66 68 50 65 10 19 35 55 59 70 90 98 66 68 91 95 97 36 47 1 5 8 9 72 73 12 14 18 21 24 28 Every arrow points to a node containing between 2 and 4 keys. A node with k keys has k + 1 pointers.

Example: B+-tree of order 2, bucket size 4 • A B-tree is used as an index • Data is stored in the leaves of the tree, known as buckets 50 65 10 25 55 59 70 81 90 ... D9 D51 ... D54 D66... D81 ... Example: B+-tree of order 2, bucket size 4

B-tree Discussion For a discussion of B-trees, see Frake, Section 2.3.1, pages 18-20. • B-trees combine fast retrieval with moderately efficient updating. • Bottom-up updating is usual fast, but may require recursive tree climbing to the root. • The main weakness is poor storage utilization; typically buckets are only 0.69 full. • Various algorithmic improvements increase storage utilization at the expense of updating performance.

Signature Files: Sequential Search without Inverted File Inexact filter: A quick test which discards many of the non-qualifying items. Advantages • Much faster than full text scanning -- 1 or 2 orders of magnitude • Modest space overhead -- 10% to 15% of file • Insertion is straightforward Disadvantages • Sequential searching no good for very large files • Some hits are false hits

Signature Files Signature size. Number of bits in a signature, F. Word signature. A bit pattern of size F with m bits set to 1 and the others 0. The word signature is calculated by a hash function. Block. A sequence of text that contains D distinct words. Block signature. The logical OR of all the word signatures in a block of text.

Signature Files Example Word Signature free 001 000 110 010 text 000 010 101 001 block signature 001 010 111 011 F = 12 bits in a signature m = 4 bits per word D = 2 words per block

Signature Files A query term is processed by matching its signature against the block signature. (a) If the term is in the block, its word signature will always match the block signature. (b) A word signature may match the block signature, but the word is not in the block. This is a false hit. The design challenge is to minimize the false drop probability, Fd . Frake, Section 4.2, page 47 discussed how to minimize Fd. The rest of this chapter discusses enhancements to the basic algorithm.

Search for Substring In some information retrieval applications, any substring can be a search term. Tries, implemented using suffix trees, provide lexicographical indexes for all the substrings in a document or set of documents.

Tries: Search for Substring Basic concept The text is divided into unique semi-infinite strings, or sistrings. Each sistring has a starting position in the text, and continues to the right until it is unique. The sistrings are stored in (the leaves of) a tree, the suffix tree. Common parts are stored only once. Each sistring can be associated with a location within a document where the sistring occurs. Subtrees below a certain node represent all occurrences of the substring represented by that node. Suffix trees have a size of the same order of magnitude as the input documents.

Tries: Suffix Tree Example: suffix tree for the following words: begin beginning between bread break b e rea gin tween d k _ ning

Tries: Sistrings A binary example String: 01 100 100 010 111 2 11 001 000 101 11 3 10 010 001 011 1 4 00 100 010 111 5 01 000 101 11 6 10 001 011 1 7 00 010 111 8 00 101 11

Tries: Lexical Ordering 7 00 010 111 4 00 100 010 111 8 00 101 11 5 01 000 101 11 1 01 100 100 010 111 6 10 001 011 1 3 10 010 001 011 1 2 11 001 000 101 11 Unique string indicated in blue

Trie: Basic Concept 1 1 1 2 1 1 7 5 1 1 6 3 1 4 8

Patricia Tree 1 1 2 2 1 1 00 3 3 4 2 1 1 10 7 5 5 1 6 3 1 4 8 Single-descendant nodes are eliminated. Nodes have bit number.

Oxford English Dictionary