Download presentation
Presentation is loading. Please wait.
1
1 CS 430: Information Discovery Lecture 4 Data Structures for Information Retrieval
2
2 Course Administration The Wednesday evening classes have been moved to Hollister 110. Introduction to Perl Classes will be held on Wednesday evenings, September 19 and October 3. Before the first class, look at the CS 430 web site and attempt the (optional) Assignment 0. (These classes and Assignment 0 are optional.)
3
3 Inverted Files: Search for Keywords Index file: Stores list of terms (keywords). Designed for rapid searching and processing range queries. May be held in memory. Postings file: Stores list of postings for each term. Designed for rapid evaluation of Boolean operators. May be stored sequentially. Document file: [Repositories for the storage of document collections are covered in CS 502.]
4
4 Index File Structures: Binary Tree elk cathog beedogfox ant gnu
5
5 Binary Tree Advantages Can be searched quickly Convenient for batch updating Easy to add an extra term Economical use of storage Disadvantages Poor for sequential processing, e.g., comp* Tree tends to become unbalanced If the index is held on disk, important to optimize the number of disk accesses
6
6 Binary Tree Calculation of maximum depth of tree. Illustrates importance of balanced trees. Worst case: depth = n O(n) Ideal case: depth = log(n + 1)/log 2 O(log n)
7
7 Right Threaded Binary Tree Threaded tree: A binary search tree in which each node uses an otherwise-empty left child link to refer to the node's in- order predecessor and an empty right child link to refer to its in-order successor. Right-threaded tree: A variant of a threaded tree in which only the right thread, i.e. link to the successor, of each node is maintained. Knuth vol 1, 2.3.1, page 325.
8
8 Right Threaded Binary Tree From: Robert F. Rossa
9
9 B-trees B-tree of order m: A balanced, multiway search tree: Each node stores many keys Root has between 2 and 2m keys. All other internal nodes have between m and 2m keys. If k i is the i th key in a given internal node -> all keys in the (i-1) th child are smaller than k i -> all keys in the i th child are bigger than k i All leaves are at the same depth
10
10 B + -tree B + -tree: A B-tree is used as an index Data is stored in the leaves of the tree, known as buckets 50 65 10 2555 5970 81 90... D 9 D 51... D 54 D 66... D 81... Example: B + -tree of order 2, bucket size 4
11
11 B-tree Discussion For a discussion of B-trees, see Frake, Section 2.3.1, pages 18-20. B-trees combine fast retrieval with moderately efficient updating. Bottom-up updating is usual fast, but may require recursive tree climbing to the root. The main weakness is poor storage utilization; typically buckets are only 0.69 full. Various algorithmic improvements increase storage utilization at the expense of updating performance.
12
12 Signature Files: Sequential Search without Inverted File Inexact filter: A quick test which discards many of the non-qualifying items. Advantages Much faster than full text scanning -- 1 or 2 orders of magnitude Modest space overhead -- 10% to 15% of file Insertion is straightforward Disadvantages Sequential searching no good for very large files Some hits are false hits
13
13 Signature Files Signature size. Number of bits in a signature, F. Word signature. A bit pattern of size F with m bits set to 1 and the others 0. The word signature is calculated by a hash function. Block. A sequence of text that contains D distinct words. Block signature. The logical OR of all the word signatures in a block of text.
14
14 Signature Files Example WordSignature free001 000 110 010 text000 010 101 001 block signature001 010 111 011 F = 12 bits in a signature m = 4 bits per word D = 2 words per block
15
15 Signature Files A query term is processed by matching its signature against the block signature. (a) If the term is in the block, its word signature will always match the block signature. (b) A word signature may match the block signature, but the word is not in the block. This is a false hit. The design challenge is to minimize the false drop probability, F d. Frake, Section 4.2, page 47 discussed how to minimize F d. The rest of this chapter discusses enhancements to the basic algorithm.
16
16 Search for Substring In some information retrieval applications, any substring can be a search term. Tries, implemented using suffix trees, provide lexicographical indexes for all the substrings in a document or set of documents.
17
17 Tries: Search for Substring Basic concept The text is divided into unique semi-infinite strings, or sistrings. Each sistring has a starting position in the text, and continues to the right until it is unique. The sistrings are stored in (the leaves of) a tree, the suffix tree. Common parts are stored only once. Each sistring can be associated with a location within a document where the sistring occurs. Subtrees below a certain node represent all occurrences of the substring represented by that node. Suffix trees have a size of the same order of magnitude as the input documents.
18
18 Tries: Suffix Tree Example: suffix tree for the following words: begin beginning between bread break b e rea gin tween d k _ ning
19
19 Tries: Sistrings A binary example String:01 100 100 010 111 Sistrings:101 100 100 010 111 211 001 000 101 11 310 010 001 011 1 400 100 010 111 501 000 101 11 610 001 011 1 700 010 111 800 101 11
20
20 Tries: Lexical Ordering 700 010 111 400 100 010 111 800 101 11 501 000 101 11 101 100 100 010 111 610 001 011 1 310 010 001 011 1 211 001 000 101 11 Unique string indicated in blue
21
21 Trie: Basic Concept 7 48 51 2 63 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1
22
22 Patricia Tree 7 48 51 2 63 0 0 0 00 0 0 0 1 1 1 1 10 1 1 1223345 Single-descendant nodes are eliminated. Nodes have bit number.
23
23 Oxford English Dictionary
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.