7CCSMWAL Algorithmic Issues in the WWW Lecture 7
Text searching To search for a keyword within text, we can Scan the text sequentially when The text is small, e.g., a few MB The text collection is very volatile (modified very frequently) No extra space is available (for building indices) Build a data structure over the text (called an index) to speed up the search when The text collection is large and static or semi-static (can be updated at reasonably regular intervals)
Inverted files Also called inverted indices Mainly composed of two elements Dictionary (or vocabulary, or lexicon) Set of all different words (tokens, index terms) Posting list (or inverted list) Each word has a list of positions where the word appears. Used here to refer to terms within documents Postings file is the set of all posting lists Postings lists are much larger than the dictionary Dictionary is commonly kept in memory, and postings lists are normally kept on disk Structure of postings lists can vary (problem dependent) e.g. Each page of this PPT presentation
Example (see Intro to IR) The dictionary sorted alphabetically into terms Each posting list is sorted by document ID The numbers are the documents in which the term occurs (or lines in a page or book or whatever)
Construct an inverted file Input: a list of normalized tokens for each document Example Doc 1 I did enact Julius Caesar: I was killed i’ the Capitol; Brutus killed me. Doc 2 So let it be with Caesar. The noble Brutus hath told you Caesar was ambitious:
Construct an inverted index The core indexing step is sorting the lists of tokens so that the terms are alphabetical Multiple occurrences of the same term from the same document are then merged The term frequency is also recorded. (#occs. of term in doc) Instances of the same term are then grouped, and the result is split into a dictionary (of terms) and postings (list of documents containing term) The dictionary also records some statistics, such as the number of documents which contain each term (document frequency), and total number of occurrences
Example The inverted file
Data structures for postings lists Fixed length arrays Each postings list is a fixed length array The arrays may not be fully utilized Reading a postings list is time-efficient as it is stored in contiguous locations
Data structures for postings lists Singly linked lists Each postings list is a singly linked list No empty entries but there is the overhead of the pointers
Methods to index the terms Various approaches include: External sort and merge In memory indexing based on hashing followed by merging Distributed indexing Google (e.g.) processes documents using Map-Reduce
Hardware constraint Indexing algorithm is governed by hardware constraints Characteristics of computer hardware Access to data in memory is much faster than access to data on disk It takes a few clock cycles to access a byte in memory, but much longer to transfer it from disk Want to keep as much data as possible in memory, especially the data that we need to access frequently
Index construction with disk The list of tokens may be too large to be stored and sorted in memory External sorting algorithm minimize the number of random disk seeks during sorting Blocked sort-based indexing algorithm (BSBI) BSBI segment the collection into parts of equal size (Step 4 of the pseudo code) construct the intermediate inverted file for each part in memory (Step 5 of the pseudo code) This step is the same as when the list of tokens can fit in the memory where inverted file is constructed in the memory store the intermediate inverted files on disk (Step 6 of the pseudo code) merge all intermediate inverted files into the final index (Step 7 of the pseudo code)
BSBI pseudo code BSBIndexConstruction() n 0 while (all documents have not been processed) do n n+1 block ParseNextBlock() BSBI-Invert(block) WriteBlockToDisk(block, fn) MergeBlocks(f1, ..., fn; fmerge)
Merging intermediate inverted files brutus d1,d3,d6,d7 caesar d1,d2,d4,d8,d9 julius d10 killed d8 noble d5 with d1,d2,d3,d5 brutus d1,d3 caesar d1,d2,d4 noble d5 with d1,d2,d3,d5 brutus d6,d7 caesar d8,d9 julius d10 killed d8 disk Two blocks (“posting lists to be merged”) are loaded from disk into memory, merged in memory (“merged posting lists”) and written back to disk
Single-pass in-memory indexing (SPIMI) No sorting of tokens is required Tokens are processed one by one in memory When a term occurs for the first time, it is added to the dictionary (implemented as a hash table), and a new posting list is created Otherwise, find the corresponding postings list Then, add the docID to the postings list The process continues until the memory is full. The dictionary is then sorted and written to disk Note: the dictionary much shorter than the complete list of all tokens which occur (or that’s the idea)
Hash table A data structure that supports operation Lookup and Insert (and possibly Delete) in expected constant time Can be considered as a table of data Each term is stored in one of the entries of the table A hash function determines which data to be stored in which table entry Typically, the hash function maps a string (an index term or a key) to an integer (table entry) More details in slide p.37
Picture from Wikipedia Example Picture from Wikipedia
SPIMI pseudo code SPIMI-Invert(token_stream) output_file = NewFile() dictionary = NewHash() while (free memory available) do token next(token_stream) if term(token) dictionary then postings_list = AddToDictionary(dictionary, term(token)) else postings_list = GetPostingsList(dictionary, term(token)) if full(postings_list) then posting_list = DoublePostingList(dictionary, term(token)) AddToPostingsList(postings_list, docID(token)) sorted_terms SortTerms(dictionary) WriteBlockToDisk(sorted_terms, dictionary, output_file) return output_file The pseudo code only shows how an intermediate inverted file is constructed The final inverted files merging is the same as BSBI
Example List of tokens Main memory Disk Hash table Inverted file (did, 1) (enact, 1) (julius, 1) (casear, 1) (so, 2) (did, 2) (it, 2) (the, 3) (you, 3) (hold, 3) ... Terms Postings casear 1 enact I so 2 julius did 1,2 Terms Postings casear 1 did 1,2 enact I julius so 2 Hash table Inverted file The tokens are read one by one and inserted to the hash table in main memory until the memory is full The entries in the hash table are sorted and written to disk as inverted file
Distributed indexing Perform indexing on large computer cluster A computer cluster is a group of tightly coupled computers that work closely together The group may consists of hundreds or thousands of nodes (computers) Individual nodes can fail at any time The result of the construction process is a distributed index that is partitioned across several machines Either according to term or according to document We focus on term-partitioned index
Distributed indexing MapReduce: a general architecture for distributed computing A master node (computer) directs the process of dividing the work up into small tasks assigning the tasks to individual nodes re-assigning tasks in case of node failure
Distributed indexing The master-node breaks the input documents into splits Each split is a subset of documents (corresponding to the partitions of the list of tokens made in BSBI/SPIMI) Two set of tasks Parsers Inverters
Parsers Master assigns a split to an idle parser node Parser reads one document at a time and produces (term, doc) pairs Parser writes pairs into j partitions for passing on to Inverters Each partition is for a range of terms’ first letters E.g., a-f, g-p, q-z here j=3
Inverters To complete the index inversion Parses pass the term-partitions to the inverters. Or can send the (term, doc) pairs one at a time An inverter collects all (term, doc) pairs (= postings) for its term-partition Sorts and writes to posting lists
... ... ... Data flow Master Splits of documents Postings a-f g-p q-z assign assign Postings a-f Parser a-f g-p q-z Inverter Parser g-p a-f g-p q-z Inverter ... ... ... q-z Inverter Parser a-f g-p q-z partitions
Dynamic indexing Up to now, we have assumed that collections are static They rarely are New Documents come in over time and need to be inserted Documents are deleted and modified This means that the dictionary and the postings have to be modified: Posting updates for terms already in dictionary New terms are added to dictionary
Simplest approach. Block update Maintain “big” main index on disk New documents go into “small” auxiliary index in memory Merge the auxiliary index block and the main index when the auxiliary index is bigger than a threshold value Assume that the threshold value for refreshing the auxiliary index is a large constant n
Suppose symbol represents the merge operation Example New main index after merging with auxiliary index Auxiliary index Main index n postings 0 postings n postings 1st merge n postings n postings 2n postings 2nd merge 3rd merge n postings 2n postings 3n postings ... ... ... k-th merge n postings (k-1)n postings kn postings Suppose symbol represents the merge operation
Time complexity To process T=kn items uses k=T/n merges To merge two sorted lists of size n, and Jn takes O(n+Jn)=O(Jn) time Process of building a main index with T postings needs J=1,…,T/n merges so takes O(1n + 2n +3n + ... +(T/n)n) = O(T2/n) time
Logarithmic merge Basic idea: Don’t merge auxiliary and main index directly Speeds up merging and index construction in dynamic indexing Maintain a series of indexes, each twice as large as the previous one Keep smallest index (Z0) in memory Larger indices (I0, I1, ...) on disk (size doubling) I0 with size n, I1 with size 2n, I2 with size 4n, and so on The scheme for merging If Z0 gets too big (>=n), write to disk as I0 or merge with I0 (if I0 already exists) as Z1 Either write Z1 to disk as I1 (if no I1) or merge with I1 to form Z2
Pseudo code of logarithmic merging LMergeAddToken(indexes, Z0, token) Z0 Merge(Z0, {token}) if |Z0| = n then for i 0 to do if Ii indexes then Zi+1 Merge(Ii, Zi) (Zi+1 is a temporay index on disk) indexes indexes – {Ii} else Ii Zi (Zi becomes the permanent index Ii) indexes indexes {Ii} Break Zo LogarithmicMerge() Zo (Z0 is the in-memory index) indexes while true do LMergeAddToken(indexes, Z0, GetNextToken())
Example symbol represents the merge operation Actions taken Indexes 1st time when |Z0|=n I0 Z0; Z0 ; I0 2nd time when |Z0|=n Z1 I0 Z0; I1 Z1; Remove I0; Z0 I1 3rd time when |Z0|=n I0 Z0; Z0 I0, I1 4th time when |Z0|=n Z1 I0 Z0; I1 Z1 ; Z2 I1 Z1; I2 Z2 Remove I0, I1; Z0 I2 5th time when |Z0|=n I0, I2 6th time when |Z0|=n I1, I2 7th time when |Z0|=n I0, I1, I2 k-th time indices at binary k-1
Time complexity Size Doubling: For T postings blocks, the series of indexes consists of at most log T indexes, I0, I1, I2, ..., Ilog T Why? Need k=log(T/n) levels for (2^k)n=T items To build a main index with T postings, the overall construction time is O(T log T) Each posting is processed (i.e., merged) only once on each of (at most) log T levels Why? If merging occurs item moves up a level So logarithmic merge is more efficient for index construction than block update as T log T < T2
Searching the Index
Search structures for dictionaries Given a keyword of a query, determine if the keyword exists in the vocabulary (dictionary). If so, identify the pointer to the corresponding posting If no search structure exists, we have to check the terms of the dictionary one by one until a match is found or all terms are exhausted takes O(n) time, where n is the number of terms of the dictionary Search structures help speed up the vocabulary look-up operation
Search structures for dictionaries Two main choices Hash table (introduced on slide p.16) Search tree Factors affecting choice How many terms are we likely to have? Is the number likely to remain static, or change a lot? Are we likely to only have new terms inserted, or also to have some terms in the dictionary deleted? What are the relative frequencies with which various terms will be accessed? General speaking, hash table is preferable for more static data and search tree handles dynamic data more efficiently.
Hash table Hash table: An array with a hash function and collision management Mainly operated by a hash function, which determines where to find (or insert) a term Ash function maps a term to an integer between 0 and N-1, where N is the number of entries of the hash table Hashing is reproduce-able randomness. It looks like a term is mapped to a random array index, but every time we map the term we get the same index.
An example of hash function Suppose the dictionary consists of terms that are composed of lower-case letters or white-space only A term consists of at most 20 characters Let f() be a function that maps white- space to 0, ‘a’ to 1, ‘b’ to 2, ..., ‘z’ to 26. Let N be a large prime number The hash function F(word) can be defined as [ f(1st character) + f(2nd character)*26 +f(3rd character)*262 + f(4th character)*263 + ... ] mod N
Hash function Suppose N=13 For term ‘caesar’ For term ‘enact’ F(‘caesar’) = 3 + 1*26 + 5*262 + 19*263 + 1*264 + 18*265 mod 13 = 214659097 mod 13 = 3 For term ‘enact’ F(‘enact’) = 5 + 14*26 + 1*262 + 3*263 + 20*264 mod 13 = 9193293 mod 13 = 5 Exercise: the, let, it , best Powers of 26: 1, 26, 676, 17576, 456976 Entries Terms 1 2 3 caesar 4 did 5 enact 6 so 7 the 8 9 i 10 julius 11 killed 12
Collision Collision – two different terms mapped to the same entry For example For term ‘was’: F(‘was’) = 23 + 1*26 + 19*262 = 12893 mod 13 = 10 ‘was’ is mapped to the same entry as ‘julius’ Collision can be resolved by auxiliary structures, secondary hash function, or rehashing Entries Terms 1 2 3 caesar 4 did 5 enact 6 so 7 the 8 9 i 10 julius 11 killed 12
Search tree Binary search tree B-tree The terms are in sorted order in the in-order traversal of the tree Only practical for in-memory operations Read for interest only
Binary search tree (BST) Binary tree – a tree with every node having at most two children Binary search tree – every node is associated with a key (term) in which The term associated with the left child is lexicographically smaller than that of the parent node and The term associated with the parent is lexicographically smaller than that of the right child E.g., did caesar enact
Example Note: Posting, the documents containing the term caesar did enact so the i julius killed Vocabulary 1 1 1 1 1 1 2 1 Postings 2 2
Searching in BST Start from the root node, the search proceeds to one of the two subtrees below by comparing the term you are searching and the term associated with the root The search stops when a match is found or a leave node is reached The search (or lookup) operation takes O(log T) time where T is the number of terms, provided that the BST is balanced Balance criteria, e.g., the numbers of terms under the two subtrees of any node are either equal or differ by 1
B-tree Number of subtrees under an internal node varies in a fixed interval [a,b], where ab are positive integers The number of terms associated with an internal node, except the root, is between a-1 to b-1 Can be viewed as “collapsing” multiple levels of the binary tree into one Good for the case that dictionary is disk resident, in which case this collapsing serves the function of prefetching imminent binary tests The integers a and b are determined by the sizes of disk blocks
Example a=2 and b=4 capitol hath Vocabulary be did i’ julius let ambitious brutus caesar enact I it killed me 2 2 1 1 1 1 1 2 1 1 2 1 1 2 1 Postings 2 2
Reminder Doc 1 I did enact Julius Caesar: I was killed i’ the Capitol; Brutus killed me. Doc 2 So let it be with Caesar. The noble Brutus hath told you Caesar was ambitious: