Modern Information Retrieval Chapter 9: Parallel and Distributed IR Section 9.1: Introduction Section : MIMD Architectures Inverted Files November 5, 1999
Summary n Introduction n Review of parallel computing and parallel program performance measures n Exploration of techniques for implementing inverted file on MIMD parallel architecture n Conclusion
Introduction n The volume of electronic text available online today is staggering. n The WWW contains over 800 millions pages of text, comprising nearly 6 terabytes of data (NATURE|Vol 400|8 July 1999| n As document collections grow larger, they become more expensive to manage with an information retrieval system. n To support the demanding requirements of modern search environments, we must turn to alternative architectures and algorithms.
Parallel Computing n Parallel computing is the simultaneous aplication of multiple processors to solve a single problem. n Flynn’s Taxonomy: u SISD single instruction, single data u SIMD single instruction, multiple data u MISD multiple instruction, single data u MIMD multiple instruction, multiple data
Parallel Program Performance Measures n Speedup n Amdahl’s Law where f is the fraction of the problem that must be computed sequencially; N is the number of processors. Running time of best available sequential algorithm Running time of parallel algorithm
Parallel Program Performance Measures n Efficiency where S is speedup; N is the number of processors.
MIMD Architectures n MIMD architectures offer a great deal of flexibility in how parallelism is defined and exploited to solve a problem. n There are two ways in which a retrieval system can exploit a MIMD machine: u Parallel multitasking; u Partitioned parallel processing.
MIMD Architectures Parallel multitasking on a MIMD machine Broker User Query Result User Query Result Search Engine Search Engine Search Engine Search Engine Search Engine
MIMD Architectures Partitioned parallel processing on a MIMD machine Broker User Query Result Subquery/ Results Search Process Search Process Search Process Search Process Search Process
MIMD Architectures Basic data elements processed by a seach algorithm k 1 k 2...k i...k t d 1 w 1,1 w 2,1...w i,1...w t,1 d 2 w 1,2 w 2,2...w i,2...w t, d j w 1,j w 2,j...w i,j...w t,j d N w 1,N w 2,N...w i,N...w t,N Indexing Items DocumentsDocuments
MIMD Architectures n There are two possible methods for partitioning the data: u Document partitioning: the N documents are distributed across the P processors; each parallel process evaluates the query on the subcollection of N/P documents assigned to it; u Term partitioning: the t indexing items are distributed across the P processors; the evaluation process for each document is spread over multiple processors.
Inverted Files Logical Document Partitioning n Data Partitioning u The data partitioning is done logically using essentially the same basic underlying inverted file index as in the original sequential algorithm; u The inverted file is extended to give each parallel process direct access to that portion of the index related to the processor’s subcollection of documents.
Extended dictionary entry for document partitioning Inverted Files Logical Document Partitioning item i P1 P2 P3 P4 Inverted List Term i Dictionary
n Query Evaluation u The broker initiates P parallel processes to evaluate the query; u Each process executes the same document scoring algorithm on its document subcollection; u The search processes record document scores in a single shared array of document score accumulators; u The broker produces the final ranked list of documents. Inverted Files Logical Document Partitioning
n Inverted File Construction u The indexer partitions the documents among the processors; u Each indexing process generates a batch of inverted lists, sorted by indexing item; u A merge step is performed to create the final inverted file. Inverted Files Logical Document Partitioning
n Data Partitioning u The documents are physically partitioned into separate subcollections, one for each parallel processor; u Each subcollection has its own inverted file. Inverted Files Physical Document Partitioning
n Query Evaluation u The broker distributes the query to all of the parallel search processes; u Each parallel search process evaluates the query on its portion of the document collection, producing an intermediate hit-list; u The broker collects the intermediate hit-lists from all of the parallel search processes and merges them into a final hit-list. Inverted Files Physical Document Partitioning
n Inverted File Construction u Each processor creates, in parallel, its own complete index corresponding to its document partition; u A merge step is performed to accumulate the global statistics for all of the partitions and distribute them to each of the partition dictionaries. Inverted Files Physical Document Partitioning
n Data Partitioning u Inverted lists are spread across the processors. Inverted Files Term Partitioning
n Query Evaluation u Query is decomposed into indexing items and each indexing item is sent to the processor that holds the corresponding inverted list; u The processors create hit-lists with partial document scores and return them to the broker; u The broker combines the hit-lists. Inverted Files Term Partitioning
n Inverted File Construction u Inverted file is created using the parallel construction technique described for logical document partitioning. Inverted Files Term Partitioning
Example Document collection Document Text 1 Pease porridge hot 2 Pease porridge cold 3 Pease porridge in the pot 4 Pease porridge hot, pease porridge not cold 5 Pease porridge cold, pease porridge not hot 6 Pease porridge hot in the pot
Example Inverted File cold hot in not pease porridge pot the Dictionary Inverted Lists
Example Logical Document Partitioning cold hot in not pease porridge pot P1 P2 P3 the Inverted List Term “pease” Dictionary
Example Physical Document Partitioning cold hot in not pease porridge pot the P2 hot pease porridge P1 cold hot in not pease porridge pot the P3 cold
Example Term Partitioning cold hot in not pease porridge pot the P1 P2 P3
Conclusion n The task of indexing and searching in very large text collections is costly; n Faster indexing and searching algorithms are always desirable and the use of parallel hardware is and obvious alternative; n We discussed two possible organization for the document collection index on a MIMD parallel architecture: u Document partitioning; u Term partitioning.
Conclusion n Document partitioning affords simpler inverted index construction and maintenance than term partitioning; n When term distributions in the documents and queries are more skewed, document partitioning performs better; n When terms are uniformily distributed in user queries, term partitioning performs better.
Adicional References Lawrence, S., Giles, C.L Accessibility of Information on the Web. Nature. Vol.400.pp Ribeiro-Neto, B.A., Barbosa, R.A Query Performance for Tighly Coupled Distributed Digital Libraries. Digital Libraries 98. pp Ribeiro-Neto, B.A., Moura, E.S., Neubert, M.S., Ziviani, N Efficient Distributed Algorithms to Build Inverted Files. SIGIR’99. pp