03/20/2003Parallel IR1 Papers on Parallel IR Agenda Introduction Paper 1:Inverted file partitioning schemes in multiple disk systems Paper 2: Parallel search using partitioned inverted files Comparison Conclusion URL Links to Paper
03/20/2003Parallel IR2 Parallel IR Introduction Parallelism in Query processing involves: 1.Multitasking Simultaneous Queries A thread or process for each user query, that can execute on a CPU The same thread or process completes an entire single query Ability to handle multiple concurrent queries 2.Query Partitioning A single query is broken into sub tasks Each sub task can run in parallel Improves Response Time of a single Query
03/20/2003Parallel IR3 Partitioning Query into Sub Tasks IR involves dealing with large amounts of data. Hence we can partition data set between sub tasks –Document Partitioning Divides documents over sub tasks, so that each sub task processes a sub set of the documents –Term Partitioning Divides the indexing terms among sub tasks so that each document processing is spread out between sub tasks
03/20/2003Parallel IR4 Theme of Papers being presented…. Both the papers explore the issues and performance implications in parallel IR systems using inverted indexes when they employ –A) Document Partitioning –B) Index Term Partitioning Paper1: Inverted file partitioning schemes in multiple disk systems Paper2: Parallel search using partitioned inverted files
03/20/2003Parallel IR5 P1: Inverted File Systems Inverted File System consists of: –Index File: Ordered list of all keywords that have been used to index a collection of documents. Along with each term there are fields that give the location and number of postings in the posting file –Posting File: consists of a group of records, with each record having the weight of the term and a pointer to the actual document file –Document File: contains the actual document records of the collection
03/20/2003Parallel IR6 P1: Inverted File Systems ( cont )
03/20/2003Parallel IR7 P1: Load Balancing In a multiple CPU, multiple disk system we need to: Balance the Load on Processors –Need to maximize CPU utilization Balance the Load on the I/O devices i.e. disk drives –Avoid I/O bottle necks which will cause CPUs to go in wait states
03/20/2003Parallel IR8 P1:Partitioning an Inverted File The paper explores the 2 schemes: –Based on Term Id –Based on Document Id With Both the schemes partitioning of the index file and the document file is the same – Index File by index term id and document file by document id We have seen that the posting file has both the document id as well as the index term id. One scheme partitions the posting file based on the Term Id while the other partitions it based on the document id.
03/20/2003Parallel IR9 P1:Partitioning an Inverted File ( cont)
03/20/2003Parallel IR10 P1: Objective of Partitioning Inverted Index Objective: To maximize performance Ideal: All I/O channels and Disk drives are equally used when sub tasks of a query gets executed in parallel However Data usage is dynamic from query to query and cannot be predicted. Hence we cannot achieve the ideal limit Paper recognizes that I/O is a major cost factor in IR
03/20/2003Parallel IR11 A Brief Comparison Document IdTerm Id All posting entries of a document are on the same disk All posting entries of an index term are on the same disk The index file needs to store the disk information with the index term, to indicate where the posting entries are stored. Hence requires more space No need as all posting entries of a index term are on the same disk. Less space usage Disk space usage over the multiple disks is balanced Since posting size of a Index Term varies with the frequency of occurrence in the collection, disk space usage may be unbalanced
03/20/2003Parallel IR12 A Brief Comparison… The Main Important Difference: Different I/O characteristic: A sub task of a single query index term will lead to disk I/O distribution across multiple disks in DocumentId partitioning while with TermId is limited to one disk. Which is better? – It is a tradeoff………
03/20/2003Parallel IR13 P1: Simulation Model To compare the two schemes the paper defines a simulation model with the following factors: a)Collection Database Model – follows natural language text distribution following Zipfs law. 20% of index terms comprise 80% of posting entries. Model Skews the above ratios to observe the effect on query performance b)User Query Model : The paper used two cases. Skewed queries, with some terms of low ranks frequently requested. Uniform query model with al terms having same probability
03/20/2003Parallel IR14 P1: Simulation Model.. Cont.. c) Queuing Model: Concurrent I/O requests on the same device are queued in priority. CPU usage requests on the same CPU are also queued d) Work Load Model : Vary the number of disks and CPUs
03/20/2003Parallel IR15 Simulation Results Increasing the number of disks up to a threshold improves performance, by decreasing the response time When the index term and the query term distribution is not skewed partitioning scheme based on term id performed the best When data was skewed, partitioning scheme based on document id performed the best. With skewed data (80/20) and with TermId, disks with those 20% of terms will become bottlenecks
03/20/2003Parallel IR16 Paper 2 - Positioning w.r.t. Paper 1 The thrust of paper 1’s approach was to partition the user queries by index terms, with each index term query becoming a sub task. The objective then became to optimize the one individual sub task with the biggest bottle next of I/O What if user query has only one query index term!!! Your disks are optimized, but your CPUs are idle Paper 2 recognizes that most user queries are single term only. Why?
03/20/2003Parallel IR17 P2: Search Topology Framework P2’s proposes a different framework:
03/20/2003Parallel IR18 P2: Search Topology ( Cont..) Top Node: Accepts query from client and distributes it to all of its child nodes and awaits results. Leaf Node: Looks after only ONE PARTITION of the inverted file. Each leaf node and the top node have a processor each. Within this framework the papers objective is to evaluate which type of inverted index partitioning is better: DocId or TermId based.
03/20/2003Parallel IR19 P2: Approach The paper uses real web collections instead of simulations for experimentations The PLIERS system is used on a 8 to 12 nodes AP3000 m/c. The data used comprised BASE1(1Gb) to BASE10(10Gb) of VLC2 collection Queries were based on topics 351 to 400 of the TREC-7 ad-hoc track. Title only and whole topic queries were used DocId and TermId index partitioning was used Bottom Line: Real Data instead of simulation
03/20/2003Parallel IR20 P2: Summary of Results Within the framework of the experiment: DocId partitioning is better in a multiprocessor environment, than TermId Partitioning TermId approach imposes too much communication overhead between leafs and the top node as the final result for a given doc, depends on the results from each leaf node
03/20/2003Parallel IR21 Comparison Paper 1Paper 2 Breaks queries into sub tasks based on query keywords Breaks query into sub tasks based on number of partitions of inverted index. Focus on optimization of disk I/O access Focus on optimization of processor use Assumes a more generic Topological Framework Very specific framework. Total number of CPUs needed depend on data driven partitions! Concludes results of plus and minus of docId and TermId partitioning schemes based on properties of document collection Due to specific framework assumptions, came to the conclusion that DocId partitioning method for inverted index is best, in that framework
03/20/2003Parallel IR22 Conclusion In combination these 2 papers highlight the issues of processor and I/O utilizations, in context to the factors affecting partitioning inverted indexes, in DocumentId and TermId Schemes
03/20/2003Parallel IR23 URL Links to Paper Paper 1: Inverted file partitioning schemes in multiple disk systems Byeong-Soo Jeong; Omiecinski, E.; Parallel and Distributed Systems, IEEE Transactions on, Volume: 6 Issue: 2, Feb L&arnumber=342125&arSt=142&ared=153&arAuthor=Byeong- Soo+Jeong%3B+Omiecinski%2C+E.%3B Paper 2: Parallel search using partitioned inverted files MacFarlane, A.; McCann, J.A.; Robertson, S.E.; String Processing and Information Retrieval, SPIRE Proceedings. Seventh International Symposium on, NF&arnumber=878197&arSt=209&ared=220&arAuthor=MacFarlane%2C+A.%3B+M cCann%2C+J.A.%3B+Robertson%2C+S.E.%3B