Download presentation
Presentation is loading. Please wait.
Published byPriscilla Sims Modified over 9 years ago
1
Parallel and Distributed IR
2
2 Papers on Parallel and Distributed IR Introduction Paper A: Inverted file partitioning schemes in Multiple Disk Systems by Byeong-Soo Jeong and Edward Omiecinski [404] Paper B: Methodologies for Distributed Information Retrieval by Alister Moffat, Justin Zobel, Owen De Kretser, Tim Shimmin [1998] Comparison and Conclusion URLs Agenda
3
3 Introduction Exponential growth in size of online electronic text. Per surveys conducted, publicly indexable web contained 350 million pages ~ July 98 800 million pages ~ July 99 1 billion pages ~ January 00. To manage this size and growth, we need a scalable model, multitasking algorithms, parallel and distributed IR.
4
4 Parallel and Distributed IR Comparison Computation model in Distributed IR and Parallel IR is very similar. It divides the main task into sub-tasks and executes the sub-tasks in parallel. The main difference is that, in Distributed IR sub-tasks are run on different processing units where interprocess communication is via network protocols rather than shared memory. Distributed IR employs procedure to select subset of processes to broadcast request whereas Parallel IR broadcasts every request to every process. Paper A: discusses two schemes for Parallel IR implementation Paper B: gives methodologies for Distributed IR.
5
5 Paper A: Inverted file partitioning schemes - Objective Goal of the paper is to reduce average response time by partitioning inverted file. The paper identifies I/O time as a major cost factor in IR system. It exploits the potential of I/O parallelism and balances I/O work-load for better response time by partitioning and distributing files. The paper discusses two partitioning schemes for inverted file systems. Inverted file partitioning schemes in Multiple Disk Systems By Byeong-Soo Jeong and Edward Omiecinski [1995]
6
6 Paper A: Inverted file structure
7
7 Paper A: Inverted file partitioning schemes 1) Based on term-id 2) Based on document-id Scheme 1: All postings for a term on one disk. Scheme 2: All postings for a document on one disk (but for one term distributed across disks).
8
8 Paper A: Partitioning schemes – Pictorial presentation
9
9 Paper A: Two schemes - comparison Document – ID basedTerm – ID based Space usage: Index file needs to store disk information to indicate where posting entries are stored for term. More space usage for index file. Space usage: All postings for one term on same disk; less space usage for index file. Number of I/O: Posting entries for one term are spread across disk; hence number of I/O for posting file is equal to number of disks containing posting files entry for given term. More posting file I/O. Number of I/O: For one term, single posting file I/O. Load distribution: Though more posting file I/O, it could be parallel. Hence I/O load distribution is balanced. Load distribution: Maximum I/O parallelism is limited by number of terms in query [On web, average number of words per query = 2.35 per survey in Sept-98]. I/O time: Small I/O size. Hence, result I/O time could be less. I/O time: I/O size is equal to complete posting entry for term. Result I/O time depends on size of longest posting entry.
10
10 Paper A: Two schemes performance comparison Query Model: Under skew Query model: partition by document-id performs better. Because I/O load is more balanced in partition by document-ID. Whereas, partition by term-ID performs better in uniform query model. Query length: Under uniform query environment, partition by term-ID model performs twice as fast for long queries and 5-10 times fast for short queries. Number of disks: Addition of number of disks improves performance of partition by document- ID scheme at higher rate, since I/O load is more evenly distributed in partition by document-ID. Performance comparison under different parameters Conclusion: Partition by Term ID performs better under uniform query models, but has high fluctuation in response time depending on terms in query. In Partition by Doc-ID, there is little variation in response time for almost all cases.
11
11 Paper B: Methodologies for Distributed IR - Objective This paper is in the proceedings of 18 th international conference on Distributed Computing Systems – 1998. This paper discusses three different methodologies for Distributed IR and compares their effectiveness, efficiency and response time. Methodologies for Distributed Information Retrieval By Alister Moffat, Justin Zobel, Owen De Kretser, Tim Shimmin [1998]
12
12 Paper B: Methodologies for Distributed IR “Parallel Text Search Methods”- paper by Salton and Buckley, in 1988, [701], has interesting comments about early implementation of Parallel IR where its effectiveness and efficiency are challenged. Moffat and Zobel, in this paper, conclude that Distributed IR can be fast and effective; but agree with Salton-Buckley that its not efficient. [Will see why its not efficient in coming slides]
13
13 Paper B: Distributed IR Model Librarian – Individual node that has its own sub-collection, maintains index for sub-collection, evaluates queries, fetches doc. Receptionist – provides user interface, posts user queries to all or set of librarians, merges results from librarians, generates final ranked list of result using global info. After global ranking by the receptionist, many of the docs returned by librarian may not even be presented to the user. Thus, there is wastage of resource in calculating similarity and transmission of those unwanted docs, therefore efficiency is low in distributed model.
14
14 Paper B: Distributed IR methodologies Three different methodologies are defined based on the global information stored at the receptionist. Central Nothing – CN The only global information maintained by the receptionist is a list of librarian. Central Vocabulary – CV Global information stored by receptionist is the vocabularies of the sub-collections. Central Index – CI Receptionist has a full access to the indexes of sub-collections.
15
15 Paper B: Central Nothing–Distributed IR Advantage: Little or no storage space is required for global information at receptionist. Simple implementation. Disadvantage: Receptionist has no basis for excluding any sub-collection processes query in full. Final ranking quality is poor (a term might be common in one sub- collection and be assigned a minimal weight, but in context of the collection as a whole that term might be rare. When results from different sub-collection are merged, no basis to rank collection- wide). Global Information: List of librarians
16
16 Paper B: Central Vocabulary-Distributed IR Advantage: Receptionist can decide better to choose sub-collections for query distribution and sub-collections can be completely avoided if they contain none or few of the query terms. It has a better global ranking (compared to CN) as it can use Central Vocabulary. Disadvantage: More storage required for string collection-wide vocabulary. Global Information: Vocabularies of all sub-collections.
17
17 Paper B: Central Index–Distributed IR Advantage: Receptionist can perform all index processing and request, from librarian, docs required to make final ranking. Better selection of librarians. Disadvantage: More storage required for string collection-wide vocabulary and index. More preprocessing required at the receptionist to request documents from librarians. Receptionist has full access to indexes of sub-collection.
18
18 Paper A & Paper B comparison - Conclusion Paper A: Inverted file partitioning schemes Paper B: Methodologies for distributed IR Query Processing: Breaks query into keywords. Query Processing: Sends complete query to librarians. Document Partitioning: May or may not partition corpus. Document Partitioning: Corpus is partitioned. Optimization: Attempts to optimize I/O. Optimization: Attempts to optimize network delays and processing time. Efficiency: More efficient model. Efficiency: Less efficient model.
19
19 Paper A and Paper B - URLs Paper A: Inverted File Partitioning Schemes in Multiple Disk Systems by Byeong-Soo Jeong, Edward Omiecinski. (IEEE transactions on Parallel and distributed systems, Vol 6, Feb 1995) http://csdl.computer.org/comp/trans/td/1995/02/l0142abs.htm http://csdl.computer.org/comp/trans/td/1995/02/l0142abs.htm Paper B: Methodologies for Distributed Information Retrieval by Owen de Kretser, Alistair Moffat, Tim Shimmin, Justin Zobel. (The proceedings from 18th International Conference on Distributed Computing Systems ) http://csdl.computer.org/comp/proceedings/icdcs/1998/8292/00/8 2920066abs.htm http://csdl.computer.org/comp/proceedings/icdcs/1998/8292/00/8 2920066abs.htm
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.