Parallel and Distributed IR. 2 Papers on Parallel and Distributed IR Introduction Paper A: Inverted file partitioning schemes in Multiple Disk Systems.

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

Multimedia Database Systems
03/20/2003Parallel IR1 Papers on Parallel IR Agenda Introduction Paper 1:Inverted file partitioning schemes in multiple disk systems Paper 2: Parallel.
Building a Distributed Full-Text Index for the Web S. Melnik, S. Raghavan, B.Yang, H. Garcia-Molina.
1 The PageRank Citation Ranking: Bring Order to the web Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd Presented by Fei Li.
Spark: Cluster Computing with Working Sets
Chapter 13 (Web): Distributed Databases
Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS
LYU0101 Wireless Digital Information System Lam Yee Gordon Yeung Kam Wah Supervisor Prof. Michael Lyu Second semester FYP Presentation 2001~2002.
Reference: Message Passing Fundamentals.
Web Caching Schemes1 A Survey of Web Caching Schemes for the Internet Jia Wang.
1 CS 430 / INFO 430 Information Retrieval Lecture 4 Searching Full Text 4.
Inverted Indices. Inverted Files Definition: an inverted file is a word-oriented mechanism for indexing a text collection in order to speed up the searching.
1 CS 430: Information Discovery Lecture 3 Inverted Files and Boolean Operations.
Memory Management 2010.
1 CS 430: Information Discovery Lecture 4 Data Structures for Information Retrieval.
Quick Review of material covered Apr 8 B+-Tree Overview and some definitions –balanced tree –multi-level –reorganizes itself on insertion and deletion.
Web Search – Summer Term 2006 II. Information Retrieval (Basics) (c) Wolfgang Hürst, Albert-Ludwigs-University.
Parallel and Distributed IR
1 The Mystery of Cooperative Web Caching 2 b b Web caching : is a process implemented by a caching proxy to improve the efficiency of the web. It reduces.
Hinrich Schütze and Christina Lioma Lecture 4: Index Construction
1 CS 430 / INFO 430 Information Retrieval Lecture 4 Searching Full Text 4.
Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.
Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.
Database Systems Design, Implementation, and Management Coronel | Morris 11e ©2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or.
A Web Crawler Design for Data Mining
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Web Archiving and Access Mike Smorul Joseph JaJa ADAPT Group University of Maryland, College Park.
« Performance of Compressed Inverted List Caching in Search Engines » Proceedings of the International World Wide Web Conference Commitee, Beijing 2008)
Scalable Web Server on Heterogeneous Cluster CHEN Ge.
The Effect of Collection Organization and Query Locality on IR Performance 2003/07/28 Park,
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
Efficient P2P Searches Using Result-Caching From U. of Maryland. Presented by Lintao Liu 2/24/03.
1 CS 430: Information Discovery Lecture 3 Inverted Files.
Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.
CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin Hung, Victor KOON Ping Yin, Bob SPRINT: A Scalable Parallel Classifier for Data Mining.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
Modern Information Retrieval Chapter 9: Parallel and Distributed IR Section 9.1: Introduction Section : MIMD Architectures Inverted Files November.
1 CS 430: Information Discovery Lecture 4 Files Structures for Inverted Files.
Building a Distributed Full-Text Index for the Web by Sergey Melnik, Sriram Raghavan, Beverly Yang and Hector Garcia-Molina from Stanford University Presented.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 4: Index Construction Related to Chapter 4:
K-tree/forest: Efficient Indexes for Boolean Queries Rakesh M. Verma and Sanjiv Behl University of Houston
Concept-based P2P Search How to find more relevant documents Ingmar Weber Max-Planck-Institute for Computer Science Joint work with Holger Bast Torino,
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 4: Index Construction Related to Chapter 4:
The Development of a search engine & Comparison according to algorithms Sung-soo Kim The final report.
Modern Information Retrieval
P2P Content Search: Give the Web Back to the People Matthias Bender Sebastin Michel Peter Triantafillou Gerhard Weikum Christian Zimmer Mariam John CSE.
Handling Data Skew in Parallel Joins in Shared-Nothing Systems Yu Xu, Pekka Kostamaa, XinZhou (Teradata) Liang Chen (University of California) SIGMOD’08.
Cluster computing. 1.What is cluster computing? 2.Need of cluster computing. 3.Architecture 4.Applications of cluster computing 5.Advantages of cluster.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
General Architecture of Retrieval Systems 1Adrienn Skrop.
XRANK: RANKED KEYWORD SEARCH OVER XML DOCUMENTS Lin Guo Feng Shao Chavdar Botev Jayavel Shanmugasundaram Abhishek Chennaka, Alekhya Gade Advanced Database.
Information Retrieval Inverted Files.. Document Vectors as Points on a Surface Normalize all document vectors to be of length 1 Define d' = Then the ends.
Distributed Caching and Adaptive Search in Multilayer P2P Networks Chen Wang, Li Xiao, Yunhao Liu, Pei Zheng The 24th International Conference on Distributed.
1 Munther Abualkibash University of Bridgeport, CT.
Seminar on seminar on Presented By L.Nageswara Rao 09MA1A0546. Under the guidance of Ms.Y.Sushma(M.Tech) asst.prof.
University of Maryland Baltimore County
Why indexing? For efficient searching of a document
Text Indexing and Search
Parallel Databases.
Information Retrieval in Practice
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn.
Information Retrieval and Web Search
Implementation Issues & IR Systems
Database Performance Tuning and Query Optimization
Chapter 11 Database Performance Tuning and Query Optimization
Database System Architectures
Recuperação de Informação B
Information Retrieval and Web Design
Presentation transcript:

Parallel and Distributed IR

2 Papers on Parallel and Distributed IR Introduction Paper A: Inverted file partitioning schemes in Multiple Disk Systems by Byeong-Soo Jeong and Edward Omiecinski [404] Paper B: Methodologies for Distributed Information Retrieval by Alister Moffat, Justin Zobel, Owen De Kretser, Tim Shimmin [1998] Comparison and Conclusion URLs Agenda

3 Introduction Exponential growth in size of online electronic text. Per surveys conducted, publicly indexable web contained  350 million pages ~ July 98  800 million pages ~ July 99  1 billion pages ~ January 00. To manage this size and growth, we need a scalable model, multitasking algorithms, parallel and distributed IR.

4 Parallel and Distributed IR Comparison Computation model in Distributed IR and Parallel IR is very similar. It divides the main task into sub-tasks and executes the sub-tasks in parallel. The main difference is that, in Distributed IR sub-tasks are run on different processing units where interprocess communication is via network protocols rather than shared memory. Distributed IR employs procedure to select subset of processes to broadcast request whereas Parallel IR broadcasts every request to every process.  Paper A: discusses two schemes for Parallel IR implementation  Paper B: gives methodologies for Distributed IR.

5 Paper A: Inverted file partitioning schemes - Objective Goal of the paper is to reduce average response time by partitioning inverted file. The paper identifies I/O time as a major cost factor in IR system. It exploits the potential of I/O parallelism and balances I/O work-load for better response time by partitioning and distributing files. The paper discusses two partitioning schemes for inverted file systems. Inverted file partitioning schemes in Multiple Disk Systems By Byeong-Soo Jeong and Edward Omiecinski [1995]

6 Paper A: Inverted file structure

7 Paper A: Inverted file partitioning schemes 1) Based on term-id 2) Based on document-id Scheme 1: All postings for a term on one disk. Scheme 2: All postings for a document on one disk (but for one term distributed across disks).

8 Paper A: Partitioning schemes – Pictorial presentation

9 Paper A: Two schemes - comparison Document – ID basedTerm – ID based Space usage: Index file needs to store disk information to indicate where posting entries are stored for term. More space usage for index file. Space usage: All postings for one term on same disk; less space usage for index file. Number of I/O: Posting entries for one term are spread across disk; hence number of I/O for posting file is equal to number of disks containing posting files entry for given term. More posting file I/O. Number of I/O: For one term, single posting file I/O. Load distribution: Though more posting file I/O, it could be parallel. Hence I/O load distribution is balanced. Load distribution: Maximum I/O parallelism is limited by number of terms in query [On web, average number of words per query = 2.35 per survey in Sept-98]. I/O time: Small I/O size. Hence, result I/O time could be less. I/O time: I/O size is equal to complete posting entry for term. Result I/O time depends on size of longest posting entry.

10 Paper A: Two schemes performance comparison Query Model: Under skew Query model: partition by document-id performs better. Because I/O load is more balanced in partition by document-ID. Whereas, partition by term-ID performs better in uniform query model. Query length: Under uniform query environment, partition by term-ID model performs twice as fast for long queries and 5-10 times fast for short queries. Number of disks: Addition of number of disks improves performance of partition by document- ID scheme at higher rate, since I/O load is more evenly distributed in partition by document-ID. Performance comparison under different parameters Conclusion: Partition by Term ID performs better under uniform query models, but has high fluctuation in response time depending on terms in query. In Partition by Doc-ID, there is little variation in response time for almost all cases.

11 Paper B: Methodologies for Distributed IR - Objective This paper is in the proceedings of 18 th international conference on Distributed Computing Systems – This paper discusses three different methodologies for Distributed IR and compares their effectiveness, efficiency and response time. Methodologies for Distributed Information Retrieval By Alister Moffat, Justin Zobel, Owen De Kretser, Tim Shimmin [1998]

12 Paper B: Methodologies for Distributed IR “Parallel Text Search Methods”- paper by Salton and Buckley, in 1988, [701], has interesting comments about early implementation of Parallel IR where its effectiveness and efficiency are challenged. Moffat and Zobel, in this paper, conclude that Distributed IR can be fast and effective; but agree with Salton-Buckley that its not efficient. [Will see why its not efficient in coming slides]

13 Paper B: Distributed IR Model Librarian – Individual node that has its own sub-collection, maintains index for sub-collection, evaluates queries, fetches doc. Receptionist – provides user interface, posts user queries to all or set of librarians, merges results from librarians, generates final ranked list of result using global info. After global ranking by the receptionist, many of the docs returned by librarian may not even be presented to the user. Thus, there is wastage of resource in calculating similarity and transmission of those unwanted docs, therefore efficiency is low in distributed model.

14 Paper B: Distributed IR methodologies Three different methodologies are defined based on the global information stored at the receptionist.  Central Nothing – CN The only global information maintained by the receptionist is a list of librarian.  Central Vocabulary – CV Global information stored by receptionist is the vocabularies of the sub-collections.  Central Index – CI Receptionist has a full access to the indexes of sub-collections.

15 Paper B: Central Nothing–Distributed IR Advantage:  Little or no storage space is required for global information at receptionist.  Simple implementation. Disadvantage:  Receptionist has no basis for excluding any sub-collection processes query in full.  Final ranking quality is poor (a term might be common in one sub- collection and be assigned a minimal weight, but in context of the collection as a whole that term might be rare. When results from different sub-collection are merged, no basis to rank collection- wide). Global Information: List of librarians

16 Paper B: Central Vocabulary-Distributed IR Advantage:  Receptionist can decide better to choose sub-collections for query distribution and sub-collections can be completely avoided if they contain none or few of the query terms.  It has a better global ranking (compared to CN) as it can use Central Vocabulary. Disadvantage:  More storage required for string collection-wide vocabulary. Global Information: Vocabularies of all sub-collections.

17 Paper B: Central Index–Distributed IR Advantage:  Receptionist can perform all index processing and request, from librarian, docs required to make final ranking.  Better selection of librarians. Disadvantage:  More storage required for string collection-wide vocabulary and index.  More preprocessing required at the receptionist to request documents from librarians. Receptionist has full access to indexes of sub-collection.

18 Paper A & Paper B comparison - Conclusion Paper A: Inverted file partitioning schemes Paper B: Methodologies for distributed IR Query Processing: Breaks query into keywords. Query Processing: Sends complete query to librarians. Document Partitioning: May or may not partition corpus. Document Partitioning: Corpus is partitioned. Optimization: Attempts to optimize I/O. Optimization: Attempts to optimize network delays and processing time. Efficiency: More efficient model. Efficiency: Less efficient model.

19 Paper A and Paper B - URLs Paper A:  Inverted File Partitioning Schemes in Multiple Disk Systems by Byeong-Soo Jeong, Edward Omiecinski. (IEEE transactions on Parallel and distributed systems, Vol 6, Feb 1995)  Paper B:  Methodologies for Distributed Information Retrieval by Owen de Kretser, Alistair Moffat, Tim Shimmin, Justin Zobel. (The proceedings from 18th International Conference on Distributed Computing Systems )  abs.htm abs.htm