03/20/2003Parallel IR1 Papers on Parallel IR Agenda Introduction Paper 1:Inverted file partitioning schemes in multiple disk systems Paper 2: Parallel.

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
LIBRA: Lightweight Data Skew Mitigation in MapReduce
Resource Management §A resource can be a logical, such as a shared file, or physical, such as a CPU (a node of the distributed system). One of the functions.
Lecture 11: Operating System Services. What is an Operating System? An operating system is an event driven program which acts as an interface between.
Qinqing Gan Torsten Suel Improved Techniques for Result Caching in Web Search Engines Presenter: Arghyadip ● Konark.
TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets Chun Chen 1, Feng Li 2, Beng Chin Ooi 2, and Sai Wu 2 1 Zhejiang University, 2 National.
New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007.
1 Routing and Scheduling in Web Server Clusters. 2 Reference The State of the Art in Locally Distributed Web-server Systems Valeria Cardellini, Emiliano.
Search Engines and Information Retrieval
Improving Proxy Cache Performance: Analysis of Three Replacement Policies Dilley, J.; Arlitt, M. A journal paper of IEEE Internet Computing, Volume: 3.
Locality-Aware Request Distribution in Cluster-based Network Servers 1. Introduction and Motivation --- Why have this idea? 2. Strategies --- How to implement?
Chapter 1 Introduction 1.1A Brief Overview - Parallel Databases and Grid Databases 1.2Parallel Query Processing: Motivations 1.3Parallel Query Processing:
Evaluating the Performance of IR Sytems
1 An Empirical Study on Large-Scale Content-Based Image Retrieval Group Meeting Presented by Wyman
Parallel and Distributed IR
Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.
Search Engines and Information Retrieval Chapter 1.
 Introduction to Operating System Introduction to Operating System  Types Of An Operating System Types Of An Operating System  Single User Single User.
Philosophy of IR Evaluation Ellen Voorhees. NIST Evaluation: How well does system meet information need? System evaluation: how good are document rankings?
Parallel and Distributed IR. 2 Papers on Parallel and Distributed IR Introduction Paper A: Inverted file partitioning schemes in Multiple Disk Systems.
Meta Scheduling Sathish Vadhiyar Sources/Credits/Taken from: Papers listed in “References” slide.
Web Archiving and Access Mike Smorul Joseph JaJa ADAPT Group University of Maryland, College Park.
« Performance of Compressed Inverted List Caching in Search Engines » Proceedings of the International World Wide Web Conference Commitee, Beijing 2008)
The Effect of Collection Organization and Query Locality on IR Performance 2003/07/28 Park,
Efficient Peer to Peer Keyword Searching Nathan Gray.
Chapter 101 Multiprocessor and Real- Time Scheduling Chapter 10.
1 Multiprocessor and Real-Time Scheduling Chapter 10 Real-Time scheduling will be covered in SYSC3303.
Efficient P2P Searches Using Result-Caching From U. of Maryland. Presented by Lintao Liu 2/24/03.
Search Result Interface Hongning Wang Abstraction of search engine architecture User Ranker Indexer Doc Analyzer Index results Crawler Doc Representation.
Frontiers in Massive Data Analysis Chapter 3.  Difficult to include data from multiple sources  Each organization develops a unique way of representing.
© Pearson Education Limited, Chapter 13 Physical Database Design – Step 4 (Choose File Organizations and Indexes) Transparencies.
Load-Balancing Routing in Multichannel Hybrid Wireless Networks With Single Network Interface So, J.; Vaidya, N. H.; Vehicular Technology, IEEE Transactions.
Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
Greedy is not Enough: An Efficient Batch Mode Active Learning Algorithm Chen, Yi-wen( 陳憶文 ) Graduate Institute of Computer Science & Information Engineering.
Modern Information Retrieval Chapter 9: Parallel and Distributed IR Section 9.1: Introduction Section : MIMD Architectures Inverted Files November.
Virtual Memory 1 1.
O PERATING S YSTEM. What is an Operating System? An operating system is an event driven program which acts as an interface between a user of a computer,
1 Information Retrieval LECTURE 1 : Introduction.
DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.
1 Evaluating High Accuracy Retrieval Techniques Chirag Shah,W. Bruce Croft Center for Intelligent Information Retrieval Department of Computer Science.
1 Adaptive Parallelism for Web Search Myeongjae Jeon Rice University In collaboration with Yuxiong He (MSR), Sameh Elnikety (MSR), Alan L. Cox (Rice),
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 4: Index Construction Related to Chapter 4:
Scheduling MPI Workflow Applications on Computing Grids Juemin Zhang, Waleed Meleis, and David Kaeli Electrical and Computer Engineering Department, Northeastern.
Sudhanshu Khemka.  Treats each document as a vector with one component corresponding to each term in the dictionary  Weight of a component is calculated.
A Bandwidth Scheduling Algorithm Based on Minimum Interference Traffic in Mesh Mode Xu-Yajing, Li-ZhiTao, Zhong-XiuFang and Xu-HuiMin International Conference.
Modern Information Retrieval
GPFS: A Shared-Disk File System for Large Computing Clusters Frank Schmuck & Roger Haskin IBM Almaden Research Center.
XRANK: RANKED KEYWORD SEARCH OVER XML DOCUMENTS Lin Guo Feng Shao Chavdar Botev Jayavel Shanmugasundaram Abhishek Chennaka, Alekhya Gade Advanced Database.
Using Blog Properties to Improve Retrieval Gilad Mishne (ICWSM 2007)
Lecture 2: Performance Evaluation
Lecture 1: Operating System Services
Threads vs. Events SEDA – An Event Model 5204 – Operating Systems.
Advanced Topics in Concurrency and Reactive Programming: Case Study – Google Cluster Majeed Kassis.
Indexing Structures for Files and Physical Database Design
Indexing and hashing.
Parallel Databases.
Text Based Information Retrieval
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn.
Implementation Issues & IR Systems
Database Applications (15-415) DBMS Internals- Part III Lecture 15, March 11, 2018 Mohammad Hammoud.
Multimedia Information Retrieval
(A Research Proposal for Optimizing DBMS on CMP)
Operating Systems : Overview
PERFORMANCE MEASURES. COMPUTATIONAL MODELS Equal Duration Model:  It is assumed that a given task can be divided into n equal subtasks, each of which.
Operating Systems : Overview
Query Type Classification for Web Document Retrieval
Retrieval Performance Evaluation - Measures
Virtual Memory 1 1.
Presentation transcript:

03/20/2003Parallel IR1 Papers on Parallel IR Agenda Introduction Paper 1:Inverted file partitioning schemes in multiple disk systems Paper 2: Parallel search using partitioned inverted files Comparison Conclusion URL Links to Paper

03/20/2003Parallel IR2 Parallel IR Introduction Parallelism in Query processing involves: 1.Multitasking Simultaneous Queries A thread or process for each user query, that can execute on a CPU The same thread or process completes an entire single query Ability to handle multiple concurrent queries 2.Query Partitioning A single query is broken into sub tasks Each sub task can run in parallel Improves Response Time of a single Query

03/20/2003Parallel IR3 Partitioning Query into Sub Tasks IR involves dealing with large amounts of data. Hence we can partition data set between sub tasks –Document Partitioning Divides documents over sub tasks, so that each sub task processes a sub set of the documents –Term Partitioning Divides the indexing terms among sub tasks so that each document processing is spread out between sub tasks

03/20/2003Parallel IR4 Theme of Papers being presented…. Both the papers explore the issues and performance implications in parallel IR systems using inverted indexes when they employ –A) Document Partitioning –B) Index Term Partitioning Paper1: Inverted file partitioning schemes in multiple disk systems Paper2: Parallel search using partitioned inverted files

03/20/2003Parallel IR5 P1: Inverted File Systems Inverted File System consists of: –Index File: Ordered list of all keywords that have been used to index a collection of documents. Along with each term there are fields that give the location and number of postings in the posting file –Posting File: consists of a group of records, with each record having the weight of the term and a pointer to the actual document file –Document File: contains the actual document records of the collection

03/20/2003Parallel IR6 P1: Inverted File Systems ( cont )

03/20/2003Parallel IR7 P1: Load Balancing In a multiple CPU, multiple disk system we need to: Balance the Load on Processors –Need to maximize CPU utilization Balance the Load on the I/O devices i.e. disk drives –Avoid I/O bottle necks which will cause CPUs to go in wait states

03/20/2003Parallel IR8 P1:Partitioning an Inverted File The paper explores the 2 schemes: –Based on Term Id –Based on Document Id With Both the schemes partitioning of the index file and the document file is the same – Index File by index term id and document file by document id We have seen that the posting file has both the document id as well as the index term id. One scheme partitions the posting file based on the Term Id while the other partitions it based on the document id.

03/20/2003Parallel IR9 P1:Partitioning an Inverted File ( cont)

03/20/2003Parallel IR10 P1: Objective of Partitioning Inverted Index Objective: To maximize performance Ideal: All I/O channels and Disk drives are equally used when sub tasks of a query gets executed in parallel However Data usage is dynamic from query to query and cannot be predicted. Hence we cannot achieve the ideal limit Paper recognizes that I/O is a major cost factor in IR

03/20/2003Parallel IR11 A Brief Comparison Document IdTerm Id All posting entries of a document are on the same disk All posting entries of an index term are on the same disk The index file needs to store the disk information with the index term, to indicate where the posting entries are stored. Hence requires more space No need as all posting entries of a index term are on the same disk. Less space usage Disk space usage over the multiple disks is balanced Since posting size of a Index Term varies with the frequency of occurrence in the collection, disk space usage may be unbalanced

03/20/2003Parallel IR12 A Brief Comparison… The Main Important Difference: Different I/O characteristic: A sub task of a single query index term will lead to disk I/O distribution across multiple disks in DocumentId partitioning while with TermId is limited to one disk. Which is better? – It is a tradeoff………

03/20/2003Parallel IR13 P1: Simulation Model To compare the two schemes the paper defines a simulation model with the following factors: a)Collection Database Model – follows natural language text distribution following Zipfs law. 20% of index terms comprise 80% of posting entries. Model Skews the above ratios to observe the effect on query performance b)User Query Model : The paper used two cases. Skewed queries, with some terms of low ranks frequently requested. Uniform query model with al terms having same probability

03/20/2003Parallel IR14 P1: Simulation Model.. Cont.. c) Queuing Model: Concurrent I/O requests on the same device are queued in priority. CPU usage requests on the same CPU are also queued d) Work Load Model : Vary the number of disks and CPUs

03/20/2003Parallel IR15 Simulation Results Increasing the number of disks up to a threshold improves performance, by decreasing the response time When the index term and the query term distribution is not skewed partitioning scheme based on term id performed the best When data was skewed, partitioning scheme based on document id performed the best. With skewed data (80/20) and with TermId, disks with those 20% of terms will become bottlenecks

03/20/2003Parallel IR16 Paper 2 - Positioning w.r.t. Paper 1 The thrust of paper 1’s approach was to partition the user queries by index terms, with each index term query becoming a sub task. The objective then became to optimize the one individual sub task with the biggest bottle next of I/O What if user query has only one query index term!!! Your disks are optimized, but your CPUs are idle Paper 2 recognizes that most user queries are single term only. Why?

03/20/2003Parallel IR17 P2: Search Topology Framework P2’s proposes a different framework:

03/20/2003Parallel IR18 P2: Search Topology ( Cont..) Top Node: Accepts query from client and distributes it to all of its child nodes and awaits results. Leaf Node: Looks after only ONE PARTITION of the inverted file. Each leaf node and the top node have a processor each. Within this framework the papers objective is to evaluate which type of inverted index partitioning is better: DocId or TermId based.

03/20/2003Parallel IR19 P2: Approach The paper uses real web collections instead of simulations for experimentations The PLIERS system is used on a 8 to 12 nodes AP3000 m/c. The data used comprised BASE1(1Gb) to BASE10(10Gb) of VLC2 collection Queries were based on topics 351 to 400 of the TREC-7 ad-hoc track. Title only and whole topic queries were used DocId and TermId index partitioning was used Bottom Line: Real Data instead of simulation

03/20/2003Parallel IR20 P2: Summary of Results Within the framework of the experiment: DocId partitioning is better in a multiprocessor environment, than TermId Partitioning TermId approach imposes too much communication overhead between leafs and the top node as the final result for a given doc, depends on the results from each leaf node

03/20/2003Parallel IR21 Comparison Paper 1Paper 2 Breaks queries into sub tasks based on query keywords Breaks query into sub tasks based on number of partitions of inverted index. Focus on optimization of disk I/O access Focus on optimization of processor use Assumes a more generic Topological Framework Very specific framework. Total number of CPUs needed depend on data driven partitions! Concludes results of plus and minus of docId and TermId partitioning schemes based on properties of document collection Due to specific framework assumptions, came to the conclusion that DocId partitioning method for inverted index is best, in that framework

03/20/2003Parallel IR22 Conclusion In combination these 2 papers highlight the issues of processor and I/O utilizations, in context to the factors affecting partitioning inverted indexes, in DocumentId and TermId Schemes

03/20/2003Parallel IR23 URL Links to Paper Paper 1: Inverted file partitioning schemes in multiple disk systems Byeong-Soo Jeong; Omiecinski, E.; Parallel and Distributed Systems, IEEE Transactions on, Volume: 6 Issue: 2, Feb L&arnumber=342125&arSt=142&ared=153&arAuthor=Byeong- Soo+Jeong%3B+Omiecinski%2C+E.%3B Paper 2: Parallel search using partitioned inverted files MacFarlane, A.; McCann, J.A.; Robertson, S.E.; String Processing and Information Retrieval, SPIRE Proceedings. Seventh International Symposium on, NF&arnumber=878197&arSt=209&ared=220&arAuthor=MacFarlane%2C+A.%3B+M cCann%2C+J.A.%3B+Robertson%2C+S.E.%3B