Modern Information Retrieval Chapter 9: Parallel and Distributed IR Section 9.1: Introduction Section 9.2.2.: MIMD Architectures Inverted Files November.

Slides:



Advertisements
Similar presentations
Information Retrieval in Practice
Advertisements

Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.
Chapter 5: Introduction to Information Retrieval
03/20/2003Parallel IR1 Papers on Parallel IR Agenda Introduction Paper 1:Inverted file partitioning schemes in multiple disk systems Paper 2: Parallel.
Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.
Advanced Topics in Algorithms and Data Structures Page 1 Parallel merging through partitioning The partitioning strategy consists of: Breaking up the given.
Information Retrieval in Practice
4/26/05Han: ELEC72501 Department of Electrical and Computer Engineering Auburn University, AL K.Han Development of Parallel Distributed Computing System.
Inverted Indices. Inverted Files Definition: an inverted file is a word-oriented mechanism for indexing a text collection in order to speed up the searching.
Parallel Merging Advanced Algorithms & Data Structures Lecture Theme 15 Prof. Dr. Th. Ottmann Summer Semester 2006.
Parallel and Distributed IR
Hinrich Schütze and Christina Lioma Lecture 4: Index Construction
Parallel Computing Techniques. 1. Introduction 2. Parallel Machines 3. Clusters 4. Computational Grids 5. unGrid 6. Questions & Answers.
Introduction to Parallel Processing Ch. 12, Pg
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
CMSC 611: Advanced Computer Architecture Parallel Computation Most slides adapted from David Patterson. Some from Mohomed Younis.
Flynn’s Taxonomy of Computer Architectures Source: Wikipedia Michael Flynn 1966 CMPS 5433 – Parallel Processing.
Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.
Chapter. 8: Indexing and Searching Sections: 8.1 Introduction, 8.2 Inverted Files 9/13/ Dr. Almetwally Mostafa.
18-447: Computer Architecture Lecture 30B: Multiprocessors Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013.
1 Chapter 1 Parallel Machines and Computations (Fundamentals of Parallel Processing) Dr. Ranette Halverson.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Parallel and Distributed IR. 2 Papers on Parallel and Distributed IR Introduction Paper A: Inverted file partitioning schemes in Multiple Disk Systems.
Flynn’s Taxonomy SISD: Although instruction execution may be pipelined, computers in this category can decode only a single instruction in unit time SIMD:
Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
Introduction n How to retrieval information? n A simple alternative is to search the whole text sequentially n Another option is to build data structures.
Chapter 6: Information Retrieval and Web Search
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin & Lawrence Page Presented by: Siddharth Sriram & Joseph Xavier Department of Electrical.
Parallel Processing Steve Terpe CS 147. Overview What is Parallel Processing What is Parallel Processing Parallel Processing in Nature Parallel Processing.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Chapter 9: Alternative Architectures In this course, we have concentrated on single processor systems But there are many other breeds of architectures:
The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,
Parallel Computing.
Building a Distributed Full-Text Index for the Web by Sergey Melnik, Sriram Raghavan, Beverly Yang and Hector Garcia-Molina from Stanford University Presented.
Digital Libraries1 David Rashty. Digital Libraries2 “A library is an arsenal of liberty” Anonymous.
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
Data Structures and Algorithms in Parallel Computing Lecture 1.
Server HW CSIS 4490 n-Tier Client/Server Dr. Hoganson Server Hardware Mission-critical –High reliability –redundancy Massive storage (disk) –RAID for redundancy.
Computer Architecture And Organization UNIT-II Flynn’s Classification Of Computer Architectures.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 4: Index Construction Related to Chapter 4:
Sudhanshu Khemka.  Treats each document as a vector with one component corresponding to each term in the dictionary  Weight of a component is calculated.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
Modern Information Retrieval
P2P Content Search: Give the Web Back to the People Matthias Bender Sebastin Michel Peter Triantafillou Gerhard Weikum Christian Zimmer Mariam John CSE.
Parallel Computing Presented by Justin Reschke
LECTURE #1 INTRODUCTON TO PARALLEL COMPUTING. 1.What is parallel computing? 2.Why we need parallel computing? 3.Why parallel computing is more difficult?
Page 1 2P13 Week 1. Page 2 Page 3 Page 4 Page 5.
General Architecture of Retrieval Systems 1Adrienn Skrop.
Classification of parallel computers Limitations of parallel processing.
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
Processor Level Parallelism 1
Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.
Information Storage and Retrieval Fall Lecture 1: Introduction and History.
Information Retrieval in Practice
CHAPTER SEVEN PARALLEL PROCESSING © Prepared By: Razif Razali.
18-447: Computer Architecture Lecture 30B: Multiprocessors
Advanced Topics in Concurrency and Reactive Programming: Case Study – Google Cluster Majeed Kassis.
Information Retrieval in Practice
Parallel Processing - introduction
Flynn’s Classification Of Computer Architectures
Chapter 17 Parallel Processing
Data Mining Chapter 6 Search Engines
AN INTRODUCTION ON PARALLEL PROCESSING
Chapter 4 Multiprocessors
Information Retrieval B
Module 6: Introduction to Parallel Computing
Presentation transcript:

Modern Information Retrieval Chapter 9: Parallel and Distributed IR Section 9.1: Introduction Section : MIMD Architectures Inverted Files November 5, 1999

Summary n Introduction n Review of parallel computing and parallel program performance measures n Exploration of techniques for implementing inverted file on MIMD parallel architecture n Conclusion

Introduction n The volume of electronic text available online today is staggering. n The WWW contains over 800 millions pages of text, comprising nearly 6 terabytes of data (NATURE|Vol 400|8 July 1999| n As document collections grow larger, they become more expensive to manage with an information retrieval system. n To support the demanding requirements of modern search environments, we must turn to alternative architectures and algorithms.

Parallel Computing n Parallel computing is the simultaneous aplication of multiple processors to solve a single problem. n Flynn’s Taxonomy: u SISD single instruction, single data u SIMD single instruction, multiple data u MISD multiple instruction, single data u MIMD multiple instruction, multiple data

Parallel Program Performance Measures n Speedup n Amdahl’s Law where f is the fraction of the problem that must be computed sequencially; N is the number of processors. Running time of best available sequential algorithm Running time of parallel algorithm

Parallel Program Performance Measures n Efficiency where S is speedup; N is the number of processors.

MIMD Architectures n MIMD architectures offer a great deal of flexibility in how parallelism is defined and exploited to solve a problem. n There are two ways in which a retrieval system can exploit a MIMD machine: u Parallel multitasking; u Partitioned parallel processing.

MIMD Architectures Parallel multitasking on a MIMD machine Broker User Query Result User Query Result Search Engine Search Engine Search Engine Search Engine Search Engine

MIMD Architectures Partitioned parallel processing on a MIMD machine Broker User Query Result Subquery/ Results Search Process Search Process Search Process Search Process Search Process

MIMD Architectures Basic data elements processed by a seach algorithm k 1 k 2...k i...k t d 1 w 1,1 w 2,1...w i,1...w t,1 d 2 w 1,2 w 2,2...w i,2...w t, d j w 1,j w 2,j...w i,j...w t,j d N w 1,N w 2,N...w i,N...w t,N Indexing Items DocumentsDocuments

MIMD Architectures n There are two possible methods for partitioning the data: u Document partitioning: the N documents are distributed across the P processors; each parallel process evaluates the query on the subcollection of N/P documents assigned to it; u Term partitioning: the t indexing items are distributed across the P processors; the evaluation process for each document is spread over multiple processors.

Inverted Files Logical Document Partitioning n Data Partitioning u The data partitioning is done logically using essentially the same basic underlying inverted file index as in the original sequential algorithm; u The inverted file is extended to give each parallel process direct access to that portion of the index related to the processor’s subcollection of documents.

Extended dictionary entry for document partitioning Inverted Files Logical Document Partitioning item i P1 P2 P3 P4 Inverted List Term i Dictionary

n Query Evaluation u The broker initiates P parallel processes to evaluate the query; u Each process executes the same document scoring algorithm on its document subcollection; u The search processes record document scores in a single shared array of document score accumulators; u The broker produces the final ranked list of documents. Inverted Files Logical Document Partitioning

n Inverted File Construction u The indexer partitions the documents among the processors; u Each indexing process generates a batch of inverted lists, sorted by indexing item; u A merge step is performed to create the final inverted file. Inverted Files Logical Document Partitioning

n Data Partitioning u The documents are physically partitioned into separate subcollections, one for each parallel processor; u Each subcollection has its own inverted file. Inverted Files Physical Document Partitioning

n Query Evaluation u The broker distributes the query to all of the parallel search processes; u Each parallel search process evaluates the query on its portion of the document collection, producing an intermediate hit-list; u The broker collects the intermediate hit-lists from all of the parallel search processes and merges them into a final hit-list. Inverted Files Physical Document Partitioning

n Inverted File Construction u Each processor creates, in parallel, its own complete index corresponding to its document partition; u A merge step is performed to accumulate the global statistics for all of the partitions and distribute them to each of the partition dictionaries. Inverted Files Physical Document Partitioning

n Data Partitioning u Inverted lists are spread across the processors. Inverted Files Term Partitioning

n Query Evaluation u Query is decomposed into indexing items and each indexing item is sent to the processor that holds the corresponding inverted list; u The processors create hit-lists with partial document scores and return them to the broker; u The broker combines the hit-lists. Inverted Files Term Partitioning

n Inverted File Construction u Inverted file is created using the parallel construction technique described for logical document partitioning. Inverted Files Term Partitioning

Example Document collection Document Text 1 Pease porridge hot 2 Pease porridge cold 3 Pease porridge in the pot 4 Pease porridge hot, pease porridge not cold 5 Pease porridge cold, pease porridge not hot 6 Pease porridge hot in the pot

Example Inverted File cold hot in not pease porridge pot the Dictionary Inverted Lists

Example Logical Document Partitioning cold hot in not pease porridge pot P1 P2 P3 the Inverted List Term “pease” Dictionary

Example Physical Document Partitioning cold hot in not pease porridge pot the P2 hot pease porridge P1 cold hot in not pease porridge pot the P3 cold

Example Term Partitioning cold hot in not pease porridge pot the P1 P2 P3

Conclusion n The task of indexing and searching in very large text collections is costly; n Faster indexing and searching algorithms are always desirable and the use of parallel hardware is and obvious alternative; n We discussed two possible organization for the document collection index on a MIMD parallel architecture: u Document partitioning; u Term partitioning.

Conclusion n Document partitioning affords simpler inverted index construction and maintenance than term partitioning; n When term distributions in the documents and queries are more skewed, document partitioning performs better; n When terms are uniformily distributed in user queries, term partitioning performs better.

Adicional References Lawrence, S., Giles, C.L Accessibility of Information on the Web. Nature. Vol.400.pp Ribeiro-Neto, B.A., Barbosa, R.A Query Performance for Tighly Coupled Distributed Digital Libraries. Digital Libraries 98. pp Ribeiro-Neto, B.A., Moura, E.S., Neubert, M.S., Ziviani, N Efficient Distributed Algorithms to Build Inverted Files. SIGIR’99. pp