Parallel and Distributed IR

Slides:



Advertisements
Similar presentations
Distributed Data Processing
Advertisements

© 2009 Fakultas Teknologi Informasi Universitas Budi Luhur Jl. Ciledug Raya Petukangan Utara Jakarta Selatan Website:
Streaming SIMD Extension (SSE)
Chapter 5: Introduction to Information Retrieval
03/20/2003Parallel IR1 Papers on Parallel IR Agenda Introduction Paper 1:Inverted file partitioning schemes in multiple disk systems Paper 2: Parallel.
LIBRA: Lightweight Data Skew Mitigation in MapReduce
Building a Distributed Full-Text Index for the Web S. Melnik, S. Raghavan, B.Yang, H. Garcia-Molina.
Distributed databases
1 Presented By Avinash Gutte Under The Guidance of Mrs. Hemangi Kulkarni Department of Computer Engineering Pimpri-Chinchwad College of Engineering, Pune.
Chapter 4 Threads, SMP, and Microkernels Patricia Roy Manatee Community College, Venice, FL ©2008, Prentice Hall Operating Systems: Internals and Design.
Parallel Database Systems The Future Of High Performance Database Systems David Dewitt and Jim Gray 1992 Presented By – Ajith Karimpana.
Information Retrieval in Practice
Multiprocessors ELEC 6200: Computer Architecture and Design Instructor : Agrawal Name: Nam.
Introduction to Systems Architecture Kieran Mathieson.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Overview Distributed vs. decentralized Why distributed databases
EEC-681/781 Distributed Computing Systems Lecture 3 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Chapter 3 : Distributed Data Processing
Systems Architecture, Fourth Edition1 Internet and Distributed Application Services Chapter 13.
 Parallel Computer Architecture Taylor Hearn, Fabrice Bokanya, Beenish Zafar, Mathew Simon, Tong Chen.
Distributed Systems: Client/Server Computing
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Flynn’s Taxonomy of Computer Architectures Source: Wikipedia Michael Flynn 1966 CMPS 5433 – Parallel Processing.
Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.
KUAS.EE Parallel Computing at a Glance. KUAS.EE History Parallel Computing.
Search Engines and Information Retrieval Chapter 1.
1 Lecture 20: Parallel and Distributed Systems n Classification of parallel/distributed architectures n SMPs n Distributed systems n Clusters.
Designing and Evaluating Parallel Programs Anda Iamnitchi Federated Distributed Systems Fall 2006 Textbook (on line): Designing and Building Parallel Programs.
1 Chapter 1 Parallel Machines and Computations (Fundamentals of Parallel Processing) Dr. Ranette Halverson.
What is a Distributed System? n From various textbooks: l “A distributed system is a collection of independent computers that appear to the users of the.
Introduction and Overview Chapter 1. Why Study TCP/IP? Forms global Internet base technology Has accommodated explosive growth well Protocols work over.
1 Distributed Systems: an Introduction G53ACC Chris Greenhalgh.
Parallel and Distributed IR. 2 Papers on Parallel and Distributed IR Introduction Paper A: Inverted file partitioning schemes in Multiple Disk Systems.
The Effect of Collection Organization and Query Locality on IR Performance 2003/07/28 Park,
Processes and Threads Processes have two characteristics: – Resource ownership - process includes a virtual address space to hold the process image – Scheduling/execution.
Distributed Database Systems Overview
استاد : مهندس حسین پور ارائه دهنده : احسان جوانمرد Google Architecture.
Chapter 9: Alternative Architectures In this course, we have concentrated on single processor systems But there are many other breeds of architectures:
GUIDED BY DR. A. J. AGRAWAL Search Engine By Chetan R. Rathod.
Using NAS as a Gateway to SAN Dave Rosenberg Hewlett-Packard Company th Street SW Loveland, CO 80537
Modern Information Retrieval Chapter 9: Parallel and Distributed IR Section 9.1: Introduction Section : MIMD Architectures Inverted Files November.
Distributed DBMSs- Concept and Design Jing Luo CS 157B Dr. Lee Fall, 2003.
Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.
Making Watson Fast Daniel Brown HON111. Need for Watson to be fast to play Jeopardy successfully – All computations have to be done in a few seconds –
VMware vSphere Configuration and Management v6
Distributed Computing Systems CSCI 6900/4900. Review Distributed system –A collection of independent computers that appears to its users as a single coherent.
Server HW CSIS 4490 n-Tier Client/Server Dr. Hoganson Server Hardware Mission-critical –High reliability –redundancy Massive storage (disk) –RAID for redundancy.
A search engine is a web site that collects and organizes content from all over the internet Search engines look through their own databases of.
Lecture 3: Computer Architectures
The Development of a search engine & Comparison according to algorithms Sung-soo Kim The final report.
Modern Information Retrieval
SEPTEMBER 8, 2015 Computer Hardware 1-1. HARDWARE TERMS CPU — Central Processing Unit RAM — Random-Access Memory  “random-access” means the CPU can read.
IT 5433 LM1. Learning Objectives Understand key terms in database Explain file processing systems List parts of a database environment Explain types of.
Seminar On Rain Technology
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
Presented By: Carlton Northern and Jeffrey Shipman The Anatomy of a Large-Scale Hyper-Textural Web Search Engine By Lawrence Page and Sergey Brin (1998)
Information Retrieval in Practice
CHAPTER SEVEN PARALLEL PROCESSING © Prepared By: Razif Razali.
Search Engine Architecture
Parallel Databases.
Information Retrieval in Practice
Software Design and Architecture
Multi-Processing in High Performance Computer Architecture:
Chapter 17 Parallel Processing
Symmetric Multiprocessing (SMP)
Chapter 4 Multiprocessors
Information Retrieval B
Information Retrieval and Web Design
Presentation transcript:

Parallel and Distributed IR Eric Brown

Parallel Computing SISD:single instruction stream, single data stream. SIMD:single instruction stream, multiple data stream. MISD:multiple instruction stream, single data stream. MIMD:multiple instruction stream, multiple data stream.

S= S<= Performance Measures 1 1 f +(1-f)/N <= f S = N Running time of best available sequential algorithm --------------------------------------------------------------- Running time of parallel algorithm 1 f +(1-f)/N S<= <= 1 f S N =

Parallel IR Introduction: Develop new retrieval strategies that directly lend themselves to parallel implementation. Adapt existing, well studied information retrieval algorithms to parallel processing.

MIMD Architecture

MIMD Architecture Inverted Files Logical Document Partitioning Essentially the same basic underlying inverted file index as in the original sequential algorithm. Physical Document Partitioning Each subcollection has its own inverted file and the search processes shard nothing during query evaluation.

MIMD Architecture Logical document partitioning requires less communication than physical document partitioning with similar parallelization, and so is likely to provide better overall performance. Physical document partitioning, on the other hand, offers more flexibility and conversion of an existing IR system into a parallel IR system is simpler using physical document partition.

MIMD Architectures Term partitioning When term partitioning is used with an inverted file is created for the document collection and the inverted lists are spread across the processors. Assuming each processor has its own I/O channel and disks when term distribution in the documents and the queries are more skewed, document partition performs better. When terms are uniformly distributed in user queries, term partition performs better.

MIMD Architecture

SIMD Architecture Signature Files

SIMD Architecture Signature Files

SIMD Architecture Signature Files

SIMD Architectures Inverted Files

SIMD Architectures

SIMD Architectures Inverted Files

SIMD Architectures

Distributed IR Introduction A distributed computing system can be viewed as a MIMD parallel processor with relatively slow inter-processor communication channel and the freedom to employ a heterogeneous collection of processors in the system.

Distributed IR Introduction Distributed Model is very similar to the MIMD parallel processing model. The main difference here is that subtasks run on different computers and the communication between the subtasks is performed using network protocol such as TCP/IP.

Collection Partitioning The procedure used to adding documents to search servers in a distributed IR system depends a number of factors. Consider whether or not the system is centrally administered.

Collection Partitioning When the distribute system is centrally administered, more options are available. The first option is simple replication of the collection across all of the search servers. The second option is random distribution of the documents. The final option is explicit semantic partitioning of the documents.

Source Selection Source selection is the process of determining which of the distributed document collections are most likely to contain relevant documents for the current query, and therefore should receive the query for processing. The basic technique is to treat each collection as if it were a single large document, index the collections, and evaluate the query against the collections to produce a ranked listing of collections.

Query Processing Query processing in a distributed IR system proceeds as follows: Select collection to search. Distribute query to selected collections. Evaluate query at distributed collection in parallel. Combine results from distributed collection into final result.

Web Issues The parallel and distributed techniques described above can then be used directly as if the Web were any other large document collection. This is the approach currently taken by most of the popular Web search services.

Trends and Research Issues The trend in parallel hardware is the develop of general MIMD machines. Many challenges remain in the area of parallel and distributed text retrieval. The first challenge is measuring retrieval effectiveness on large text collections. The second significant challenge is interoperability, or building distributed IR systems form heterogeneous components.