ODISSEA open distributed search engine architecture A Peer-to-Peer Architecture for Scalable Web Search and Information Retrieval Torsten Suel, Chandan.

Slides:

Advertisements

Similar presentations

Information Retrieval in Practice

Advertisements

Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, Hari Balakrishnan MIT and Berkeley presented by Daniel Figueiredo Chord: A Scalable Peer-to-peer.

Pastry Peter Druschel, Rice University Antony Rowstron, Microsoft Research UK Some slides are borrowed from the original presentation by the authors.

Digital Library Service – An overview Introduction System Architecture Components and their functionalities Experimental Results.

TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets Chun Chen 1, Feng Li 2, Beng Chin Ooi 2, and Sai Wu 2 1 Zhejiang University, 2 National.

Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.

CHORD – peer to peer lookup protocol Shankar Karthik Vaithianathan & Aravind Sivaraman University of Central Florida.

Chord: A scalable peer-to- peer lookup service for Internet applications Ion Stoica, Robert Morris, David Karger, M. Frans Kaashock, Hari Balakrishnan.

1 1 Chord: A scalable Peer-to-peer Lookup Service for Internet Applications Dariotaki Roula

Small-world Overlay P2P Network

Optimized Query Execution in Large Search Engines with Global Page Ordering Xiaohui Long Torsten Suel CIS Department Polytechnic University Brooklyn, NY.

What's inside a router? We have yet to consider the switching function of a router - the actual transfer of datagrams from a router's incoming links to.

Web Caching Schemes1 A Survey of Web Caching Schemes for the Internet Jia Wang.

Peer to Peer File Sharing Huseyin Ozgur TAN. What is Peer-to-Peer?  Every node is designed to(but may not by user choice) provide some service that helps.

Indexes. Primary Indexes Dense Indexes Pointer to every record of a sequential file, (ordered by search key). Can make sense because records may be much.

ODISSEA Mehdi Kharrazi Kulesh Shanmugasundaram Security Issues.

A Scalable Content-Addressable Network Authors: S. Ratnasamy, P. Francis, M. Handley, R. Karp, S. Shenker University of California, Berkeley Presenter:

ODISSEA: a Peer-to-Peer Architecture for Scalable Web Search and IR Torsten Suel with C. Mathur, J. Wu, J. Zhang, A. Delis, M. Kharrazi, X. Long, K. Shanmugasunderam.

©Silberschatz, Korth and Sudarshan18.1Database System Concepts Centralized Systems Run on a single computer system and do not interact with other computer.

Hinrich Schütze and Christina Lioma Lecture 4: Index Construction

Improving Data Access in P2P Systems Karl Aberer and Magdalena Punceva Swiss Federal Institute of Technology Manfred Hauswirth and Roman Schmidt Technical.

 Structured peer to peer overlay networks are resilient – but not secure.  Even a small fraction of malicious nodes may result in failure of correct.

Google Distributed System and Hadoop Lakshmi Thyagarajan.

Search engine structure Web Crawler Page archive Page Analizer Control Query resolver ? Ranker text Structure auxiliary Indexer.

Roger ZimmermannCOMPSAC 2004, September 30 Spatial Data Query Support in Peer-to-Peer Systems Roger Zimmermann, Wei-Shinn Ku, and Haojun Wang Computer.

Web Search Engines and Information Retrieval on the World-Wide Web Torsten Suel CIS Department Overview: introduction.

Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.

Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.

Tapestry GTK Devaroy (07CS1012) Kintali Bala Kishan (07CS1024) G Rahul (07CS3009)

New Protocols for Remote File Synchronization Based on Erasure Codes Utku Irmak Svilen Mihaylov Torsten Suel Polytechnic University.

Link Recommendation In P2P Social Networks Yusuf Aytaş, Hakan Ferhatosmanoğlu, Özgür Ulusoy Bilkent University, Ankara, Turkey.

Replication Mechanisms for a Distributed Time Series Storage and Retrieval Service Mugurel Ionut Andreica Politehnica University of Bucharest Iosif Charles.

Chord & CFS Presenter: Gang ZhouNov. 11th, University of Virginia.

HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.

Searching the Clouds Presented by Kajal Miyan Slides courtesy: UC Berkeley RAD Lab

Parallel and Distributed IR. 2 Papers on Parallel and Distributed IR Introduction Paper A: Inverted file partitioning schemes in Multiple Disk Systems.

Chord: A Scalable Peer-to-peer Lookup Protocol for Internet Applications Xiaozhou Li COS 461: Computer Networks (precept 04/06/12) Princeton University.

« Performance of Compressed Inverted List Caching in Search Engines » Proceedings of the International World Wide Web Conference Commitee, Beijing 2008)

Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.

Efficient P2P Searches Using Result-Caching From U. of Maryland. Presented by Lintao Liu 2/24/03.

Content Addressable Network CAN. The CAN is essentially a distributed Internet-scale hash table that maps file names to their location in the network.

AlvisP2P : Scalable Peer-to-Peer Text Retrieval in a Structured P2P Network Toan Luu, Gleb Skobeltsyn, Fabius Klemm, Maroje Puh, Ivana Podnar Zarko, Martin.

VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation MongoDB Architecture.

Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications.

Serverless Network File Systems Overview by Joseph Thompson.

Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, Hari Balakrishnan Presented.

The Replica Location Service The Globus Project™ And The DataGrid Project Copyright (c) 2002 University of Chicago and The University of Southern California.

Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.

Scalable Content- Addressable Networks Prepared by Kuhan Paramsothy March 5, 2007.

Eduardo Gutarra Velez. Outline Distributed Filesystems Motivation Google Filesystem Architecture The Metadata Consistency Model File Mutation.

CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.

1 Secure Peer-to-Peer File Sharing Frans Kaashoek, David Karger, Robert Morris, Ion Stoica, Hari Balakrishnan MIT Laboratory.

Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.

Peer to Peer Network Design Discovery and Routing algorithms

Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.

Introduction to Information Retrieval Introduction to Information Retrieval Lecture 4: Index Construction Related to Chapter 4:

Two Peer-to-Peer Networking Approaches Ken Calvert Net Seminar, 23 October 2001 Note: Many slides “borrowed” from S. Ratnasamy’s Qualifying Exam talk.

03/02/20061 Evaluating Top-k Queries Over Web-Accessible Databases Amelie Marian Nicolas Bruno Luis Gravano Presented By: Archana and Muhammed.

P2P Content Search: Give the Web Back to the People Matthias Bender Sebastin Michel Peter Triantafillou Gerhard Weikum Christian Zimmer Mariam John CSE.

The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.

Chord: A Scalable Peer-to-Peer Lookup Service for Internet Applications * CS587x Lecture Department of Computer Science Iowa State University *I. Stoica,

1 Design and Implementation of a High-Performance Distributed Web Crawler Polytechnic University Vladislav Shkapenyuk, Torsten Suel 06/13/2006 석사 2 학기.

Design and Implementation of a High- Performance Distributed Web Crawler Vladislav Shkapenyuk, Torsten Suel 실시간 연구실 문인철

Advanced Topics in Concurrency and Reactive Programming: Case Study – Google Cluster Majeed Kassis.

Information Retrieval in Practice

CHAPTER 3 Architectures for Distributed Systems

DHT Routing Geometries and Chord

Lecture 7: Index Construction

A Scalable Peer-to-peer Lookup Service for Internet Applications

MapReduce: Simplified Data Processing on Large Clusters

Presentation transcript:

ODISSEA open distributed search engine architecture A Peer-to-Peer Architecture for Scalable Web Search and Information Retrieval Torsten Suel, Chandan Mathur, Jo-Wen Wu, Jiangong Zhang, Alex Delis, Mehdi Kharrazi, Xiaohui Long, Kulesh Shanmugasundaram Daniel Porta

Seminar "Peer-to-peer Information Systems"2 Talk Outline Motivation Design Overview System Design Details Target Applications Implementation Details Efficient Query Processing Open Questions

Seminar "Peer-to-peer Information Systems"3 Motivation Today, main part of the web search infrastructure is supplied by only a few large crawl-based search engines Strong research in the field of P2P systems over the last few years Computers have/will become faster and the network bandwidth has increased/will grow This raises two issues Vast data in P2P networks requires the ability to search in these networks Significant computing resources provided by a P2P system could be used to search content residing inside or outside the system ODISSEA attempts both issues by a „distributed global indexing and query execution service“

Seminar "Peer-to-peer Information Systems"4 Design Overview ODISSEA is different from many other approaches to P2P search It assumes a two-layered search engine architecture and a global index structure distributed over the nodes of the system In a global index, as contradiction to a local index, a single node holds the entire inverted index for a particular term

Seminar "Peer-to-peer Information Systems"5 Two Layer Approach Lower layer provides maintanance of the global index structure under document insertions and updates Maintanance of node joins and failures Efficient execution of simple search queries ODISSEA queries crawler queries Search server Upper layer interacts with P2P-based lower layer via two classes of clients Update clients (e.g crawler, web server) Query clients (user implemented optimized query execution plan) WWW

Seminar "Peer-to-peer Information Systems"6 Two Layer Approach Enables a large variety of (client-based) search tools that more fully exploit client computing resources. Those tools could share the same lower-layer web search infrastructure. Tools are developed using an open API, which accesses the search infrastructure When processing a query, this could in the most general case (i.e where no pre-evaluation is done on server-side) result in large amounts of data to be transferred to the query client

Seminar "Peer-to-peer Information Systems"7 Global vs. Local index posting = [DocID, Position, additional information] Inverted list is a list of postings that represents all occurencies of a term in the document collection Inverted index for terms is the set of the corresponding inverted lists Suppose a query „chair AND table“. Then the query will be processed as follows search client A B C A: chair B: table C search client

Seminar "Peer-to-peer Information Systems"8 Global vs. Local index Local index organization is very inefficient in very large networks (e.g. web) if result quality is the major concern, because the query has to be transmitted to all nodes and all of them have to respond But in a global index organization large amounts of data need to be transmitted between nodes when Initially building the index Evaluating a query  bad response time Can be overcome with smart algorithmic techniques, as you will see later Choice depends on the types of queries and the frequency of document updates, as well as on the question of how dynamic the system is

Seminar "Peer-to-peer Information Systems"9 Crawling and Fault Tolerance Crawling approach Client-based, non P2P crawlers have the advantage that they can be easily altered in the case that some web site operators have complains about the bot Smart crawling strategies beyond BFS are hard to implement in a P2P environment unless there is a centralized scheduler P2P systems and fault tolerance System design relies on the assumption of a more stable P2P environment, since otherwise administration (insert, update, replication) would be too expencive

Seminar "Peer-to-peer Information Systems"10 Target Applications Full-text search in large document collections located within in P2P communities Search in large intranet environments Web Search: a powerful API supports the anticipated shift towards client-based search tools which better exploit the resources of todays desktop machines Search middleware: Instead of inserting documents, clients could directly insert index entries. This might speed up query execution, since for a document only certain „strong“ keywords can be inserted. But a drawback could be that the identification of such keywords lies in client‘s hand

Seminar "Peer-to-peer Information Systems"11 Implementation Details Currently, a first system is being implemented in Java, using Pastry as a P2P substrate (lower layer) and a DHT mapping for hashing IDs to the appropriate IP-address Each node runs an indexer that stores inverted list in compressed form in a Berkeley DB (which contains a B+ tree), each document is also stored in a Berkeley DB Using MD5, all documents and term lists are hashed to a 80-bit ID that is used for lookups in the system

Seminar "Peer-to-peer Information Systems"12 Implementation Details Parsing and Routing Postings New or updated documents are parsed at the node where they reside, as determined by the DHT mapping Parser generates for each term a posting that is routed via several intermediate nodes, as determined by the topology of the Pastry network, until it reaches its destination node An index structure of a node is split up in a small structure (residing in main memory) that is eventually merged with a bigger structure on disk to avoid disk accesses during inserts/updates  lower amortized complexity

Seminar "Peer-to-peer Information Systems"13 Implementation Details Groups and Splits Initially, all objects (documents, indexes) whose first w bits (here w=16) coincide are placed into a common group identified by this w-bit string Locally, each group maintains a Berkeley DB with all objects it contains When a group (of documents) becomes too large (here >1GB), it is split into two groups identified by a (w+1)-bit string leaving a stub structure pointing to the new groups that are assigned to new nodes If index structures for terms are too large (here >100MB), they are split into two lists according to the document IDs they contain

Seminar "Peer-to-peer Information Systems"14 Implementation Details Replication Performed at group level by attaching „/0“, „/1“, etc. to the group label (e.g /2) This new label is then what is really presented to Pastry/DHT during lookups All replicas of a group form a „clique“ that communicate periodically to update their status If a group replica fails, the others are in charge of detecting this and if necessary perform repair Each node can contain several distinct group replicas and therefore participate in several cliques Postings are first routed to only one replica that is then in charge of forwarding them to the others over a period of a few minutes

Seminar "Peer-to-peer Information Systems"15 Implementation Details Faults, Unavailability and Synchronization When a node leaves the system, its group replicas eventually have to be replaced to maintain the desired degree of replication A node has failed if it has been unavailable for an extended period of time Create new replicas for a failed node or if a certain number of nodes are unavailable Former unavailable nodes have to synchronize its index structures using logs of missing updates

Seminar "Peer-to-peer Information Systems"16 Efficient Query Processing Information Thoeretic Background Let d be a document, q = q 0 …q m-1 a query consisting of m terms and F be a function that assigns d (depending on q) a value F(d,q). Such a function is called a ranking function. The top-k ranking problem for a query q is finding the k documents with the highest values F(d,q). A common form of such a function looks like this Since queries typically have at most only 2 search terms, the following algorithm focuses on the top-k ranking problem and queries with exactly 2 search terms (for one-term queries, there is in fact nothing to do)

Seminar "Peer-to-peer Information Systems"17 Efficient Query Processing Fagin‘s Algorithm (FA) Intuitively, an item that is ranked in the top is likely to be ranked very high in at least one of the contributing subcategories Assume a query q = q 0 AND q 1 and postings of the form (d,f(d,q i )) that are sorted by the second component with highest values on top Also assume that the inverted lists for q 0 and q 1 are located on the same machine, so that no network communication is required Goal: compute the top k documents as fast as possible

Seminar "Peer-to-peer Information Systems"18 Efficient Query Processing Scan both lists from the beginning, by reading one element from each list in every step, until there are k documents that have been encountered in both lists (here assume k=2) 2.Compute the scores of these k documents. Also, for each document that was encoutered in only one of the lists, perform a lookup into the other list to determine the score of the document. 3.Return the k documents with the highest score (here d1, d5) 5 37 A B

Seminar "Peer-to-peer Information Systems"19 Efficient Query Processing Threshold Algorithm (TA) Scan both lists simultaneously and read (d,f(d,q 0 )) from the first and (d‘,f(d‘,q 1 )) from the second list Compute t = f(d,q 0 ) + f(d‘,q 1 ) For each d in one of the lists perform immediately a lookup in the other list in order to compute its complete score Algorithm terminates, when k documents have been found that have higher scores than the current value of t Because it does not make sense to scan two lists simultaneously while they are distributed in a P2P network, the above techniques have to be adapted. This leads us to the following protocol that aims at minimize the data to be transferred.

Seminar "Peer-to-peer Information Systems"20 Efficient Query Processing A simple distributed pruning protocol (DPP) AB Node A (holding the shorter list) sends the first x postings to node B. Let r min be the smallest value f(d,q 0 ) transmitted Node B receives the postings from A and performs a lookup into its own list in order to compute the total scores. Retain the k documents with the highest scores. Let r k be the smallest value among these. Node B now transmitts to A all postings among its first x postings with f(d,q 1 ) > r k - r min, together with the total scores of the k documents from the previous step Node A now performs lookups into its own list for the postings received from B and determines the overall top k documents

Seminar "Peer-to-peer Information Systems"21 Efficient Query Processing DPP-Example for k=2 and x=3: A containing term q 0 : (d 1, 0.9), (d 2, 0.8), (d 3, 0.7), (d 4, 0.69), (d 5, 0.67) B containing term q 1 : (d 6, 0.6), (d 5, 0.5), (d 3, 0.4), (d 1, 0.3), (d 7, 0.2), (d 8, 0.1) AB A to B: (d 1, 0.9), (d 2, 0.8), (d 3, 0.7) B computes: (d 1, ) (d 2, ) (d 3, ) B to A: (d 6, 0.6), (d 5, 0.5), because f(d 6,5,q 1 ) > 0.4 together with (d 1, 1.2), (d 3, 1.1) A computes: (d 6, ), (d 5, ),  r min = 0.7  r k = 1.1 B retains: (d 1, 1.2) (d 3, 1.1) r k – r min = = 0.4 Top 2 documents: 1. (d 1, 1.2) 2. (d 5, 1.17)

Seminar "Peer-to-peer Information Systems"22 Efficient Query Processing Problems with the DPP works only with queries containing 2 search terms random lookups can cause disk accesses, since large index structures reside on hard disk  bad response time How must the value of x be chosen? (x should be the number of postings transmitted from A and B, s.t. DPP works correct without extra roundtrip; depends on the k and length of the inverted lists) By deriving appropriate formulae based on extensive testing By sampling-based methods that estimate the number of documents appearing in both lists

Seminar "Peer-to-peer Information Systems"23 Efficient Query Processing Evaluation of DPP 900 two-term queries selected form a set of over 1 million Testing corpora: 120 million web pages (1.8TB) that were crawled by their own crawler Value of x determined by experiments on TA Computation within nodes are not taken into account Commmunication costs and estimated times of DPP for the top-10 documents and standard cosine measure:

Seminar "Peer-to-peer Information Systems"24 Future Work Framework for generating optimized query execution plans for multi-keyword queries New algorithmic techniques for the index synchronization problem New strategies for load balancing and rebuilding of lost replicas More experimental evaluation concerning different types of queries

Seminar "Peer-to-peer Information Systems"25 Questions? „The general question remains whether the near future will see massive P2P-based systems for challenging applications such as web search and large-scale IR, beyond the current simple applications such as file sharing.“