Efficient P2P Searches Using Result-Caching From U. of Maryland. Presented by Lintao Liu 2/24/03.

Slides:



Advertisements
Similar presentations
Dynamic Replica Placement for Scalable Content Delivery Yan Chen, Randy H. Katz, John D. Kubiatowicz {yanchen, randy, EECS Department.
Advertisements

Efficient Event-based Resource Discovery Wei Yan*, Songlin Hu*, Vinod Muthusamy +, Hans-Arno Jacobsen +, Li Zha* * Chinese Academy of Sciences, Beijing.
Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, Hari Balakrishnan MIT and Berkeley presented by Daniel Figueiredo Chord: A Scalable Peer-to-peer.
P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer D istributed T able: Efficient Query-Driven Processing of Multi-Term Queries in.
03/20/2003Parallel IR1 Papers on Parallel IR Agenda Introduction Paper 1:Inverted file partitioning schemes in multiple disk systems Paper 2: Parallel.
Scalable Content-Addressable Network Lintao Liu
TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets Chun Chen 1, Feng Li 2, Beng Chin Ooi 2, and Sai Wu 2 1 Zhejiang University, 2 National.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
Scaling Distributed Machine Learning with the BASED ON THE PAPER AND PRESENTATION: SCALING DISTRIBUTED MACHINE LEARNING WITH THE PARAMETER SERVER – GOOGLE,
Multiple-key indexes Index on one attribute provides pointer to an index on the other. If V is a value of the first attribute, then the index we reach.
Xyleme A Dynamic Warehouse for XML Data of the Web.
Cis e-commerce -- lecture #6: Content Distribution Networks and P2P (based on notes from Dr Peter McBurney © )
Topics in Reliable Distributed Systems Lecture 2, Fall Dr. Idit Keidar.
Carnegie Mellon University Complex queries in distributed publish- subscribe systems Ashwin R. Bharambe, Justin Weisz and Srinivasan Seshan.
Efficient Content Location Using Interest-based Locality in Peer-to-Peer Systems Presented by: Lin Wing Kai.
Object Naming & Content based Object Search 2/3/2003.
Chord-over-Chord Overlay Sudhindra Rao Ph.D Qualifier Exam Department of ECECS.
Topics in Reliable Distributed Systems Fall Dr. Idit Keidar.
1 The Mystery of Cooperative Web Caching 2 b b Web caching : is a process implemented by a caching proxy to improve the efficiency of the web. It reduces.
ICDE A Peer-to-peer Framework for Caching Range Queries Ozgur D. Sahin Abhishek Gupta Divyakant Agrawal Amr El Abbadi Department of Computer Science.
Chapter 8 Physical Database Design. McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved. Outline Overview of Physical Database.
1 The anatomy of a Large Scale Search Engine Sergey Brin,Lawrence Page Dept. CS of Stanford University.
1CS 6401 Peer-to-Peer Networks Outline Overview Gnutella Structured Overlays BitTorrent.
Chapter 61 Chapter 6 Index Structures for Files. Chapter 62 Indexes Indexes are additional auxiliary access structures with typically provide either faster.
Hashed Samples Selectivity Estimators for Set Similarity Selection Queries.
INTRODUCTION TO PEER TO PEER NETWORKS Z.M. Joseph CSE 6392 – DB Exploration Spring 2006 CSE, UT Arlington.
Roger ZimmermannCOMPSAC 2004, September 30 Spatial Data Query Support in Peer-to-Peer Systems Roger Zimmermann, Wei-Shinn Ku, and Haojun Wang Computer.
Detecting Near-Duplicates for Web Crawling Manku, Jain, Sarma
Modularizing B+-trees: Three-Level B+-trees Work Fine Shigero Sasaki* and Takuya Araki NEC Corporation * currently with 1st Nexpire Inc.
Efficient Keyword Search over Virtual XML Views Feng Shao and Lin Guo and Chavdar Botev and Anand Bhaskar and Muthiah Chettiar and Fan Yang Cornell University.
Database Management 9. course. Execution of queries.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Parallel and Distributed IR. 2 Papers on Parallel and Distributed IR Introduction Paper A: Inverted file partitioning schemes in Multiple Disk Systems.
Chord: A Scalable Peer-to-peer Lookup Protocol for Internet Applications Xiaozhou Li COS 461: Computer Networks (precept 04/06/12) Princeton University.
Full-Text Search in P2P Networks Christof Leng Databases and Distributed Systems Group TU Darmstadt.
« Performance of Compressed Inverted List Caching in Search Engines » Proceedings of the International World Wide Web Conference Commitee, Beijing 2008)
Efficient Peer to Peer Keyword Searching Nathan Gray.
Peer-to-Peer Services Lintao Liu 5/26/03. Papers YAPPERS: A Peer-to-Peer Lookup Service over Arbitrary Topology Stanford University Associative Search.
Network Computing Laboratory Scalable File Sharing System Using Distributed Hash Table Idea Proposal April 14, 2005 Presentation by Jaesun Han.
Date: 2012/3/5 Source: Marcus Fontouraet. al(CIKM’11) Advisor: Jia-ling, Koh Speaker: Jiun Jia, Chiou 1 Efficiently encoding term co-occurrences in inverted.
Implicit group messaging in peer-to-peer networks Daniel Cutting, 28th April 2006 Advanced Networks Research Group.
Super-peer Network. Motivation: Search in P2P Centralised (Napster) Flooding (Gnutella)  Essentially a breadth-first search using TTLs Distributed Hash.
Quantitative Evaluation of Unstructured Peer-to-Peer Architectures Fabrício Benevenuto José Ismael Jr. Jussara M. Almeida Department of Computer Science.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin & Lawrence Page Presented by: Siddharth Sriram & Joseph Xavier Department of Electrical.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
Indexing and hashing Azita Keshmiri CS 157B. Basic concept An index for a file in a database system works the same way as the index in text book. For.
Peer to Peer A Survey and comparison of peer-to-peer overlay network schemes And so on… Chulhyun Park
Efficient Peer-to-Peer Keyword Searching 1 Efficient Peer-to-Peer Keyword Searching Patrick Reynolds and Amin Vahdat presented by Volker Kudelko.
Building a Distributed Full-Text Index for the Web by Sergey Melnik, Sriram Raghavan, Beverly Yang and Hector Garcia-Molina from Stanford University Presented.
1 Secure Peer-to-Peer File Sharing Frans Kaashoek, David Karger, Robert Morris, Ion Stoica, Hari Balakrishnan MIT Laboratory.
Marwan Al-Namari Hassan Al-Mathami. Indexing What is Indexing? Indexing is a mechanisms. Why we need to use Indexing? We used indexing to speed up access.
1. Efficient Peer-to-Peer Lookup Based on a Distributed Trie 2. Complex Queries in DHT-based Peer-to-Peer Networks Lintao Liu 5/21/2002.
Relational Operator Evaluation. Overview Application Programmer (e.g., business analyst, Data architect) Sophisticated Application Programmer (e.g.,
Chapter 8 Physical Database Design. Outline Overview of Physical Database Design Inputs of Physical Database Design File Structures Query Optimization.
CPSC 404, Laks V.S. Lakshmanan1 Overview of Query Evaluation Chapter 12 Ramakrishnan & Gehrke (Sections )
Algorithms and Techniques in Structured Scalable Peer-to-Peer Networks
LOOKING UP DATA IN P2P SYSTEMS Hari Balakrishnan M. Frans Kaashoek David Karger Robert Morris Ion Stoica MIT LCS.
Large Scale Sharing Marco F. Duarte COMP 520: Distributed Systems September 19, 2004.
1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
Distributed Caching and Adaptive Search in Multilayer P2P Networks Chen Wang, Li Xiao, Yunhao Liu, Pei Zheng The 24th International Conference on Distributed.
Information Retrieval in Practice
Indexing Structures for Files and Physical Database Design
Azita Keshmiri CS 157B Ch 12 indexing and hashing
Database Management System
Chapter 12: Query Processing
EE 122: Peer-to-Peer (P2P) Networks
A Small and Fast IP Forwarding Table Using Hashing
Consistent Hashing and Distributed Hash Table
Presentation transcript:

Efficient P2P Searches Using Result-Caching From U. of Maryland. Presented by Lintao Liu 2/24/03

Motivation & Query Model Avoid duplicating work and data movement by caching previous query results Observation: a i && a j && a k = (a i && a j ) && a k So We can keep the result for (a i &&a j ) as materialized view. Query Model: (a i && a j && a k ) || (b i && b j && b k ) ||…

View and View Tree a view is the cached result for a previous query. Where to store the views: Using the underlying P2P system mechanism for example, in Chord, the view “a && b” is stored at the successor of Hash(“a&&b”) But it can’t be used to efficiently answer view queries. Why? For “a 1 && a 2 &&..&& a k “, there are 2 k possible views. And you don’t know which one exists.

View and View Tree (Cont.) Another possible way: Centralized, consistent list of views Easy to locate Problems:  Frequent updates  Storage requirements Proposed Solution: View Tree Implemented as a trie,  A tree for storing strings in which there is one node for every common prefix. The strings are stored in extra leaf nodes. scalable, stateless

View Tree All nodes are at level 1. (single-attribute views) A canonical order on the attributes is defined and used to uniquely identify equivalent conjunctive queries. “a && b” and “b && a” are both recorded as “a && b” (alphabetical order)

Answering Queries Finding a smallest set of views to evaluate a query is NP-hard Instead, the following method is used: Exact match, if such a match exists Forward progress:  For each node accessed, at least one attribute must be located which does not occur in the views located so far.

Example on answering queries Query: “cbagekhilo” Step 1: match prefix “cbag” Step 2: “cbage” is not found, but “cbagh” exists and is useful for the query (forward progress) …..

Algorithm: Search(n, q)

Creating a balanced View Tree For a query “a && b && c &&.. && x”, there exist a lot of equivalent views, each corresponds to a position at View Tree. And any of them can be used to represent the result of the search query. Which one to choose and how to make a balanced Tree? Deterministically pick a permutation P uniformly at random among all possible permutations.

Maintaining the View Tree The owner of a new node needs to update all attribute indexed corresponding to attributes of the new node. (Q: Cross the whole network? All related views need to be updated? Isn’t it too expensive?) Heartbeat is used to check the presence of child nodes and parent node in the view tree. Insertion of new view is less expensive: some child pointers need to be reassigned.

Example of a new view join

Preliminary Results: Data source and methodology: Document: TREC-Web data set HTML pages with keyword meta-tag 64K different pages for each experiment Queries: generated using the statistical characteristics from search.com

Preliminary results: Caching Benefit Query locality Benefit

On the Feasibility of P2P Web Indexing and Search From UC Berkeley & MIT

Motivation: Is P2P web search likely to work? two keyword-search techniques: Flooding (Gnutella): not discussed in this paper Intersection of index lists This paper presents a feasibility analysis based on the resource constraints and workload.

Introduction Why interested in P2P searching: A good stress test for P2P architectures More resistant to censoring and manipulated ranking More robust from single node failure 550 billion documents on the web Google indexes more than 2 billion of them Gnutella and KaZaA: flooding search 500 Million files Typically music files Search: file meta-data such as titles and artist DHT-based keyword searching: Good full-text search performance with about 100,000 files (Duke Univ.)

Fundamental Constraints of the real world We assumed the following parameters Web documents: 3 billion Words per file: 1000 An inverted index would have: 3*10 9 *1000 unique docIDs. docID: 20 bytes (hash of the file content) Inverted index size: 6*10 13 bytes Queries per second: 1000 (google)

Fundamental constraints (cont.) Storage Constraints 1 GB for each PC, at least PCs. Communication Constraints: Assume web search consume 10% (after comparison with the traffic for DNS) 1999, Internet backbone of US: 100Gbits 1000 queries/sec, 10Mbits can be used for each query

Basic Cost analysis Assumption: DHT-based P2P systems Two-term query (each search has 2 keyword) In MIT, 1.7 Million Web pages, queries, 300,000 bytes are moved for each query. Scale to Internet (3 billion page), it might require 530 Mbytes for each query (Q: Is that true?)

Optimizations: All the optimizations used the queries from mit.edu Caching and Precomputation Caching received posting lists: reduce communication cost by 38% Computing and storing the intersection of different posting lists in advance: 3% of all possible term pairs is precomputed, the communication cost is reduced by 50% (Zipf distribution, most popular words)

Compression Bloom Filters: Two-round Bloom intersection (one node sends the bloom filter of its posting list to another node, which returns the result): compression ratio 13 4-round Bloom intersection: Compression ratio 40 Compressed Bloom filters: 30% improvement Gap Compression: Effective when the gaps between sorted docIDs are small, ? So less bits are required for docID ?

Compression (cont.) Adaptive Set Intersection: Exploit the structure in the posting lists to avoid transferring entire lists Example: {1, 3, 4, 7}&&{8, 10, 20, 30} requires one element exchange since 7<8 Clustering Similar documents are grouped together based on their term occurrences and assigned adjacent docIDs, which improves the compression ration of adaptive set intersecton with gap compression to 75

Optimization Tech. And Improvements Still one order of magnitude higher than the budget

Two compromises And Conclusion Compromising Result Quality Compromising P2P Structure Conclusion: Naïve implementations of P2P Web search are not feasible. The most effective optimizations bring the problem to within an order of magnitude of feasibility. Two possible compromises are proposed All of them combined together will bring us within feasibility range for P2P Web search. (Q: Sure?)