18.01.2005 Efficient Peer-to-Peer Keyword Searching 1 Efficient Peer-to-Peer Keyword Searching Patrick Reynolds and Amin Vahdat presented by Volker Kudelko.

Slides:

Advertisements

Similar presentations

Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, Hari Balakrishnan MIT and Berkeley presented by Daniel Figueiredo Chord: A Scalable Peer-to-peer.

Advertisements

Scalable Content-Addressable Network Lintao Liu

Resource Management §A resource can be a logical, such as a shared file, or physical, such as a CPU (a node of the distributed system). One of the functions.

Operating Systems Lecture 10 Issues in Paging and Virtual Memory Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard. Zhiqing.

Pastiche: Making Backup Cheap and Easy. Introduction Backup is cumbersome and expensive Backup is cumbersome and expensive ~$4/GB/Month (now $0.02/GB)

Clayton Sullivan PEER-TO-PEER NETWORKS. INTRODUCTION What is a Peer-To-Peer Network A Peer Application Overlay Network Network Architecture and System.

CHORD – peer to peer lookup protocol Shankar Karthik Vaithianathan & Aravind Sivaraman University of Central Florida.

©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.

Precept 6 Hashing & Partitioning 1 Peng Sun. Server Load Balancing Balance load across servers Normal techniques: Round-robin? 2.

1 1 Chord: A scalable Peer-to-peer Lookup Service for Internet Applications Dariotaki Roula

Denial-of-Service Resilience in Peer-to-Peer Systems D. Dumitriu, E. Knightly, A. Kuzmanovic, I. Stoica and W. Zwaenepoel Presenter: Yan Gao.

Hit or Miss ? !!!.  Cache RAM is high-speed memory (usually SRAM).  The Cache stores frequently requested data.  If the CPU needs data, it will check.

Web Caching Schemes1 A Survey of Web Caching Schemes for the Internet Jia Wang.

Peer-to-Peer Based Multimedia Distribution Service Zhe Xiang, Qian Zhang, Wenwu Zhu, Zhensheng Zhang IEEE Transactions on Multimedia, Vol. 6, No. 2, April.

A Trust Based Assess Control Framework for P2P File-Sharing System Speaker ： Jia-Hui Huang Adviser : Kai-Wei Ke Date ： 2004 / 3 / 15.

A Scalable Content-Addressable Network Authors: S. Ratnasamy, P. Francis, M. Handley, R. Karp, S. Shenker University of California, Berkeley Presenter:

Exploiting Content Localities for Efficient Search in P2P Systems Lei Guo 1 Song Jiang 2 Li Xiao 3 and Xiaodong Zhang 1 1 College of William and Mary,

Scaling Personalized Web Search Glen Jeh, Jennfier Widom Stanford University Presented by Li-Tal Mashiach Search Engine Technology course (236620) Technion.

© 2007 Pearson Education Inc., Upper Saddle River, NJ. All rights reserved.1 Computer Networks and Internets with Internet Applications, 4e By Douglas.

Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.

A Distributed Search Service for Peer-to-Peer File Sharing in Mobile Application Presented by Tony Sung On Loy, MC Lab, CUHK IE 1 A Distributed Search.

Chord-over-Chord Overlay Sudhindra Rao Ph.D Qualifier Exam Department of ECECS.

Freenet A Distributed Anonymous Information Storage and Retrieval System I Clarke O Sandberg I Clarke O Sandberg B WileyT W Hong.

Comparing Hybrid Peer-to-Peer Systems Beverly Yang and Hector Garcia-Molina Presented by Marco Barreno November 3, 2003 CS 294-4: Peer-to-peer systems.

1 CS 194: Distributed Systems Distributed Hash Tables Scott Shenker and Ion Stoica Computer Science Division Department of Electrical Engineering and Computer.

Wide-area cooperative storage with CFS

1 The Mystery of Cooperative Web Caching 2 b b Web caching : is a process implemented by a caching proxy to improve the efficiency of the web. It reduces.

ICDE A Peer-to-peer Framework for Caching Range Queries Ozgur D. Sahin Abhishek Gupta Divyakant Agrawal Amr El Abbadi Department of Computer Science.

On-Demand Media Streaming Over the Internet Mohamed M. Hefeeda, Bharat K. Bhargava Presented by Sam Distributed Computing Systems, FTDCS Proceedings.

1CS 6401 Peer-to-Peer Networks Outline Overview Gnutella Structured Overlays BitTorrent.

Hash, Don’t Cache: Fast Packet Forwarding for Enterprise Edge Routers Minlan Yu Princeton University Joint work with Jennifer.

Roger ZimmermannCOMPSAC 2004, September 30 Spatial Data Query Support in Peer-to-Peer Systems Roger Zimmermann, Wei-Shinn Ku, and Haojun Wang Computer.

By Ravi Shankar Dubasi Sivani Kavuri A Popularity-Based Prediction Model for Web Prefetching.

Network Aware Resource Allocation in Distributed Clouds.

Chord: A Scalable Peer-to-peer Lookup Protocol for Internet Applications Xiaozhou Li COS 461: Computer Networks (precept 04/06/12) Princeton University.

Full-Text Search in P2P Networks Christof Leng Databases and Distributed Systems Group TU Darmstadt.

Scalable Web Server on Heterogeneous Cluster CHEN Ge.

Efficient Peer to Peer Keyword Searching Nathan Gray.

Structuring P2P networks for efficient searching Rishi Kant and Abderrahim Laabid Abderrahim Laabid.

Efficient P2P Searches Using Result-Caching From U. of Maryland. Presented by Lintao Liu 2/24/03.

Introduction to dCache Zhenping (Jane) Liu ATLAS Computing Facility, Physics Department Brookhaven National Lab 09/12 – 09/13, 2005 USATLAS Tier-1 & Tier-2.

G063 - Distributed Databases. Learning Objectives: By the end of this topic you should be able to: explain how databases may be stored in more than one.

AlvisP2P : Scalable Peer-to-Peer Text Retrieval in a Structured P2P Network Toan Luu, Gleb Skobeltsyn, Fabius Klemm, Maroje Puh, Ivana Podnar Zarko, Martin.

The Replica Location Service The Globus Project™ And The DataGrid Project Copyright (c) 2002 University of Chicago and The University of Southern California.

Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.

Efficient P2P Search by Exploiting Localities in Peer Community and Individual Peers A DISC’04 paper Lei Guo 1 Song Jiang 2 Li Xiao 3 and Xiaodong Zhang.

A Low-bandwidth Network File System Athicha Muthitacharoen et al. Presented by Matt Miller September 12, 2002.

Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.

Exact Regenerating Codes on Hierarchical Codes Ernst Biersack Eurecom France Joint work and Zhen Huang.

Peer to Peer Network Design Discovery and Routing algorithms

Bandwidth-Efficient Continuous Query Processing over DHTs Yingwu Zhu.

A Bandwidth Scheduling Algorithm Based on Minimum Interference Traffic in Mesh Mode Xu-Yajing, Li-ZhiTao, Zhong-XiuFang and Xu-HuiMin International Conference.

Peer-to-Peer Video Systems: Storage Management CS587x Lecture Department of Computer Science Iowa State University.

CS 347Notes081 CS 347: Parallel and Distributed Data Management Notes 08: P2P Systems.

P2P Search COP P2P Search Techniques Centralized P2P systems  e.g. Napster, Decentralized & unstructured P2P systems  e.g. Gnutella.

CSCI 599: Beyond Web Browsers Professor Shahram Ghandeharizadeh Computer Science Department Los Angeles, CA

Large Scale Sharing Marco F. Duarte COMP 520: Distributed Systems September 19, 2004.

Geethanjali College Of Engineering and Technology Cheeryal( V), Keesara ( M), Ranga Reddy District. I I Internal Guide Mrs.CH.V.Anupama Assistant Professor.

BUFFALO: Bloom Filter Forwarding Architecture for Large Organizations Minlan Yu Princeton University Joint work with Alex Fabrikant,

Authors: Jiang Xie, Ian F. Akyildiz

Advanced Topics in Concurrency and Reactive Programming: Case Study – Google Cluster Majeed Kassis.

Information Retrieval in Practice

Data Dissemination and Management (2) Lecture 10

CHAPTER 3 Architectures for Distributed Systems

EE 122: Peer-to-Peer (P2P) Networks

Edge computing (1) Content Distribution Networks

How Yahoo! use to serve millions of videos from its video library.

Hash Functions for Network Applications (II)

Data Dissemination and Management (2) Lecture 10

Presentation transcript:

Efficient Peer-to-Peer Keyword Searching 1 Efficient Peer-to-Peer Keyword Searching Patrick Reynolds and Amin Vahdat presented by Volker Kudelko

Efficient Peer-to-Peer Keyword Searching Overview 1. Introduction 2. Some techniques 3. Simulator design 4. Analysing of the techniques in the simulator 5. Conclusion

Efficient Peer-to-Peer Keyword Searching Introduction recent file storage applications built on top of peer-to- peer distributed hash tables do not contain good search capabilities but search is a very important part!  design and analyze a distributed search engine based on a distributed hash table.

Efficient Peer-to-Peer Keyword Searching Introduction what´s important for a good search system?  end user latency! comes mostly from network transfer times  try to reduce the number of bytes sent most queries contain several keywords  minimizing the network traffic for multiple-keyword- search

Efficient Peer-to-Peer Keyword Searching Some techniques partitioning centralized or distributed design bloom filters caches incremental results virtual hosts

Efficient Peer-to-Peer Keyword Searching Partitioning horizontal vs. vertical horizontal: dokuments are assigned to exactly one node, which stores the dokument‘s keywords benefit: to insert or update a dokument witn n keywords requires to contact just one single node (vertikal needs to contact all servers)

Efficient Peer-to-Peer Keyword Searching Partitioning vertical: one ore more keywords are assigned to exactly one node which stores a match-list of the dokuments containing these keywords benefit: a search containing k keywords ensures that not more than k servers must be contacted (horizontal needs to contact all servers)

Efficient Peer-to-Peer Keyword Searching Partitioning so far so good… what to use? because of the design of an archival storage system, we assume that most files change rarely and those which change do this very often so, we believe that queries will outnumber updates  we choose the advantages of a vertically partitioned system for our search engine

Efficient Peer-to-Peer Keyword Searching Centralized or distributed distributed search services provide more benefits than a centralized one: no single point of failure less risk of  network outages,  denial-of-service attacks and  censorship by domestic or foreign authorities more replication in distributed systems  better availability

Efficient Peer-to-Peer Keyword Searching Bloom filters what is the opportunity to use them? first take a look at a simple „AND“ query without bloom filters s A sends the list of documents with keyword A to s B s B sends sends the intersection A∩B to the client

Efficient Peer-to-Peer Keyword Searching Bloom filters now – what is the situation with bloom filters? server s A sends instead of A a bloom filter F(A) server s B calculates B ∩ F(A) server s A performs then the intersection of A and B ∩ F(A) to remove the false positves the results are sent to the client consequence: decrease in the number of excess bits send but addional step needed for romoving the false positives

Efficient Peer-to-Peer Keyword Searching Bloom filters – a closer look a method for representing a set of n elements to support membership queries by using a hash-based data structure allocate a vector v of m bits, initially all set to 0 choose k independent hash functions, with range 1...m bits at positions h 1 (a),...,h k (a) in v are set to 1 given a query for b, check the bits at positions h 1 (b),...,h k (b). If any of them are 0, then b is not in the set A. otherwise we assume that b is in the set, although this could be wrong (false positive) parameters k and m should be chosen so that the probability of a false positive is acceptable.

Efficient Peer-to-Peer Keyword Searching Bloom filters number of excess bit send example:  A and B contain documents in their list  each document identifier has 128 bits without bloom filters: 10,000 ∙ 128 = 1,280,000 bits with bloom filters  probability of false positves: p fp = m/n m= # bits in the boom filter, n = # elements in the set   # of excess bits sent: m + p fp |B| j = m m/|A| |B| j   computing optimal bloom filter size m (derivation set to 0 and solve for m) gives:   # of excess bits send with bloom filters = 106,544   represents a compression of : 1 !

Efficient Peer-to-Peer Keyword Searching Caches server locally store the data they receive like A or F(A) better caching bloom filters   they are smaller   possible to store data for more keywords improvements in the hit rate   reduce the minimum of excess bits

Efficient Peer-to-Peer Keyword Searching Incremental results the number of results is proportional to the number of documents in the network  also the bandwidth costs grow lineraly with the network size  bloom filters and caches provide a substantial cost improvement but cannot eliminate the linear growth in cost solution: truncate the results  devide the set A in chunks and send several bloom filters until a desired number of results is achieved

Efficient Peer-to-Peer Keyword Searching Incremental results advantage:  caches can store several fractional bloom filters  servers can retain or discard partial entries in the cache  storing only a fraction reduces the amount of space disadvantage:  Higher CPU load for processing the search (costs for B ∩ F(A i ) are equal to B ∩ F(A)

Efficient Peer-to-Peer Keyword Searching Virtual hosts a problem of peer-to-peer systems is the heterogeneity, i.e. there are serveral different powerful machines can happen that an individual machine is assigned to an inappropriate number of keywords idea: using of virtual hosts each machine gets a „power“ index proportional to its processing capacity and referring to the baseline host e.g. machine A gets index 10, i.e. it is ten times more powerful than the baseline host and gets 10 virtual IDs power index refers to the CPU power, network bandwidth and memory

Efficient Peer-to-Peer Keyword Searching Simulator Design Java application 1.85 GB HTML data with 1.17 million unique words in 105,953 documents 95,409 searches perfomed with 43,344 unique keywords testing 3 distributions  Modem  Backbone links  Gnutella based network (heterogeneous) storage range from 1 to 100 MB hosts are placed at random in a 2,500 miles square-grid

Efficient Peer-to-Peer Keyword Searching Simulator Design performing a query:  obtaining up to M results for each keyword  each host intersects its set with the data from the previous one  forward to the subsequent host  order given by the number of documents found (few many)  use of a bloom filter depends on the size of the set (threshold)  each node that sent a bloom filter is a potentially candidate to remove „false positives“  if the resulting documents are less than desired, m is increased adaptively

Efficient Peer-to-Peer Keyword Searching Analysing of the techniques in the simulator scalability and Virtual Hosts bloom filters caching putting it all together

Efficient Peer-to-Peer Keyword Searching Scalability and virtual hosts goal: reducing the bytes sent per request  minimizing the end-to-end latency virtual host concentrate popular data mostly powerful machines  higher probability to handle queries entirely locally  only a small effect on bytes send per query, but..

Efficient Peer-to-Peer Keyword Searching Scalability and virtual hosts … a much greater effect on network time per query no bottleneck with virtual hosts better load balancing, faster machines handle a larger fraction of requests

Efficient Peer-to-Peer Keyword Searching Bloom filters problem: using bloom filters always  unnecessary data transfers (eliminate „false positives“) solution: use them only when the time saved out- weigh the time sending clean-up messages what is the optimal threshold for bloom filters?  threshold approximately 300

Efficient Peer-to-Peer Keyword Searching Caching what are the real benefits of caching? allowing only „false positives“ rates of 0%,1% and 10 % in the returning documents the lower the false positives rates (additional removing steps needed), the higher the network traffic  only a small effect on bytes send per query, but..

Efficient Peer-to-Peer Keyword Searching Caching... a much greater effect on network time per query allowing more „false positves“ causes eliminating a number of required message transmissions

Efficient Peer-to-Peer Keyword Searching Putting it all together splitting the end-to-end time in 3 components:  CPU processing time  network transmission time (bytes transfered divided by the network connection speed of the slowest host)  latency (time required for a bit to travel through the line) analysing under 3 network conditions  WAN  heterogeneous  modem

Efficient Peer-to-Peer Keyword Searching Putting it all together heterogeneous  the use of virtual host and caching together has the most powerful effect  using virtual hosts  keywords concentraiting on nodes with high network bandwidth WAN:  transmission time negligible (very high transmission speeds)  caching reduces CPU time by avoiding to calculate and to transmit Bloom filters  virtual hosts also reduce CPU time, requests are concentrated to more powerful nodes

Efficient Peer-to-Peer Keyword Searching Putting it all together modem  end-to-end query time dominated by transmission time  no effect of virtual hosts (homogeneous)

Efficient Peer-to-Peer Keyword Searching Conclusion the use of virtual hosts and caching together has the most pronounced effect on the heterogeneous network,  reducing average query response times by 59% in particular, the use of virtual hosts reduces the network transmission portion of average query response times by 48% by concentrating keywords on the subset of nodes with more network bandwidth caching uniformly reduces all aspects of the average query time, in particular reducing the latency components by 47% in each case by eliminating the need for a significant portion of network communication

Efficient Peer-to-Peer Keyword Searching thank you for your attention! questions?