1 Reading Report 3 Yin Chen 20 Feb 2004 Reference: Efficient Search in Peer-to-Peer Networks, Beverly Yang, Hector Garcia-Molina, In 22 nd Int. Conf. on.

Slides:



Advertisements
Similar presentations
Peer-to-Peer and Social Networks An overview of Gnutella.
Advertisements

Scalable Content-Addressable Network Lintao Liu
Resource Management §A resource can be a logical, such as a shared file, or physical, such as a CPU (a node of the distributed system). One of the functions.
Kademlia: A Peer-to-peer Information System Based on the XOR Metric Petar Mayamounkov David Mazières A few slides are taken from the authors’ original.
CMPE 521 Improving Search In P2P Systems by Yang and Molina Prepared by Ayhan Molla.
Efficient Search - Overview Improving Search In Peer-to-Peer Systems Presented By Jon Hess cs294-4 Fall 2003.
UNIVERSITY OF JYVÄSKYLÄ Building NeuroSearch – Intelligent Evolutionary Search Algorithm For Peer-to-Peer Environment Master’s Thesis by Joni Töyrylä
Improving Search in Peer-to-Peer Networks Beverly Yang Hector Garcia-Molina Presented by Shreeram Sahasrabudhe
Peer-to-Peer Distributed Search. Peer-to-Peer Networks A pure peer-to-peer network is a collection of nodes or peers that: 1.Are autonomous: participants.
Search and Replication in Unstructured Peer-to-Peer Networks Pei Cao, Christine Lv., Edith Cohen, Kai Li and Scott Shenker ICS 2002.
P2p, Spring 05 1 Topics in Database Systems: Data Management in Peer-to-Peer Systems Routing indexes A. Crespo & H. Garcia-Molina ICDCS 02.
Planning under Uncertainty
An Overview of Peer-to-Peer Networking CPSC 441 (with thanks to Sami Rollins, UCSB)
Mobile and Wireless Computing Institute for Computer Science, University of Freiburg Western Australian Interactive Virtual Environments Centre (IVEC)
Peer-to-Peer Networks João Guerreiro Truong Cong Thanh Department of Information Technology Uppsala University.
Open Problems in Data- Sharing Peer-to-Peer Systems Neil Daswani, Hector Garcia-Molina, Beverly Yang.
P2p, Spring 05 1 Topics in Database Systems: Data Management in Peer-to-Peer Systems March 29, 2005.
Internet Networking Spring 2006 Tutorial 12 Web Caching Protocols ICP, CARP.
A Trust Based Assess Control Framework for P2P File-Sharing System Speaker : Jia-Hui Huang Adviser : Kai-Wei Ke Date : 2004 / 3 / 15.
Improving Search in P2P Networks By Shadi Lahham.
1 Spring Semester 2007, Dept. of Computer Science, Technion Internet Networking recitation #13 Web Caching Protocols ICP, CARP.
CSCI 4550/8556 Computer Networks Comer, Chapter 19: Binding Protocol Addresses (ARP)
Object Naming & Content based Object Search 2/3/2003.
Chord-over-Chord Overlay Sudhindra Rao Ph.D Qualifier Exam Department of ECECS.
Comparing Hybrid Peer-to-Peer Systems Beverly Yang and Hector Garcia-Molina Presented by Marco Barreno November 3, 2003 CS 294-4: Peer-to-peer systems.
1 Seminar: Information Management in the Web Gnutella, Freenet and more: an overview of file sharing architectures Thomas Zahn.
Efficient Search in Peer to Peer Networks By: Beverly Yang Hector Garcia-Molina Presented By: Anshumaan Rajshiva Date: May 20,2002.
Searching in Unstructured Networks Joining Theory with P-P2P.
Improving Data Access in P2P Systems Karl Aberer and Magdalena Punceva Swiss Federal Institute of Technology Manfred Hauswirth and Roman Schmidt Technical.
1CS 6401 Peer-to-Peer Networks Outline Overview Gnutella Structured Overlays BitTorrent.
P2P File Sharing Systems
INTRODUCTION TO PEER TO PEER NETWORKS Z.M. Joseph CSE 6392 – DB Exploration Spring 2006 CSE, UT Arlington.
Freenet. Anonymity  Napster, Gnutella, Kazaa do not provide anonymity  Users know who they are downloading from  Others know who sent a query  Freenet.
Peer-to-Peer Computing CS587x Lecture Department of Computer Science Iowa State University.
1 Napster & Gnutella An Overview. 2 About Napster Distributed application allowing users to search and exchange MP3 files. Written by Shawn Fanning in.
Introduction Widespread unstructured P2P network
IR Techniques For P2P Networks1 Information Retrieval Techniques For Peer-To-Peer Networks Demetrios Zeinalipour-Yazti, Vana Kalogeraki and Dimitrios Gunopulos.
1 Reading Report 4 Yin Chen 26 Feb 2004 Reference: Peer-to-Peer Architecture Case Study: Gnutella Network, Matei Ruoeanu, In Int. Conf. on Peer-to-Peer.
1 P2P Querying Wes Hatch MUMT-614 Mar What is P2P? Nodes of equal roles exchanging information and services directly “distributed databases”
09/07/2004Peer-to-Peer Systems in Mobile Ad-hoc Networks 1 Lookup Service for Peer-to-Peer Systems in Mobile Ad-hoc Networks M. Tech Project Presentation.
Searching In Peer-To-Peer Networks Chunlin Yang. What’s P2P - Unofficial Definition All of the computers in the network are equal Each computer functions.
HERO: Online Real-time Vehicle Tracking in Shanghai Xuejia Lu 11/17/2008.
Peer to Peer Research survey TingYang Chang. Intro. Of P2P Computers of the system was known as peers which sharing data files with each other. Build.
P2p, Fall 06 1 Topics in Database Systems: Data Management in Peer-to-Peer Systems Routing indexes A. Crespo & H. Garcia-Molina ICDCS 02.
1 ICS 214B: Transaction Processing and Distributed Data Management Lecture 18: Data Management in Peer-to-Peer Systems Professor Chen Li Based on slides.
Routing Indices For P-to-P Systems ICDCS Introduction Search in a P2P system –Mechanisms without an index –Mechanisms with specialized index nodes.
A Scalable Content-Addressable Network (CAN) Seminar “Peer-to-peer Information Systems” Speaker Vladimir Eske Advisor Dr. Ralf Schenkel November 2003.
Implicit group messaging in peer-to-peer networks Daniel Cutting, 28th April 2006 Advanced Networks Research Group.
Super-peer Network. Motivation: Search in P2P Centralised (Napster) Flooding (Gnutella)  Essentially a breadth-first search using TTLs Distributed Hash.
Quantitative Evaluation of Unstructured Peer-to-Peer Architectures Fabrício Benevenuto José Ismael Jr. Jussara M. Almeida Department of Computer Science.
An IP Address Based Caching Scheme for Peer-to-Peer Networks Ronaldo Alves Ferreira Joint work with Ananth Grama and Suresh Jagannathan Department of Computer.
1 Peer-to-Peer Technologies Seminar by: Kunal Goswami (05IT6006) School of Information Technology Guided by: Prof. C.R.Mandal, School of Information Technology.
Peer to Peer A Survey and comparison of peer-to-peer overlay network schemes And so on… Chulhyun Park
By Jonathan Drake.  The Gnutella protocol is simply not scalable  This is due to the flooding approach it currently utilizes  As the nodes increase.
LightFlood: An Efficient Flooding Scheme for File Search in Unstructured P2P Systems Song Jiang, Lei Guo, and Xiaodong Zhang College of William and Mary.
Peer to Peer Network Design Discovery and Routing algorithms
LOOKING UP DATA IN P2P SYSTEMS Hari Balakrishnan M. Frans Kaashoek David Karger Robert Morris Ion Stoica MIT LCS.
P2P Search COP6731 Advanced Database Systems. P2P Computing  Powerful personal computer Share computing resources P2P Computing  Advantages: Shared.
P2P Search COP P2P Search Techniques Centralized P2P systems  e.g. Napster, Decentralized & unstructured P2P systems  e.g. Gnutella.
School of Electrical Engineering &Telecommunications UNSW Cost-effective Broadcast for Fully Decentralized Peer-to-peer Networks Marius Portmann & Aruna.
09/13/04 CDA 6506 Network Architecture and Client/Server Computing Peer-to-Peer Computing and Content Distribution Networks by Zornitza Genova Prodanoff.
Large Scale Sharing Marco F. Duarte COMP 520: Distributed Systems September 19, 2004.
Distributed Caching and Adaptive Search in Multilayer P2P Networks Chen Wang, Li Xiao, Yunhao Liu, Pei Zheng The 24th International Conference on Distributed.
Composing Web Services and P2P Infrastructure. PRESENTATION FLOW Related Works Paper Idea Our Project Infrastructure.
A Survey of Peer-to-Peer Content Distribution Technologies Stephanos Androutsellis-Theotokis and Diomidis Spinellis ACM Computing Surveys, December 2004.
Internet Networking recitation #12
EE 122: Peer-to-Peer (P2P) Networks
DHT Routing Geometries and Chord
Improving Performance in the Gnutella Protocol
Peer-to-Peer Information Systems Week 6: Performance
Presentation transcript:

1 Reading Report 3 Yin Chen 20 Feb 2004 Reference: Efficient Search in Peer-to-Peer Networks, Beverly Yang, Hector Garcia-Molina, In 22 nd Int. Conf. on Distributed Computing System, Vienna, Austria, July Papers/P2P_gnutella_search.pdf

2 P2P Systems P2P systems:  Distributed systems in which nodes of equal roles and capabilities exchange information and services directly with each other. Became a popular way to share huge volumes of data: I.e. Morpheus has 470,000 users sharing a total of.36 petabytes of data Strengths:  Self-organization  Load-balancing  Adaptation  Fault tolerance Challenge: searching and retrieval of data :  Peers’ location guaranteed systems, i.e. Chord, Pastry, Tapestry, CAN, tightly control the data placement and topology within the network. Currently only support search by object identifier.  Loose P2P system, Gnutella, Freenet, Napster and Morpheus, scalable, supporting richer queries, not strictly control the data placement and topology. Currently search techniques are very inefficient.

3 P2P Systems Examples Napster  Not a pure P2P system, but a hybrid one containing some centralized components Morpheus  “Super-peer” nodes act as a centralized resource for a small number of clients, they connect to each other to form a pure P2P network. Chord, CAN, Pastry, Tapestry - Strong guarantees on availability  In Chord, Pastry and Tapestry, nodes are assigned a numerical identifier  In CAN, nodes are assigned regions in a d-dimensional identifier space.  A node is responsible for owning objects, or pointers to objects, whose identifiers map to the node’s identifier or region.  The system can locate an object by its identifier within a bounded number of hops. Others  Nodes index the content of other nodes, allowing queries to be processed over a large fraction of all content while actually visiting just a small number of nodes.  Each node maintain metadata that can provide “hints”. Hints can be formed by building summaries of the content that is reachable by each connection; Can also be formed by learning user preferences. Query messages are routed by nodes making local decisions based on these hints.

4 Metrics of Efficiency Measurement Cost  Average Aggregate Bandwidth: The average, over a set of representative queries Q rep, of the aggregate bandwidth consumed (in bytes) over every edge in the network on behalf of each query.  Average Aggregate Processing Cost: The average, over a set of representative queries Q rep, of the aggregate processing power consumed at every node in the network on behalf of each query. Quality of Results.  Number of results: the size of the total result set.  Satisfaction of the query: Rather than returning every result, only notify the user of the first Z results, where Z is some value specified by the user. We say a query is satisfied if Z or more results are returned. Given a sufficiently large Z, the user can find what she is looking for from the first Z results. i.e. if Z = 50, a query that returns 1000 results performs no better than a query returning 100 results, in terms of satisfaction.  Time to Satisfaction: How long the user must wait for results to arrive. The time that has elapsed from when the query is first submitted by the user, to when the user’s client receives the Zth result.

5 Metrics of Efficiency Measurement Current Techniques Gnutella:  Uses a breadth-first traversal (BFS) over the network with depth limit D, ( D is the system-wide maximum time-to-live of a message, measured in hops).  A source node S sends the Query message with a TTL of D to a depth of 1, to all its neighbors.  Each node at depth 1 processes the query, sends any results to S, and then forwards the query to all of their neighbors, to a depth of 2.  The process continues until depth D is reached (i.e., when the message reaches its maximum TTL), at which point the message is dropped. Freenet:  Uses a depth-first traversal (DFS) with depth limit D.  Each node forwards the query to a single neighbor, and waits for a definite response from the neighbor before forwarding the query to another neighbor (if the query was not satisfied), or forwarding results back to the query source (if the query was satisfied). Conclusion  With BFS, query can be sent to every possible node as quickly as possible, BUT wastes bandwidth and processing power (because most queries can be satisfied from the responses of relatively few nodes).  With DFS, query is processed sequentially, searches can be terminated as soon as the query is satisfied, cost is minimized. BUT poor response time, (with the worst case being exponential in D).

6 Efficient Search Techniques Basic idea  To reduce the number of nodes that receive and process each query Iterative Deepening  Reduces the number of nodes that are queried by iteratively sending the query to more nodes until the query is answered. Directed BFS  Queries a restricted set of nodes intelligently selected to maximize the chance that the query will be answered. Local Indices  Nodes maintain very simple and small indices over other nodes’ data. Queries are processed by a small set of nodes.

7 Efficient Search Techniques Iterative Deepening Working Mechanisms  Specify a system-wide policy P, that indicates at which depths the iterations are to occur. i.e. P={a, b, c} : the first iteration searches to a depth a, the second to depth b, and the third at depth c.  Specify a waiting period W, (W is the time between successive iterations in the policy).  A source node S initiates a BFS of depth a by sending out a Query message with TTL= a. Once reaches depth a, the query stops passing and “froze” at all nodes a hops from the source S. After waiting for a time period W, S begins to see if the query has already been satisfied.  If the query has not been satisfied, S initiates the next BFS by sending out Resend message with TTL= b. On receiving a Resend message, a node simply forward the message. Once reaching depth a, the nodes will drop the Resend message and “unfreeze” the corresponding query by forwarding the Query message (with a TTL of b – a).  The process continues in a similar fashion to the other levels in the policy. Since c is the depth of the last iteration in the policy, queries will not be frozen at depth c, and S will not initiate another iteration, even if the query is still not satisfied.

8 Efficient Search Techniques Directed BFS Disadvantage of Iterative Deepening : time taken by multiple iterations. Directed BFS : Sending queries to a subset of its neighbors. Reduce the number of nodes that receive the query. Quality of results does not decrease. A node will maintain statistics on its neighbors, and select the best neighbor based on heuristics. i.e.,  Select the neighbor that has returned the highest number of results for previous queries.  Select neighbor that returns response messages that have taken the lowest average number of hops. -- A low hop count may suggest that this neighbor is close to nodes containing useful data.  Select the neighbor that has forwarded the largest number of messages (all types) since our client first connected to the neighbor. -- A high message count implies that this neighbor is stable, since we have been connected to the neighbor for a long time, and it can handle a large flow of messages.  Select the neighbor with the shortest message queue. -- A long message queue implies that the neighbor’s pipe is saturated, or that the neighbor has died.

9 Efficient Search Techniques Local Indices Local Indices : A node n maintains an index over the data of each node within r hops of itself, ( r is a system-wide variable known as the radius of the index). On receiving a Query message, the node can process the query on behalf of every node within r hops of itself. (Users may not want to devote large amounts of storage to store other users’ data, r should be small). Working Mechanism  A policy specifies the depths at which the query should be processed. All nodes at depths not listed in the policy simply forward the query to the next depth. i.e., policy is P = {1; 5}. A source S sends the query. Nodes at depth 1will process and forward the query to all their neighbors at depth 2. Nodes at depth 2 will not process the query, simply forward the query to depth 3. Eventually, nodes at depth 5 will process the query, since depth 5 is in the policy. Extra Works when a node joins, leaves or update  When joins the network, a node sends a Join message with TTL = r, containing metadata. All nodes within r hops will receive the message and send back a Join message containing its own metadata. Both nodes add each other’s metadata to their own index.  When a node leaves the network or dies, other nodes will remove its metadata after a timeout. A node may also be assumed to be died by not responding to Ping messages for several times.  When updates, a node sends out a Update message with TTL = r, containing metadata. All nodes receiving the message update their index.

10 Experimental Setup Data Collection Data were gathered from the Gnutlla network, the largest open P2P system in operation, with about users. For a period of one month, ran a Gnutella client that observed messaged as they passed through the network. Gnutella Pong messages contain the IP address of node that originated the message, and the number of files stored at the node. By extracting the number from all Pong messages, can determine the distribution of collection sizes in the network. Observing results:

11 Experimental Setup Calculating Costs – Bandwidth Cost Estimating the size of message. A Gnutella header = 22 bytes, TCP/IP and Ethernet headers = 58 bytes. The query string = L(Q) bytes. Total message size = 82 + L(Q) bytes. Aggregate bandwidth for a query under BFS is:  The first term inside the summation gives the bandwidth consumed by sending the query message from level n-1 to level n. N(Q, n) is the number of nodes at depth n, C(Q, n) is the redundant edges between depths n -1 and n. a(Q) is the size of the Query message.  The second term gives the bandwidth consumed by transferring Response messages from n hops away back to the query source. M(Q, n) is the number of Response messages returned from nodes n, and these Response messages contain a total of R(Q, n) result records. The size of a response message header is d, and the size of a result record is c (on average), hence the total size of all Response messages returned from n hops away is c·R(Q, n) + d·M(Q, n). These messages must take n hops to get back to the source; hence, bandwidth consumed is n times the size of all responses.

12 Experimental Setup Calculating Costs -- Processing Cost The cost of actions are expressed in terms of coarser units Costs were estimated by running each type of action on a 930 MHz processor (Pentium III, running Linux kernel version 2.2) and recording CPU time spent on each action. While CPU time will vary between machines, the relative cost of actions should remain roughly the same. Defining the base unit as the cost of transferring a Resend message, which is roughly 2000 cycles.

13 Experimental Setup Calculating Costs -- Averaging Results The performance of a technique is measured by averaging the performance of every query i.e. average aggregate bandwidth for a breadth-first search is:

14 Experiments Results Iterative deepening Cost Comparison  As d increases, the bandwidth consumption of policy P d increases.  The larger d is, the more likely the policy will waste bandwidth by sending the query out to too many nodes. Sending the query out to more nodes than necessary will generate extra bandwidth from forwarding the Query message, and transferring Response messages back to the source.  As W decreases, bandwidth usage increases.  For W = 1, when d is large, bandwidth usage actually decreases as d increases.

15 Experiments Results Iterative deepening Quality of Results  As W increases, the time spent for each iteration grows longer. In addition, as d decreases, the number of iterations needed to satisfy a query will increase, on average. In both of these cases, the time to satisfaction will increase. (Figure 3)  In deciding which policy would be the best to use in practice, must keep in mind what time to satisfaction the user can tolerate in an interactive system. i.e., suppose a system requires the average time to satisfaction to be no more than 9 seconds. Several combinations of d and W result in this time to satisfy, e.g., d = 4 and W = 4, or d = 5 and W = 6. But according to Figure 1 and Figure 2, the policy and waiting period that minimizes cost is P 5 and W = 6 (with savings of 72% in aggregate bandwidth and 52% in aggregate processing cost over BFS). We would therefore recommend this policy for the system.

16 Experiments Results Iterative deepening Variations : Two factors effect the iterative deepening technique :  Desired number of results Z  The number of neighbors Z increases, satisfaction decreases, but does not drop very quickly as Z increases. The number of neighbors decreases, satisfaction decreases, since fewer neighbors translates fewer number of results, but also does not drop much, in most cases it was still enough to satisfy the query. Decreasing the number of neighbors might decrease satisfaction a little, but decrease the cost significantly.  4 neighbors requires just 55% to 65% of the bandwidth required with 8 neighbors. (Figure 5)  4 neighbors requires just 55% of the processing cost required with 8 neighbors (Figure 6)

17 Experiments Results Directed BFS 8 heuristics for Directed BFS. The RAND heuristic, choosing a neighbor at random, is used as the baseline for comparison with other heuristics.

18 Experiments Results Directed BFS Quality of Results  Past performance is a good indicator of future performance.  Figure 7 shows the probability of satisfying, for the different heuristics, for different values of Z. All heuristics (except RES has the BEST performance. followed by <TIME.  Figure 8 shows the time to satisfaction of the different heuristics. The RES (20% lower than the baseline)  <HOPS has poor performance. A low average number of hops does NOT imply that the neighbor is close to other nodes with data.

19 Experiments Results Directed BFS Cost  There is a definite correlation between cost and quality of results. Many of heuristics return higher quality results. More quality results are returned, more aggregate bandwidth and processing power is consumed.  Users are more aware of the quality of results that are returned, rather than the cost of a query.  Recommend >RES or <TIME. Both heuristics provide good time to satisfaction, and a probability of satisfaction.

20 Conclusion I.e.  Compared with iterative deepening, the strength of Directed BFS is time to satisfaction.  By sacrificing time to satisfaction, iterative deepening can achieve lower cost than any Directed BFS heuristic.  Satisfaction for all iterative deepening policies is higher than satisfaction of any Directed BFS heuristic.

21 END …