1 - CS7701 – Fall 2004 Review of: Making Gnutella-like P2P Systems Scalable Paper by: – Yatin Chawathe (AT&T) –Sylvia Ratnasamy (Intel) –Lee Breslau (AT&T) –Nick Lanham (UC Berkeley) –Scott Shenker (ICSI) Published in: – IEEE SIGCOMM 2003 Reviewed by: – Todd Sproull Discussion Leader: – Christoph Jechlitschek CS7701: Research Seminar on Networking
2 - CS7701 – Fall 2004 Outline Introduction Problem Description Gia Design Simulation Results Implementation Conclusions
3 - CS7701 – Fall 2004 Introduction Peer to Peer (P2P) Networks –“Systems serving other Systems” –Potential for millions of users –Gained consumer popularity through Napster Napster –Started in 1999 by Shawn Fanning –Enabled music fans to trade songs over a P2P network –Clients connected to centralized Napster Servers to locate music –2001 Judge ruled Napster had to block all copyrighted material –2002 Napster folded RIAA continued after Napster clones Gnutella –March 14, 2000 Nullsoft released first version of software Created by Justin Frankel and Tom Pepper Nullsoft pulled the software the next day –Software was reverse engineered –Open Source clients became available –Built around decentralized approach
4 - CS7701 – Fall 2004 Gnutella Distributed search and download Unstructured: ad-hoc topology –Peers connect to random nodes Random search –Flood queries across network Scaling problems –As network grows, search overhead increases P1P1 P2P2 P4P4 P3P3 who has “madonna” P 4 has “madonna- american-life.mp3” P5P5 P6P6 P 2 has “madonna- ray-of-light.mp3”
5 - CS7701 – Fall 2004 Problem Gnutella has notoriously poor scaling –Flooding-based Solution –Just using Distributed Hash Tables does not necessarily fix the problem Challenge –Improve scaling while maintain Gnutella’s simplicity Propose new mechanisms to fix scalability issues Evaluate performance of these individual components and the entire network
6 - CS7701 – Fall 2004 What about DHTS? Distributed Hash Tables (DHTs) –Provides hash table abstraction over multiple compute nodes How it works –Each DHT can store data items –Data items indexed via lookup key –Overlay routing delivers requests for a given key to the responsible node –O (log N) message hops in network of N nodes –DHT adjusts mapping of keys and neighbor tables when node set changes
7 - CS7701 – Fall 2004 Example B’s Routing Table KeyPointer 7C 8D C B D Key 6? I have key 6 Key 6? D’s Routing Table KeyPointer 6E Nope! Key 6? Key 6! E A
8 - CS7701 – Fall 2004 DHT only P2P network? Problems –P2P clients are transient Clients joining and leaving at rates causing a fair amount of “churn” Route failures require O (log n) repair operations –Keyword searches are more prevalent, and more important than an exact-match queries “Madonna Ray of Light mp3” or “Madona Ray Light mp3”.. –Queries are for hay, not needles Most requests for popular content 50% content requests for more than 100 replicas 80% content requests for more than 80 replicas
9 - CS7701 – Fall 2004 The Solution Design new Gnutella like P2P system “Gia” –Short for gianduia, generic form of hazelnut spread Nutella What’s so great about it? –Dynamic Topology Adaptation Accounts for heterogeneity among nodes –Active Flow Control Scheme Implements token based allocation for queries –One-hop replication Keep small nodes next to well connected “higher capacity” nodes –Capacity refers to message processing capabilities of a node per unit time –Search Protocol based on Random Walks No longer flooding the network with requests
10 - CS7701 – Fall 2004 Make high-capacity nodes easily reachable –Dynamic topology adaptation Make high-capacity nodes have more answers –One-hop replication Search efficiently –Biased random walks Prevent overloaded nodes –Active flow control Make high-capacity nodes easily reachable –Dynamic topology adaptation Make high-capacity nodes have more answers –One-hop replication Search efficiently –Biased random walks Prevent overloaded nodes –Active flow control Example Query
11 - CS7701 – Fall 2004 Dynamic Topology Adaptation Core Component of Gia Goals –Ensure high capacity nodes are ones with high degree –Keep low capacity nodes within short reach of high capacity nodes Accomplished through satisfaction level S –When S=0, node is dissatisfied –As node accumulates more neighbors, satisfaction rises until it reaches a satisfaction level of 1
12 - CS7701 – Fall 2004 Adding new neighbors Adding neighbor Y to X –Add neighbor new neighbor, if room exists –If no room, check to see if an existing neighbor can be replaced –Goal: Find an existing neighbor with capacity less then or equal to new neighbor, with the highest degree Do not drop an already poorly connected neighbor Assumptions: –Max Neighbors of X = 3 –Capacity of all nodes the same X A B Y C
13 - CS7701 – Fall 2004 Token Based Flow Control Allows client to query the neighbor only if allowed from the neighbor –Client must have token from neighbor Tokens sent from a client to its neighbors periodically –Token allocation rate based on nodes ability to process queries
14 - CS7701 – Fall 2004 One Hop Replication Gia nodes maintain index of content of neighbors –Improves efficiency of search process –Allows for neighbors to respond to search queries Being “close” to content is useful –Not necessary that you have the requested content, but instead a pointer to it
15 - CS7701 – Fall 2004 Search Protocol Based on biased random walks –Gia node selects highest capacity neighbor that it has tokens for and sends query –Queues message if no tokens available for any neighbor Uses two mechanisms for control –TTL bounds duration of walks –Maintains MAX_RESPONSES parameter for maximum number of answers it searches for
16 - CS7701 – Fall 2004 Simulations Four basic models –FLOOD Gnutella Model –RWRT Random Walks over Random Topologies Proposed by Lv et al. –SUPER Classifies some nodes as “Super Nodes”, based on Capacity (> 1000x) –GIA Gia protocol suite Capacity –The number of messages (queries or add/drop requests) a node can process per unit time –Derived from measured bandwidth distributions from Sariou et al. Fair amount of clients have dialup connections Majority are using cable-modem or DSL Few have “high-speed” connections
17 - CS7701 – Fall 2004 Performance Metrics Collapse Point (CP) –Per node query rate at the point beyond which the success rate drops below 90%. –Referred to as the knee Hop-count before collapse (CP-HP) –Average hop count prior to collapse
18 - CS7701 – Fall 2004 Performance Comparison
19 - CS7701 – Fall 2004 Factor Analysis Effects of individual components –Remove each component from Gia one at a time –Add each component to RWRT –No single component contributes entirely to Gia’s success
20 - CS7701 – Fall 2004 Multiple Searches CP changes with MAX_RESPONSES Replication Factor and MAX_RESPONSES
21 - CS7701 – Fall 2004 Robustness Static SUPER Static RWRT (1% repl)
22 - CS7701 – Fall 2004 Active Replication Allow higher capacity nodes to replicate files –On demand replication when high capacity node receives query and download request Active replication can increase capacity of nodes serving files from a factor of 38 to 50
23 - CS7701 – Fall 2004 Implementation Satisfaction Level –Aggressiveness of Adaptation –Exponential relationship between satisfaction level S and adaptation interval I –Define: I = Adaptation interval S = Satisfaction level T = maximum interval between adaptation iterations K = aggressiveness of adaptation interval –Let I = T * K -(1-S)
24 - CS7701 – Fall 2004 Satisfaction Level Calculating Satisfaction level –S = 0 initially and if # of neighbors is less than predefined min –Satisfaction Algorithm does the following Adds up normalized capacity of all neighbors –High capacity neighbor with low degree is worth more than High capacity high degree Divide your capacity from total to find S Returns S=1 if S > 1 or # neighbors greater than predefined max
25 - CS7701 – Fall 2004 Deployment Planet Lab –Wide Area service deployment testbed in North America, Europe, Asia and the South Pacific –Deployed Gia on 83 clients –Measured time to reach “steady state”
26 - CS7701 – Fall 2004 Related Work KaZaA –At time of SIGCOMM little had been published on KaZaA –“Understanding KaZaA” Liang, et al CAP –Cluster based approach to handle scaling in Gnutella Based on a central clustering server Clusters act as directory servers PierSearch –Published in SIGCOMM 2004 –PIER + Gnutella PIER uses DHT for hard to find content and Gnutella for the more popular Gnuetella2 –Aimed at fixing many of the problems with Gnutella –Not created by Gnutella founders, causing some controversy in the community
27 - CS7701 – Fall 2004 Conclusion Gia proves to be a scalable Gnutella –3 to 5 orders of magnitude improvement Unstructed system works well for popular content –DHT not necessary in most cases Working implementation on Planet Lab
28 - CS7701 – Fall 2004
29 - CS7701 – Fall 2004
30 - CS7701 – Fall 2004
31 - CS7701 – Fall 2004
32 - CS7701 – Fall 2004