An IP Address Based Caching Scheme for Peer-to-Peer Networks Ronaldo Alves Ferreira Joint work with Ananth Grama and Suresh Jagannathan Department of Computer Sciences – Purdue University December
Outline Introduction Motivation IP Addresses as Virtual IDs Cache Organization Simulation Results Conclusions
Introduction Peer-to-Peer (P2P) networks are self-organizing distributed systems where participating nodes both provide and receive services from each other in a cooperative manner without distinguished roles as pure clients or pure servers. P2P Internet applications have recently been popularized by file sharing applications like Napster and Gnutella. P2P systems have many interesting technical aspects such as decentralized control, self-organization, adaptation and scalability. One of the key problems in large-scale P2P applications is to provide efficient algorithms for object location and routing within the network.
Location and Routing - DHT Most of known proposals take as input a key and, in response, route a message to the node responsible for that key. The keys are strings of digits of some length (generally 128 bits). Nodes have identifiers taken from the same space as the keys (same number of digits). Each node maintains a routing table consisting of a small subset of nodes in the system. Nodes route queries to neighbor nodes that make the most “progress” towards resolving the query.
Location and Routing - DHT The notion of progress differs from algorithm to algorithm. Plaxton developed the first ideas that could be applied in a scalable manner. While intended for a static node population, Plaxton algorithm provides efficient routing of queries. The algorithm works by “correcting” a single digit at a time. Chord, Pastry, and Tapestry are variants of Plaxton algorithm.
Location and Routing - DHT 0XXX1XXX2XXX3XXX START 0112 routes a message to key First hop fixes first digit (2) Second hop fixes second digit (20) END 2001 closest live node to 2000.
Location and Routing - DHT Node 0 Routing Table Leaf Set
Location and Routing - DHT Node 0 Routing Table
Location and Routing - Pastry Computers (nodes) have unique ID Typically 128 bits long Assignment should lead to uniform distribution in the node ID space, for example hash of node’s IP Primitive: route(msg, key) Deliver msg to currently alive node with ID numerically closest to key Node state Routing table Neighborhood set Leaf set Scalable, efficient O(log(N)) routing table entries per node Route in O(log(N)) number of hops
DHT Performance Issues Virtualization destroys locality. Messages may have to travel around the world to reach a node in the same LAN. Query responses do not contain locality information. Heuristics to minimize the problem: Proximity routing Topology-based node ID assignment Proximity neighbor selection
Motivation Virtualization destroys locality. Query responses do not contain locality information. Recent studies show that queries for multiple keys in P2P networks follow a Zipf-like distribution. For many wide-are distributed applications, nodes in the same region share common interests. For example, music sharing applications. Networking intensive applications have been built using P2P networks (Distributed File Systems).
IP Addresses as Virtual IDs A natural way of building locality in an overlay network is to explore the addressing scheme of the underlying network. In most cases, nodes with IP addresses that are numerically close are also physically close. Organization of the Internet in ASs. By correcting a few bits in each hop, the last hops would be inside an AS.
IP Addresses as Virtual IDs IP space is not uniformly populated by peers. Load imbalance at the peers. The upper bound of O(log n) can no longer be guaranteed.
IP Addresses as Virtual IDs How severe would be the load imbalance if we use the IP address of the node as its overlay identifier? Is it possible to find a boundary in the IP address such that distribution of peers is uniform and such that some form of locality is captured? Experimental Basis: Gnutella traces from June 2002 with 56M messages. 62,000 different IP addresses. Addresses were validated using a whois server and Ping.
IP Addresses as Virtual IDs
2,420 nodes. 20 keys per node.
IP Addresses as Virtual IDs Average CIDR prefix length for the address over 19 bits. Negative result. Provides us with an insight to propose a two-level overlay architecture. One global overlay, and several local overlays. A local overlay is formed with nodes that share the first 8 bits.
Cache Organization
Node Arrivals When joining the network, a node first joins the global overlay using the specific DHT protocol. After joining the global overlay, the new node contacts the rendezvous point of its domain to determine which local overlay it will join.
Simulation Setup Internet topology generated using GT-ITM topology generator. 10,000 overlay nodes selected randomly from the hosts. NLANR web proxy trace with 500,254 objects. Zipf distribution parameters: {0.75, 0.80, 0.85, 0.90, 0.95} Local cache size: 5MB (LRU replacement policy).
IP Addresses as Virtual IDs
Simulation Results Zipf-parameterCache Hit RatioGain %31.0% %33.5% %36.0% %38.7% %41.3%
Conclusions Use of IP addresses as virtual IDs would probably produce overlays with good locality properties, but the non-uniform population of nodes in the IP space leads to severe load imbalances and no guarantees on the number of hops exist. Two-level overlay architecture. Local overlays are created to cluster nodes that are close in the underlying network. The performance gains of a two-level architecture are significant, when compared with a single global overlay. The costs of maintaining the two-level architecture are very low.