Peer-to-Peer Networks Thanks to: Vinod Muthusam, U. of Toronto Jon Kubiatowicz, UC Berkeley Don Towsley, U. Mass at Amherst Mema Roussopoulos, Harvard University
What is P2P? P2P is a communications model in which each party has the same capabilities and either party can initiate a communication session. Whatis.com P2P is a class of applications that takes advantage of resources – storage, cycles, content, human presence – available at the edges of the internet. Clay Shirky, openp2p.com A type of network in which each workstation has equivalent capabilities and responsibilities. Webopedia.com A P2P computer network refers to any network that does not have fixed clients and servers, but a number of peer nodes that function as both clients and servers to other nodes on the network. Wikipedia.org
P2P is not new! Usenet: News groups first truly decentralized system DNS: Handles huge number of clients IP routing: Vastly decentralized, many equivalent routers
P2P is not new! Usenet: News groups first truly decentralized system DNS: Handles huge number of clients IP routing: Vastly decentralized, many equivalent routers
When is an application P2P? We will do an analysis based on a decision tree developed by researchers at Harvard, Stanford, Berkeley, and HP Labs 2 P2P or Not 2 P2P (IPTPS 2004) Exercise for end of class: critique and modify the decision tree
Recent Explosion of New Large Scale Applications Applications requiring immense resources CPU: Grid and grid computing Files and information: Music sharing, semantic web Bandwidth: Video streaming, content distribution Communication: IP telephony, group collaboration Storage: Data archives, massive storage Thousands to millions of nodes on the edge of the Internet can participate as sources/donors and as receivers/users.
Large scale storage applications: Web indexing (Google) Goal: index the entire Web Estimate: Google has 250,000 node cluster! Massively distributed Crawler Indexer Client Store(url, page) Index(page, keywords) Find(keywords) Distributed File System * Partial content from http://project-iris.net/talks/dht-toronto-03.ppt
Large scale storage applications: Web archives Goal: make and archive a daily check point of the Web Estimates: Web is about 57 Tbyte, compressed HTML+img New data per day: 580 Gbyte 128 Tbyte per year with 5 replicas Design: 12,810 nodes: 100 Gbyte disk each Crawler Client Store(url, page, date) Get(url, date) Distributed File System * Partial content from http://project-iris.net/talks/dht-toronto-03.ppt
Large scale storage applications: File storage OceanStore (UC Berkeley) Untrusted Infrastructure: The OceanStore is comprised of untrusted components Individual hardware has finite lifetimes All data encrypted within the infrastructure Responsible Party: Some organization (i.e. service provider) guarantees that your data is consistent and durable Not trusted with content of data, merely its integrity Mostly Well-Connected: Data producers and consumers are connected to a high-bandwidth network most of the time Exploit multicast for quicker consistency when possible Promiscuous Caching: Data may be cached anywhere, anytime
Utility-based Infrastructure Pac Bell Sprint IBM AT&T Canadian OceanStore Utility-based Infrastructure Data service provided by storage federation Cross-administrative domain Pay for Service
Large scale scientific applications SETI@Home and other projects at 500,000 BOINC volunteers Featured volunteer: Work done per day 6,514 (65 GigaFLOPS)
GRIDs and desktop grids GRID computing connects supercomputing labs (large parallel machines and databases), primarily for scientific computing Desktop grids use PCs for cycle sharing (dedicated or on the edge of the Internet) CCOF: Cluster Computing on the Fly
WaveGrid, Riding the wave of idle cycles
Peer-to-peer for large systems Limitations of client/server architecture Benefits of P2P History of P2P systems P2P architectures P2P issues
Client/server architecture Well known, powerful, reliable server is a data source Clients request data from server Very successful model WWW (HTTP), FTP, Web services, etc. Server Client Internet * Figure from http://project-iris.net/talks/dht-toronto-03.ppt
Client/server limitations Scalability is expensive Presents a single point of failure Requires administration Unused resources at the network edge P2P systems try to address these limitations
P2P vocabulary P2P application P2P architecture P2P computing P2P network Peer-based v. P2P Terms are used interchangeably, sometimes sloppily but there are subtle differences in meaning.
P2P computing P2P computing is the sharing of computer resources and services by direct exchange between systems. These resources and services include the exchange of information, processing cycles, cache storage, and disk storage for files. P2P computing takes advantage of existing computing power, computer storage and networking connectivity, allowing users to leverage their collective power to the ‘benefit’ of all. * From http://www-sop.inria.fr/mistral/personnel/Robin.Groenevelt/ Publications/Peer-to-Peer_Introduction_Feb.ppt
P2P architecture All nodes are both clients and servers Provide and consume data Any node can initiate a connection No centralized data source “The ultimate form of democracy on the Internet” “The ultimate threat to copy-right protection on the Internet” Node Internet * Content from http://project-iris.net/talks/dht-toronto-03.ppt
P2P benefits Efficient use of resources Scalability Unused bandwidth, storage, processing power at the edge of the network Scalability Consumers of resources also donate resources Aggregate resources grow naturally with utilization Organic scaling Infrastructure-less scaling Reliability (in aggregate) Replicas Geographic distribution No single point of failure Ease of administration Nodes self organize No need to deploy servers to satisfy demand (c.f. scalability) Built-in fault tolerance, replication, and load balancing
P2P Challenges Efficient and fair use of resources Scalability How to allocate? How to deliver? How to prevent selfish behavior? How to provide incentives? Scalability How to locate resources in such a large system? How to avoid overuse of the network? How to deal with heterogeneity? Reliability and trustworthiness in open systems (fault tolerance and security) How to prevent or recover from malicious behavior Do we need or want authentication? How to deal with churn? Do we want to guarantee anonymity? Ease of administration and security What about commercial P2P? How to deal with policy issues?
P2P uses Overlay Networks Peer Peer Peer Peer IP Network Overlay Traditional Communication IP Network Tunneling Communication
P2P Architectures (Fig. 5.1) Overlay Network Unstructured Structured Architecture Model Centralized Pure P2P Hybrid (hierarchical) DHT
P2P Architectures: Tradeoffs Centralized P2P architecture - can suffer from bottleneck at the central server, simpler design Pure P2P - sometimes the content cannot be found, can potentially generate a lot of network traffic Hybrid P2P - how to select and locate the supernodes? DHT - fast routing and content discovery, more complex infrastructure and higher maintenance cost
Unstructured P2P Systems: File Sharing Napster, Gnutella, Kazaa, Freenet Large scale sharing of files. User A makes files (music, video, etc.) on their computer available to others User B connects to the network, searches for files and downloads files directly from user A Issues of copyright infringement
P2P File sharing Traffic Globally, P2P traffic now represents 55%-80% of Internet traffic [CacheLogic] * Figure from http://www.cachelogic.com/research/slide12.php
Napster A way to share music files with others Users upload their list of files to Napster server You send queries to Napster server for files of interest Keyword search (artist, song, album, bitrate, etc.) Napster server replies with IP address of users with matching files You connect directly to user A to download file * Figure from http://computer.howstuffworks.com/file-sharing.htm
Napster Central Napster server Search is centralized Can ensure correct results Bottleneck for scalability Single point of failure Susceptible to denial of service Malicious users Lawsuits, legislation Search is centralized File transfer is direct (peer-to-peer)
Gnutella Share any type of files (not just music) Decentralized search unlike Napster You ask your neighbours for files of interest Neighbours ask their neighbours, and so on TTL field quenches messages after a number of hops Users with matching files reply to you * Figure from http://computer.howstuffworks.com/file-sharing.htm
Gnutella Decentralized No single point of failure Not as susceptible to denial of service Cannot ensure correct results Flooding queries Search is now distributed but still not scalable Good at finding popular content Bad at finding rare content Two level hierarchy reduces traffic Ultrapeers Leaf peers
Freenet Data flows in reverse path of query “Smart” queries Impossible to know if a user is initiating or forwarding a query Impossible to know if a user is consuming or forwarding data “Smart” queries Requests get routed to correct peer by incremental discovery * Figure from “Protecting Freedom of Information Online with Freenet”, Ian Clarke and Scott Miller. IEEE Internet Computing, Jan/Feb 2002
Comparison of file sharing networks Napster (centralized) Bottleneck (scalability, failure, denial of service) Correct search results (centralized search) Gnutella (distributed) No bottleneck No guarantee on search results Freenet Anonymity Less efficient data transfer
Anonymity Napster, Gnutella, Kazaa don’t provide anonymity Freenet Users know who they are downloading from Others know who sent a query Freenet Designed to provide anonymity among other features
Unstructured P2P Systems: File download (bandwidth sharing) BitTorrent (35% of Internet traffic) Limewire eDonkey Large file is divided into fixed size blocks. Peers download missing blocks from other peers while uploading blocks they have to requesting peers (tit-for-tat) Blocks arrive out of order, so peer must reassemble the file. (More details later this term)
BitTorrent “Offline” search Bartered “Tit for Tat” download bandwidth No search built into protocol Bartered “Tit for Tat” download bandwidth Download one (random) chunk from a storage peer, slowly Subsequent chunks bartered with concurrent downloaders As tracked by the tracker for the file The more chunks you can upload, the more you can download Download speed starts slow, then goes fast Great for large files * Content from Hellerstein’s VLDB2004 P2P Tutorial
Structured P2P Second generation P2P overlay networks Self-organizing Load balanced Fault-tolerant Scalable guarantees on numbers of hops to answer a query Major difference with unstructured P2P systems Based on a distributed hash table interface
Distributed hash tables (DHT) Distributed version of a hash table data structure Store and retrieve (key, value) pairs The key is like a filename The value can be file contents
DHT applications Many services can be built on top of a DHT interface File sharing Archival storage Databases Naming, service discovery Chat service Rendezvous-based communication Publish/Subscribe
DHT desirable properties Keys mapped evenly to all nodes in the network Each node maintains information about only a few other nodes Messages can be routed to a node efficiently Node arrival/departures only affect a few nodes
DHT routing protocols DHT is a generic interface There are several implementations of this interface Chord [MIT] Pastry [Microsoft Research UK, Rice University] Tapestry [UC Berkeley] Content Addressable Network (CAN) [UC Berkeley] SkipNet [Microsoft Research US, Univ. of Washington] Kademlia [New York University] Viceroy [Israel, UC Berkeley] P-Grid [EPFL Switzerland] Freenet [Ian Clarke] Freenet more concerned with privacy/security, not as much on delivery guarantees. Others: Farsite, SALAD (Douceur) are about file systems
P2P Challenges: Resource Discovery in Unstructured How to find desired resources in a large scale, open and dynamic P2P network? Think of it as a graph traversal problem Flooding Random walk Expanding ring Advertisement-based Rendezvous-point History-based Many more (Resource discovery in structured P2P networks is very different and will be covered later)
P2P Challenges: Incentives & Fairness How to provide incentives for peers to participate Goodness of their hearts (seems to work) Fame, competitive spirit Credit schemes Game theory How to ensure fairness Prevent freeriders Accounting mechanisms Game theory !! Don’t even try to??
P2P Challenges: Security Again Malicious Behavior Failure to forward messages/files Faked computational results in cycle sharing Corrupted files/data/code Deliberate delay of messages to gain an advantage (e.g. games) Inconsistent information relayed to different peers Faking work to get credit Denial of service attacks Collusion among several peers to do harm Sybil attack (forging multiple identities from one peer to gain advantage)
P2P Challenges: routing Peers that are close in the overlay network can be far in the physical network. N20 N41 N80 N40 * Figure from http://project-iris.net/talks/dht-toronto-03.ppt
For class discussion: 1. Is a sensor network a P2P network? 2. If stores gave away free CDs and DVDs 24-7, what would happen to P2P computing and traffic? 3. Do this in pairs: Redesign the 2 P2P or not 2 P2P decision tree to (a) include more issues such as those covered in today’s lecture, and (b) rearrange the order of the decisions in a more natural order. Optional: to be more suitable for a specific application such as gnutella