Ben Y. Zhao University of California at Berkeley

Slides:



Advertisements
Similar presentations
Peer-to-Peer Infrastructure and Applications Andrew Herbert Microsoft Research, Cambridge
Advertisements

Brocade: Landmark Routing on Peer to Peer Networks Ben Y. Zhao Yitao Duan, Ling Huang, Anthony Joseph, John Kubiatowicz IPTPS, March 2002.
Internet Indirection Infrastructure (i3 ) Ion Stoica, Daniel Adkins, Shelley Zhuang, Scott Shenker, Sonesh Surana UC Berkeley SIGCOMM 2002 Presented by:
Perspective on Overlay Networks Panel: Challenges of Computing on a Massive Scale Ben Y. Zhao FuDiCo 2002.
Tapestry: Decentralized Routing and Location SPAM Summer 2001 Ben Y. Zhao CS Division, U. C. Berkeley.
P2P data retrieval DHT (Distributed Hash Tables) Partially based on Hellerstein’s presentation at VLDB2004.
Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, Hari Balakrishnan MIT and Berkeley presented by Daniel Figueiredo Chord: A Scalable Peer-to-peer.
Pastry Peter Druschel, Rice University Antony Rowstron, Microsoft Research UK Some slides are borrowed from the original presentation by the authors.
Peer-to-Peer Systems Chapter 25. What is Peer-to-Peer (P2P)? Napster? Gnutella? Most people think of P2P as music sharing.
Exploiting Route Redundancy via Structured Peer to Peer Overlays Ben Y. Zhao, Ling Huang, Jeremy Stribling, Anthony D. Joseph, and John D. Kubiatowicz.
1 Accessing nearby copies of replicated objects Greg Plaxton, Rajmohan Rajaraman, Andrea Richa SPAA 1997.
Rapid Mobility via Type Indirection Ben Y. Zhao, Ling Huang, Anthony D. Joseph, John D. Kubiatowicz Computer Science Division, UC Berkeley IPTPS 2004.
Common approach 1. Define space: assign random ID (160-bit) to each node and key 2. Define a metric topology in this space,  that is, the space of keys.
Web Caching Schemes1 A Survey of Web Caching Schemes for the Internet Jia Wang.
Exploiting Route Redundancy via Structured Peer to Peer Overlays Ben Y. Zhao, Ling Huang, Jeremy Stribling, Anthony D. Joseph, and John D. Kubiatowicz.
Ongoing Work on Peer-to- Peer Networks June 19, 2015 Prof. Ben Y. Zhao
OSMOSIS Final Presentation. Introduction Osmosis System Scalable, distributed system. Many-to-many publisher-subscriber real time sensor data streams,
Weaving a Tapestry Distributed Algorithms for Secure Node Integration, Routing and Fault Handling Ben Y. Zhao (John Kubiatowicz, Anthony Joseph) Fault-tolerant.
Topics in Reliable Distributed Systems Fall Dr. Idit Keidar.
1 CS 194: Distributed Systems Distributed Hash Tables Scott Shenker and Ion Stoica Computer Science Division Department of Electrical Engineering and Computer.
CITRIS Poster Supporting Wide-area Applications Complexities of global deployment  Network unreliability.
Or, Providing Scalable, Decentralized Location and Routing Network Services Tapestry: Fault-tolerant Wide-area Application Infrastructure Motivation and.
Tapestry: A Resilient Global-scale Overlay for Service Deployment Ben Y. Zhao, Ling Huang, Jeremy Stribling, Sean C. Rhea, Anthony D. Joseph, and John.
Internet Indirection Infrastructure (i3) Ion Stoica, Daniel Adkins, Shelley Zhuang, Scott Shenker, Sonesh Surana UC Berkeley SIGCOMM 2002.
Locality Aware Mechanisms for Large-scale Networks Ben Y. Zhao Anthony D. Joseph John D. Kubiatowicz UC Berkeley Future Directions in Distributed Computing.
Storage management and caching in PAST PRESENTED BY BASKAR RETHINASABAPATHI 1.
Tapestry GTK Devaroy (07CS1012) Kintali Bala Kishan (07CS1024) G Rahul (07CS3009)
Arnold N. Pears, CoRE Group Uppsala University 3 rd Swedish Networking Workshop Marholmen, September Why Tapestry is not Pastry Presenter.
Distributed Systems Concepts and Design Chapter 10: Peer-to-Peer Systems Bruce Hammer, Steve Wallis, Raymond Ho.
SCAN: a Scalable, Adaptive, Secure and Network-aware Content Distribution Network Yan Chen CS Department Northwestern University.
Chord & CFS Presenter: Gang ZhouNov. 11th, University of Virginia.
Overcast: Reliable Multicasting with an Overlay Network CS294 Paul Burstein 9/15/2003.
Jonathan Walpole CSE515 - Distributed Computing Systems 1 Teaching Assistant for CSE515 Rahul Dubey.
Brocade Landmark Routing on P2P Networks Gisik Kwon April 9, 2002.
Chord: A Scalable Peer-to-peer Lookup Protocol for Internet Applications Xiaozhou Li COS 461: Computer Networks (precept 04/06/12) Princeton University.
CS 3700 Networks and Distributed Systems Overlay Networks (P2P DHT via KBR FTW) Revised 4/1/2013.
An IP Address Based Caching Scheme for Peer-to-Peer Networks Ronaldo Alves Ferreira Joint work with Ananth Grama and Suresh Jagannathan Department of Computer.
1 More on Plaxton routing There are n nodes, and log B n digits in the id, where B = 2 b The neighbor table of each node consists of - primary neighbors.
Paper Survey of DHT Distributed Hash Table. Usages Directory service  Very little amount of information, such as URI, metadata, … Storage  Data, such.
Peer to Peer Network Design Discovery and Routing algorithms
CS 6401 Overlay Networks Outline Overlay networks overview Routing overlays Resilient Overlay Networks Content Distribution Networks.
Bruce Hammer, Steve Wallis, Raymond Ho
1 Plaxton Routing. 2 History Greg Plaxton, Rajmohan Rajaraman, Andrea Richa. Accessing nearby copies of replicated objects, SPAA 1997 Used in several.
Distributed Web Systems Peer-to-Peer Systems Lecturer Department University.
Brocade: Landmark Routing on Overlay Networks
Peer-to-Peer Information Systems Week 12: Naming
CIS 700-5: The Design and Implementation of Cloud Networks
Virtual Direction Routing
Internet Indirection Infrastructure (i3)
CS 268: Lecture 22 (Peer-to-Peer Networks)
Distributed Hash Tables
CS 3700 Networks and Distributed Systems
Plethora: Infrastructure and System Design
Accessing nearby copies of replicated objects
Infrastructure-based Resilient Routing
An Overlay Infrastructure for Decentralized Object Location and Routing Ben Y. Zhao University of California at Santa Barbara.
Early Measurements of a Cluster-based Architecture for P2P Systems
EE 122: Peer-to-Peer (P2P) Networks
CS 162: P2P Networks Computer Science Division
P2P Systems and Distributed Hash Tables
John D. Kubiatowicz UC Berkeley
Rapid Mobility via Type Indirection
Dynamic Replica Placement for Scalable Content Delivery
Replica Placement Heuristics of Application-level Multicast
Tapestry: Scalable and Fault-tolerant Routing and Location
EE 122: Lecture 22 (Overlay Networks)
Exploiting Routing Redundancy via Structured Peer-to-Peer Overlays
Consistent Hashing and Distributed Hash Table
Peer-to-Peer Information Systems Week 12: Naming
Brocade: Landmark Routing on Peer to Peer Networks
Presentation transcript:

An Overlay Infrastructure for Decentralized Object Location and Routing Ben Y. Zhao ravenben@eecs.berkeley.edu University of California at Berkeley Computer Science Division

Peer-based Distributed Computing Cooperative approach to large-scale applications peer-based: available resources scale w/ # of participants better than client/server: limited resources & scalability Large-scale, cooperative applications are coming content distribution networks (e.g. FastForward) large-scale backup / storage utilities leverage peers’ storage for higher resiliency / availability cooperative web caching application-level multicast video on-demand, streaming movies Maybe no p2p logos. Give negative impression Less focus on legitimacy, more on properties September 20, 2018 ravenben@eecs.berkeley.edu

What Are the Technical Challenges? File system: replicate files for resiliency/performance how do you find close by replicas? how does this scale to millions of users? billions of files? September 20, 2018 ravenben@eecs.berkeley.edu

Node Membership Changes Nodes join and leave the overlay, or fail data or control state needs to know about available resources node membership management a necessity September 20, 2018 ravenben@eecs.berkeley.edu

A Fickle Internet Internet disconnections are not rare (UMichTR98,IMC02) TCP retransmission is not enough, need route-around IP route repair takes too long: IS-IS  5s, BGP  3-15mins good end-to-end performance requires fast response to faults People will just say TCP, TCP is good for retransmitting to get past congestion, but if there is a disconnected route, TCP will just keep going til it times out. Then application has to deal with error, which is very hard. Or if it’s heavy congestion, then TCP scales back transmission rate heavily, result is similar to large packet loss to application. September 20, 2018 ravenben@eecs.berkeley.edu

An Infrastructure Approach First generation of large-scale apps: vertical approach Hard problems, difficult to get right instead, solve common challenges once build single overlay infrastructure at application layer FastForward application overlay Yahoo IM SETI presentation data location session data location data location dynamic membership Explicit mention fact that all apps solve same problems efficient, scalable data location transport dynamic membership dynamic membership reliable comm. network dynamic node membership algorithms reliable comm. link reliable comm. reliable communication physical Internet September 20, 2018 ravenben@eecs.berkeley.edu

Personal Research Roadmap service discovery service XSet lightweight XML DB Mobicom 99 5000+ downloads TSpaces PRR 97 Tapestry multicast (Bayeux) NOSSDAV 02 file system (Oceanstore) ASPLOS99/FAST03 spam filtering (SpamWatch) Middleware 03 rapid mobility (Warp) IPTPS 04 a p p l i c a t i o n s robust dynamic algorithms resilient overlay routing DOLR structured overlay APIs SPAA 02 / TOCS ICNP 03 IPTPS 03 WAN deployment (1500+ downloads) landmark routing (Brocade) IPTPS 02 modeling of non- stationary datasets 4300 visits (8-9/day), 1300 downloads 50+ countries 421 academic, 193 industry, 55 labs Search engine co, european bank, open source film distribution, hospital, TV station JSAC 04 September 20, 2018 ravenben@eecs.berkeley.edu

Talk Outline Motivation Decentralized object location and routing Resilient routing Tapestry deployment performance Wrap-up September 20, 2018 ravenben@eecs.berkeley.edu

What should this infrastructure look like? here is one appealing direction…

Structured Peer-to-Peer Overlays Node IDs and keys from randomized namespace (SHA-1) incremental routing towards destination ID each node has small set of outgoing routes, e.g. prefix routing log (n) neighbors per node, log (n) hops between any node pair ID: ABCE ABC0 To: ABCD AB5F A930 September 20, 2018 ravenben@eecs.berkeley.edu

Related Work Unstructured Peer to Peer Approaches Napster, Gnutella, KaZaa probabilistic search (optimized for the hay, not the needle) locality-agnostic routing (resulting in high network b/w costs) Structured Peer to Peer Overlays the first protocols (2001): Tapestry, Pastry, Chord, CAN then: Kademlia, SkipNet, Viceroy, Symphony, Koorde, Ulysseus… distinction: how to choose your neighbors Tapestry, Pastry: latency-optimized routing mesh distinction: application interface distributed hash table: put (key, data); data = get (key); Tapestry: decentralized object location and routing September 20, 2018 ravenben@eecs.berkeley.edu

Defining the Requirements efficient routing to nodes and data low routing stretch (ratio of latency to shortest path distance) flexible data location applications want/need to control data placement allows for application-specific performance optimizations directory interface publish (ObjID), RouteToObj(ObjID, msg) resilient and responsive to faults more than just retransmission, route around failures reduce negative impact (loss/jitter) on the application Data can stay in place and be mutable!! September 20, 2018 ravenben@eecs.berkeley.edu

Decentralized Object Location & Routing routeobj(k) backbone routeobj(k) k publish(k) k where objects are placed is orthogonal, we’re providing a slightly lower level abstraction and allowing application to place data. data placement strategy is area of research redirect data traffic using log(n) in-network redirection pointers average # of pointers/machine: log(n) * avg files/machine keys to performance proximity-enabled routing mesh with routing convergence September 20, 2018 ravenben@eecs.berkeley.edu

Why Proximity Routing? 01234 01234 Fewer/shorter IP hops: shorter e2e latency, less bandwidth/congestion, less likely to cross broken/lossy links September 20, 2018 ravenben@eecs.berkeley.edu

Performance Impact (Proximity) Simulated Tapestry w/ and w/o proximity on 5000 node transit-stub network Measure pair-wise routing stretch between 200 random nodes September 20, 2018 ravenben@eecs.berkeley.edu

DOLR vs. Distributed Hash Table DHT: hash content  name  replica placement modifications  replicating new version into DHT DOLR: app places copy near requests, overlay routes msgs to it September 20, 2018 ravenben@eecs.berkeley.edu

Performance Impact (DOLR) simulated Tapestry w/ DOLR and DHT interfaces on 5000 node T-S measure route to object latency from clients in 2 stub networks DHT: 5 object replicas DOLR: 1 replica placed in each stub network September 20, 2018 ravenben@eecs.berkeley.edu

Talk Outline Motivation Decentralized object location and routing Resilient and responsive routing Tapestry deployment performance Wrap-up September 20, 2018 ravenben@eecs.berkeley.edu

How do you get fast responses to faults? Response time = fault-detection + alternate path discovery + time to switch

Fast Response via Static Resiliency Reducing fault-detection time monitor paths to neighbors with periodic UDP probes O(log(n)) neighbors: higher frequency w/ low bandwidth exponentially weighted moving average for link quality estimation avoid route flapping due to short term loss artifacts loss rate: Ln = (1 - )  Ln-1 +   p Eliminate synchronous backup path discovery actively maintain redundant paths, redirect traffic immediately repair redundancy asynchronously create and store backups at node insertion restore redundancy via random pair-wise queries after failures End result fast detection + precomputed paths = increased responsiveness September 20, 2018 ravenben@eecs.berkeley.edu

Routing Policies Use estimated overlay link quality to choose shortest “usable” link Use shortest overlay link with minimal quality > T Alternative policies prioritize low loss over latency use least lossy overlay link use path w/ minimal “cost function” cf = x latency + y loss rate This is not perfect because of possible correlated failures, but can leverage existing work on failure independent overlay construction If there’s just 1 link, nobody can win Todo: consider removing correlated failures bullet September 20, 2018 ravenben@eecs.berkeley.edu

Talk Outline Motivation Decentralized object location and routing Resilient and responsive routing Tapestry deployment performance Wrap-up September 20, 2018 ravenben@eecs.berkeley.edu

Tapestry, a DOLR Protocol Routing based on incremental prefix matching Latency-optimized routing mesh nearest neighbor algorithm (HKRZ02) supports massive failures and large group joins Built-in redundant overlay links 2 backup links maintained w/ each primary Use “objects” as endpoints for rendezvous nodes publish names to announce their presence e.g. wireless proxy publishes nearby laptop’s ID e.g. multicast listeners publish multicast session name to self organize Rendezvous point September 20, 2018 ravenben@eecs.berkeley.edu

Weaving a Tapestry inserting node (0123) into network route to own ID, find 012X nodes, fill last column request backpointers to 01XX nodes measure distance, add to rTable prune to nearest K nodes repeat 2—4 ID = 0123 XXXX 0XXX 01XX 012X 1XXX 2XXX 3XXX 00XX 02XX 03XX 010X 011X 013X 0120 0121 0122 Existing Tapestry September 20, 2018 ravenben@eecs.berkeley.edu

Implementation Performance Java implementation 35000+ lines in core Tapestry, 1500+ downloads Micro-benchmarks per msg overhead: ~ 50s, most latency from byte copying performance scales w/ CPU speedup 5KB msgs on P-IV 2.4Ghz: throughput ~ 10,000 msgs/sec Routing stretch route to node: < 2 route to objects/endpoints: < 3 higher stretch for close by objects September 20, 2018 ravenben@eecs.berkeley.edu

Responsiveness to Faults (PlanetLab) 300 660 = 0.2 = 0.4 20 runs for each point B/W  network size N, N=300  7KB/s/node, N=106  20KB/s sim: if link failure < 10%, can route around 90% of survivable failures September 20, 2018 ravenben@eecs.berkeley.edu

Stability Under Membership Changes kill nodes constant churn large group join success rate (%) Routing operations on 40 node Tapestry cluster Churn: nodes join/leave every 10 seconds, average lifetime = 2mins September 20, 2018 ravenben@eecs.berkeley.edu

Talk Outline Motivation Decentralized object location and routing Resilient and responsive routing Tapestry deployment performance Wrap-up September 20, 2018 ravenben@eecs.berkeley.edu

Lessons and Takeaways Consider system constraints in algorithm design limited by finite resources (e.g. file descriptors, bandwidth) simplicity wins over small performance gains easier adoption and faster time to implementation Wide-area state management (e.g. routing state) reactive algorithm for best-effort, fast response proactive periodic maintenance for correctness Naïve event programming model is too low-level much code complexity from managing stack state important for protocols with asychronous control algorithms need explicit thread support for callbacks / stack management September 20, 2018 ravenben@eecs.berkeley.edu

Future Directions Ongoing work to explore p2p application space resilient anonymous routing, attack resiliency Intelligent overlay construction router-level listeners allow application queries efficient meshes, fault-independent backup links, failure notify Deploying and measuring a lightweight peer-based application focus on usability and low overhead p2p incentives, security, deployment meet the real world A holistic approach to overlay security and control p2p good for self-organization, not for security/ management decouple administration from normal operation explicit domains / hierarchy for configuration, analysis, control Interplay between two goals in first class infrastructure September 20, 2018 ravenben@eecs.berkeley.edu

Thanks! Questions, comments? ravenben@eecs.berkeley.edu

Impact of Correlated Events + + = event handler ? A B C Network ? ? ? ? historically, events largely independent web server requests focus on throughput event relationships becoming increasingly prevalent peer to peer control messages large scale data aggregation networks action X requires A *and* B *and* C for progress correlated requests: A+B+CD e.g. online continuous queries, sensor aggregation, p2p control layer, streaming data mining web / application servers independent requests maximize individual throughput September 20, 2018 ravenben@eecs.berkeley.edu

Some Details Simple fault detection techniques periodically probe overlay links to neighbors exponentially weighted moving average for link quality estimation avoid route flapping due to short term loss artifacts loss rate: Ln = (1 - )  Ln-1 +   p p = instantaneous loss rate,  = filter constant other techniques topics of open research How do we get and repair the backup links? each hop has flexible routing constraint e.g. in prefix routing, 1st hop just requires 1 fixed digit backups always available until last hop to destination create and store backups at node insertion restore redundancy via random pair-wise queries after failures e.g. to replace 123X neighbor, talk to local 12XX neighbors September 20, 2018 ravenben@eecs.berkeley.edu

Route Redundancy (Simulator) Simulation constructs shortest paths to emulate IP routes Now break links, then look at reachability Simulation of Tapestry, 2 backup paths per routing entry 2 backups: low maintenance overhead, good resiliency September 20, 2018 ravenben@eecs.berkeley.edu

Another Perspective on Reachability Portion of all pair-wise paths where no failure-free paths remain A path exists, but neither IP nor FRLS can locate the path Portion of all paths where IP and FRLS both route successfully FRLS finds path, where short-term IP routing fails September 20, 2018 ravenben@eecs.berkeley.edu

Single Node Software Architecture application programming interface applications Dynamic Tap. Patchwork core router distance map Remove small message text network SEDA event-driven framework Java Virtual Machine September 20, 2018 ravenben@eecs.berkeley.edu

Related Work Unstructured Peer to Peer Applications Napster, Gnutella, KaZaa probabilistic search, difficult to scale, inefficient b/w Structured Peer to Peer Overlays Chord, CAN, Pastry, Kademlia, SkipNet, Viceroy, Symphony, Koorde, Coral, Ulysseus, … routing efficiency application interface Resilient routing traffic redirection layers Detour, Resilient Overlay Networks (RON), Internet Indirection Infrastructure (I3) our goals: scalability, in-network traffic redirection September 20, 2018 ravenben@eecs.berkeley.edu

Node to Node Routing (PlanetLab) Median=31.5, 90th percentile=135 Ratio of end-to-end latency to ping distance between nodes All node pairs measured, placed into buckets September 20, 2018 ravenben@eecs.berkeley.edu

Object Location (PlanetLab) 90th percentile=158 Ratio of end-to-end latency to client-object ping distance Local-area stretch improved w/ additional location state September 20, 2018 ravenben@eecs.berkeley.edu

Micro-benchmark Results (LAN) 100mb/s Per msg overhead ~ 50s, latency dominated by byte copying Performance scales with CPU speedup For 5K messages, throughput = ~10,000 msgs/sec September 20, 2018 ravenben@eecs.berkeley.edu

Structured Peer to Peer Overlay Traffic Tunneling Legacy Node B Legacy Node A B P’(B) A, B are IP addresses register register Proxy Proxy put (hash(B), P’(B)) P’(B) get (hash(B)) put (hash(A), P’(A)) Structured Peer to Peer Overlay Not a unique engineering approach, similar approach at i3 Store mapping from end host IP to its proxy’s overlay ID Similar to approach in Internet Indirection Infrastructure (I3) September 20, 2018 ravenben@eecs.berkeley.edu

Constrained Multicast Used only when all paths are below quality threshold Send duplicate messages on multiple paths Leverage route convergence Assign unique message IDs Mark duplicates Keep moving window of IDs Recognize and drop duplicates Limitations Assumes loss not from congestion Ideal for local area routing 2225 2299 2274 2286 2046 2281 2530 ? ? ? 1111 September 20, 2018 ravenben@eecs.berkeley.edu

Link Probing Bandwidth (PL) Bandwidth increases logarithmically with overlay size Medium sized routing overlays incur low probing bandwidth September 20, 2018 ravenben@eecs.berkeley.edu