An Overlay Infrastructure for Decentralized Object Location and Routing Ben Y. Zhao ravenben@eecs.berkeley.edu University of California at Berkeley Computer Science Division
Peer-based Distributed Computing Cooperative approach to large-scale applications peer-based: available resources scale w/ # of participants better than client/server: limited resources & scalability Large-scale, cooperative applications are coming content distribution networks (e.g. FastForward) large-scale backup / storage utilities leverage peers’ storage for higher resiliency / availability cooperative web caching application-level multicast video on-demand, streaming movies Maybe no p2p logos. Give negative impression Less focus on legitimacy, more on properties September 20, 2018 ravenben@eecs.berkeley.edu
What Are the Technical Challenges? File system: replicate files for resiliency/performance how do you find close by replicas? how does this scale to millions of users? billions of files? September 20, 2018 ravenben@eecs.berkeley.edu
Node Membership Changes Nodes join and leave the overlay, or fail data or control state needs to know about available resources node membership management a necessity September 20, 2018 ravenben@eecs.berkeley.edu
A Fickle Internet Internet disconnections are not rare (UMichTR98,IMC02) TCP retransmission is not enough, need route-around IP route repair takes too long: IS-IS 5s, BGP 3-15mins good end-to-end performance requires fast response to faults People will just say TCP, TCP is good for retransmitting to get past congestion, but if there is a disconnected route, TCP will just keep going til it times out. Then application has to deal with error, which is very hard. Or if it’s heavy congestion, then TCP scales back transmission rate heavily, result is similar to large packet loss to application. September 20, 2018 ravenben@eecs.berkeley.edu
An Infrastructure Approach First generation of large-scale apps: vertical approach Hard problems, difficult to get right instead, solve common challenges once build single overlay infrastructure at application layer FastForward application overlay Yahoo IM SETI presentation data location session data location data location dynamic membership Explicit mention fact that all apps solve same problems efficient, scalable data location transport dynamic membership dynamic membership reliable comm. network dynamic node membership algorithms reliable comm. link reliable comm. reliable communication physical Internet September 20, 2018 ravenben@eecs.berkeley.edu
Personal Research Roadmap service discovery service XSet lightweight XML DB Mobicom 99 5000+ downloads TSpaces PRR 97 Tapestry multicast (Bayeux) NOSSDAV 02 file system (Oceanstore) ASPLOS99/FAST03 spam filtering (SpamWatch) Middleware 03 rapid mobility (Warp) IPTPS 04 a p p l i c a t i o n s robust dynamic algorithms resilient overlay routing DOLR structured overlay APIs SPAA 02 / TOCS ICNP 03 IPTPS 03 WAN deployment (1500+ downloads) landmark routing (Brocade) IPTPS 02 modeling of non- stationary datasets 4300 visits (8-9/day), 1300 downloads 50+ countries 421 academic, 193 industry, 55 labs Search engine co, european bank, open source film distribution, hospital, TV station JSAC 04 September 20, 2018 ravenben@eecs.berkeley.edu
Talk Outline Motivation Decentralized object location and routing Resilient routing Tapestry deployment performance Wrap-up September 20, 2018 ravenben@eecs.berkeley.edu
What should this infrastructure look like? here is one appealing direction…
Structured Peer-to-Peer Overlays Node IDs and keys from randomized namespace (SHA-1) incremental routing towards destination ID each node has small set of outgoing routes, e.g. prefix routing log (n) neighbors per node, log (n) hops between any node pair ID: ABCE ABC0 To: ABCD AB5F A930 September 20, 2018 ravenben@eecs.berkeley.edu
Related Work Unstructured Peer to Peer Approaches Napster, Gnutella, KaZaa probabilistic search (optimized for the hay, not the needle) locality-agnostic routing (resulting in high network b/w costs) Structured Peer to Peer Overlays the first protocols (2001): Tapestry, Pastry, Chord, CAN then: Kademlia, SkipNet, Viceroy, Symphony, Koorde, Ulysseus… distinction: how to choose your neighbors Tapestry, Pastry: latency-optimized routing mesh distinction: application interface distributed hash table: put (key, data); data = get (key); Tapestry: decentralized object location and routing September 20, 2018 ravenben@eecs.berkeley.edu
Defining the Requirements efficient routing to nodes and data low routing stretch (ratio of latency to shortest path distance) flexible data location applications want/need to control data placement allows for application-specific performance optimizations directory interface publish (ObjID), RouteToObj(ObjID, msg) resilient and responsive to faults more than just retransmission, route around failures reduce negative impact (loss/jitter) on the application Data can stay in place and be mutable!! September 20, 2018 ravenben@eecs.berkeley.edu
Decentralized Object Location & Routing routeobj(k) backbone routeobj(k) k publish(k) k where objects are placed is orthogonal, we’re providing a slightly lower level abstraction and allowing application to place data. data placement strategy is area of research redirect data traffic using log(n) in-network redirection pointers average # of pointers/machine: log(n) * avg files/machine keys to performance proximity-enabled routing mesh with routing convergence September 20, 2018 ravenben@eecs.berkeley.edu
Why Proximity Routing? 01234 01234 Fewer/shorter IP hops: shorter e2e latency, less bandwidth/congestion, less likely to cross broken/lossy links September 20, 2018 ravenben@eecs.berkeley.edu
Performance Impact (Proximity) Simulated Tapestry w/ and w/o proximity on 5000 node transit-stub network Measure pair-wise routing stretch between 200 random nodes September 20, 2018 ravenben@eecs.berkeley.edu
DOLR vs. Distributed Hash Table DHT: hash content name replica placement modifications replicating new version into DHT DOLR: app places copy near requests, overlay routes msgs to it September 20, 2018 ravenben@eecs.berkeley.edu
Performance Impact (DOLR) simulated Tapestry w/ DOLR and DHT interfaces on 5000 node T-S measure route to object latency from clients in 2 stub networks DHT: 5 object replicas DOLR: 1 replica placed in each stub network September 20, 2018 ravenben@eecs.berkeley.edu
Talk Outline Motivation Decentralized object location and routing Resilient and responsive routing Tapestry deployment performance Wrap-up September 20, 2018 ravenben@eecs.berkeley.edu
How do you get fast responses to faults? Response time = fault-detection + alternate path discovery + time to switch
Fast Response via Static Resiliency Reducing fault-detection time monitor paths to neighbors with periodic UDP probes O(log(n)) neighbors: higher frequency w/ low bandwidth exponentially weighted moving average for link quality estimation avoid route flapping due to short term loss artifacts loss rate: Ln = (1 - ) Ln-1 + p Eliminate synchronous backup path discovery actively maintain redundant paths, redirect traffic immediately repair redundancy asynchronously create and store backups at node insertion restore redundancy via random pair-wise queries after failures End result fast detection + precomputed paths = increased responsiveness September 20, 2018 ravenben@eecs.berkeley.edu
Routing Policies Use estimated overlay link quality to choose shortest “usable” link Use shortest overlay link with minimal quality > T Alternative policies prioritize low loss over latency use least lossy overlay link use path w/ minimal “cost function” cf = x latency + y loss rate This is not perfect because of possible correlated failures, but can leverage existing work on failure independent overlay construction If there’s just 1 link, nobody can win Todo: consider removing correlated failures bullet September 20, 2018 ravenben@eecs.berkeley.edu
Talk Outline Motivation Decentralized object location and routing Resilient and responsive routing Tapestry deployment performance Wrap-up September 20, 2018 ravenben@eecs.berkeley.edu
Tapestry, a DOLR Protocol Routing based on incremental prefix matching Latency-optimized routing mesh nearest neighbor algorithm (HKRZ02) supports massive failures and large group joins Built-in redundant overlay links 2 backup links maintained w/ each primary Use “objects” as endpoints for rendezvous nodes publish names to announce their presence e.g. wireless proxy publishes nearby laptop’s ID e.g. multicast listeners publish multicast session name to self organize Rendezvous point September 20, 2018 ravenben@eecs.berkeley.edu
Weaving a Tapestry inserting node (0123) into network route to own ID, find 012X nodes, fill last column request backpointers to 01XX nodes measure distance, add to rTable prune to nearest K nodes repeat 2—4 ID = 0123 XXXX 0XXX 01XX 012X 1XXX 2XXX 3XXX 00XX 02XX 03XX 010X 011X 013X 0120 0121 0122 Existing Tapestry September 20, 2018 ravenben@eecs.berkeley.edu
Implementation Performance Java implementation 35000+ lines in core Tapestry, 1500+ downloads Micro-benchmarks per msg overhead: ~ 50s, most latency from byte copying performance scales w/ CPU speedup 5KB msgs on P-IV 2.4Ghz: throughput ~ 10,000 msgs/sec Routing stretch route to node: < 2 route to objects/endpoints: < 3 higher stretch for close by objects September 20, 2018 ravenben@eecs.berkeley.edu
Responsiveness to Faults (PlanetLab) 300 660 = 0.2 = 0.4 20 runs for each point B/W network size N, N=300 7KB/s/node, N=106 20KB/s sim: if link failure < 10%, can route around 90% of survivable failures September 20, 2018 ravenben@eecs.berkeley.edu
Stability Under Membership Changes kill nodes constant churn large group join success rate (%) Routing operations on 40 node Tapestry cluster Churn: nodes join/leave every 10 seconds, average lifetime = 2mins September 20, 2018 ravenben@eecs.berkeley.edu
Talk Outline Motivation Decentralized object location and routing Resilient and responsive routing Tapestry deployment performance Wrap-up September 20, 2018 ravenben@eecs.berkeley.edu
Lessons and Takeaways Consider system constraints in algorithm design limited by finite resources (e.g. file descriptors, bandwidth) simplicity wins over small performance gains easier adoption and faster time to implementation Wide-area state management (e.g. routing state) reactive algorithm for best-effort, fast response proactive periodic maintenance for correctness Naïve event programming model is too low-level much code complexity from managing stack state important for protocols with asychronous control algorithms need explicit thread support for callbacks / stack management September 20, 2018 ravenben@eecs.berkeley.edu
Future Directions Ongoing work to explore p2p application space resilient anonymous routing, attack resiliency Intelligent overlay construction router-level listeners allow application queries efficient meshes, fault-independent backup links, failure notify Deploying and measuring a lightweight peer-based application focus on usability and low overhead p2p incentives, security, deployment meet the real world A holistic approach to overlay security and control p2p good for self-organization, not for security/ management decouple administration from normal operation explicit domains / hierarchy for configuration, analysis, control Interplay between two goals in first class infrastructure September 20, 2018 ravenben@eecs.berkeley.edu
Thanks! Questions, comments? ravenben@eecs.berkeley.edu
Impact of Correlated Events + + = event handler ? A B C Network ? ? ? ? historically, events largely independent web server requests focus on throughput event relationships becoming increasingly prevalent peer to peer control messages large scale data aggregation networks action X requires A *and* B *and* C for progress correlated requests: A+B+CD e.g. online continuous queries, sensor aggregation, p2p control layer, streaming data mining web / application servers independent requests maximize individual throughput September 20, 2018 ravenben@eecs.berkeley.edu
Some Details Simple fault detection techniques periodically probe overlay links to neighbors exponentially weighted moving average for link quality estimation avoid route flapping due to short term loss artifacts loss rate: Ln = (1 - ) Ln-1 + p p = instantaneous loss rate, = filter constant other techniques topics of open research How do we get and repair the backup links? each hop has flexible routing constraint e.g. in prefix routing, 1st hop just requires 1 fixed digit backups always available until last hop to destination create and store backups at node insertion restore redundancy via random pair-wise queries after failures e.g. to replace 123X neighbor, talk to local 12XX neighbors September 20, 2018 ravenben@eecs.berkeley.edu
Route Redundancy (Simulator) Simulation constructs shortest paths to emulate IP routes Now break links, then look at reachability Simulation of Tapestry, 2 backup paths per routing entry 2 backups: low maintenance overhead, good resiliency September 20, 2018 ravenben@eecs.berkeley.edu
Another Perspective on Reachability Portion of all pair-wise paths where no failure-free paths remain A path exists, but neither IP nor FRLS can locate the path Portion of all paths where IP and FRLS both route successfully FRLS finds path, where short-term IP routing fails September 20, 2018 ravenben@eecs.berkeley.edu
Single Node Software Architecture application programming interface applications Dynamic Tap. Patchwork core router distance map Remove small message text network SEDA event-driven framework Java Virtual Machine September 20, 2018 ravenben@eecs.berkeley.edu
Related Work Unstructured Peer to Peer Applications Napster, Gnutella, KaZaa probabilistic search, difficult to scale, inefficient b/w Structured Peer to Peer Overlays Chord, CAN, Pastry, Kademlia, SkipNet, Viceroy, Symphony, Koorde, Coral, Ulysseus, … routing efficiency application interface Resilient routing traffic redirection layers Detour, Resilient Overlay Networks (RON), Internet Indirection Infrastructure (I3) our goals: scalability, in-network traffic redirection September 20, 2018 ravenben@eecs.berkeley.edu
Node to Node Routing (PlanetLab) Median=31.5, 90th percentile=135 Ratio of end-to-end latency to ping distance between nodes All node pairs measured, placed into buckets September 20, 2018 ravenben@eecs.berkeley.edu
Object Location (PlanetLab) 90th percentile=158 Ratio of end-to-end latency to client-object ping distance Local-area stretch improved w/ additional location state September 20, 2018 ravenben@eecs.berkeley.edu
Micro-benchmark Results (LAN) 100mb/s Per msg overhead ~ 50s, latency dominated by byte copying Performance scales with CPU speedup For 5K messages, throughput = ~10,000 msgs/sec September 20, 2018 ravenben@eecs.berkeley.edu
Structured Peer to Peer Overlay Traffic Tunneling Legacy Node B Legacy Node A B P’(B) A, B are IP addresses register register Proxy Proxy put (hash(B), P’(B)) P’(B) get (hash(B)) put (hash(A), P’(A)) Structured Peer to Peer Overlay Not a unique engineering approach, similar approach at i3 Store mapping from end host IP to its proxy’s overlay ID Similar to approach in Internet Indirection Infrastructure (I3) September 20, 2018 ravenben@eecs.berkeley.edu
Constrained Multicast Used only when all paths are below quality threshold Send duplicate messages on multiple paths Leverage route convergence Assign unique message IDs Mark duplicates Keep moving window of IDs Recognize and drop duplicates Limitations Assumes loss not from congestion Ideal for local area routing 2225 2299 2274 2286 2046 2281 2530 ? ? ? 1111 September 20, 2018 ravenben@eecs.berkeley.edu
Link Probing Bandwidth (PL) Bandwidth increases logarithmically with overlay size Medium sized routing overlays incur low probing bandwidth September 20, 2018 ravenben@eecs.berkeley.edu