Presentation is loading. Please wait.

Presentation is loading. Please wait.

Tapestry Deployment and Fault-tolerant Routing Ben Y. Zhao L. Huang, S. Rhea, J. Stribling, A. D. Joseph, J. D. Kubiatowicz Berkeley Research Retreat January.

Similar presentations


Presentation on theme: "Tapestry Deployment and Fault-tolerant Routing Ben Y. Zhao L. Huang, S. Rhea, J. Stribling, A. D. Joseph, J. D. Kubiatowicz Berkeley Research Retreat January."— Presentation transcript:

1 Tapestry Deployment and Fault-tolerant Routing Ben Y. Zhao L. Huang, S. Rhea, J. Stribling, A. D. Joseph, J. D. Kubiatowicz Berkeley Research Retreat January 2003

2 UCB Winter Retreat ravenben@eecs.berkeley.edu 2 Scaling Network Applications Complexities of global deployment  Network unreliability BGP slow convergence, redundancy unexploited  Lack of administrative control over components Constrains protocol deployment: multicast, congestion ctrl.  Management of large scale resources / components Locate, utilize resources despite failures

3 UCB Winter Retreat ravenben@eecs.berkeley.edu 3 Enabling Technology: DOLR (Decentralized Object Location and Routing) GUID1 DOLR GUID1 GUID2

4 UCB Winter Retreat ravenben@eecs.berkeley.edu 4 What is Tapestry? DOLR driving OceanStore global storage (Zhao, Kubiatowicz, Joseph et al. 2000) Network structure  Nodes assigned bit sequence nodeIds from namespace: 0-2 160, based on some radix (e.g. 16)  keys from same namespace Keys dynamically map to 1 unique live node: root Base API  Publish / Unpublish (Object ID)  RouteToNode (NodeId)  RouteToObject (Object ID)

5 UCB Winter Retreat ravenben@eecs.berkeley.edu 5 4 2 3 3 3 2 2 1 2 4 1 2 3 3 1 3 4 1 1 43 2 4 NodeID 0xEF34 NodeID 0xEF31 NodeID 0xEFBA NodeID 0x0921 NodeID 0xE932 NodeID 0xEF37 NodeID 0xE324 NodeID 0xEF97 NodeID 0xEF32 NodeID 0xFF37 NodeID 0xE555 NodeID 0xE530 NodeID 0xEF44 NodeID 0x0999 NodeID 0x099F NodeID 0xE399 NodeID 0xEF40 NodeID 0xEF34 Tapestry Mesh

6 UCB Winter Retreat ravenben@eecs.berkeley.edu 6 Object Location

7 UCB Winter Retreat ravenben@eecs.berkeley.edu 7 Talk Outline Introduction Architecture  Node architecture  Node implementation Deployment Evaluation Fault-tolerant Routing

8 UCB Winter Retreat ravenben@eecs.berkeley.edu 8 Single Node Architecture Transport Protocols Network Link Management Application Interface / Upcall API Decentralized File Systems Application-Level Multicast Approximate Text Matching Router Routing Table & Object Pointer DB Dynamic Node Management

9 UCB Winter Retreat ravenben@eecs.berkeley.edu 9 Single Node Implementation Application Programming Interface Applications Dynamic TapestryCore Router Patchwork Network StageDistance Map SEDA Event-driven Framework Java Virtual Machine Enter/leave Tapestry State Maint. Node Ins/del Routing Link Maintenance Node Ins/del Messages UDP Pings route to node / obj API calls Upcalls fault detect heartbeat msgs

10 UCB Winter Retreat ravenben@eecs.berkeley.edu 10 Deployment Status C simulator  Packet level simulation  Scales up to 10,000 nodes Java implementation  50000 semicolons of Java, 270 class files  Deployed on local area cluster (40 nodes)  Deployed on Planet Lab global network (~100 distributed nodes)

11 UCB Winter Retreat ravenben@eecs.berkeley.edu 11 Talk Outline Introduction Architecture Deployment Evaluation  Micro-benchmarks  Stable network performance  Single and parallel node insertion Fault-tolerant Routing

12 UCB Winter Retreat ravenben@eecs.berkeley.edu 12 Micro-benchmark Methodology Experiment run in LAN, GBit Ethernet Sender sends 60001 messages at full speed Measure inter-arrival time for last 50000 msgs  10000 msgs: remove cold-start effects  50000 msgs: remove network jitter effects Sender Control Receiver Control Tapestry LAN Link

13 UCB Winter Retreat ravenben@eecs.berkeley.edu 13 Micro-benchmark Results Constant processing overhead ~ 50  s Latency dominated by byte copying For 5K messages, throughput = ~10,000 msgs/sec

14 UCB Winter Retreat ravenben@eecs.berkeley.edu 14 Large Scale Methodology PlanetLab global network  101 machines at 42 institutions, in North America, Europe, Australia (~ 60 machines utilized)  1.26Ghz PIII (1GB RAM), 1.8Ghz P4 (2GB RAM)  North American machines (2/3) on Internet2 Tapestry Java deployment  6-7 nodes on each physical machine  IBM Java JDK 1.30  Node virtualization inside JVM and SEDA  Scheduling between virtual nodes increases latency

15 UCB Winter Retreat ravenben@eecs.berkeley.edu 15 Node to Node Routing Ratio of end-to-end routing latency to shortest ping distance between nodes All node pairs measured, placed into buckets Median=31.5, 90 th percentile=135

16 UCB Winter Retreat ravenben@eecs.berkeley.edu 16 Object Location Ratio of end-to-end latency for object location, to shortest ping distance between client and object location Each node publishes 10,000 objects, lookup on all objects 90 th percentile=158

17 UCB Winter Retreat ravenben@eecs.berkeley.edu 17 Latency to Insert Node Latency to dynamically insert a node into an existing Tapestry, as function of size of existing Tapestry Humps due to expected filling of each routing level

18 UCB Winter Retreat ravenben@eecs.berkeley.edu 18 Bandwidth to Insert Node Cost in bandwidth of dynamically inserting a node into the Tapestry, amortized for each node in network Per node bandwidth decreases with size of network

19 UCB Winter Retreat ravenben@eecs.berkeley.edu 19 Parallel Insertion Latency Latency to dynamically insert nodes in unison into an existing Tapestry of 200 Shown as function of insertion group size / network size 90 th percentile=55042

20 UCB Winter Retreat ravenben@eecs.berkeley.edu 20 Talk Outline Introduction Architecture Deployment Evaluation Fault-tolerant Routing  Tunneling through scalable overlays  Example using Tapestry

21 UCB Winter Retreat ravenben@eecs.berkeley.edu 21 Adaptive and Resilient Routing Goals  Reachability as a service  Agility / adaptability in routing  Scalable deployment  Useful for all client endpoints

22 UCB Winter Retreat ravenben@eecs.berkeley.edu 22 Existing Redundancy in DOLR/DHTs Fault-detection via soft-state beacons  Periodically sent to each node in routing table Scales logarithmically with size of network  Worst case overhead: 2 40 nodes, 160b ID  20 hex 1 beacon/sec, 100B each = 240 kbps can minimize B/W w/ better techniques (Hakim, Shelley) Precomputed backup routes  Intermediate hops in overlay path are flexible Keep list of backups for outgoing hops (e.g. 3 node pointers for each route entry in Tapestry)  Maintain backups using node membership algorithms (no additional overhead)

23 UCB Winter Retreat ravenben@eecs.berkeley.edu 23 Bootstrapping Non-overlay Endpoints Goal  Allow non-overlay nodes to benefit  Endpoints communicate via overlay proxies Example: legacy nodes L 1, L 2  L i registers w/ nearby overlay proxy P i  P i assigns L i a proxy name D i s.t. D i is the closest possible unique name to P i (e.g. start w/ P i, increment for each node)  L i and L 2 exchange new proxy names  messages route to nodes using proxy names

24 UCB Winter Retreat ravenben@eecs.berkeley.edu 24 Tunneling through an Overlay P1 L1 P2 L2  L1 registers with P1 as document D1  L2 registers with P2 as document D2  Traffic tunnels through overlay via proxies D2 D1 Overlay Network

25 UCB Winter Retreat ravenben@eecs.berkeley.edu 25 Failure Avoidance in Tapestry

26 UCB Winter Retreat ravenben@eecs.berkeley.edu 26 Routing Convergence

27 UCB Winter Retreat ravenben@eecs.berkeley.edu 27 Bandwidth Overhead for Misroute Status: under deployment on PlanetLab

28 UCB Winter Retreat ravenben@eecs.berkeley.edu 28 For more information … Tapestry and related projects (and these slides): http://www.cs.berkeley.edu/~ravenben/tapestry OceanStore: http://oceanstore.cs.berkeley.edu Related papers: http://oceanstore.cs.berkeley.edu/publications http://www.cs.berkeley.edu/~ravenben/publications ravenben@eecs.berkeley.edu

29 UCB Winter Retreat ravenben@eecs.berkeley.edu 29 Backup Slides Follow…

30 UCB Winter Retreat ravenben@eecs.berkeley.edu 30 The Naming Problem Tracking modifiable objects  Example: email, Usenet articles, tagged audio  Goal: verifiable names, robust to small changes Current approaches  Content-based hashed naming  Content-independent naming ADOLR Project: (Feng Zhou, Li Zhuang)  Approximate names based on feature vectors  Leverage to match / search for similar content

31 UCB Winter Retreat ravenben@eecs.berkeley.edu 31 Approximation Extension to DOLR/DHT Publication using features  Objects are described using a set of features: AO ≡ Feature Vector (FV) = {f 1, f 2, f 3, …, f n }  Locate AOs in DOLR ≡ find all AOs in the network with |FV * ∩ FV| ≥ Thres, 0 < Thres ≤ |FV| Driving application: decentralized spam filter  Humans are the only fool-proof spam filter  Mark spam, publish spam by text feature vector  Incoming mail filtered by FV query on P2P overlay

32 UCB Winter Retreat ravenben@eecs.berkeley.edu 32 Evaluation on Real Emails Accuracy of feature vector matching on real emails  Spam (29631 Junk Emails from www.spamarchive.org)www.spamarchive.org 14925 (unique), 86% of spam ≤ 5K  Normal Emails 9589 (total) = 50% newsgroup posts, 50% personal emails Status  Prototype implemented as Outlook Plug-in  Interfaces w/ Tapestry overlay  http://www.cs.berkeley.edu/~zf/spamwatch http://www.cs.berkeley.edu/~zf/spamwatch THRESDetectedFail% 3/1033568497.56 4/10317226892.21 “Similarity” Test 3440 modified copies of 39 emails Match FP# pairprobability 2/1042.79e-8 >2/1000 “False Positive” Test 9589(normal)×14925(spam)

33 UCB Winter Retreat ravenben@eecs.berkeley.edu 33 State of the Art Routing High dimensionality and coordinate-based P2P routing  Tapestry, Pastry, Chord, CAN, etc…  Sub-linear storage and # of overlay hops per route  Properties dependent on random name distribution  Optimized for uniform mesh style networks

34 UCB Winter Retreat ravenben@eecs.berkeley.edu 34 Reality AS-2 P2P Overlay Network AS-1 AS-3 SR Transit-stub topology, disparate resources per node Result: Inefficient inter-domain routing (b/w, latency)

35 UCB Winter Retreat ravenben@eecs.berkeley.edu 35 Landmark Routing on P2P Brocade  Exploit non-uniformity  Minimize wide-area routing hops / bandwidth Secondary overlay on top of Tapestry  Select super-nodes by admin. domain Divide network into cover sets  Super-nodes form secondary Tapestry Advertise cover set as local objects  brocade routes directly into destination’s local network, then resumes p2p routing

36 UCB Winter Retreat ravenben@eecs.berkeley.edu 36 AS-2 P2P Network AS-1 AS-3 Brocade Layer SD Original Route Brocade Route Brocade Routing

37 UCB Winter Retreat ravenben@eecs.berkeley.edu 37 Overlay Routing Networks CAN: Ratnasamy et al., (ACIRI / UCB)  Uses d-dimensional coordinate space to implement distributed hash table  Route to neighbor closest to destination coordinate Chord: Stoica, Morris, Karger, et al., (MIT / UCB)  Linear namespace modeled as circular address space  “Finger-table” point to logarithmic # of inc. remote hosts Pastry: Rowstron and Druschel (Microsoft / Rice )  Hypercube routing similar to PRR97  Objects replicated to servers by name Fast Insertion / Deletion Constant-sized routing state Unconstrained # of hops Overlay distance not prop. to physical distance Simplicity in algorithms Fast fault-recovery Log 2 (N) hops and routing state Overlay distance not prop. to physical distance Fast fault-recovery Log(N) hops and routing state Data replication required for fault-tolerance

38 UCB Winter Retreat ravenben@eecs.berkeley.edu 38 2175 0880 0123 0157 0154 Routing in Detail 2175 012 3 45 6 7 0880 012 3 45 6 7 0123 012 3 45 6 7 0154 012 3 45 6 7 0157 012 3 45 6 7 Example: Octal digits, 2 12 namespace, 2175  0157

39 UCB Winter Retreat ravenben@eecs.berkeley.edu 39 Publish / Lookup Details Publish object with ObjectID: // route towards “virtual root,” ID=ObjectID For (i=0, i<Log 2 (N), i+=j) { //Define hierarchy j is # of bits in digit size, (i.e. for hex digits, j = 4 ) Insert entry into nearest node that matches on last i bits If no matches found, deterministically choose alternative Found real root node, when no external routes left Lookup object Traverse same path to root as publish, except search for entry at each node For (i=0, i<Log 2 (N), i+=j) { Search for cached object location Once found, route via IP or Tapestry to object

40 UCB Winter Retreat ravenben@eecs.berkeley.edu 40 Dynamic Insertion 1. Build up new node’s routing map  Send messages to each hop along path from gateway to current node N’ that best approximates N  The i th hop along the path sends its i th level route table to N  N optimizes those tables where necessary 2. Notify via acked multicast nodes with null entries for N’s ID 3. Notified node issues republish message for relevant objects 4. Notify local neighbors

41 UCB Winter Retreat ravenben@eecs.berkeley.edu 41 Dynamic Insertion Example NodeID 0x243FE NodeID 0x913FE NodeID 0x0ABFE NodeID 0x71290 NodeID 0x5239E NodeID 0x973FE NEW 0x143FE NodeID 0x779FE NodeID 0xA23FE Gateway 0xD73FF NodeID 0xB555E NodeID 0xC035E NodeID 0x244FE NodeID 0x09990 NodeID 0x4F990 NodeID 0x6993E NodeID 0x704FE 4 2 3 3 3 2 1 2 4 1 2 3 3 1 3 4 1 1 43 2 4 NodeID 0x243FE

42 UCB Winter Retreat ravenben@eecs.berkeley.edu 42 Dynamic Root Mapping Problem: choosing a root node for every object  Deterministic over network changes  Globally consistent Assumptions  All nodes with same matching suffix contains same null/non-null pattern in next level of routing map  Requires: consistent knowledge of nodes across network

43 UCB Winter Retreat ravenben@eecs.berkeley.edu 43 PRR Solution Given desired ID N,  Find set S of nodes in existing network nodes n matching most # of suffix digits with N  Choose S i = node in S with highest valued ID Issues:  Mapping must be generated statically using global knowledge  Must be kept as hard state in order to operate in changing environment  Mapping is not well distributed, many nodes in n get no mappings

44 UCB Winter Retreat ravenben@eecs.berkeley.edu 44 Tapestry Solution Globally consistent distributed algorithm:  Attempt to route to desired ID N i  Whenever null entry encountered, choose next “higher” non-null pointer entry  If current node S is only non-null pointer in rest of route map, terminate route, f (N) = S Assumes:  Routing maps across network are up to date  Null/non-null properties identical at all nodes sharing same suffix

45 UCB Winter Retreat ravenben@eecs.berkeley.edu 45 Analysis Globally consistent deterministic mapping Null entry  no node in network with suffix  consistent map  identical null entries across same route maps of nodes w/ same suffix Additional hops compared to PRR solution: Reduce to coupon collector problem Assuming random distribution With n  ln(n) + cn entries, P(all coupons) = 1-e -c For n=b, c=b-ln(b), P(b 2 nodes left) = 1-b/e b = 1.8  10 -6 # of additional hops  Log b (b 2 ) = 2 Distributed algorithm with minimal additional hops

46 UCB Winter Retreat ravenben@eecs.berkeley.edu 46 Node vanishes undetected  Routing proceeds on invalid link, fails  No backup router, so proceed to surrogate routing Node enters network undetected; messages going to surrogate node instead  New node checks with surrogate after all such nodes have been notified  Route info at surrogate is moved to new node Dynamic Mapping Border Cases

47 UCB Winter Retreat ravenben@eecs.berkeley.edu 47 SPAA slides follow

48 UCB Winter Retreat ravenben@eecs.berkeley.edu 48 Network Assumption Nearest neighbor is hard in general metric Assume the following:  Ball of radius 2r contains only a factor of c more nodes than ball of radius r.  Also, b > c 2  [Both assumed by PRR] Start knowing one node; allow distance queries

49 UCB Winter Retreat ravenben@eecs.berkeley.edu 49 Algorithm Idea Call a node a level i node if it matches the new node in i digits. The whole network is contained in forest of trees rooted at highest possible imax. Let list[imax] contain the root of all trees. Then, starting at imax, while i > 1  list[i-1] = getChildren(list[i]) Certainly, list[i] contains level i neighbors.

50 UCB Winter Retreat ravenben@eecs.berkeley.edu 50 NodeID 0xEF34 NodeID 0xEF31 NodeID 0xEFBA NodeID 0x0921 NodeID 0xE932 NodeID 0xEF37 NodeID 0xE324 NodeID 0xEF97 NodeID 0xEF32 NodeID 0xFF37 NodeID 0xE555 NodeID 0xE530 NodeID 0xEF44 NodeID 0x0999 NodeID 0x099F NodeID 0xE399 NodeID 0xEF40 4 4 4 1 1 1 1 2 2 2 2 2 3 3 3 3 NodeID 0xEF34 We Reach The Whole Network

51 UCB Winter Retreat ravenben@eecs.berkeley.edu 51 The Real Algorithm Simplified version ALL nodes in the network. But far away nodes are not likely to have close descendents  Trim the list at each step. New version: while i > 1  List[i-1] = getChildren(list[i])  Trim(list[i-1])

52 UCB Winter Retreat ravenben@eecs.berkeley.edu 52 How to Trim Consider circle of radius r with at least one level i node. Level-(i-1) node in little circle must must point to a level- i node in the big circle Want: list[i] had radius three times list[i-1] and list[i –1] contains one level i <2r r

53 UCB Winter Retreat ravenben@eecs.berkeley.edu 53 Animation new

54 UCB Winter Retreat ravenben@eecs.berkeley.edu 54 True in Expectation Want: list[i] had radius three times list[i-1] and list[i –1] contains one level i Suppose list[i-1] has k elements and radius r  Expect ball of radius 4r to contain kc 2 /b  Ball of radius 3r contains less than k nodes, so keeping k all along is enough. To work with high probability, k = O(log n)

55 UCB Winter Retreat ravenben@eecs.berkeley.edu 55 Steps of Insertion Find node with closest matching ID (surrogate) and get preliminary neighbor table  If surrogate’s table is hole-free, so is this one. Find all nodes that need to put new node in routing table via multicast Optimize neighbor table  w.h.p. contacted nodes in building table only ones that need to update their own tables Need:  No fillable holes.  Keep objects reachable

56 UCB Winter Retreat ravenben@eecs.berkeley.edu 56 Need-to-know = a node with a hole in neighbor table filled by new node If 1234 is new node, and no 123s existed, must notify 12?? Nodes Acknowledged multicast to all matching nodes Need-to-know nodes

57 UCB Winter Retreat ravenben@eecs.berkeley.edu 57 Locates & Contacts all nodes with a given prefix Create a tree based on IDs as we go Nodes send acks when all children reached Starting node knows when all nodes reached 54345 54340 The node then sends to any 5430?, any 5431?, any 5434?, etc. if possible 543?? 5431? 5434? Acknowledged Multicast Algorithm


Download ppt "Tapestry Deployment and Fault-tolerant Routing Ben Y. Zhao L. Huang, S. Rhea, J. Stribling, A. D. Joseph, J. D. Kubiatowicz Berkeley Research Retreat January."

Similar presentations


Ads by Google