Presentation is loading. Please wait.

Presentation is loading. Please wait.

Each mesh represents a single hop on the route to a given root. Sibling nodes maintain pointers to each other. Each referrer has pointers to the desired.

Similar presentations


Presentation on theme: "Each mesh represents a single hop on the route to a given root. Sibling nodes maintain pointers to each other. Each referrer has pointers to the desired."— Presentation transcript:

1 Each mesh represents a single hop on the route to a given root. Sibling nodes maintain pointers to each other. Each referrer has pointers to the desired node’s siblings An investigation into building a truly scalable, highly available wide-area system Context: The Oceanstore Global-scale Persistent Storage System Why is it hard? –Large scale system  frequent component faults –Large amount of data  performance and load bottleneck –Dynamic environment  changes in topology, network conditions –More principals  attacks on system (e.g. DoS) more likely Previous efforts: –The Globe Wide-area Distributed System –SLP wide-area extension –The Berkeley Service Discovery Service Project Goals: –True scalability without centralization –Exploit locality: local-area performance for local objects –Availability in face of multiple node failures and network partitions –Self-maintenance: repair corrupted data, optimize Provide framework for active “shuttle” messages –“Shuttles” are protocol-specific messages tunneled inside Tapestry –Protocol-specific modules interpret shuttles and generate events Allows overlay networks to leverage Tapestry’s availability, fault-tolerance, and hierarchy management Need to verify / trust module code Further theoretical analysis of algorithms How do we deal with highly dynamic system? –Can we tolerate high rate of node entries/exits –Mobile clients: can we Tapestry out to the edge –Allow faster insertion for “guests” such as mobile nodes More security issues: –How to prevent flooding of route packets –Message authentication –DoS using frequent entries to Tapestry Ben Y. Zhao, John Kubiatowicz, Anthony D. Joseph Tapestry: A Highly-available Wide-area Location and Routing Mechanism Sibling Mesh: –Logical mesh formed by nodes with common suffix –Each node keeps pointers to small # (n) nearest siblings –Each pointer to next hop also keeps 2 alternate nodes –Result: reduces entry/exit latency and adds redundancy Referrer List (backpointers) –TBD: explain tradeoff between storage and functionality here… Availability via Replication –Plaxton resilient to: Intermediate node failures Small partitions –Vulnerable to: Root node failures Overload of router nodes Large/correlated partitions –Tapestry solution: Node replication –Replication algorithm: New replicate w/ ID “X” searches for existing node for “X” Negotiate with “X” to get referrer list Use network distance measurements to find optimal partition Keep regular beacons as part of replicate group Detect replicate fault and take over responsibility as necessary –Replicates can be co-located with hotspots for better load distribution Availability via Hashing –Incoming IDs hashed with multiple salt values –Queries parallized for additional redundancy –Hinders DoS attacks by obscuring object  node mapping Self-repair vs. corruption, –When next hop fails, use alternate node to access sibling mesh –Use mesh to find the new optimal next hop Self-optimization –Running queries store previous hop IDs and distances –Non-optimal paths detected during traversal and fixed Fast fault detection and recovery –Occasional soft-state beacons between node and referrers –Use active queries as heartbeats in high traffic regions –On fault: mark downed node as inactive with long lease –Probabilistically send regular query requests to inactive node If recovery detected, switch status to active If lease expires, mark as failed and actively remove Security –Use referrer maps and sibling mesh to isolate attackers Tapestry EnhancementsProject Overview Extensibility Simulation Results Fault-tolerance and optimality Map objects to one of many embedded trees in the network –Objects mapped to “root” nodes identified by string IDs –For every suffix length possible in the ID, Nodes keep pointers to “nearest” neighbors sharing a suffix length Plaxton Trees Availability Results: (plotted against Plaxton) 1.Single node failure 2.Multiple node failure Optimality measure: Minimum stretch factor as function of size of network 1.Immediately after insertion 2.After time X, and self- optimization reaches steady state Availability Results: (plotted against Plaxton) 1.Small/single partition 2.Multiple/correlated partitions Fault-tolerance 1.Integration of recovered node (2 nd chance vs plaxt) 2.Recovery time (downed node/link) 3.Query latency degradation under fault conditions Availability Ongoing work Routing algorithm –Start at closest neighbor with desired ending digit –At each hop, match next digit to nearest neighbor listing Operations: –Insertion: place pointers to obj on each intervening hop to root and at root –Query: route to root, stop when hop has desired pointer 116 479 529 629 Inserting Obj #62942 116 479 529 629 675 109 Searching Obj #62942 Root Node Search Client Object Location Properties: –Routes have at most Log B (N) hops, B=base of ID, N= # of nodes –No centralization; reroute around failed nodes or links –Highly scalable; exploits locality  local queries never go to root node Plaxton Trees / Ground Level 629 29 Level 9 Level Single path to root Sibling pointers Single hops to root Three sibling meshes for one root

2 Poster Notes 1.Briefly review oceanstore: 1.Oceanstore is a globally distributed, fully secure, persistent storage system. Its focus is on reliability and guaranteed persistence. It uses multi-tiered servers to store the data and propagate updates down to clients. All data is fully encrypted everywhere, and the infrastructure is not trusted. Fragments of documents are encoded using erasure codes and let loose into the system. Adaptive mechanisms place them at optimal locations. Wide area location and communication between nodes on a global scale is crucial 2.Any global-scale system will run into the same problems of: 1.frequent faults in the system (large # of components), 2.any point of centralization will cause bottlenecks 3.Redundancy and assoc. cost in storage is acceptable tradeoff for reliability/availability 4.What is the problem with existing wide-area systems? There is ALWAYS a point of centralization somewhere,  performance and scaling bottleneck 5.What’s new here: no centralization and redundancy over redundancy 2.Review Plaxton work 1.Was primarily theory work, missing features desirable in a real system 2.Never implemented or simulated before (AFAIK) and no empirical results 3.Stress “limited” resilience against node failures (since every intermediate node is also a root node), and vulnerable to massive network partitions or bisections 4.See poster for algorithm details and graph, focus on key properties of locality, fault-tolerance, and true scalability via randomized distribution 3.Explain assumptions: 1.Tapestry is designed to work best over a relatively stable system, needs time to optimize itself to steady state 2.Needs a strong measurement infrastructure in order to learn about network distances between nodes 4.Discuss the Tapestry enhancements one by one: 1.sibling mesh: making what was already there more explicit 2.explain 4 tiered diagram, can think of siblings with the same length suffix as points on the same 2-D mesh, as you route closer and closer to root node, you traverse up a 3-D canopy of meshes 3.Explain how replication solves node failure problems, routing bottleneck problems, and locality problems 4.availability via hashing: very good redundancy, without having to incur overhead of pointer list again, only the overhead of storing X times more data (1 obj becomes 3 separate objects, but all 3 stored in same network) 5.Self-repair: crucial in wide-area systems, since faults are many, distributed, and often hard to get to and fix on time 6.self-optimization: this allows algorithms to not have to do a “perfect job”, just a good enough one, and then let self-optimization take over 7.Second-chance algorithm is tuned towards allowing servers that recover within some period of time (say 1 day) the ability to pick up where they left off, and not incur the high exit/entry cost (probabilistic algorithm proactively probes the node for activity on a regular basis) 8.security: use Stefan Savage and Dawn Song et. al’s algorithms to do traceback, then isolate and quarantine malicious nodes from network

3 Explore use of backpointers more Is the cost worth the benefit? Very expensive at the lower nodes of the sibling mesh, maybe use probabilistic argument to justify an incomplete but smaller subset of referrers. Don’t overwhelm w/ details, pick a few from enhancement list to discuss Close look at simulation results (which will be ready for actual conference date) availability should be quite good key is optimality measure. The minimum stretch factor should be ~3 or ~4, both are acceptable. If it’s much larger, then routes are too inefficient. (I suspect it will be <3) Spend as much time on future work as possible: Get ideas on security, how to identify malicious nodes using fast entry/exit Get ideas on storage/availability tradeoff What about replication consistency?


Download ppt "Each mesh represents a single hop on the route to a given root. Sibling nodes maintain pointers to each other. Each referrer has pointers to the desired."

Similar presentations


Ads by Google