Download presentation
Presentation is loading. Please wait.
1
Availability in Wide-Area Service Composition Bhaskaran Raman and Randy H. Katz SAHARA, EECS, U.C.Berkeley
2
10% of paths have only 95% availability Problem Statement Poor availability of wide-area (inter-domain) Internet paths BGP recovery can take several 10s of seconds [Labovitz, FTCS’99] [Labovitz, SIGCOMM’00]
3
Architecture Composed services Hardware platform Peering relations, Overlay network Service clusters Logical platform Application plane Service cluster: compute cluster capable of running services Internet Peering: exchange perf. info. Destination Source Finding Overlay Entry/Exit Location of Service Replicas Service-Level Path Creation, Maintenance, and Recovery Link-State Propagation At-least -once UDP Perf. Meas. Liveness Detection Functionalities at the Cluster-Manager
4
“Failure” detection in the Wide-Area Two important characteristics: –Distbn. of outage periods –Rate of occurrence Wide-Area traces –12 pairs of hosts: Berkeley, Stanford, UIUC, CMU, TU-Berlin, UNSW –300ms heart-beat Time Timeout period Timeout for failure detection Approx. 2sec timeout Low rate of occurrence (once an hour) Good for many real-time applications
5
Key Design Points Overlay size: how many nodes? –A comparison: Akamai cache servers –O(10,000) servers for Internet-wide operation –Probably a lesser number of data-center locations Link-state floods: –Twice for each failure –For a 1,000-node graph; estimate #edges = 10,000 –Failures (>1.8 sec outage): O(once an hour) in the worst case –Only about 6 floods/second in the entire network! Graph computation: –Modified version of Dijkstra’s for service composition –O(k*E*log(N)) computation time; k = #services composed –For 6,510-node network, this takes 50ms –Huge overhead, but: path caching helps –Memory: a few MB
6
Wide-Area experiments: setup 8 nodes: –Berkeley, Stanford, UCSD, CMU –Cable modem (Berkeley) –DSL (San Francisco) –UNSW (Australia), TU-Berlin (Germany) Text-to-speech composed sessions –Half with destinations at Berkeley, CMU –Half with recovery algo enabled, other half disabled –4 paths in system at any time –Duration of session: 2min 30sec –Run for 4 days Metric: loss-rate measured in 5sec intervals
7
Loss-rate for a pair of paths
8
CDF of loss-rates of all paths failed
9
CDF of gaps seen at client
10
Summary Failure detection makes sense in ~2sec –Improvement in availability for real-time applications –Text-to-speech composed application About 3.5-4 sec recovery time –2,000ms failure detection timeout –1,000ms recovery signaling –500-1,000ms state restoration (re-process current text sentence) –Of the 2,872 paths, 18 were recovered (0.63%) –Availability: Number of 5sec periods with >10% outage: Other issues: stability, scaling, load-balancing –Studied using Millennium emulation platform Availability % Table Day 1, Dest= Berk Day 1, Dest= CMU Day 2, Dest= Berk Day 2, Dest= CMU Day 3, Dest= Berk Day 3, Dest= CMU Day 4, Dest= Berk Day 4, Dest= CMU No recovery99.5899.5999.6599.7399.6599.7999.85100.0 With recovery99.6399.5999.6799.9699.6599.9899.91100.0
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.