Download presentation
Presentation is loading. Please wait.
Published byGinger Stone Modified over 9 years ago
1
An Efficient Topology-Adaptive Membership Protocol for Large- Scale Cluster-Based Services Jingyu Zhou * §, Lingkun Chu*, Tao Yang* § * Ask Jeeves §University of California at Santa Barbara
2
Outline Background & motivation Membership protocol design Implementation Evaluation Related work Conclusion
3
Background Large-scale 24x7 Internet services Thousands of machines connected by many level-2 and level-3 switches (e.g. 10,000 at Ask Jeeves) Multi-tiered architecture with data partitioning and replication Some of machines are unavailable frequently due to failures, operational errors, and scheduled service update.
4
Network Topology in Service Clusters Multiple hosting centers across Internet In a hosting center Thousands of nodes Many level-2 and level-3 switches Complex switch topology
5
Motivation Membership protocol Yellow page directory – discovery of services and their attributes Server aliveness – quick fault detection Challenges Efficiency Scalability Fast detection
6
Fast Failure Detection is crucial Online auction service even with replication Failure of one replica 7s - 12s Service unavailable 10s - 13s
7
Communication Cost for Fast Detection Communication requirement Propagate to all nodes Fast detection needs higher packet rate High bandwidth Higher hardware cost More chances of failures.
8
Design Requirements of Membership Protocol for Large-scale Clusters Efficient: bandwidth, # of packets Topology-adaptive: localize traffic within switches Scalable: scale to tens of thousands of nodes Fast failure detection and information propagation.
9
Approaches Centralized Easy to implement Single point of failure, not scalable, extra delay Distributed All-to-all broadcast [Shen’01]: doesn’t scale well Gossip [Renesse’98]: probabilistic guarantee Ring: slow to handle multi-failures Don’t consider network topology
10
TAMP: Topology-Adaptive Membership Protocol Topology-awareness Form a hierarchical tree according to network topology Topology-adaptiveness Network changes: add/remove/move switches Service changes: add/remove/move nodes Exploit TTL field in IP packet
11
Hierarchical Tree Formation Algorithm 1.Form small multicast groups with low TTL values; 2.Each multicast group performs elections; 3.Group leaders form higher level groups with larger TTL values; 4.Stop when max. TTL value is reached; otherwise, goto Step 2.
12
An Example 3 Level-3 switches with 9 nodes
13
Node Joining Procedure Purpose Find/elect a leader Exchange membership information Process 1.Join a channel and listen; 2.If a leader exists, stop and bootstrap with the leader; 3.Otherwise, elects a leader (bully algorithm); 4.If is leader, increase channel ID & TTL, goto 1.
14
Properties of TAMP Upward propagation guarantee A node is always aware of its leader Messages can always be propagated to nodes in the higher levels Downward propagation guarantee A node at level i must leaders of level i-1, i-2, …, 0 Messages can always be propagated to lower level nodes Eventual convergence View of every node converges
15
Update protocol when cluster structure changes Heartbeat for failure detection Leader receive an update - multicast up & down
16
Fault Tolerance Techniques Leader failure: backup leader or election Network partition failure Timeout all nodes managed by a failed leader Hierarchical timeout: longer timeout for higher levels Packet loss Leaders exchanges deltas since last update Piggyback last three changes
17
Scalability Analysis Protocols: all-to-all, gossip, and TAMP Basic performance factors Failure detection time (T fail_detect ) View convergence time (T converge ) Communication cost in terms of bandwidth (B)
18
Scalability Analysis (Cont.) Two metrics BDP = B * T fail_detect, lower failure detection time with low bandwidth is desired BCP = B * T converge, lower convergence time with low bandwidth is desired BDPBCP All-to-allO(n 2 ) GossipO(n 2 logn) TAMPO(n)O(n)+O(B*log k n) n: total # of nodes k: each group size, a constant
19
Implementation Inside Neptune middleware [Shen’01] – programming and runtime support for building cluster-based Internet services Can be easily coupled into others clustering frameworks
20
Evaluation: Objectives & Settings Metrics Bandwidth failure detection time View convergence time Hardware settings 100 dual PIII 1.4GHz nodes 2 switches connected by a Gigabit switch Protocol related settings Frequency: 1 packet/s A node is deemed dead after 5 consecutive loss Gossip mistake probability 0.1% # of nodes: 20 – 100 in step of 20
21
Bandwidth Consumption All-to-All & Gossip: quadratic increase TAMP: close to linear
22
Failure Detection Time Gossip: log(N) increase All-to-All & TAMP: constant
23
View Convergence Time Gossip: log(N) increase All-to-All & TAMP: constant
24
Related Work Membership & failure detection [Chandra’96], [Fetzer’99], [Fetzer’01], [Neiger’96], and [Stok’94] Gossip-style protocols SCAMP, [Kempe’01], and [Renesse’98] High-availability system (e.g., HA-Linux, Linux Heartbeat) Cluster-based network services TACC, Porcupine, Neptune, Ninja Resource monitoring: Ganglia, NWS, MDS2
25
Contributions & Conclusions TAMP is a highly efficient and scalable protocol for giant clusters Exploiting TTL count in IP packet for topology-adaptive design. Verified through property analysis and experimentation. Deployed at Ask Jeeves clusters with thousands of machines.
26
Questions?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.