An Efficient Topology-Adaptive Membership Protocol for Large- Scale Cluster-Based Services Jingyu Zhou * §, Lingkun Chu*, Tao Yang* § * Ask Jeeves §University.

Slides:



Advertisements
Similar presentations
Dr. Multicast for Data Center Communication Scalability Ymir Vigfusson Hussam Abu-Libdeh Mahesh Balakrishnan Ken Birman Cornell University Yoav Tock IBM.
Advertisements

C. Mastroianni, D. Talia, O. Verta - A Super-Peer Model for Resource Discovery Services in Grids A Super-Peer Model for Building Resource Discovery Services.
Multicasting in Mobile Ad hoc Networks By XIE Jiawei.
One Hop Lookups for Peer-to-Peer Overlays Anjali Gupta, Barbara Liskov, Rodrigo Rodrigues Laboratory for Computer Science, MIT.
Efficient Event-based Resource Discovery Wei Yan*, Songlin Hu*, Vinod Muthusamy +, Hans-Arno Jacobsen +, Li Zha* * Chinese Academy of Sciences, Beijing.
CS 542: Topics in Distributed Systems Diganta Goswami.
Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, Hari Balakrishnan MIT and Berkeley presented by Daniel Figueiredo Chord: A Scalable Peer-to-peer.
Pastry Peter Druschel, Rice University Antony Rowstron, Microsoft Research UK Some slides are borrowed from the original presentation by the authors.
SPORC: Group Collaboration using Untrusted Cloud Resources Ariel J. Feldman, William P. Zeller, Michael J. Freedman, Edward W. Felten Published in OSDI’2010.
Leader Election Let G = (V,E) define the network topology. Each process i has a variable L(i) that defines the leader.  i,j  V  i,j are non-faulty.
PortLand: A Scalable Fault-Tolerant Layer 2 Data Center Network Fabric. Presented by: Vinuthna Nalluri Shiva Srivastava.
Radhika Niranjan Mysore, Andreas Pamboris, Nathan Farrington, Nelson Huang, Pardis Miri, Sivasankar Radhakrishnan, Vikram Subramanya, and Amin Vahdat Department.
Dynamo: Amazon's Highly Available Key-value Store Distributed Storage Systems CS presented by: Hussam Abu-Libdeh.
Receiver-driven Layered Multicast S. McCanne, V. Jacobsen and M. Vetterli University of Calif, Berkeley and Lawrence Berkeley National Laboratory SIGCOMM.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts Amherst Operating Systems CMPSCI 377 Lecture.
Reliability on Web Services Presented by Pat Chan 17/10/2005.
Courtesy: Nick McKeown, Stanford
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
Wi-Fi Neighborcast: Enabling communication among nearby clients
Receiver-driven Layered Multicast S. McCanne, V. Jacobsen and M. Vetterli SIGCOMM 1996.
Distributed Systems Fall 2010 Replication Fall 20105DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
Web Caching Schemes1 A Survey of Web Caching Schemes for the Internet Jia Wang.
1-1 CMPE 259 Sensor Networks Katia Obraczka Winter 2005 Transport Protocols.
Scalable Application Layer Multicast Suman Banerjee Bobby Bhattacharjee Christopher Kommareddy ACM SIGCOMM Computer Communication Review, Proceedings of.
June, 2002INFOCOM 1 Host Multicast: A Framework for Delivering Multicast to End Users Beichuan Zhang (UCLA) Sugih Jamin (UMich) Lixia Zhang (UCLA)
Application Layer Multicast
Large Scale Internet Search at Ask.com Tao Yang Chief Scientist and Senior Vice President InfoScale 2006.
Probabilistic Data Aggregation Ling Huang, Ben Zhao, Anthony Joseph Sahara Retreat January, 2004.
Large Scale File Distribution Troy Raeder & Tanya Peters.
presented by Hasan SÖZER1 Scalable P2P Search Daniel A. Menascé George Mason University.
A Routing Control Platform for Managing IP Networks Jennifer Rexford Princeton University
1 Idit Keidar MIT Lab for Computer Science Theory of Distributed Systems Group Paradigms for Building Distributed Systems: Performance Measurements and.
Adaptive Self-Configuring Sensor Network Topologies ns-2 simulation & performance analysis Zhenghua Fu Ben Greenstein Petros Zerfos.
PortLand Presented by Muhammad Sadeeq and Ling Su.
Lesson 1: Configuring Network Load Balancing
Presentation on Clustering Paper: Cluster-based Scalable Network Services; Fox, Gribble et. al Internet Services Suman K. Grandhi Pratish Halady.
Network Topologies.
Network Support for Cloud Services Lixin Gao, UMass Amherst.
PMIT-6102 Advanced Database Systems
1 The Google File System Reporter: You-Wei Zhang.
Communication (II) Chapter 4
July 2003 Sorrento: A Self-Organizing Distributed File System on Large-scale Clusters Hong Tang, Aziz Gulbeden and Tao Yang Department of Computer Science,
WinMee 2005 A Framework for the Management of Large-Scale Wireless Network Testbeds Krishna Ramachandran, Kevin C. Almeroth, Elizabeth Belding-Royer Dept.
ON DESIGING END-USER MULTICAST FOR MULTIPLE VIDEO SOURCES Y.Nakamura, H.Yamaguchi, A.Hiromori, K.Yasumoto †, T.Higashino and K.Taniguchi Osaka University.
Multicast Routing Algorithms n Multicast routing n Flooding and Spanning Tree n Forward Shortest Path algorithm n Reversed Path Forwarding (RPF) algorithms.
CSE 486/586, Spring 2013 CSE 486/586 Distributed Systems Gossiping Steve Ko Computer Sciences and Engineering University at Buffalo.
1 ACTIVE FAULT TOLERANT SYSTEM for OPEN DISTRIBUTED COMPUTING (Autonomic and Trusted Computing 2006) Giray Kömürcü.
Toward Fault-tolerant P2P Systems: Constructing a Stable Virtual Peer from Multiple Unstable Peers Kota Abe, Tatsuya Ueda (Presenter), Masanori Shikano,
Fault Tolerant Services
Information-Centric Networks10b-1 Week 10 / Paper 2 Hermes: a distributed event-based middleware architecture –P.R. Pietzuch, J.M. Bacon –ICDCS 2002 Workshops.
DISTIN: Distributed Inference and Optimization in WSNs A Message-Passing Perspective SCOM Team
CS 6401 Overlay Networks Outline Overlay networks overview Routing overlays Resilient Overlay Networks Content Distribution Networks.
Network Topologies.
Chapter 25 Internet Routing. Static Routing manually configured routes that do not change Used by hosts whose routing table contains one static route.
KYUNG-HWA KIM HENNING SCHULZRINNE 12/09/2008 INTERNET REAL-TIME LAB, COLUMBIA UNIVERSITY DYSWIS.
CMSC 691B Multi-Agent System A Scalable Architecture for Peer to Peer Agent by Naveen Srinivasan.
1 Multipath Routing in WSN with multiple Sink nodes YUEQUAN CHEN, Edward Chan and Song Han Department of Computer Science City University of HongKong.
CSE 486/586 CSE 486/586 Distributed Systems Gossiping Steve Ko Computer Sciences and Engineering University at Buffalo.
Wireless Sensor Networks: A Survey I. F. Akyildiz, W. Su, Y. Sankarasubramaniam and E. Cayirci.
CSE 486/586 Distributed Systems Gossiping
Internet Indirection Infrastructure (i3)
Cellular IP: A New Approach to Internet Host Mobility
ETHANE: TAKING CONTROL OF THE ENTERPRISE
Replication Middleware for Cloud Based Storage Service
Host Multicast: A Framework for Delivering Multicast to End Users
Integrated Resource Management for Cluster-based Internet Services
Shen, Yang, Chu, Holliday, Kuschner, and Zhu
Cluster Load Balancing for Fine-grain Network Services
In-network computation
Design and Implementation of OverLay Multicast Tree Protocol
Presentation transcript:

An Efficient Topology-Adaptive Membership Protocol for Large- Scale Cluster-Based Services Jingyu Zhou * §, Lingkun Chu*, Tao Yang* § * Ask Jeeves §University of California at Santa Barbara

Outline Background & motivation Membership protocol design Implementation Evaluation Related work Conclusion

Background Large-scale 24x7 Internet services  Thousands of machines connected by many level-2 and level-3 switches (e.g. 10,000 at Ask Jeeves)  Multi-tiered architecture with data partitioning and replication  Some of machines are unavailable frequently due to failures, operational errors, and scheduled service update.

Network Topology in Service Clusters Multiple hosting centers across Internet In a hosting center  Thousands of nodes  Many level-2 and level-3 switches  Complex switch topology

Motivation Membership protocol  Yellow page directory – discovery of services and their attributes  Server aliveness – quick fault detection Challenges  Efficiency  Scalability  Fast detection

Fast Failure Detection is crucial Online auction service even with replication  Failure of one replica 7s - 12s  Service unavailable 10s - 13s

Communication Cost for Fast Detection Communication requirement  Propagate to all nodes  Fast detection needs higher packet rate  High bandwidth Higher hardware cost More chances of failures.

Design Requirements of Membership Protocol for Large-scale Clusters Efficient: bandwidth, # of packets Topology-adaptive: localize traffic within switches Scalable: scale to tens of thousands of nodes Fast failure detection and information propagation.

Approaches Centralized  Easy to implement  Single point of failure, not scalable, extra delay Distributed  All-to-all broadcast [Shen’01]: doesn’t scale well  Gossip [Renesse’98]: probabilistic guarantee  Ring: slow to handle multi-failures Don’t consider network topology

TAMP: Topology-Adaptive Membership Protocol Topology-awareness  Form a hierarchical tree according to network topology Topology-adaptiveness  Network changes: add/remove/move switches  Service changes: add/remove/move nodes  Exploit TTL field in IP packet

Hierarchical Tree Formation Algorithm 1.Form small multicast groups with low TTL values; 2.Each multicast group performs elections; 3.Group leaders form higher level groups with larger TTL values; 4.Stop when max. TTL value is reached; otherwise, goto Step 2.

An Example 3 Level-3 switches with 9 nodes

Node Joining Procedure Purpose  Find/elect a leader  Exchange membership information Process 1.Join a channel and listen; 2.If a leader exists, stop and bootstrap with the leader; 3.Otherwise, elects a leader (bully algorithm); 4.If is leader, increase channel ID & TTL, goto 1.

Properties of TAMP Upward propagation guarantee  A node is always aware of its leader  Messages can always be propagated to nodes in the higher levels Downward propagation guarantee  A node at level i must leaders of level i-1, i-2, …, 0  Messages can always be propagated to lower level nodes Eventual convergence  View of every node converges

Update protocol when cluster structure changes Heartbeat for failure detection Leader receive an update - multicast up & down

Fault Tolerance Techniques Leader failure: backup leader or election Network partition failure  Timeout all nodes managed by a failed leader  Hierarchical timeout: longer timeout for higher levels Packet loss  Leaders exchanges deltas since last update  Piggyback last three changes

Scalability Analysis Protocols: all-to-all, gossip, and TAMP Basic performance factors  Failure detection time (T fail_detect )  View convergence time (T converge )  Communication cost in terms of bandwidth (B)

Scalability Analysis (Cont.) Two metrics  BDP = B * T fail_detect, lower failure detection time with low bandwidth is desired  BCP = B * T converge, lower convergence time with low bandwidth is desired BDPBCP All-to-allO(n 2 ) GossipO(n 2 logn) TAMPO(n)O(n)+O(B*log k n) n: total # of nodes k: each group size, a constant

Implementation Inside Neptune middleware [Shen’01] – programming and runtime support for building cluster-based Internet services Can be easily coupled into others clustering frameworks

Evaluation: Objectives & Settings Metrics  Bandwidth  failure detection time  View convergence time Hardware settings  100 dual PIII 1.4GHz nodes  2 switches connected by a Gigabit switch Protocol related settings  Frequency: 1 packet/s  A node is deemed dead after 5 consecutive loss  Gossip mistake probability 0.1%  # of nodes: 20 – 100 in step of 20

Bandwidth Consumption All-to-All & Gossip: quadratic increase TAMP: close to linear

Failure Detection Time Gossip: log(N) increase All-to-All & TAMP: constant

View Convergence Time Gossip: log(N) increase All-to-All & TAMP: constant

Related Work Membership & failure detection  [Chandra’96], [Fetzer’99], [Fetzer’01], [Neiger’96], and [Stok’94] Gossip-style protocols  SCAMP, [Kempe’01], and [Renesse’98] High-availability system (e.g., HA-Linux, Linux Heartbeat) Cluster-based network services  TACC, Porcupine, Neptune, Ninja Resource monitoring: Ganglia, NWS, MDS2

Contributions & Conclusions TAMP is a highly efficient and scalable protocol for giant clusters Exploiting TTL count in IP packet for topology-adaptive design. Verified through property analysis and experimentation. Deployed at Ask Jeeves clusters with thousands of machines.

Questions?