Dr. Multicast for Data Center Communication Scalability Ymir Vigfusson Hussam Abu-Libdeh Mahesh Balakrishnan Ken Birman Cornell University Yoav Tock IBM.

Slides:



Advertisements
Similar presentations
PLATO: Predictive Latency- Aware Total Ordering Mahesh Balakrishnan Ken Birman Amar Phanishayee.
Advertisements

Dr. Multicast for Data Center Communication Scalability Ymir Vigfusson Hussam Abu-Libdeh Mahesh Balakrishnan Ken Birman Cornell University Yoav Tock IBM.
Distributed System Services Prepared By:- Monika Patel.
System Area Network Abhiram Shandilya 12/06/01. Overview Introduction to System Area Networks SAN Design and Examples SAN Applications.
Scalable Content-Addressable Network Lintao Liu
PortLand: A Scalable Fault-Tolerant Layer 2 Data Center Network Fabric. Presented by: Vinuthna Nalluri Shiva Srivastava.
Dynamo: Amazon's Highly Available Key-value Store Distributed Storage Systems CS presented by: Hussam Abu-Libdeh.
Gossip Algorithms and Implementing a Cluster/Grid Information service MsSys Course Amar Lior and Barak Amnon.
Ýmir Vigfússon IBM Research Haifa Labs Ken Birman Cornell University Qi Huang Cornell University Deepak Nataraj Cornell University.
Reliable Group Communication Quanzeng You & Haoliang Wang.
Ph.D. thesis defense. August 21, Ýmir Vigfússon Joint work with: Hussam Abu-Libdeh, Mahesh Balakrishnan, Ken Birman, Gregory Chockler, Qi Huang,
Using DSVM to Implement a Distributed File System Ramon Lawrence Dept. of Computer Science
An Associative Broadcast Based Coordination Model for Distributed Processes James C. Browne Kevin Kane Hongxia Tian Department of Computer Sciences The.
Approaches to EJB Replication. Overview J2EE architecture –EJB, components, services Replication –Clustering, container, application Conclusions –Advantages.
Distributed Processing, Client/Server, and Clusters
Virtual Synchrony Jared Cantwell. Review Multicast Causal and total ordering Consistent Cuts Synchronized clocks Impossibility of consensus Distributed.
Ken Birman Cornell University. CS5410 Fall
Virtual Synchrony Ki Suh Lee Some slides are borrowed from Ken, Jared (cs ) and Justin (cs )
Reference: Message Passing Fundamentals.
1 Principles of Reliable Distributed Systems Tutorial 12: Frangipani Spring 2009 Alex Shraer.
1 Switching and Forwarding Bridges and Extended LANs.
Group Communications Group communication: one source process sending a message to a group of processes: Destination is a group rather than a single process.
CS 268: Lecture 5 (Project Suggestions) Ion Stoica February 6, 2002.
A Scalable Content-Addressable Network Authors: S. Ratnasamy, P. Francis, M. Handley, R. Karp, S. Shenker University of California, Berkeley Presenter:
OSMOSIS Final Presentation. Introduction Osmosis System Scalable, distributed system. Many-to-many publisher-subscriber real time sensor data streams,
CS 268: Project Suggestions Ion Stoica February 6, 2003.
Distributed Systems 2006 Group Membership * *With material adapted from Ken Birman.
1 Switching and Forwarding Bridges and Extended LANs.
The Future of the Internet Jennifer Rexford ’91 Computer Science Department Princeton University
An Active Reliable Multicast Framework for the Grids M. Maimour & C. Pham ICCS 2002, Amsterdam Network Support and Services for Computational Grids Sunday,
Internet Indirection Infrastructure (i3) Ion Stoica, Daniel Adkins, Shelley Zhuang, Scott Shenker, Sonesh Surana UC Berkeley SIGCOMM 2002.
Correctness of Gossip-Based Membership under Message Loss Maxim Gurevich, Idit Keidar Technion.
SMUCSE 8344 Constraint-Based Routing in MPLS. SMUCSE 8344 Constraint Based Routing (CBR) What is CBR –Each link a collection of attributes (performance,
Qi Huang Ýmir Vigfússon Ken Birman Haoyuan Li Cornell University IBM Research Haifa Labs Cornell University.
1 The Google File System Reporter: You-Wei Zhang.
1 Computer Communication & Networks Lecture 22 Network Layer: Delivery, Forwarding, Routing (contd.)
An Efficient Topology-Adaptive Membership Protocol for Large- Scale Cluster-Based Services Jingyu Zhou * §, Lingkun Chu*, Tao Yang* § * Ask Jeeves §University.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Jonathan Walpole CSE515 - Distributed Computing Systems 1 Teaching Assistant for CSE515 Rahul Dubey.
Transparency in Distributed Operating Systems Vijay Akkineni.
1 The Design of a Robust Peer-to-Peer System Rodrigo Rodrigues, Barbara Liskov, Liuba Shrira Presented by Yi Chen Some slides are borrowed from the authors’
Content-Based Routing in Mobile Ad Hoc Networks Milenko Petrovic, Vinod Muthusamy, Hans-Arno Jacobsen University of Toronto July 18, 2005 MobiQuitous 2005.
Autonomic SLA-driven Provisioning for Cloud Applications Nicolas Bonvin, Thanasis Papaioannou, Karl Aberer Presented by Ismail Alan.
SPREAD TOOLKIT High performance messaging middleware Presented by Sayantam Dey Vipin Mehta.
A Hybrid Multicast-Unicast Infrastructure for Efficient Publish-Subscribe in Enterprise Networks Danny Bickson, Ezra N. Hoch, Nir Naaman and Yoav Tock.
CCNA 3 Week 2 Link State Protocols OSPF. Copyright © 2005 University of Bolton Distance Vector vs Link State Distance Vector –Copies Routing Table to.
LogP and BSP models. LogP model Common MPP organization: complete machine connected by a network. LogP attempts to capture the characteristics of such.
Toward Fault-tolerant P2P Systems: Constructing a Stable Virtual Peer from Multiple Unstable Peers Kota Abe, Tatsuya Ueda (Presenter), Masanori Shikano,
Dealing with open groups The view of a process is its current knowledge of the membership. It is important that all processes have identical views. Inconsistent.
Distributed System Concepts and Architectures Services
Hwajung Lee. A group is a collection of users sharing some common interest.Group-based activities are steadily increasing. There are many types of groups:
CCNP 3: Chapter 3 Implementing Spanning Tree. Overview Basics of implementing STP Election of Root Bridge and Backup Enhancing STP RSTP MSTP EtherChannels.
Minimizing Churn in Distributed Systems P. Brighten Godfrey, Scott Shenker, and Ion Stoica UC Berkeley SIGCOMM’06.
Slingshot: Time-Critical Multicast for Clustered Applications Mahesh Balakrishnan Stefan Pleisch Ken Birman Cornell University.
CS 6401 Overlay Networks Outline Overlay networks overview Routing overlays Resilient Overlay Networks Content Distribution Networks.
NDDS: The Real-Time Publish Subscribe Middleware Network Data Delivery Service An Efficient Real-Time Application Communications Platform Presented By:
Mobile IP Definition: Mobile IP is a standard communication protocol, defined to allow mobile device users to move from one IP network to another while.
LOOKING UP DATA IN P2P SYSTEMS Hari Balakrishnan M. Frans Kaashoek David Karger Robert Morris Ion Stoica MIT LCS.
Spring 2000CS 4611 Routing Outline Algorithms Scalability.
Security Kim Soo Jin. 2 Contents Background Introduction Secure multicast using clustering Spatial Clustering Simulation Experiment Conclusions.
© 2015 Pittsburgh Supercomputing Center Opening the Black Box Using Web10G to Uncover the Hidden Side of TCP CC PI Meeting Austin, TX September 29, 2015.
Distributed, Self-stabilizing Placement of Replicated Resources in Emerging Networks Bong-Jun Ko, Dan Rubenstein Presented by Jason Waddle.
1 Roie Melamed, Technion AT&T Labs Araneola: A Scalable Reliable Multicast System for Dynamic Wide Area Environments Roie Melamed, Idit Keidar Technion.
Mobile IP THE 12 TH MEETING. Mobile IP  Incorporation of mobile users in the network.  Cellular system (e.g., GSM) started with mobility in mind. 
Reliable multicast Tolerates process crashes. The additional requirements are: Only correct processes will receive multicasts from all correct processes.
Internet Indirection Infrastructure (i3)
Multicast Outline Multicast Introduction and Motivation DVRMP.
Advanced Computer Networks
Alternative system models
EE 122: Lecture 13 (IP Multicast Routing)
Presentation transcript:

Dr. Multicast for Data Center Communication Scalability Ymir Vigfusson Hussam Abu-Libdeh Mahesh Balakrishnan Ken Birman Cornell University Yoav Tock IBM Research Haifa LADIS, September 15, 2008

IP Multicast in Data Centers IPMC is not used in data centers

IP Multicast in Data Centers Why is IP multicast rarely used?

IP Multicast in Data Centers Why is IP multicast rarely used? o Limited IPMC scalability on switches/routers and NICs

IP Multicast in Data Centers Why is IP multicast rarely used? o Limited IPMC scalability on switches/routers and NICs o Broadcast storms: Loss triggers a horde of NACKs, which triggers more loss, etc. o Disruptive even to non-IPMC applications.

IP Multicast in Data Centers IP multicast has a bad reputation

IP Multicast in Data Centers IP multicast has a bad reputation o Works great up to a point, after which it breaks catastrophically

IP Multicast in Data Centers Bottom line: o Administrators have no control over multicast use... o Without control, they opt for never.

Dr. Multicast

Dr. Multicast (MCMD) Policy: Permits data center operators to selectively enable and control IPMC Transparency: Standard IPMC interface, system calls are overloaded. Performance: Uses IPMC when possible, otherwise point-to-point UDP Robustness: Distributed, fault-tolerant service

Terminology Process: Application that joins logical IPMC groups Logical IPMC group: A virtualized abstraction Physical IPMC group: As usual UDP multi-send: New kernel-level system-call Collection: Set of logical IPMC groups with identical membership

Acceptable Use Policy Assume a higher-level network management tool compiles policy into primitives Explicitly allow a process to use IPMC groups o allow-join(process,logical IPMC) o allow-send(process,logical IPMC) UDP multi-send always permitted Additional restraints o max-groups(process,limit) o force-udp(process,logical IPMC)

Overview Library module Mapping module Gossip layer Optimization questions Results

Transparent. Overloads the IPMC functions o setsockopt(), send(), etc. Translation. Logical IPMC map to a set of P-IPMC/unicast addresses. o Two extremes MCMD Library Module

MCMD Agent runs on each machine o Contacted by the library modules o Provides a mapping One agent elected to be a leader: o Allocates IPMC resources according to the current policy MCMD Mapping Role

Allocating IPMC resources: An optimization problem Procs L-IPMC MCMD Mapping Role This box intentionally left BLACK Procs Collections L-IPMC

Runs system-wide Automatic failure detection Group membership fully replicated via gossip o Node reports its own state o Future: Replicate more selectively o Leader runs optimization algorithm on data and reports the mapping MCMD Gossip Layer

But gossip is slow... Implications: o Slow propagation of group membership o Slow propagation of new maps o We assume a low rate of membership churn Remedy: Broadcast module o Leader broadcasts urgent messages o Bounded bandwidth of urgent channel o Trade-off between latency and scalability MCMD Gossip Layer

Overview Library module Mapping module Gossip layer Optimization questions Results

Optimization Questions Procs L-IPMC BLACK Collections Procs L-IPMC First step: compress logical IPMC groups

klk;l Optimization Questions How compressible are subscriptions? o Multi-objective optimization:  Minimize number of collections  Minimize bandwidth overhead on network Ties in with social preferences o How do people's subscriptions overlap?

klk;l Optimization Questions How compressible are subscriptions? o Multi-objective optimization:  Minimize number of groups  Minimize bandwidth overhead on network o Thm: The general problem is NP-complete o Thm: In uniform random allocation, "little" compression opportunity. o Replication (e.g. for load balancing) can generate duplicates (easy case).

klk;l Optimization Questions Which collections get an IPMC address? o Thm: Ordered by decreasing traffic*size, assign P-IPMC addresses greedily, we minimize bandwidth. Tiling heuristic: o Sort L-IPMC by traffic*size o Greedily collapse identical groups o Assign IPMC to collections in reverse order of traffic*size, UDP-multisend to the rest Building tilings incrementally

Insignificant overhead when mapping L-IPMC to P-IPMC. klk;l Overhead

Linux kernel module increases UDP-multisend throughput by 17% (compared to user-space UDP-multisend) klk;l Overhead

A malfunctioning node bombards an existing IPMC group. MCMD policy prevents ill-effects klk;l Policy control

A malfunctioning node bombards an existing IPMC group. MCMD policy prevents ill-effects klk;l Policy control

klk;l Network Overhead MCMD Gossip Layer uses constant background bandwidth Latency of leaves/joins/new tilings bounded by gossip dissemination latency

Conclusion IPMC has been a bad citizen...

Conclusion IPMC has been a bad citizen... Dr. Multicast has the cure! Opportunity for big performance enhancements and policy control.

Thank you!