1 Sources of Instability in Data Center Multicast Dmitry Basin Ken Birman Idit Keidar Ymir Vigfusson LADIS 2010.

Slides:



Advertisements
Similar presentations
Traffic Engineering with Forward Fault Correction (FFC)
Advertisements

24-1 Chapter 24. Congestion Control and Quality of Service (part 1) 23.1 Data Traffic 23.2 Congestion 23.3 Congestion Control 23.4 Two Examples.
Congestion Control Jennifer Rexford Advanced Computer Networks Tuesdays/Thursdays 1:30pm-2:50pm.
Computer Networks Transport Layer. Topics F Introduction (6.1)  F Connection Issues ( ) F TCP (6.4)
* Mellanox Technologies LTD, + Technion - EE Department
CS 408 Computer Networks Congestion Control (from Chapter 05)
Balajee Vamanan et al. Deadline-Aware Datacenter TCP (D 2 TCP) Balajee Vamanan, Jahangir Hasan, and T. N. Vijaykumar.
The Structure of Networks with emphasis on information and social networks T-214-SINE Summer 2011 Chapter 8 Ýmir Vigfússon.
Copyright © 2005 Department of Computer Science 1 Solving the TCP-incast Problem with Application-Level Scheduling Maxim Podlesny, University of Waterloo.
The War Between Mice and Elephants LIANG GUO, IBRAHIM MATTA Computer Science Department Boston University ICNP (International Conference on Network Protocols)
Vijay Vasudevan, Amar Phanishayee, Hiral Shah, Elie Krevat David Andersen, Greg Ganger, Garth Gibson, Brian Mueller* Carnegie Mellon University, *Panasas.
Virtual Synchrony Jared Cantwell. Review Multicast Causal and total ordering Consistent Cuts Synchronized clocks Impossibility of consensus Distributed.
1 Modeling and Emulation of Internet Paths Pramod Sanaga, Jonathon Duerig, Robert Ricci, Jay Lepreau University of Utah.
Dynamic Internet Congestion with Bursts Stefan Schmid Roger Wattenhofer Distributed Computing Group, ETH Zurich 13th International Conference On High Performance.
1-1 CMPE 259 Sensor Networks Katia Obraczka Winter 2005 Transport Protocols.
Group Communications Group communication: one source process sending a message to a group of processes: Destination is a group rather than a single process.
High-performance bulk data transfers with TCP Matei Ripeanu University of Chicago.
Internet and Intranet Protocols and Applications Section V: Network Application Performance Lecture 11: Why the World Wide Wait? 4/11/2000 Arthur P. Goldberg.
Lecture 5: Congestion Control l Challenge: how do we efficiently share network resources among billions of hosts? n Last time: TCP n This time: Alternative.
Computer Networks Transport Layer. Topics F Introduction  F Connection Issues F TCP.
Jennifer Rexford Fall 2014 (TTh 3:00-4:20 in CS 105) COS 561: Advanced Computer Networks TCP.
A Criticism of: “Moving beyond end-to-end path information to optimize CDN performance” Gautam Bhawsar Alok Rakkhit.
Jennifer Rexford Princeton University MW 11:00am-12:20pm Wide-Area Traffic Management COS 597E: Software Defined Networking.
The Structure of Networks with emphasis on information and social networks T-214-SINE Summer 2011 Chapter 8 Ýmir Vigfússon.
Courtesy: Nick McKeown, Stanford 1 TCP Congestion Control Tahir Azim.
Data Transfer Case Study: TCP  Go-back N ARQ  32-bit sequence # indicates byte number in stream  transfers a byte stream, not fixed size user blocks.
Communication (II) Chapter 4
3: Transport Layer3b-1 Principles of Congestion Control Congestion: r informally: “too many sources sending too much data too fast for network to handle”
Evaluating the Running Time of a Communication Round over the Internet Omar Bakr Idit Keidar MIT MIT/Technion PODC 2002.
TCP Throughput Collapse in Cluster-based Storage Systems
CS/EE 145A Congestion Control Netlab.caltech.edu/course.
Elephants, Mice, and Lemmings! Oh My! Fred Baker Fellow 25 July 2014 Making life better in data centers and high speed computing.
Understanding the Performance of TCP Pacing Amit Aggarwal, Stefan Savage, Thomas Anderson Department of Computer Science and Engineering University of.
U NDERSTANDING TCP I NCAST T HROUGHPUT C OLLAPSE IN D ATACENTER N ETWORKS Presenter: Aditya Agarwal Tyler Maclean.
3: Transport Layer3-1 Where we are in chapter 3 Last time: r TCP m Reliable transfer m Flow control m Connection management r principles of congestion.
VL2: A Scalable and Flexible Data Center Network Albert Greenberg, James R. Hamilton, Navendu Jain, Srikanth Kandula, Changhoon Kim, Parantap Lahiri, David.
Copyright 2008 Kenneth M. Chipps Ph.D. Controlling Flow Last Update
Networking Fundamentals. Basics Network – collection of nodes and links that cooperate for communication Nodes – computer systems –Internal (routers,
Interconnect simulation. Different levels for Evaluating an architecture Numerical models – Mathematic formulations to obtain performance characteristics.
CS640: Introduction to Computer Networks Aditya Akella Lecture 20 - Queuing and Basics of QoS.
1 Transport Layer Lecture 10 Imran Ahmed University of Management & Technology.
TCP: Transmission Control Protocol Part II : Protocol Mechanisms Computer Network System Sirak Kaewjamnong Semester 1st, 2004.
Winter 2008CS244a Handout 71 CS244a: An Introduction to Computer Networks Handout 7: Congestion Control Nick McKeown Professor of Electrical Engineering.
TCP continued. Discussion – TCP Throughput TCP will most likely generate the saw tooth type of traffic. – A rough estimate is that the congestion window.
1 Fair Queuing Hamed Khanmirza Principles of Network University of Tehran.
TCP Traffic Characteristics—Deep buffer Switch
Chapter 10 Congestion Control in Data Networks and Internets 1 Chapter 10 Congestion Control in Data Networks and Internets.
VL2: A Scalable and Flexible Data Center Network
William Stallings Data and Computer Communications
Data Center Architectures
Data Center TCP (DCTCP)
Approaches towards congestion control
ECF: an MPTCP Scheduler to Manage Heterogeneous Paths
IT351: Mobile & Wireless Computing
Congestion Control (from Chapter 05)
CMPE 252A : Computer Networks
Congestion Control (from Chapter 05)
Congestion Control (from Chapter 05)
TCP Congestion Control
Replica Placement Heuristics of Application-level Multicast
Congestion Control (from Chapter 05)
Congestion Control (from Chapter 05)
Congestion Control (from Chapter 05)
Congestion Control (from Chapter 05)
Congestion Control (from Chapter 05)
TCP: Transmission Control Protocol Part II : Protocol Mechanisms
Congestion Michael Freedman COS 461: Computer Networks
Performance Evaluation of a Communication Round over the Internet
Lecture 6, Computer Networks (198:552)
Evaluating the Running Time of a Communication Round over the Internet
Presentation transcript:

1 Sources of Instability in Data Center Multicast Dmitry Basin Ken Birman Idit Keidar Ymir Vigfusson LADIS 2010

Multicast is Important 2 Replication is used in data centers and clouds: to provision financial servers of read-mostly requests to parallelize computation to cache important data for fault-tolerance Reliable multicast is a basis for consistent replication Transactions that update replicated data Atomic broadcast LADIS 2010

Why not IP/UDP-based Multicast ? 3 CTOs say these mechanisms may destabilize the whole data center Lack of flow control Tend to cause “synchronization” Load oscillations Anecdotes (eBay, Amazon): All goes well until one day, under heavy load, loss rates spike, triggering throughput collapse LADIS 2010

TCP Tree Overlays Most common variant of data center multicast The cost of a cloud: research problems in data center networks. Greenberg, Hamilton, Maltz, and Patel. Toward a cloud computing research agenda. Birman, Chockler, and van Renesse. Most application level multicast protocols use trees implicitly SCRIBE, NICE. Mesh-solutions also, as long as no node failures The time without failures in DC can be long Advantages: Flow and congestion control Should be stable..? 4 LADIS 2010 TCP

Suppose We Had a Perfect Tree 5 Suppose we had a perfect multicast tree: High throughput - low latency links Very rare node failures Would it work fine? Would the multicast have high throughput? Theory / simulations papers say: YES! [Baccelli, Chaintreau, Liu, Riabov. 2005] Data center operators say: NO! Observed throughput collapse and oscillations when the system became large LADIS 2010

Our Goal: Explain the GAP 6 Our hypothesis: instability stems from disturbances Very rare, short events OS scheduling, network congestion, Java GC stalls, … Never modeled or simulated before Become significant when system grows What if there were one pea per mattress? LADIS 2010

Multicast Model 7 Complete tree LADIS 2010

Multicast Model 8 Reliable links with congestion and flow control (e.g. TCP links) LADIS 2010

Multicast Model 9 Incoming buffers of size B 0 Outgoing buffers of size B 0 LADIS 2010

Multicast Model Thread forwards packets from incoming to outgoing buffers 2 LADIS 2010

Multicast Model 11 Root forwards packets from application to outgoing buffers Application 121 LADIS 2010

Multicast Model 12 Application OkOops.. If any outgoing buffer is full, thread stalls LADIS 2010

Analytic Model: Node States 13 Pr(Bad) = Good Bad Node’s thread works properly Memoryless process: duration has exponential distribution with expectation Node’s thread is stuck State duration has distribution with finite expectation Pr(Good) = LADIS 2010

Analytic Model: System States 14 Active Blocking some node u becomes Bad node u becomes Good LADIS 2010

15 A CB DEFG Application Active LADIS 2010 B max - aggregate buffer size from the root to a leaf

16 A CB DEFG Application B Node B becomes Bad. Thread does not forward packets Application Active Blocking some node B becomes Bad LADIS 2010

17 A C DEFG Application B Blocking LADIS 2010 Root can fill buffers on the path to B before blocking

18 A C DEFG Application B Link flow control prevents further sending 2 2 Application Blocking LADIS 2010

19 A C DEFG Application B The thread can’t forward. The root blocks. Blocking 11 LADIS 2010

Application 20 A C DEFG Application B G Node G becomes Bad Blocking The system state remains Blocking LADIS 2010

Application A CB DEF G Node B becomes Good 21 Active Blocking node B becomes Good B LADIS 2010 During Blocking state: # sent bits ≤ buffers on path to B ≤ B max

Application 22 A CB DEF Application G 1 4 Active LADIS 2010

Application A CB DEF G 2 4 Can send again 11 The root unblocks 23 Active LADIS 2010

Application A CB DEF G Active E Node E becomes Bad Blocking some node E becomes Bad LADIS 2010

Analysis 25 Period: Active and successive Blocking state The aggregate throughput during m periods is: data root sent in Active state in period k (bits) data root sent in Blocking state in period k (bits) duration of Blocking state in period k (sec) duration of Active state in period k (sec) LADIS 2010

Analysis: Throughput in Each State 26 Blocking state: Best-case assumption (for upper bound) At the beginning buffers are empty B k is bounded by B max Long-term average of t Bk is LADIS 2010

Analysis: Throughput in Each State 27 Active state: Best-case assumption (for upper bound) Root always sends at maximal throughput Flow control is perfect – no slow start State duration (t Ak ) analysis is complex (see paper) LADIS 2010

Analysis: Throughput Bound 28 We prove a bound on AGGR(m) LADIS 2010 Duration of Good state Throughput bound Duration of Good state

Analysis: Throughput Bound 29 We prove a bound on AGGR(m) LADIS 2010 Duration of Bad state Throughput bound Duration of Bad state

Analysis: Throughput Bound 30 We prove a bound on AGGR(m) LADIS 2010 System size Throughput bound System size

Simulations 31 Remove the assumption of empty buffers Use real buffers At nodes close to the root, measured to be full half the time Still assume perfect flow control No slow start Hence still upper bound on real network Our simulations Big trees (10,000s of nodes) Small trees (10s of nodes) LADIS 2010

Use Case 1: Big Trees LADIS Tree spanning an entire data center 10,000s of nodes Used for control

Disturbances every hour for 1 sec Results: Aggregate Throughput Bound 33 LADIS 2010 Simulations much worse: ~90% degradation Analytic bound: TCP default buffers (64KB), 10K nodes ~65% degradation Analytic bound not so pessimistic when buffers are large Simulations show we still have a problem ( because buffers are full half the time)

Use Case 2: Small Trees LADIS

Average Node in DC LADIS Has many different applications using network : 50% of the time - more than 10 concurrent flows 5% of the time - more than 80 concurrent flows [Greenberg, Hamilton, Jain, Kandula, Kim, Lahiri, Maltz, Patel, Sengupta, 2009] Can’t use too big buffers Switch port might congest

TCP Time-Outs as Disturbances 36 Temporary switch congestion can cause a loss burst on a TCP link The following TCP link time-out can be modeled as a disturbance Default TCP implementations Min. time-out 200 ms Network RTT can be ~200us The source of well-known Incast problem [Nagle, Serenyi, Matthews 2004] LADIS 2010

Results: Aggregate Throughput Bound 37 Time-outs every 5 sec for 200 msec Simulations: again bigger buffers help only in theory Analytical bound: again optimistic for larger buffers Analytic bound: TCP default buffers (64KB), 50 nodes LADIS 2010 Simulations much worse

Conclusions 38 We explain why supposedly perfect tree-based multicast inevitably collapses in data centers: Rare and short disruption events (disturbances) can cause throughput collapse when system grows Frequent disturbances can cause throughput collapse even for small system sizes Reality is even worse than our analytic bound: Disturbances cause buffers to fill up The main reason of the gap between simulation and analysis LADIS 2010

PODC Thank you.