Vijay Vasudevan, Amar Phanishayee, Hiral Shah, Elie Krevat David Andersen, Greg Ganger, Garth Gibson, Brian Mueller* Carnegie Mellon University, *Panasas.

Slides:



Advertisements
Similar presentations
CS144 Review Session 4 April 25, 2008 Ben Nham
Advertisements

A Switch-Based Approach to Starvation in Data Centers Alex Shpiner and Isaac Keslassy Department of Electrical Engineering, Technion. Gabi Bracha, Eyal.
LOGO Transmission Control Protocol 12 (TCP) Data Flow.
Deconstructing Datacenter Packet Transport Mohammad Alizadeh, Shuang Yang, Sachin Katti, Nick McKeown, Balaji Prabhakar, Scott Shenker Stanford University.
1 Transport Protocols & TCP CSE 3213 Fall April 2015.
Fixing TCP in Datacenters Costin Raiciu Advanced Topics in Distributed Systems 2011.
Hui Zhang, Fall Computer Networking TCP Enhancements.
TCP Congestion Control Dina Katabi & Sam Madden nms.csail.mit.edu/~dina 6.033, Spring 2014.
Profiling Network Performance in Multi-tier Datacenter Applications Minlan Yu Princeton University 1 Joint work with Albert Greenberg,
Copyright © 2005 Department of Computer Science 1 Solving the TCP-incast Problem with Application-Level Scheduling Maxim Podlesny, University of Waterloo.
1 TCP - Part II. 2 What is Flow/Congestion/Error Control ? Flow Control: Algorithms to prevent that the sender overruns the receiver with information.
1 TCP CSE May TCP Services Flow control Connection establishment and termination Congestion control 2.
Computer Networking Lecture 17 – More TCP & Congestion Control Copyright ©, Carnegie Mellon University.
Profiling Network Performance in Multi-tier Datacenter Applications
Congestion Control Tanenbaum 5.3, /12/2015Congestion Control (A Loss Based Technique: TCP)2 What? Why? Congestion occurs when –there is no reservation.
Analyzing the jitter-attacks against TCP flows Mentors: Dr. Imad Aad, Prof. Jean-Pierre Hubaux Moumbe Arno Patrice 09 february 2005.
Transport Layer 3-1 Transport Layer r To learn about transport layer protocols in the Internet: m TCP: connection-oriented protocol m Reliability protocol.
Profiling Network Performance in Multi-tier Datacenter Applications Jori Hardman Carly Ho Paper by Minlan Yu, Albert Greenberg, Dave Maltz, Jennifer Rexford,
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 1 TCP (Part III: Miscl) Shivkumar Kalyanaraman Rensselaer Polytechnic Institute
A Switch-Based Approach to Starvation in Data Centers Alex Shpiner Joint work with Isaac Keslassy Faculty of Electrical Engineering Faculty of Electrical.
The Transport Layer Chapter 6. The Transport Service Services Provided to the Upper Layers Transport Service Primitives Berkeley Sockets An Example of.
Computer Networking Lecture 16 – More TCP
TCP Congestion Control
Jennifer Rexford Fall 2014 (TTh 3:00-4:20 in CS 105) COS 561: Advanced Computer Networks TCP.
Low-Rate TCP Denial of Service Defense Johnny Tsao Petros Efstathopoulos Tutor: Guang Yang UCLA 2003.
IA-TCP A Rate Based Incast- Avoidance Algorithm for TCP in Data Center Networks Communications (ICC), 2012 IEEE International Conference on 曾奕勳.
TCP & Data Center Networking
TCP Throughput Collapse in Cluster-based Storage Systems
COMT 4291 Communications Protocols and TCP/IP COMT 429.
TCP Timers Chia-tai Tsai Introduction The 7 Timers for each Connection Connection-Establishment Timer Establish a new connection.
CS 4396 Computer Networks Lab
Raj Jain The Ohio State University R1: Performance Analysis of TCP Enhancements for WWW Traffic using UBR+ with Limited Buffers over Satellite.
Understanding the Performance of TCP Pacing Amit Aggarwal, Stefan Savage, Thomas Anderson Department of Computer Science and Engineering University of.
U NDERSTANDING TCP I NCAST T HROUGHPUT C OLLAPSE IN D ATACENTER N ETWORKS Presenter: Aditya Agarwal Tyler Maclean.
Transport over Wireless Networks Myungchul Kim
1 TCP III - Error Control TCP Error Control. 2 ARQ Error Control Two types of errors: –Lost packets –Damaged packets Most Error Control techniques are.
Computer Networking TCP (cont.). Lecture 16: Overview TCP congestion control TCP modern loss recovery TCP interactions TCP options TCP.
HighSpeed TCP for High Bandwidth-Delay Product Networks Raj Kettimuthu.
1 Mao W07 Midterm Review EECS 489 Computer Networks Z. Morley Mao Monday Feb 19, 2007 Acknowledgement: Some.
Copyright © Lopamudra Roychoudhuri
Lecture 9 – More TCP & Congestion Control
Forward Error Correction vs. Active Retransmit Requests in Wireless Networks Robbert Haarman.
Lecture 18 – More TCP & Congestion Control
Computer Networking Lecture 18 – More TCP & Congestion Control.
TCP: Transmission Control Protocol Part II : Protocol Mechanisms Computer Network System Sirak Kaewjamnong Semester 1st, 2004.
1 CS 4396 Computer Networks Lab TCP – Part II. 2 Flow Control Congestion Control Retransmission Timeout TCP:
1 SIGCOMM ’ 03 Low-Rate TCP-Targeted Denial of Service Attacks A. Kuzmanovic and E. W. Knightly Rice University Reviewed by Haoyu Song 9/25/2003.
1 TCP - Part II. 2 What is Flow/Congestion/Error Control ? Flow Control: Algorithms to prevent that the sender overruns the receiver with information.
1 Computer Networks Congestion Avoidance. 2 Recall TCP Sliding Window Operation.
Janey C. Hoe Laboratory for Computer Science at MIT 노상훈, Pllab.
Development of a QoE Model Himadeepa Karlapudi 03/07/03.
TCP continued. Discussion – TCP Throughput TCP will most likely generate the saw tooth type of traffic. – A rough estimate is that the congestion window.
Transport Layer: Sliding Window Reliability
Peer-to-Peer Networks 13 Internet – The Underlay Network
Studies of LHCb Trigger Readout Network Design Karol Hennessy University College Dublin Karol Hennessy University College Dublin.
15-744: Computer Networking L-14 Data Center Networking III.
Karn’s Algorithm Do not use measured RTT to update SRTT and SDEV Calculate backoff RTO when a retransmission occurs Use backoff RTO for segments until.
TCP - Part II.
Transmission Control Protocol (TCP) Retransmission and Time-Out
COMP 431 Internet Services & Protocols
Chapter 5 TCP Transmission Control
Carnegie Mellon University, *Panasas Inc.
TCP - Part II Suman Banerjee CS 640, UW-Madison
Lecture 18 – More TCP & Congestion Control
TCP Throughput Modeling
Chapter 17. Transport Protocols
TCP Congestion Control
TCP III - Error Control TCP Error Control.
TCP: Transmission Control Protocol Part II : Protocol Mechanisms
ECN in QUIC - Questions Surfaced
Presentation transcript:

Vijay Vasudevan, Amar Phanishayee, Hiral Shah, Elie Krevat David Andersen, Greg Ganger, Garth Gibson, Brian Mueller* Carnegie Mellon University, *Panasas Inc. Solving TCP Incast (and more) With Aggressive TCP Timeouts PDL Retreat 2009

2 Cluster-based Storage Systems Client Commodity Ethernet Switch Servers Ethernet: 1-10Gbps Round Trip Time (RTT): us

3 Cluster-based Storage Systems Client Switch Storage Servers R R R R 1 2 Data Block Server Request Unit (SRU) 3 4 Synchronized Read Client now sends next batch of requests 1 234

4 Synchronized Read Setup Test on an Ethernet-based storage cluster Client performs synchronized reads Increase # of servers involved in transfer Data block size is fixed (FS read) TCP used as the data transfer protocol

5 TCP Throughput Collapse Collapse ! Cluster Setup 1Gbps Ethernet Unmodified TCP S50 Switch 1MB Block Size TCP Incast Cause of throughput collapse: coarse-grained TCP timeouts

6 Solution: µsecond TCP + no minRTO more servers  High throughput for up to 47 servers Simulation scales to thousands of servers Throughput (Mbps) Unmodified TCP Our solution

7 Overview Problem: Coarse-grained TCP timeouts (200ms) too expensive for datacenter applications Solution: microsecond granularity timeouts Improves datacenter app throughput & latency Also safe for use in the wide-area (Internet)

8 Outline Overview  Why are TCP timeouts expensive? How do coarse-grained timeouts affect apps? Solution: Microsecond TCP Retransmissions Is the solution safe?

9 TCP: data-driven loss recovery Sender Receiver Ack 1 3 duplicate ACKs for 1 (packet 2 is probably lost) 2 Seq # Retransmit packet 2 immediately Ack 5 In datacenters data-driven recovery in µsecs after loss.

10 TCP: timeout-driven loss recovery Sender Receiver Retransmission Timeout (RTO) Ack 1 Seq # Timeouts are expensive (msecs to recover after loss) Retransmit packet

11 TCP: Loss recovery comparison Sender Receiver Ack 1 Retransmit 2 Seq # Ack 5 Sender Receiver Retransmission Timeout (RTO) Ack 1 Seq # Timeout driven recovery is slow (ms) Data-driven recovery is super fast ( µ s) in datacenters

12 RTO Estimation and Minimum Bound Jacobson’s TCP RTO Estimator RTO Estimated = SRTT + (4 * RTTVAR) Actual RTO = max(minRTO, RTO Estimated ) Minimum RTO bound (minRTO) = 200ms TCP timer granularity Safety (Allman99) minRTO (200ms) >> Datacenter RTT (100µs) 1 TCP Timeout lasts 1000 datacenter RTTs!

13 Outline Overview Why are TCP timeouts expensive?  How do coarse-grained timeouts affect apps? Solution: Microsecond TCP Retransmissions Is the solution safe?

14 Single Flow TCP Request-Response Client Switch Server Data time Request sent R Response sent Response dropped Response resent 200ms

15 Apps Sensitive to 200ms Timeouts Single flow request-response Latency-sensitive applications Barrier-Synchronized workloads Parallel Cluster File Systems –Throughput-intensive Search: multi-server queries –Latency-sensitive

16 Link Idle Time Due To Timeouts Client Switch R R R R Synchronized Read Server Request Unit (SRU) time Req. sent Rsp. sent 4 dropped Response resent 1 – 3 done Link Idle!

17 Client Link Utilization 200ms Link Idle!

18 200ms timeouts  Throughput Collapse [Nagle04] called this Incast Provided application level solutions Cause of throughput collapse: TCP timeouts [FAST08] Search for network level solutions to TCP Incast Collapse ! Cluster Setup 1Gbps Ethernet 200ms minRTO S50 Switch 1MB Block Size

19 Results from our previous work (FAST08) Network Level SolutionsResults / Conclusions Increase Switch Buffer Size Delays throughput collapse Throughput collapse inevitable Expensive

20 Results from our previous work (FAST08) Network Level SolutionsResults / Conclusions Increase Switch Buffer Size Delays throughput collapse Throughput collapse inevitable Expensive Alternate TCP Implementations (avoiding timeouts, aggressive data- driven recovery, disable slow start) Throughput collapse inevitable because timeouts are inevitable (complete window loss a common case)

21 Results from our previous work (FAST08) Network Level SolutionsResults / Conclusions Increase Switch Buffer Size Delays throughput collapse Throughput collapse inevitable Expensive Alternate TCP Implementations (avoiding timeouts, aggressive data- driven recovery, disable slow start) Throughput collapse inevitable because timeouts are inevitable (complete window loss a common case) Ethernet Flow Control Effective Limited effectiveness (works for simple topologies) head-of-line blocking

22 Results from our previous work (FAST08) Network Level SolutionsResults / Conclusions Increase Switch Buffer Size Delays throughput collapse Throughput collapse inevitable Expensive Alternate TCP Implementations (avoiding timeouts, aggressive data- driven recovery, disable slow start) Throughput collapse inevitable because timeouts are inevitable (complete window loss a common case) Ethernet Flow Control Effective Limited effectiveness (works for simple topologies) head-of-line blocking Reducing minRTO (in simulation) Very effective Implementation concerns (µs timers for OS, TCP) Safety concerns

23 Outline Overview Why are TCP timeouts expensive? How do coarse-grained timeouts affect apps?  Solution: Microsecond TCP Retransmissions and eliminate minRTO Is the solution safe?

24 µsecond Retransmission Timeouts (RTO) RTO = max( minRTO, f(RTT) ) 200ms 200 µ s? 0? RTT tracked in milliseconds Track RTT in µsecond

25 Lowering minRTO to 1ms Lower minRTO to as low a value as possible without changing timers/TCP impl. Simple one-line change to Linux Uses low-resolution 1ms kernel timers

26 Default minRTO: Throughput Collapse Unmodified TCP (200ms minRTO)

27 Lowering minRTO to 1ms helps Millisecond retransmissions are not enough Unmodified TCP (200ms minRTO) 1ms minRTO

28 Requirements for µsecond RTO TCP must track RTT in microseconds Modify internal data structures Reuse timestamp option Efficient high-resolution kernel timers Use HPET for efficient interrupt signaling

29 Solution: µsecond TCP + no minRTO Unmodified TCP (200ms minRTO) more servers 1ms minRTO microsecond TCP + no minRTO High throughput for up to 47 servers

30 Simulation: Scaling to thousands Block Size = 80MB, Buffer = 32KB, RTT = 20us

31 Synchronized Retransmissions At Scale Simultaneous retransmissions  successive timeouts Successive RTO = RTO * 2 backoff

32 Simulation: Scaling to thousands Desynchronize retransmissions to scale further Successive RTO = (RTO + (rand(0.5)*RTO) ) * 2 backoff For use within datacenters only

33 Outline Overview Why are TCP timeouts expensive? The Incast Workload Solution: Microsecond TCP Retransmissions  Is the solution safe? Interaction with Delayed-ACK within datacenters Performance in the wide-area

34 Delayed-ACK (for RTO > 40ms) Delayed-Ack: Optimization to reduce #ACKs sent Seq # Sender Receiver 1 Ack 1 40ms Sender Receiver 1 Ack 2 Seq # 2 Sender Receiver 1 Ack 0 Seq # 2

35 µsecond RTO and Delayed-ACK Premature Timeout RTO on sender triggers before Delayed-ACK on receiver Sender Receiver 1 Ack 1 Seq # 1 RTO < 40ms Timeout Retransmit packet Seq # Sender Receiver 1 Ack 1 40ms RTO > 40ms

36 Impact of Delayed-ACK

37 Is it safe for the wide-area? Stability: Could we cause congestion collapse? No: Wide-area RTOs are in 10s, 100s of ms No: Timeouts result in rediscovering link capacity (slow down the rate of transfer) Performance: Do we timeout unnecessarily? [Allman99] Reducing minRTO increases the chance of premature timeouts –Premature timeouts slow transfer rate Today: detect and recover from premature timeouts Wide-area experiments to determine performance impact

38 Wide-area Experiment Do microsecond timeouts harm wide-area throughput? Microsecond TCP + No minRTO Standard TCP BitTorrent Seeds BitTorrent Clients

39 Wide-area Experiment: Results No noticeable difference in throughput

40 Conclusion Microsecond granularity TCP timeouts (with no minRTO) improve datacenter application response time and throughput Safe for wide-area communication Linux patch: Code (simulation, cluster) and scripts: