Efficient Network Protocols for Data-Intensive Worldwide Grids Seminar at JAIST, Japan 3 March 2003 T. Kelly, University of Cambridge, UK S. Ravot, Caltech,

Slides:



Advertisements
Similar presentations
Appropriateness of Transport Mechanisms in Data Grid Middleware Rajkumar Kettimuthu 1,3, Sanjay Hegde 1,2, William Allcock 1, John Bresnahan 1 1 Mathematics.
Advertisements

The DataTAG Project 25 March, Brussels FP6 Information Day Peter Clarke, University College London.
TCP transfers over high latency/bandwidth network & Grid TCP Sylvain Ravot
Restricted Slow-Start for TCP William Allcock 1,2, Sanjay Hegde 3 and Rajkumar Kettimuthu 1,2 1 Argonne National Laboratory 2 The University of Chicago.
Ahmed El-Hassany CISC856: CISC 856 TCP/IP and Upper Layer Protocols Slides adopted from: Injong Rhee, Lisong Xu.
Presentation by Joe Szymanski For Upper Layer Protocols May 18, 2015.
Advanced Computer Networking Congestion Control for High Bandwidth-Delay Product Environments (XCP Algorithm) 1.
School of Information Technologies TCP Congestion Control NETS3303/3603 Week 9.
Congestion Control on High-Speed Networks
Transport Layer 3-1 Fast Retransmit r time-out period often relatively long: m long delay before resending lost packet r detect lost segments via duplicate.
Congestion Control Tanenbaum 5.3, /12/2015Congestion Control (A Loss Based Technique: TCP)2 What? Why? Congestion occurs when –there is no reservation.
GridPP meeting Feb 03 R. Hughes-Jones Manchester WP7 Networking Richard Hughes-Jones.
High-performance bulk data transfers with TCP Matei Ripeanu University of Chicago.
Congestion Control for High Bandwidth-Delay Product Environments Dina Katabi Mark Handley Charlie Rohrs.
02 nd April 03Networkshop Managed Bandwidth Next Generation F. Saka UCL NETSYS (NETwork SYStems centre of excellence)
KEK Network Qi Fazhi KEK SW L2/L3 Switch for outside connections Central L2/L3 Switch A Netscreen Firewall Super Sinet Router 10GbE 2 x GbE IDS.
Transport Layer 4 2: Transport Layer 4.
Transport Layer3-1 Chapter 3 outline r 3.1 Transport-layer services r 3.2 Multiplexing and demultiplexing r 3.3 Connectionless transport: UDP r 3.4 Principles.
Experiences in Design and Implementation of a High Performance Transport Protocol Yunhong Gu, Xinwei Hong, and Robert L. Grossman National Center for Data.
Implementing High Speed TCP (aka Sally Floyd’s) Yee-Ting Li & Gareth Fairey 1 st October 2002 DataTAG CERN (Kinda!)
UDT: UDP based Data Transfer Yunhong Gu & Robert Grossman Laboratory for Advanced Computing University of Illinois at Chicago.
NORDUnet 2003, Reykjavik, Iceland, 26 August 2003 High-Performance Transport Protocols for Data-Intensive World-Wide Grids T. Kelly, University of Cambridge,
DataTAG Research and Technological Development for a Transatlantic Grid Abstract Several major international Grid development projects are underway at.
Experience with Loss-Based Congestion Controlled TCP Stacks Yee-Ting Li University College London.
FAST TCP in Linux Cheng Jin David Wei Steven Low California Institute of Technology.
High TCP performance over wide area networks Arlington, VA May 8, 2002 Sylvain Ravot CalTech HENP Working Group.
High-speed TCP  FAST TCP: motivation, architecture, algorithms, performance (by Cheng Jin, David X. Wei and Steven H. Low)  Modifying TCP's Congestion.
HighSpeed TCP for High Bandwidth-Delay Product Networks Raj Kettimuthu.
23 January 2003Paolo Moroni (Slide 1) SWITCH-cc meeting DataTAG overview.
BIC Control for Fast Long-Distance Networks paper written by Injong Rhee, Lisong Xu & Khaled Harfoush (2004) Presented by Jonathan di Costanzo (2009/02/18)
Internet data transfer record between CERN and California Sylvain Ravot (Caltech) Paolo Moroni (CERN)
NET100 Development of network-aware operating systems Tom Dunigan
TERENA Networking Conference, Zagreb, Croatia, 21 May 2003 High-Performance Data Transport for Grid Applications T. Kelly, University of Cambridge, UK.
Transport Layer 3-1 Chapter 3 Transport Layer Computer Networking: A Top Down Approach 6 th edition Jim Kurose, Keith Ross Addison-Wesley March
Tiziana FerrariThe DataTAG Projct, Roma Nov DataTAG Project.
TCP transfers over high latency/bandwidth networks Internet2 Member Meeting HENP working group session April 9-11, 2003, Arlington T. Kelly, University.
Performance Engineering E2EpiPEs and FastTCP Internet2 member meeting - Indianapolis World Telecom Geneva October 15, 2003
30 June Wide Area Networking Performance Challenges Olivier Martin, CERN UK DTI visit.
DataTAG: Glue schema test C.Vistoli DataTAG Wp4 Meeting 23/5/2002.
Chapter 11.4 END-TO-END ISSUES. Optical Internet Optical technology Protocol translates availability of gigabit bandwidth in user-perceived QoS.
GNEW2004 CERN March 2004 R. Hughes-Jones Manchester 1 Lessons Learned in Grid Networking or How do we get end-2-end performance to Real Users ? Richard.
18/09/2002Presentation to Spirent1 Presentation to Spirent 18/09/2002.
NET100 Development of network-aware operating systems Tom Dunigan
TCP continued. Discussion – TCP Throughput TCP will most likely generate the saw tooth type of traffic. – A rough estimate is that the congestion window.
TCP transfers over high latency/bandwidth networks & Grid DT Measurements session PFLDnet February 3- 4, 2003 CERN, Geneva, Switzerland Sylvain Ravot
Final EU Review - 24/03/2004 DataTAG is a project funded by the European Commission under contract IST Richard Hughes-Jones The University of.
1 Testing TCP Westwood+ over Transatlantic Links at 10 Gigabit/Second rate Saverio Mascolo Dipartimento di Elettrotecnica ed Elettronica Politecnico di.
The EU DataTAG Project Richard Hughes-Jones Based on Olivier H. Martin GGF3 Frascati, Italy Oct 2001.
An Analysis of AIMD Algorithm with Decreasing Increases Yunhong Gu, Xinwei Hong, and Robert L. Grossman National Center for Data Mining.
Network-aware OS DOE/MICS ORNL site visit January 8, 2004 ORNL team: Tom Dunigan, Nagi Rao, Florence Fowler, Steven Carter Matt Mathis Brian.
@Yuan Xue A special acknowledge goes to J.F Kurose and K.W. Ross Some of the slides used in this lecture are adapted from their.
@Yuan Xue A special acknowledge goes to J.F Kurose and K.W. Ross Some of the slides used in this lecture are adapted from their.
CS450 – Introduction to Networking Lecture 19 – Congestion Control (2)
R. Hughes-Jones Manchester
Networking between China and Europe
Transport Protocols over Circuits/VCs
CERN-USA connectivity update DataTAG project
Internet Congestion Control Research Group
DataTAG Project update
TCP Performance over a 2.5 Gbit/s Transatlantic Circuit
Wide Area Networking at SLAC, Feb ‘03
5th EU DataGrid Conference
Presentation at University of Twente, The Netherlands
Internet2 Spring Member meeting
TCP flow and congestion control
High-Performance Data Transport for Grid Applications
Review of Internet Protocols Transport Layer
Lecture 6, Computer Networks (198:552)
Presentation transcript:

Efficient Network Protocols for Data-Intensive Worldwide Grids Seminar at JAIST, Japan 3 March 2003 T. Kelly, University of Cambridge, UK S. Ravot, Caltech, USA J.P. Martin-Flatin, CERN, Switzerland

3 March 2003T. Kelly, S. Ravot and J.P. Martin-Flatin 2 Outline  DataTAG project  Problems with TCP in data-intensive Grids  Analysis and characterization  Scalable TCP  GridDT  Research directions

The DataTAG Project

3 March 2003T. Kelly, S. Ravot and J.P. Martin-Flatin 4 Facts About DataTAG  Budget: EUR ~4M  Manpower: 24 people funded 30 people externally funded  Start date: 1 January 2002  Duration: 2 years

3 March 2003T. Kelly, S. Ravot and J.P. Martin-Flatin 5 Three Objectives  Build a testbed to experiment with massive file transfers across the Atlantic  Provide high-performance protocols for gigabit networks underlying data-intensive Grids  Guarantee interoperability between several major Grid projects in Europe and USA

3 March 2003T. Kelly, S. Ravot and J.P. Martin-Flatin 6 Collaborations  Testbed: Caltech, Northwestern University, UIC, UMich, StarLight  Network Research: Europe: GEANT + Dante, University of Cambridge, Forschungszentrum Karlsruhe, VTHD, MB-NG, SURFnet USA: Internet2 + Abilene, SLAC, ANL, FNAL, LBNL, ESnet Canarie  Grids: DataGrid, GridStart, CrossGrid, iVDGL, PPDG, GriPhyN, GGF

Grids

3 March 2003T. Kelly, S. Ravot and J.P. Martin-Flatin 8 GIIS edt004.cnaf.infn.it Mds-vo-name=‘Datatag’ Resource Broker Computing Element-1 PBS WN1 edt001.cnaf.infn.it WN2 edt002.cnaf.infn.it Gatekeeper grid006f.cnaf.infn.it Computing Element -2 Fork/pbs Gatekeeper edt004.cnaf.infn.it GIIS giis.ivdgl.org mds-vo-name=ivdgl-glue dc-user.isi.edu rod.mcs.anl.gov Job manager: Fork hamachi.cs.uchicago.edu iVDGL DataTAG Condor Gatekeeper: US-CMS LSF Gatekeeper: US-ATLAS LSF Gatekeeper: Padova-site GIIS giis.ivdgl.org mds-vo-name=glue Gatekeeper Grids

3 March 2003T. Kelly, S. Ravot and J.P. Martin-Flatin 9 Grids in DataTAG  Interoperability between European and U.S. Grids: High Energy Physics (main focus) Bioinformatics Earth Observation  Grid middleware: DataGrid iVDGL VDT (shared by PPDG and GriPhyN)  Information modeling (GLUE initiative)  Software development

Testbed

3 March 2003T. Kelly, S. Ravot and J.P. Martin-Flatin 11 Objectives  Provisioning of 2.5 Gbit/s transatlantic circuit between CERN (Geneva) and StarLight (Chicago)  Dedicated to research (no production traffic)  Multi-vendor testbed with layer-2 and layer-3 capabilities: Cisco, Juniper, Alcatel, Extreme Networks  Get hands-on experience with the operation of gigabit networks: Stability and reliability of hardware and software Interoperability

3 March 2003T. Kelly, S. Ravot and J.P. Martin-Flatin Gbit/s Transatlantic Circuit  Operational since 20 August 2002  Provisioned by Deutsche Telekom  Circuit initially connected to Cisco 76xx routers (layer 3)  High-end PC servers at CERN and StarLight: 4x SuperMicro 2.4 GHz dual Xeon, 2 GB memory 8x SuperMicro 2.2 GHz dual Xeon, 1 GB memory 24x SysKonnect SK-9843 GbE cards (2 per PC) total disk space: 1680 GB can saturate the circuit with TCP traffic  Deployment of layer-2 equipment underway  Upgrade to 10 Gbit/s expected in 2003

R&DConnectivity Between Europe & USA R&D Connectivity Between Europe & USA NL SURFnet CH CERN UK SuperJANET4 IT GARR-B GEANT FR INRIA FR VTHD ATRIUM Abilene ESNET MREN New York StarLight Canari e

Network Research

3 March 2003T. Kelly, S. Ravot and J.P. Martin-Flatin 15 Network Research Activities  Enhance performance of network protocols for massive file transfers (TBytes): Data-transport layer: TCP, UDP, SCTP  QoS: LBE (Scavenger)  Bandwidth reservation: AAA-based bandwidth on demand Lightpaths managed as Grid resources  Monitoring Rest of this talk

3 March 2003T. Kelly, S. Ravot and J.P. Martin-Flatin 16 Problem Statement  End-user’s perspective: Using TCP as the data-transport protocol for Grids leads to a poor bandwidth utilization in fast WANs: e.g., see demos at iGrid 2002  Network protocol designer’s perspective: TCP is currently inefficient in high bandwidth*delay networks for 2 reasons: TCP implementations have not yet been tuned for gigabit WANs TCP was not designed with gigabit WANs in mind

3 March 2003T. Kelly, S. Ravot and J.P. Martin-Flatin 17 TCP: Implementation Problems  TCP’s current implementation in Linux kernel is not optimized for gigabit WANs: e.g., SACK code needs to be rewritten  Device drivers must be modified: e.g., enable interrupt coalescence to cope with ACK bursts

3 March 2003T. Kelly, S. Ravot and J.P. Martin-Flatin 18 TCP: Design Problems  TCP’s congestion control algorithm (AIMD) is not suited to gigabit networks  Due to TCP’s limited feedback mechanisms, line errors are interpreted as congestion: Bandwidth utilization is reduced when it shouldn’t  RFC 2581 (which gives the formula for increasing cwnd) “forgot” delayed ACKs  TCP requires that ACKs be sent at most every second segment  ACK bursts  difficult to handle by kernel and NIC

3 March 2003T. Kelly, S. Ravot and J.P. Martin-Flatin 19 AIMD Algorithm (1/2)  Van Jacobson, SIGCOMM 1988  Congestion avoidance algorithm: For each ACK in an RTT without loss, increase: For each window experiencing loss, decrease:  Slow-start algorithm: Increase by 1 MSS per ACK until ssthresh

3 March 2003T. Kelly, S. Ravot and J.P. Martin-Flatin 20 AIMD Algorithm (2/2)  Additive Increase: A TCP connection increases slowly its bandwidth utilization in the absence of loss: forever, unless we run out of send/receive buffers or detect a packet loss TCP is greedy: no attempt to reach a stationary state  Multiplicative Decrease: A TCP connection reduces its bandwidth utilization drastically whenever a packet loss is detected: assumption: packet loss means congestion (line errors are negligible)

Congestion Window (cwnd) average cwnd over the last 10 samples average cwnd over the entire lifetime of the connection (if no loss) Slow Start Congestion Avoidance SSTHRESH

3 March 2003T. Kelly, S. Ravot and J.P. Martin-Flatin 22 Disastrous Effect of Packet Loss on TCP in Fast WANs (1/2) TCP NewReno throughput as a function of time C=1Gbit/s, MSS=1460bytes

3 March 2003T. Kelly, S. Ravot and J.P. Martin-Flatin 23 Disastrous Effect of Packet Loss on TCP in Fast WANs (2/2)  Long time to recover from a single loss: TCP should react to congestion rather than packet loss (line errors and transient faults in equipment are no longer negligible) TCP should recover quicker from a loss  TCP is more sensitive to packet loss in WANs than in LANs, particularly in fast WANs (where cwnd is large)

3 March 2003T. Kelly, S. Ravot and J.P. Martin-Flatin 24 Characterization of the Problem (1/2) The responsiveness  measures how quickly we go back to using the network link at full capacity after experiencing a loss (i.e., loss recovery time if loss occurs when bandwidth utilization = network link capacity) The responsiveness  measures how quickly we go back to using the network link at full capacity after experiencing a loss (i.e., loss recovery time if loss occurs when bandwidth utilization = network link capacity)  2. inc C. RTT 2

Characterization of the Problem (2/2) CapacityRTT # inc Responsiveness 9.6 kbit/s (typ. WAN in 1988) max: 40 ms ms 10 Mbit/s (typ. LAN in 1988) max: 20 ms 8 ~150 ms 100 Mbit/s (typ. LAN in 2003) max: 5 ms 20 ~100 ms 622 Mbit/s 120 ms ~2,900 ~6 min 2.5 Gbit/s 120 ms ~11,600 ~23 min 10 Gbit/s 120 ms ~46,200 ~1h 30min inc size = MSS = 1,460 bytes # inc = window size in pkts

3 March 2003T. Kelly, S. Ravot and J.P. Martin-Flatin 26 Congestion vs. Line Errors Throughput Required Bit Loss Rate Required Packet Loss Rate 10 Mbit/s Mbit/s Gbit/s Gbit/s At gigabit speed, the loss rate required for packet loss to be ascribed only to congestion is unrealistic with AIMD RTT=120 ms, MTU=1500 bytes, AIMD

3 March 2003T. Kelly, S. Ravot and J.P. Martin-Flatin 27 What Can We Do?  To achieve higher throughputs over high bandwidth*delay networks, we can: Change AIMD to recover faster in case of packet loss: larger cwnd increment less aggressive decrease algorithm larger MTU (Jumbo frames) Set the initial slow-start threshold (ssthresh) to a value better suited to the delay and bandwidth of the TCP connection Avoid losses in end hosts: implementation issue  Two proposals: Scalable TCP (Kelly) and GridDT (Ravot)

3 March 2003T. Kelly, S. Ravot and J.P. Martin-Flatin 28 Scalable TCP: Algorithm  For cwnd>lwnd, replace AIMD with new algorithm: for each ACK in an RTT without loss: cwnd i+1 = cwnd i + a for each window experiencing loss: cwnd i+1 = cwnd i – (b x cwnd i )  Kelly’s proposal during internship at CERN: (lwnd,a,b) = (16, 0.01, 0.125) Trade-off between fairness, stability, variance and convergence  Advantages: Responsiveness improves dramatically for gigabit networks Responsiveness is independent of capacity

3 March 2003T. Kelly, S. Ravot and J.P. Martin-Flatin 29 Scalable TCP: lwnd

3 March 2003T. Kelly, S. Ravot and J.P. Martin-Flatin 30 Scalable TCP: Responsiveness Independent of Capacity

3 March 2003T. Kelly, S. Ravot and J.P. Martin-Flatin 31 Scalable TCP: Improved Responsiveness  Responsiveness for RTT=200 ms and MSS=1460 bytes: Scalable TCP: 2.7 s TCP NewReno (AIMD): ~3 min at 100 Mbit/s ~1h 10min at 2.5 Gbit/s ~4h 45min at 10 Gbit/s  Patch available for Linux kernel  For details, see paper and code at:

3 March 2003T. Kelly, S. Ravot and J.P. Martin-Flatin 32 Scalable TCP vs. TCP NewReno: Benchmarking Number of flows TCP TCP + new dev driver Scalable TCP Bulk throughput tests with C=2.5 Gbit/s. Flows transfer 2 Gbytes and start again for 1200s.

3 March 2003T. Kelly, S. Ravot and J.P. Martin-Flatin 33 GridDT: Algorithm  Congestion avoidance algorithm: For each ACK in an RTT without loss, increase:  By modifying A dynamically according to RTT, guarantee fairness among TCP connections:

3 March 2003T. Kelly, S. Ravot and J.P. Martin-Flatin 34 TCP NewReno: RTT Bias SunnyvaleStarLightCERN RR GE Switch Host #1 POS 2.5 Gbit/s 1 GE Host #2 Host #1 Host #2 1 GE Bottleneck R POS 10 Gbit/s R 10GE  Two TCP streams share a 1 Gbit/s bottleneck  CERN-Sunnyvale: RTT=181ms. Avg. throughput over a period of 7000s = 202Mbit/s  CERN-StarLight: RTT=117ms. Avg. throughput over a period of 7000s = 514Mbit/s  MTU = 9000 bytes. Link utilization = 72%

3 March 2003T. Kelly, S. Ravot and J.P. Martin-Flatin 35 GridDT Fairer than TCP NewReno  CERN-Sunnyvale: RTT = 181 ms. Additive inc. A1 = 7. Avg. throughput = 330 Mbit/s  CERN-StarLight: RTT = 117 ms. Additive inc. A2 = 3. Avg. throughput = 388 Mbit/s  MTU = 9000 bytes. Link utilization 72% SunnyvaleStarLightCERN RR GE Switch POS 2.5 Gbit/s 1 GE Host #2 Host #1 Host #2 1 GE Bottleneck R POS 10 Gbit/s R 10GE Host #1 A2=3 RTT=117ms A1=7 RTT=181ms

3 March 2003T. Kelly, S. Ravot and J.P. Martin-Flatin 36 Measurements with Different MTUs (1/2)  Mathis advocates the use of larger MTUs  Experimental environment: Linux Traffic generated by iperf average throughout over the last 5 seconds Single TCP stream RTT = 119 ms Duration of each test: 2 hours Transfers from Chicago to Geneva  MTUs: set on the NIC of the PC (ifconfig) POS MTU set to 9180 Max MTU with Linux : 9000

3 March 2003T. Kelly, S. Ravot and J.P. Martin-Flatin 37 Measurements with Different MTUs (2/2) TCP max: 990 Mbit/s (MTU=9000) UDP max: 957 Mbit/s (MTU=1500)

3 March 2003T. Kelly, S. Ravot and J.P. Martin-Flatin 38 Measurement Tools  We used several tools to investigate TCP performance issues: Generation of TCP flows: iperf and gensink Capture of packet flows: tcpdump tcpdump  tcptrace  xplot  Some tests performed with SmartBits 2000

3 March 2003T. Kelly, S. Ravot and J.P. Martin-Flatin 39 Delayed ACKs  RFC 2581 (spec. defining TCP congestion control AIMD algorithm) erred:  RFC 2581 (spec. defining TCP congestion control AIMD algorithm) erred:  Implicit assumption: one ACK per packet  Delayed ACKs: one ACK every second packet  Responsiveness multiplied by two: Makes a bad situation worse when RTT and cwnd are large  Allman preparing an RFC to fix this

3 March 2003T. Kelly, S. Ravot and J.P. Martin-Flatin 40 Related Work  Sally Floyd, ICIR: Internet-Draft “High Speed TCP for Large Congestion Windows”  Steven Low, Caltech: Fast TCP  Dina Katabi, MIT: XCP  Web100 and Net100 projects  PFLDnet 2003 workshop:

3 March 2003T. Kelly, S. Ravot and J.P. Martin-Flatin 41 Research Directions  Compare the performance of different proposals  More stringent definition of congestion: Lose more than 1 packet per RTT  ACK more than two packets in one go: Decrease ACK bursts  Use SCTP instead of TCP