10GbE WAN Data Transfers for Science High Energy/Nuclear Physics (HENP) SIG Fall 2004 Internet2 Member Meeting Yang Xia, HEP, Caltech yxia@caltech.edu.

Slides:



Advertisements
Similar presentations
Martin Suchara, Ryan Witt, Bartek Wydrowski California Institute of Technology Pasadena, U.S.A. TCP MaxNet Implementation and Experiments on the WAN in.
Advertisements

Storage System Integration with High Performance Networks Jon Bakken and Don Petravick FNAL.
TCP transfers over high latency/bandwidth network & Grid TCP Sylvain Ravot
High Speed Total Order for SAN infrastructure Tal Anker, Danny Dolev, Gregory Greenman, Ilya Shnaiderman School of Engineering and Computer Science The.
August 10, Circuit TCP (CTCP) Helali Bhuiyan
LCG TCP performance optimization for 10 Gb/s LHCOPN connections 1 on behalf of M. Bencivenni, T.Ferrari, D. De Girolamo, Stefano.
Iperf Tutorial Jon Dugan Summer JointTechs 2010, Columbus, OH.
Ahmed El-Hassany CISC856: CISC 856 TCP/IP and Upper Layer Protocols Slides adopted from: Injong Rhee, Lisong Xu.
Congestion Control Tanenbaum 5.3, /12/2015Congestion Control (A Loss Based Technique: TCP)2 What? Why? Congestion occurs when –there is no reservation.
TCP in Heterogeneous Network Md. Ehtesamul Haque # P.
PFLDNet Argonne Feb 2004 R. Hughes-Jones Manchester 1 UDP Performance and PCI-X Activity of the Intel 10 Gigabit Ethernet Adapter on: HP rx2600 Dual Itanium.
ESLEA Bedfont Lakes Dec 04 Richard Hughes-Jones Network Measurement & Characterisation and the Challenge of SuperComputing SC200x.
Chapter 4 Queuing, Datagrams, and Addressing
KEK Network Qi Fazhi KEK SW L2/L3 Switch for outside connections Central L2/L3 Switch A Netscreen Firewall Super Sinet Router 10GbE 2 x GbE IDS.
1 A Basic R&D for an Analysis Framework Distributed on Wide Area Network Hiroshi Sakamoto International Center for Elementary Particle Physics (ICEPP),
Is Lambda Switching Likely for Applications? Tom Lehman USC/Information Sciences Institute December 2001.
TNC 2007 Bandwidth-on-demand to reach the optimal throughput of media Brecht Vermeulen Stijn Eeckhaut, Stijn De Smet, Bruno Volckaert, Joachim Vermeir,
Experiences in Design and Implementation of a High Performance Transport Protocol Yunhong Gu, Xinwei Hong, and Robert L. Grossman National Center for Data.
Large File Transfer on 20,000 km - Between Korea and Switzerland Yusung Kim, Daewon Kim, Joonbok Lee, Kilnam Chon
Towards a Common Communication Infrastructure for Clusters and Grids Darius Buntinas Argonne National Laboratory.
High Performance User-Level Sockets over Gigabit Ethernet Pavan Balaji Ohio State University Piyush Shivam Ohio State University.
J. Bunn, D. Nae, H. Newman, S. Ravot, X. Su, Y. Xia California Institute of Technology High speed WAN data transfers for science Session Recent Results.
J. Bunn, D. Nae, H. Newman, S. Ravot, X. Su, Y. Xia California Institute of Technology State of the art in the use of long distance network International.
Maximizing End-to-End Network Performance Thomas Hacker University of Michigan October 26, 2001.
Masaki Hirabaru Internet Architecture Group GL Meeting March 19, 2004 High Performance Data transfer on High Bandwidth-Delay Product Networks.
UDT: UDP based Data Transfer Yunhong Gu & Robert Grossman Laboratory for Advanced Computing University of Illinois at Chicago.
Network Tests at CHEP K. Kwon, D. Han, K. Cho, J.S. Suh, D. Son Center for High Energy Physics, KNU, Korea H. Park Supercomputing Center, KISTI, Korea.
LambdaStation Monalisa DoE PI meeting September 30, 2005 Sylvain Ravot.
Data transfer over the wide area network with a large round trip time H. Matsunaga, T. Isobe, T. Mashimo, H. Sakamoto, I. Ueda International Center for.
Securing and Monitoring 10GbE WAN Links Steven Carter Center for Computational Sciences Oak Ridge National Laboratory.
Remote Direct Memory Access (RDMA) over IP PFLDNet 2003, Geneva Stephen Bailey, Sandburst Corp., Allyn Romanow, Cisco Systems,
Srihari Makineni & Ravi Iyer Communications Technology Lab
FAST TCP in Linux Cheng Jin David Wei Steven Low California Institute of Technology.
High TCP performance over wide area networks Arlington, VA May 8, 2002 Sylvain Ravot CalTech HENP Working Group.
ENW-9800 Copyright © PLANET Technology Corporation. All rights reserved. Dual 10Gbps SFP+ PCI Express Server Adapter.
High-speed TCP  FAST TCP: motivation, architecture, algorithms, performance (by Cheng Jin, David X. Wei and Steven H. Low)  Modifying TCP's Congestion.
Rate Control Rate control tunes the packet sending rate. No more than one packet can be sent during each packet sending period. Additive Increase: Every.
Data Transport Challenges for e-VLBI Julianne S.O. Sansa* * With Arpad Szomoru, Thijs van der Hulst & Mike Garret.
1 Network Performance Optimisation and Load Balancing Wulf Thannhaeuser.
TCP with Variance Control for Multihop IEEE Wireless Networks Jiwei Chen, Mario Gerla, Yeng-zhong Lee.
An Architecture and Prototype Implementation for TCP/IP Hardware Support Mirko Benz Dresden University of Technology, Germany TERENA 2001.
Masaki Hirabaru NICT Koganei 3rd e-VLBI Workshop October 6, 2004 Makuhari, Japan Performance Measurement on Large Bandwidth-Delay Product.
Wide Area Network Performance Analysis Methodology Wenji Wu, Phil DeMar, Mark Bowden Fermilab ESCC/Internet2 Joint Techs Workshop 2007
First of ALL Big appologize for Kei’s absence Hero of this year’s LSR achievement Takeshi in his experiment.
TERENA Networking Conference, Zagreb, Croatia, 21 May 2003 High-Performance Data Transport for Grid Applications T. Kelly, University of Cambridge, UK.
Ethernet. Ethernet standards milestones 1973: Ethernet Invented 1983: 10Mbps Ethernet 1985: 10Mbps Repeater 1990: 10BASE-T 1995: 100Mbps Ethernet 1998:
Supporting Multimedia Communication over a Gigabit Ethernet Network VARUN PIUS RODRIGUES.
Topic 3 Analysing network traffic
TCP transfers over high latency/bandwidth networks Internet2 Member Meeting HENP working group session April 9-11, 2003, Arlington T. Kelly, University.
Performance Engineering E2EpiPEs and FastTCP Internet2 member meeting - Indianapolis World Telecom Geneva October 15, 2003
30 June Wide Area Networking Performance Challenges Olivier Martin, CERN UK DTI visit.
Prentice HallHigh Performance TCP/IP Networking, Hassan-Jain Chapter 13 TCP Implementation.
Chapter 11.4 END-TO-END ISSUES. Optical Internet Optical technology Protocol translates availability of gigabit bandwidth in user-perceived QoS.
GNEW2004 CERN March 2004 R. Hughes-Jones Manchester 1 Lessons Learned in Grid Networking or How do we get end-2-end performance to Real Users ? Richard.
TCP transfers over high latency/bandwidth networks & Grid DT Measurements session PFLDnet February 3- 4, 2003 CERN, Geneva, Switzerland Sylvain Ravot
TCP Traffic Characteristics—Deep buffer Switch
Final EU Review - 24/03/2004 DataTAG is a project funded by the European Commission under contract IST Richard Hughes-Jones The University of.
S. Ravot, J. Bunn, H. Newman, Y. Xia, D. Nae California Institute of Technology CHEP 2004 Network Session September 1, 2004 Breaking the 1 GByte/sec Barrier?
L1/HLT trigger farm Bologna setup 0 By Gianluca Peco INFN Bologna Genève,
HENP SIG Austin, TX September 27th, 2004Shawn McKee The UltraLight Program UltraLight: An Overview and Update Shawn McKee University of Michigan.
16 th IEEE NPSS Real Time Conference 2009 IHEP, Beijing, China, 12 th May, 2009 High Rate Packets Transmission on 10 Gbit/s Ethernet LAN Using Commodity.
Recent experience with PCI-X 2.0 and PCI-E network interfaces and emerging server systems Yang Xia Caltech US LHC Network Working Group October 23, 2006.
Wide Area Network Performance Analysis Methodology
Realization of a stable network flow with high performance communication in high bandwidth-delay product network Y. Kodama, T. Kudoh, O. Tatebe, S. Sekiguchi.
Networking between China and Europe
Transport Protocols over Circuits/VCs
Wide Area Networking at SLAC, Feb ‘03
High-Performance Data Transport for Grid Applications
Review of Internet Protocols Transport Layer
Presentation transcript:

10GbE WAN Data Transfers for Science High Energy/Nuclear Physics (HENP) SIG Fall 2004 Internet2 Member Meeting Yang Xia, HEP, Caltech yxia@caltech.edu September 28, 2004 8:00 AM – 10:00 AM

Agenda Introduction 10GE NIC comparisons & contrasts Overview of LHCnet High TCP performance over wide area networks Problem statement Benchmarks Network architecture and tuning Networking enhancements in Linux 2.6 kernels Light paths : UltraLight FAST TCP protocol development

Introduction High Engery Physics LHC model shows data at the experiment will be stored at the rate of 100 – 1500 Mbytes/sec throughout the year. Many Petabytes per year of stored and processed binary data will be accessed and processed repeatedly by the worldwide collaborations. Network backbone capacities advancing rapidly to 10 Gbps range and seamless integration into SONETs. Proliferating GbE adapters on commodity desktops generates bottleneck on GbE Switch I/O ports. More commercial 10GbE adapter products entering the market, e.g. Intel, S2io, IBM, Chelsio etc.

InfiniBand and 4xTwinax cables IEEE 802.3ae Port Types Port Type Wavelength and Fiber Type WAN/ LAN Maximum Reach 10GBase-SR 850nm/MMF 300m 10GBase-LR 1310nm/SMF (LAN-PHY) 10km 10GBase-ER 1550nm/SMF 40km 10GBase-SW WAN 10GBase-LW (WAN-PHY) 10GBase-EW 10GBase-CX4 InfiniBand and 4xTwinax cables 15m 10GBase-T Twisted-pair 100m The 10-Gigabit Ethernet distances are defined as 300 meters for short reach (SR), 10 km for long reach (LR), and 40 km for extended reach (ER).

10GbE NICs Comparison (Intel vs S2io) Standard Support: 802.3ae Standard, full duplex only 64bit/133MHz PCI-X bus 1310nm SMF/850nm MMF Jumbo Frame Support Major Difference in Performance Features: S2io Adapter Intel Adapter PCI-X Bus DMA Split Transaction Capacity 32 2 Rx Frame Buffer Capacity 64MB 256KB MTU 9600Byte 16114Byte IPv4 TCP Large Send Offload Max offload size 80k Partial; Max offload size 32k

LHCnet Network Setup 10 Gbps transatlantic link extended to Caltech via Abilene and CENIC. NLR wave local loop is in working progress. High-performance end stations (Intel Xeon & Itanium, AMD Opteron) running both Linux and Windows We have added a 64x64 Non-SONET all optical switch from Calient to provision a dynamic path via MonALISA, in the context of UltraLight.

LHCnet Topology: August 2004 StarLight CERN Glimmerglass Alcatel 7770 Procket 8801 Alcatel 7770 Procket 8801 Juniper M10 Cisco 7609 Cisco 7609 Juniper M10 Linux Farm 20 P4 CPU 6 TBytes Linux Farm 20 P4 CPU 6 TBytes 10GE 10GE 10GE LHCnet tesbed LHCnet tesbed Juniper T320 Juniper T320 10GE American Partners 10GE European Partners Internal Network Caltech/DoE PoP - Chicago OC192 (Production and R&D) CERN - Geneva Services: IPv4 & IPv6 ; Layer2 VPN ; QoS ; scavenger ; large MTU (9k) ; MPLS ; links aggregation ; monitoring (Monalisa) Clean separation of production and R&D traffic based on CCC. Unique Multi-platform / Multi-technology optical transatlantic test-bed Powerful Linux farms equipped with 10 GE adapters (Intel; S2io) Equipment loan and donation; exceptional discount NEW: Photonic switch (Glimmerglass T300) evaluation Circuit (“pure” light path) provisioning

LHCnet Topology: August 2004 (cont’d) GMPLS controlled PXCs and IP/MPLS routers can provide dynamic shortest path set-up and path setup based on priority of links. Optical Switch Matrix Calient Photonic Cross Connect Switch

Problem Statement To get the most bangs for the buck on 10GbE WAN, packet loss is the #1 enemy. This is because of slow TCP responsive from AIMD algorithm: No Loss: cwnd := cwnd + 1/cwnd Loss: cwnd := cwnd/2 Fairness: TCP Reno MTU & RTT bias Different MTUs and delays lead to a very poor sharing of the bandwidth.

Internet 2 Land Speed Record (LSR) IPv6 record: 4.0 Gbps between Geneva and Phoenix (SC2003) IPv4 Multi-stream record with Windows: 7.09 Gbps between Caltech and CERN (11k km) Single Stream 6.6 Gbps X 16.5 k km with Linux We have exceeded 100 Petabit-m/sec with both Linux & Windows Testing on different WAN distances doesn’t seem to change TCP rate: 7k km (Geneva - Chicago) 11k km (Normal Abilene Path) 12.5k km (Petit Abilene's Tour) 16.5k km (Grande Abilene's Tour) Monitoring of the Abilene Traffic in LA:

Internet 2 Land Speed Record (cont’d) Single Stream IPv4 Category

Primary Workstation Summary Sending Station: Newisys 4300, 4 x AMD Opteron 248 2.2GHz, 4GB PC3200/Processor. Up to 5 x 1GB/s 133MHz/64bit PCI-X slots. No FSB bottleneck. HyperTransport connects CPUs (up to 19.2GB/s peak BW per processor), 24 SATA disks RAID system @ 1.2GB/s read/write Opteron white box with Tyan S2882 motherboard, 2x Opteron 2.4 GHz , 2 GB DDR. AMD8131 chipset PCI-X bus speed: ~940MB/s Receiving Station: HP rx4640, 4x 1.5GHz Itanium-2, zx1 chipset, 8GB memory. SATA disk RAID system

Linux Tuning Parameters PCI-X Bus Parameters: (via setpci command) Maximum Memory Read Byte Count (MMRBC) controls PCI-X transmit burst lengths on the bus: Available values are 512Byte (default), 1024KB, 2048KB and 4096KB “max_split_trans” controls outstanding splits. Available values are: 1, 2, 3, 4 latency_timer to 248 Interrupt Coalescence: It allows a user to change the CPU-affinity of the interrupts in a system. Large window size = BW*Delay (BDP) Too large window size will negatively impact throughput. 9000byte MTU and 64KB TSO

Linux Tuning Parameters (cont’d) Use sysctl command to modify /proc parameters to increase TCP memory values.

10GbE Network Testing Tools In Linux: Iperf: Version 1.7.0 doesn’t work by default on the Itanium2 machine. Workarounds: 1) Compile using RedHat’s gcc 2.96 or 2) make it single threaded UDP send rate limits to 2Gbps because of 32-bit date type Nttcp: Measures the time required to send preset chunk of data. Netperf (v2.1): Sends as much data as it can in an interval and collects result at the end of test. Great for end-to-end latency test. Tcpdump: Challenging task for 10GbE link In Windows: NTttcp: Using Windows APIs Microsoft Network Monitoring Tool Ethereal

Networking Enhancements in Linux 2.6 2.6.x Linux kernel has made many improvements in general to improve system performance, scalability and hardware drivers. Improved Posix Threading Support (NGPT and NPTL) Supporting AMD 64-bit (x86-64) and improved NUMA support. TCP Segmentation Offload (TSO) Network Interrupt Mitigation: Improved handling of high network loads Zero-Copy Networking and NFS: One system call with: sendfile(sd, fd, &offset, nbytes) NFS Version 4

TCP Segmentation Offload Must have hardware support in NIC. It’s a sender only option. It allows TCP layer to send a larger than normal segment of data, e,g, 64KB, to the driver and then the NIC. The NIC then fragments the large packet into smaller (<=mtu) packets. TSO is disabled in multiple places in the TCP functions. It is disabled when sacks are received, in tcp_sacktag_write_queue, and when a packet is retransmitted, in tcp_retransmit_skb. However, TSO is never re-enabled in the current 2.6.8 kernel when TCP state changes back to normal (TCP_CA_Open). Need to patch the kernel to re-enable TSO. Benefits: TSO can reduce CPU overhead by 10%~15%. Increase TCP responsiveness. p=(C*RTT*RTT)/(2*MSS) p: Time to recover to full rate C: Capacity of the link RTT: Round Trip Time MSS: Maximum Segment Size

Responsiveness with and w/o TSO Path BW RTT (s) MTU Responsiveness (min) With Delayed ACK (min) Geneva-LA (Normal Path) 10Gbps 0.18 9000 38 75 Geneva-LA (Long Path) 0.252 74 148 (Long Path w/ 64KB TSO) 10 20 LAN 0.001 1500 428ms 856ms Geneva-Chicago 0.12 103 205 1Gbps 23 46 45 91 1 2

The Transfer over 10GbE WAN With 9000byte MTU and stock Linux 2.6.7 kernel: LAN: 7.5Gb/s WAN: 7.4Gb/s (Receiver is CPU bound) We’ve reached the PCI-X bus limit with single NIC. Using bonding (802.3ad) of multiple interfaces we could bypass the PCI X bus limitation in mulple streams case only LAN: 11.1Gb/s WAN: ??? (a.k.a. doom’s day for Abilene)

UltraLight: Developing Advanced Network Services for Data Intensive HEP Applications UltraLight (funded by NSF ITR): a next-generation hybrid packet- and circuit-switched network infrastructure. Packet switched: cost effective solution; requires ultrascale protocols to share 10G  efficiently and fairly Circuit-switched: Scheduled or sudden “overflow” demands handled by provisioning additional wavelengths; Use path diversity, e.g. across the US, Atlantic, Canada,… Extend and augment existing grid computing infrastructures (currently focused on CPU/storage) to include the network as an integral component Using MonALISA to monitor and manage global systems Partners: Caltech, UF, FIU, UMich, SLAC, FNAL, MIT/Haystack; CERN, Internet2, NLR, CENIC; Translight, UKLight, Netherlight; UvA, UCL, KEK, Taiwan Strong support from Cisco and Level(3)

“Ultrascale” protocol development: FAST TCP Based on TCP Vegas Uses end-to-end delay and loss to dynamically adjust the congestion window Achieves any desired fairness, expressed by utility function Very high utilization (99% in theory) Compare to Other TCP Variants: e.g. BIC, Westwood+ Capacity = OC-192 9.5Gbps; 264 ms round trip latency; 1 flow BW use 30% BW use 40% BW use 50% BW use 79% Linux TCP Linux Westwood+ Linux BIC TCP FAST

Summary and Future Approaches Full TCP offload engine will be available for 10GbE in the near future. There is a trade-off between maximizing CPU utilization and ensuring data integrity. Develop and provide cost-effective transatlantic network infrastructure and services required to meet the HEP community's needs a highly reliable and performance production network, with rapidly increasing capacity and a diverse workload. an advanced research backbone for network and Grid developments: including operations and management assisted by agent-based software (MonALISA) Concentrate on reliable Terabyte-scale file transfers, to drive development of an effective Grid-based Computing Model for LHC data analysis.