Download presentation
Presentation is loading. Please wait.
1
Networkshop 34 4-6 Apr 2006, R. Hughes-Jones Manchester 1 Bandwidth Challenges or "How fast can we really drive a Network?" Richard Hughes-Jones The University of Manchester www.hep.man.ac.uk/~rich/ then “Talks” www.hep.man.ac.uk/~rich/
2
Networkshop 34 4-6 Apr 2006, R. Hughes-Jones Manchester 2 uSCINet uCaltech Booth uThe BWC at the SLAC Booth Collaboration at SC|05 uESLEA Boston Ltd. & Peta-Cache Sun uStorcloud
3
Networkshop 34 4-6 Apr 2006, R. Hughes-Jones Manchester 3 Bandwidth Challenge wins Hat Trick uThe maximum aggregate bandwidth was >151 Gbits/s 130 DVD movies in a minute serve 10,000 MPEG2 HDTV movies in real-time u22 10Gigabit Ethernet waves Caltech & SLAC/FERMI booths In 2 hours transferred 95.37 TByte 24 hours moved ~ 475 TBytes uShowed real-time particle event analysis uSLAC Fermi UK Booth: 1 10 Gbit Ethernet to UK NLR&UKLight: transatlantic HEP disk to disk VLBI streaming 2 10 Gbit Links to SALC: rootd low-latency file access application for clusters Fibre Channel StorCloud 4 10 Gbit links to Fermi Dcache data transfers SC2004 101 Gbit/s In to booth Out of booth
4
Networkshop 34 4-6 Apr 2006, R. Hughes-Jones Manchester 4 ESLEA and UKLight u6 * 1 Gbit transatlantic Ethernet layer 2 paths UKLight + NLR uDisk-to-disk transfers with bbcp Seattle to UK Set TCP buffer and application to give ~850Mbit/s One stream of data 840-620 Mbit/s uStream UDP VLBI data UK to Seattle 620 Mbit/s Reverse TCP
5
Networkshop 34 4-6 Apr 2006, R. Hughes-Jones Manchester 5 SLAC 10 Gigabit Ethernet u2 Lightpaths: Routed over ESnet Layer 2 over Ultra Science Net u6 Sun V20Z systems per λ udcache remote disk data access 100 processes per node Node sends or receives One data stream 20-30 Mbit/s uUsed Netweion NICs & Chelsio TOE uData also sent to StorCloud using fibre channel links uTraffic on the 10 GE link for 2 nodes: 3-4 Gbit per nodes 8.5-9Gbit on Trunk
6
Networkshop 34 4-6 Apr 2006, R. Hughes-Jones Manchester 6 LightPath Topologies
7
Networkshop 34 4-6 Apr 2006, R. Hughes-Jones Manchester 7 Switched LightPaths [1] uLightpaths are a fixed point to point path or circuit Optical links (with FEC) have a BER 10 -16 i.e. a packet loss rate 10 -12 or 1 loss in about 160 days In SJ5 LightPaths known as Bandwidth Channels uHost to host Lightpath One Application No congestion Advanced TCP stacks for large Delay Bandwidth Products uLab to Lab Lightpaths Many application share Classic congestion points TCP stream sharing and recovery Advanced TCP stacks
8
Networkshop 34 4-6 Apr 2006, R. Hughes-Jones Manchester 8 Switched LightPaths [2] uSome applications suffer when using TCP may prefer to use UDP DCCP XCP … uE.g. With e-VLBI the data wave-front gets distorted and correlation fails uUser Controlled Lightpaths Grid Scheduling of CPUs & Network Many Application flows No congestion on each path Lightweight framing possible
9
Networkshop 34 4-6 Apr 2006, R. Hughes-Jones Manchester 9 Network Transport Layer Issues
10
Networkshop 34 4-6 Apr 2006, R. Hughes-Jones Manchester 10 Problem #1 Packet Loss Is it important ?
11
Networkshop 34 4-6 Apr 2006, R. Hughes-Jones Manchester 11 TCP (Reno) – Packet loss and Time uTCP takes packet loss as indication of congestion uTime for TCP to recover its throughput from 1 lost 1500 byte packet given by: u for rtt of ~200 ms @ 1 Gbit/s: UK 6 ms Europe 25 ms USA 150 ms 1.6 s 26 s 28min 2 min
12
Networkshop 34 4-6 Apr 2006, R. Hughes-Jones Manchester 12 Packet Loss and new TCP Stacks uTCP Response Function Throughput vs Loss Rate – further to right: faster recovery UKLight London-Chicago-London rtt 177 ms Drop Packets in Kernel 2.6.6 Kernel Agreement with theory good Some new stacks good at high loss rates
13
Networkshop 34 4-6 Apr 2006, R. Hughes-Jones Manchester 13 High Throughput Demonstrations Geneva rtt 128 ms man03lon01 2.5 Gbit SDH MB-NG Core 1 GEth Cisco GSR Cisco 7609 Cisco 7609 Chicago Dual Zeon 2.2 GHz Send data with TCP Drop Packets Monitor TCP with Web100
14
Networkshop 34 4-6 Apr 2006, R. Hughes-Jones Manchester 14 TCP Throughput – DataTAG uDifferent TCP stacks tested on the DataTAG Network u rtt 128 ms uDrop 1 in 10 6 uHigh-Speed Rapid recovery uScalable Very fast recovery uStandard Recovery would take ~ 20 mins
15
Networkshop 34 4-6 Apr 2006, R. Hughes-Jones Manchester 15 Problem #2 Is TCP fair? look at Round Trip Times & Max Transfer Unit
16
Networkshop 34 4-6 Apr 2006, R. Hughes-Jones Manchester 16 MTU and Fairness uTwo TCP streams share a 1 Gb/s bottleneck uRTT=117 ms uMTU = 3000 Bytes ; Avg. throughput over a period of 7000s = 243 Mb/s uMTU = 9000 Bytes; Avg. throughput over a period of 7000s = 464 Mb/s uLink utilization : 70,7 % Starlight (Chi) CERN (GVA) RR GbE Switch Host #1 POS 2.5 Gbps 1 GE Host #2 Host #1 Host #2 1 GE Bottleneck Sylvain Ravot DataTag 2003
17
Networkshop 34 4-6 Apr 2006, R. Hughes-Jones Manchester 17 RTT and FairnessSunnyvale Starlight (Chi) CERN (GVA) RR GbE Switch Host #1 POS 2.5 Gb/s 1 GE Host #2 Host #1 Host #2 1 GE Bottleneck R POS 10 Gb/s R 10GE uTwo TCP streams share a 1 Gb/s bottleneck uCERN Sunnyvale RTT=181ms ; Avg. throughput over a period of 7000s = 202Mb/s uCERN Starlight RTT=117ms; Avg. throughput over a period of 7000s = 514Mb/s uMTU = 9000 bytes uLink utilization = 71,6 % Sylvain Ravot DataTag 2003
18
Networkshop 34 4-6 Apr 2006, R. Hughes-Jones Manchester 18 Problem #n Do TCP Flows Share the Bandwidth ?
19
Networkshop 34 4-6 Apr 2006, R. Hughes-Jones Manchester 19 uChose 3 paths from SLAC (California) Caltech (10ms), Univ Florida (80ms), CERN (180ms) uUsed iperf/TCP and UDT/UDP to generate traffic uEach run was 16 minutes, in 7 regions Test of TCP Sharing: Methodology (1Gbit/s) Ping 1/s Iperf or UDT ICMP/ping traffic TCP/UDP bottleneck iperf SLAC CERN 2 mins 4 mins Les Cottrell & RHJ PFLDnet 2005
20
Networkshop 34 4-6 Apr 2006, R. Hughes-Jones Manchester 20 uLow performance on fast long distance paths AIMD (add a=1 pkt to cwnd / RTT, decrease cwnd by factor b=0.5 in congestion) Net effect: recovers slowly, does not effectively use available bandwidth, so poor throughput Unequal sharing TCP Reno single stream Congestion has a dramatic effect Recovery is slow Increase recovery rate SLAC to CERN RTT increases when achieves best throughput Les Cottrell & RHJ PFLDnet 2005 Remaining flows do not take up slack when flow removed
21
Networkshop 34 4-6 Apr 2006, R. Hughes-Jones Manchester 21 Hamilton TCP uOne of the best performers Throughput is high Big effects on RTT when achieves best throughput Flows share equally Appears to need >1 flow to achieve best throughput Two flows share equally SLAC-CERN > 2 flows appears less stable
22
Networkshop 34 4-6 Apr 2006, R. Hughes-Jones Manchester 22 Problem #n+1 To SACK or not to SACK ?
23
Networkshop 34 4-6 Apr 2006, R. Hughes-Jones Manchester 23 The SACK Algorithm uSACK Rational Non-continuous blocks of data can be ACKed Sender transmits just lost packets Helps when multiple packets lost in one TCP window uThe SACK Processing is inefficient for large bandwidth delay products Sender write queue (linked list) walked for: Each SACK block To mark lost packets To re-transmit Processing so long input Q becomes full Get Timeouts SACKs updated rtt 150ms Standard SACKs rtt 150ms HS-TCP Dell 1650 2.8 GHz PCI-X 133 MHz Intel Pro/1000 Doug Leith Yee-Ting Li
24
Networkshop 34 4-6 Apr 2006, R. Hughes-Jones Manchester 24 What does the User & Application make of this? The view from the Application
25
Networkshop 34 4-6 Apr 2006, R. Hughes-Jones Manchester 25 SC2004: Disk-Disk bbftp ubbftp file transfer program uses TCP/IP uUKLight: Path:- London-Chicago-London; PCs:- Supermicro +3Ware RAID0 uMTU 1500 bytes; Socket size 22 Mbytes; rtt 177ms; SACK off uMove a 2 Gbyte file uWeb100 plots: uStandard TCP uAverage 825 Mbit/s u(bbcp: 670 Mbit/s) uScalable TCP uAverage 875 Mbit/s u(bbcp: 701 Mbit/s ~4.5s of overhead) uDisk-TCP-Disk at 1Gbit/s is here!
26
Networkshop 34 4-6 Apr 2006, R. Hughes-Jones Manchester 26 SC|05 HEP: Moving data with bbcp uWhat is the end-host doing with your network protocol? uLook at the PCI-X u3Ware 9000 controller RAID0 u1 Gbit Ethernet link u2.4 GHz dual Xeon u~660 Mbit/s PCI-X bus with RAID Controller PCI-X bus with Ethernet NIC Read from disk for 44 ms every 100ms Write to Network for 72 ms uPower needed in the end hosts uCareful Application design
27
Networkshop 34 4-6 Apr 2006, R. Hughes-Jones Manchester 27 VLBI: TCP Stack & CPU Load uReal User problem! uEnd host TCP flow at 960 Mbit/s with rtt 1 ms falls to 770 Mbit/s when rtt 15 ms u1.2GHz PIII TCP iperf rtt 1 ms 960 Mbit/s 94.7% kernel mode idle 1.5 % TCP iperf rtt 15 ms 777 Mbit/s 96.3% kernel mode idle 0.05 % CPULoad with nice priority Throughput falls as priority increases No Loss No Timeouts uNot enough CPU power u2.8 GHz Xeon rtt 1 ms TCP iperf 916 Mbit/s 43% kernel mode idle 55% CPULoad with nice priority Throughput constant as priority increases No Loss No Timeouts uKernel mode includes TCP stack and Ethernet driver
28
Networkshop 34 4-6 Apr 2006, R. Hughes-Jones Manchester 28 ATLAS Remote Computing: Application Protocol u Event Request EFD requests an event from SFI SFI replies with the event ~2Mbytes u Processing of event u Return of computation EF asks SFO for buffer space SFO sends OK EF transfers results of the computation u tcpmon - instrumented TCP request-response program emulates the Event Filter EFD to SFI communication. Send OK Send event data Request event ●●● Request Buffer Send processed event Process event Time Request-Response time (Histogram) Event Filter Daemon EFD SFI and SFO
29
Networkshop 34 4-6 Apr 2006, R. Hughes-Jones Manchester 29 tcpmon: TCP Activity Manc-CERN Req-Resp Web100 hooks for TCP status Round trip time 20 ms 64 byte Request green 1 Mbyte Response blue TCP in slow start 1st event takes 19 rtt or ~ 380 ms TCP Congestion window gets re-set on each Request TCP stack RFC 2581 & RFC 2861 reduction of Cwnd after inactivity Even after 10s, each response takes 13 rtt or ~260 ms Transfer achievable throughput 120 Mbit/s Event rate very low Application not happy!
30
Networkshop 34 4-6 Apr 2006, R. Hughes-Jones Manchester 30 tcpmon: TCP Activity Manc-cern Req-Resp no cwnd reduction Round trip time 20 ms 64 byte Request green 1 Mbyte Response blue TCP starts in slow start 1 st event takes 19 rtt or ~ 380 ms TCP Congestion window grows nicely Response takes 2 rtt after ~1.5s Rate ~10/s (with 50ms wait) Transfer achievable throughput grows to 800 Mbit/s Data transferred WHEN the application requires the data 3 Round Trips 2 Round Trips
31
Networkshop 34 4-6 Apr 2006, R. Hughes-Jones Manchester 31 uObjective: demo 1 Gbit/s aggregate bandwidth between RAL and 4 Tier 2 sites uRAL has SuperJANET4 and UKLight links: uRAL Capped firewall traffic at 800 Mbit/s uSuperJANET Sites: Glasgow Manchester Oxford QMWL uUKLight Site: Lancaster uMany concurrent transfers from RAL to each of the Tier 2 sites HEP: Service Challenge 4 ~700 Mbit UKLight Peak 680 Mbit SJ4 uApplications able to sustain high rates. uSuperJANET5, UKLight & new access links very timely
32
Networkshop 34 4-6 Apr 2006, R. Hughes-Jones Manchester 32 uWell, you CAN fill the Links at 1 and 10 Gbit/s – but its not THAT simple uPacket loss is a killer for TCP Check on campus links & equipment, and access links to backbones Users need to collaborate with the Campus Network Teams Dante Pert uNew stacks are stable and give better response & performance Still need to set the TCP buffer sizes ! Check other kernel settings e.g. window-scale maximum Watch for “TCP Stack implementation Enhancements” uTCP tries to be fair Large MTU has an advantage Short distances, small RTT, have an advantage uTCP does not share bandwidth well with other streams uThe End Hosts themselves Plenty of CPU power is required for the TCP/IP stack as well and the application Packets can be lost in the IP stack due to lack of processing power / CPU scheduling Interaction between HW, protocol processing, and disk sub-system complex uApplication architecture & implementation are also important The TCP protocol dynamics strongly influence the behaviour of the Application. Summary & Conclusions
33
Networkshop 34 4-6 Apr 2006, R. Hughes-Jones Manchester 33 More Information Some URLs uUKLight web site: http://www.uklight.ac.uk uESLEA web site: http://www.eslea.uklight.ac.uk uMB-NG project web site: http://www.mb-ng.net/ uDataTAG project web site: http://www.datatag.org/ uUDPmon / TCPmon kit + writeup: http://www.hep.man.ac.uk/~rich/net uMotherboard and NIC Tests: http://www.hep.man.ac.uk/~rich/net/nic/GigEth_tests_Boston.ppt & http://datatag.web.cern.ch/datatag/pfldnet2003/ “Performance of 1 and 10 Gigabit Ethernet Cards with Server Quality Motherboards” FGCS Special issue 2004 http:// www.hep.man.ac.uk/~rich/ uTCP tuning information may be found at: http://www.ncne.nlanr.net/documentation/faq/performance.html & http://www.psc.edu/networking/perf_tune.html uTCP stack comparisons: “Evaluation of Advanced TCP Stacks on Fast Long-Distance Production Networks” Journal of Grid Computing 2004 uPFLDnet http://www.ens-lyon.fr/LIP/RESO/pfldnet2005/ uDante PERT http://www.geant2.net/server/show/nav.00d00h002
34
Networkshop 34 4-6 Apr 2006, R. Hughes-Jones Manchester 34 Any Questions?
35
Networkshop 34 4-6 Apr 2006, R. Hughes-Jones Manchester 35 Backup Slides
36
Networkshop 34 4-6 Apr 2006, R. Hughes-Jones Manchester 36 uLectures, tutorials etc. on TCP/IP: www.nv.cc.va.us/home/joney/tcp_ip.htm www.cs.pdx.edu/~jrb/tcpip.lectures.html www.raleigh.ibm.com/cgi-bin/bookmgr/BOOKS/EZ306200/CCONTENTS www.cisco.com/univercd/cc/td/doc/product/iaabu/centri4/user/scf4ap1.htm www.cis.ohio-state.edu/htbin/rfc/rfc1180.html www.jbmelectronics.com/tcp.htm uEncylopaedia http://www.freesoft.org/CIE/index.htm uTCP/IP Resources www.private.org.il/tcpip_rl.html uUnderstanding IP addresses http://www.3com.com/solutions/en_US/ncs/501302.html uConfiguring TCP (RFC 1122) ftp://nic.merit.edu/internet/documents/rfc/rfc1122.txt uAssigned protocols, ports etc (RFC 1010) http://www.es.net/pub/rfcs/rfc1010.txt & /etc/protocols http://www.es.net/pub/rfcs/rfc1010.txt More Information Some URLs 2
37
Networkshop 34 4-6 Apr 2006, R. Hughes-Jones Manchester 37 Packet Loss with new TCP Stacks uTCP Response Function Throughput vs Loss Rate – further to right: faster recovery Drop packets in kernel MB-NG rtt 6ms DataTAG rtt 120 ms
38
Networkshop 34 4-6 Apr 2006, R. Hughes-Jones Manchester 38 uDrop 1 in 25,000 urtt 6.2 ms uRecover in 1.6 s High Performance TCP – MB-NG StandardHighSpeed Scalable
39
Networkshop 34 4-6 Apr 2006, R. Hughes-Jones Manchester 39 Fast uAs well as packet loss, FAST uses RTT to detect congestion RTT is very stable: σ(RTT) ~ 9ms vs 37±0.14ms for the others SLAC-CERN Big drops in throughput which take several seconds to recover from 2 nd flow never gets equal share of bandwidth
40
Networkshop 34 4-6 Apr 2006, R. Hughes-Jones Manchester 40 SACK … uLook into what’s happening at the algorithmic level with web100: uStrange hiccups in cwnd only correlation is SACK arrivals Scalable TCP on MB-NG with 200mbit/sec CBR Background Yee-Ting Li
41
Networkshop 34 4-6 Apr 2006, R. Hughes-Jones Manchester 41 10 Gigabit Ethernet: Tuning PCI-X u16080 byte packets every 200 µs uIntel PRO/10GbE LR Adapter uPCI-X bus occupancy vs mmrbc Measured times Times based on PCI-X times from the logic analyser Expected throughput ~7 Gbit/s Measured 5.7 Gbit/s mmrbc 1024 bytes mmrbc 2048 bytes mmrbc 4096 bytes 5.7Gbit/s mmrbc 512 bytes CSR Access PCI-X Sequence Data Transfer Interrupt & CSR Update
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.