20th-21st June 2005 NeSC Edinburgh End-user systems: NICs, MotherBoards, Disks, TCP Stacks & Applications Richard Hughes-Jones.

http://grid.ucl.ac.uk/nfnn.html 20th-21st June 2005 NeSC Edinburgh End-user systems: NICs, MotherBoards, Disks, TCP Stacks & Applications Richard Hughes-Jones Work reported is from many Networking Collaborations MB - NG

Slide: 2 Richard Hughes-Jones Network Performance Issues End System Issues Network Interface Card and Driver and their configuration Processor speed MotherBoard configuration, Bus speed and capability Disk System TCP and its configuration Operating System and its configuration Network Infrastructure Issues Obsolete network equipment Configured bandwidth restrictions Topology Security restrictions (e.g., firewalls) Sub-optimal routing Transport Protocols Network Capacity and the influence of Others! Congestion – Group, Campus, Access links Many, many TCP connections Mice and Elephants on the path

Slide: 3 Richard Hughes-Jones Methodology used in testing NICs & Motherboards

Slide: 4 Richard Hughes-Jones uUDP/IP packets sent between back-to-back systems Processed in a similar manner to TCP/IP Not subject to flow control & congestion avoidance algorithms Used UDPmon test program uLatency uRound trip times measured using Request-Response UDP frames uLatency as a function of frame size Slope is given by: Mem-mem copy(s) + pci + Gig Ethernet + pci + mem-mem copy(s) Intercept indicates: processing times + HW latencies uHistograms of ‘singleton’ measurements uTells us about: Behavior of the IP stack The way the HW operates Interrupt coalescence Latency Measurements

Slide: 5 Richard Hughes-Jones Throughput Measurements uUDP Throughput uSend a controlled stream of UDP frames spaced at regular intervals n bytes Number of packets Wait time time  Zero stats OK done ●●● Get remote statistics Send statistics: No. received No. lost + loss pattern No. out-of-order CPU load & no. int 1-way delay Send data frames at regular intervals ●●● Time to send Time to receive Inter-packet time (Histogram) Signal end of test OK done Time Sender Receiver

Slide: 6 Richard Hughes-Jones PCI Bus & Gigabit Ethernet Activity uPCI Activity uLogic Analyzer with PCI Probe cards in sending PC Gigabit Ethernet Fiber Probe Card PCI Probe cards in receiving PC Gigabit Ethernet Probe CPU mem chipset NIC CPU mem NIC chipset Logic Analyser Display PCI bus Possible Bottlenecks

Slide: 7 Richard Hughes-Jones u SuperMicro P4DP8-2G (P4DP6) uDual Xeon u 400/522 MHz Front side bus u 6 PCI PCI-X slots u 4 independent PCI buses 64 bit 66 MHz PCI 100 MHz PCI-X 133 MHz PCI-X u Dual Gigabit Ethernet u Adaptec AIC-7899W dual channel SCSI u UDMA/100 bus master/EIDE channels data transfer rates of 100 MB/sec burst “Server Quality” Motherboards

Slide: 8 Richard Hughes-Jones “Server Quality” Motherboards u Boston/Supermicro H8DAR u Two Dual Core Opterons u 200 MHz DDR Memory Theory BW: 6.4Gbit u HyperTransport u 2 independent PCI buses 133 MHz PCI-X u 2 Gigabit Ethernet u SATA u ( PCI-e )

Slide: 9 Richard Hughes-Jones NIC & Motherboard Evaluations

Slide: 10 Richard Hughes-Jones SuperMicro 370DLE: Latency: SysKonnect PCI:32 bit 33 MHz Latency small 62 µs & well behaved Latency Slope 0.0286 µs/byte Expect: 0.0232 µs/byte PCI0.00758 GigE0.008 PCI0.00758 PCI:64 bit 66 MHz Latency small 56 µs & well behaved Latency Slope 0.0231 µs/byte Expect: 0.0118 µs/byte PCI0.00188 GigE0.008 PCI0.00188 n Possible extra data moves ? nMotherboard: SuperMicro 370DLE Chipset: ServerWorks III LE Chipset nCPU: PIII 800 MHz nRedHat 7.1 Kernel 2.4.14

Slide: 11 Richard Hughes-Jones SuperMicro 370DLE: Throughput: SysKonnect PCI:32 bit 33 MHz Max throughput 584Mbit/s No packet loss >18 us spacing PCI:64 bit 66 MHz Max throughput 720 Mbit/s No packet loss >17 us spacing Packet loss during BW drop 95-100% Kernel mode nMotherboard: SuperMicro 370DLE Chipset: ServerWorks III LE Chipset nCPU: PIII 800 MHz nRedHat 7.1 Kernel 2.4.14

Slide: 12 Richard Hughes-Jones SuperMicro 370DLE: PCI: SysKonnect Motherboard: SuperMicro 370DLE Chipset: ServerWorks III LE Chipset CPU: PIII 800 MHz PCI:64 bit 66 MHz RedHat 7.1 Kernel 2.4.14 Receive transfer Send PCI Packet on Ethernet Fibre Send setup Send transfer Receive PCI 1400 bytes sent Wait 100 us ~8 us for send or receive Stack & Application overhead ~ 10 us / node ~36 us

Slide: 13 Richard Hughes-Jones Send PCI Receive PCI Send setup Data Transfers Receive Transfers Send PCI Receive PCI Send setup Data Transfers Receive Transfers Signals on the PCI bus u1472 byte packets every 15 µs Intel Pro/1000 uPCI:64 bit 33 MHz u82% usage uPCI:64 bit 66 MHz u65% usage uData transfers half as long

Slide: 14 Richard Hughes-Jones SuperMicro 370DLE: PCI: SysKonnect 1400 bytes sent Wait 20 us 1400 bytes sent Wait 10 us nMotherboard: SuperMicro 370DLE Chipset: ServerWorks III LE Chipset nCPU: PIII 800 MHz PCI:64 bit 66 MHz nRedHat 7.1 Kernel 2.4.14 Frames on Ethernet Fiber 20 us spacing Frames are back-to-back 800 MHz Can drive at line speed Cannot go any faster !

Slide: 15 Richard Hughes-Jones SuperMicro 370DLE: Throughput: Intel Pro/1000 Max throughput 910 Mbit/s No packet loss >12 us spacing Packet loss during BW drop CPU load 65-90% spacing < 13 us nMotherboard: SuperMicro 370DLE Chipset: ServerWorks III LE Chipset nCPU: PIII 800 MHz PCI:64 bit 66 MHz nRedHat 7.1 Kernel 2.4.14

Slide: 16 Richard Hughes-Jones SuperMicro 370DLE: PCI: Intel Pro/1000 nMotherboard: SuperMicro 370DLE Chipset: ServerWorks III LE Chipset nCPU: PIII 800 MHz PCI:64 bit 66 MHz nRedHat 7.1 Kernel 2.4.14 nRequest – Response nDemonstrates interrupt coalescence nNo processing directly after each transfer

Slide: 17 Richard Hughes-Jones SuperMicro P4DP6: Latency Intel Pro/1000 Some steps Slope 0.009 us/byte Slope flat sections : 0.0146 us/byte Expect 0.0118 us/byte No variation with packet size FWHM 1.5 us Confirms timing reliable Motherboard: SuperMicro P4DP6 Chipset: Intel E7500 (Plumas) CPU: Dual Xeon Prestonia 2.2 GHz PCI, 64 bit, 66 MHz RedHat 7.2 Kernel 2.4.19

Slide: 18 Richard Hughes-Jones SuperMicro P4DP6: Throughput Intel Pro/1000 Max throughput 950Mbit/s No packet loss Averages are misleading CPU utilisation on the receiving PC was ~ 25 % for packets > than 1000 bytes 30- 40 % for smaller packets Motherboard: SuperMicro P4DP6 Chipset: Intel E7500 (Plumas) CPU: Dual Xeon Prestonia 2.2 GHz PCI, 64 bit, 66 MHz RedHat 7.2 Kernel 2.4.19

Slide: 19 Richard Hughes-Jones SuperMicro P4DP6: PCI Intel Pro/1000 1400 bytes sent Wait 12 us ~5.14us on send PCI bus PCI bus ~68% occupancy ~ 3 us on PCI for data recv CSR access inserts PCI STOPs NIC takes ~ 1 us/CSR CPU faster than the NIC ! Similar effect with the SysKonnect NIC Motherboard: SuperMicro P4DP6 Chipset: Intel E7500 (Plumas) CPU: Dual Xeon Prestonia 2.2 GHz PCI, 64 bit, 66 MHz RedHat 7.2 Kernel 2.4.19

Slide: 20 Richard Hughes-Jones SuperMicro P4DP8-G2: Throughput Intel onboard Max throughput 995Mbit/s No packet loss Averages are misleading 20% CPU utilisation receiver packets > 1000 bytes 30% CPU utilisation smaller packets nMotherboard: SuperMicro P4DP8-G2 Chipset: Intel E7500 (Plumas) nCPU: Dual Xeon Prestonia 2.4 GHz PCI-X:64 bit nRedHat 7.3 Kernel 2.4.19

Slide: 21 Richard Hughes-Jones Interrupt Coalescence: Throughput Intel Pro 1000 on 370DLE

Slide: 22 Richard Hughes-Jones Interrupt Coalescence Investigations u Set Kernel parameters for Socket Buffer size = rtt*BW u TCP mem-mem lon2-man1 u Tx 64 Tx-abs 64 u Rx 0 Rx-abs 128 u 820-980 Mbit/s +- 50 Mbit/s u Tx 64 Tx-abs 64 u Rx 20 Rx-abs 128 u 937-940 Mbit/s +- 1.5 Mbit/s u Tx 64 Tx-abs 64 u Rx 80 Rx-abs 128 u 937-939 Mbit/s +- 1 Mbit/s

Slide: 23 Richard Hughes-Jones Supermarket Motherboards and other “Challenges”

Slide: 24 Richard Hughes-Jones Tyan Tiger S2466N uMotherboard: Tyan Tiger S2466N uPCI: 1 64bit 66 MHz uCPU: Athlon MP2000+ uChipset: AMD-760 MPX u3Ware forces PCI bus to 33 MHz uTyan to MB-NG SuperMicro Network mem-mem 619 Mbit/s

Slide: 25 Richard Hughes-Jones IBM das: Throughput: Intel Pro/1000 Max throughput 930Mbit/s No packet loss > 12 us Clean behaviour Packet loss during drop nMotherboard: IBM das Chipset:: ServerWorks CNB20LE nCPU: Dual PIII 1GHz PCI:64 bit 33 MHz nRedHat 7.1 Kernel 2.4.14 1400 bytes sent 11 us spacing Signals clean ~9.3us on send PCI bus PCI bus ~82% occupancy ~ 5.9 us on PCI for data recv.

Slide: 26 Richard Hughes-Jones Network switch limits behaviour uEnd2end UDP packets from udpmon Only 700 Mbit/s throughput Lots of packet loss Packet loss distribution shows throughput limited

Slide: 27 Richard Hughes-Jones 10 Gigabit Ethernet

Slide: 28 Richard Hughes-Jones 10 Gigabit Ethernet: UDP Throughput u1500 byte MTU gives ~ 2 Gbit/s uUsed 16144 byte MTU max user length 16080 uDataTAG Supermicro PCs uDual 2.2 GHz Xenon CPU FSB 400 MHz uPCI-X mmrbc 512 bytes uwire rate throughput of 2.9 Gbit/s uCERN OpenLab HP Itanium PCs uDual 1.0 GHz 64 bit Itanium CPU FSB 400 MHz uPCI-X mmrbc 4096 bytes uwire rate of 5.7 Gbit/s uSLAC Dell PCs giving a uDual 3.0 GHz Xenon CPU FSB 533 MHz uPCI-X mmrbc 4096 bytes uwire rate of 5.4 Gbit/s

Slide: 29 Richard Hughes-Jones 10 Gigabit Ethernet: Tuning PCI-X u16080 byte packets every 200 µs uIntel PRO/10GbE LR Adapter uPCI-X bus occupancy vs mmrbc Measured times Times based on PCI-X times from the logic analyser Expected throughput ~7 Gbit/s Measured 5.7 Gbit/s mmrbc 1024 bytes mmrbc 2048 bytes mmrbc 4096 bytes 5.7Gbit/s mmrbc 512 bytes CSR Access PCI-X Sequence Data Transfer Interrupt & CSR Update

Slide: 30 Richard Hughes-Jones Different TCP Stacks

Slide: 31 Richard Hughes-Jones Investigation of new TCP Stacks uThe AIMD Algorithm – Standard TCP (Reno) For each ack in a RTT without loss: cwnd -> cwnd + a / cwnd- Additive Increase, a=1 For each window experiencing loss: cwnd -> cwnd – b (cwnd) - Multiplicative Decrease, b= ½ uHigh Speed TCP a and b vary depending on current cwnd using a table a increases more rapidly with larger cwnd – returns to the ‘optimal’ cwnd size sooner for the network path b decreases less aggressively and, as a consequence, so does the cwnd. The effect is that there is not such a decrease in throughput. uScalable TCP a and b are fixed adjustments for the increase and decrease of cwnd a = 1/100 – the increase is greater than TCP Reno b = 1/8 – the decrease on loss is less than TCP Reno Scalable over any link speed. uFast TCP Uses round trip time as well as packet loss to indicate congestion with rapid convergence to fair equilibrium for throughput.

Slide: 32 Richard Hughes-Jones Comparison of TCP Stacks uTCP Response Function Throughput vs Loss Rate – further to right: faster recovery Drop packets in kernel MB-NG rtt 6ms DataTAG rtt 120 ms

Slide: 33 Richard Hughes-Jones High Throughput Demonstrations Manchester (Geneva) man03lon01 2.5 Gbit SDH MB-NG Core 1 GEth Cisco GSR Cisco 7609 Cisco 7609 London (Chicago) Dual Zeon 2.2 GHz Send data with TCP Drop Packets Monitor TCP with Web100

Slide: 34 Richard Hughes-Jones uDrop 1 in 25,000 urtt 6.2 ms uRecover in 1.6 s High Performance TCP – MB-NG StandardHighSpeed Scalable

Slide: 35 Richard Hughes-Jones High Performance TCP – DataTAG uDifferent TCP stacks tested on the DataTAG Network u rtt 128 ms uDrop 1 in 10 6 uHigh-Speed Rapid recovery uScalable Very fast recovery uStandard Recovery would take ~ 20 mins

Slide: 36 Richard Hughes-Jones Disk and RAID Sub-Systems Based on talk at NFNN 2004 by Andrew Sansum Tier 1 Manager at RAL

Slide: 37 Richard Hughes-Jones Reliability and Operation uThe performance of a system that is broken is 0.0 MB/S! uWhen staff are fixing broken servers they cannot spend their time optimising performance. uWith a lot of systems, anything that can break – will break Typical ATA failure rate 2-3% per annum at RAL, CERN and CASPUR. That’s 30-45 drives per annum on the RAL Tier1 i.e. One/week. One failure for every 10**15 bits read. MTBF 400K hours. RAID arrays may not handle block re-maps satisfactorily or take so long that the system has given up. RAID5 not perfect protection – finite chance of 2 disks failing! SCSI interconnects corrupts data or give CRC errors Filesystems (ext2) corrupt for no apparent cause (EXT2 errors) Silent data corruption happens (checksums are vital). uFixing data corruptions can take a huge amount of time (> 1 week). Data loss possible.

Slide: 38 Richard Hughes-Jones You need to Benchmark – but how? uAndrew’s golden rule: Benchmark Benchmark Benchmark uChoose a benchmark that matches your application (we use Bonnie++ and IOZONE). Single stream of sequential I/O. Random disk accesses. Multi-stream – 30 threads more typical. uWatch out for caching effects (use large files and small memory). uUse a standard protocol that gives reproducible results. uDocument what you did for future reference uStick with the same benchmark suite/protocol as long as possible to gain familiarity. Much easier to spot significant changes

Slide: 39 Richard Hughes-Jones Hard Disk Performance uFactors affecting performance: Seek time (rotation speed is a big component) Transfer Rate Cache size Retry count (vibration) uWatch out for unusual disk behaviours: write/verify (drive verifies writes for first writes) drive spin down etc etc. uStorage review is often worth reading: http://www.storagereview.com/ http://www.storagereview.com/

Slide: 40 Richard Hughes-Jones Impact of Drive Transfer Rate A slower drive may not cause significant sequential performance degradation in a RAID array. This 5400 drive was 25% slower than the 7200 Threads

Slide: 41 Richard Hughes-Jones Disk Parameters From: www.storagereview.com ms Distribution of Seek Time uMean 14.1 ms uPublished tseek time 9 ms uTaken off rotational latency Head Position & Transfer Rate uOutside of disk has more blocks so faster u58 MB/s to 35 MB/s u40% effect Position on disk

Slide: 42 Richard Hughes-Jones Drive Interfaces http://www.maxtor.com/

Slide: 43 Richard Hughes-Jones RAID levels uChoose Appropriate RAID level RAID 0 (stripe) is fastest but no redundancy RAID 1 (mirror) single disk performance RAID 2, 3 and 4 not very common/supported sometimes worth testing with your application RAID 5 – Read is almost as fast as RAID-0, write substantially slower. RAID 6 – extra parity info – good for unreliable drives but could have considerable impact on performance. Rare but potentially very useful for large IDE disk arrays. RAID 50 & RAID 10 - RAID0 2 (or more) controllers

Slide: 44 Richard Hughes-Jones Kernel Versions uKernel versions can make an enormous difference. It is not unusual for I/O to be badly broken in Linux. In 2.4 series – Virtual Memory Subsystem problems only 2.4.5 some 2.4.9 >2.4.17 OK Redhat Enteprise 3 seems badly broken (although most recent RHE Kernel may be better) 2.6 kernel contains new functionality and may be better (although first tests show little difference) uAlways check after an upgrade that performance remains good.

Slide: 45 Richard Hughes-Jones Kernel Tuning uTunable kernel parameters can improve I/O performance, but depends what you are trying to achieve. IRQ balancing. Issues with IRQ handling on some Xeon chipsets/kernel combinations (all IRQs handled on CPU zero). Hyperthreading I/O schedulers – can tune for latency by minimising the seeks /sbin/elvtune ( number sectors of writes before reads allowed) Choice of scheduler: elevator=xx orders I/O to minimise seeks, merges adjacent requests. VM on 2.4 tuningBdflush and readahead (single thread helped) Sysctl –w vm.max-readahead=512 Sysctl –w vm.min-readahead=512 Sysctl –w vm.bdflush=10 500 0 0 3000 10 20 0 On 2.6 readahead command is [default 256] blockdev --setra 16384 /dev/sda

Slide: 46 Richard Hughes-Jones Example Effect of VM Tuneup g.prassas@rl.ac.uk MB/s threads Redhat 7.3 Read Performance

Slide: 47 Richard Hughes-Jones IOZONE Tests Read Performance uFile System ext2 ext3 uDue to journal behaviour u40% effect ext3 Write ext3 Read uJournal write data + write journal

Slide: 48 Richard Hughes-Jones RAID Controller Performance uRAID5 (stripped with redundancy) u3Ware 7506 Parallel 66 MHz 3Ware 7505 Parallel 33 MHz u3Ware 8506 Serial ATA 66 MHz ICP Serial ATA 33/66 MHz uTested on Dual 2.2 GHz Xeon Supermicro P4DP8-G2 motherboard uDisk: Maxtor 160GB 7200rpm 8MB Cache uRead ahead kernel tuning: /proc/sys/vm/max-readahead = 512 uRates for the same PC RAID0 (stripped) Read 1040 Mbit/s, Write 800 Mbit/s Disk – Memory Read Speeds Memory - Disk Write Speeds Stephen Dallison

Slide: 49 Richard Hughes-Jones SC2004 RAID Controller Performance uSupermicro X5DPE-G2 motherboards loaned from Boston Ltd. uDual 2.8 GHz Zeon CPUs with 512 k byte cache and 1 M byte memory u3Ware 8506-8 controller on 133 MHz PCI-X bus uConfigured as RAID0 64k byte stripe size uSix 74.3 GByte Western Digital Raptor WD740 SATA disks 75 Mbyte/s disk-buffer 150 Mbyte/s buffer-memory uScientific Linux with 2.6.6 Kernel + altAIMD patch (Yee) + packet loss patch uRead ahead kernel tuning: /sbin/blockdev --setra 16384 /dev/sda uRAID0 (stripped) 2 GByte file: Read 1500 Mbit/s, Write 1725 Mbit/s Disk – Memory Read Speeds Memory - Disk Write Speeds

Slide: 50 Richard Hughes-Jones Applications Throughput for Real Users

Slide: 51 Richard Hughes-Jones Topology of the MB – NG Network Key Gigabit Ethernet 2.5 Gbit POS Access MPLS Admin. Domains UCL Domain Edge Router Cisco 7609 man01 man03 Boundary Router Cisco 7609 RAL Domain Manchester Domain lon02 man02 ral01 UKERNA Development Network Boundary Router Cisco 7609 ral02 lon03 lon01 HW RAID

Slide: 52 Richard Hughes-Jones Gridftp Throughput + Web100 u RAID0 Disks: 960 Mbit/s read 800 Mbit/s write u Throughput Mbit/s: u See alternate 600/800 Mbit and zero u Data Rate: 520 Mbit/s u Cwnd smooth u No dup Ack / send stall / timeouts

Slide: 53 Richard Hughes-Jones http data transfers HighSpeed TCP u Same Hardware u Bulk data moved by web servers u Apachie web server out of the box! uprototype client - curl http library u1Mbyte TCP buffers u2Gbyte file u Throughput ~720 Mbit/s u Cwnd - some variation u No dup Ack / send stall / timeouts

Slide: 54 Richard Hughes-Jones bbftp: What else is going on? u Scalable TCP u BaBar + SuperJANET u SuperMicro + SuperJANET u Congestion window – duplicate ACK u Variation not TCP related? Disk speed / bus transfer Application

Slide: 55 Richard Hughes-Jones SC2004 UKLIGHT Topology MB-NG 7600 OSR Manchester ULCC UKLight UCL HEP UCL network K2 Ci Chicago Starlight Amsterdam SC2004 Caltech Booth UltraLight IP SLAC Booth Cisco 6509 UKLight 10G Four 1GE channels UKLight 10G Surfnet/ EuroLink 10G Two 1GE channels NLR Lambda NLR-PITT-STAR-10GE-16 K2 Ci Caltech 7600

Slide: 56 Richard Hughes-Jones SC2004 Disk-Disk bbftp ubbftp file transfer program uses TCP/IP uUKLight: Path:- London-Chicago-London; PCs:- Supermicro +3Ware RAID0 uMTU 1500 bytes; Socket size 22 Mbytes; rtt 177ms; SACK off uMove a 2 Gbyte file uWeb100 plots: uStandard TCP uAverage 825 Mbit/s u(bbcp: 670 Mbit/s) uScalable TCP uAverage 875 Mbit/s u(bbcp: 701 Mbit/s ~4.5s of overhead) uDisk-TCP-Disk at 1Gbit/s

Slide: 57 Richard Hughes-Jones Network & Disk Interactions (work in progress) uHosts: Supermicro X5DPE-G2 motherboards dual 2.8 GHz Zeon CPUs with 512 k byte cache and 1 M byte memory 3Ware 8506-8 controller on 133 MHz PCI-X bus configured as RAID0 six 74.3 GByte Western Digital Raptor WD740 SATA disks 64k byte stripe size uMeasure memory to RAID0 transfer rates with & without UDP traffic Disk write 1735 Mbit/s Disk write + 1500 MTU UDP 1218 Mbit/s Drop of 30% Disk write + 9000 MTU UDP 1400 Mbit/s Drop of 19% % CPU kernel mode

Slide: 58 Richard Hughes-Jones Remote Processing Farms: Manc-CERN Round trip time 20 ms 64 byte Request green 1 Mbyte Response blue TCP in slow start 1st event takes 19 rtt or ~ 380 ms TCP Congestion window gets re-set on each Request TCP stack implementation detail to reduce Cwnd after inactivity Even after 10s, each response takes 13 rtt or ~260 ms Transfer achievable throughput 120 Mbit/s ●●●●●●

Slide: 59 Richard Hughes-Jones uHost is critical: Motherboards NICs, RAID controllers and Disks matter The NICs should be well designed: NIC should use 64 bit 133 MHz PCI-X (66 MHz PCI can be OK) NIC/drivers: CSR access / Clean buffer management / Good interrupt handling Worry about the CPU-Memory bandwidth as well as the PCI bandwidth Data crosses the memory bus at least 3 times Separate the data transfers – use motherboards with multiple 64 bit PCI-X buses uTest Disk Systems with representative Load Choose a modern high throughput RAID controller Consider SW RAID0 of RAID5 HW controllers uNeed plenty of CPU power for sustained 1 Gbit/s transfers and disk access uPacket loss is a killer Check on campus links & equipment, and access links to backbones uNew stacks are stable give better response & performance Still need to set the tcp buffer sizes & other kernel settings e.g. window-scale uApplication architecture & implementation is important uInteraction between HW, protocol processing, and disk sub-system complex Summary, Conclusions & Thanks MB - NG

Slide: 60 Richard Hughes-Jones More Information Some URLs uUKLight web site: http://www.uklight.ac.uk uDataTAG project web site: http://www.datatag.org/ uUDPmon / TCPmon kit + writeup: http://www.hep.man.ac.uk/~rich/ (Software & Tools) uMotherboard and NIC Tests: http://www.hep.man.ac.uk/~rich/net/nic/GigEth_tests_Boston.ppt & http://datatag.web.cern.ch/datatag/pfldnet2003/ “Performance of 1 and 10 Gigabit Ethernet Cards with Server Quality Motherboards” FGCS Special issue 2004 http:// www.hep.man.ac.uk/~rich/ (Publications) uTCP tuning information may be found at: http://www.ncne.nlanr.net/documentation/faq/performance.html & http://www.psc.edu/networking/perf_tune.html uTCP stack comparisons: “Evaluation of Advanced TCP Stacks on Fast Long-Distance Production Networks” Journal of Grid Computing 2004 http:// www.hep.man.ac.uk/~rich/ (Publications) uPFLDnet http://www.ens-lyon.fr/LIP/RESO/pfldnet2005/ uDante PERT http://www.geant2.net/server/show/nav.00d00h002 uReal-Time Remote Farm site http://csr.phys.ualberta.ca/real-time uDisk information http://www.storagereview.com/

Slide: 61 Richard Hughes-Jones Any Questions?

Slide: 62 Richard Hughes-Jones Backup Slides

Slide: 63 Richard Hughes-Jones TCP (Reno) – Details uTime for TCP to recover its throughput from 1 lost packet given by: u for rtt of ~200 ms: 2 min UK 6 ms Europe 20 ms USA 150 ms

Slide: 64 Richard Hughes-Jones Packet Loss and new TCP Stacks uTCP Response Function UKLight London-Chicago-London rtt 180 ms 2.6.6 Kernel Agreement with theory good

Slide: 65 Richard Hughes-Jones uChose 3 paths from SLAC (California) Caltech (10ms), Univ Florida (80ms), CERN (180ms) uUsed iperf/TCP and UDT/UDP to generate traffic uEach run was 16 minutes, in 7 regions Test of TCP Sharing: Methodology (1Gbit/s) Ping 1/s Iperf or UDT ICMP/ping traffic TCP/UDP bottleneck iperf SLAC Caltech/UFL/CERN 2 mins 4 mins Les Cottrell PFLDnet 2005

Slide: 66 Richard Hughes-Jones uLow performance on fast long distance paths AIMD (add a=1 pkt to cwnd / RTT, decrease cwnd by factor b=0.5 in congestion) Net effect: recovers slowly, does not effectively use available bandwidth, so poor throughput Unequal sharing TCP Reno single stream Congestion has a dramatic effect Recovery is slow Increase recovery rate SLAC to CERN RTT increases when achieves best throughput Les Cottrell PFLDnet 2005 Remaining flows do not take up slack when flow removed

Slide: 67 Richard Hughes-Jones Average Transfer Rates Mbit/s AppTCP StackSuperMicro on MB-NG SuperMicro on SuperJANET4 BaBar on SuperJANET4 SC2004 on UKLight IperfStandard940350-370425940 HighSpeed940510570940 Scalable940580-650605940 bbcpStandard434290-310290 HighSpeed435385360 Scalable432400-430380 bbftpStandard400-410325320825 HighSpeed370-390380 Scalable430345-532380875 apacheStandard425260300-360 HighSpeed430370315 Scalable428400317 GridftpStandard405240 HighSpeed320 Scalable335 New stacks give more throughput Rate decreases

Slide: 68 Richard Hughes-Jones iperf Throughput + Web100 u SuperMicro on MB-NG network u HighSpeed TCP u Linespeed 940 Mbit/s u DupACK ? <10 (expect ~400) u BaBar on Production network u Standard TCP u 425 Mbit/s u DupACKs 350-400 – re-transmits

Slide: 69 Richard Hughes-Jones Applications: Throughput Mbit/s u HighSpeed TCP u 2 GByte file RAID5 u SuperMicro + SuperJANET u bbcp u bbftp u Apachie u Gridftp u Previous work used RAID0 (not disk limited)

Slide: 70 Richard Hughes-Jones bbcp & GridFTP Throughput u 2Gbyte file transferred RAID5 - 4disks Manc – RAL u bbcp u Mean 710 Mbit/s u GridFTP u See many zeros Mean ~710 Mean ~620 u DataTAG altAIMD kernel in BaBar & ATLAS

Slide: 71 Richard Hughes-Jones tcpdump of the T/DAQ dataflow at SFI (1) Cern-Manchester 1.0 Mbyte event Remote EFD requests event from SFI Incoming event request Followed by ACK N 1448 byte packets SFI sends event Limited by TCP receive buffer Time 115 ms (~4 ev/s) When TCP ACKs arrive more data is sent. ●●●●●●

20th-21st June 2005 NeSC Edinburgh End-user systems: NICs, MotherBoards, Disks, TCP Stacks & Applications Richard Hughes-Jones.

Similar presentations

Presentation on theme: "20th-21st June 2005 NeSC Edinburgh End-user systems: NICs, MotherBoards, Disks, TCP Stacks & Applications Richard Hughes-Jones."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

20th-21st June 2005 NeSC Edinburgh End-user systems: NICs, MotherBoards, Disks, TCP Stacks & Applications Richard Hughes-Jones.

Similar presentations

Presentation on theme: "20th-21st June 2005 NeSC Edinburgh End-user systems: NICs, MotherBoards, Disks, TCP Stacks & Applications Richard Hughes-Jones."— Presentation transcript:

Similar presentations

About project

Feedback