Presentation is loading. Please wait.

Presentation is loading. Please wait.

Slide: 1 Richard Hughes-Jones Mini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 1 Using TCP/IP on High Bandwidth Long.

Similar presentations


Presentation on theme: "Slide: 1 Richard Hughes-Jones Mini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 1 Using TCP/IP on High Bandwidth Long."— Presentation transcript:

1 Slide: 1 Richard Hughes-Jones Mini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 1 Using TCP/IP on High Bandwidth Long Distance Optical Networks Real Applications on Real Networks Richard Hughes-Jones University of Manchester www.hep.man.ac.uk/~rich/ then “Talks” then look for “Rank” www.hep.man.ac.uk/~rich/

2 Slide: 2 Richard Hughes-Jones Mini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 2 uSCINet Bandwidth Challenge at SC2004 uSetting up the BW Bunker uThe BW Challenge at the SLAC Booth uWorking with S2io, Sun, Chelsio

3 Slide: 3 Richard Hughes-Jones Mini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 3 The Bandwidth Challenge – SC2004 uThe peak aggregate bandwidth from the booths was 101.13Gbits/s uThat is 3 full length DVDs per second ! u4 Times greater that SC2003 ! (with its 4.4 Gbit transatlantic flows) uSaturated TEN 10Gigabit Ethernet waves uSLAC Booth: Sunnyvale to Pittsburgh, LA to Pittsburgh and Chicago to Pittsburgh (with UKLight).

4 Slide: 4 Richard Hughes-Jones Mini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 4 TCP has been around for ages and it just works fine So What’s the Problem? The users complain about the Network!

5 Slide: 5 Richard Hughes-Jones Mini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 5 TCP – provides reliability uPositive acknowledgement (ACK) of each received segment Sender keeps record of each segment sent Sender awaits an ACK – “I am ready to receive byte 2048 and beyond” Sender starts timer when it sends segment – so can re-transmit Segment n ACK of Segment n RTT Time Sender Receiver Sequence 1024 Length 1024 Ack 2048 Segment n+1 ACK of Segment n +1 RTT Sequence 2048 Length 1024 Ack 3072 uInefficient – sender has to wait

6 Slide: 6 Richard Hughes-Jones Mini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 6 Flow Control: Sender – Congestion Window uUses Congestion window, cwnd, a sliding window to control the data flow Byte count giving highest byte that can be sent with out without an ACK Transmit buffer size and Advertised Receive buffer size important. ACK gives next sequence no to receive AND The available space in the receive buffer Timer kept for each packet Unsent Data may be transmitted immediately Sent Data buffered waiting ACK TCP Cwnd slides Data to be sent, waiting for window to open. Application writes here Data sent and ACKed Sending host advances marker as data transmitted Received ACK advances trailing edge Receiver’s advertised window advances leading edge

7 Slide: 7 Richard Hughes-Jones Mini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 7 How it works: TCP Slowstart uProbe the network - get a rough estimate of the optimal congestion window size uThe larger the window size, the higher the throughput Throughput = Window size / Round-trip Time uexponentially increase the congestion window size until a packet is lost cwnd initially 1 MTU then increased by 1 MTU for each ACK received Send 1 st packet get 1 ACK increase cwnd to 2 Send 2 packets get 2 ACKs inc cwnd to 4 Time to reach cwnd size W = RTT*log 2 (W) Rate doubles each RTT CWND slow start: exponential increase congestion avoidance: linear increase packet loss time retransmit: slow start again timeout

8 Slide: 8 Richard Hughes-Jones Mini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 8 uadditive increase: starting from the rough estimate, linearly increase the congestion window size to probe for additional available bandwidth cwnd increased by 1 /MTU for each ACK – linear increase in rate cwnd -> cwnd + a / cwnd- Additive Increase, a=1 uTCP takes packet loss as indication of congestion ! umultiplicative decrease: cut the congestion window size aggressively if a packet is lost Standard TCP reduces cwnd by 0.5 cwnd -> cwnd – b (cwnd) - Multiplicative Decrease, b= ½ Slow start to Congestion avoidance transition determined by ssthresh uPacket loss is a killer CWND slow start: exponential increase congestion avoidance: linear increase packet loss time retransmit: slow start again timeout How it works: TCP AIMD Congestion Avoidance

9 Slide: 9 Richard Hughes-Jones Mini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 9 TCP (Reno) – Details of problem uThe time for TCP to recover its throughput from 1 lost 1500 byte packet is given by: u for rtt of ~200 ms: 2 min UK 6 ms Europe 25 ms USA 150 ms 1.6 s 26 s 28min

10 Slide: 10 Richard Hughes-Jones Mini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 10 TCP: Simple Tuning - Filling the Pipe uRemember, TCP has to hold a copy of data in flight uOptimal (TCP buffer) window size depends on: Bandwidth end to end, i.e. min(BW links ) AKA bottleneck bandwidth Round Trip Time (RTT) uThe number of bytes in flight to fill the entire path: Bandwidth*Delay Product BDP = RTT*BW Can increase bandwidth by orders of magnitude uWindows also used for flow control RTT Time Sender Receiver ACK Segment time on wire = bits in segment/BW

11 Slide: 11 Richard Hughes-Jones Mini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 11 Investigation of new TCP Stacks uThe AIMD Algorithm – Standard TCP (Reno) For each ack in a RTT without loss: cwnd -> cwnd + a / cwnd- Additive Increase, a=1 For each window experiencing loss: cwnd -> cwnd – b (cwnd) - Multiplicative Decrease, b= ½ uHigh Speed TCP a and b vary depending on current cwnd using a table a increases more rapidly with larger cwnd – returns to the ‘optimal’ cwnd size sooner for the network path b decreases less aggressively and, as a consequence, so does the cwnd. The effect is that there is not such a decrease in throughput. uScalable TCP a and b are fixed adjustments for the increase and decrease of cwnd a = 1/100 – the increase is greater than TCP Reno b = 1/8 – the decrease on loss is less than TCP Reno Scalable over any link speed. uFast TCP Uses round trip time as well as packet loss to indicate congestion with rapid convergence to fair equilibrium for throughput. uHSTCP-LP, Hamilton-TCP, BiC-TCP

12 Slide: 12 Richard Hughes-Jones Mini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 12 Lets Check out this theory about new TCP stacks Does it matter ? Does it work?

13 Slide: 13 Richard Hughes-Jones Mini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 13 Packet Loss with new TCP Stacks uTCP Response Function Throughput vs Loss Rate – further to right: faster recovery Drop packets in kernel MB-NG rtt 6ms DataTAG rtt 120 ms

14 Slide: 14 Richard Hughes-Jones Mini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 14 High Throughput Demonstration Manchester (Geneva) man03lon01 2.5 Gbit SDH MB-NG Core 1 GEth Cisco GSR Cisco 7609 Cisco 7609 London (Chicago) Dual Zeon 2.2 GHz Send data with TCP Drop Packets Monitor TCP with Web100

15 Slide: 15 Richard Hughes-Jones Mini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 15 High Performance TCP – DataTAG uDifferent TCP stacks tested on the DataTAG Network u rtt 128 ms uDrop 1 in 10 6 uHigh-Speed Rapid recovery uScalable Very fast recovery uStandard Recovery would take ~ 20 mins

16 Slide: 16 Richard Hughes-Jones Mini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 16 Throughput for real users Transfers in the UK for BaBar using MB-NG and SuperJANET4

17 Slide: 17 Richard Hughes-Jones Mini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 17 Topology of the MB – NG Network Key Gigabit Ethernet 2.5 Gbit POS Access MPLS Admin. Domains UCL Domain Edge Router Cisco 7609 man01 man03 Boundary Router Cisco 7609 RAL Domain Manchester Domain lon02 man02 ral01 UKERNA Development Network Boundary Router Cisco 7609 ral02 lon03 lon01 HW RAID

18 Slide: 18 Richard Hughes-Jones Mini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 18 Topology of the Production Network Key Gigabit Ethernet 2.5 Gbit POS Access 10 Gbit POS man01 RAL Domain Manchester Domain ral01 HW RAID routers switches 3 routers 2 switches

19 Slide: 19 Richard Hughes-Jones Mini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 19 iperf Throughput + Web100 u SuperMicro on MB-NG network u HighSpeed TCP u Linespeed 940 Mbit/s u DupACK ? <10 (expect ~400) u BaBar on Production network u Standard TCP u 425 Mbit/s u DupACKs 350-400 – re-transmits

20 Slide: 20 Richard Hughes-Jones Mini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 20 Applications: Throughput Mbit/s HighSpeed TCP 2 GByte file RAID5 SuperMicro + SuperJANET u bbcp u bbftp u Apachie u Gridftp u Previous work used RAID0 (not disk limited)

21 Slide: 21 Richard Hughes-Jones Mini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 21 bbftp: What else is going on? Scalable TCP u SuperMicro + SuperJANET Instantaneous 220 - 625 Mbit/s u Congestion window – duplicate ACK u Throughput variation not TCP related? Disk speed / bus transfer Application u BaBar + SuperJANET Instantaneous 200 – 600 Mbit/s u Disk-mem ~ 590 Mbit/s

22 Slide: 22 Richard Hughes-Jones Mini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 22 Average Transfer Rates Mbit/s AppTCP StackSuperMicro on MB-NG SuperMicro on SuperJANET4 BaBar on SuperJANET4 SC2004 on UKLight IperfStandard940350-370425940 HighSpeed940510570940 Scalable940580-650605940 bbcpStandard434290-310290 HighSpeed435385360 Scalable432400-430380 bbftpStandard400-410325320825 HighSpeed370-390380 Scalable430345-532380875 apacheStandard425260300-360 HighSpeed430370315 Scalable428400317 GridftpStandard405240 HighSpeed320 Scalable335 New stacks give more throughput Rate decreases

23 Slide: 23 Richard Hughes-Jones Mini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 23 Transatlantic Disk to Disk Transfers With UKLight SuperComputing 2004

24 Slide: 24 Richard Hughes-Jones Mini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 24 SC2004 UKLIGHT Overview MB-NG 7600 OSR Manchester ULCC UKLight UCL HEP UCL network K2 Ci Chicago Starlight Amsterdam SC2004 Caltech Booth UltraLight IP SLAC Booth Cisco 6509 UKLight 10G Four 1GE channels UKLight 10G Surfnet/ EuroLink 10G Two 1GE channels NLR Lambda NLR-PITT-STAR-10GE-16 K2 Ci Caltech 7600

25 Slide: 25 Richard Hughes-Jones Mini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 25 Transatlantic Ethernet: TCP Throughput Tests uSupermicro X5DPE-G2 PCs uDual 2.9 GHz Xenon CPU FSB 533 MHz u1500 byte MTU u2.6.6 Linux Kernel uMemory-memory TCP throughput uStandard TCP uWire rate throughput of 940 Mbit/s uFirst 10 sec uWork in progress to study: Implementation detail Advanced stacks Effect of packet loss Sharing

26 Slide: 26 Richard Hughes-Jones Mini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 26 SC2004 Disk-Disk bbftp ubbftp file transfer program uses TCP/IP uUKLight: Path:- London-Chicago-London; PCs:- Supermicro +3Ware RAID0 uMTU 1500 bytes; Socket size 22 Mbytes; rtt 177ms; SACK off uMove a 2 Gbyte file uWeb100 plots: uStandard TCP uAverage 825 Mbit/s u(bbcp: 670 Mbit/s) uScalable TCP uAverage 875 Mbit/s u(bbcp: 701 Mbit/s ~4.5s of overhead) uDisk-TCP-Disk at 1Gbit/s

27 Slide: 27 Richard Hughes-Jones Mini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 27 Network & Disk Interactions (work in progress) uHosts: Supermicro X5DPE-G2 motherboards dual 2.8 GHz Zeon CPUs with 512 k byte cache and 1 M byte memory 3Ware 8506-8 controller on 133 MHz PCI-X bus configured as RAID0 six 74.3 GByte Western Digital Raptor WD740 SATA disks 64k byte stripe size uMeasure memory to RAID0 transfer rates with & without UDP traffic Disk write 1735 Mbit/s Disk write + 1500 MTU UDP 1218 Mbit/s Drop of 30% Disk write + 9000 MTU UDP 1400 Mbit/s Drop of 19% % CPU kernel mode

28 Slide: 28 Richard Hughes-Jones Mini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 28 Remote Computing Farms in the ATLAS TDAQ Experiment

29 Slide: 29 Richard Hughes-Jones Mini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 29 Remote Computing Concepts ROB L2PU SFI PF Local Event Processing Farms ATLAS Detectors – Level 1 Trigger SFOs Mass storage Experimental Area CERN B513 Copenhagen Edmonton Krakow Manchester PF Remote Event Processing Farms PF lightpaths PF Data Collection Network Back End Network GÉANT Switch Level 2 Trigger Event Builders ~PByte/sec 320 MByte/sec

30 Slide: 30 Richard Hughes-Jones Mini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 30 ATLAS Application Protocol u Event Request EFD requests an event from SFI SFI replies with the event ~2Mbytes u Processing of event u Return of computation EF asks SFO for buffer space SFO sends OK EF transfers results of the computation u tcpmon - instrumented TCP request-response program emulates the Event Filter EFD to SFI communication. Send OK Send event data Request event ●●● Request Buffer Send processed event Process event Time Request-Response time (Histogram) Event Filter Daemon EFD SFI and SFO

31 Slide: 31 Richard Hughes-Jones Mini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 31 tcpmon: TCP Activity Manc-CERN Req-Resp Web100 Instruments the TCP stack Round trip time 20 ms 64 byte Request green 1 Mbyte Response blue TCP in slow start 1st event takes 19 rtt or ~ 380 ms TCP Congestion window gets re-set on each Request TCP stack implementation detail to reduce Cwnd after inactivity Even after 10s, each response takes 13 rtt or ~260 ms Transfer achievable throughput 120 Mbit/s

32 Slide: 32 Richard Hughes-Jones Mini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 32 tcpmon: TCP Activity Manc-cern Req-Resp TCP stack tuned Round trip time 20 ms 64 byte Request green 1 Mbyte Response blue TCP starts in slow start 1 st event takes 19 rtt or ~ 380 ms TCP Congestion window grows nicely Response takes 2 rtt after ~1.5s Rate ~10/s (with 50ms wait) Transfer achievable throughput grows to 800 Mbit/s

33 Slide: 33 Richard Hughes-Jones Mini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 33 Round trip time 150 ms 64 byte Request green 1 Mbyte Response blue TCP starts in slow start 1 st event takes 11 rtt or ~ 1.67 s tcpmon: TCP Activity Alberta-CERN Req-Resp TCP stack tuned TCP Congestion window in slow start to ~1.8s then congestion avoidance Response in 2 rtt after ~2.5s Rate 2.2/s (with 50ms wait) Transfer achievable throughput grows slowly from 250 to 800 Mbit/s

34 Slide: 34 Richard Hughes-Jones Mini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 34 Time Series of Request-Response Latency Alberta – CERN Round trip time 150 ms 1 Mbyte of data returned Stable for ~150s at 300ms Falls to 160ms with ~80 μs variation Manchester – CERN Round trip time 20 ms 1 Mbyte of data returned Stable for ~18s at ~42.5ms Then alternate points 29 & 42.5 ms

35 Slide: 35 Richard Hughes-Jones Mini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 35 Radio Astronomy e-VLBI

36 Slide: 36 Richard Hughes-Jones Mini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 36 Jodrell Bank UK Dwingeloo DWDM link Medicina Italy Torun Poland e-VLBI at the GÉANT2 Launch Jun 2005

37 Slide: 37 Richard Hughes-Jones Mini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 37 e-VLBI UDP Data Streams

38 Slide: 38 Richard Hughes-Jones Mini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 38 UDP Performance: 3 Flows on GÉANT uThroughput: 5 Hour run uJodrell: JIVE 2.0 GHz dual Xeon – 2.4 GHz dual Xeon 670-840 Mbit/s uMedicina (Bologna): JIVE 800 MHz PIII – mark623 1.2 GHz PIII 330 Mbit/s limited by sending PC uTorun: JIVE 2.4 GHz dual Xeon – mark575 1.2 GHz PIII 245-325 Mbit/s limited by security policing (>400Mbit/s  20 Mbit/s) ? uThroughput: 50 min period uPeriod is ~17 min

39 Slide: 39 Richard Hughes-Jones Mini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 39 UDP Performance: 3 Flows on GÉANT uPacket Loss & Re-ordering uJodrell: 2.0 GHz Xeon Loss 0 – 12% Reordering significant uMedicina: 800 MHz PIII Loss ~6% Reordering in-significant uTorun: 2.4 GHz Xeon Loss 6 - 12% Reordering in-significant

40 Slide: 40 Richard Hughes-Jones Mini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 40 18 Hour Flows on UKLight Jodrell – JIVE, 26 June 2005 uThroughput: uJodrell: JIVE 2.4 GHz dual Xeon – 2.4 GHz dual Xeon 960-980 Mbit/s uTraffic through SURFnet uPacket Loss Only 3 groups with 10-150 lost packets each No packets lost the rest of the time uPacket re-ordering None

41 Slide: 41 Richard Hughes-Jones Mini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 41 uThe End Hosts themselves The performance of Motherboards, NICs, RAID controllers and Disks matter Plenty of CPU power is required to sustain Gigabit transfers for the TCP/IP stack as well and the application Packets can be lost in the IP stack due to lack of processing power uNew TCP stacks are stable give better response & performance Still need to set the tcp buffer sizes ! Check other kernel settings e.g. window-scale Take care on difference between the Protocol and The Implementation uPacket loss is a killer Check on campus links & equipment, and access links to backbones uApplications architecture & implementation is also important The work is applicable to other areas including: Remote iSCSI Remote database accesses Real-time Grid Computing – eg Real-Time Interactive Medical Image processing Interaction between HW, protocol processing, and disk sub-system complex Summary & Conclusions MB - NG

42 Slide: 42 Richard Hughes-Jones Mini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 42 More Information Some URLs uReal-Time Remote Farm site http://csr.phys.ualberta.ca/real-time uUKLight web site: http://www.uklight.ac.uk uDataTAG project web site: http://www.datatag.org/ uUDPmon / TCPmon kit + writeup: http://www.hep.man.ac.uk/~rich/ (Software & Tools) uMotherboard and NIC Tests: http://www.hep.man.ac.uk/~rich/net/nic/GigEth_tests_Boston.ppt & http://datatag.web.cern.ch/datatag/pfldnet2003/ “Performance of 1 and 10 Gigabit Ethernet Cards with Server Quality Motherboards” FGCS Special issue 2004 http:// www.hep.man.ac.uk/~rich/ (Publications) uTCP tuning information may be found at: http://www.ncne.nlanr.net/documentation/faq/performance.html & http://www.psc.edu/networking/perf_tune.html uTCP stack comparisons: “Evaluation of Advanced TCP Stacks on Fast Long-Distance Production Networks” Journal of Grid Computing 2004 http:// www.hep.man.ac.uk/~rich/ (Publications) uPFLDnet http://www.ens-lyon.fr/LIP/RESO/pfldnet2005/ uDante PERT http://www.geant2.net/server/show/nav.00d00h002

43 Slide: 43 Richard Hughes-Jones Mini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 43 Any Questions?

44 Slide: 44 Richard Hughes-Jones Mini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 44 Backup Slides

45 Slide: 45 Richard Hughes-Jones Mini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 45 Multi-Gigabit flows at SC2003 BW Challenge u Three Server systems with 10 Gigabit Ethernet NICs u Used the DataTAG altAIMD stack 9000 byte MTU u Send mem-mem iperf TCP streams From SLAC/FNAL booth in Phoenix to: Pal Alto PAIX rtt 17 ms, window 30 MB Shared with Caltech booth 4.37 Gbit HighSpeed TCP I=5% Then 2.87 Gbit I=16% Fall when 10 Gbit on link 3.3Gbit Scalable TCP I=8% Tested 2 flows sum 1.9Gbit I=39% Chicago Starlight rtt 65 ms, window 60 MB Phoenix CPU 2.2 GHz 3.1 Gbit HighSpeed TCP I=1.6% Amsterdam SARA rtt 175 ms, window 200 MB Phoenix CPU 2.2 GHz 4.35 Gbit HighSpeed TCP I=6.9% Very Stable Both used Abilene to Chicago

46 Slide: 46 Richard Hughes-Jones Mini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 46 uUDP/IP packets sent between back-to-back systems Processed in a similar manner to TCP/IP Not subject to flow control & congestion avoidance algorithms Used UDPmon test program uLatency uRound trip times measured using Request-Response UDP frames uLatency as a function of frame size Slope is given by: Mem-mem copy(s) + pci + Gig Ethernet + pci + mem-mem copy(s) Intercept indicates: processing times + HW latencies uHistograms of ‘singleton’ measurements uTells us about: Behavior of the IP stack The way the HW operates Interrupt coalescence Latency Measurements

47 Slide: 47 Richard Hughes-Jones Mini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 47 Throughput Measurements uUDP Throughput uSend a controlled stream of UDP frames spaced at regular intervals n bytes Number of packets Wait time time  Zero stats OK done ●●● Get remote statistics Send statistics: No. received No. lost + loss pattern No. out-of-order CPU load & no. int 1-way delay Send data frames at regular intervals ●●● Time to send Time to receive Inter-packet time (Histogram) Signal end of test OK done Time Sender Receiver

48 Slide: 48 Richard Hughes-Jones Mini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 48 PCI Bus & Gigabit Ethernet Activity uPCI Activity uLogic Analyzer with PCI Probe cards in sending PC Gigabit Ethernet Fiber Probe Card PCI Probe cards in receiving PC Gigabit Ethernet Probe CPU mem chipset NIC CPU mem NIC chipset Logic Analyser Display PCI bus Possible Bottlenecks

49 Slide: 49 Richard Hughes-Jones Mini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 49 u SuperMicro P4DP8-2G (P4DP6) uDual Xeon u 400/522 MHz Front side bus u 6 PCI PCI-X slots u 4 independent PCI buses 64 bit 66 MHz PCI 100 MHz PCI-X 133 MHz PCI-X u Dual Gigabit Ethernet u Adaptec AIC-7899W dual channel SCSI u UDMA/100 bus master/EIDE channels data transfer rates of 100 MB/sec burst “Server Quality” Motherboards

50 Slide: 50 Richard Hughes-Jones Mini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 50 “Server Quality” Motherboards u Boston/Supermicro H8DAR u Two Dual Core Opterons u 200 MHz DDR Memory Theory BW: 6.4Gbit u HyperTransport u 2 independent PCI buses 133 MHz PCI-X u 2 Gigabit Ethernet u SATA u ( PCI-e )

51 Slide: 51 Richard Hughes-Jones Mini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 51 End Hosts & NICs CERN-nat-Manc. Request-Response Latency Throughput Packet Loss Re-Order uUse UDP packets to characterise Host, NIC & Network SuperMicro P4DP8 motherboard Dual Xenon 2.2GHz CPU 400 MHz System bus 64 bit 66 MHz PCI / 133 MHz PCI-X bus uThe network can sustain 1Gbps of UDP traffic uThe average server can loose smaller packets uPacket loss caused by lack of power in the PC receiving the traffic uOut of order packets due to WAN routers uLightpaths look like extended LANS have no re-ordering

52 Slide: 52 Richard Hughes-Jones Mini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 52 TCP (Reno) – Details uTime for TCP to recover its throughput from 1 lost packet given by: u for rtt of ~200 ms: 2 min UK 6 ms Europe 20 ms USA 150 ms

53 Slide: 53 Richard Hughes-Jones Mini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 53 Network & Disk Interactions uDisk Write mem-disk: 1735 Mbit/s Tends to be in 1 die uDisk Write + UDP 1500 mem-disk : 1218 Mbit/s Both dies at ~80% uDisk Write + CPU  mem mem-disk : 1341 Mbit/s 1 CPU at ~60% other 20% Large user mode usage Below Cut = hi BW Hi BW = die1 used uDisk Write + CPUload mem-disk : 1334 Mbit/s 1 CPU at ~60% other 20% All CPUs saturated in user mode Total CPU load Kernel CPU load

54 Slide: 54 Richard Hughes-Jones Mini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 54 TCP Fast Retransmit & Recovery uDuplicate ACKs are due to lost segments or segments out of order. uFast Retransmit: If the receiver transmits 3 duplicate ACKs (i.e. it received 3 additional segments without getting the one expected) Transmitting host sends the missing segment Set ssthresh to 0.5*cwnd – so enter congestion avoidance phase Set cwnd = (0.5*cwnd +3 ) – the 3 dup ACKs Increase cwnd by 1 segment when get duplicate ACKs Keep sending new data if allowed by cwnd Set cwnd to half original value on new ACK no need to go into “slow start” again uAt steady state, CWND oscillates around the optimal window size uWith a retransmission timeout, slow start is triggered again

55 Slide: 55 Richard Hughes-Jones Mini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 55 Packet Loss and new TCP Stacks uTCP Response Function UKLight London-Chicago-London rtt 177 ms 2.6.6 Kernel Agreement with theory good Some new stacks good at high loss rates


Download ppt "Slide: 1 Richard Hughes-Jones Mini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester 1 Using TCP/IP on High Bandwidth Long."

Similar presentations


Ads by Google