Presentation is loading. Please wait.

Presentation is loading. Please wait.

5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 1 TCP/IP on High Bandwidth Long Distance Paths or So TCP.

Similar presentations


Presentation on theme: "5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 1 TCP/IP on High Bandwidth Long Distance Paths or So TCP."— Presentation transcript:

1 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 1 TCP/IP on High Bandwidth Long Distance Paths or So TCP works … but still the users ask: Where is my throughput? Richard Hughes-Jones The University of Manchester www.hep.man.ac.uk/~rich/ then “Talks” and look for Haystack www.hep.man.ac.uk/~rich/

2 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 2 Layers & IP

3 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 3 The Network Layer 3: IP uIP Layer properties: Provides best effort delivery It is unreliable Packet may be lost Duplicated Out of order Connection less Provides logical addresses Provides routing Demultiplex data on protocol number

4 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 4 The Internet datagram 31 HlenVersType of serv.Total length 0 816 IdentificationFlags 24 4 Fragment offset 19 TTLProtocolHeader Checksum Source IP address Destination IP address IP Options (if any)Padding 20 Bytes Frame headerTransportFCS IP header

5 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 5 IP Datagram Format (cont.) uType of Service – TOS: now being used for QoS uTotal length: length of datagram in bytes, includes header and data uTime to live – TTL: specifies how long datagram is allowed to remain in internet Routers decrement by 1 When TTL = 0 router discards datagram Prevents infinite loops uProtocol: specifies the format of the data area Protocol numbers administered by central authority to guarantee agreement, e.g. ICMP=1, TCP=6, UDP=17 … uSource & destination IP address: (32 bits each) contain IP address of sender and intended recipient uOptions: (variable length) Mainly used to record a route, or timestamps, or specify routing

6 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 6 The Transport Layer 4: UDP uUDP Provides : Connection less service over IP No setup teardown One packet at a time Minimal overhead – high performance Provides best effort delivery It is unreliable: Packet may be lost Duplicated Out of order Application is responsible for Data reliability Flow control Error handling

7 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 7 UDP Datagram format uSource/destination port: port numbers identify sending & receiving processes Port number & IP address allow any application on Internet to be uniquely identified Ports can be static or dynamic Static (< 1024) assigned centrally, known as well known ports Dynamic uMessage length: in bytes includes the UDP header and data (min 8 max 65,535) 81631 24 Source portDestination port UDP message lenChecksum (opt.) 0 Frame header Application data FCS IP header UDP header 8 Bytes

8 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 8 The Transport Layer 4: TCP uTCP RFC 768 RFC 1122 Provides : Connection orientated service over IP During setup the two ends agree on details Explicit teardown Multiple connections allowed Reliable end-to-end Byte Stream delivery over unreliable network It takes care of: Lost packets Duplicated packets Out of order packets TCP provides Data buffering Flow control Error detection & handling Limits network congestion

9 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 9 Code Source portDestination port Sequence number 0 816 31 24 Acknowledgement number 4 Hlen 10 ResvWindow Urgent ptrChecksum Options (if any)Padding The TCP Segment Format Frame header Application data FCS IP header TCP header 20 Bytes

10 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 10 TCP Segment Format – cont. uSource/Dest port: TCP port numbers to ID applications at both ends of connection uSequence number: First byte in segment from sender’s byte stream uAcknowledgement: identifies the number of the byte the sender of this (ACK) segment expects to receive next uCode: used to determine segment purpose, e.g. SYN, ACK, FIN, URG uWindow: Advertises how much data this station is willing to accept. Can depend on buffer space remaining. uOptions: used for window scaling, SACK, timestamps, maximum segment size etc. Code Source portDestination port Sequence number Acknowledgement number HlenResvWindow Urgent ptrChecksum Options (if any) Padding

11 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 11 TCP – providing reliability uPositive acknowledgement (ACK) of each received segment Sender keeps record of each segment sent Sender awaits an ACK – “I am ready to receive byte 2048 and beyond” Sender starts timer when it sends segment – so can re-transmit Segment n ACK of Segment n RTT Time Sender Receiver Sequence 1024 Length 1024 Ack 2048 Segment n+1 ACK of Segment n +1 RTT Sequence 2048 Length 1024 Ack 3072 uInefficient – sender has to wait

12 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 12 Flow Control: Sender – Congestion Window uUses Congestion window, cwnd, a sliding window to control the data flow Byte count giving highest byte that can be sent with out an ACK Transmit buffer size and Advertised Receive buffer size important. ACK gives next sequence no to receive AND The available space in the receive buffer Timer kept for each packet Unsent Data may be transmitted immediately Sent Data buffered waiting ACK TCP Cwnd slides Data to be sent, waiting for window to open. Application writes here Data sent and ACKed Sending host advances marker as data transmitted Received ACK advances trailing edge Receiver’s advertised window advances leading edge

13 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 13 Flow Control: Receiver – Lost Data Received but not ACKed ACKed but not given to user Window slides Lost data Data given to application Last ACK given Next byte expected Expected sequence no. Receiver’s advertised window advances leading edge Application reads here uIf new data is received with a sequence number ≠ next byte expected Duplicate ACK is send with the expected sequence number

14 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 14 How it works: TCP Slowstart uProbe the network - get a rough estimate of the optimal congestion window size uThe larger the window size, the higher the throughput Throughput = Window size / Round-trip Time uexponentially increase the congestion window size until a packet is lost cwnd initially 1 MTU then increased by 1 MTU for each ACK received Send 1 st packet get 1 ACK increase cwnd to 2 Send 2 packets get 2 ACKs increase cwnd to 4 Time to reach cwnd size W T W = RTT*log 2 (W) (not exactly slow!) Rate doubles each RTT CWND slow start: exponential increase congestion avoidance: linear increase packet loss time retransmit: slow start again timeout

15 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 15 uGrowth of CWND related to RTT u(Most important in Congestion Avoidance phase) SourceSinkCWND= 1 CWND= 2 CWND= 4 TCP Slowstart Animated Toby Rodwell Dante

16 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 16 uadditive increase: starting from the rough estimate, linearly increase the congestion window size to probe for additional available bandwidth cwnd increased by 1 segment per rtt cwnd increased by 1 /cwnd for each ACK – linear increase in rate uTCP takes packet loss as indication of congestion ! umultiplicative decrease: cut the congestion window size aggressively if a packet is lost Standard TCP reduces cwnd by 0.5 Slow start to Congestion Avoidance transition determined by ssthresh CWND slow start: exponential increase congestion avoidance: linear increase packet loss time retransmit: slow start again timeout How it works: TCP Congestion Avoidance

17 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 17 TCP Fast Retransmit & Recovery uDuplicate ACKs are due to lost segments or segments out of order. uFast Retransmit: If the receiver transmits 3 duplicate ACKs (i.e. it received 3 additional segments without getting the one expected) Sender re-transmits the missing segment Set ssthresh to 0.5*cwnd – so enter congestion avoidance phase Set cwnd = (0.5*cwnd +3 ) – the 3 dup ACKs Increase cwnd by 1 segment when get duplicate ACKs Keep sending new data if allowed by cwnd Set cwnd to half original value on new ACK no need to go into “slow start” again uAt the steady state, cwnd oscillates around the optimal window size uWith a retransmission timeout, slow start is triggered again

18 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 18 TCP: Simple Tuning - Filling the Pipe uRemember, TCP has to hold a copy of data in flight uOptimal (TCP buffer) window size depends on: Bandwidth end to end, i.e. min(BW links ) AKA bottleneck bandwidth Round Trip Time (RTT) uThe number of bytes in flight to fill the entire path: Bandwidth*Delay Product BDP = RTT*BW Can increase bandwidth by orders of magnitude uWindows also used for flow control RTT Time Sender Receiver ACK Segment time on wire = bits in segment/BW

19 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 19 Standard TCP (Reno) – What’s the problem? uTCP has 2 phases: Slowstart Probe the network to estimate the Available BW Exponential growth Congestion Avoidance Main data transfer phase – transfer rate glows “slowly” uAIMD and High Bandwidth – Long Distance networks Poor performance of TCP in high bandwidth wide area networks is due in part to the TCP congestion control algorithm. For each ack in a RTT without loss: cwnd -> cwnd + a / cwnd- Additive Increase, a=1 For each window experiencing loss: cwnd -> cwnd – b (cwnd) - Multiplicative Decrease, b= ½ uPacket loss is a killer !!

20 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 20 TCP (Reno) – Details of problem #1 uTime for TCP to recover its throughput from 1 lost 1500 byte packet given by: u for rtt of ~200 ms @ 1 Gbit/s: 2 min UK 6 ms Europe 25 ms USA 150 ms 1.6 s 26 s 28min

21 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 21 Investigation of new TCP Stacks uThe AIMD Algorithm – Standard TCP (Reno) For each ack in a RTT without loss: cwnd -> cwnd + a / cwnd- Additive Increase, a=1 For each window experiencing loss: cwnd -> cwnd – b (cwnd) - Multiplicative Decrease, b= ½ uHigh Speed TCP a and b vary depending on current cwnd using a table a increases more rapidly with larger cwnd – returns to the ‘optimal’ cwnd size sooner for the network path b decreases less aggressively and, as a consequence, so does the cwnd. The effect is that there is not such a decrease in throughput. uScalable TCP a and b are fixed adjustments for the increase and decrease of cwnd a = 1/100 – the increase is greater than TCP Reno b = 1/8 – the decrease on loss is less than TCP Reno Scalable over any link speed. uFast TCP Uses round trip time as well as packet loss to indicate congestion with rapid convergence to fair equilibrium for throughput. uHSTCP-LP, H-TCP, BiC-TCP

22 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 22 Lets Check out this theory about new TCP stacks Does it matter ? Does it work?

23 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 23 Problem #1 Packet Loss Is it important ?

24 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 24 Packet Loss with new TCP Stacks uTCP Response Function Throughput vs Loss Rate – further to right: faster recovery Drop packets in kernel MB-NG rtt 6ms DataTAG rtt 120 ms

25 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 25 Packet Loss and new TCP Stacks uTCP Response Function UKLight London-Chicago-London rtt 177 ms 2.6.6 Kernel Agreement with theory good Some new stacks good at high loss rates

26 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 26 High Throughput Demonstrations Manchester rtt 6.2 ms (Geneva) rtt 128 ms man03lon01 2.5 Gbit SDH MB-NG Core 1 GEth Cisco GSR Cisco 7609 Cisco 7609 London (Chicago) Dual Zeon 2.2 GHz Send data with TCP Drop Packets Monitor TCP with Web100

27 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 27 uDrop 1 in 25,000 urtt 6.2 ms uRecover in 1.6 s High Performance TCP – MB-NG StandardHighSpeed Scalable

28 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 28 High Performance TCP – DataTAG uDifferent TCP stacks tested on the DataTAG Network u rtt 128 ms uDrop 1 in 10 6 uHigh-Speed Rapid recovery uScalable Very fast recovery uStandard Recovery would take ~ 20 mins

29 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 29 FAST demo via OMNInet and Datatag J. Mambretti, F. Yeh (Northwestern) t OMNInet Nortel Passport 8600 Nortel Passport 8600 Photonic Switch NU-E (Leverone) Workstations 2 x GE StarLight-Chicago CalTech Cisco 7609 2 x GE Photonic Switch Alcatel 1670 10GE Alcatel 1670 2 x GE OC-48 DataTAG 2 x GE Workstations CERN -Geneva San Diego FAST display CERN Cisco 7609 7,000 km A. Adriaanse, C. Jin, D. Wei (Caltech) S. Ravot (Caltech/CERN) FAST Demo Cheng Jin, David Wei Caltech Layer 2 path Layer 2/3 path

30 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 30 FAST TCP vs newReno è Traffic flow Channel #1 : newReno è Traffic flowChannel #2: FAST è Traffic flow Channel #2: FAST Utilization: 70% Utilization: 90% 90%

31 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 31 Problem #2 Is TCP fair? look at Round Trip Times & Max Transfer Unit

32 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 32 MTU and Fairness uTwo TCP streams share a 1 Gb/s bottleneck uRTT=117 ms uMTU = 3000 Bytes ; Avg. throughput over a period of 7000s = 243 Mb/s uMTU = 9000 Bytes; Avg. throughput over a period of 7000s = 464 Mb/s uLink utilization : 70,7 % Starlight (Chi) CERN (GVA) RR GbE Switch Host #1 POS 2.5 Gbps 1 GE Host #2 Host #1 Host #2 1 GE Bottleneck Sylvain Ravot DataTag 2003

33 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 33 RTT and FairnessSunnyvale Starlight (Chi) CERN (GVA) RR GbE Switch Host #1 POS 2.5 Gb/s 1 GE Host #2 Host #1 Host #2 1 GE Bottleneck R POS 10 Gb/s R 10GE uTwo TCP streams share a 1 Gb/s bottleneck uCERN Sunnyvale RTT=181ms ; Avg. throughput over a period of 7000s = 202Mb/s uCERN Starlight RTT=117ms; Avg. throughput over a period of 7000s = 514Mb/s uMTU = 9000 bytes uLink utilization = 71,6 % Sylvain Ravot DataTag 2003

34 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 34 Problem #n Do TCP Flows Share the Bandwidth ?

35 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 35 uChose 3 paths from SLAC (California) Caltech (10ms), Univ Florida (80ms), CERN (180ms) uUsed iperf/TCP and UDT/UDP to generate traffic uEach run was 16 minutes, in 7 regions Test of TCP Sharing: Methodology (1Gbit/s) Ping 1/s Iperf or UDT ICMP/ping traffic TCP/UDP bottleneck iperf SLAC Caltech/UFL/CERN 2 mins 4 mins Les Cottrell PFLDnet 2005

36 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 36 uLow performance on fast long distance paths AIMD (add a=1 pkt to cwnd / RTT, decrease cwnd by factor b=0.5 in congestion) Net effect: recovers slowly, does not effectively use available bandwidth, so poor throughput Unequal sharing TCP Reno single stream Congestion has a dramatic effect Recovery is slow Increase recovery rate SLAC to CERN RTT increases when achieves best throughput Les Cottrell PFLDnet 2005 Remaining flows do not take up slack when flow removed

37 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 37 Fast uAs well as packet loss, FAST uses RTT to detect congestion RTT is very stable: σ(RTT) ~ 9ms vs 37±0.14ms for the others SLAC-CERN Big drops in throughput which take several seconds to recover from 2 nd flow never gets equal share of bandwidth

38 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 38 Hamilton TCP uOne of the best performers Throughput is high Big effects on RTT when achieves best throughput Flows share equally Appears to need >1 flow to achieve best throughput Two flows share equally SLAC-CERN > 2 flows appears less stable

39 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 39 Problem #n+1 To SACK or not to SACK ?

40 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 40 The SACK Algorithm uSACK Rational Non-continuous blocks of data can be ACKed Sender transmits just lost packets Helps when multiple packets lost in one TCP window uThe SACK Processing is inefficient for large bandwidth delay products Sender write queue (linked list) walked for: Each SACK block To mark lost packets To re-transmit Processing so long input Q becomes full Get Timeouts SACKs updated rtt 150ms Standard SACKs rtt 150ms HS-TCP Dell 1650 2.8 GHz PCI-X 133 MHz Intel Pro/1000 Doug Leith Yee-Ting Li

41 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 41 SACK … uLook into what’s happening at the algorithmic level with web100: uStrange hiccups in cwnd  only correlation is SACK arrivals Scalable TCP on MB-NG with 200mbit/sec CBR Background Yee-Ting Li

42 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 42 Real Applications on Real Networks Disk-2-disk applications on real networks Memory-2-memory tests Transatlantic disk-2-disk at Gigabit speeds HEP&VLBI at SC|05 Remote Computing Farms The effect of TCP The effect of distance Radio Astronomy e-VLBI Leave for the talk later in the meeting

43 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 43 iperf Throughput + Web100 u SuperMicro on MB-NG network u HighSpeed TCP u Linespeed 940 Mbit/s u DupACK ? <10 (expect ~400) u BaBar on Production network u Standard TCP u 425 Mbit/s u DupACKs 350-400 – re-transmits

44 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 44 Applications: Throughput Mbit/s u HighSpeed TCP u 2 GByte file RAID5 u SuperMicro + SuperJANET u bbcp u bbftp u Apachie u Gridftp u Previous work used RAID0 (not disk limited)

45 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 45 Transatlantic Disk to Disk Transfers With UKLight SuperComputing 2004

46 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 46 bbftp: What else is going on? Scalable TCP u SuperMicro + SuperJANET Instantaneous 0 - 550 Mbit/s u Congestion window – duplicate ACK u Throughput variation not TCP related? Disk speed / bus transfer Application architecture u BaBar + SuperJANET Instantaneous 200 – 600 Mbit/s u Disk-mem ~ 590 Mbit/s remember the end host

47 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 47 SC2004 UKLIGHT Overview MB-NG 7600 OSR Manchester ULCC UKLight UCL HEP UCL network K2 Ci Chicago Starlight Amsterdam SC2004 Caltech Booth UltraLight IP SLAC Booth Cisco 6509 UKLight 10G Four 1GE channels UKLight 10G Surfnet/ EuroLink 10G Two 1GE channels NLR Lambda NLR-PITT-STAR-10GE-16 K2 Ci Caltech 7600

48 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 48 Transatlantic Ethernet: TCP Throughput Tests uSupermicro X5DPE-G2 PCs uDual 2.9 GHz Xenon CPU FSB 533 MHz u1500 byte MTU u2.6.6 Linux Kernel uMemory-memory TCP throughput uStandard TCP uWire rate throughput of 940 Mbit/s uFirst 10 sec uWork in progress to study: Implementation detail Advanced stacks Effect of packet loss Sharing

49 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 49 SC2004 Disk-Disk bbftp ubbftp file transfer program uses TCP/IP uUKLight: Path:- London-Chicago-London; PCs:- Supermicro +3Ware RAID0 uMTU 1500 bytes; Socket size 22 Mbytes; rtt 177ms; SACK off uMove a 2 GByte file uWeb100 plots: uStandard TCP uAverage 825 Mbit/s u(bbcp: 670 Mbit/s) uScalable TCP uAverage 875 Mbit/s u(bbcp: 701 Mbit/s ~4.5s of overhead) uDisk-TCP-Disk at 1Gbit/s

50 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 50 Network & Disk Interactions (work in progress) uHosts: Supermicro X5DPE-G2 motherboards dual 2.8 GHz Zeon CPUs with 512 k byte cache and 1 M byte memory 3Ware 8506-8 controller on 133 MHz PCI-X bus configured as RAID0 six 74.3 GByte Western Digital Raptor WD740 SATA disks 64k byte stripe size uMeasure memory to RAID0 transfer rates with & without UDP traffic Disk write 1735 Mbit/s Disk write + 1500 MTU UDP 1218 Mbit/s Drop of 30% Disk write + 9000 MTU UDP 1400 Mbit/s Drop of 19% % CPU kernel mode

51 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 51 Transatlantic Transfers With UKLight SuperComputing 2005

52 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 52 ESLEA and UKLight u6 * 1 Gbit transatlantic Ethernet layer 2 paths UKLight + NLR uDisk-to-disk transfers with bbcp Seattle to UK Set TCP buffer and application to give ~850Mbit/s One stream of data 840-620 Mbit/s uStream UDP VLBI data UK to Seattle 620 Mbit/s Reverse TCP

53 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 53 SC|05 – SLAC 10 Gigabit Ethernet u2 Lightpaths: Routed over ESnet Layer 2 over Ultra Science Net u6 Sun V20Z systems per λ 3 Transmit 3 Receive udcache remote disk data access 100 processes per node Node sends or receives One data stream 20-30 Mbit/s uUsed Netweion NICs & Chelsio TOE uData also sent to StorCloud using fibre channel links uTraffic on the 10 GE link for 2 nodes: 3-4 Gbit per nodes 8.5-9Gbit on Trunk

54 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 54 Remote Computing Farms in the ATLAS TDAQ Experiment

55 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 55 ATLAS Remote Farms – Network Connectivity

56 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 56 ATLAS Application Protocol u Event Request EFD requests an event from SFI SFI replies with the event ~2Mbytes u Processing of event u Return of computation EF asks SFO for buffer space SFO sends OK EF transfers results of the computation u tcpmon - instrumented TCP request-response program emulates the Event Filter EFD to SFI communication. Send OK Send event data Request event ●●● Request Buffer Send processed event Process event Time Request-Response time (Histogram) Event Filter Daemon EFD SFI and SFO

57 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 57 tcpmon: TCP Activity Manc-CERN Req-Resp Round trip time 20 ms 64 byte Request green 1 Mbyte Response blue TCP in slow start 1st event takes 19 rtt or ~ 380 ms TCP Congestion window gets re-set on each Request TCP stack RFC 2581 & RFC 2861 reduction of Cwnd after inactivity Even after 10s, each response takes 13 rtt or ~260 ms Transfer achievable throughput 120 Mbit/s

58 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 58 tcpmon: TCP Activity Manc-CERN Req-Resp TCP stack tuned Round trip time 20 ms 64 byte Request green 1 Mbyte Response blue TCP starts in slow start 1 st event takes 19 rtt or ~ 380 ms TCP Congestion window grows nicely Response takes 2 rtt after ~1.5s Rate ~10/s (with 50ms wait) Transfer achievable throughput grows to 800 Mbit/s Data transferred WHEN the application requires the data

59 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 59 Round trip time 150 ms 64 byte Request green 1 Mbyte Response blue TCP starts in slow start 1 st event takes 11 rtt or ~ 1.67 s tcpmon: TCP Activity Alberta-CERN Req-Resp TCP stack tuned TCP Congestion window in slow start to ~1.8s then congestion avoidance Response in 2 rtt after ~2.5s Rate 2.2/s (with 50ms wait) Transfer achievable throughput grows slowly from 250 to 800 Mbit/s

60 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 60 uStandard TCP not optimum for high throughput long distance links uPacket loss is a killer for TCP Check on campus links & equipment, and access links to backbones Users need to collaborate with the Campus Network Teams Dante Pert uNew stacks are stable and give better response & performance Still need to set the TCP buffer sizes ! Check other kernel settings e.g. window-scale maximum Watch for “TCP Stack implementation Enhancements” uTCP tries to be fair Large MTU has an advantage Short distances, small RTT, have an advantage uTCP does not share bandwidth well with other streams uThe End Hosts themselves Plenty of CPU power is required for the TCP/IP stack as well and the application Packets can be lost in the IP stack due to lack of processing power Interaction between HW, protocol processing, and disk sub-system complex uApplication architecture & implementation are also important The TCP protocol dynamics strongly influence the behaviour of the Application. uUsers are now able to perform sustained 1 Gbit/s transfers Summary & Conclusions

61 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 61 More Information Some URLs 1 uUKLight web site: http://www.uklight.ac.uk uMB-NG project web site: http://www.mb-ng.net/ uDataTAG project web site: http://www.datatag.org/ uUDPmon / TCPmon kit + writeup: http://www.hep.man.ac.uk/~rich/net uMotherboard and NIC Tests: http://www.hep.man.ac.uk/~rich/net/nic/GigEth_tests_Boston.ppt & http://datatag.web.cern.ch/datatag/pfldnet2003/ “Performance of 1 and 10 Gigabit Ethernet Cards with Server Quality Motherboards” FGCS Special issue 2004 http:// www.hep.man.ac.uk/~rich/ uTCP tuning information may be found at: http://www.ncne.nlanr.net/documentation/faq/performance.html & http://www.psc.edu/networking/perf_tune.html uTCP stack comparisons: “Evaluation of Advanced TCP Stacks on Fast Long-Distance Production Networks” Journal of Grid Computing 2004 uPFLDnet http://www.ens-lyon.fr/LIP/RESO/pfldnet2005/ uDante PERT http://www.geant2.net/server/show/nav.00d00h002

62 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 62 uLectures, tutorials etc. on TCP/IP: www.nv.cc.va.us/home/joney/tcp_ip.htm www.cs.pdx.edu/~jrb/tcpip.lectures.html www.raleigh.ibm.com/cgi-bin/bookmgr/BOOKS/EZ306200/CCONTENTS www.cisco.com/univercd/cc/td/doc/product/iaabu/centri4/user/scf4ap1.htm www.cis.ohio-state.edu/htbin/rfc/rfc1180.html www.jbmelectronics.com/tcp.htm uEncylopaedia http://www.freesoft.org/CIE/index.htm uTCP/IP Resources www.private.org.il/tcpip_rl.html uUnderstanding IP addresses http://www.3com.com/solutions/en_US/ncs/501302.html uConfiguring TCP (RFC 1122) ftp://nic.merit.edu/internet/documents/rfc/rfc1122.txt uAssigned protocols, ports etc (RFC 1010) http://www.es.net/pub/rfcs/rfc1010.txt & /etc/protocols http://www.es.net/pub/rfcs/rfc1010.txt More Information Some URLs 2

63 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 63 Any Questions?

64 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 64 Backup Slides

65 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 65 uUDP/IP packets sent between back-to-back systems Processed in a similar manner to TCP/IP Not subject to flow control & congestion avoidance algorithms Used UDPmon test program uLatency uRound trip times measured using Request-Response UDP frames uLatency as a function of frame size Slope is given by: Mem-mem copy(s) + pci + Gig Ethernet + pci + mem-mem copy(s) Intercept indicates: processing times + HW latencies uHistograms of ‘singleton’ measurements uTells us about: Behavior of the IP stack The way the HW operates Interrupt coalescence Latency Measurements

66 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 66 Throughput Measurements uUDP Throughput uSend a controlled stream of UDP frames spaced at regular intervals n bytes Number of packets Wait time time  Zero stats OK done ●●● Get remote statistics Send statistics: No. received No. lost + loss pattern No. out-of-order CPU load & no. int 1-way delay Send data frames at regular intervals ●●● Time to send Time to receive Inter-packet time (Histogram) Signal end of test OK done Time Sender Receiver

67 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 67 PCI Bus & Gigabit Ethernet Activity uPCI Activity uLogic Analyzer with PCI Probe cards in sending PC Gigabit Ethernet Fiber Probe Card PCI Probe cards in receiving PC Gigabit Ethernet Probe CPU mem chipset NIC CPU mem NIC chipset Logic Analyser Display PCI bus Possible Bottlenecks

68 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 68 Network switch limits behaviour uEnd2end UDP packets from udpmon Only 700 Mbit/s throughput Lots of packet loss Packet loss distribution shows throughput limited

69 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 69 u SuperMicro P4DP8-2G (P4DP6) uDual Xeon u 400/522 MHz Front side bus u 6 PCI PCI-X slots u 4 independent PCI buses 64 bit 66 MHz PCI 100 MHz PCI-X 133 MHz PCI-X u Dual Gigabit Ethernet u Adaptec AIC-7899W dual channel SCSI u UDMA/100 bus master/EIDE channels data transfer rates of 100 MB/sec burst “Server Quality” Motherboards

70 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 70 “Server Quality” Motherboards u Boston/Supermicro H8DAR u Two Dual Core Opterons u 200 MHz DDR Memory Theory BW: 6.4Gbit u HyperTransport u 2 independent PCI buses 133 MHz PCI-X u 2 Gigabit Ethernet u SATA u ( PCI-e )

71 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 71 10 Gigabit Ethernet: UDP Throughput u1500 byte MTU gives ~ 2 Gbit/s uUsed 16144 byte MTU max user length 16080 uDataTAG Supermicro PCs uDual 2.2 GHz Xenon CPU FSB 400 MHz uPCI-X mmrbc 512 bytes uwire rate throughput of 2.9 Gbit/s uCERN OpenLab HP Itanium PCs uDual 1.0 GHz 64 bit Itanium CPU FSB 400 MHz uPCI-X mmrbc 4096 bytes uwire rate of 5.7 Gbit/s uSLAC Dell PCs giving a uDual 3.0 GHz Xenon CPU FSB 533 MHz uPCI-X mmrbc 4096 bytes uwire rate of 5.4 Gbit/s

72 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 72 10 Gigabit Ethernet: Tuning PCI-X u16080 byte packets every 200 µs uIntel PRO/10GbE LR Adapter uPCI-X bus occupancy vs mmrbc Measured times Times based on PCI-X times from the logic analyser Expected throughput ~7 Gbit/s Measured 5.7 Gbit/s mmrbc 1024 bytes mmrbc 2048 bytes mmrbc 4096 bytes 5.7Gbit/s mmrbc 512 bytes CSR Access PCI-X Sequence Data Transfer Interrupt & CSR Update

73 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 73 Congestion control: ACK clocking

74 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 74 End Hosts & NICs CERN-nat-Manc. Request-Response Latency Throughput Packet Loss Re-Order uUse UDP packets to characterise Host, NIC & Network SuperMicro P4DP8 motherboard Dual Xenon 2.2GHz CPU 400 MHz System bus 64 bit 66 MHz PCI / 133 MHz PCI-X bus uThe network can sustain 1Gbps of UDP traffic uThe average server can loose smaller packets uPacket loss caused by lack of power in the PC receiving the traffic uOut of order packets due to WAN routers uLightpaths look like extended LANS have no re-ordering

75 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 75 tcpdump / tcptrace utcpdump: dump all TCP header information for a specified source/destination ftp://ftp.ee.lbl.gov/ utcptrace: format tcpdump output for analysis using xplot http://www.tcptrace.org/ NLANR TCP Testrig : Nice wrapper for tcpdump and tcptrace tools http://www.ncne.nlanr.net/TCP/testrig/ uSample use: tcpdump -s 100 -w /tmp/tcpdump.out host hostname tcptrace -Sl /tmp/tcpdump.out xplot /tmp/a2b_tsg.xpl

76 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 76 tcptrace and xplot uX axis is time uY axis is sequence number uthe slope of this curve gives the throughput over time. uxplot tool make it easy to zoom in

77 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 77 Zoomed In View uGreen Line: ACK values received from the receiver uYellow Line tracks the receive window advertised from the receiver uGreen Ticks track the duplicate ACKs received. uYellow Ticks track the window advertisements that were the same as the last advertisement. uWhite Arrows represent segments sent. uRed Arrows (R) represent retransmitted segments

78 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 78 TCP Slow Start


Download ppt "5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 1 TCP/IP on High Bandwidth Long Distance Paths or So TCP."

Similar presentations


Ads by Google