Experiences Tuning Cluster Hosts 1GigE and 10GbE Paul Hyder Cooperative Institute for Research in Environmental Sciences, CU Boulder Cooperative Institute for Research in Environmental Sciences, CU Boulder (CIRES at NOAA/ERSL/GSD High Performance Computing) (CIRES at NOAA/ERSL/GSD High Performance Computing) Paul.Hyder at noaa.gov Paul.Hyder at noaa.gov
Tuning Focus n Cluster Front Ends and Cron Server Hosts n File transfer servers (scponly) n BWCTL host n Remote client hosts n 10GbE Testbed (7.2 Gb/sec uses ~49% of one 3G CPU) (7.2 Gb/sec uses ~49% of one 3G CPU)
How We Apply the Well Known Rules n Jumbo Frames –8K on hosts –9K on network n Tune TCP to match BDP n Encourage application writers to use large read and write buffers n Install tuned Applications –PSC.edu patch to ssh OpenSSH:channels.h #define CHAN_TCP_PACKET_DEFAULT (32*1024) #define CHAN_TCP_WINDOW_DEFAULT (4*CHAN_TCP_PACKET_DEFAULT)
Throughput Testing n Iperf (2.0.2) from shell scripts –Vary buffer (-l) and window (-w) –Modify ifconfig and PCI configuration –Loop takes 3 days n Bwctl with remote hosts –?Anyone on NLR? n Use scp/sftp/rsync as final test
I’m Curious n How much TCP tuning information do you provide users and admins? n Are hosts being tuned? n Does your internal LAN support jumbo frames?
GSD Cluster GigE Defaults n [wr]mem_default 2MB n [wr]mem_max 16MB n ipv4/tcp_[wr]mem 64KB 2MB 16MB n optmem_max 512K n txqueuelen n netdev_max_backlog 3000 n ipv4/tcp_sack and ipv4/tcp_timestamps on n Don’t touch ipv4/tcp_mem
Jumbo Frame Plot
What doesn’t work n Jumbo Frames –Switch Fabrics n High density cards n Complex vLAN configurations n Stand alone GigE switches –Firewalls –ICMP for path mtu discovery n Disabled completely n Network devices don’t respond
Linux 2.6 and Jumbos IP hostA > hostB.22: S 544:544(0) win IP hostB.22 > hostA.52434: S 207:207(0) ack 545 win IP hostA > hostB.22:. 2255:6599(4344) ack 2293 win IP hostA > hostB.22: P 6599:10943(4344) ack 2293 win IP router > hostA: icmp 36: hostB unreachable - need to frag (mtu 1500) IP hostA > hostB.22:. 2255:3703(1448) ack 2293 win 16304
Host Side Checks n Interrupt Aggregation (Linux NAPI) n Memory to match buffer tuning n More than one CPU n Static ARP entries
Network Device Settings n Static ARP entries or increase timeout n Increase FDB timeouts n Verify jumbo frame configuration
10GbE Quick Notes n Know your PCI hardware (MMRBC, Latency timer, and Splits) n TCP stack is ~0.200ms n Increase netdev_max_backlog to (throughput = backlog * 100MHz * ave_bytes_pkt) n Set *_cong to CERN values n Write buffers in code ~128KB
10G buffer plot
Questions?
Reference URLs n n n n – n n n n n n