Performance measurement with ZeroMQ and FairMQ Mohammad Al-Turany 20/02/15 CWG13 Meeting
Zero MQ performance tests suite Zero MQ deliver some tools to measure bandwidth and latency of the network, following executables are build by default and located in the perf subdirectory local_lat Remote_lat local_thr remote_thr 20/02/15 CWG13 Meeting
ØMQ performance tests suite Latency Test consists of local_lat and remote_lat. These are to be placed on two boxes that you wish to measure latency between. We did not perform this test up to know!! $ local_lat tcp://eth0:5555 1 100000 $ remote_lat tcp://192.168.0.111:5555 1 100000 message size: 1 [B] roundtrip count: 100000 average latency: 30.915 [us] latency reported is the one-way latency 20/02/15 CWG13 Meeting
ØMQ performance tests suite Throughput Test consists of local_thr and remote_thr. These are to be placed on two boxes that you wish to measure latency between. $local_thr tcp://eth0:5555 1 100000 $remote_thr tcp://192.168.0.111:5555 1 100000 message size: 1 [B] message count: 1000000 mean throughput: 5554568 [msg/s] mean throughput: 44.437 [Mb/s] 20/02/15 CWG13 Meeting
Running the Zero MQ performance test on the DAQ test cluster 20/02/15 CWG13 Meeting
Running the Zero MQ performance test on the DAQ test cluster 20/02/15 CWG13 Meeting
Running the Zero MQ performance test on the DAQ test cluster 20/02/15 CWG13 Meeting
Running the Zero MQ performance test on the DAQ test cluster 20/02/15 CWG13 Meeting
Performance test with FairMQ FLP 2 EPN aidrefma02 aidrefma01 Push-Pull pattern Message size= 10 Mbyte Throughput = 2,6 Gbyte/s 20/02/15 CWG13 Meeting
Performance test with FairMQ FLP 2 EPN aidrefma02 aidrefma01 Push-Pull pattern Message size= 10 Mbyte Throughput = 3,7 Gbyte/s 20/02/15 CWG13 Meeting
Performance test with FairMQ FLP 2 EPN aidrefma03 aidrefma01 Push-Pull pattern Message size= 10 Mbyte Throughput = 4,8 Gbyte/s 20/02/15 CWG13 Meeting
A node that use 3(4) cores to receive data via Ethernet or IPoverIB at a rate of more than 4 GByte/s, ist still usable for reconstruction? 20/02/15 CWG13 Meeting
STREAM: Sustainable Memory Bandwidth in High Performance Computers A simple synthetic benchmark program that measures sustainable memory bandwidth (in MB/s) and the corresponding computation rate for simple vector kernels. Specifically designed to work with datasets much larger than the available cache on any given system, so that the results are (presumably) more indicative of the performance of very large, vector style applications. http://www.cs.virginia.edu/stream/ 20/02/15 CWG13 Meeting
Stream Settings This system uses 8 bytes per array element. ------------------------------------------------------------- Array size = 200000000 (elements), Offset = 0 (elements) Memory per array = 1525.9 MiB (= 1.5 GiB). Total memory required = 4577.6 MiB (= 4.5 GiB). Each kernel will be executed 10 times. The *best* time for each kernel (excluding the first iteration) will be used to compute the reported bandwidth. Number of Threads requested = 12 Number of Threads counted = 12 20/02/15 CWG13 Meeting
STREAM is intended to measure the bandwidth from main memory 20/02/15 CWG13 Meeting
Performance and bandwidth test with FairMQ FLP 2 EPN aidrefma02 CERN: DAQ Lab system: 40 G Ethernet Dual socket Intel Sandy Bridge-EP, dual E5-2690 @ 2.90GHz, 2x8 hw cores - 32 threads, 64GB RAM Function Best Rate MB/s Avg time Min time Max time Copy: 15258.3 0.017153 0.010486 0.025462 Scale: 15019.2 0.017180 0.010653 0.025397 Add: 16883.6 0.021488 0.014215 0.036001 Triad: 16831.6 0.021190 0.014259 0.035066 -------------------------------------------------------------- name kernel bytes/iter FLOPS/iter COPY: a(i) = b(i) 16 0 SCALE: a(i) = q*b(i) 16 1 SUM: a(i) = b(i) + c(i) 24 1 TRIAD: a(i) = b(i) + q*c(i) 24 2 -------------------------------------------------------------- CWG13 Meeting 20/02/15
Performance and bandwidth test with FairMQ FLP 2 EPN aidrefma01 aidrefma02 FLP EPN CERN: DAQ Lab system: 40 G Ethernet Dual socket Intel Sandy Bridge-EP, dual E5-2690 @ 2.90GHz, 2x8 hw cores - 32 threads, 64GB RAM 8 MB Masseges 4.7 Gbyte/s Function Best Rate MB/s Copy: 12782.6 Scale: 12319.0 Add: 14210.4 Triad: 14317.3 -16 % -18 % -15 % CWG13 Meeting 20/02/15
Performance and bandwidth test with FairMQ FLP 2 EPN CPU Time in seconds needed to simulate 1000 events, 10 proton in FairRoot example 3 aidrefma01 aidrefma02 FLP EPN Run 12 processes Without MQ With 4 MB Messages With 8 MB Messages 54 61 68 58 64 66 62 56 57 55 63 67 60 65 57,3 62,1 61,2 5% 4% Geant Geant 4 MB Masseges 4.5 Gbyte/s 8 MB Masseges 4.7 Gbyte/s Geant Geant Geant Geant Geant CERN: DAQ Lab system: 40 G Ethernet Dual socket Intel Sandy Bridge-EP, dual E5-2690 @ 2.90GHz, 2x8 hw cores - 32 threads, 64GB RAM Geant Geant Geant Geant Geant Geant CWG13 Meeting 20/02/15
Performance and bandwidth test with FairMQ FLP 2 EPN CPU Time in seconds needed to simulate 1000 events, 100 proton in FairRoot example 3 aidrefma01 aidrefma02 FLP EPN Run 12 processes Without MQ With 8 MB Messages 565 605 573 615 570 598 603 602 563 601 619 576 616 574 606 567 609 577 595 570.2 605.6 6% Geant Geant 8 MB Masseges 4.7 Gbyte/s 2.8 TByte total data transfer Geant Geant Geant Geant Geant CERN: DAQ Lab system: 40 G Ethernet Dual socket Intel Sandy Bridge-EP, dual E5-2690 @ 2.90GHz, 2x8 hw cores - 32 threads, 64GB RAM Geant Geant Geant Geant Geant Geant CWG13 Meeting 20/02/15
Backup and Discussion 20/02/15 CWG13 Meeting
Run on STREAM version $Revision: 5.10 $ ------------------------------------------------------------- This system uses 8 bytes per array element. Array size = 10000000 (elements), Offset = 0 (elements) Memory per array = 76.3 MiB (= 0.1 GiB). Total memory required = 228.9 MiB (= 0.2 GiB). Each kernel will be executed 10 times. The *best* time for each kernel (excluding the first iteration) will be used to compute the reported bandwidth. Your clock granularity/precision appears to be 1 microseconds. Each test below will take on the order of 21173 microseconds. (= 21173 clock ticks) Increase the size of the arrays if this shows that you are not getting at least 20 clock ticks per test. WARNING -- The above is only a rough guideline. For best results, please be sure you know the precision of your system timer. Function Best Rate MB/s Avg time Min time Max time Copy: 15258.3 0.017153 0.010486 0.025462 Scale: 15019.2 0.017180 0.010653 0.025397 Add: 16883.6 0.021488 0.014215 0.036001 Triad: 16831.6 0.021190 0.014259 0.035066 20/02/15 CWG13 Meeting