Recent Communication Optimizations in Charm++ Nitin Bhat Software Engineer Charmworks, Inc. 16th Annual Charm Workshop 2018
Agenda Existing Charm++ Messaging API Motivation Zero-copy Entry Method Send API using RDMA Zero-copy Direct API using RDMA Results Using SHM transport using CMA Summary
Charm++ Messaging API Diagram to show if there are multiple parameters, how the copied approach works
forcecalculations.ci forcecalculations.C forcecalculations.C Module forcecalculations{ …... array [1D] Cell { entry forces( ) ; entry void recv_forces (double forces [size], int size, double value); } …..... Charm Interface File - Declarations forcecalculations.C void recv_forces(double * forces, int size, double value){ …. } C++ Code File – Entry method Cell_Proxy[n].recv_forces(forces, 1000000, 4.0); forcecalculations.C C++ Code File – Call site
Regular Messaging API - What happens under the hood? Node 0 Node 1 Charm++. ...... Cell_Proxy [n]. recv_force (forces, size, value); ....... Charm++ void recv_force ( double * forces, int size, int value) { } forces forces value value Marshalling of Parameters size size Un-marshalling of Parameters Header Header forces size size value value RGET metadata Network Network
Motivation Memory system is the bottleneck Faster cores and Fatter nodes Processor performance has been scaling much better than memory performance over the years On RDMA/CMA enabled systems, avoid copies of the large buffer by minor changes in the application logic Advantages: Reduce memory footprint Improve performance by reducing memory allocation size and avoiding copy Reduce page faults, data cache misses
Zero-copy Entry Method Send API
forcecalculations.ci forcecalculations.C forcecalculations.C Module forcecalculations{ …... array [1D] Cell { entry forces( ) ; entry void recv_forces (nocopy double forces [size], int size, double value); } …..... Charm Interface File - Declarations forcecalculations.C void recv_forces(double * forces, int size, double value){ …. } C++ Code File – Entry method forcecalculations.C Callback Cb = new Callback(CkIndex_Cell::completed, cellArrayID); Cell_Proxy[n].recv_forces(CkSendBuffer(forces, cb), 1000000, 4.0); C++ Code File – Call site
Zero-copy Entry Method Send API - What happens under the hood? Callback Node 0 Node 1 Charm++. ...... Cell_Proxy [n]. recv_force (CkSendBuffer(forces, cb), size, value); ....... Charm++ void recv_force ( double * forces, int size, int value) { } forces value value Marshalling of Parameters size size Un-marshalling of Parameters RGET Header Header size value size value forces Network Network
Zero-copy Direct API
forcecalculations.ci forcecalculations.C forcecalculations.C Module forcecalculations{ …... array [1D] Cell { entry forces( ) ; entry void recv_forces (CkNcpySource src, int size, double value); } …..... Charm Interface File - Declarations forcecalculations.C void recv_forces(CkNcpySource src, int size, double value) { Callback recv_cb = new Callback(CkIndex_Cell::recv_completed, cellArrayID); CkNcpyDestination dest(myForces, size*sizeof(double), recv_cb, CK_BUFFER_REG); dest.rget(src); } C++ Code File – Entry method forcecalculations.C Callback send_cb = new Callback(CkIndex_Cell::send_completed, cellArrayID); CkNcpySource src(forces, size*sizeof(double), send_cb, CK_BUFFER_REG); Cell_Proxy[n].recv_forces(src, 1000000, 4.0); C++ Code File – Call site
Zero-copy Direct API - What happens under the hood? Sender Callback Receiver Callback Node 0 Node 1 Charm++ void recv_force (CkNcpySoruce src, int size, int value) { … dest.rget(src); } Charm++. ...... CkNcpySource src(forces, size*sizeof(double), send_cb); Cell_Proxy [n]. recv_force(src, size, value); ....... forces value value RGET myforces Marshalling of Parameters src Un-marshalling of Parameters size size Header Header src size value src size value Network Network
Modes of Operation in Direct API to support memory registration(gni, verbs, ofi) CK_BUFFER_UNREG - Default Mode Unregistered at the beginning Delayed registration if required CK_BUFFER_REG Registered by the API CK_BUFFER_PREREG Registered before the API call by allocating memory out of a pre-registered mempool CK_BUFFER_NOREG No registration
Results – Pingpong Regular API vs Zerocopy Entry Method Send API & Regular Send and Receive API vs Zerocopy Direct API
Results on BG/Q (Vesta) – PAMI interconnect (upto 1.6x) Message Size Regular Send API (us) Zerocopy EM Send API (us) ZC EM API % Improvement ZM EM Speedup Regular Send and Receive API (us) Zerocopy Direct API (GET) (us) Direct API % improvement Direct API SpeedUp 2 KB 34.10 67.85 -98.97 0.50 35.57 47.82 -34.44 0.74 4 KB 37.00 68.02 -83.86 0.54 38.27 49.02 -28.10 0.78 8 KB 40.28 70.31 -74.55 0.57 42.03 51.43 -22.36 0.82 16 KB 46.58 74.12 -59.14 0.63 48.74 56.07 -15.04 0.87 32 KB 57.11 83.43 -46.08 0.68 61.49 64.66 -5.15 0.95 64 KB 78.80 101.76 -29.14 0.77 86.15 83.48 3.10 1.03 128 KB 122.08 138.55 -13.49 0.88 135.77 121.14 10.78 1.12 256 KB 208.94 212.58 -1.74 0.98 235.53 195.19 17.13 1.21 512 KB 381.90 359.59 5.84 1.06 434.52 341.57 21.39 1.27 1 MB 728.81 655.91 10.00 1.11 831.49 636.78 23.42 1.31 2 MB 1484.07 1245.52 16.07 1.19 1755.24 1228.63 30.00 1.43 4 MB 3307.57 2676.49 19.08 1.24 3718.02 2407.34 35.25 1.54 8 MB 6569.11 5282.12 19.59 7465.67 4767.12 36.15 1.57 16 MB 13771.92 10565.15 23.28 1.30 15539.09 9560.51 38.47 1.63 32 MB 29246.51 23730.00 18.86 1.23 33700.23 21573.57 35.98 1.56 64 MB 57096.48 44976.32 21.23 65988.34 40644.15 38.41 1.62
Results on Dell/Intel cluster (Golub) – Results on Dell/Intel cluster (Golub) – Infiniband Interconnect (upto 4.3x) Message Size Regular Send API (us) Zerocopy EM Send API (us) ZC EM API % Improvement ZM EM Speedup Regular Send and Receive API (us) Zerocopy Direct API (GET) (us) Direct API % improvement Direct API SpeedUp 2 KB 4.15 114.34 -2655.45 0.04 3.99 6.15 -54.26 0.65 4 KB 4.98 115.56 -2220.20 4.77 6.38 -33.81 0.75 8 KB 6.52 115.48 -1671.76 0.06 7.32 -18.96 0.84 16 KB 9.30 120.05 -1190.74 0.08 8.92 8.95 -0.30 1.00 32 KB 15.64 124.64 -697.08 0.13 15.76 12.04 23.61 1.31 64 KB 24.67 133.03 -439.24 0.19 27.82 18.13 34.85 1.53 128 KB 43.53 150.97 -246.78 0.29 53.55 30.20 43.61 1.77 256 KB 81.57 179.27 -119.77 0.46 103.41 54.65 47.15 1.89 512 KB 159.56 244.22 -53.06 202.44 103.98 48.64 1.95 1 MB 397.62 421.63 -6.04 0.94 528.40 201.00 61.96 2.63 2 MB 760.64 726.41 4.50 1.05 970.92 396.63 59.15 2.45 4 MB 1456.88 1348.92 7.41 1.08 1878.60 794.44 57.71 2.36 8 MB 6428.19 3835.77 40.33 1.68 7154.38 1658.74 76.82 4.31 16 MB 13891.67 6287.78 54.74 2.21 15631.23 3305.39 78.85 4.73 32 MB 24835.79 17905.08 27.91 1.39 28174.30 6654.32 76.38 4.23 64 MB 50290.13 35370.92 29.67 1.42 56955.59 13259.62 76.72 4.30
Results on Crayxc (Edison) – Gni Interconnect (upto 8.7x) Message Size Regular Send API (us) Zerocopy EM Send API (us) ZC EM API % Improvement ZM EM Speedup Regular Send and Receive API (us) Zerocopy Direct API (GET) (us) Direct API % improvement Direct API SpeedUp 2 KB 4.38 589.26 -13345.67 0.01 4.50 5.91 -31.38 0.76 4 KB 5.54 576.78 -10314.84 5.51 5.90 -7.04 0.93 8 KB 5.86 560.79 -9475.61 5.99 6.71 -12.03 0.89 16 KB 6.76 587.23 -8585.87 7.82 7.70 1.56 1.02 32 KB 10.95 568.08 -5086.38 0.02 13.16 9.43 28.33 1.40 64 KB 15.72 602.76 -3734.09 0.03 26.59 13.28 50.06 2.00 128 KB 30.08 626.32 -1981.94 0.05 49.09 21.20 56.82 2.32 256 KB 56.59 649.52 -1047.70 0.09 95.08 36.40 61.71 2.61 512 KB 108.59 698.80 -543.50 0.16 205.05 67.68 66.99 3.03 1 MB 226.92 759.19 -234.57 0.30 372.59 157.04 57.85 2.37 2 MB 475.25 915.49 -92.63 0.52 828.88 307.46 62.91 2.70 4 MB 913.32 1523.03 -66.76 0.60 1475.04 517.97 64.88 2.85 8 MB 1773.81 2738.94 -54.41 0.65 3342.99 1025.65 69.32 3.26 16 MB 14835.30 7263.10 51.04 2.04 18455.18 2245.69 87.83 8.22 32 MB 26601.09 16218.50 39.03 1.64 38212.89 4589.43 87.99 8.33 64 MB 52790.98 29718.78 43.70 1.78 81922.82 9388.09 88.54 8.73
Results on Intel KNL cluster (Stampede2) – Results on Intel KNL cluster (Stampede2) – Intel Omni-path Interconnect (upto 10x) Message Size Regular Send API (us) Zerocopy EM Send API (us) ZC EM API % Improvement ZM EM Speedup Regular Send and Receive API (us) Zerocopy Direct API (GET) (us) Direct API % improvement Direct API SpeedUp 2 KB 16.79 55.91 -232.96 0.30 16.96 36.18 -113.31 0.47 4 KB 18.06 59.45 -229.14 18.61 37.95 -103.94 0.49 8 KB 21.23 68.65 -223.40 0.31 21.46 42.05 -95.95 0.51 16 KB 24.69 74.33 -201.06 0.33 25.80 46.19 -79.06 0.56 32 KB 30.39 75.55 -148.57 0.40 33.41 49.97 -49.58 0.67 64 KB 137.88 147.84 -7.22 0.93 154.33 57.28 62.89 2.69 128 KB 179.06 205.41 -14.72 0.87 191.62 155.07 19.07 1.24 256 KB 215.92 319.49 -47.97 0.68 195.90 162.64 16.97 1.20 512 KB 207.66 336.76 -62.17 0.62 323.97 154.20 52.40 2.10 1 MB 407.83 342.27 16.08 1.19 605.58 194.84 67.83 3.11 2 MB 736.41 383.23 47.96 1.92 1060.35 248.68 76.55 4.26 4 MB 1376.30 560.89 59.25 2.45 1901.06 453.40 76.15 4.19 8 MB 2811.16 831.74 70.41 3.38 6805.73 781.65 88.51 8.71 16 MB 6008.41 1531.04 74.52 3.92 16454.11 1498.92 90.89 10.98 32 MB 23693.12 11775.96 50.30 2.01 29109.18 2888.36 90.08 10.08 64 MB 45585.29 21727.71 52.34 55920.52 5666.87 89.87 9.87
Using SHM Transport using CMA Charm++ within-node communication between processes uses the network SHM skips the network Cross Memory Attach – Linux 3.2 Implementation uses – metadata message (sent through the network) followed by a process_vm_readv and ack message (sent through the network)
Results – Pingpong Using the network vs Using SHM Transport over CMA
Results on a lab machine with Ethernet network (upto 4x) Size (Bytes) No CMA one way time (us) CMA one way time (us) % improvement Speedup 1 KB 5.58 10.02 -79.54 0.56 2 KB 5.93 10.19 -71.97 0.58 4 KB 6.27 10.36 -65.25 0.61 8 KB 7.56 10.96 -45.00 0.69 16 KB 11.55 11.93 -3.32 0.97 32 KB 19.87 14.22 28.42 1.40 64 KB 36.31 18.91 47.93 1.92 128 KB 66.57 27.68 58.42 2.40 256 KB 130.52 44.50 65.91 2.93 512 KB 254.47 75.09 70.49 3.39 1 MB 500.50 133.47 73.33 3.75 2 MB 1025.51 252.18 75.41 4.07 4 MB 2321.18 687.42 70.38 3.38 8 MB 4935.33 1850.31 62.51 2.67 16 MB 9703.12 3641.47 62.47 2.66 32 MB 21204.47 9358.97 55.86 2.27
Results on Edison (GNI) (upto 1.5x) Size (Bytes) No CMA one way time (us) CMA one way time (us) % improvement Speedup 256 Bytes 1.39 2.35 -68.78 0.59 512 Bytes 1.43 2.41 -68.68 1 KB 3.56 2.33 34.74 1.53 2 KB 3.46 2.49 28.09 4 KB 3.69 2.74 25.58 1.34 8 KB 4.10 3.41 16.84 1.20 16 KB 5.16 4.37 15.40 1.18 32 KB 7.23 6.17 14.64 1.17 64 KB 11.41 10.17 10.86 1.12 128 KB 19.80 18.06 8.77 1.10 256 KB 36.71 33.83 7.84 1.09 512 KB 70.28 116.89 -66.33 0.60 1 MB 137.14 267.55 -95.08 0.51 2 MB 270.96 528.58 4 MB 561.46 1060.39 -88.86 0.53 8 MB 1208.64 2109.57 -74.54 0.57 16 MB 6156.18 6654.44 -8.09 0.93 32 MB 10463.20 12576.42 -20.20 0.83
Results on Stampede2 (OFI) (upto 1.1x) Size (Bytes) No CMA one way time (us) CMA one way time (us) % improvement Speedup 1 KB 6.05 15.02 -148.14 0.40 2 KB 6.44 15.62 -142.47 0.41 4 KB 9.34 16.01 -71.51 0.58 8 KB 10.12 17.28 -70.82 0.59 16 KB 18.83 19.63 -4.24 0.96 32 KB 23.81 24.27 -1.93 0.98 64 KB 40.18 35.81 10.89 1.12 128 KB 55.79 52.16 6.51 1.07 256 KB 86.60 76.35 11.84 1.13 512 KB 190.23 166.52 12.46 1.14 1 MB 353.27 336.50 4.75 1.05 2 MB 619.59 621.30 -0.28 1.00 4 MB 1198.66 1187.12 1.01 8 MB 2334.56 2358.88 -1.04 0.99 16 MB 4560.66 4639.19 -1.72 32 MB 18086.00 17088.52 5.52 1.06
Results on Bridges (OFI) (upto 1.15 x) Size (Bytes) No CMA one way time (us) CMA one way time (us) % improvement Speedup 1 KB 2.48 6.28 -153.55 0.39 2 KB 2.70 4.70 -74.10 0.57 4 KB 4.01 5.22 -30.14 0.77 8 KB 6.63 6.65 -0.37 1.00 16 KB 10.22 9.33 8.70 1.10 32 KB 17.03 16.99 0.24 64 KB 28.40 24.73 12.91 1.15 128 KB 49.79 43.99 11.66 1.13 256 KB 91.54 92.98 -1.57 0.98 512 KB 169.61 167.09 1.49 1.02 1 MB 325.80 323.69 0.65 1.01 2 MB 646.17 619.66 4.10 1.04 4 MB 1293.16 1252.15 3.17 1.03 8 MB 2556.80 2559.24 -0.10 16 MB 5148.79 5219.44 -1.37 0.99 32 MB 14727.66 14711.74 0.11
Results on Bridges (MPI) (upto 1.08x) Size (Bytes) No CMA one way time (us) CMA one way time (us) % improvement Speedup 1 KB 3.96 5.25 -32.63 0.75 2 KB 4.18 5.48 -31.07 0.76 4 KB 4.74 6.23 -31.42 8 KB 5.66 7.22 -27.50 0.78 16 KB 8.08 10.84 -34.11 32 KB 12.02 13.77 -14.51 0.87 64 KB 23.82 22.15 7.04 1.08 128 KB 42.55 41.31 2.91 1.03 256 KB 77.81 74.63 4.09 1.04 512 KB 145.11 140.83 2.95 1 MB 277.25 273.82 1.23 1.01 2 MB 547.88 540.21 1.40 4 MB 1086.36 1078.66 0.71 8 MB 2175.44 2188.64 -0.61 0.99 16 MB 4378.83 4421.36 -0.97 32 MB 13477.01 13336.61
Summary Zero-copy EM API reduces sender side memory footprint and improves performance by avoiding large memory allocation and sender side copy Zero-copy Direct API reduces both sender and receiver sider memory footprint and improves performance to a larger extent by avoiding large memory allocation and copy on both sender side and receiver side copy CMA proves to be a faster alternative for intra-host inter-process communication to send messages avoiding the network.
Questions?