Presentation is loading. Please wait.

Presentation is loading. Please wait.

Recent Communication Optimizations in Charm++

Similar presentations


Presentation on theme: "Recent Communication Optimizations in Charm++"— Presentation transcript:

1 Recent Communication Optimizations in Charm++
Nitin Bhat Software Engineer Charmworks, Inc. 16th Annual Charm Workshop 2018

2 Agenda Existing Charm++ Messaging API Motivation
Zero-copy Entry Method Send API using RDMA Zero-copy Direct API using RDMA Results Using SHM transport using CMA Summary

3 Charm++ Messaging API Diagram to show if there are multiple parameters, how the copied approach works

4 forcecalculations.ci forcecalculations.C forcecalculations.C
Module forcecalculations{ …... array [1D] Cell { entry forces( ) ; entry void recv_forces (double forces [size], int size, double value); } …..... Charm Interface File - Declarations forcecalculations.C void recv_forces(double * forces, int size, double value){ …. } C++ Code File – Entry method Cell_Proxy[n].recv_forces(forces, , 4.0); forcecalculations.C C++ Code File – Call site

5 Regular Messaging API - What happens under the hood?
Node 0 Node 1 Charm++. ...... Cell_Proxy [n]. recv_force (forces, size, value); Charm++ void recv_force ( double * forces, int size, int value) { } forces forces value value Marshalling of Parameters size size Un-marshalling of Parameters Header Header forces size size value value RGET metadata Network Network

6 Motivation Memory system is the bottleneck
Faster cores and Fatter nodes Processor performance has been scaling much better than memory performance over the years On RDMA/CMA enabled systems, avoid copies of the large buffer by minor changes in the application logic Advantages: Reduce memory footprint Improve performance by reducing memory allocation size and avoiding copy Reduce page faults, data cache misses

7 Zero-copy Entry Method Send API

8 forcecalculations.ci forcecalculations.C forcecalculations.C
Module forcecalculations{ …... array [1D] Cell { entry forces( ) ; entry void recv_forces (nocopy double forces [size], int size, double value); } …..... Charm Interface File - Declarations forcecalculations.C void recv_forces(double * forces, int size, double value){ …. } C++ Code File – Entry method forcecalculations.C Callback Cb = new Callback(CkIndex_Cell::completed, cellArrayID); Cell_Proxy[n].recv_forces(CkSendBuffer(forces, cb), , 4.0); C++ Code File – Call site

9 Zero-copy Entry Method Send API - What happens under the hood?
Callback Node 0 Node 1 Charm++. ...... Cell_Proxy [n]. recv_force (CkSendBuffer(forces, cb), size, value); Charm++ void recv_force ( double * forces, int size, int value) { } forces value value Marshalling of Parameters size size Un-marshalling of Parameters RGET Header Header size value size value forces Network Network

10 Zero-copy Direct API

11 forcecalculations.ci forcecalculations.C forcecalculations.C
Module forcecalculations{ …... array [1D] Cell { entry forces( ) ; entry void recv_forces (CkNcpySource src, int size, double value); } …..... Charm Interface File - Declarations forcecalculations.C void recv_forces(CkNcpySource src, int size, double value) { Callback recv_cb = new Callback(CkIndex_Cell::recv_completed, cellArrayID); CkNcpyDestination dest(myForces, size*sizeof(double), recv_cb, CK_BUFFER_REG); dest.rget(src); } C++ Code File – Entry method forcecalculations.C Callback send_cb = new Callback(CkIndex_Cell::send_completed, cellArrayID); CkNcpySource src(forces, size*sizeof(double), send_cb, CK_BUFFER_REG); Cell_Proxy[n].recv_forces(src, , 4.0); C++ Code File – Call site

12 Zero-copy Direct API - What happens under the hood?
Sender Callback Receiver Callback Node 0 Node 1 Charm++ void recv_force (CkNcpySoruce src, int size, int value) { dest.rget(src); } Charm++. ...... CkNcpySource src(forces, size*sizeof(double), send_cb); Cell_Proxy [n]. recv_force(src, size, value); forces value value RGET myforces Marshalling of Parameters src Un-marshalling of Parameters size size Header Header src size value src size value Network Network

13 Modes of Operation in Direct API to support memory registration(gni, verbs, ofi)
CK_BUFFER_UNREG - Default Mode Unregistered at the beginning Delayed registration if required CK_BUFFER_REG Registered by the API CK_BUFFER_PREREG Registered before the API call by allocating memory out of a pre-registered mempool CK_BUFFER_NOREG No registration

14 Results – Pingpong Regular API vs Zerocopy Entry Method Send API & Regular Send and Receive API vs Zerocopy Direct API

15 Results on BG/Q (Vesta) – PAMI interconnect (upto 1.6x)
Message Size Regular Send API (us) Zerocopy EM Send API (us) ZC EM API % Improvement ZM EM Speedup Regular Send and Receive API (us) Zerocopy Direct API (GET) (us) Direct API % improvement Direct API SpeedUp 2 KB 34.10 67.85 -98.97 0.50 35.57 47.82 -34.44 0.74 4 KB 37.00 68.02 -83.86 0.54 38.27 49.02 -28.10 0.78 8 KB 40.28 70.31 -74.55 0.57 42.03 51.43 -22.36 0.82 16 KB 46.58 74.12 -59.14 0.63 48.74 56.07 -15.04 0.87 32 KB 57.11 83.43 -46.08 0.68 61.49 64.66 -5.15 0.95 64 KB 78.80 101.76 -29.14 0.77 86.15 83.48 3.10 1.03 128 KB 122.08 138.55 -13.49 0.88 135.77 121.14 10.78 1.12 256 KB 208.94 212.58 -1.74 0.98 235.53 195.19 17.13 1.21 512 KB 381.90 359.59 5.84 1.06 434.52 341.57 21.39 1.27 1 MB 728.81 655.91 10.00 1.11 831.49 636.78 23.42 1.31 2 MB 16.07 1.19 30.00 1.43 4 MB 19.08 1.24 35.25 1.54 8 MB 19.59 36.15 1.57 16 MB 23.28 1.30 38.47 1.63 32 MB 18.86 1.23 35.98 1.56 64 MB 21.23 38.41 1.62

16 Results on Dell/Intel cluster (Golub) –
Results on Dell/Intel cluster (Golub) – Infiniband Interconnect (upto 4.3x) Message Size Regular Send API (us) Zerocopy EM Send API (us) ZC EM API % Improvement ZM EM Speedup Regular Send and Receive API (us) Zerocopy Direct API (GET) (us) Direct API % improvement Direct API SpeedUp 2 KB 4.15 114.34 0.04 3.99 6.15 -54.26 0.65 4 KB 4.98 115.56 4.77 6.38 -33.81 0.75 8 KB 6.52 115.48 0.06 7.32 -18.96 0.84 16 KB 9.30 120.05 0.08 8.92 8.95 -0.30 1.00 32 KB 15.64 124.64 0.13 15.76 12.04 23.61 1.31 64 KB 24.67 133.03 0.19 27.82 18.13 34.85 1.53 128 KB 43.53 150.97 0.29 53.55 30.20 43.61 1.77 256 KB 81.57 179.27 0.46 103.41 54.65 47.15 1.89 512 KB 159.56 244.22 -53.06 202.44 103.98 48.64 1.95 1 MB 397.62 421.63 -6.04 0.94 528.40 201.00 61.96 2.63 2 MB 760.64 726.41 4.50 1.05 970.92 396.63 59.15 2.45 4 MB 7.41 1.08 794.44 57.71 2.36 8 MB 40.33 1.68 76.82 4.31 16 MB 54.74 2.21 78.85 4.73 32 MB 27.91 1.39 76.38 4.23 64 MB 29.67 1.42 76.72 4.30

17 Results on Crayxc (Edison) – Gni Interconnect (upto 8.7x)
Message Size Regular Send API (us) Zerocopy EM Send API (us) ZC EM API % Improvement ZM EM Speedup Regular Send and Receive API (us) Zerocopy Direct API (GET) (us) Direct API % improvement Direct API SpeedUp 2 KB 4.38 589.26 0.01 4.50 5.91 -31.38 0.76 4 KB 5.54 576.78 5.51 5.90 -7.04 0.93 8 KB 5.86 560.79 5.99 6.71 -12.03 0.89 16 KB 6.76 587.23 7.82 7.70 1.56 1.02 32 KB 10.95 568.08 0.02 13.16 9.43 28.33 1.40 64 KB 15.72 602.76 0.03 26.59 13.28 50.06 2.00 128 KB 30.08 626.32 0.05 49.09 21.20 56.82 2.32 256 KB 56.59 649.52 0.09 95.08 36.40 61.71 2.61 512 KB 108.59 698.80 0.16 205.05 67.68 66.99 3.03 1 MB 226.92 759.19 0.30 372.59 157.04 57.85 2.37 2 MB 475.25 915.49 -92.63 0.52 828.88 307.46 62.91 2.70 4 MB 913.32 -66.76 0.60 517.97 64.88 2.85 8 MB -54.41 0.65 69.32 3.26 16 MB 51.04 2.04 87.83 8.22 32 MB 39.03 1.64 87.99 8.33 64 MB 43.70 1.78 88.54 8.73

18 Results on Intel KNL cluster (Stampede2) –
Results on Intel KNL cluster (Stampede2) – Intel Omni-path Interconnect (upto 10x) Message Size Regular Send API (us) Zerocopy EM Send API (us) ZC EM API % Improvement ZM EM Speedup Regular Send and Receive API (us) Zerocopy Direct API (GET) (us) Direct API % improvement Direct API SpeedUp 2 KB 16.79 55.91 0.30 16.96 36.18 0.47 4 KB 18.06 59.45 18.61 37.95 0.49 8 KB 21.23 68.65 0.31 21.46 42.05 -95.95 0.51 16 KB 24.69 74.33 0.33 25.80 46.19 -79.06 0.56 32 KB 30.39 75.55 0.40 33.41 49.97 -49.58 0.67 64 KB 137.88 147.84 -7.22 0.93 154.33 57.28 62.89 2.69 128 KB 179.06 205.41 -14.72 0.87 191.62 155.07 19.07 1.24 256 KB 215.92 319.49 -47.97 0.68 195.90 162.64 16.97 1.20 512 KB 207.66 336.76 -62.17 0.62 323.97 154.20 52.40 2.10 1 MB 407.83 342.27 16.08 1.19 605.58 194.84 67.83 3.11 2 MB 736.41 383.23 47.96 1.92 248.68 76.55 4.26 4 MB 560.89 59.25 2.45 453.40 76.15 4.19 8 MB 831.74 70.41 3.38 781.65 88.51 8.71 16 MB 74.52 3.92 90.89 10.98 32 MB 50.30 2.01 90.08 10.08 64 MB 52.34 89.87 9.87

19 Using SHM Transport using CMA
Charm++ within-node communication between processes uses the network SHM skips the network Cross Memory Attach – Linux 3.2 Implementation uses – metadata message (sent through the network) followed by a process_vm_readv and ack message (sent through the network)

20 Results – Pingpong Using the network vs Using SHM Transport over CMA

21 Results on a lab machine with Ethernet network (upto 4x)
Size (Bytes) No CMA one way time (us) CMA one way time (us) % improvement Speedup 1 KB 5.58 10.02 -79.54 0.56 2 KB 5.93 10.19 -71.97 0.58 4 KB 6.27 10.36 -65.25 0.61 8 KB 7.56 10.96 -45.00 0.69 16 KB 11.55 11.93 -3.32 0.97 32 KB 19.87 14.22 28.42 1.40 64 KB 36.31 18.91 47.93 1.92 128 KB 66.57 27.68 58.42 2.40 256 KB 130.52 44.50 65.91 2.93 512 KB 254.47 75.09 70.49 3.39 1 MB 500.50 133.47 73.33 3.75 2 MB 252.18 75.41 4.07 4 MB 687.42 70.38 3.38 8 MB 62.51 2.67 16 MB 62.47 2.66 32 MB 55.86 2.27

22 Results on Edison (GNI) (upto 1.5x)
Size (Bytes) No CMA one way time (us) CMA one way time (us) % improvement Speedup 256 Bytes 1.39 2.35 -68.78 0.59 512 Bytes 1.43 2.41 -68.68 1 KB 3.56 2.33 34.74 1.53 2 KB 3.46 2.49 28.09 4 KB 3.69 2.74 25.58 1.34 8 KB 4.10 3.41 16.84 1.20 16 KB 5.16 4.37 15.40 1.18 32 KB 7.23 6.17 14.64 1.17 64 KB 11.41 10.17 10.86 1.12 128 KB 19.80 18.06 8.77 1.10 256 KB 36.71 33.83 7.84 1.09 512 KB 70.28 116.89 -66.33 0.60 1 MB 137.14 267.55 -95.08 0.51 2 MB 270.96 528.58 4 MB 561.46 -88.86 0.53 8 MB -74.54 0.57 16 MB -8.09 0.93 32 MB -20.20 0.83

23 Results on Stampede2 (OFI) (upto 1.1x)
Size (Bytes) No CMA one way time (us) CMA one way time (us) % improvement Speedup 1 KB 6.05 15.02 0.40 2 KB 6.44 15.62 0.41 4 KB 9.34 16.01 -71.51 0.58 8 KB 10.12 17.28 -70.82 0.59 16 KB 18.83 19.63 -4.24 0.96 32 KB 23.81 24.27 -1.93 0.98 64 KB 40.18 35.81 10.89 1.12 128 KB 55.79 52.16 6.51 1.07 256 KB 86.60 76.35 11.84 1.13 512 KB 190.23 166.52 12.46 1.14 1 MB 353.27 336.50 4.75 1.05 2 MB 619.59 621.30 -0.28 1.00 4 MB 1.01 8 MB -1.04 0.99 16 MB -1.72 32 MB 5.52 1.06

24 Results on Bridges (OFI) (upto 1.15 x)
Size (Bytes) No CMA one way time (us) CMA one way time (us) % improvement Speedup 1 KB 2.48 6.28 0.39 2 KB 2.70 4.70 -74.10 0.57 4 KB 4.01 5.22 -30.14 0.77 8 KB 6.63 6.65 -0.37 1.00 16 KB 10.22 9.33 8.70 1.10 32 KB 17.03 16.99 0.24 64 KB 28.40 24.73 12.91 1.15 128 KB 49.79 43.99 11.66 1.13 256 KB 91.54 92.98 -1.57 0.98 512 KB 169.61 167.09 1.49 1.02 1 MB 325.80 323.69 0.65 1.01 2 MB 646.17 619.66 4.10 1.04 4 MB 3.17 1.03 8 MB -0.10 16 MB -1.37 0.99 32 MB 0.11

25 Results on Bridges (MPI) (upto 1.08x)
Size (Bytes) No CMA one way time (us) CMA one way time (us) % improvement Speedup 1 KB 3.96 5.25 -32.63 0.75 2 KB 4.18 5.48 -31.07 0.76 4 KB 4.74 6.23 -31.42 8 KB 5.66 7.22 -27.50 0.78 16 KB 8.08 10.84 -34.11 32 KB 12.02 13.77 -14.51 0.87 64 KB 23.82 22.15 7.04 1.08 128 KB 42.55 41.31 2.91 1.03 256 KB 77.81 74.63 4.09 1.04 512 KB 145.11 140.83 2.95 1 MB 277.25 273.82 1.23 1.01 2 MB 547.88 540.21 1.40 4 MB 0.71 8 MB -0.61 0.99 16 MB -0.97 32 MB

26 Summary Zero-copy EM API reduces sender side memory footprint and improves performance by avoiding large memory allocation and sender side copy Zero-copy Direct API reduces both sender and receiver sider memory footprint and improves performance to a larger extent by avoiding large memory allocation and copy on both sender side and receiver side copy CMA proves to be a faster alternative for intra-host inter-process communication to send messages avoiding the network.

27 Questions?


Download ppt "Recent Communication Optimizations in Charm++"

Similar presentations


Ads by Google