Recent Communication Optimizations in Charm++

Slides:



Advertisements
Similar presentations
© 2003 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice Performance Measurements of a User-Space.
Advertisements

Efficient Collective Operations using Remote Memory Operations on VIA-Based Clusters Rinku Gupta Dell Computers Dhabaleswar Panda.
The Development of Mellanox - NVIDIA GPUDirect over InfiniBand A New Model for GPU to GPU Communications Gilad Shainer.
R2: An application-level kernel for record and replay Z. Guo, X. Wang, J. Tang, X. Liu, Z. Xu, M. Wu, M. F. Kaashoek, Z. Zhang, (MSR Asia, Tsinghua, MIT),
CS 425 / ECE 428 Distributed Systems Fall 2014 Indranil Gupta (Indy) Lecture 25: Distributed Shared Memory All slides © IG.
Performance Analysis of MPI Communications on the SGI Altix 3700 Nor Asilah Wati Abdul Hamid, Paul Coddington, Francis Vaughan Distributed & High Performance.
High Performance Communication using MPJ Express 1 Presented by Jawad Manzoor National University of Sciences and Technology, Pakistan 29 June 2015.
Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck Pavan Balaji  Hemal V. Shah ¥ D. K. Panda 
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
Signature Verbs Extension Richard L. Graham. Data Integrity Field (DIF) Used to provide data block integrity check capabilities (CRC) for block storage.
1 Scaling Collective Multicast Fat-tree Networks Sameer Kumar Parallel Programming Laboratory University Of Illinois at Urbana Champaign ICPADS ’ 04.
High Performance User-Level Sockets over Gigabit Ethernet Pavan Balaji Ohio State University Piyush Shivam Ohio State University.
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
Boosting Event Building Performance Using Infiniband FDR for CMS Upgrade Andrew Forrest – CERN (PH/CMD) Technology and Instrumentation in Particle Physics.
Boston, May 22 nd, 2013 IPDPS 1 Acceleration of an Asynchronous Message Driven Programming Paradigm on IBM Blue Gene/Q Sameer Kumar* IBM T J Watson Research.
The MPC Parallel Computer Hardware, Low-level Protocols and Performances University P. & M. Curie (PARIS) LIP6 laboratory Olivier Glück.
1 Using HPS Switch on Bassi Jonathan Carter User Services Group Lead NERSC User Group Meeting June 12, 2006.
2006 Sonoma Workshop February 2006Page 1 Sockets Direct Protocol (SDP) for Windows - Motivation and Plans Gilad Shainer Mellanox Technologies Inc.
Swapping to Remote Memory over InfiniBand: An Approach using a High Performance Network Block Device Shuang LiangRanjit NoronhaDhabaleswar K. Panda IEEE.
Remote Direct Memory Access (RDMA) over IP PFLDNet 2003, Geneva Stephen Bailey, Sandburst Corp., Allyn Romanow, Cisco Systems,
August 22, 2005Page 1 of (#) Datacenter Fabric Workshop Open MPI Overview and Current Status Tim Woodall - LANL Galen Shipman - LANL/UNM.
Increasing Web Server Throughput with Network Interface Data Caching October 9, 2002 Hyong-youb Kim, Vijay S. Pai, and Scott Rixner Rice Computer Architecture.
Minimizing Communication Latency to Maximize Network Communication Throughput over InfiniBand Design and Implementation of MPICH-2 over InfiniBand with.
Charm Workshop CkDirect: Charm++ RDMA Put Presented by Eric Bohm CkDirect Team: Eric Bohm, Sayantan Chakravorty, Pritish Jetley, Abhinav Bhatele.
The influence of system calls and interrupts on the performances of a PC cluster using a Remote DMA communication primitive Olivier Glück Jean-Luc Lamotte.
Optimizing Charm++ Messaging for the Grid Gregory A. Koenig Parallel Programming Laboratory Department of Computer.
A Preliminary Investigation on Optimizing Charm++ for Homogeneous Multi-core Machines Chao Mei 05/02/2008 The 6 th Charm++ Workshop.
Interconnect Networks Basics. Generic parallel/distributed system architecture On-chip interconnects (manycore processor) Off-chip interconnects (clusters.
A uGNI-Based Asynchronous Message- driven Runtime System for Cray Supercomputers with Gemini Interconnect Yanhua Sun, Gengbin Zheng, Laximant(Sanjay) Kale.
Mellanox Connectivity Solutions for Scalable HPC Highest Performing, Most Efficient End-to-End Connectivity for Servers and Storage September 2010 Brandon.
Use case of RDMA in Symantec storage software stack Om Prakash Agarwal Symantec.
1 Opportunities and Challenges of Modern Communication Architectures: Case Study with QsNet CAC Workshop Santa Fe, NM, 2004 Sameer Kumar* and Laxmikant.
Towards High Performance Processing of Streaming Data May Supun Kamburugamuve, Saliya Ekanayake, Milinda Pathirage and Geoffrey C. Fox Indiana.
Memory Management & Virtual Memory. Hierarchy Cache Memory : Provide invisible speedup to main memory.
Advisor: Hung Shi-Hao Presenter: Chen Yu-Jen
Feeding Parallel Machines – Any Silver Bullets? Novica Nosović ETF Sarajevo 8th Workshop “Software Engineering Education and Reverse Engineering” Durres,
High Performance Storage System (HPSS) Jason Hick Mass Storage Group HEPiX October 26-30, 2009.
HTCC coffee march /03/2017 Sébastien VALAT – CERN.
Parallel Virtual File System (PVFS) a.k.a. OrangeFS
Supporting x86-64 Address Translation for 100s of GPU Lanes
Lecture: Large Caches, Virtual Memory
HDF5 Metadata and Page Buffering
Accelerating Large Charm++ Messages using RDMA
Day 27 File System.
The Hardware/Software Interface CSE351 Winter 2013
QuickPath interconnect GB/s GB/s total To I/O
Multi-Processing in High Performance Computer Architecture:
NGS computation services: APIs and Parallel Jobs
uGNI-based Charm++ Runtime for Cray Gemini Interconnect
CSE 153 Design of Operating Systems Winter 2018
IEEE BigData 2016 December 5-8, Washington D.C.
CMSC 611: Advanced Computer Architecture
by Manuel Saldaña, Daniel Nunes, Emanuel Ramalho, and Paul Chow
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
Introduction to Computer Systems
OpenFabrics Alliance An Update for SSSI
Lecture 26 A: Distributed Shared Memory
Xen Network I/O Performance Analysis and Opportunities for Improvement
Constructing a system with multiple computers or processors
Constructing a system with multiple computers or processors
Hybrid Programming with OpenMP and MPI
Upcoming Improvements and Features in Charm++
High Performance Computing
Lecture 26 A: Distributed Shared Memory
CSE 153 Design of Operating Systems Winter 2019
Virtual Memory Lecture notes from MKP and S. Yalamanchili.
Assoc. Prof. Marc FRÎNCU, PhD. Habil.
Types of Parallel Computers
An Implementation of User-level Distributed Shared Memory
Motivation Contemporary big data tools such as MapReduce and graph processing tools have fixed data abstraction and support a limited set of communication.
Presentation transcript:

Recent Communication Optimizations in Charm++ Nitin Bhat Software Engineer Charmworks, Inc. 16th Annual Charm Workshop 2018

Agenda Existing Charm++ Messaging API Motivation Zero-copy Entry Method Send API using RDMA Zero-copy Direct API using RDMA Results Using SHM transport using CMA Summary

Charm++ Messaging API Diagram to show if there are multiple parameters, how the copied approach works

forcecalculations.ci forcecalculations.C forcecalculations.C Module forcecalculations{ …... array [1D] Cell { entry forces( ) ; entry void recv_forces (double forces [size], int size, double value); } …..... Charm Interface File - Declarations forcecalculations.C void recv_forces(double * forces, int size, double value){ …. } C++ Code File – Entry method Cell_Proxy[n].recv_forces(forces, 1000000, 4.0); forcecalculations.C C++ Code File – Call site

Regular Messaging API - What happens under the hood? Node 0 Node 1 Charm++. ...... Cell_Proxy [n]. recv_force (forces, size, value); ....... Charm++ void recv_force ( double * forces, int size, int value) { } forces forces value value Marshalling of Parameters size size Un-marshalling of Parameters Header Header forces size size value value RGET metadata Network Network

Motivation Memory system is the bottleneck Faster cores and Fatter nodes Processor performance has been scaling much better than memory performance over the years On RDMA/CMA enabled systems, avoid copies of the large buffer by minor changes in the application logic Advantages: Reduce memory footprint Improve performance by reducing memory allocation size and avoiding copy Reduce page faults, data cache misses

Zero-copy Entry Method Send API

forcecalculations.ci forcecalculations.C forcecalculations.C Module forcecalculations{ …... array [1D] Cell { entry forces( ) ; entry void recv_forces (nocopy double forces [size], int size, double value); } …..... Charm Interface File - Declarations forcecalculations.C void recv_forces(double * forces, int size, double value){ …. } C++ Code File – Entry method forcecalculations.C Callback Cb = new Callback(CkIndex_Cell::completed, cellArrayID); Cell_Proxy[n].recv_forces(CkSendBuffer(forces, cb), 1000000, 4.0); C++ Code File – Call site

Zero-copy Entry Method Send API - What happens under the hood? Callback Node 0 Node 1 Charm++. ...... Cell_Proxy [n]. recv_force (CkSendBuffer(forces, cb), size, value); ....... Charm++ void recv_force ( double * forces, int size, int value) { } forces value value Marshalling of Parameters size size Un-marshalling of Parameters RGET Header Header size value size value forces Network Network

Zero-copy Direct API

forcecalculations.ci forcecalculations.C forcecalculations.C Module forcecalculations{ …... array [1D] Cell { entry forces( ) ; entry void recv_forces (CkNcpySource src, int size, double value); } …..... Charm Interface File - Declarations forcecalculations.C void recv_forces(CkNcpySource src, int size, double value) { Callback recv_cb = new Callback(CkIndex_Cell::recv_completed, cellArrayID); CkNcpyDestination dest(myForces, size*sizeof(double), recv_cb, CK_BUFFER_REG); dest.rget(src); } C++ Code File – Entry method forcecalculations.C Callback send_cb = new Callback(CkIndex_Cell::send_completed, cellArrayID); CkNcpySource src(forces, size*sizeof(double), send_cb, CK_BUFFER_REG); Cell_Proxy[n].recv_forces(src, 1000000, 4.0); C++ Code File – Call site

Zero-copy Direct API - What happens under the hood? Sender Callback Receiver Callback Node 0 Node 1 Charm++ void recv_force (CkNcpySoruce src, int size, int value) { … dest.rget(src); } Charm++. ...... CkNcpySource src(forces, size*sizeof(double), send_cb); Cell_Proxy [n]. recv_force(src, size, value); ....... forces value value RGET myforces Marshalling of Parameters src Un-marshalling of Parameters size size Header Header src size value src size value Network Network

Modes of Operation in Direct API to support memory registration(gni, verbs, ofi) CK_BUFFER_UNREG - Default Mode Unregistered at the beginning Delayed registration if required CK_BUFFER_REG Registered by the API CK_BUFFER_PREREG Registered before the API call by allocating memory out of a pre-registered mempool CK_BUFFER_NOREG No registration

Results – Pingpong Regular API vs Zerocopy Entry Method Send API & Regular Send and Receive API vs Zerocopy Direct API

Results on BG/Q (Vesta) – PAMI interconnect (upto 1.6x) Message Size Regular Send API (us) Zerocopy EM Send API (us) ZC EM API % Improvement ZM EM Speedup Regular Send and Receive API (us) Zerocopy Direct API (GET) (us) Direct API % improvement Direct API SpeedUp 2 KB 34.10 67.85 -98.97 0.50 35.57 47.82 -34.44 0.74 4 KB 37.00 68.02 -83.86 0.54 38.27 49.02 -28.10 0.78 8 KB 40.28 70.31 -74.55 0.57 42.03 51.43 -22.36 0.82 16 KB 46.58 74.12 -59.14 0.63 48.74 56.07 -15.04 0.87 32 KB 57.11 83.43 -46.08 0.68 61.49 64.66 -5.15 0.95 64 KB 78.80 101.76 -29.14 0.77 86.15 83.48 3.10 1.03 128 KB 122.08 138.55 -13.49 0.88 135.77 121.14 10.78 1.12 256 KB 208.94 212.58 -1.74 0.98 235.53 195.19 17.13 1.21 512 KB 381.90 359.59 5.84 1.06 434.52 341.57 21.39 1.27 1 MB 728.81 655.91 10.00 1.11 831.49 636.78 23.42 1.31 2 MB 1484.07 1245.52 16.07 1.19 1755.24 1228.63 30.00 1.43 4 MB 3307.57 2676.49 19.08 1.24 3718.02 2407.34 35.25 1.54 8 MB 6569.11 5282.12 19.59 7465.67 4767.12 36.15 1.57 16 MB 13771.92 10565.15 23.28 1.30 15539.09 9560.51 38.47 1.63 32 MB 29246.51 23730.00 18.86 1.23 33700.23 21573.57 35.98 1.56 64 MB 57096.48 44976.32 21.23 65988.34 40644.15 38.41 1.62

Results on Dell/Intel cluster (Golub) – Results on Dell/Intel cluster (Golub) – Infiniband Interconnect (upto 4.3x) Message Size Regular Send API (us) Zerocopy EM Send API (us) ZC EM API % Improvement ZM EM Speedup Regular Send and Receive API (us) Zerocopy Direct API (GET) (us) Direct API % improvement Direct API SpeedUp 2 KB 4.15 114.34 -2655.45 0.04 3.99 6.15 -54.26 0.65 4 KB 4.98 115.56 -2220.20 4.77 6.38 -33.81 0.75 8 KB 6.52 115.48 -1671.76 0.06 7.32 -18.96 0.84 16 KB 9.30 120.05 -1190.74 0.08 8.92 8.95 -0.30 1.00 32 KB 15.64 124.64 -697.08 0.13 15.76 12.04 23.61 1.31 64 KB 24.67 133.03 -439.24 0.19 27.82 18.13 34.85 1.53 128 KB 43.53 150.97 -246.78 0.29 53.55 30.20 43.61 1.77 256 KB 81.57 179.27 -119.77 0.46 103.41 54.65 47.15 1.89 512 KB 159.56 244.22 -53.06 202.44 103.98 48.64 1.95 1 MB 397.62 421.63 -6.04 0.94 528.40 201.00 61.96 2.63 2 MB 760.64 726.41 4.50 1.05 970.92 396.63 59.15 2.45 4 MB 1456.88 1348.92 7.41 1.08 1878.60 794.44 57.71 2.36 8 MB 6428.19 3835.77 40.33 1.68 7154.38 1658.74 76.82 4.31 16 MB 13891.67 6287.78 54.74 2.21 15631.23 3305.39 78.85 4.73 32 MB 24835.79 17905.08 27.91 1.39 28174.30 6654.32 76.38 4.23 64 MB 50290.13 35370.92 29.67 1.42 56955.59 13259.62 76.72 4.30

Results on Crayxc (Edison) – Gni Interconnect (upto 8.7x) Message Size Regular Send API (us) Zerocopy EM Send API (us) ZC EM API % Improvement ZM EM Speedup Regular Send and Receive API (us) Zerocopy Direct API (GET) (us) Direct API % improvement Direct API SpeedUp 2 KB 4.38 589.26 -13345.67 0.01 4.50 5.91 -31.38 0.76 4 KB 5.54 576.78 -10314.84 5.51 5.90 -7.04 0.93 8 KB 5.86 560.79 -9475.61 5.99 6.71 -12.03 0.89 16 KB 6.76 587.23 -8585.87 7.82 7.70 1.56 1.02 32 KB 10.95 568.08 -5086.38 0.02 13.16 9.43 28.33 1.40 64 KB 15.72 602.76 -3734.09 0.03 26.59 13.28 50.06 2.00 128 KB 30.08 626.32 -1981.94 0.05 49.09 21.20 56.82 2.32 256 KB 56.59 649.52 -1047.70 0.09 95.08 36.40 61.71 2.61 512 KB 108.59 698.80 -543.50 0.16 205.05 67.68 66.99 3.03 1 MB 226.92 759.19 -234.57 0.30 372.59 157.04 57.85 2.37 2 MB 475.25 915.49 -92.63 0.52 828.88 307.46 62.91 2.70 4 MB 913.32 1523.03 -66.76 0.60 1475.04 517.97 64.88 2.85 8 MB 1773.81 2738.94 -54.41 0.65 3342.99 1025.65 69.32 3.26 16 MB 14835.30 7263.10 51.04 2.04 18455.18 2245.69 87.83 8.22 32 MB 26601.09 16218.50 39.03 1.64 38212.89 4589.43 87.99 8.33 64 MB 52790.98 29718.78 43.70 1.78 81922.82 9388.09 88.54 8.73

Results on Intel KNL cluster (Stampede2) – Results on Intel KNL cluster (Stampede2) – Intel Omni-path Interconnect (upto 10x) Message Size Regular Send API (us) Zerocopy EM Send API (us) ZC EM API % Improvement ZM EM Speedup Regular Send and Receive API (us) Zerocopy Direct API (GET) (us) Direct API % improvement Direct API SpeedUp 2 KB 16.79 55.91 -232.96 0.30 16.96 36.18 -113.31 0.47 4 KB 18.06 59.45 -229.14 18.61 37.95 -103.94 0.49 8 KB 21.23 68.65 -223.40 0.31 21.46 42.05 -95.95 0.51 16 KB 24.69 74.33 -201.06 0.33 25.80 46.19 -79.06 0.56 32 KB 30.39 75.55 -148.57 0.40 33.41 49.97 -49.58 0.67 64 KB 137.88 147.84 -7.22 0.93 154.33 57.28 62.89 2.69 128 KB 179.06 205.41 -14.72 0.87 191.62 155.07 19.07 1.24 256 KB 215.92 319.49 -47.97 0.68 195.90 162.64 16.97 1.20 512 KB 207.66 336.76 -62.17 0.62 323.97 154.20 52.40 2.10 1 MB 407.83 342.27 16.08 1.19 605.58 194.84 67.83 3.11 2 MB 736.41 383.23 47.96 1.92 1060.35 248.68 76.55 4.26 4 MB 1376.30 560.89 59.25 2.45 1901.06 453.40 76.15 4.19 8 MB 2811.16 831.74 70.41 3.38 6805.73 781.65 88.51 8.71 16 MB 6008.41 1531.04 74.52 3.92 16454.11 1498.92 90.89 10.98 32 MB 23693.12 11775.96 50.30 2.01 29109.18 2888.36 90.08 10.08 64 MB 45585.29 21727.71 52.34 55920.52 5666.87 89.87 9.87

Using SHM Transport using CMA Charm++ within-node communication between processes uses the network SHM skips the network Cross Memory Attach – Linux 3.2 Implementation uses – metadata message (sent through the network) followed by a process_vm_readv and ack message (sent through the network)

Results – Pingpong Using the network vs Using SHM Transport over CMA

Results on a lab machine with Ethernet network (upto 4x) Size (Bytes) No CMA one way time (us) CMA one way time (us) % improvement Speedup 1 KB 5.58 10.02 -79.54 0.56 2 KB 5.93 10.19 -71.97 0.58 4 KB 6.27 10.36 -65.25 0.61 8 KB 7.56 10.96 -45.00 0.69 16 KB 11.55 11.93 -3.32 0.97 32 KB 19.87 14.22 28.42 1.40 64 KB 36.31 18.91 47.93 1.92 128 KB 66.57 27.68 58.42 2.40 256 KB 130.52 44.50 65.91 2.93 512 KB 254.47 75.09 70.49 3.39 1 MB 500.50 133.47 73.33 3.75 2 MB 1025.51 252.18 75.41 4.07 4 MB 2321.18 687.42 70.38 3.38 8 MB 4935.33 1850.31 62.51 2.67 16 MB 9703.12 3641.47 62.47 2.66 32 MB 21204.47 9358.97 55.86 2.27

Results on Edison (GNI) (upto 1.5x) Size (Bytes) No CMA one way time (us) CMA one way time (us) % improvement Speedup 256 Bytes 1.39 2.35 -68.78 0.59 512 Bytes 1.43 2.41 -68.68 1 KB 3.56 2.33 34.74 1.53 2 KB 3.46 2.49 28.09 4 KB 3.69 2.74 25.58 1.34 8 KB 4.10 3.41 16.84 1.20 16 KB 5.16 4.37 15.40 1.18 32 KB 7.23 6.17 14.64 1.17 64 KB 11.41 10.17 10.86 1.12 128 KB 19.80 18.06 8.77 1.10 256 KB 36.71 33.83 7.84 1.09 512 KB 70.28 116.89 -66.33 0.60 1 MB 137.14 267.55 -95.08 0.51 2 MB 270.96 528.58 4 MB 561.46 1060.39 -88.86 0.53 8 MB 1208.64 2109.57 -74.54 0.57 16 MB 6156.18 6654.44 -8.09 0.93 32 MB 10463.20 12576.42 -20.20 0.83

Results on Stampede2 (OFI) (upto 1.1x) Size (Bytes) No CMA one way time (us) CMA one way time (us) % improvement Speedup 1 KB 6.05 15.02 -148.14 0.40 2 KB 6.44 15.62 -142.47 0.41 4 KB 9.34 16.01 -71.51 0.58 8 KB 10.12 17.28 -70.82 0.59 16 KB 18.83 19.63 -4.24 0.96 32 KB 23.81 24.27 -1.93 0.98 64 KB 40.18 35.81 10.89 1.12 128 KB 55.79 52.16 6.51 1.07 256 KB 86.60 76.35 11.84 1.13 512 KB 190.23 166.52 12.46 1.14 1 MB 353.27 336.50 4.75 1.05 2 MB 619.59 621.30 -0.28 1.00 4 MB 1198.66 1187.12 1.01 8 MB 2334.56 2358.88 -1.04 0.99 16 MB 4560.66 4639.19 -1.72 32 MB 18086.00 17088.52 5.52 1.06

Results on Bridges (OFI) (upto 1.15 x) Size (Bytes) No CMA one way time (us) CMA one way time (us) % improvement Speedup 1 KB 2.48 6.28 -153.55 0.39 2 KB 2.70 4.70 -74.10 0.57 4 KB 4.01 5.22 -30.14 0.77 8 KB 6.63 6.65 -0.37 1.00 16 KB 10.22 9.33 8.70 1.10 32 KB 17.03 16.99 0.24 64 KB 28.40 24.73 12.91 1.15 128 KB 49.79 43.99 11.66 1.13 256 KB 91.54 92.98 -1.57 0.98 512 KB 169.61 167.09 1.49 1.02 1 MB 325.80 323.69 0.65 1.01 2 MB 646.17 619.66 4.10 1.04 4 MB 1293.16 1252.15 3.17 1.03 8 MB 2556.80 2559.24 -0.10 16 MB 5148.79 5219.44 -1.37 0.99 32 MB 14727.66 14711.74 0.11

Results on Bridges (MPI) (upto 1.08x) Size (Bytes) No CMA one way time (us) CMA one way time (us) % improvement Speedup 1 KB 3.96 5.25 -32.63 0.75 2 KB 4.18 5.48 -31.07 0.76 4 KB 4.74 6.23 -31.42 8 KB 5.66 7.22 -27.50 0.78 16 KB 8.08 10.84 -34.11 32 KB 12.02 13.77 -14.51 0.87 64 KB 23.82 22.15 7.04 1.08 128 KB 42.55 41.31 2.91 1.03 256 KB 77.81 74.63 4.09 1.04 512 KB 145.11 140.83 2.95 1 MB 277.25 273.82 1.23 1.01 2 MB 547.88 540.21 1.40 4 MB 1086.36 1078.66 0.71 8 MB 2175.44 2188.64 -0.61 0.99 16 MB 4378.83 4421.36 -0.97 32 MB 13477.01 13336.61

Summary Zero-copy EM API reduces sender side memory footprint and improves performance by avoiding large memory allocation and sender side copy Zero-copy Direct API reduces both sender and receiver sider memory footprint and improves performance to a larger extent by avoiding large memory allocation and copy on both sender side and receiver side copy CMA proves to be a faster alternative for intra-host inter-process communication to send messages avoiding the network.

Questions?