Upcoming Improvements and Features in Charm++

Upcoming Improvements and Features in Charm++
Discussion Moderator: Eric Bohm Charm++ Workshop 2018

One-sided, RDMA, zero-copy
Direct API integrated into Charm See Nitin Bhat’s talk for details Note: uses IPC (i.e., CMA) for cross process within node Get vs Put Put semantics : when is it safe to write to remote memory? Message layer completion notification weak for put Get semantics : when is the remote data available? If your application already has that knowledge, get will have lower latency Memory registration Can’t access a page that isn’t mapped We have four strategies with different costs to choose from (next slide) We handle this for messages already, but if you want to zero copy send from your own buffers, you have to think about this issue. Should the Direct API cover GPU to GPU operations? Charm++ Workshop 2018

Memory registration Use CkRdmaAlloc / CkRdmaFree to allocate your buffers: use CK_BUFFER_PREREG mode – you assert that all your buffers are accessible (i.e., pinned) to get max performance benefits. CK_BUFFER_UNREG – You expect the runtime system to handle it as necessary. May incur registration or copy overhead. CK_BUFFER_REG – request that the Direct API register your buffers. May incur per transaction registration overhead. CK_BUFFER_NOREG – no guarantee regarding pinning, RDMA not supported. Generic support for Ncpy operations. Standard message protocols in use with associated copy overheads. Is this API sufficient? Charm++ Workshop 2018

Command Line Options How wide should a run be if the user provides no arguments? Default for MPI is 1 process with 1 core with 1 rank Conservative choice, but is that really what the user intended? +autoProvision Everything we can see is ours +processPerSocket +wthPerCore is probably the right answer Unless you need to leave cores free for OS voodoo (+excludecore ) Are there other common command line issues we should address? Charm++ Workshop 2018

Offload API Manage the offloading of work to accelerators CUDA only
Support multiple accelerators per host and per process Completion converted to Charm Callback event. Allow work to be done on GPU or CPU Based on utilization and suitability CUDA only That is where the platforms have been and are going Are there other aspects of accelerator interaction that we should prioritize? How much priority should we place on other accelerator APIs? Charm++ Workshop 2018

C++ Integration vector directly supported in reductions
[inline] supports templated methods R-value references supported PUP supports enums, deque, forward_list CkLoop supports lambda syntax Which advanced C++ features should be prioritized? Charm++ Workshop 2018

CharmPy Basic Charm++ support in Python
Limited load balancer selection No nodegroup No SMP mode (python lock) Which parts of the Python ecosystem should we prioritize for compatibility? Are there use cases for CharmPy that sound interesting to you? Charm++ Workshop 2018

Within Node Parallelism
Support for Boost Threads (uFcontext) Default choice for platforms where they don’t break other features (not OSX) Lowest context switching cost Integration of LLVM OpenMP implementation Supports clean interoperability between Charm++ ULT (CkLoop, [threaded] etc) and OpenMP to avoid oversubscription and resource contention Finer controls for CkLoop work stealing strategies Our support for the OpenMP task API is weak How important is that to you? Charm++ Workshop 2018

AMPI Improved point-to-point communication latency and bandwidth, particularly for messages within a process. Updated AMPI_Migrate() with built-in MPI_Info objects, such as AMPI_INFO_LB_SYNC.- Fixes to MPI_Sendrecv_replace, MPI_(I)Alltoall{v,w}, MPI_(I)Scatter(v), MPI_IN_PLACE in gather collectives, MPI_Type_free, MPI_Op_free, and MPI_Comm_free. Implemented MPI_Comm_create_group and MPI_Dist_graph support. Added support for using -tlsglobals for privatization of global/static variables in shared objects. Previously -tlsglobals required static linking. AMPI only renames the user's MPI calls from MPI_* to AMPI_* if Charm++/AMPI is built on top of another MPI implementation for communication. Support for compiling mpif.h in both fixed form and free form. PMPI profiling interface support added. Added an ampirun script that wraps charmrun to enable easier integration with build and test scripts that already take mpirun/mpiexec as an option. Which incomplete aspects of MPI 3 are of highest importance to you? Charm++ Workshop 2018

Charm Next Generation (post 6.9)
How attached are you to the current Load Balancing API? Should per PE load balancing be the focus, or per host? Should chares be bound to a PE by default? Should entry methods be non-reentrant by default? Should unbound chares be non-reentrant by default? How much of a burden is charmxi to your application development? Dedicated scheduler 1:1 with execution stream and hardware thread vs Selectable number of schedulers bound to execution steams with remainder as drones executing work stealing queues. Should we implement multiple comm threads per process? No dedicated comm threads (ala PAMI layer)? Charm++ Workshop 2018

Upcoming Improvements and Features in Charm++

Similar presentations

Presentation on theme: "Upcoming Improvements and Features in Charm++"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Upcoming Improvements and Features in Charm++

Similar presentations

Presentation on theme: "Upcoming Improvements and Features in Charm++"— Presentation transcript:

Similar presentations

About project

Feedback