Optimizing Charm++ Messaging for the Grid Gregory A. Koenig Parallel Programming Laboratory Department of Computer.

Optimizing Charm++ Messaging for the Grid Gregory A. Koenig (koenig@cs.uiuc.edu)koenig@cs.uiuc.edu Parallel Programming Laboratory Department of Computer Science University of Illinois at Urbana-Champaign 2005 Charm++ Workshop

Introduction Goals of this work  Optimize message passing, in terms of: Latency CPU overhead  Optimize both single cluster messages as well as Grid messages  Leverage hardware support as much as possible  Use NCSA Virtual Machine Interface (VMI) messaging layer; Create a solution that is applicable to other layers  Primary deployment on TeraGrid (Myrinet) Cluster A Cluster B Intra-cluster latency (microseconds) Inter-cluster latency (milliseconds)

Message Passing Primitives  Stream Send {Stream Open, Send Fragment, …, Stream Close} Message data must be copied an extra time into receive buffer (i.e., only good for small messages) Easy to use (low management overhead)  RDMA (Remote Direct Memory Access) RDMA Put, RDMA Get Message data are written/read directly into receive buffer (i.e., good for large messages) Harder to use (requires buffer management)

RDMA Put (Rendezvous) Processor A sends a message to Processor B via RDMA Put:  A sends a short setup message to B indicating an upcoming RDMA Put and specifying the message size  B registers (pins into memory) a receive buffer of the specified size and responds to A with its address  A does an RDMA Put directly into B’s pinned receive buffer (“zero copy”) A B 1 2 3 Notice that Processor A must actively promote the send for a relatively long period of time – time that could be spent computing instead!

RDMA Put Expected Behavior

RDMA Put Unexpected Behavior

Pitfalls of RDMA Put A B 1 2 3

RDMA Get A B 1 2 Processor A sends a message to Processor B via RDMA Get:  A registers (pins into memory) a send buffer and sends a short setup message to B with its address and size  B does an RDMA Get directly from the sender’s buffer into a receive buffer (“zero copy”) Notice that once Processor A initiates the Get operation, it is free to do other things (such as computing) while hardware promotes the send.

Benefits of RDMA Get  Leverages hardware to allow more overlapping of communication with computation  Reduces the number of network traversals required to send the message from three to two  Reduces the chance (by about half) that a busy CPU will not acknowledge an RDMA operation in a timely manner  But… If the receiver is busy when the setup message arrives, the Get can still be delayed Two network traversals are required to send the message (not so good for Grid computations) Can we do better??

Eager Communication Channels  It would be nice if each receiver could dedicate a receive buffer to each sender; the sender could then just Put data directly into the buffer assigned to it  Unfortunately, this does not scale Buffers must be periodically polled (or serviced by an interrupt, which is usually slow) Pinned memory is a finite resource  Solution: Have the message layer observe the communication characteristics of the computation and set up dedicated buffers (“eager channels”) between pairs of processes that communicate frequently

Eager Channel Implementation (Small Messages)  When a receiver notices that a given process frequently sends to it, the receiver dedicates a buffer to the sender and divides it into “slots”  Receiver polls a sentinel at the end of the active slot -- a changed sentinel indicates that a new message is present in that slot  When a message is received, the address of the slot is returned to the application (must intercept subsequent CmiFree() calls!) and polling takes place on the next slot in the buffer, round-robin slot 1slot 2slot 3slot 4 (sentinels)  Sender does an RDMA Put into slots in order  Every message send requires a “send credit”; if a sender does not have a credit, it can still send the message via the slow path  When the receiver frees a message in a slot, a send credit is returned to the sender (frees must happen almost in order, otherwise holes)

Eager Channel Implementation (Large Messages)  The maximum message size for the slotted buffer approach is bounded by the size of a slot  For larger message sizes, dedicate a small number (e.g., three) of larger sized buffers (e.g., 1 MB) to the sender  Instead of polling these large message buffers (which uses CPU cycles), service them via interrupt; the latency of the interrupt is essentially lost in the latency of the actual data transfer

Summary of message passing paths in vmi-linux machine layer Slow path  Small messages Sent via Stream  Large messages Sent via RDMA Get Fast path (Eager)  Small messages Sent via RDMA Put into slotted buffers which are polled  Large messages Sent via RDMA Put into a small number of interrupt- serviced buffers

Preliminary Results One-way Pingpong Latency  Converse pingpong test running on NCSA Mercury (TeraGrid) cluster 1.3 GHz IA-64 processors Myrinet interconnect  Yellow: small msg  Green: large msg Msg Size (bytes) Slow Path Latency (us) Fast Path Latency (us) 1611.639.20 6411.669.33 25615.7010.45 1,02423.2618.64 4,09661.5933.57 16,384112.9685.01 65,536318.59285.15 262,1441125.631080.32 1,048,5764345.294260.45 4,194,30417132.8916981.12

Persistent Communication API  Implemented by Gengbin Zheng and Sameer Kumar in elan-linux and net-linux machine layers  General implementation/usage: Programmer initializes the API, specifying the maximum size for a message send Programmer switches persistence on/off on a per- send basis In a persistent send, data (up to maxsize bytes) are Put directly into a dedicated buffer on receiver API exploits programmer’s knowledge of the application’s communication patterns – no checks are done to ensure message data are not overwritten  In vmi-linux machine layer, the API is simply used as an indicator that an eager channel is highly desirable between a pair of processes

Future Work  This is pretty complicated code – a lot more testing is needed, additional optimization is most likely possible  Measure performance in TeraGrid environment  Measure performance on hardware optimized for RDMA (e.g., PCI Express bus, InfiniBand interconnect)  Implement the ability to discard eager channels if they are unused for some period of time (this is probably important in conjunction with load balancing and migrating objects based on their communication patterns)

Optimizing Charm++ Messaging for the Grid Gregory A. Koenig Parallel Programming Laboratory Department of Computer.

Similar presentations

Presentation on theme: "Optimizing Charm++ Messaging for the Grid Gregory A. Koenig Parallel Programming Laboratory Department of Computer."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Optimizing Charm++ Messaging for the Grid Gregory A. Koenig Parallel Programming Laboratory Department of Computer.

Similar presentations

Presentation on theme: "Optimizing Charm++ Messaging for the Grid Gregory A. Koenig Parallel Programming Laboratory Department of Computer."— Presentation transcript:

Similar presentations

About project

Feedback