Download presentation
Presentation is loading. Please wait.
Published byBrendan Watts Modified over 9 years ago
1
Optimizing Charm++ Messaging for the Grid Gregory A. Koenig (koenig@cs.uiuc.edu)koenig@cs.uiuc.edu Parallel Programming Laboratory Department of Computer Science University of Illinois at Urbana-Champaign 2005 Charm++ Workshop
2
Introduction Goals of this work Optimize message passing, in terms of: Latency CPU overhead Optimize both single cluster messages as well as Grid messages Leverage hardware support as much as possible Use NCSA Virtual Machine Interface (VMI) messaging layer; Create a solution that is applicable to other layers Primary deployment on TeraGrid (Myrinet) Cluster A Cluster B Intra-cluster latency (microseconds) Inter-cluster latency (milliseconds)
3
Message Passing Primitives Stream Send {Stream Open, Send Fragment, …, Stream Close} Message data must be copied an extra time into receive buffer (i.e., only good for small messages) Easy to use (low management overhead) RDMA (Remote Direct Memory Access) RDMA Put, RDMA Get Message data are written/read directly into receive buffer (i.e., good for large messages) Harder to use (requires buffer management)
4
RDMA Put (Rendezvous) Processor A sends a message to Processor B via RDMA Put: A sends a short setup message to B indicating an upcoming RDMA Put and specifying the message size B registers (pins into memory) a receive buffer of the specified size and responds to A with its address A does an RDMA Put directly into B’s pinned receive buffer (“zero copy”) A B 1 2 3 Notice that Processor A must actively promote the send for a relatively long period of time – time that could be spent computing instead!
5
RDMA Put Expected Behavior
6
RDMA Put Unexpected Behavior
7
Pitfalls of RDMA Put A B 1 2 3
8
RDMA Get A B 1 2 Processor A sends a message to Processor B via RDMA Get: A registers (pins into memory) a send buffer and sends a short setup message to B with its address and size B does an RDMA Get directly from the sender’s buffer into a receive buffer (“zero copy”) Notice that once Processor A initiates the Get operation, it is free to do other things (such as computing) while hardware promotes the send.
9
Benefits of RDMA Get Leverages hardware to allow more overlapping of communication with computation Reduces the number of network traversals required to send the message from three to two Reduces the chance (by about half) that a busy CPU will not acknowledge an RDMA operation in a timely manner But… If the receiver is busy when the setup message arrives, the Get can still be delayed Two network traversals are required to send the message (not so good for Grid computations) Can we do better??
10
Eager Communication Channels It would be nice if each receiver could dedicate a receive buffer to each sender; the sender could then just Put data directly into the buffer assigned to it Unfortunately, this does not scale Buffers must be periodically polled (or serviced by an interrupt, which is usually slow) Pinned memory is a finite resource Solution: Have the message layer observe the communication characteristics of the computation and set up dedicated buffers (“eager channels”) between pairs of processes that communicate frequently
11
Eager Channel Implementation (Small Messages) When a receiver notices that a given process frequently sends to it, the receiver dedicates a buffer to the sender and divides it into “slots” Receiver polls a sentinel at the end of the active slot -- a changed sentinel indicates that a new message is present in that slot When a message is received, the address of the slot is returned to the application (must intercept subsequent CmiFree() calls!) and polling takes place on the next slot in the buffer, round-robin slot 1slot 2slot 3slot 4 (sentinels) Sender does an RDMA Put into slots in order Every message send requires a “send credit”; if a sender does not have a credit, it can still send the message via the slow path When the receiver frees a message in a slot, a send credit is returned to the sender (frees must happen almost in order, otherwise holes)
12
Eager Channel Implementation (Large Messages) The maximum message size for the slotted buffer approach is bounded by the size of a slot For larger message sizes, dedicate a small number (e.g., three) of larger sized buffers (e.g., 1 MB) to the sender Instead of polling these large message buffers (which uses CPU cycles), service them via interrupt; the latency of the interrupt is essentially lost in the latency of the actual data transfer
13
Summary of message passing paths in vmi-linux machine layer Slow path Small messages Sent via Stream Large messages Sent via RDMA Get Fast path (Eager) Small messages Sent via RDMA Put into slotted buffers which are polled Large messages Sent via RDMA Put into a small number of interrupt- serviced buffers
14
Preliminary Results One-way Pingpong Latency Converse pingpong test running on NCSA Mercury (TeraGrid) cluster 1.3 GHz IA-64 processors Myrinet interconnect Yellow: small msg Green: large msg Msg Size (bytes) Slow Path Latency (us) Fast Path Latency (us) 1611.639.20 6411.669.33 25615.7010.45 1,02423.2618.64 4,09661.5933.57 16,384112.9685.01 65,536318.59285.15 262,1441125.631080.32 1,048,5764345.294260.45 4,194,30417132.8916981.12
15
Persistent Communication API Implemented by Gengbin Zheng and Sameer Kumar in elan-linux and net-linux machine layers General implementation/usage: Programmer initializes the API, specifying the maximum size for a message send Programmer switches persistence on/off on a per- send basis In a persistent send, data (up to maxsize bytes) are Put directly into a dedicated buffer on receiver API exploits programmer’s knowledge of the application’s communication patterns – no checks are done to ensure message data are not overwritten In vmi-linux machine layer, the API is simply used as an indicator that an eager channel is highly desirable between a pair of processes
16
Future Work This is pretty complicated code – a lot more testing is needed, additional optimization is most likely possible Measure performance in TeraGrid environment Measure performance on hardware optimized for RDMA (e.g., PCI Express bus, InfiniBand interconnect) Implement the ability to discard eager channels if they are unused for some period of time (this is probably important in conjunction with load balancing and migrating objects based on their communication patterns)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.