Optimizing Charm++ Messaging for the Grid Gregory A. Koenig Parallel Programming Laboratory Department of Computer.

Slides:

Advertisements

Similar presentations

Switching Techniques In large networks there might be multiple paths linking sender and receiver. Information may be switched as it travels through various.

Advertisements

Categories of I/O Devices

System Area Network Abhiram Shandilya 12/06/01. Overview Introduction to System Area Networks SAN Design and Examples SAN Applications.

The Building Blocks: Send and Receive Operations

1 Performance Modeling l Basic Model »Needed to evaluate approaches »Must be simple l Synchronization delays l Main components »Latency and Bandwidth »Load.

MPI and RDMA Yufei 10/15/2010. MPI over uDAPL: abstract MPI: most popular parallel computing standard MPI needs the ability to deliver high performace.

CSCI 4550/8556 Computer Networks Comer, Chapter 7: Packets, Frames, And Error Detection.

Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.

High Performance Communication using MPJ Express 1 Presented by Jawad Manzoor National University of Sciences and Technology, Pakistan 29 June 2015.

An overview of Infiniband Reykjavik, June 24th 2008 R E Y K J A V I K U N I V E R S I T Y Dept. Computer Science Center for Analysis and Design of Intelligent.

CS252/Patterson Lec /28/01 CS 213 Lecture 10: Multiprocessor 3: Directory Organization.

Presented By Chandra Shekar Reddy.Y 11/5/20081Computer Architecture & Design.

Router Architectures An overview of router architectures.

1 Computer System Overview Chapter 1 Review of basic hardware concepts.

Router Architectures An overview of router architectures.

MULTICOMPUTER 1. MULTICOMPUTER, YANG DIPELAJARI Multiprocessors vs multicomputers Interconnection topologies Switching schemes Communication with messages.

COMP201 Computer Systems Exceptions and Interrupts.

ISO Layer Model Lecture 9 October 16, The Need for Protocols Multiple hardware platforms need to have the ability to communicate. Writing communications.

Synchronization and Communication in the T3E Multiprocessor.

1 Computer System Overview Chapter 1. 2 n An Operating System makes the computing power available to users by controlling the hardware n Let us review.

Grid Computing With Charm++ And Adaptive MPI Gregory A. Koenig Department of Computer Science University of Illinois.

Lecture 8: Design of Parallel Programs Part III Lecturer: Simon Winberg.

Towards a Common Communication Infrastructure for Clusters and Grids Darius Buntinas Argonne National Laboratory.

High Performance User-Level Sockets over Gigabit Ethernet Pavan Balaji Ohio State University Piyush Shivam Ohio State University.

MICROPROCESSOR INPUT/OUTPUT

The MPC Parallel Computer Hardware, Low-level Protocols and Performances University P. & M. Curie (PARIS) LIP6 laboratory Olivier Glück.

Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.

1 Using HPS Switch on Bassi Jonathan Carter User Services Group Lead NERSC User Group Meeting June 12, 2006.

Data and Computer Communications Chapter 10 – Circuit Switching and Packet Switching (Wide Area Networks)

Lecture 3 Process Concepts. What is a Process? A process is the dynamic execution context of an executing program. Several processes may run concurrently,

1 Blue Gene Simulator Gengbin Zheng Gunavardhan Kakulapati Parallel Programming Laboratory Department of Computer Science.

A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign.

August 22, 2005Page 1 of (#) Datacenter Fabric Workshop Open MPI Overview and Current Status Tim Woodall - LANL Galen Shipman - LANL/UNM.

Interrupts, Buses Chapter 6.2.5, Introduction to Interrupts Interrupts are a mechanism by which other modules (e.g. I/O) may interrupt normal.

MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.

Minimizing Communication Latency to Maximize Network Communication Throughput over InfiniBand Design and Implementation of MPICH-2 over InfiniBand with.

Charm Workshop CkDirect: Charm++ RDMA Put Presented by Eric Bohm CkDirect Team: Eric Bohm, Sayantan Chakravorty, Pritish Jetley, Abhinav Bhatele.

Chapter 8-2 : Multicomputers Multiprocessors vs multicomputers Multiprocessors vs multicomputers Interconnection topologies Interconnection topologies.

ECE 526 – Network Processing Systems Design Computer Architecture: traditional network processing systems implementation Chapter 4: D. E. Comer.

Latency Reduction Techniques for Remote Memory Access in ANEMONE Mark Lewandowski Department of Computer Science Florida State University.

The influence of system calls and interrupts on the performances of a PC cluster using a Remote DMA communication primitive Olivier Glück Jean-Luc Lamotte.

LRPC Firefly RPC, Lightweight RPC, Winsock Direct and VIA.

CS2100 Computer Organisation Input/Output – Own reading only (AY2015/6) Semester 1 Adapted from David Patternson’s lecture slides:

1 ©2004 Board of Trustees of the University of Illinois Computer Science Overview Laxmikant (Sanjay) Kale ©

Chapter 13 – I/O Systems (Pgs ). Devices  Two conflicting properties A. Growing uniformity in interfaces (both h/w and s/w): e.g., USB, TWAIN.

1 Lecture 4: Part 2: MPI Point-to-Point Communication.

Parallelizing Spacetime Discontinuous Galerkin Methods Jonathan Booth University of Illinois at Urbana/Champaign In conjunction with: L. Kale, R. Haber,

File Systems cs550 Operating Systems David Monismith.

Using Charm++ to Mask Latency in Grid Computing Applications Gregory A. Koenig Parallel Programming Laboratory Department.

Interconnect Networks Basics. Generic parallel/distributed system architecture On-chip interconnects (manycore processor) Off-chip interconnects (clusters.

(Superficial!) Review of Uniprocessor Architecture Parallel Architectures and Related concepts CS 433 Laxmikant Kale University of Illinois at Urbana-Champaign.

 Process Concept  Process Scheduling  Operations on Processes  Cooperating Processes  Interprocess Communication  Communication in Client-Server.

1 Opportunities and Challenges of Modern Communication Architectures: Case Study with QsNet CAC Workshop Santa Fe, NM, 2004 Sameer Kumar* and Laxmikant.

Copyright © 2007 by Curt Hill Interrupts How the system responds.

Switching. Circuit switching Message switching Packet Switching – Datagrams – Virtual circuit – source routing Cell Switching – Cells, – Segmentation.

Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.

Direct Memory Access (DMA) Department of Computer Engineering, M.S.P.V.L. Polytechnic College, Pavoorchatram. A Presentation On.

UDP: User Datagram Protocol Chapter 12. Introduction Multiple application programs can execute simultaneously on a given computer and can send and receive.

ECE 456 Computer Architecture Lecture #9 – Input/Output Instructor: Dr. Honggang Wang Fall 2013.

FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI Gengbin Zheng Lixia Shi Laxmikant V. Kale Parallel Programming Lab.

Input/Output Devices ENCE 360

CS 286 Computer Organization and Architecture

Performance Evaluation of Adaptive MPI

CMSC 611: Advanced Computer Architecture

CS 258 Reading Assignment 4 Discussion Exploiting Two-Case Delivery for Fast Protected Messages Bill Kramer February 13, 2002 #

BigSim: Simulating PetaFLOPS Supercomputers

Chapter 13: I/O Systems.

Virtual Memory.

Emulating Massively Parallel (PetaFLOPS) Machines

Cluster Computers.

Presentation transcript:

Optimizing Charm++ Messaging for the Grid Gregory A. Koenig Parallel Programming Laboratory Department of Computer Science University of Illinois at Urbana-Champaign 2005 Charm++ Workshop

Introduction Goals of this work  Optimize message passing, in terms of: Latency CPU overhead  Optimize both single cluster messages as well as Grid messages  Leverage hardware support as much as possible  Use NCSA Virtual Machine Interface (VMI) messaging layer; Create a solution that is applicable to other layers  Primary deployment on TeraGrid (Myrinet) Cluster A Cluster B Intra-cluster latency (microseconds) Inter-cluster latency (milliseconds)

Message Passing Primitives  Stream Send {Stream Open, Send Fragment, …, Stream Close} Message data must be copied an extra time into receive buffer (i.e., only good for small messages) Easy to use (low management overhead)  RDMA (Remote Direct Memory Access) RDMA Put, RDMA Get Message data are written/read directly into receive buffer (i.e., good for large messages) Harder to use (requires buffer management)

RDMA Put (Rendezvous) Processor A sends a message to Processor B via RDMA Put:  A sends a short setup message to B indicating an upcoming RDMA Put and specifying the message size  B registers (pins into memory) a receive buffer of the specified size and responds to A with its address  A does an RDMA Put directly into B’s pinned receive buffer (“zero copy”) A B Notice that Processor A must actively promote the send for a relatively long period of time – time that could be spent computing instead!

RDMA Put Expected Behavior

RDMA Put Unexpected Behavior

Pitfalls of RDMA Put A B 1 2 3

RDMA Get A B 1 2 Processor A sends a message to Processor B via RDMA Get:  A registers (pins into memory) a send buffer and sends a short setup message to B with its address and size  B does an RDMA Get directly from the sender’s buffer into a receive buffer (“zero copy”) Notice that once Processor A initiates the Get operation, it is free to do other things (such as computing) while hardware promotes the send.

Benefits of RDMA Get  Leverages hardware to allow more overlapping of communication with computation  Reduces the number of network traversals required to send the message from three to two  Reduces the chance (by about half) that a busy CPU will not acknowledge an RDMA operation in a timely manner  But… If the receiver is busy when the setup message arrives, the Get can still be delayed Two network traversals are required to send the message (not so good for Grid computations) Can we do better??

Eager Communication Channels  It would be nice if each receiver could dedicate a receive buffer to each sender; the sender could then just Put data directly into the buffer assigned to it  Unfortunately, this does not scale Buffers must be periodically polled (or serviced by an interrupt, which is usually slow) Pinned memory is a finite resource  Solution: Have the message layer observe the communication characteristics of the computation and set up dedicated buffers (“eager channels”) between pairs of processes that communicate frequently

Eager Channel Implementation (Small Messages)  When a receiver notices that a given process frequently sends to it, the receiver dedicates a buffer to the sender and divides it into “slots”  Receiver polls a sentinel at the end of the active slot -- a changed sentinel indicates that a new message is present in that slot  When a message is received, the address of the slot is returned to the application (must intercept subsequent CmiFree() calls!) and polling takes place on the next slot in the buffer, round-robin slot 1slot 2slot 3slot 4 (sentinels)  Sender does an RDMA Put into slots in order  Every message send requires a “send credit”; if a sender does not have a credit, it can still send the message via the slow path  When the receiver frees a message in a slot, a send credit is returned to the sender (frees must happen almost in order, otherwise holes)

Eager Channel Implementation (Large Messages)  The maximum message size for the slotted buffer approach is bounded by the size of a slot  For larger message sizes, dedicate a small number (e.g., three) of larger sized buffers (e.g., 1 MB) to the sender  Instead of polling these large message buffers (which uses CPU cycles), service them via interrupt; the latency of the interrupt is essentially lost in the latency of the actual data transfer

Summary of message passing paths in vmi-linux machine layer Slow path  Small messages Sent via Stream  Large messages Sent via RDMA Get Fast path (Eager)  Small messages Sent via RDMA Put into slotted buffers which are polled  Large messages Sent via RDMA Put into a small number of interrupt- serviced buffers

Preliminary Results One-way Pingpong Latency  Converse pingpong test running on NCSA Mercury (TeraGrid) cluster 1.3 GHz IA-64 processors Myrinet interconnect  Yellow: small msg  Green: large msg Msg Size (bytes) Slow Path Latency (us) Fast Path Latency (us) , , , , , ,048, ,194,

Persistent Communication API  Implemented by Gengbin Zheng and Sameer Kumar in elan-linux and net-linux machine layers  General implementation/usage: Programmer initializes the API, specifying the maximum size for a message send Programmer switches persistence on/off on a per- send basis In a persistent send, data (up to maxsize bytes) are Put directly into a dedicated buffer on receiver API exploits programmer’s knowledge of the application’s communication patterns – no checks are done to ensure message data are not overwritten  In vmi-linux machine layer, the API is simply used as an indicator that an eager channel is highly desirable between a pair of processes

Future Work  This is pretty complicated code – a lot more testing is needed, additional optimization is most likely possible  Measure performance in TeraGrid environment  Measure performance on hardware optimized for RDMA (e.g., PCI Express bus, InfiniBand interconnect)  Implement the ability to discard eager channels if they are unused for some period of time (this is probably important in conjunction with load balancing and migrating objects based on their communication patterns)