Charm Workshop 20081 CkDirect: Charm++ RDMA Put Presented by Eric Bohm CkDirect Team: Eric Bohm, Sayantan Chakravorty, Pritish Jetley, Abhinav Bhatele.

Slides:



Advertisements
Similar presentations
MPI 2.2 William Gropp. 2 Scope of MPI 2.2 Small changes to the standard. A small change is defined as one that does not break existing correct MPI 2.0.
Advertisements

Middleware Support for RDMA-based Data Transfer in Cloud Computing Yufei Ren, Tan Li, Dantong Yu, Shudong Jin, Thomas Robertazzi Department of Electrical.
Efficient Collective Operations using Remote Memory Operations on VIA-Based Clusters Rinku Gupta Dell Computers Dhabaleswar Panda.
The Building Blocks: Send and Receive Operations
User-Level Interprocess Communication for Shared Memory Multiprocessors Bershad, B. N., Anderson, T. E., Lazowska, E.D., and Levy, H. M. Presented by Akbar.
AMLAPI: Active Messages over Low-level Application Programming Interface Simon Yau, Tyson Condie,
Parallel Programming Laboratory1 Fault Tolerance in Charm++ Sayantan Chakravorty.
MPI and RDMA Yufei 10/15/2010. MPI over uDAPL: abstract MPI: most popular parallel computing standard MPI needs the ability to deliver high performace.
Interfacing Processors and Peripherals Andreas Klappenecker CPSC321 Computer Architecture.
ECE669 L20: Evaluation and Message Passing April 13, 2004 ECE 669 Parallel Computer Architecture Lecture 20 Evaluation and Message Passing.
Haoyuan Li CS 6410 Fall /15/2009.  U-Net: A User-Level Network Interface for Parallel and Distributed Computing ◦ Thorsten von Eicken, Anindya.
1 Parallel Computing—Introduction to Message Passing Interface (MPI)
Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.
A CASE STUDY OF COMMUNICATION OPTIMIZATIONS ON 3D MESH INTERCONNECTS University of Illinois at Urbana-Champaign Abhinav Bhatele, Eric Bohm, Laxmikant V.
Improving IPC by Kernel Design Jochen Liedtke Shane Matthews Portland State University.
A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping Maria Athanasaki, Aristidis Sotiropoulos, Georgios Tsoukalas,
Realizing the Performance Potential of the Virtual Interface Architecture Evan Speight, Hazim Abdel-Shafi, and John K. Bennett Rice University, Dep. Of.
An overview of Infiniband Reykjavik, June 24th 2008 R E Y K J A V I K U N I V E R S I T Y Dept. Computer Science Center for Analysis and Design of Intelligent.
Presented By Chandra Shekar Reddy.Y 11/5/20081Computer Architecture & Design.
Profile Guided MPI Protocol Selection for Point-to-Point Communication Calls 5/9/111 Aniruddha Marathe, David K. Lowenthal Department of Computer Science.
Computer Architecture Lecture 08 Fasih ur Rehman.
ATM and Fast Ethernet Network Interfaces for User-level Communication Presented by Sagwon Seo 2000/4/13 Matt Welsh, Anindya Basu, and Thorsten von Eicken.
Porting Charm++ to a New System Writing a Machine Layer Sayantan Chakravorty 5/01/20081Parallel Programming Laboratory.
3/11/2002CSE Input/Output Input/Output Control Datapath Memory Processor Input Output Memory Input Output Network Control Datapath Processor.
Synchronization and Communication in the T3E Multiprocessor.
LWIP TCP/IP Stack 김백규.
1 Lecture 20: I/O n I/O hardware n I/O structure n communication with controllers n device interrupts n device drivers n streams.
LWIP TCP/IP Stack 김백규.
Dynamic Time Variant Connection Management for PGAS Models on InfiniBand Abhinav Vishnu 1, Manoj Krishnan 1 and Pavan Balaji 2 1 Pacific Northwest National.
1-1 Embedded Network Interface (ENI) API Concepts Shared RAM vs. FIFO modes ENI API’s.
Boston, May 22 nd, 2013 IPDPS 1 Acceleration of an Asynchronous Message Driven Programming Paradigm on IBM Blue Gene/Q Sameer Kumar* IBM T J Watson Research.
The MPC Parallel Computer Hardware, Low-level Protocols and Performances University P. & M. Curie (PARIS) LIP6 laboratory Olivier Glück.
Computer Architecture Lecture10: Input/output devices Piotr Bilski.
Parallel Computing A task is broken down into tasks, performed by separate workers or processes Processes interact by exchanging information What do we.
Swapping to Remote Memory over InfiniBand: An Approach using a High Performance Network Block Device Shuang LiangRanjit NoronhaDhabaleswar K. Panda IEEE.
Interrupts By Ryan Morris. Overview ● I/O Paradigm ● Synchronization ● Polling ● Control and Status Registers ● Interrupt Driven I/O ● Importance of Interrupts.
A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign.
August 22, 2005Page 1 of (#) Datacenter Fabric Workshop Open MPI Overview and Current Status Tim Woodall - LANL Galen Shipman - LANL/UNM.
Copyright: Abhinav Vishnu Fault-Tolerant Communication Runtime Support for Data-Centric Programming Models Abhinav Vishnu 1, Huub Van Dam 1, Bert De Jong.
Minimizing Communication Latency to Maximize Network Communication Throughput over InfiniBand Design and Implementation of MPICH-2 over InfiniBand with.
Accessing I/O Devices Processor Memory BUS I/O Device 1 I/O Device 2.
The influence of system calls and interrupts on the performances of a PC cluster using a Remote DMA communication primitive Olivier Glück Jean-Luc Lamotte.
Optimizing Charm++ Messaging for the Grid Gregory A. Koenig Parallel Programming Laboratory Department of Computer.
ERLANGEN REGIONAL COMPUTING CENTER st International Workshop on Fault Tolerant Systems, IEEE Cluster `15 Building a fault tolerant application.
1 Lecture 4: Part 2: MPI Point-to-Point Communication.
Lecture 4 Mechanisms & Kernel for NOSs. Mechanisms for Network Operating Systems  Network operating systems provide three basic mechanisms that support.
A uGNI-Based Asynchronous Message- driven Runtime System for Cray Supercomputers with Gemini Interconnect Yanhua Sun, Gengbin Zheng, Laximant(Sanjay) Kale.
CC-MPI: A Compiled Communication Capable MPI Prototype for Ethernet Switched Clusters Amit Karwande, Xin Yuan Department of Computer Science, Florida State.
Fault Tolerance in Charm++ Gengbin Zheng 10/11/2005 Parallel Programming Lab University of Illinois at Urbana- Champaign.
Effects of contention on message latencies in large supercomputers Abhinav S Bhatele and Laxmikant V Kale Parallel Programming Laboratory, UIUC IS TOPOLOGY.
UDP: User Datagram Protocol Chapter 12. Introduction Multiple application programs can execute simultaneously on a given computer and can send and receive.
Channels. Models for Communications Synchronous communications – E.g. Telephone call Asynchronous communications – E.g. .
Effects of contention on message latencies in large supercomputers Abhinav S Bhatele and Laxmikant V Kale Parallel Programming Laboratory, UIUC IS TOPOLOGY.
SPL/2010 Reactor Design Pattern 1. SPL/2010 Overview ● blocking sockets - impact on server scalability. ● non-blocking IO in Java - java.niopackage ●
Experiences with VI Communication for Database Storage Yuanyuan Zhou, Angelos Bilas, Suresh Jagannathan, Cezary Dubnicki, Jammes F. Philbin, Kai Li.
The Evaluation Tool for the LHCb Event Builder Network Upgrade Guoming Liu, Niko Neufeld CERN, Switzerland 18 th Real-Time Conference June 13, 2012.
Advisor: Hung Shi-Hao Presenter: Chen Yu-Jen
A Parallel Communication Infrastructure for STAPL
Split-C for the New Millennium
Implementation and Optimization of MPI point-to-point communications on SMP-CMP clusters with RDMA capability.
Accelerating Large Charm++ Messages using RDMA
uGNI-based Charm++ Runtime for Cray Gemini Interconnect
CMSC 611: Advanced Computer Architecture
CS 258 Reading Assignment 4 Discussion Exploiting Two-Case Delivery for Fast Protected Messages Bill Kramer February 13, 2002 #
Recent Communication Optimizations in Charm++
Presented by Neha Agrawal
Chapter 8 I/O.
Support for Adaptivity in ARMCI Using Migratable Objects
Chapter 13: I/O Systems.
Chapter 13: I/O Systems “The two main jobs of a computer are I/O and [CPU] processing. In many cases, the main job is I/O, and the [CPU] processing is.
Presentation transcript:

Charm Workshop CkDirect: Charm++ RDMA Put Presented by Eric Bohm CkDirect Team: Eric Bohm, Sayantan Chakravorty, Pritish Jetley, Abhinav Bhatele 5/4/2008

Charm Workshop What is CkDirect? One-sided communication One-way (put only, so far)‏ Memory to memory interface Uses RDMA for zero copy No protocol synchronization User notification via callback Pair-wise persistent channels

Charm Workshop Motivating Example

Charm Workshop Messaging Approach Proc2 Msg SR Dest Proc AB Send Message Msg RS

Charm Workshop CkDirect Approach Proc2 Dest Proc AB Put

Charm Workshop RDMA Challenges Remote Direct Memory Access Minimal overhead => fast Put is more intuitive for message driven model Get: know remote location and remote data is ready Put: know remote location Interfaces for RDMA vary by interconnect Put completion notification is lacking either there is no notification or the put performance is hardly better than two-sided through trickery, we can do better than that.

Charm Workshop Where is it useful? When the same size data is transferred between the same partners each iteration to buffers which are reused When the application already enforces iteration boundaries Especially when you need to aggregate data from disparate sources into a contiguous buffer before processing

Charm Workshop How does it work? User callback triggered on put completion Application must: register send and receive processor and memory pairs in a handle register put completion callback for handle register out of band pattern for handle call ready when done using the received put data only 1 transaction per handle at a time trigger message from callback for real computation

Charm Workshop 20089

10 Ping Pong Results Ping Pong, CkDirect relative improvement, by message size in 1000s of bytes

Charm Workshop Matrix Multiply Results Matrix Multiply 2048*2048 average time in milliseconds

Charm Workshop Jacobi 3D Results Blue Gene/P (Surveyor)‏ Infiniband (Abe)‏ Jacobi 3D 1024*1024*512, iteration time improvement from CkDirect

Charm Workshop OpenAtom Results OpenAtom Water256M Benchmark, minimization, time per step in seconds

Charm Workshop Reducing Polling Overhead Polling overhead is proportional to the number of ready handles. To minimize the number of ready handles we have a split scheme. CkDirect_readyMark Done with data, but don't start polling yet CkDirect_readyPoll Data was already marked, start checking Can detect puts completed since readyMark

Charm Workshop Iteration Boundary CkDirect_readyPoll CkDirect _put User Data Invoke callback function CkDirect_readyMark Iteration Boundary Phase Boundary Phase Boundary CkDirect_readyPoll message sends entry methods CkDirect _put User Data CkDirect_readyMark CkDirect_readyPoll User Data

Charm Workshop The CkDirect API /* Receiver side create handle */ struct infiDirectUserHandle CkDirect_createHandle(int senderNode,void *recvBuf, int recvBufSize, void (*callbackFnPtr)(void *), void *callbackData,double initialValue); /* Sender side register memory to handle */ void CkDirect_assocLocalBuffer(struct infiDirectUserHandle *userHandle,void *sendBuf,int sendBufSize); /* Sender side actual data transfer */ void CkDirect_put(struct infiDirectUserHandle *userHandle); /* Receiver side done with buffer */ void CkDirect_readyMark(struct infiDirectUserHandle *userHandle); /* Receiver side start checking for put */ void CkDirect_readyPollQ(struct infiDirectUserHandle *userHandle); /* Receiver side done with buffer start checking for put */ void CkDirect_ready(struct infiDirectUserHandle *userHandle);

Charm Workshop Conclusions Availability: cvs version of charm net-linux-amd64-ibverbs bluegenep Future Work CkDirect multicasts Ports to other architectures Questions? Feedback?