© 2010 IBM Corporation Enabling Concurrent Multithreaded MPI Communication on Multicore Petascale Systems Gabor Dozsa 1, Sameer Kumar 1, Pavan Balaji 2,

Slides:



Advertisements
Similar presentations
Enabling MPI Interoperability Through Flexible Communication Endpoints
Advertisements

MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.
© 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P Blue Gene/P System Overview - Hardware.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Class CS 775/875, Spring 2011 Amit H. Kumar, OCCS Old Dominion University.
♦ Commodity processor with commodity inter- processor connection Clusters Pentium, Itanium, Opteron, Alpha GigE, Infiniband, Myrinet, Quadrics, SCI NEC.
GWDG Matrix Transpose Results with Hybrid OpenMP / MPI O. Haan Gesellschaft für wissenschaftliche Datenverarbeitung Göttingen, Germany ( GWDG ) SCICOMP.
Chapter 4 Conventional Computer Hardware Architecture
TOWARDS ASYNCHRONOUS AND MPI-INTEROPERABLE ACTIVE MESSAGES Xin Zhao, Darius Buntinas, Judicael Zounmevo, James Dinan, David Goodell, Pavan Balaji, Rajeev.
AMLAPI: Active Messages over Low-level Application Programming Interface Simon Yau, Tyson Condie,
Toward Efficient Support for Multithreaded MPI Communication Pavan Balaji 1, Darius Buntinas 1, David Goodell 1, William Gropp 2, and Rajeev Thakur 1 1.
Concurrent Data Structures in Architectures with Limited Shared Memory Support Ivan Walulya Yiannis Nikolakopoulos Marina Papatriantafilou Philippas Tsigas.
3.5 Interprocess Communication Many operating systems provide mechanisms for interprocess communication (IPC) –Processes must communicate with one another.
Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.
3.5 Interprocess Communication
Communication Models for Parallel Computer Architectures 4 Two distinct models have been proposed for how CPUs in a parallel computer system should communicate.
High Performance Communication using MPJ Express 1 Presented by Jawad Manzoor National University of Sciences and Technology, Pakistan 29 June 2015.
Runtime Support for Irregular Computations in MPI-Based Applications - CCGrid 2015 Doctoral Symposium - Xin Zhao *, Pavan Balaji † (Co-advisor), William.
Real Parallel Computers. Modular data centers Background Information Recent trends in the marketplace of high performance computing Strohmaier, Dongarra,
Hossein Bastan Isfahan University of Technology 1/23.
MPI3 Hybrid Proposal Description
Synchronization and Communication in the T3E Multiprocessor.
Atlanta, Georgia TiNy Threads on BlueGene/P: Exploring Many-Core Parallelisms Beyond The Traditional OS Handong Ye, Robert Pavel, Aaron Landwehr, Guang.
Boston, May 22 nd, 2013 IPDPS 1 Acceleration of an Asynchronous Message Driven Programming Paradigm on IBM Blue Gene/Q Sameer Kumar* IBM T J Watson Research.
The MPC Parallel Computer Hardware, Low-level Protocols and Performances University P. & M. Curie (PARIS) LIP6 laboratory Olivier Glück.
Fast Multi-Threading on Shared Memory Multi-Processors Joseph Cordina B.Sc. Computer Science and Physics Year IV.
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
Slide 1 MIT Lincoln Laboratory Toward Mega-Scale Computing with pMatlab Chansup Byun and Jeremy Kepner MIT Lincoln Laboratory Vipin Sachdeva and Kirk E.
MPICH2 – A High-Performance and Widely Portable Open- Source MPI Implementation Darius Buntinas Argonne National Laboratory.
1 Using HPS Switch on Bassi Jonathan Carter User Services Group Lead NERSC User Group Meeting June 12, 2006.
Argonne National Laboratory is a U.S. Department of Energy laboratory managed by U Chicago Argonne, LLC. Xin Zhao *, Pavan Balaji † (Co-advisor) and William.
© 2008 IBM Corporation Deep Computing Messaging Framework Lightweight Communication for Petascale Supercomputing Supercomputing 2008 Michael Blocksome,
Analysis of Topology-Dependent MPI Performance on Gemini Networks Antonio J. Peña, Ralf G. Correa Carvalho, James Dinan, Pavan Balaji, Rajeev Thakur, and.
SciDAC All Hands Meeting, March 2-3, 2005 Northwestern University PIs:Alok Choudhary, Wei-keng Liao Graduate Students:Avery Ching, Kenin Coloma, Jianwei.
Abdelhalim Amer *, Huiwei Lu *, Pavan Balaji *, Satoshi Matsuoka + *Argonne National Laboratory, IL, USA +Tokyo Institute of Technology, Tokyo, Japan Characterizing.
CSE 661 PAPER PRESENTATION
Minimizing Communication Latency to Maximize Network Communication Throughput over InfiniBand Design and Implementation of MPICH-2 over InfiniBand with.
(a) What is the output generated by this program? In fact the output is not uniquely defined, i.e., it is not always the same. So please give three examples.
Non-Data-Communication Overheads in MPI: Analysis on Blue Gene/P P. Balaji, A. Chan, W. Gropp, R. Thakur, E. Lusk Argonne National Laboratory University.
PMI: A Scalable Process- Management Interface for Extreme-Scale Systems Pavan Balaji, Darius Buntinas, David Goodell, William Gropp, Jayesh Krishna, Ewing.
University of Mannheim1 ATOLL ATOmic Low Latency – A high-perfomance, low cost SAN Patrick R. Haspel Computer Architecture Group.
Fabric Interfaces Architecture Sean Hefty - Intel Corporation.
The influence of system calls and interrupts on the performances of a PC cluster using a Remote DMA communication primitive Olivier Glück Jean-Luc Lamotte.
Cray Inc. Hot Interconnects 1 Bob Alverson, Duncan Roweth, Larry Kaplan Cray Inc.
A new thread support level for hybrid programming with MPI endpoints EASC 2015 Dan Holmes, Mark Bull, Jim Dinan
Computer Network Lab. Korea University Computer Networks Labs Se-Hee Whang.
EECB 473 Data Network Architecture and Electronics Lecture 1 Conventional Computer Hardware Architecture
Interconnection network network interface and a case study.
OpenMP for Networks of SMPs Y. Charlie Hu, Honghui Lu, Alan L. Cox, Willy Zwaenepoel ECE1747 – Parallel Programming Vicky Tsang.
Programming an SMP Desktop using Charm++ Laxmikant (Sanjay) Kale Parallel Programming Laboratory Department of Computer Science.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
1 Opportunities and Challenges of Modern Communication Architectures: Case Study with QsNet CAC Workshop Santa Fe, NM, 2004 Sameer Kumar* and Laxmikant.
Sockets Direct Protocol for Hybrid Network Stacks: A Case Study with iWARP over 10G Ethernet P. Balaji, S. Bhagvat, R. Thakur and D. K. Panda, Mathematics.
MPI on a Million Processors Pavan Balaji, 1 Darius Buntinas, 1 David Goodell, 1 William Gropp, 2 Sameer Kumar, 3 Ewing Lusk, 1 Rajeev Thakur, 1 Jesper.
Silberschatz, Galvin and Gagne ©2013 Operating System Concepts – 9 th Edition Chapter 4: Threads.
OpenMP Runtime Extensions Many core Massively parallel environment Intel® Xeon Phi co-processor Blue Gene/Q MPI Internal Parallelism Optimizing MPI Implementation.
 Dan Ibanez, Micah Corah, Seegyoung Seol, Mark Shephard  2/27/2013  Scientific Computation Research Center  Rensselaer Polytechnic Institute 1 Advances.
A CASE STUDY IN USING MASSIVELY PARALLEL SIMULATION FOR EXTREME- SCALE TORUS NETWORK CO-DESIGN Misbah Mubarak, Rensselaer Polytechnic Institute Christopher.
BLUE GENE Sunitha M. Jenarius. What is Blue Gene A massively parallel supercomputer using tens of thousands of embedded PowerPC processors supporting.
PARALLEL MODEL OF EVOLUTIONARY GAME DYNAMICS Amanda Peters MIT /13/2009.
CS5102 High Performance Computer Systems Thread-Level Parallelism
Fabric Interfaces Architecture – v4
uGNI-based Charm++ Runtime for Cray Gemini Interconnect
CMSC 611: Advanced Computer Architecture
Chapter 4: Threads.
Jun Doi Tokyo Research Laboratory IBM Research
Yiannis Nikolakopoulos
Chapter 4: Threads & Concurrency
Emulating Massively Parallel (PetaFLOPS) Machines
Presentation transcript:

© 2010 IBM Corporation Enabling Concurrent Multithreaded MPI Communication on Multicore Petascale Systems Gabor Dozsa 1, Sameer Kumar 1, Pavan Balaji 2, Darius Buntinas 2, David Goodell 2, William Gropp 3, Joe Ratterman 4, and Rajeev Thakur 2 1 IBM T. J. Watson Research Center, Yorktown Heights, NY Argonne National Laboratory, Argonne, IL University of Illinois, Urbana, IL IBM Systems and Technology Group, Rochester, MN Joint Collaboration between Argonne and IBM

© 2010 IBM Corporation Outline  Motivation  MPI Semantics for multi-threading  Blue Gene/P overview –Deep computing messaging framework (DCMF)  Optimize MPI thread parallelism  Performance results  Summary

© 2010 IBM Corporation Motivation  Multicore architectures with many threads per node  Running MPI processes on each core results in –Less memory per process –Higher TLB pressure –Problem may not scale to as many processes  Hybrid Programming –Use MPI across nodes –Use shared memory within nodes (posix threads, OpenMP) –MPI library accessed concurrently from many threads  Fully concurrent network interfaces that permit concurrent access from multiple threads  Thread optimized MPI library critical for hybrid programming

© 2010 IBM Corporation MPI Semantics for Multithreading  MPI defines four thread levels  Single –Only one thread will execute  Funneled –The process may be multi-threaded, but only the main thread will make MPI calls  Serialized –The process may be multi-threaded, and multiple threads may make MPI calls, but only one at a time: MPI calls are not made concurrently from two distinct threads  Multiple –Multiple threads may call MPI, with no restrictions

© 2010 IBM Corporation Blue Gene/P Overview

© 2010 IBM Corporation BlueGene/P Interconnection Networks 3 Dimensional Torus – DMA responsible for handing packets –Interconnects all compute nodes –Virtual cut-through hardware routing –3.4 Gb/s on all 12 node links (5.1 GB/s per node) –0.5 µs latency between nearest neighbors, 5 µs to the farthest 100 ns –Communications backbone for computations Collective Network – core responsible for handling packets –One-to-all broadcast functionality –Reduction operations functionality –6.8 Gb/s (850 MB/s) of bandwidth per link –Latency of one way network traversal 1.3 µs Low Latency Global Barrier and Interrupt –Latency of one way to reach all 72K nodes 0.65 µs, MPI 1.2 µs

© 2010 IBM Corporation BG/P Torus network and the DMA  Torus network accessed via a direct memory access unit  DMA unit sends and receives data with physical addresses  Messages have to be in contiguous buffers  DMA performs cache injection on sender and receiver  Resources in the DMA managed by software Blue Gene/P Compute Card

© 2010 IBM Corporation BG/P DMA  DMA Resources –Injection Memory FIFOs –Reception Memory FIFOs –Counters  A collection of DMA resources forms a group –Each BG/P node has four groups –Access to a DMA group can be concurrent –Typically used to support virtual node mode with four processes Blue Gene/P Node Card

© 2010 IBM Corporation Message Passing on the BG/P DMA  Sender injects a DMA descriptor –Descriptor provides detailed information to the DMA on the actions to perform  DMA initiates intra-node or inter-node data movement –Memory FIFO send : results in packets in the destination reception memory FIFO –Direct put : moves local data to a destination buffer –Remote get : pulls data from a source node to a local (or even remote) buffer  Access to injection and reception FIFOs in different groups can be done in parallel by different processes and threads

© 2010 IBM Corporation Deep Computing Messaging Framework (DCMF)  Low level messaging API on BG/P  Supporting multiple paradigms on BG/P  Active message API –Good match for LAPI, Charm++ and other active message runtimes –MPI supported on this active message runtime  Optimized collectives

© 2010 IBM Corporation Extensions to DCMF for concurrency  Introduce the notion of channels –In SMP mode the four injection/reception DMA groups exposed as four DCMF channels –New DCMF API calls DCMF_Channel_acquire() DCMF_Channel_release() DCMF progress calls enhanced to only advance acquired channel Channel state stored in thread private memory –Pt-to-pt API unchanged  Channels are similar to the notion of enpoints proposal in the MPI Forum

© 2010 IBM Corporation Optimize MPI thread concurrency

© 2010 IBM Corporation MPICH Thread Parallelism  Coarse Grained –Each MPI call is guarded by an ALLFUNC critical section macro –Blocking MPI calls release the critical section enabling all threads to make progress  Fine grained –Decrease size of the critical sections –ALLFUNC macros are disabled –Each shared resource is locked For example a MSGQUEUE critical section can guard a message queue The RECVQUEUE critical section will guard the MPI receive queues HANDLE mutex for allocating object handles –Eliminate critical sections Operations such as reference counting can be optimized via scalable atomics (e.g. fetch-add)

© 2010 IBM Corporation Enabling Concurrent MPI on BG/P  Messages to the same destination always use the same channel to preserve MPI ordering  Messages from different sources arrive on different channels to improve parallelism  Map each source destination pair to a channel via a hash function –E.g. (srcrank + dstrank) % numchannels

© 2010 IBM Corporation Parallel Receive Queues  Standard MPICH has two queues for posted receives and unexpected messages  Extend MPICH to have –Queue of unexpected messages and posted receives for each channel –Additional queue for wild card receives  Each process posts receives to the channel queue in the absence of wild cards –When there is a wild card all receives are posted to the wild card queue  When a packet arrives –First the wild card queue is processed after acquiring WC lock –If its empty, the thread that receives the packet acquires channel RECVQUEUE lock Matches packet with posted channel receives or If no match is found a new entry is created in the channel unexpected queue

© 2010 IBM Corporation Performance Results

© 2010 IBM Corporation Message Rate Benchmark Message rate benchmark where each thread exchanges messages with a different node

© 2010 IBM Corporation Message Rate Performance Zero threads = MPI_THREAD_SINGLE

© 2010 IBM Corporation Thread scaling on MPI vs DCMF Absence of receiver matching enables higher concurrency in DCMF

© 2010 IBM Corporation Summary and Future work  We presented different techniques to improve throughput of MPI calls in multi-threaded architectures  Performance improves 3.6x on four threads  These techniques should be extendible to other architectures where network interfaces permit concurrent access  Explore lockless techniques to eliminate critical sections for handles and other resources –Garbage collect request objects