© 2008 IBM Corporation Deep Computing Messaging Framework Lightweight Communication for Petascale Supercomputing Supercomputing 2008 Michael Blocksome,

Slides:

Advertisements

Similar presentations

Threads, SMP, and Microkernels

Advertisements

System Area Network Abhiram Shandilya 12/06/01. Overview Introduction to System Area Networks SAN Design and Examples SAN Applications.

Protocols and software for exploiting Myrinet clusters Congduc Pham and the main contributors P. Geoffray, L. Prylli, B. Tourancheau, R. Westrelin.

Toward Efficient Support for Multithreaded MPI Communication Pavan Balaji 1, Darius Buntinas 1, David Goodell 1, William Gropp 2, and Rajeev Thakur 1 1.

Chapter 13 Embedded Systems

Chapter 13 Embedded Systems Patricia Roy Manatee Community College, Venice, FL ©2008, Prentice Hall Operating Systems: Internals and Design Principles,

MPI in uClinux on Microblaze Neelima Balakrishnan Khang Tran 05/01/2006.

VIA and Its Extension To TCP/IP Network Yingping Lu Based on Paper “Queue Pair IP, …” by Philip Buonadonna.

Contiki A Lightweight and Flexible Operating System for Tiny Networked Sensors Presented by: Jeremy Schiff.

3.5 Interprocess Communication Many operating systems provide mechanisms for interprocess communication (IPC) –Processes must communicate with one another.

1 Parallel Computing—Introduction to Message Passing Interface (MPI)

3.5 Interprocess Communication

High Performance Communication using MPJ Express 1 Presented by Jawad Manzoor National University of Sciences and Technology, Pakistan 29 June 2015.

Embedded Transport Acceleration Intel Xeon Processor as a Packet Processing Engine Abhishek Mitra Professor: Dr. Bhuyan.

1 I/O Management in Representative Operating Systems.

An overview of Infiniband Reykjavik, June 24th 2008 R E Y K J A V I K U N I V E R S I T Y Dept. Computer Science Center for Analysis and Design of Intelligent.

Infiniband enables scalable Real Application Clusters – Update Spring 2008 Sumanta Chatterjee, Oracle Richard Frank, Oracle.

Optimizing Threaded MPI Execution on SMP Clusters Hong Tang and Tao Yang Department of Computer Science University of California, Santa Barbara.

OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.

1 Developing Native Device for MPJ Express Advisor: Dr. Aamir Shafi Co-advisor: Ms Samin Khaliq.

Synchronization and Communication in the T3E Multiprocessor.

Atlanta, Georgia TiNy Threads on BlueGene/P: Exploring Many-Core Parallelisms Beyond The Traditional OS Handong Ye, Robert Pavel, Aaron Landwehr, Guang.

LWIP TCP/IP Stack 김백규.

High Performance User-Level Sockets over Gigabit Ethernet Pavan Balaji Ohio State University Piyush Shivam Ohio State University.

1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,

Dynamic Time Variant Connection Management for PGAS Models on InfiniBand Abhinav Vishnu 1, Manoj Krishnan 1 and Pavan Balaji 2 1 Pacific Northwest National.

Boston, May 22 nd, 2013 IPDPS 1 Acceleration of an Asynchronous Message Driven Programming Paradigm on IBM Blue Gene/Q Sameer Kumar* IBM T J Watson Research.

© 2010 IBM Corporation Enabling Concurrent Multithreaded MPI Communication on Multicore Petascale Systems Gabor Dozsa 1, Sameer Kumar 1, Pavan Balaji 2,

AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author ： Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source ： Proceedings of the 2nd IASTED.

MPICH2 – A High-Performance and Widely Portable Open- Source MPI Implementation Darius Buntinas Argonne National Laboratory.

Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,

1 Using HPS Switch on Bassi Jonathan Carter User Services Group Lead NERSC User Group Meeting June 12, 2006.

2006 Sonoma Workshop February 2006Page 1 Sockets Direct Protocol (SDP) for Windows - Motivation and Plans Gilad Shainer Mellanox Technologies Inc.

Harp: Collective Communication on Hadoop Bingjing Zhang, Yang Ruan, Judy Qiu.

Swapping to Remote Memory over InfiniBand: An Approach using a High Performance Network Block Device Shuang LiangRanjit NoronhaDhabaleswar K. Panda IEEE.

TEMPLATE DESIGN © Hardware Design, Synthesis, and Verification of a Multicore Communication API Ben Meakin, Ganesh Gopalakrishnan.

August 22, 2005Page 1 of (#) Datacenter Fabric Workshop Open MPI Overview and Current Status Tim Woodall - LANL Galen Shipman - LANL/UNM.

Minimizing Communication Latency to Maximize Network Communication Throughput over InfiniBand Design and Implementation of MPICH-2 over InfiniBand with.

Integrating New Capabilities into NetPIPE Dave Turner, Adam Oline, Xuehua Chen, and Troy Benjegerdes Scalable Computing Laboratory of Ames Laboratory This.

1 Public DAFS Storage for High Performance Computing using MPI-I/O: Design and Experience Arkady Kanevsky & Peter Corbett Network Appliance Vijay Velusamy.

High-Level, One-Sided Models on MPI: A Case Study with Global Arrays and NWChem James Dinan, Pavan Balaji, Jeff R. Hammond (ANL); Sriram Krishnamoorthy.

The Cosmic Cube Charles L. Seitz Presented By: Jason D. Robey 2 APR 03.

Non-Data-Communication Overheads in MPI: Analysis on Blue Gene/P P. Balaji, A. Chan, W. Gropp, R. Thakur, E. Lusk Argonne National Laboratory University.

PMI: A Scalable Process- Management Interface for Extreme-Scale Systems Pavan Balaji, Darius Buntinas, David Goodell, William Gropp, Jayesh Krishna, Ewing.

CS533 - Concepts of Operating Systems 1 The Mach System Presented by Catherine Vilhauer.

© 2009 IBM Corporation Parallel Programming with X10/APGAS IBM UPC and X10 teams  Through languages –Asynchronous Co-Array Fortran –extension of CAF with.

Intel Research & Development ETA: Experience with an IA processor as a Packet Processing Engine HP Labs Computer Systems Colloquium August 2003 Greg Regnier.

High Performance Computing Group Feasibility Study of MPI Implementation on the Heterogeneous Multi-Core Cell BE TM Architecture Feasibility Study of MPI.

1 Qualifying ExamWei Chen Unified Parallel C (UPC) and the Berkeley UPC Compiler Wei Chen the Berkeley UPC Group 3/11/07.

The influence of system calls and interrupts on the performances of a PC cluster using a Remote DMA communication primitive Olivier Glück Jean-Luc Lamotte.

Optimizing Charm++ Messaging for the Grid Gregory A. Koenig Parallel Programming Laboratory Department of Computer.

Mellanox Connectivity Solutions for Scalable HPC Highest Performing, Most Efficient End-to-End Connectivity for Servers and Storage April 2010.

Message-Passing Computing Chapter 2. Programming Multicomputer Design special parallel programming language –Occam Extend existing language to handle.

1 Lecture 4: Part 2: MPI Point-to-Point Communication.

Interconnection network network interface and a case study.

Implementing Babel RMI with ARMCI Jian Yin Khushbu Agarwal Daniel Chavarría Manoj Krishnan Ian Gorton Vidhya Gurumoorthi Patrick Nichols.

Mr. P. K. GuptaSandeep Gupta Roopak Agarwal

CCA Common Component Architecture Distributed Array Component based on Global Arrays Manoj Krishnan, Jarek Nieplocha High Performance Computing Group Pacific.

Sockets Direct Protocol for Hybrid Network Stacks: A Case Study with iWARP over 10G Ethernet P. Balaji, S. Bhagvat, R. Thakur and D. K. Panda, Mathematics.

Silberschatz, Galvin and Gagne ©2013 Operating System Concepts – 9 th Edition Chapter 4: Threads.

Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.

The World Leader in High Performance Signal Processing Solutions Heterogeneous Multicore for blackfin implementation Open Platform Solutions Steven Miao.

Group Members Hamza Zahid (131391) Fahad Nadeem khan Abdual Hannan AIR UNIVERSITY MULTAN CAMPUS.

Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.

TLDK Transport Layer Development Kit

Chapter 4: Threads.

Chapter 4: Threads.

Support for Adaptivity in ARMCI Using Migratable Objects

Types of Parallel Computers

Emulating Massively Parallel (PetaFLOPS) Machines

Presentation transcript:

© 2008 IBM Corporation Deep Computing Messaging Framework Lightweight Communication for Petascale Supercomputing Supercomputing 2008 Michael Blocksome,

DCMF BoF, Supercomputing 2008 Deep Computing Messaging Framework | Lightweight Communication for Petascale Supercomputing © 2008 IBM Corporation 2 DCMF Open Source Community  Open source community established January 2008  Wiki –  Mailing List  Git Source Repository –helpful git resources on wiki –git clone

DCMF BoF, Supercomputing 2008 Deep Computing Messaging Framework | Lightweight Communication for Petascale Supercomputing © 2008 IBM Corporation 3 Design Goals  Scalable to millions of tasks  Efficient on low frequency embedded cores –Inlined system programmer interface (SPI)  Supports many programming paradigms –Active Messages –Support multiple contexts –Multiple levels of application interfaces  Structured component design –Extendible to new architectures –Software architecture for multiple networks –Open source runtime with external contributions  Separate library for optimized collectives –Hardware acceleration –Software collectives

DCMF BoF, Supercomputing 2008 Deep Computing Messaging Framework | Lightweight Communication for Petascale Supercomputing © 2008 IBM Corporation 4 Berkeley UPC Application DMA SPI DCMF (C++) MPICH2 DCMF Public API Global Arrays GASNet Systems Programming Interface Deep Computing Messaging Framework CCMI Application Layer Charm++ DMA SPI Applications (QCD) DCMF Applications ARMCI Library Portability Layer BG/P Network Hardware IBM supported software Externally supported software IBM ® Blue Gene ® /P Messaging Software Stack dcmfd ADI

DCMF BoF, Supercomputing 2008 Deep Computing Messaging Framework | Lightweight Communication for Petascale Supercomputing © 2008 IBM Corporation 5 Direct DCMF Application Programming  dcmf.h – core interface –point-to-point and utilities –all functions implemented  collectives interface(s) –may or may not be implemented –check return value on register!  Collective Component Messaging Interface (CCMI) –high level collectives library –uses multisend interface –extensible to new collectives BG/P Hardware dcmf_collectives.h CCMI SX SX SX DCMF sysdep messager sysdep Device Protocol dcmf.h Protocol Device dcmf_globalcollectives.hdcmf_multisend.h Adaptor Application high level collectives multisend collectives all point-to-point global collectives

DCMF BoF, Supercomputing 2008 Deep Computing Messaging Framework | Lightweight Communication for Petascale Supercomputing © 2008 IBM Corporation 6 DCMF Blue Gene/P Performance Point-to-PointCollectives on 512 nodes (SMP)  MPI achieves 4300MB/sec (96% of peak) for torus near-neighbor communication on 6 links ProtocolLatency (µs) DCMF Eager One-way1.6 MPI Eager One-way2.4 MPI Rendezvous One-way5.6 DCMF Put0.9 DCMF Get1.6 ARMCI blocking put2.0 ARMCI blocking get3.3 Collective OperationPerformance MPI Barrier1.3us MPI Allreduce (int sum)4.3us MPI Broadcast4.3us MPI Allreduce throughput817 MB/sec MPI Bcast throughput2.0 GB/sec  Barriers accelerated via the Global Interrupt network  Allreduce and broadcast operations accelerated via the collective network  Large broadcasts take advantage of the 6 edge- disjoint routes on a 3D torus

DCMF BoF, Supercomputing 2008 Deep Computing Messaging Framework | Lightweight Communication for Petascale Supercomputing © 2008 IBM Corporation 7 Why use DCMF ?  Scales on BG/P to millions of tasks –high-efficiency, low overhead  Open Source –active community support  Easily port applications and libraries to DCMF interface  Unique features of DCMF –See next chart

DCMF BoF, Supercomputing 2008 Deep Computing Messaging Framework | Lightweight Communication for Petascale Supercomputing © 2008 IBM Corporation 8 MXVERBSLAPIELANDCMF Multiple ContextsNYYYY Active MessagesNN1N1 YYY One-sided callsNYYYY Strided or Vector callsN1N1 N1N1 YYN2N2 Multi-send callsN1N1 N1N1 N1N1 N1N1 Y Message Ordering and Consistency NNNNY Device interface for many different networks NY (C-API)NNY 3 (C++) Topology AwarenessNNNNY Architecture NeutralNYYNY Non-blocking optimized collectives N1N1 N1N1 N1N1 BlockingY 1 This feature can be implemented in software on top of the provided set of features in this API, at possibly lower efficiency 2 Non-contiguous transfer operation to be added 3 Device level programming is available at the protocol level and not the API Feature Comparison (to the best of our knowledge)

DCMF BoF, Supercomputing 2008 Deep Computing Messaging Framework | Lightweight Communication for Petascale Supercomputing © 2008 IBM Corporation 9 DCMF C API Features  Multiple Context Registration –supports multiple, concurrent communication paradigms  Memory Consistency –One sided communication APIs like UPC and ARMCI need optimized support for memory consistency levels  Active Messaging –Good match for Charm++ and other active message runtimes –MPI can be easily supported  Multisend Protocols –Amortize startup across many messages sent together  Topology Awareness  Optimized Protocols  See dcmf.h

DCMF BoF, Supercomputing 2008 Deep Computing Messaging Framework | Lightweight Communication for Petascale Supercomputing © 2008 IBM Corporation 10 Extending DCMF to other Architectures  Copy the “Linux ® sockets” messager and build options –Contains sockets device and DCMF_Send () protocol –Implements core API, returns DCMF_UNIMPL for collectives  New architecture only needs to implement DCMF_Send –Sockets device enables DCMF on Linux clusters –Shmem device enables DCMF on multi-core systems  DCMF provides default *oversend point-to-point implementations –DCMF_Put () –DCMF_Get () –DCMF_Control ()  Selectively implement architecture devices and optimized protocols –Assign to DCMF_USER0_SEND_PROTOCOL (for example) to test

DCMF BoF, Supercomputing 2008 Deep Computing Messaging Framework | Lightweight Communication for Petascale Supercomputing © 2008 IBM Corporation 11 Upcoming Features * (nothing promised)  Common Device Interface (CDI) –Posix Shared Memory –Sockets –Infiniband  Multi-channel advance –Thread may advance a “slice” of the messaging devices –Dedicated threads result in uncontested locks for high-level communication libraries  Add a blocking advance API –Eliminate explicit processor polls on supported hardware –May degrade to a regular DCMF_Messager_advance() on unsupported hardware  Extend API to access Blue Gene ® features in portable manner –network and device structures –replace hardware struct with key-value  Noncontiguous point-to-point one-sided –iterator can be used to implement all other interfaces (strided, vector, etc)  One-sided “on the fly” collectives (ad hoc)

DCMF BoF, Supercomputing 2008 Deep Computing Messaging Framework | Lightweight Communication for Petascale Supercomputing © 2008 IBM Corporation 12 DCMF Device Abstraction  At the core of DCMF is a “Device”, with a packet API abstraction and a DMA API abstraction  In principle, the functions are virtual, in practice the methods are inlined for performance –Barton-Nackman C++ templates  Common Device Interface (CDI) –If you implement this interface, you get all of DCMF “for free” –Good for rapid prototypes

DCMF BoF, Supercomputing 2008 Deep Computing Messaging Framework | Lightweight Communication for Petascale Supercomputing © 2008 IBM Corporation 13 Current DCMF Devices  Blue Gene/P –DMA / 3-D Torus Network –Collective Network –Global Interrupt Network –Lockbox / Memory Atomics  Generic –Sockets –hybrid compatable –Shared Memory –hybrid compatable –Infiniband –hybrid compatable

DCMF BoF, Supercomputing 2008 Deep Computing Messaging Framework | Lightweight Communication for Petascale Supercomputing © 2008 IBM Corporation 14 Other DCMF Projects  IBM –Roadrunner  Argonne National Laboratory –MPICH2 –ZeptoOS  Pacific Northwest National Laboratory –Global Arrays / ARMCI  Berkeley –UPC / GASNet  University of Illinois at Urbana-Champaign –Charm++

DCMF BoF, Supercomputing 2008 Deep Computing Messaging Framework | Lightweight Communication for Petascale Supercomputing © 2008 IBM Corporation 15 Open Source Project Ideas, in no particular order  Store-and-Forward protocols  Stream API  Channel combining, message striping across devices  Extend to other process managers (OpenMPI, etc)  Extend to other platforms (OS X, BSD, Windows, ?)  DCMF functional and performance test suite  Scalability improvements for sockets and IB  Combination shmem/sockets messager  GPU device ? hybrid model  Shared memory collectives

DCMF BoF, Supercomputing 2008 Deep Computing Messaging Framework | Lightweight Communication for Petascale Supercomputing © 2008 IBM Corporation 16 How Can We be a more effective open source project  How to improve open source experience  specific needs, directions?  missing features?

© 2008 IBM Corporation Additional Charts DCMF on Linux Clusters DCMF on Infiniband

© 2008 IBM Corporation DCMF on Linux Clusters

DCMF BoF, Supercomputing 2008 Deep Computing Messaging Framework | Lightweight Communication for Petascale Supercomputing © 2008 IBM Corporation 19 DCMF on Linux Clusters  Build Instructions on Wiki –  Test environment for application developers –Evaluate the DCMF API and runtime –Port applications to DCMF before reserving time on Blue Gene/P  Uses MPICH2 PMI for job launch and management –Needs pluggable job launch and sysdep extension to remove MPICH2 dependency  Implemented Devices –sockets device –shmem device

DCMF BoF, Supercomputing 2008 Deep Computing Messaging Framework | Lightweight Communication for Petascale Supercomputing © 2008 IBM Corporation 20 DCMF Sockets Device  Standard sockets syscalls implemented on many architectures  Uses the “packet” CDI –New “stream” CDI may provide better performance  Current design is not scalable –primarily a development and porting platform  Can be used to initialize other devices that require sychronization

DCMF BoF, Supercomputing 2008 Deep Computing Messaging Framework | Lightweight Communication for Petascale Supercomputing © 2008 IBM Corporation 21 DCMF Shmem Device  Uses the “packet” CDI  Only point-to-point send  Thread safe, allows multiple threads to post messages to device  No collectives

© 2008 IBM Corporation DCMF on Infiniband

DCMF BoF, Supercomputing 2008 Deep Computing Messaging Framework | Lightweight Communication for Petascale Supercomputing © 2008 IBM Corporation 23 DCMF Infiniband Motivations  Optimize for low power processors and big fatties  Infiniband project lead: Charles Archer –communicate via dcmf mailing list

DCMF BoF, Supercomputing 2008 Deep Computing Messaging Framework | Lightweight Communication for Petascale Supercomputing © 2008 IBM Corporation 24 DCMF Infiniband Device  Implements CDI “rdma” version –direct RDMA –memregions  Implements CDI “packet” version –“eager” style sends  rdma CDI design –SRQ, scalable – worst latency  packet CDI design –Per destination rdma with send recv –Per destination rdma with direct DMA – best latency

DCMF BoF, Supercomputing 2008 Deep Computing Messaging Framework | Lightweight Communication for Petascale Supercomputing © 2008 IBM Corporation 25 DCMF Infiniband – Future Work  Remove artificial limits on scalability –currently 32 nodes  Implement memregion caching  Multiple adaptor support (?)  Switch management routines (?)  Multiple network implemention –SRQ and “per destination”  Async progress through IB events