Boston, May 22 nd, 2013 IPDPS 1 Acceleration of an Asynchronous Message Driven Programming Paradigm on IBM Blue Gene/Q Sameer Kumar* IBM T J Watson Research.

Slides:

Advertisements

Similar presentations

Chapter 3 Embedded Computing in the Emerging Smart Grid Arindam Mukherjee, ValentinaCecchi, Rohith Tenneti, and Aravind Kailas Electrical and Computer.

Advertisements

Categories of I/O Devices

Threads, SMP, and Microkernels

MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

Class CS 775/875, Spring 2011 Amit H. Kumar, OCCS Old Dominion University.

AMLAPI: Active Messages over Low-level Application Programming Interface Simon Yau, Tyson Condie,

CSC457 Seminar YongKang Zhu December 6 th, 2001 About Network Processor.

Concurrent Data Structures in Architectures with Limited Shared Memory Support Ivan Walulya Yiannis Nikolakopoulos Marina Papatriantafilou Philippas Tsigas.

Multilingual Debugging Support for Data-driven Parallel Languages Parthasarathy Ramachandran Laxmikant Kale Parallel Programming Laboratory Dept. of Computer.

VIA and Its Extension To TCP/IP Network Yingping Lu Based on Paper “Queue Pair IP, …” by Philip Buonadonna.

Chapter Hardwired vs Microprogrammed Control Multithreading

Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.

Hitachi SR8000 Supercomputer LAPPEENRANTA UNIVERSITY OF TECHNOLOGY Department of Information Technology Introduction to Parallel Computing Group.

1 Last Class: Introduction Operating system = interface between user & architecture Importance of OS OS history: Change is only constant User-level Applications.

High Performance Communication using MPJ Express 1 Presented by Jawad Manzoor National University of Sciences and Technology, Pakistan 29 June 2015.

16: Distributed Systems1 DISTRIBUTED SYSTEM STRUCTURES NETWORK OPERATING SYSTEMS The users are aware of the physical structure of the network. Each site.

The Design of Robust and Efficient Microkernel ManRiX, The Design of Robust and Efficient Microkernel Presented by: Manish Regmi

1 OS & Computer Architecture Modern OS Functionality (brief review) Architecture Basics Hardware Support for OS Features.

Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Maria Athanasaki, Evangelos Koukis, Nectarios Koziris National Technical.

A Scalable, Cache-Based Queue Management Subsystem for Network Processors Sailesh Kumar, Patrick Crowley Dept. of Computer Science and Engineering.

A Framework for Collective Personalized Communication Laxmikant V. Kale, Sameer Kumar, Krishnan Varadarajan.

Computer System Architectures Computer System Software

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

Atlanta, Georgia TiNy Threads on BlueGene/P: Exploring Many-Core Parallelisms Beyond The Traditional OS Handong Ye, Robert Pavel, Aaron Landwehr, Guang.

© 2012 MELLANOX TECHNOLOGIES 1 The Exascale Interconnect Technology Rich Graham – Sr. Solutions Architect.

© 2010 IBM Corporation Enabling Concurrent Multithreaded MPI Communication on Multicore Petascale Systems Gabor Dozsa 1, Sameer Kumar 1, Pavan Balaji 2,

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.

Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.

Fast Multi-Threading on Shared Memory Multi-Processors Joseph Cordina B.Sc. Computer Science and Physics Year IV.

© 2008 IBM Corporation Deep Computing Messaging Framework Lightweight Communication for Petascale Supercomputing Supercomputing 2008 Michael Blocksome,

A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign.

Minimizing Communication Latency to Maximize Network Communication Throughput over InfiniBand Design and Implementation of MPICH-2 over InfiniBand with.

NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.

Integrating New Capabilities into NetPIPE Dave Turner, Adam Oline, Xuehua Chen, and Troy Benjegerdes Scalable Computing Laboratory of Ames Laboratory This.

OPERATING SYSTEM SUPPORT DISTRIBUTED SYSTEMS CHAPTER 6 Lawrence Heyman July 8, 2002.

Overcoming Scaling Challenges in Bio-molecular Simulations Abhinav Bhatelé Sameer Kumar Chao Mei James C. Phillips Gengbin Zheng Laxmikant V. Kalé.

Overview of An Efficient Implementation Scheme of Concurrent Object-Oriented Languages on Stock Multicomputers Tony Chen, Sunjeev Sikand, and John Kerwin.

Workshop BigSim Large Parallel Machine Simulation Presented by Eric Bohm PPL Charm Workshop 2004.

Middleware Services. Functions of Middleware Encapsulation Protection Concurrent processing Communication Scheduling.

Optimizing Charm++ Messaging for the Grid Gregory A. Koenig Parallel Programming Laboratory Department of Computer.

Computer Network Lab. Korea University Computer Networks Labs Se-Hee Whang.

1 ©2004 Board of Trustees of the University of Illinois Computer Science Overview Laxmikant (Sanjay) Kale ©

The Mach System Silberschatz et al Presented By Anjana Venkat.

Parallelizing Spacetime Discontinuous Galerkin Methods Jonathan Booth University of Illinois at Urbana/Champaign In conjunction with: L. Kale, R. Haber,

AMQP, Message Broker Babu Ram Dawadi. overview Why MOM architecture? Messaging broker like RabbitMQ in brief RabbitMQ AMQP – What is it ?

Interconnection network network interface and a case study.

A uGNI-Based Asynchronous Message- driven Runtime System for Cray Supercomputers with Gemini Interconnect Yanhua Sun, Gengbin Zheng, Laximant(Sanjay) Kale.

Programming an SMP Desktop using Charm++ Laxmikant (Sanjay) Kale Parallel Programming Laboratory Department of Computer Science.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

1 Opportunities and Challenges of Modern Communication Architectures: Case Study with QsNet CAC Workshop Santa Fe, NM, 2004 Sameer Kumar* and Laxmikant.

Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.

CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.

OpenMP Runtime Extensions Many core Massively parallel environment Intel® Xeon Phi co-processor Blue Gene/Q MPI Internal Parallelism Optimizing MPI Implementation.

Is MPI still part of the solution ? George Bosilca Innovative Computing Laboratory Electrical Engineering and Computer Science Department University of.

Page 1 2P13 Week 1. Page 2 Page 3 Page 4 Page 5.

FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI Gengbin Zheng Lixia Shi Laxmikant V. Kale Parallel Programming Lab.

ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Interactions with Microarchitectures and I/O Copyright 2004 Daniel.

The Multikernel: A New OS Architecture for Scalable Multicore Systems

Alternative system models

uGNI-based Charm++ Runtime for Cray Gemini Interconnect

Chapter 3 Process Management.

Chapter 4: Threads.

Yiannis Nikolakopoulos

Hybrid Programming with OpenMP and MPI

Upcoming Improvements and Features in Charm++

BigSim: Simulating PetaFLOPS Supercomputers

IXPUG, SC’16 Lightning Talk Kavitha Chandrasekar*, Laxmikant V. Kale

Support for Adaptivity in ARMCI Using Migratable Objects

Emulating Massively Parallel (PetaFLOPS) Machines

Presentation transcript:

Boston, May 22 nd, 2013 IPDPS 1 Acceleration of an Asynchronous Message Driven Programming Paradigm on IBM Blue Gene/Q Sameer Kumar* IBM T J Watson Research Center, Yorktown Heights, NY Yanhua Sun, Laxmikant Kale Department of Computer Science University of Illinois at Urbana Champaign

Boston, May 22 nd, 2013 IPDPS 2 Overview Charm++ programming model Blue Gene/Q machine –Programming models and messaging libraries on Blue Gene/Q Optimization of Charm++ on BG/Q Performance results Summary

Boston, May 22 nd, 2013 IPDPS 3 Charm++ Programming Model Asynchronous Message Driven Programming –Users decompose the problem (over decomposition) –Intelligent runtime : task processor mapping, communication load balancing, fault tolerance –Overlap computation and communication via asynchronous communication –Execution driven by available message data

Boston, May 22 nd, 2013 IPDPS 4 Charm++ Runtime System Non SMP mode –One process per hardware thread –Each process has a separate charm scheduler SMP Mode –Single or a few processes per network node –Multiple threads executing charm++ schedulers in the same address space –Lower space overheads as read only data structures are not replicated –Communication threads can drive network progress –Communication within the node via pointer exchange

Boston, May 22 nd, 2013 IPDPS 5 Blue Gene/Q

Boston, May 22 nd, 2013 IPDPS 6 Blue Gene/Q Architecture Integrated scalable 5D torus –Virtual Cut-Through routing –Hardware assists for collective & barrier functions –FP addition support in network –RDMA Integrated on-chip Message Unit 272 concurrent endpoints 2 GB/s raw bandwidth on all 10 links –each direction -- i.e. 4 GB/s bidi –1.8 GB/s user bandwidth protocol overhead 5D nearest neighbor exchange measured at 1.76 GB/s per link (98% efficiency) Processor architecture –Implemented 64-bit PowerISATM v2.06 – V. –4-way Simultaneous Multi- Threading –Quad FPU –2-way concurrent issue –In-order execution with dynamic branch prediction Node architecture –Large multi-core SMPs with 64 threads/node –Relatively small amount of memory per thread: 16 GB node share by 64 threads

Boston, May 22 nd, 2013 IPDPS 7 New Hardware Features Scalable L2 Atomics –Atomic operations can be invoked on 64bit words in DDR –Several operations supported including load-increment, store-add, store-XOR.. –Bounded atomics supported Wait on pin –Thread can arm a wakeup unit and go to wait –Core resources such load/store pipeline slots, arithmetic units not used –Thread awakened by Network packet Store to a memory location that results in an L2 invalidate Inter-process-interrupt (IPI)

Boston, May 22 nd, 2013 IPDPS 8 PAMI API PAMI Messaging Library on BG/Q MPICH2 2.x BG/Q messaging implementation PERCS messaging implementation PAMI ADI Applications BG/Q MU SPI PERCS HAL API IBM MPI 2.x MPCI Intel x86 messaging implementation Intel x86 APGAS Runtime CAF Runtime X10 Runtime UPC Runtime GA ARMCI CHARM++GASNet Middleware System Software PAMI: Parallel Active Messaging Interface

Boston, May 22 nd, 2013 IPDPS 9 Point-to-point Operations Active messages –A registered handle is called on the remote node –PAMI_Send_immediate for short transfers –PAMI_Send One-sided remote DMA –PAMI_Get, PAMI_Put : application initiates RDMA with remote virtual address –PAMI_Rget, PAMI_Rput: application first exchanges memory regions before starting RDMA transfer

Boston, May 22 nd, 2013 IPDPS 10 Multi-threading in PAMI Multi-context communication –Enable several threads in a multi-core architecture concurrent access to the network –Eliminate contention for shared resources –Enable parallel send and receive operations on different contexts via different BG/Q injection and reception FIFOs Endpoint addressing scheme –Communication is between network endpoints, not processes, threads, or tasks Multiple contexts progressed on multiple communication threads Communication threads on BG/Q wait on pin –L2 writes or network packets can awaken communication threads with very low overheads Post work to PAMI contexts via PAMI_Context_post –Work posted to a concurrent L2 atomic queue –Work functions advanced by main or communication threads

Boston, May 22 nd, 2013 IPDPS 11 Charm++ Port over PAMI on BG/Q

Boston, May 22 nd, 2013 IPDPS 12 Charm++ Port and Optimizations Ported the converse machine interface to make PAMI API calls Explored various optimizations –Lockless queues –Scalable memory allocation –Concurrent communication Allocate multiple PAMI contexts Multiple communication threads driving multiple PAMI contexts –Optimize short messages Manytomany

Boston, May 22 nd, 2013 IPDPS 13 Lockless Queues Concurrent producer consumer array based queues based on L2 atomic increments Overflow queue used when L2 queue is full Threads in the same process can send messages via concurrent enqueues

Boston, May 22 nd, 2013 IPDPS 14 Scalable Memory Allocation Systems software on BG/Q calls glibc shared arena allocator –Malloc Find an available arena and lock it Allocate and return memory buffer Release lock –Free Find arena where buffer was allocated from Lock arena, free buffer in that arena and unlock –Free results in thread contention –Can slow down short malloc/free calls typically used in Charm++ applications such as NAMD

Boston, May 22 nd, 2013 IPDPS 15 Scalable Memory Allocation (2) Optimize via memory pools of short buffers L2 atomic queues for fast thread concurrent access Allocate –Dequeue from Charm thread’s local memory pool if memory buffer is available –If pool is empty allocate via glibc malloc De allocate –Enqueue to owner thread’s pool via a lockless enqueue –Release via glibc free if owner thread’s pool is full

Boston, May 22 nd, 2013 IPDPS 16 Multiple Contexts and Communication Threads Maximize concurrency in sends and receives Charm++ SMP mode creates multiple PAMI contexts –Sub groups of Charm++ worker threads are associated with a PAMI context –For example at 64 threads/node we use 16 PAMI contexts Sub groups of 4 threads access a PAMI context PAMI library calls protected via critical sections Worker threads advance PAMI contexts when idle –This mode is suitable for compute bound applications SMP mode with communication threads –Each PAMI context advanced by a different communication thread –Charm++ worker threads post work via PAMI_Context_post –Charm++ worker threads do not advance PAMI contexts –This mode is suitable for communication bound applications

Boston, May 22 nd, 2013 IPDPS 17 Optimize Short Messages CmiDirectManytomany –Charm++ interface to optimize a burst of short messages –Message buffer addresses and sizes registered ahead –Communication operations kicked off via a start call –Completion callback notifies Charm++ scheduler when data has been fully sent and received –Charm++ scheduling and header overheads are eliminated We parallelize burst sends of several short messages by posting work to multiple communication threads –Worker threads call PAMI_Context_post with a work function –Work functions execute PAMI_Send_immediate to calls move data on the network –On the receiver data is directly moved to registered destination buffers

Boston, May 22 nd, 2013 IPDPS 18 Performance Results

Boston, May 22 nd, 2013 IPDPS 19 Converse Internode Ping Pong Latency

Boston, May 22 nd, 2013 IPDPS 20 Converse Intranode Ping Pong Latency

Boston, May 22 nd, 2013 IPDPS 21 Scalable Memory Allocation 64 Threads on a node allocate and free 100 buffers in each iteration

Boston, May 22 nd, 2013 IPDPS 22 Performance Impact of L2 Atomic Queues Speedup 2.7x Speedup 1.5x NAMD APoA1 Benchmark

Boston, May 22 nd, 2013 IPDPS 23 NAMD Application on 512 Nodes Time Profile with 32 Worker Threads and 8 Communication Threads per node Time Profile with 48 Worker Threads and No Communication Threads per node 512 Nodes

Boston, May 22 nd, 2013 IPDPS 24 PME Optimization with CmiDirectManytoMany 1024 Nodes

Boston, May 22 nd, 2013 IPDPS 25 3D Complex to Complex FFT Nodes 128X128X12864X64X6432X32X32 p2pm2mp2pm2mp2pm2m Complex to Complex Forward + Backward 3D FFT Time in Microseconds

Boston, May 22 nd, 2013 IPDPS 26 NAMD APoA1 Benchmark Performance Results BG/Q Time Step 0.68 ms/step

Boston, May 22 nd, 2013 IPDPS 27 Summary Presented several optimizations for the Charm++ runtime on the Blue Gene/Q machine SMP mode outperforms non-SMP –Best performance on BG/Q with 1 to 4 processes per node and 16 to 64 threads/process Best time step of 0.68ms/step for the NAMD application with the APoA1 benchmark

Boston, May 22 nd, 2013 IPDPS 28 Thank You Questions?