Scaling NWChem with Efficient and Portable Asynchronous Communication in MPI RMA Min Si [1][2], Antonio J. Peña [1], Jeff Hammond [3], Pavan Balaji [1],

Slides:

Advertisements

Similar presentations

Technology Drivers Traditional HPC application drivers – OS noise, resource monitoring and management, memory footprint – Complexity of resources to be.

Advertisements

MPI in uClinux on Microblaze Neelima Balakrishnan Khang Tran 05/01/2006.

Contiki A Lightweight and Flexible Operating System for Tiny Networked Sensors Presented by: Jeremy Schiff.

Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.

Chapter 13 Embedded Systems

Runtime Support for Irregular Computations in MPI-Based Applications - CCGrid 2015 Doctoral Symposium - Xin Zhao *, Pavan Balaji † (Co-advisor), William.

Techniques for Enabling Highly Efficient Message Passing on Many-Core Architectures Min Si PhD student at University of Tokyo, Tokyo, Japan Advisor : Yutaka.

Argonne National Laboratory is a U.S. Department of Energy laboratory managed by U Chicago Argonne, LLC. The experimental resources for this research were.

Synergy.cs.vt.edu Power and Performance Characterization of Computational Kernels on the GPU Yang Jiao, Heshan Lin, Pavan Balaji (ANL), Wu-chun Feng.

User-Level Process towards Exascale Systems Akio Shimada [1], Atsushi Hori [1], Yutaka Ishikawa [1], Pavan Balaji [2] [1] RIKEN AICS, [2] Argonne National.

Non-Collective Communicator Creation in MPI James Dinan 1, Sriram Krishnamoorthy 2, Pavan Balaji 1, Jeff Hammond 1, Manojkumar Krishnan 2, Vinod Tipparaju.

Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.

Computer System Architectures Computer System Software

National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Direct Self-Consistent Field Computations on GPU Clusters Guochun.

ET E.T. International, Inc. X-Stack: Programming Challenges, Runtime Systems, and Tools Brandywine Team May2013.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

Work Stealing and Persistence-based Load Balancers for Iterative Overdecomposed Applications Jonathan Lifflander, UIUC Sriram Krishnamoorthy, PNNL* Laxmikant.

Chapter 6 Operating System Support. This chapter describes how middleware is supported by the operating system facilities at the nodes of a distributed.

Selective Recovery From Failures In A Task Parallel Programming Model James Dinan*, Sriram Krishnamoorthy #, Arjun Singri*, P. Sadayappan* *The Ohio State.

CCA Common Component Architecture Manoj Krishnan Pacific Northwest National Laboratory MCMD Programming and Implementation Issues.

Dynamic Time Variant Connection Management for PGAS Models on InfiniBand Abhinav Vishnu 1, Manoj Krishnan 1 and Pavan Balaji 2 1 Pacific Northwest National.

High Performance I/O and Data Management System Group Seminar Xiaosong Ma Department of Computer Science North Carolina State University September 12,

The Global View Resilience Model Approach GVR (Global View for Resilience) Exploits a global-view data model, which enables irregular, adaptive algorithms.

Performance Model & Tools Summary Hung-Hsun Su UPC Group, HCS lab 2/5/2004.

Presented by High Productivity Language Systems: Next-Generation Petascale Programming Aniruddha G. Shet, Wael R. Elwasif, David E. Bernholdt, and Robert.

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.

Argonne National Laboratory is a U.S. Department of Energy laboratory managed by U Chicago Argonne, LLC. Xin Zhao *, Pavan Balaji † (Co-advisor) and William.

Min Si[1][2], Antonio J. Peña[1], Jeff Hammond[3], Pavan Balaji[1],

Abdelhalim Amer *, Huiwei Lu *, Pavan Balaji *, Satoshi Matsuoka + *Argonne National Laboratory, IL, USA +Tokyo Institute of Technology, Tokyo, Japan Characterizing.

Center for Component Technology for Terascale Simulation Software CCA is about: Enhancing Programmer Productivity without sacrificing performance. Supporting.

Copyright: Abhinav Vishnu Fault-Tolerant Communication Runtime Support for Data-Centric Programming Models Abhinav Vishnu 1, Huub Van Dam 1, Bert De Jong.

DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S

NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.

Towards Exascale File I/O Yutaka Ishikawa University of Tokyo, Japan 2009/05/21.

Overcoming Scaling Challenges in Bio-molecular Simulations Abhinav Bhatelé Sameer Kumar Chao Mei James C. Phillips Gengbin Zheng Laxmikant V. Kalé.

High-Level, One-Sided Models on MPI: A Case Study with Global Arrays and NWChem James Dinan, Pavan Balaji, Jeff R. Hammond (ANL); Sriram Krishnamoorthy.

Non-Data-Communication Overheads in MPI: Analysis on Blue Gene/P P. Balaji, A. Chan, W. Gropp, R. Thakur, E. Lusk Argonne National Laboratory University.

Manno, , © by Supercomputing Systems 1 1 COSMO - Dynamical Core Rewrite Approach, Rewrite and Status Tobias Gysi POMPA Workshop, Manno,

HPC Components for CCA Manoj Krishnan and Jarek Nieplocha Computational Sciences and Mathematics Division Pacific Northwest National Laboratory.

Implementing Data Cube Construction Using a Cluster Middleware: Algorithms, Implementation Experience, and Performance Ge Yang Ruoming Jin Gagan Agrawal.

Operating System 4 THREADS, SMP AND MICROKERNELS.

MT-MPI: Multithreaded MPI for Many- Core Environments Min Si 1,2 Antonio J. Peña 2 Pavan Balaji 2 Masamichi Takagi 3 Yutaka Ishikawa 1 1 University of.

Energy-Aware Resource Adaptation in Tessellation OS 3. Space-time Partitioning and Two-level Scheduling David Chou, Gage Eads Par Lab, CS Division, UC.

The Distributed Data Interface in GAMESS Brett M. Bode, Michael W. Schmidt, Graham D. Fletcher, and Mark S. Gordon Ames Laboratory-USDOE, Iowa State University.

Programming Sensor Networks Andrew Chien CSE291 Spring 2003 May 6, 2003.

Comprehensive Scientific Support Of Large Scale Parallel Computation David Skinner, NERSC.

CCA Common Component Architecture Distributed Array Component based on Global Arrays Manoj Krishnan, Jarek Nieplocha High Performance Computing Group Pacific.

Nov 14, 08ACES III and SIAL1 ACES III and SIAL: technologies for petascale computing in chemistry and materials physics Erik Deumens, Victor Lotrich, Mark.

Parallel Computing Presented by Justin Reschke

EU-Russia Call Dr. Panagiotis Tsarchopoulos Computing Systems ICT Programme European Commission.

OpenMP Runtime Extensions Many core Massively parallel environment Intel® Xeon Phi co-processor Blue Gene/Q MPI Internal Parallelism Optimizing MPI Implementation.

Background Computer System Architectures Computer System Software.

Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8 th Edition Chapter 4: Threads.

11 Brian Van Straalen Portable Performance Discussion August 7, FASTMath SciDAC Institute.

Application of Design Patterns to Geometric Decompositions V. Balaji, Thomas L. Clune, Robert W. Numrich and Brice T. Womack.

Chapter 4 – Thread Concepts

Support for Program Analysis as a First-Class Design Constraint in Legion Michael Bauer 02/22/17.

For Massively Parallel Computation The Chaotic State of the Art

Chapter 4 – Thread Concepts

Parallel Algorithm Design

Performance Evaluation of Adaptive MPI

Many-core Software Development Platforms

Department of Computer Science University of California, Santa Barbara

Operating System 4 THREADS, SMP AND MICROKERNELS

Gary M. Zoppetti Gagan Agrawal

Hybrid Programming with OpenMP and MPI

Accelerating Quantum Chemistry with Batched and Vectorized Integrals

Department of Computer Science University of California, Santa Barbara

Gary M. Zoppetti Gagan Agrawal Rishi Kumar

Support for Adaptivity in ARMCI Using Migratable Objects

Presentation transcript:

Scaling NWChem with Efficient and Portable Asynchronous Communication in MPI RMA Min Si [1][2], Antonio J. Peña [1], Jeff Hammond [3], Pavan Balaji [1], Yutaka Ishikawa [4] [2] University of Tokyo, Japan [1] Argonne National Laboratory, USA {msi, apenya, [3] Intel Labs, USA [4] RIKEN AICS, Japan

Large Chemical & Biological Applications  NWChem –Quantum chemistry application  SWAP-Assembler –Bioinformatics application  Molecular Dynamics –Simulation of physical movements of atoms and molecules for materials, chemistry and biology … Min Si, CCGrid Scale Challenge2 Application Characteristics Large memory requirement (cannot fit in single node) Irregular data movement Application Characteristics Large memory requirement (cannot fit in single node) Irregular data movement

NWChem [1] [1] M. Valiev, E.J. Bylaska, N. Govind, K. Kowalski, T.P. Straatsma, H.J.J. van Dam, D. Wang, J. Nieplocha, E. Apra, T.L. Windus, W.A. de Jong, "NWChem: a comprehensive and scalable open-source solution for large scale molecular simulations" Comput. Phys. Commun. 181, 1477 (2010) Water (H 2 O) 21 Pyrene C 16 H 10 Carbon C 20  High performance computational chemistry application suite  Composed of many types of simulation capabilities –Molecular Electronic Structure –Quantum Mechanics/Molecular Mechanics –Pseudo potential Plane-Wave Electronic Structure –Molecular Dynamics 3Min Si, CCGrid Scale Challenge

Communication Runtime 4 ARMCI : Communication interface for RMA [3] Global Arrays [2] Applications [3] [2] × Limited platforms × Long development cycle time for supporting new platform Cray IB … ARMCI native ports IB DMMAP … Portable implementation on top of MPI Support most platforms ! Tianhe-2 K K … MPI RMA Cray IB ARMCI-MPI Abstractions for distributed arrays Global Address Space Physically distributed to different processes Hidden from user

Get-Compute-Update  Typical Get-Compute-Update mode in GA programming Accumulate block c GET block b GET block a Perform DGEMM in local buffer for i in I blocks: for j in J blocks: for k in K blocks: GET block a from A GET block b from B c += a * b /*computing*/ end do ACC block c to C end do Pseudo code 5Min Si, CCGrid Scale Challenge

Outline  Problem Statement  Solution  Evaluation Experimental Environment NERSC*'s newest supercomputer Cray XC Petaflops/s peak performance 133,824 compute cores… *National Energy Research Scientific Computing Center 6Min Si, CCGrid Scale Challenge

NWChem CCSD(T) simulation  “Gold standard” CCSD(T) –Pareto optimal point of high accuracy relative to computational cost –Top of the three-tiered pyramid of methods used for ab initio calculations Self-consistent field (SCF) Four-index transformation (4-index) CCSD iteration (T) portion Internal steps in CCSD(T) task CCSD(T) internal steps in varying water problems (T) portion consistently dominates the entire cost by close to 80%. 7Min Si, CCGrid Scale Challenge CCSD(T) MP2 SCF More accuracy More computation O(N 7 ) O(N 5 ) O(N 3 ) We are here

How to determine SCALABILITY ?  Parallel Efficiency –Execution time on minimal number of cores α as base T 1 –If base execution is not efficient ? i.e., inefficient communication  Computational Efficiency –Focus on overhead of inefficient communication –Computation time on minimal number of cores as α base T comp 8Min Si, CCGrid Scale Challenge High base T α Artificially High PE(N) !

Is (T) Portion Efficient ? (T) Portion strong scaling for W 21 Parallel Efficiency Computational Efficiency Not efficient ! 9Min Si, CCGrid Scale Challenge (T) Portion profiling for W 21

WHY (T) Portion Is Not Efficient ?  One-sided operations are not truly one-sided –On most platforms, some operations (e.g., 3D accumulates of double precision data) still have to be done in software –Cray MPI (default mode) implements all operations in software Process 0Process 1 Computation RMA (data) Extreme communication delay in computation-intensive (T) portion in large scale problems Software implementation of one-sided operations means that the target process has to make an MPI call to make progress. MPI call Delay 10Min Si, CCGrid Scale Challenge How to improve asynchronous progress in communication with minimal impact on computation ? Challenge

Outline  Problem Statement  Solution  Evaluation 11Min Si, CCGrid Scale Challenge

Traditional Approaches of ASYNC Progress  Thread-based approach –Every MPI process has a communication dedicated background thread –Background thread polls MPI progress process  Interrupt-based approach –Assume all hardware resources are busy with user computation on target processes –Utilize hardware interrupts to awaken a kernel thread 12Min Si, CCGrid Scale Challenge P0 P1 P2 P3 T0 T1 T2 T3 Cons: ×Waste 50% computing cores or oversubscribe cores ×Overhead of Multithreading safety of MPI Cons: ×Overhead of frequent interrupts DMMAP-based ASYNC overhead on Cray XC30

Our Solution: Process-based ASYNC Progress  Multi- and many-core architectures –Rapidly growing number of cores –Not all of the cores are always keeping busy  Casper [4] –Dedicating arbitrary number of cores to “ghost processes” –Ghost process intercepts all RMA operations to the user processes Original communication Process 0Process 1 Computation RMA(data) MPI call Process 0Process 1 += Computation Acc(data) Ghost Process P0 P1 P2 PN … G0 G1 Communication with Casper No multithreading / interrupts overhead Flexible core deployment Portable PMPI redirection 13 [4] M. Si, A. J. Pena, J. Hammond, P. Balaji, M. Takagi, and Y. Ishikawa, “Casper: An asynchronous progress model for MPI RMA on many-core architectures,” in Parallel and Distributed Processing (IPDPS), 2015.

Outline  Problem Statement  Solution  Evaluation Experimental Environment 12-core Intel Ivy Bridge * 2 (24 cores) per node Cray MPI v Min Si, CCGrid Scale Challenge

[DEMO] Core Utilization in CCSD(T) Simulation COMM COMP 15Min Si, CCGrid Scale Challenge Original MPI (no ASYNC) Casper ASYNC Threads ASYNC on dedicated cores Core Utilization (%) 100% 50% 25% 75% 100% 50% 25% 75% 0 Core Utilization (%) 0100% 50% 25% 75% 100% 50% 25% 75% Task Processing (%) 0100% 50% 25% 75% 100% 50% 25% 75% Task Processing (%) Oversubscribed ASYNC cores are polling MPI progress 100% 50% 25% 75% 100% 50% 25% 75% 0 Threads ASYNC on oversubscribed cores DONE Concurrent COMM & COMP Higher Computation Utilization is Better !

Strong Scaling of (T) Portion for W21 Problem # COMP # ASYNC Original MPI240 Casper231 Thread (O) (with oversubscribed cores) 24 Thread (D) (with dedicated cores) 12 Execution timeComputational efficiency Core deployment Improved! Reduced ! 16Min Si, CCGrid Scale Challenge (H 2 O) 21 Computation-intensive (T) CCSD(T) simulation processing

WHY We Achieve ~100% Efficiency ? W21 using 1704 cores W21 using 6144 cores # COMP# ASYNC Original MPI240 Casper231Loss only 1 (4%) COMP cores Thread (O) (with oversubscribed cores) 24 Core oversubscription Thread (D) (with dedicated cores) 12 Loss 50% COMP cores 17Min Si, CCGrid Scale Challenge

Summary 18 Min Si, CCGrid Scale Challenge NWChem CCSD(T) with Casper ASYNC We scale NWChem simulations We utilize portable & scalable MPI Asynchronous Progress We scale water molecule problems at ~100% parallel & computational efficiency on ~12288 cores and reduce ~45% execution time ! NWChem CCSD(T) with original MPI (no ASYNC) + Casper 0100% 50% 25% 75% 100% 50% 25% 75% Task Processing (%) Core Utilization (%) 0100% 50% 25% 75% 100% 50% 25% 75% Task Processing (%) COMM COMP Only 50% Computational EfficiencyClose to 100% Computational Efficiency Core Utilization (%)