Multilevel Parallelism using Processor Groups Bruce Palmer Jarek Nieplocha, Manoj Kumar Krishnan, Vinod Tipparaju Pacific Northwest National Laboratory.

Slides:



Advertisements
Similar presentations
Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.
Advertisements

Practical techniques & Examples
Current Progress on the CCA Groundwater Modeling Framework Bruce Palmer, Yilin Fang, Vidhya Gurumoorthi, Computational Sciences and Mathematics Division.
Spark: Cluster Computing with Working Sets
CISC October Goals for today: Foster’s parallel algorithm design –Partitioning –Task dependency graph Granularity Concurrency Collective communication.
Reference: Message Passing Fundamentals.
1 Friday, September 29, 2006 If all you have is a hammer, then everything looks like a nail. -Anonymous.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Operating Systems CS208. What is Operating System? It is a program. It is the first piece of software to run after the system boots. It coordinates the.
Flexible and Efficient Control of Data Transfers for Loosely Coupled Components Joe Shang-Chieh Wu Department of Computer Science University.
Virtues of Good (Parallel) Software
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
Operating Systems Concepts 1. A Computer Model An operating system has to deal with the fact that a computer is made up of a CPU, random access memory.
What is Concurrent Programming? Maram Bani Younes.
Chapter 3.1:Operating Systems Concepts 1. A Computer Model An operating system has to deal with the fact that a computer is made up of a CPU, random access.
Non-Collective Communicator Creation in MPI James Dinan 1, Sriram Krishnamoorthy 2, Pavan Balaji 1, Jeff Hammond 1, Manojkumar Krishnan 2, Vinod Tipparaju.
N Tropy: A Framework for Analyzing Massive Astrophysical Datasets Harnessing the Power of Parallel Grid Resources for Astrophysical Data Analysis Jeffrey.
Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.
Computer System Architectures Computer System Software
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Designing and Evaluating Parallel Programs Anda Iamnitchi Federated Distributed Systems Fall 2006 Textbook (on line): Designing and Building Parallel Programs.
CCA Common Component Architecture Manoj Krishnan Pacific Northwest National Laboratory MCMD Programming and Implementation Issues.
Computational issues in Carbon nanotube simulation Ashok Srinivasan Department of Computer Science Florida State University.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
Pursuing Faster I/O in COSMO POMPA Workshop May 3rd 2010.
Trace Generation to Simulate Large Scale Distributed Application Olivier Dalle, Emiio P. ManciniMar. 8th, 2012.
Overview of Recent MCMD Developments Manojkumar Krishnan January CCA Forum Meeting Boulder.
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Tekin Bicer Gagan Agrawal 1.
Sept COMP60611 Fundamentals of Concurrency Lab Exercise 2 Notes Notes on the finite difference performance model example – for the lab… Graham Riley,
Overview of Recent MCMD Developments Jarek Nieplocha CCA Forum Meeting San Francisco.
Parallel Simulation of Continuous Systems: A Brief Introduction
Center for Component Technology for Terascale Simulation Software CCA is about: Enhancing Programmer Productivity without sacrificing performance. Supporting.
1 M. Tudruj, J. Borkowski, D. Kopanski Inter-Application Control Through Global States Monitoring On a Grid Polish-Japanese Institute of Information Technology,
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Basic Parallel Programming Concepts Computational.
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.
CCGrid, 2012 Supporting User Defined Subsetting and Aggregation over Parallel NetCDF Datasets Yu Su and Gagan Agrawal Department of Computer Science and.
Computing Simulation in Orders Based Transparent Parallelizing Pavlenko Vitaliy Danilovich, Odessa National Polytechnic University Burdeinyi Viktor Viktorovych,
HPC Components for CCA Manoj Krishnan and Jarek Nieplocha Computational Sciences and Mathematics Division Pacific Northwest National Laboratory.
ATmospheric, Meteorological, and Environmental Technologies RAMS Parallel Processing Techniques.
CS 484 Designing Parallel Algorithms Designing a parallel algorithm is not easy. There is no recipe or magical ingredient Except creativity We can benefit.
October 2008 Integrated Predictive Simulation System for Earthquake and Tsunami Disaster CREST/Japan Science and Technology Agency (JST)
Progress on Component-Based Subsurface Simulation I: Smooth Particle Hydrodynamics Bruce Palmer Pacific Northwest National Laboratory Richland, WA.
08/10/ NRL Hybrid QR Factorization Algorithm for High Performance Computing Architectures Peter Vouras Naval Research Laboratory Radar Division Professor.
Data Structures and Algorithms in Parallel Computing
CDP Tutorial 3 Basics of Parallel Algorithm Design uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison.
CCA Common Component Architecture Distributed Array Component based on Global Arrays Manoj Krishnan, Jarek Nieplocha High Performance Computing Group Pacific.
MSc in High Performance Computing Computational Chemistry Module Parallel Molecular Dynamics (i) Bill Smith CCLRC Daresbury Laboratory
Lecture 3: Designing Parallel Programs. Methodological Design Designing and Building Parallel Programs by Ian Foster www-unix.mcs.anl.gov/dbpp.
C OMPUTATIONAL R ESEARCH D IVISION 1 Defining Software Requirements for Scientific Computing Phillip Colella Applied Numerical Algorithms Group Lawrence.
CS4315A. Berrached:CMS:UHD1 Introduction to Operating Systems Chapter 1.
Uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison Wesley, 2003.
University of Texas at Arlington Scheduling and Load Balancing on the NASA Information Power Grid Sajal K. Das, Shailendra Kumar, Manish Arora Department.
Concurrency and Performance Based on slides by Henri Casanova.
Background Computer System Architectures Computer System Software.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
Parallel Molecular Dynamics A case study : Programming for performance Laxmikant Kale
Fermi National Accelerator Laboratory & Thomas Jefferson National Accelerator Facility SciDAC LQCD Software The Department of Energy (DOE) Office of Science.
Department of Computer Science, Johns Hopkins University Lecture 7 Finding Concurrency EN /420 Instructor: Randal Burns 26 February 2014.
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
Implementation Issues
Parallel Programming By J. H. Wang May 2, 2017.
CS 584 Lecture 3 How is the assignment going?.
Parallel Algorithm Design
Parallel Programming in C with MPI and OpenMP
CPSC 531: System Modeling and Simulation
Sanuja Dabade & Eilbroun Benjamin CS 257 – Dr. TY Lin
CS 584.
Multiprocessor and Real-Time Scheduling
Parallel Programming in C with MPI and OpenMP
Presentation transcript:

Multilevel Parallelism using Processor Groups Bruce Palmer Jarek Nieplocha, Manoj Kumar Krishnan, Vinod Tipparaju Pacific Northwest National Laboratory

Processor Groups Many parallel applications require the execution of a large number of independent tasks. Examples include Numerical evaluation of gradients Monte Carlo sampling over initial conditions or uncertain parameter sets Free energy perturbation calculations (chemistry) Nudged elastic band calculations (chemistry and materials science) Sparse matrix-vector operations (NAS CG benchmark)

Processor Groups If the individual calculations are small enough then each processor can be used to execute one of the tasks (embarrassingly parallel algorithms). If the individual tasks are large enough that they must be distributed amongst several processors then the only option (usually) is to run each task sequentially on multiple processors. This limits the total number of processors that can be applied to the problem since parallel efficiency degrades as the number of processors increases. Speedup Processors

Processor Groups Alternatively the collection of processors can be decomposed into processor groups. These processor groups can be used to execute parallel algorithms independently of one another. This requires global operations that are restricted in scope to a particular group instead of over the entire domain of processors (world group) distributed data structures that are restricted to a particular group

Processor Groups (Schematic) World Group Group A Group B

Communication between Groups Copying to and from data structures in nested groups is well defined. The world group can be used as mechanism for communicating between groups (hierarchical data model). Copying directly between groups is more complicated. What is the programming model? World Group Group A Group B

Sequential Parallel Tasks TasksProcessors Results

Embarrassingly Parallel Tasks 3 5 Processors Results

Concurrent Execution on Groups Tasks 3 5 Processors Results

MD Example P0P0 P1P1 P2P2 P3P3 P4P4 P5P5 P6P6 P7P7 Spatial Decomposition Algorithm: Partition particles among processors Update coordinates at every step Update partitioning after fixed number of steps

MD Parallel Scaling

MD Performance on Groups

Task Completion Times

Multilevel Parallelism Using CCA and Global Arrays (GA) zCombining SPMD and MPMD paradigms zMCMD – Multi Component Multiple Data yMPMD + Component zThe MCMD Driver launches multiple instances of NWChem components on subsets of processors (CCA) zEach NWChem QM component does multiple energy computations on subgroups (GA) MCMD Hessian Driver Go cProps ModelFactory NWChem_QM_1 ModelFacto ry cProps Param Port Energy NWChem_QM_0 ModelFacto ry cProps Param Port NWChem_QM_2 ModelFacto ry cProps Param Port NWChem_QM_n ModelFacto ry cProps Param Port

Numerical Hessian Scalability - I Single-level Parallelism native parallel code

Numerical Hessian Scalability - II Single Energy Calculations Two-level Parallelism Native parallel code – Energy level group-based single energy calculations – Gradient Level, using GA groups

Numerical Hessian Scalability - III Three-level Parallelism Native parallel code – Energy level group-based single energy calculations – Gradient Level, using GA groups Task-based gradient calculations – using CCA Application efficiency improved 10x times on 256 CPUs

New Opportunities zOverlapping functionality (IO, visualization) yAssign functions such as IO or visualization to separate processor groups so that these tasks do not block main computation zMultiphysics simulations ySeparate physics modules run on independent processor groups and are loosely coupled on the world group

Overlapping Functionality zMinimize effects of latency and blocking zAchieve optimal bandwidth to other resources (e.g file system or graphics device) by minimizing competition between processors zSimplify programming (less need to schedule access to limited resources) zPotential applications to Global Cloud Resolving Model SciDAC program, other applications that create large amounts of data for permanent storage

Overlapping Functionality World Group Computation Group Auxiliary Group Disk, graphics device, etc.

Multiphysics Simulations zDifferent physical models are run as separate parallel simulations on independent groups zLoose coupling of models through the world group zApplications to subsurface modeling through the Hybrid Multiscale Modeling of Subsurface Simulations SciDAC program, applications as well to climate modeling (combined ocean, atmospheric, surface water models, etc.)

Multiphysics Simulations World Group Microscopic Model Pore Scale Model Darcy Flow Model ??