Application Performance Analysis on Blue Gene/L Jim Pool, P.I. Maciej Brodowicz, Sharon Brunett, Tom Gottschalk, Dan Meiron, Paul Springer, Thomas Sterling,

Slides:

Advertisements

Similar presentations

OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.

Advertisements

Parallel Jacobi Algorithm Steven Dong Applied Mathematics.

Parallel Sorting Sathish Vadhiyar. Sorting  Sorting n keys over p processors  Sort and move the keys to the appropriate processor so that every key.

Priority Research Direction Key challenges General Evaluation of current algorithms Evaluation of use of algorithms in Applications Application of “standard”

SKELETON BASED PERFORMANCE PREDICTION ON SHARED NETWORKS Sukhdeep Sodhi Microsoft Corp Jaspal Subhlok University of Houston.

Efficient Parallelization for AMR MHD Multiphysics Calculations Implementation in AstroBEAR Collaborators: Adam Frank Brandon Shroyer Chen Ding Shule Li.

Reference: Message Passing Fundamentals.

Exploring Communication Options with Adaptive Mesh Refinement Courtenay T. Vaughan, and Richard F. Barrett Sandia National Laboratories SIAM Computational.

CSE 160/Berman Programming Paradigms and Algorithms W+A 3.1, 3.2, p. 178, 6.3.2, H. Casanova, A. Legrand, Z. Zaogordnov, and F. Berman, "Heuristics.

Tools for Engineering Analysis of High Performance Parallel Programs David Culler, Frederick Wong, Alan Mainwaring Computer Science Division U.C.Berkeley.

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.

Efficient Parallelization for AMR MHD Multiphysics Calculations Implementation in AstroBEAR.

Science Advisory Committee Meeting - 20 September 3, 2010 Stanford University 1 04_Parallel Processing Parallel Processing Majid AlMeshari John W. Conklin.

Topic Overview One-to-All Broadcast and All-to-One Reduction

VTF Applications Performance and Scalability Sharon Brunett CACR/Caltech ASCI Site Review October 28,

Parallel Programming: Case Studies Todd C. Mowry CS 495 September 12, 2002.

Topology Aware Mapping for Performance Optimization of Science Applications Abhinav S Bhatele Parallel Programming Lab, UIUC.

Parallel Adaptive Mesh Refinement Combined With Multigrid for a Poisson Equation CRTI RD Project Review Meeting Canadian Meteorological Centre August.

Charm++ Load Balancing Framework Gengbin Zheng Parallel Programming Laboratory Department of Computer Science University of Illinois at.

Reference: / Parallel Programming Paradigm Yeni Herdiyeni Dept of Computer Science, IPB.

Development in hardware – Why? Option: array of custom processing nodes Step 1: analyze the application and extract the component tasks Step 2: design.

Stratified Magnetohydrodynamics Accelerated Using GPUs:SMAUG.

LLNL-PRES-XXXXXX This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.

Designing and Evaluating Parallel Programs Anda Iamnitchi Federated Distributed Systems Fall 2006 Textbook (on line): Designing and Building Parallel Programs.

Scaling to New Heights Retrospective IEEE/ACM SC2002 Conference Baltimore, MD.

Martin Berzins (Steve Parker) What are the hard apps problems? How do the solutions get shared? What non-apps work is needed? Thanks to DOE for funding.

Adaptive MPI Milind A. Bhandarkar

1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.

Computational issues in Carbon nanotube simulation Ashok Srinivasan Department of Computer Science Florida State University.

Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.

Molecular Dynamics A brief overview. 2 Notes - Websites "A Molecular Dynamics Primer", F. Ercolessi

UPC Applications Parry Husbands. Roadmap Benchmark small applications and kernels —SPMV (for iterative linear/eigen solvers) —Multigrid Develop sense.

Performance Model & Tools Summary Hung-Hsun Su UPC Group, HCS lab 2/5/2004.

1 © 2012 The MathWorks, Inc. Parallel computing with MATLAB.

IPDPS 2005, slide 1 Automatic Construction and Evaluation of “Performance Skeletons” ( Predicting Performance in an Unpredictable World ) Sukhdeep Sodhi.

Definitions Speed-up Efficiency Cost Diameter Dilation Deadlock Embedding Scalability Big Oh notation Latency Hiding Termination problem Bernstein’s conditions.

NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.

NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.

Advanced Simulation and Computing (ASC) Academic Strategic Alliances Program (ASAP) Center at The University of Chicago The Center for Astrophysical Thermonuclear.

Overcoming Scaling Challenges in Bio-molecular Simulations Abhinav Bhatelé Sameer Kumar Chao Mei James C. Phillips Gengbin Zheng Laxmikant V. Kalé.

MPI (continue) An example for designing explicit message passing programs Advanced MPI concepts.

Resource Utilization in Large Scale InfiniBand Jobs Galen M. Shipman Los Alamos National Labs LAUR

Workshop BigSim Large Parallel Machine Simulation Presented by Eric Bohm PPL Charm Workshop 2004.

University of Illinois at Urbana-Champaign Memory Architectures for Protein Folding: MD on million PIM processors Fort Lauderdale, May 03,

Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.

ATmospheric, Meteorological, and Environmental Technologies RAMS Parallel Processing Techniques.

Modeling Billion-Node Torus Networks Using Massively Parallel Discrete-Event Simulation Ning Liu, Christopher Carothers 1.

Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.

CS- 492 : Distributed system & Parallel Processing Lecture 7: Sun: 15/5/1435 Foundations of designing parallel algorithms and shared memory models Lecturer/

CS 484 Load Balancing. Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.

Parallelizing Spacetime Discontinuous Galerkin Methods Jonathan Booth University of Illinois at Urbana/Champaign In conjunction with: L. Kale, R. Haber,

Interconnection network network interface and a case study.

A Multi-platform Co-Array Fortran Compiler for High-Performance Computing Cristian Coarfa, Yuri Dotsenko, John Mellor-Crummey {dotsenko, ccristi,

MSc in High Performance Computing Computational Chemistry Module Parallel Molecular Dynamics (i) Bill Smith CCLRC Daresbury Laboratory

PAPI on Blue Gene L Using network performance counters to layout tasks for improved performance.

A Pattern Language for Parallel Programming Beverly Sanders University of Florida.

Spring EE 437 Lillevik 437s06-l22 University of Portland School of Engineering Advanced Computer Architecture Lecture 22 Distributed computer Interconnection.

Parallel Molecular Dynamics A case study : Programming for performance Laxmikant Kale

First INFN International School on Architectures, tools and methodologies for developing efficient large scale scientific computing applications Ce.U.B.

Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,

Multicore Applications in Physics and Biochemical Research Hristo Iliev Faculty of Physics Sofia University “St. Kliment Ohridski” 3 rd Balkan Conference.

Application of Design Patterns to Geometric Decompositions V. Balaji, Thomas L. Clune, Robert W. Numrich and Brice T. Womack.

Flexibility and Interoperability in a Parallel MD code Robert Brunner, Laxmikant Kale, Jim Phillips University of Illinois at Urbana-Champaign.

Resource Utilization in Large Scale InfiniBand Jobs

Milind A. Bhandarkar Adaptive MPI Milind A. Bhandarkar

Summary Background Introduction in algorithms and applications

IXPUG, SC’16 Lightning Talk Kavitha Chandrasekar*, Laxmikant V. Kale

Parallel Programming in C with MPI and OpenMP

Parallel computing in Computational chemistry

Emulating Massively Parallel (PetaFLOPS) Machines

Presentation transcript:

Application Performance Analysis on Blue Gene/L Jim Pool, P.I. Maciej Brodowicz, Sharon Brunett, Tom Gottschalk, Dan Meiron, Paul Springer, Thomas Sterling, Ed Upchurch

Caltech’s Role in Blue Gene/L Project Understand implications of BG/L network architecture & drive results from real world ASCI applications Develop statistical models of applications, processors as message generators, and the network Focus on –Application communications distribution –Network contention as function of load, size and adaptive routing Represent 64K Nodes Explicitly in Statistical Model Create trace analysis tools to characterize applications –Extensible Trace Facility (ETF)

Blue Gene / L Node

Blue Gene / L Network

ETF Built-in Trace Options MPI events –All point-to-point communications (MPI-1) –All collective communications (MPI-1) –Non-blocking request tracking –Communicator creation and destruction –MPI datatype decoding (requires MPI-2) –Languages: C, Fortran –Easy instrumentation of applications Memory reference and program execution tracing –Tracking of statically and dynamically allocated arrays (identifiers, element sizes, dimensions) –Tracking of scalar variables –Read and write accesses to individual scalars and array elements as well as contiguous vectors of elements –Function calls –Program execution phases

ETF Tracing Example for Magnetic Hydro Dynamic (MHD) Code with Adaptive Mesh Refinement (AMR) Parallel MHD fluid code solves equations of hydrodynamics and resistive Maxwell’s equations –Part of larger application which computes dynamic responses to strong shock waves impinging on target materials –Fortran 90 + MPI –MPI Cartesian communicators –Nearest neighbor comms use non blocking send/recv –MPI Allreduce for calculating stable time steps

AMR MHD: Communication Profile 20 time steps on 32 processors, 128x128 cells Max. level = 1 Max. level = 2

Lennard-Jones Molecular Dynamics Short range molecular dynamics application simulating Newtonian interactions in large groups of atoms –production code from Sandia National Lab Simulations are large in two dimensions – number of atoms and number of time steps Spatial decomposition case selected –each processing node keeps track of the positions and movement of the atoms in a 3-D box Computations carried out in a single time step correspond to femto- seconds of real time – a meaningful simulation of the evolution of the system’s state typically requires thousands of time steps Point-to-point MPI messages are exchanged across each of the 6 sides of the box / time step Code is written in Fortran and MPI

Typical Grid Cell and Cutoff Radius Communication Steps Computational Cycle Model Lennard-Jones Molecular Dynamics

LJS Single Processor BG/L Performance Original Codevs. Tuned for BG/L ,62531,25062,500125,000250,000500,000 Number of Atoms per BG/L CPU Improvement (%) good cache reuse

LJS Molecular Dynamics Performance Fixed Problem Size of 1 Billion Atoms 2k4k8k16k32k 64k Number of BG/L CPUs Time per single iteration (ms) Compute Time [ms] Communications Time [ms]

LJS Speedup BG/L vs. ASCI Red 3200 Nodes 1 Billion Atom Problem k4k8k16k32k64k Number of BlueGene/L Nodes Speedup

LJS Communications Time 500,000 Atoms per BG/L Node x4x4 (64 BGL Nodes)8x8x8 (512 BGL Nodes)16x16x16 (4096 BGL Nodes) BG/L Configuration Communications Time Per Iteration (msecs) Physical Nearest Neighbor Mapping Random Mapping

What is QMC and Why is it a Good Fit for BG/L? QMC is a finite all-electron Quantum Monte Carlo code used to determine quantum properties of materials with extremely high accuracy Developed at Caltech by Bill Goddard’s ASCI Material Properties group Interesting Characteristics –Low memory requirements –After initialization, highly parallel and scalable –Minimal set of MPI calls required Non blocking p2p, reduction, probe, communicator, collective calls –No communications during QMC working steps –Communicating convergence statistics is 7200 bytes regardless of problem size and node count –Code already ported to many platforms (Linux, AIX, IRIX, etc.) C++ and MPI sources

Iterative QMC Algorithm For each processor do: Steps = Total Steps / number of processors Generate walkers Equilibrate walkers for each step generate QMC statistics send QMC statistics to master node

QMC Communications Time For 100,000 Steps Per Node (Reduce Using the Torus) x8x8 (512)16x16x16 (4K)32x16x16 (8K)32x32x16 (16K)32x32x32 (32K)64x32x32 (64K) BG/L Configuration Time (seconds)

Future Application Porting and Analysis for BG/L ASCI solid dynamics code simulating the mechanical response of polycrystalline materials, such as tantalum Address memory constraints, grain load imbalance and MPI_Waitall() efficiency as we port/tune to BG/L –good stress test for BG/L robustness Scalable simulation of polycrystalline response with assumed grain shape. The grain shape corresponds to the space-filling polyhedra corresponding to the Wigner-Seitz cell of a BCC crystal. The 390 grain example shown here was run on LLNL’s IBM SP3, frost.