Download presentation
Presentation is loading. Please wait.
1
Application Performance Analysis on Blue Gene/L Jim Pool, P.I. Maciej Brodowicz, Sharon Brunett, Tom Gottschalk, Dan Meiron, Paul Springer, Thomas Sterling, Ed Upchurch
2
Caltech’s Role in Blue Gene/L Project Understand implications of BG/L network architecture & drive results from real world ASCI applications Develop statistical models of applications, processors as message generators, and the network Focus on –Application communications distribution –Network contention as function of load, size and adaptive routing Represent 64K Nodes Explicitly in Statistical Model Create trace analysis tools to characterize applications –Extensible Trace Facility (ETF)
3
Blue Gene / L Node
4
Blue Gene / L Network
5
ETF Built-in Trace Options MPI events –All point-to-point communications (MPI-1) –All collective communications (MPI-1) –Non-blocking request tracking –Communicator creation and destruction –MPI datatype decoding (requires MPI-2) –Languages: C, Fortran –Easy instrumentation of applications Memory reference and program execution tracing –Tracking of statically and dynamically allocated arrays (identifiers, element sizes, dimensions) –Tracking of scalar variables –Read and write accesses to individual scalars and array elements as well as contiguous vectors of elements –Function calls –Program execution phases
6
ETF Tracing Example for Magnetic Hydro Dynamic (MHD) Code with Adaptive Mesh Refinement (AMR) Parallel MHD fluid code solves equations of hydrodynamics and resistive Maxwell’s equations –Part of larger application which computes dynamic responses to strong shock waves impinging on target materials –Fortran 90 + MPI –MPI Cartesian communicators –Nearest neighbor comms use non blocking send/recv –MPI Allreduce for calculating stable time steps
7
AMR MHD: Communication Profile 20 time steps on 32 processors, 128x128 cells Max. level = 1 Max. level = 2
8
Lennard-Jones Molecular Dynamics Short range molecular dynamics application simulating Newtonian interactions in large groups of atoms –production code from Sandia National Lab Simulations are large in two dimensions – number of atoms and number of time steps Spatial decomposition case selected –each processing node keeps track of the positions and movement of the atoms in a 3-D box Computations carried out in a single time step correspond to femto- seconds of real time – a meaningful simulation of the evolution of the system’s state typically requires thousands of time steps Point-to-point MPI messages are exchanged across each of the 6 sides of the box / time step Code is written in Fortran and MPI
9
Typical Grid Cell and Cutoff Radius Communication Steps Computational Cycle Model Lennard-Jones Molecular Dynamics
10
LJS Single Processor BG/L Performance Original Codevs. Tuned for BG/L 0 2 4 6 8 10 12 15,62531,25062,500125,000250,000500,000 Number of Atoms per BG/L CPU Improvement (%) good cache reuse
11
LJS Molecular Dynamics Performance Fixed Problem Size of 1 Billion Atoms 2k4k8k16k32k 64k Number of BG/L CPUs Time per single iteration (ms) Compute Time [ms] Communications Time [ms]
12
LJS Speedup BG/L vs. ASCI Red 3200 Nodes 1 Billion Atom Problem 0 10 20 30 40 50 60 70 80 2k4k8k16k32k64k Number of BlueGene/L Nodes Speedup
13
LJS Communications Time 500,000 Atoms per BG/L Node 0 10 20 30 40 50 60 4x4x4 (64 BGL Nodes)8x8x8 (512 BGL Nodes)16x16x16 (4096 BGL Nodes) BG/L Configuration Communications Time Per Iteration (msecs) Physical Nearest Neighbor Mapping Random Mapping
14
What is QMC and Why is it a Good Fit for BG/L? QMC is a finite all-electron Quantum Monte Carlo code used to determine quantum properties of materials with extremely high accuracy Developed at Caltech by Bill Goddard’s ASCI Material Properties group Interesting Characteristics –Low memory requirements –After initialization, highly parallel and scalable –Minimal set of MPI calls required Non blocking p2p, reduction, probe, communicator, collective calls –No communications during QMC working steps –Communicating convergence statistics is 7200 bytes regardless of problem size and node count –Code already ported to many platforms (Linux, AIX, IRIX, etc.) C++ and MPI sources
15
Iterative QMC Algorithm For each processor do: Steps = Total Steps / number of processors Generate walkers Equilibrate walkers for each step generate QMC statistics send QMC statistics to master node
16
QMC Communications Time For 100,000 Steps Per Node (Reduce Using the Torus) 0.001 0.01 0.1 1 8x8x8 (512)16x16x16 (4K)32x16x16 (8K)32x32x16 (16K)32x32x32 (32K)64x32x32 (64K) BG/L Configuration Time (seconds)
17
Future Application Porting and Analysis for BG/L ASCI solid dynamics code simulating the mechanical response of polycrystalline materials, such as tantalum Address memory constraints, grain load imbalance and MPI_Waitall() efficiency as we port/tune to BG/L –good stress test for BG/L robustness Scalable simulation of polycrystalline response with assumed grain shape. The grain shape corresponds to the space-filling polyhedra corresponding to the Wigner-Seitz cell of a BCC crystal. The 390 grain example shown here was run on LLNL’s IBM SP3, frost.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.