Scientific Computations on Modern Parallel Vector Systems Leonid Oliker Julian Borrill, Jonathan Carter, Andrew Canning, John Shalf, David Skinner Lawrence.

Slides:



Advertisements
Similar presentations
Andrew Canning and Lin-Wang Wang Computational Research Division LBNL
Advertisements

Particle acceleration in a turbulent electric field produced by 3D reconnection Marco Onofri University of Thessaloniki.
PARATEC and the Generation of the Empty States (Starting point for GW/BSE) Andrew Canning Computational Research Division, LBNL and Chemical Engineering.
Algorithm Development for the Full Two-Fluid Plasma System
MEASURES OF POST-PROCESSING THE HUMAN BODY RESPONSE TO TRANSIENT FIELDS Dragan Poljak Department of Electronics, University of Split R.Boskovica bb,
Cosmological Structure Formation A Short Course
CS 584. Review n Systems of equations and finite element methods are related.
ECE669 L4: Parallel Applications February 10, 2004 ECE 669 Parallel Computer Architecture Lecture 4 Parallel Applications.
Module on Computational Astrophysics Professor Jim Stone Department of Astrophysical Sciences and PACM.
Applications for K42 Initial Brainstorming Paul Hargrove and Kathy Yelick with input from Lenny Oliker, Parry Husbands and Mike Welcome.
GTC Status: Physics Capabilities & Recent Applications Y. Xiao for GTC team UC Irvine.
Identifying Interplanetary Shock Parameters in Heliospheric MHD Simulation Results S. A. Ledvina 1, D. Odstrcil 2 and J. G. Luhmann 1 1.Space Sciences.
Chapter 13 Finite Difference Methods: Outline Solving ordinary and partial differential equations Finite difference methods (FDM) vs Finite Element Methods.
Module on Computational Astrophysics Jim Stone Department of Astrophysical Sciences 125 Peyton Hall : ph :
Large-Scale Density Functional Calculations James E. Raynolds, College of Nanoscale Science and Engineering Lenore R. Mullin, College of Computing and.
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
Numerical methods for PDEs PDEs are mathematical models for –Physical Phenomena Heat transfer Wave motion.
Towards Petascale Computing for Science Horst Simon Lenny Oliker, David Skinner, and Erich Strohmaier Lawrence Berkeley National Laboratory The Salishan.
1 Titanium Review: Ti Parallel Benchmarks Kaushik Datta Titanium NAS Parallel Benchmarks Kathy Yelick U.C. Berkeley September.
Parallel Performance of Hierarchical Multipole Algorithms for Inductance Extraction Ananth Grama, Purdue University Vivek Sarin, Texas A&M University Hemant.
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
Physics “Advanced Electronic Structure” Pseudopotentials Contents: 1. Plane Wave Representation 2. Solution for Weak Periodic Potential 3. Solution.
Global Address Space Applications Kathy Yelick NERSC/LBNL and U.C. Berkeley.
Massively Parallel Magnetohydrodynamics on the Cray XT3 Joshua Breslau and Jin Chen Princeton Plasma Physics Laboratory Cray XT3 Technical Workshop Nashville,
Stratified Magnetohydrodynamics Accelerated Using GPUs:SMAUG.
Presented by XGC: Gyrokinetic Particle Simulation of Edge Plasma CPES Team Physics and Applied Math Computational Science.
The Nuts and Bolts of First-Principles Simulation Durham, 6th-13th December : DFT Plane Wave Pseudopotential versus Other Approaches CASTEP Developers’
Parallelism and Robotics: The Perfect Marriage By R.Theron,F.J.Blanco,B.Curto,V.Moreno and F.J.Garcia University of Salamanca,Spain Rejitha Anand CMPS.
S.S. Yang and J.K. Lee FEMLAB and its applications POSTEC H Plasma Application Modeling Lab. Oct. 25, 2005.
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
Performance Characteristics of a Cosmology Package on Leading HPC Architectures Leonid Oliker Julian Borrill, Jonathan Carter.
Fast Low-Frequency Impedance Extraction using a Volumetric 3D Integral Formulation A.MAFFUCCI, A. TAMBURRINO, S. VENTRE, F. VILLONE EURATOM/ENEA/CREATE.
1 Benchmark performance on Bassi Jonathan Carter User Services Group Lead NERSC User Group Meeting June 12, 2006.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
ACKNOWLEDGMENTS This research was supported by the National Science Foundation of China (NSFC) under grants , , , the Specialized.
Challenging problems in kinetic simulation of turbulence and transport in tokamaks Yang Chen Center for Integrated Plasma Studies University of Colorado.
R. Ryne, NUG mtg: Page 1 High Energy Physics Greenbook Presentation Robert D. Ryne Lawrence Berkeley National Laboratory NERSC User Group Meeting.
Scalable Reconfigurable Interconnects Ali Pinar Lawrence Berkeley National Laboratory joint work with Shoaib Kamil, Lenny Oliker, and John Shalf CSCAPES.
Scheduling Many-Body Short Range MD Simulations on a Cluster of Workstations and Custom VLSI Hardware Sumanth J.V, David R. Swanson and Hong Jiang University.
Evaluation of Modern Parallel Vector Architectures Lenny Oliker.
Parallel I/O Performance: From Events to Ensembles Andrew Uselton National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory.
C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Evaluation of Ultra-Scale Applications on Leading Scalar and Vector Platforms Leonid Oliker Computational.
C O M P U T A T I O N A L R E S E A R C H D I V I S I O N BIPS Leading Computational Methods on Modern Parallel Systems Leonid Oliker Lawrence Berkeley.
C O M P U T A T I O N A L R E S E A R C H D I V I S I O N BIPS Leading Computational Methods on the Earth Simulator Leonid Oliker Lawrence Berkeley National.
Evaluation of Modern Parallel Vector Architectures Leonid Oliker Future Technologies Group Computational Research Division LBNL
Scientific Computations on Modern Parallel Vector Systems Leonid Oliker Andrew Canning, Jonathan Carter, John Shalf Lawrence.
Scientific Computations on Modern Parallel Vector Systems Leonid Oliker, Jonathan Carter, Andrew Canning, John Shalf Lawrence Berkeley National Laboratories.
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
Hybrid MHD-Gyrokinetic Simulations for Fusion Reseach G. Vlad, S. Briguglio, G. Fogaccia Associazione EURATOM-ENEA, Frascati, (Rome) Italy Introduction.
Scientific Computations on Modern Parallel Vector Systems Leonid Oliker Computer Staff Scientist Future Technologies Group Computational Research Division.
I/O for Structured-Grid AMR Phil Colella Lawrence Berkeley National Laboratory Coordinating PI, APDEC CET.
DDM Kirk. LSST-VAO discussion: Distributed Data Mining (DDM) Kirk Borne George Mason University March 24, 2011.
J.-N. Leboeuf V.K. Decyk R.E. Waltz J. Candy W. Dorland Z. Lin S. Parker Y. Chen W.M. Nevins B.I. Cohen A.M. Dimits D. Shumaker W.W. Lee S. Ethier J. Lewandowski.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
Data Structures and Algorithms in Parallel Computing Lecture 7.
Outline Why this subject? What is High Performance Computing?
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
Warp LBNL Warp suite of simulation codes: developed to study high current ion beams (heavy-ion driven inertial confinement fusion). High.
Evaluation of Cache-based Superscalar and Cacheless Vector Architectures for Scientific Computations Leonid Oliker Future Technologies Group Lawrence Berkeley.
21st IAEA Fusion Energy Conf. Chengdu, China, Oct.16-21, /17 Gyrokinetic Theory and Simulation of Zonal Flows and Turbulence in Helical Systems T.-H.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.
Multipole-Based Preconditioners for Sparse Linear Systems. Ananth Grama Purdue University. Supported by the National Science Foundation.
Scientific Computations on Modern Parallel Vector Systems Leonid Oliker Julian Borrill, Jonathan Carter, Andrew Canning, John Shalf, David Skinner Lawrence.
Computational Physics (Lecture 16) PHY4061. Typical initial-value problems: – time-dependent diffusion equation, – the time-dependent wave equation Some.
Multicore Applications in Physics and Biochemical Research Hristo Iliev Faculty of Physics Sofia University “St. Kliment Ohridski” 3 rd Balkan Conference.
Unstructured Meshing Tools for Fusion Plasma Simulations
Xing Cai University of Oslo
ECE408 Fall 2015 Applied Parallel Programming Lecture 21: Application Case Study – Molecular Dynamics.
Supported by the National Science Foundation.
Comparison of CFEM and DG methods
Presentation transcript:

Scientific Computations on Modern Parallel Vector Systems Leonid Oliker Julian Borrill, Jonathan Carter, Andrew Canning, John Shalf, David Skinner Lawrence Berkeley National Laboratories Stephane Ethier Princeton Plasma Physics Laboratory

Architectural Comparison Node TypeWhere CPU/ Node Clock MHz Peak GFlop Mem BW GB/s Peak byte/flop Netwk BW GB/s/P Bisect BW byte/flop MPI Latency usec Network Topology Power3NERSC Fat-tree Power4ORNL Fat-tree AltixORNL Fat-tree ESESC Crossbar X1ORNL D-torus  Custom vector architectures have High memory bandwidth relative to peak Superior interconnect: latency, point to point, and bisection bandwidth  Overall ES appears as the most balanced architecture, while Altix shows best architectural balance among superscalar architectures  A key ‘balance point’ for vector systems is the scalar:vector ratio

Applications studied LBMHD Plasma Physics 1,500 lines grid based Lattice Boltzmann approach for magneto-hydrodynamics CACTUS Astrophysics 100,000 lines grid based Solves Einstein’s equations of general relativity PARATEC Material Science 50,000 lines Fourier space/grid Density Functional Theory electronic structures codes GTC Magnetic Fusion 5,000 lines particle based Particle in cell method for gyrokinetic Vlasov-Poisson equation MADbench Cosmology 2,000 lines dense linear algebra Maximum likelihood two-point angular correlation, I/O intensive Applications chosen with potential to run at ultrascale  Computations contain abundant data parallelism ES runs require minimum parallelization and vectorization hurdles  Codes originally designed for superscalar systems  Ported onto single node of SX6, first multi-node experiments performed at ESC

Plasma Physics: LBMHD  LBMHD uses a Lattice Boltzmann method to model magneto-hydrodynamics (MHD)  Performs 2D simulation of high temperature plasma  Evolves from initial conditions and decaying to form current sheets  2D spatial grid is coupled to octagonal streaming lattice  Block distributed over 2D processor grid  Main computational components: Collision requires coefficients for local gridpoint only, no communication Stream values at gridpoints are streamed to neighbors, at cell boundaries information is exchanged via MPI Interpolation step required between spatial and stream lattices  Developed George Vahala’s group College of William and Mary, ported Jonathan Carter Current density decays of two cross- shaped structures

LBMHD: Porting Details  Collision routine rewritten: For ES loop ordering switched so gridpoint loop (~1000 iterations) is inner rather than velocity or magnetic field loops (~10 iterations) X1 compiler made this transformation automatically: multistreaming outer loop and vectorizing (via strip mining) inner loop Temporary arrays padded reduce bank conflicts  Stream routine performs well: Array shift operations, block copies, 3 rd -degree polynomial eval  Boundary value exchange MPI_Isend, MPI_Irecv pairs Further work: plan to use ES "global memory" to remove message copies (left) octagonal streaming lattice coupled with square spatial grid (right) example of diagonal streaming vector updating three spatial cells

Material Science: PARATEC  PARATEC performs first-principles quantum mechanical total energy calculation using pseudopotentials & plane wave basis set  Density Functional Theory to calc structure & electronic properties of new materials  DFT calc are one of the largest consumers of supercomputer cycles in the world Induced current and charge density in crystallized glycine  Uses all-band CG approach to obtain wavefunction of electrons  33% 3D FFT, 33% BLAS3, 33% Hand coded F90  Part of calculation in real space other in Fourier space Uses specialized 3D FFT to transform wavefunction  Computationally intensive - generally obtains high percentage of peak  Developed Andrew Canning with Louie and Cohen’s groups (UCB, LBNL)

 Transpose from Fourier to real space  3D FFT done via 3 sets of 1D FFTs and 2 transposes  Most communication in global transpose (b) to (c) little communication (d) to (e)  Many FFTs done at the same time to avoid latency issues  Only non-zero elements communicated/calculated  Much faster than vendor 3D-FFT PARATEC: Wavefunction Transpose (a)(b) (e) (c) (f) (d)

Astrophysics: CACTUS  Numerical solution of Einstein’s equations from theory of general relativity  Among most complex in physics: set of coupled nonlinear hyperbolic & elliptic systems with thousands of terms  CACTUS evolves these equations to simulate high gravitational fluxes, such as collision of two black holes  Evolves PDE’s on regular grid using finite differences  Uses ADM formulation: domain decomposed into 3D hypersurfaces for different slices of space along time dimension  Exciting new field about to be born: Gravitational Wave Astronomy - fundamentally new information about Universe  Gravitational waves: Ripples in spacetime curvature, caused by matter motion, causing distances to change.  Developed at Max Planck Institute, vectorized by John Shalf Visualization of grazing collision of two black holes Communication at boundaries Expect high parallel efficiency

Magnetic Fusion: GTC  Gyrokinetic Toroidal Code: transport of thermal energy (plasma microturbulence)  Goal magnetic fusion is burning plasma power plant producing cleaner energy  GTC solves 3D gyroaveraged gyrokinetic system w/ particle-in-cell approach (PIC)  PIC scales N instead of N 2 – particles interact w/ electromagnetic field on grid  Allows solving equation of particle motion with ODEs (instead of nonlinear PDEs)  Main computational tasks: Scatter d eposit particle charge to nearest point Solve Poisson eqn to get potential for each point Gather c alc force based on neighbors potential Move particles by solving eqn of motion Shift particles moved outside local domain 3D visualization of electrostatic potential in magnetic fusion device Developed at Princeton Plasma Physics Laboratory, vectorized by Stephane Ethier

GTC: Scatter operation  Particle charge deposited amongst nearest grid points.  Calculate force based on neighbors potential, then move particle accordingly  Several particles can contribute to same grid points, resulting in memory conflicts (dependencies) that prevent vectorization  Solution: VLEN copies of charge deposition array with reduction after main loop However, greatly increases memory footprint (8X)  Since particles are randomly localized - scatter also hinders cache reuse

Cosmology: MADbench  Microwave Anisotropy Dataset Computational Analysis Package  Optimal general algorithm for extracting key cosmological data from Cosmic Microwave Background Radiation (CMB)  CMB encodes fundamental parameters of cosmology: Universe geometry, expansion rate, number of neutrino species  Preserves full complexity of underlying scientific problem  Calculates maximum likelihood two-point angular correlation function  Recasts problem in dense linear algebra: ScaLAPACK Steps include: mat-mat, mat-vec, chol decomp, redistribution  High I/O requirement - due to out-of-core nature of calculation  Developed at NERSC/CRD by Julian Borrill

CMB analysis moves from the time domain - observations - O(10 12 ) to the pixel domain - maps - O(10 8 ) to the multipole domain - power spectra - O(10 4 ) calculating the compressed data and their reduced error bars at each step. CMB Data Analysis

MADBench: Performance Characterization  In depth analysis shows performance contribution of each component for evaluated architectures  Identifies system specific balance and opportunities for optimization  Results show that I/O has more effect on ES than Seaborg - due to ratio between I/O performance and peak ALU speed  Demonstrated IPM capabilities to measure MPI overhead on variety of architectures without the need to recompile, at a trivial runtime overhead (<1%)

Overview Tremendous potential of vector architectures: 5 codes running faster than ever before  Vector systems allows resolution not possible with scalar arch (regardless of # procs)  Opportunity to perform scientific runs at unprecedented scale  ES shows high raw and much higher sustained performance compared with X1 Non-vectorizable code segments become very expensive (8:1 or even 32:1 ratio) Evaluation codes contain sufficient regularity in computation for high vector performance GTC example code at odds with data-parallelism Important to characterize full application including I/O effects Much more difficult to evaluate codes poorly suited for vectorization  Vectors potentially at odds w/ emerging techniques (irregular, multi-physics, multi-scale)  Plan to expand scope of application domains/methods, and examine latest HPC architectures Code (P=64) % peak(P=Max avail) Speedup ES vs. Pwr3Pwr4AltixESX1Pwr3Pwr4AltixX1 LBMHD7%5%11%58%37% CACTUS6%11%7%34%6% GTC9%6%5%20%11% PARATEC57%33%54%58%20% MADbench49%---19%37%17% Average