SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO Scalable Spectral Transforms at Petascale Dmitry Pekurovsky San Diego Supercomputer.

Slides:

Advertisements

Similar presentations

Μπ A Scalable & Transparent System for Simulating MPI Programs Kalyan S. Perumalla, Ph.D. Senior R&D Manager Oak Ridge National Laboratory Adjunct Professor.

Advertisements

MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.

A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.

The Charm++ Programming Model and NAMD Abhinav S Bhatele Department of Computer Science University of Illinois at Urbana-Champaign

Thoughts on Shared Caches Jeff Odom University of Maryland.

Three-dimensional Fast Fourier Transform (3D FFT): the algorithm

1 Advancing Supercomputer Performance Through Interconnection Topology Synthesis Yi Zhu, Michael Taylor, Scott B. Baden and Chung-Kuan Cheng Department.

One-day Meeting, INI, September 26th, 2008 Role of spectral turbulence simulations in developing HPC systems YOKOKAWA, Mitsuo Next-Generation Supercomputer.

The Who, What, Why and How of High Performance Computing Applications in the Cloud Soheila Abrishami 1.

SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Performance of Applications Using Dual-Rail InfiniBand 3D Torus Network on the.

Communication Analysis of Parallel 3D FFT for Flat Cartesian Meshes on Large Blue Gene Systems A. Chan, P. Balaji, W. Gropp, R. Thakur Math. and Computer.

SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO High-Frequency Simulations of Global Seismic Wave Propagation A seismology challenge:

MPI and C-Language Seminars Seminar Plan  Week 1 – Introduction, Data Types, Control Flow, Pointers  Week 2 – Arrays, Structures, Enums, I/O,

Reference: Message Passing Fundamentals.

Acknowledgments: Thanks to Professor Nicholas Brummell from UC Santa Cruz for his help on FFTs after class, and also thanks to Professor James Demmel from.

11/14/05ELEC Fall Multi-processor SoCs Yijing Chen.

Parallelization of FFT in AFNI Huang, Jingshan Xi, Hong Department of Computer Science and Engineering University of South Carolina.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

PARALLEL PROCESSING The NAS Parallel Benchmarks Daniel Gross Chen Haiout.

Topic Overview One-to-All Broadcast and All-to-One Reduction

Performance Engineering and Debugging HPC Applications David Skinner

NPACI: National Partnership for Advanced Computational Infrastructure August 17-21, 1998 NPACI Parallel Computing Institute 1 Cluster Archtectures and.

The Re-engineering and Reuse of Software

Programming for High Performance Computers John M. Levesque Director Cray’s Supercomputing Center Of Excellence.

Applying Twister to Scientific Applications CloudCom 2010 Indianapolis, Indiana, USA Nov 30 – Dec 3, 2010.

Reference: / Parallel Programming Paradigm Yeni Herdiyeni Dept of Computer Science, IPB.

1 Titanium Review: Ti Parallel Benchmarks Kaushik Datta Titanium NAS Parallel Benchmarks Kathy Yelick U.C. Berkeley September.

Performance Evaluation of Hybrid MPI/OpenMP Implementation of a Lattice Boltzmann Application on Multicore Systems Department of Computer Science and Engineering,

Massively Parallel Magnetohydrodynamics on the Cray XT3 Joshua Breslau and Jin Chen Princeton Plasma Physics Laboratory Cray XT3 Technical Workshop Nashville,

Statistical Performance Analysis for Scientific Applications Presentation at the XSEDE14 Conference Atlanta, GA Fei Xing Haihang You Charng-Da Lu July.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

Designing and Evaluating Parallel Programs Anda Iamnitchi Federated Distributed Systems Fall 2006 Textbook (on line): Designing and Building Parallel Programs.

Scaling to New Heights Retrospective IEEE/ACM SC2002 Conference Baltimore, MD.

DISTRIBUTED COMPUTING

A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.

SAMANVITHA RAMAYANAM 18 TH FEBRUARY 2010 CPE 691 LAYERED APPLICATION.

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.

Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,

S AN D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE On pearls and perils of hybrid OpenMP/MPI programming.

1 Computer Programming (ECGD2102 ) Using MATLAB Instructor: Eng. Eman Al.Swaity Lecture (1): Introduction.

Planned AlltoAllv a clustered approach Stephen Booth (EPCC) Adrian Jackson (EPCC)

A Distributed Algorithm for 3D Radar Imaging PATRICK LI SIMON SCOTT CS 252 MAY 2012.

NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.

A Framework for Visualizing Science at the Petascale and Beyond Kelly Gaither Research Scientist Associate Director, Data and Information Analysis Texas.

Overcoming Scaling Challenges in Bio-molecular Simulations Abhinav Bhatelé Sameer Kumar Chao Mei James C. Phillips Gengbin Zheng Laxmikant V. Kalé.

2009/4/21 Third French-Japanese PAAP Workshop 1 A Volumetric 3-D FFT on Clusters of Multi-Core Processors Daisuke Takahashi University of Tsukuba, Japan.

2007/11/2 First French-Japanese PAAP Workshop 1 The FFTE Library and the HPC Challenge (HPCC) Benchmark Suite Daisuke Takahashi Center for Computational.

HPCMP Benchmarking Update Cray Henry April 2008 Department of Defense High Performance Computing Modernization Program.

Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.

Manno, , © by Supercomputing Systems 1 1 COSMO - Dynamical Core Rewrite Approach, Rewrite and Status Tobias Gysi POMPA Workshop, Manno,

ATmospheric, Meteorological, and Environmental Technologies RAMS Parallel Processing Techniques.

Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.

SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Hybrid MPI/Pthreads Parallelization of the RAxML Phylogenetics Code Wayne Pfeiffer.

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA

3/12/2013Computer Engg, IIT(BHU)1 CLOUD COMPUTING-1.

SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Advanced User Support for MPCUGLES code at University of Minnesota October 09,

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

Parallel Computing Presented by Justin Reschke

Parallel I/O Performance Study and Optimizations with HDF5, A Scientific Data Package Christian Chilan, Kent Yang, Albert Cheng, Quincey Koziol, Leon Arber.

FFTC: Fastest Fourier Transform on the IBM Cell Broadband Engine David A. Bader, Virat Agarwal.

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.

Petascale Computing Resource Allocations PRAC – NSF Ed Walker, NSF CISE/ACI March 3,

Fermi National Accelerator Laboratory & Thomas Jefferson National Accelerator Facility SciDAC LQCD Software The Department of Energy (DOE) Office of Science.

SCEC Capability Simulations on TeraGrid

Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming

Performance Evaluation of Adaptive MPI

Anne Pratoomtong ECE734, Spring2002

FUNDAMENTAL CONCEPTS OF PARALLEL PROGRAMMING

Parallel Programming in C with MPI and OpenMP

Support for Adaptivity in ARMCI Using Migratable Objects

Presentation transcript:

SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO Scalable Spectral Transforms at Petascale Dmitry Pekurovsky San Diego Supercomputer Center UC San Diego XSEDE’13 Presented at XSEDE’13, July 22-25, San Diego

SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO Introduction: Fast Fourier Transforms and related spectral transforms Project scope: algorithms operating on structured grids in three dimensions that are computationally demanding, and process data in each dimension independently of others. Examples: Fourier, Chebyshev, finite difference high- order compact schemes Heavily used in many areas of computational science Computationally demanding Typically not a cache-friendly algorithm Memory bandwidth is stressed Communication intense All-to-all exchange is an expensive operation, stressing bisection bandwidth of the host’s network

SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO 1D decomposition 2D decomposition z x y

SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO Algorithm scalability 1D decomposition: concurrency is limited to N (linear grid size). Not enough parallelism for O(10 4 )-O(10 5 ) cores This is the approach of most libraries to date (FFTW 3.3, PESSL) 2D decomposition: concurrency is up to N 2 Scaling to ultra-large core counts is possible The answer to the petascale challenge

SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO Need for general-purpose scalable library for spectral transforms Requirements for the library: 1.Scalable to large core counts on significantly large problem sizes (implies 2D decomposition) 2.Achieves performance and scalability reasonably close to upper hardware capability 3.Has a simple user interface 4.Is sufficiently versatile to be of use to many research groups

SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO P3DFFT Open source library for efficient, highly scalable spectral transforms on parallel platforms Uses 2D decomposition Includes 1D option. Available at Historically grew out of a Teragrid Advanced User Support Project (now called ECSS)

SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO P3DFFT 2.6.1: features Currently implements: 1.Real-to-complex (R2C) and complex-to-real (C2R) 3D FFT transforms 2.Complex-to-complex 3D FFT transforms 3.Cosine/sine/Chebyshev transforms in third dimension (FFT in first two dimensions) 4.Empty transform in the third dimension User can substitute their custom algorithm Fortran and C interfaces Single or double precision

SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO P3DFFT features (contd) Arbitrary dimensions Handles many uneven cases (N i does not have to be a factor of M j) Can do either in-place or out-of-place transform Can do pruned input/output (when only a subset of output or input modes is needed). This can save substantial time, as shown later. Includes installation instructions, extensive documentation, example programs in Fortran and C

SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO 3D FFT algorithm with 2D decomposition Y-Z plane exchange in column subgroups Z Y X X-Y plane exchange in row subgroups Perform 1D FFT in X Perform 1D FFT in YPerform 1D FFT in Z

SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO P3DFFT implementation Baseline version implemented in Fortran90 with MPI 1D FFT: call FFTW or ESSL Transpose implementation in 2D decomposition: Set up 2D cartesian subcommunicators, using MPI_COMM_SPLIT (rows and columns) Two transposes are needed: 1. in rows 2. in columns Baseline version: exchange data using MPI_Alltoall or MPI_Alltoallv

SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO Computation performance 1D FFT, three times: 1. Stride-1 2. Small stride 3. Large stride (out of cache) Strategy: Use an established library (ESSL, FFTW) An option to keep data in original layout, or transpose so that the stride is always 1 The results are then laid out as (Z,Y,X) instead of (X,Y,Z) Use loop blocking to optimize cache use

SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO Communication performance A large portion of total time (up to 80%) is all-to-all Highly dependent on optimal implementation of MPI_Alltoall (varies with vendor) Buffers for exchange are close in size Good load balance, predictable pattern Performance can be sensitive to choice of 2D virtual processor grid (M 1,M 2 )

SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO Performance dependance on processor grid shape M 1 xM 2

SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO Communication scaling and networks All-to-all exchanges are directly affected by bisection bandwidth of the interconnect Increasing P decreases buffer size Expect 1/P scaling on fat-trees and other networks with full bisection bandwidth (until buffer size gets below the latency threshold). On torus topology (Cray XT5, XE6) bisection bandwidth scales as P 2/3 Expect P -2/3 scaling

SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO Strong scaling on Cray XT5 (Kraken) at NICS/ORNL grid, double precision, best M 1 /M 2 combination

SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO Weak Scaling (Kraken) N 3 grid, double precision

SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO 2D vs. 1D decomposition

SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO Applications of P3DFFT P3DFFT has already been applied in a number of codes, in science fields including the following: Turbulence Astrophysics Oceanography Other potential areas include Material Science Chemistry Aerospace engineering X-ray crystallography Medicine Atmospheric science

SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO DNS turbulence Direct Numerical Simulations (DNS) code from Georgia Tech (P.K.Yeung et al.) to simulate isotropic turbulence on a cubic periodic domain Characterized by disorderly, nonlinear fluctuations in 3D space and time that span a wide range of interacting scales DNS is an important tool for first-principles understanding of turbulence in great detail Vital for new concepts and models as well as improved engineering devices Areas of application include aeronautics, environment, combustion, meteorology, oceanography One of three Model Problems for NSF’s Track 1 solicitation

SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO DNS algorithm It is crucial to simulate grids with high resolution to minimize discretization effects, and study a wide range of length scales. Uses Runge-Kutta 2 nd or 4 th order for time-stepping Uses pseudospectral method to solve Navier-Stokes eqs. 3D FFT is the most time-consuming part 2D decomposition based on P3DFFT framework has been implemented.

SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO DNS performance (Cray XT5) N cores

SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO P3DFFT Development: Motivation 3D FFT is a very common algorithm. In pseudospectral algorithms simulating turbulence it is mostly used for cases with isotropic conditions and in homogeneous domains with periodic boundary conditions. In this case only real-to-complex and complex-to-real 3D FFT are needed. For many researchers it is more interesting to study inhomogeneous systems, for example wall-bounded flows (Dirichlet or other non-periodic boundary conditions in one dimension). Chebyshev transforms or finite difference higher- order compact schemes are more appropriate. In simulating compressible turbulence, again higher-order compact schemes are used. In other applications, a complex-to-complex transform may be needed; or a custom user transform.

SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO P3DFFT Development: Motivation (cont’d) Many CFD/turbulence codes use 2/3 dealiasing technique, where only 2/3 of the modes in each dimension are kept after the forward Fourier Transform. Potential time- and memory- saving opportunity. Many codes have several independent arrays (variables) that need to be transformed. This can be implemented in a staggered fashion so as to overlap communication with computation. Some codes employ 3D rather than 2D domain decomposition. They need utilities to go between 2D and 3D. In some cases, usage scenario does not fall into the common fold, and the user might need access to isolated transposes.

SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO P3DFFT - Ongoing and planned work Part 1: Interface and Flexibility 1.Added other types of transform (e.g. complex-to- complex, Chebyshev, empty) – DONE in P3DFFT Added pruned input/output feature (allows to implement 2/3 dealiasing) – DONE in P3DFFT Expanding the memory layout options, including 3D decomposition utilities. 4.Adding ability to isolate transposes so the user can design their own transform

SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO P3DFFT performance for a large problem ( )

SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO 1.One-sided/nonblocking communication MPI-2, MPI-3 OpenSHMEM Co-Array Fortran 2.Communication/computation overlap – requires RDMA 3.Hybrid MPI/OpenMP implementation P3DFFT - Ongoing and planned work Part 2: Performance improvements

SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO Default Comm. 1Comm. 4Comm. 3Comm. 2 Comp. 1Comp. 2Comp. 3Comp. 4 Overlap Comm. 1Comm. 4Comm. 3Comm. 2 Comp. 1 Comp. 2 Comp. 3 Comp. 4 Idle

SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO Coarse-grain overlap Suitable for computing several FFTs at once Independent variables, e.g. velocity components Overlap communication stage of one variable with computation stage of another variable Uses large send buffers due to message aggregation Uses pairwise exchange algorithm, implemented through either MPI-2, SHMEM or Co-Array Fortran Alternatively, as of recently, MPI-3 nonblocking collectives have become available

SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO Coarse-grain overlap, results on Mellanox ConnectX-2 cluster (64 and 128 cores) K.Kandalla, H.Subramoni, K.Tomko, D. Pekurovsky, S.Sur, D.Panda “High-Performance and Scalable Non-Blocking All-to-All with Collective Offload on InfiniBand Clusters: A Study with Parallel 3D FFT”, ISC’11, Germany. Computer Science – Research and Development, v. 26, i.3, (2011)

SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO Hybrid MPI/OpenMP preliminary results (Kraken)

SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO Conclusions P3DFFT is an efficient, scalable and versatile library (available as open source at Performance consistent with hardware capability is achieved on leading platforms Great potential for enabling petascale science An excellent testing tool for future platforms’ capabilities Bisection bandwidth MPI implementation One-sided protocols implementation MPI/OpenMP hybrid performance

SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO Conclusions (cont’d) An example of project that came out of an Advanced User Support Collaboration, now benefiting a wider community Incorporated into a number of codes (~25 citations as of today, hundreds of downloads) A future XSEDE community code Work under way to expand capability and improve parallel performance even further WHAT ARE YOUR PETASCALE ALGORITHMIC NEEDS? Send me an

SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO Acknowledgements P.K.Yeung D.A.Donzis G. Chukkappalli J. Goebbert G. Brethouser N. Prigozhina K. Tomko K. Kandalla H. Subramoni S. Sur D. Panda Work supported by XSEDE, NSF grants OCI and CCF Benchmarks run on Teragrid resources Ranger (TACC), Kraken (NICS), and DOE resources Jaguar (NCSS/ORNL), Hopper(NERSC), Blue Waters (NCSA)