Three-dimensional Fast Fourier Transform (3D FFT): the algorithm

Slides:



Advertisements
Similar presentations
SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO Scalable Spectral Transforms at Petascale Dmitry Pekurovsky San Diego Supercomputer.
Advertisements

A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.
The Study of Cache Oblivious Algorithms Prepared by Jia Guo.
ECE 734: Project Presentation Pankhuri May 8, 2013 Pankhuri May 8, point FFT Algorithm for OFDM Applications using 8-point DFT processor (radix-8)
Priority Research Direction Key challenges General Evaluation of current algorithms Evaluation of use of algorithms in Applications Application of “standard”
Communication Analysis of Parallel 3D FFT for Flat Cartesian Meshes on Large Blue Gene Systems A. Chan, P. Balaji, W. Gropp, R. Thakur Math. and Computer.
Reference: Message Passing Fundamentals.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
A Grid Parallel Application Framework Jeremy Villalobos PhD student Department of Computer Science University of North Carolina Charlotte.
Programming with MPI – A Detailed Example By: Llewellyn Yap Wednesday, EE524 LECTURE EE524 LECTURE How to Build a Beowulf, Chapter 9.
Acknowledgments: Thanks to Professor Nicholas Brummell from UC Santa Cruz for his help on FFTs after class, and also thanks to Professor James Demmel from.
Analysis and Performance Results of a Molecular Modeling Application on Merrimac Erez, et al. Stanford University 2004 Presented By: Daniel Killebrew.
1 Titanium and UPCKathy Yelick UPC Benchmarks Kathy Yelick LBNL and UC Berkeley Joint work with The Berkeley UPC Group: Christian Bell, Dan Bonachea, Wei.
Page 1 CS Department Parallel Design of JPEG2000 Image Compression Xiuzhen Huang CS Department UC Santa Barbara April 30th, 2003.
Efficient Parallelization for AMR MHD Multiphysics Calculations Implementation in AstroBEAR.
Topic Overview One-to-All Broadcast and All-to-One Reduction
The hybird approach to programming clusters of multi-core architetures.
Topic 7 - Fourier Transforms DIGITAL IMAGE PROCESSING Course 3624 Department of Physics and Astronomy Professor Bob Warwick.
A Framework for Collective Personalized Communication Laxmikant V. Kale, Sameer Kumar, Krishnan Varadarajan.
1 Titanium Review: Ti Parallel Benchmarks Kaushik Datta Titanium NAS Parallel Benchmarks Kathy Yelick U.C. Berkeley September.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
Network Topologies Topology – how nodes are connected – where there is a wire between 2 nodes. Routing – the path a message takes to get from one node.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Designing and Evaluating Parallel Programs Anda Iamnitchi Federated Distributed Systems Fall 2006 Textbook (on line): Designing and Building Parallel Programs.
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
Computational Design of the CCSM Next Generation Coupler Tom Bettge Tony Craig Brian Kauffman National Center for Atmospheric Research Boulder, Colorado.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Computational issues in Carbon nanotube simulation Ashok Srinivasan Department of Computer Science Florida State University.
Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.
Scalable Reconfigurable Interconnects Ali Pinar Lawrence Berkeley National Laboratory joint work with Shoaib Kamil, Lenny Oliker, and John Shalf CSCAPES.
1 Dynamic Interconnection Networks Miodrag Bolic.
Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.
Planned AlltoAllv a clustered approach Stephen Booth (EPCC) Adrian Jackson (EPCC)
JavaSeis Parallel Arrays JavaSeis data structures Synchronous parallel model Arrays in Java mpiJava Parallel Distributed Arrays Examples.
SciDAC All Hands Meeting, March 2-3, 2005 Northwestern University PIs:Alok Choudhary, Wei-keng Liao Graduate Students:Avery Ching, Kenin Coloma, Jianwei.
A Distributed Algorithm for 3D Radar Imaging PATRICK LI SIMON SCOTT CS 252 MAY 2012.
NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.
2009/4/21 Third French-Japanese PAAP Workshop 1 A Volumetric 3-D FFT on Clusters of Multi-Core Processors Daisuke Takahashi University of Tsukuba, Japan.
2007/11/2 First French-Japanese PAAP Workshop 1 The FFTE Library and the HPC Challenge (HPCC) Benchmark Suite Daisuke Takahashi Center for Computational.
Manno, , © by Supercomputing Systems 1 1 COSMO - Dynamical Core Rewrite Approach, Rewrite and Status Tobias Gysi POMPA Workshop, Manno,
1 Qualifying ExamWei Chen Unified Parallel C (UPC) and the Berkeley UPC Compiler Wei Chen the Berkeley UPC Group 3/11/07.
ATmospheric, Meteorological, and Environmental Technologies RAMS Parallel Processing Techniques.
CS 484 Designing Parallel Algorithms Designing a parallel algorithm is not easy. There is no recipe or magical ingredient Except creativity We can benefit.
1 ©2004 Board of Trustees of the University of Illinois Computer Science Overview Laxmikant (Sanjay) Kale ©
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.
PARALLELIZATION OF ARTIFICIAL NEURAL NETWORKS Joe Bradish CS5802 Fall 2015.
CS 471 Final Project 2d Advection/Wave Equation Using Fourier Methods December 10, 2003 Jose L. Rodriguez
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
A uGNI-Based Asynchronous Message- driven Runtime System for Cray Supercomputers with Gemini Interconnect Yanhua Sun, Gengbin Zheng, Laximant(Sanjay) Kale.
An Evaluation of Data-Parallel Compiler Support for Line-Sweep Applications Daniel Chavarría-Miranda John Mellor-Crummey Dept. of Computer Science Rice.
1 Opportunities and Challenges of Modern Communication Architectures: Case Study with QsNet CAC Workshop Santa Fe, NM, 2004 Sameer Kumar* and Laxmikant.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-2.
1 VSIPL++: Parallel Performance HPEC 2004 CodeSourcery, LLC September 30, 2004.
Lecture 3: Designing Parallel Programs. Methodological Design Designing and Building Parallel Programs by Ian Foster www-unix.mcs.anl.gov/dbpp.
Nov 14, 08ACES III and SIAL1 ACES III and SIAL: technologies for petascale computing in chemistry and materials physics Erik Deumens, Victor Lotrich, Mark.
FFTC: Fastest Fourier Transform on the IBM Cell Broadband Engine David A. Bader, Virat Agarwal.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.
EE345S Real-Time Digital Signal Processing Lab Fall 2006 Lecture 17 Fast Fourier Transform Prof. Brian L. Evans Dept. of Electrical and Computer Engineering.
LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.
Application of Design Patterns to Geometric Decompositions V. Balaji, Thomas L. Clune, Robert W. Numrich and Brice T. Womack.
Parallel Programming By J. H. Wang May 2, 2017.
Parallel Objects: Virtualization & In-Process Components
Fast Fourier Transforms Dr. Vinu Thomas
Parallel Programming in C with MPI and OpenMP
Fast Fourier Transform
CS 584.
Hybrid Programming with OpenMP and MPI
Mapping the FFT Algorithm to the IBM Cell Processor
Fast Fourier Transform
L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher
Presentation transcript:

Three-dimensional Fast Fourier Transform (3D FFT): the algorithm 1D FFT is applied three times (for X,Y and Z) Easily parallelized, load balanced Use transpose approach: call FFT on local data only transpose where necessary so as to arrange the data locally for the direction of the transform It is more efficient to transpose the data once than exchanging data multiple times during a distributed 1D FFT At each stage, there are many 1D FFT to do Divide the work evenly

Algorithm scalability 1D decomposition: concurrency is limited to N (linear grid size). Not enough parallelism for O(104)-O(105) cores This is the approach of most libraries to date (FFTW 3.2, PESSL) 2D decomposition: concurrency is up to N2 Scaling to ultra-large core counts is possible, though not guaranteed The answer to the petascale challenge

1D decomposition 2D decomposition z x y

2D decomposition Transpose in rows P2 P1 P = P1 x P2 Transpose in columns

P3DFFT Open source library for efficient, highly scalable 3D FFT on parallel platforms Built on top of an optimized 1D FFT library Currently ESSL or FFTW In the future, more libraries Uses 2D decomposition Includes 1D option. Available at http://www.sdsc.edu/us/resources/p3dfft.php Includes example programs in Fortran and C

P3DFFT: features Implements real-to-complex (R2C) and complex-to-real (C2R) 3D transforms Implemented as F90 module Fortran and C interfaces Performance-optimized Single or double precision Arbitrary dimensions Handles many uneven cases (Ni does not have to be a factor of Pj) Can do either in-place or out-of-place transform

P3DFFT: Storage R2C transform input: R2C output: Global: (Nx,Ny,Nz) real array Local: (Nx,Ny/P1,Nz/P2) real array R2C output: Global: (Nx/2+1,Ny,Nz) complex array Conjugate symmetry: F(N-i) = F*(i) F(N/2+2) through F(N) are redundant Local: ((Nx/2+1)/P1),Ny/P2,Nz) complex C2R: input and output interchanged from R2C

Using P3DFFT Fortran: ‘use p3dfft’ C: #include “p3dfft.h” Initialization: p3dfft_setup(proc_dims,nx,ny,nz,overwrite) Integer proc_dims(2): Sets up the square 2D grid of MPI communicators in rows and columns Initializes local array dimensions For 1D decomposition, set P1 = 1 overwrite: logical (boolean) variable – is it OK to ovewrite the input array in btran? Defines mytype – 4 or 8 byte reals

Using P3DFFT (2) Local dimensions: get_dims(istart,iend,isize,Q) integer istart(3),iend(3),isize(3) Set Q=1 for R2C input, Q=2 for R2C output Now allocate your arrays When ready to call Fourier Transform: Forward transform exp(-ikx/N): p3dfft_ftran_r2c(IN,OUT) Backward (inverse) transform exp(ikx/N) p3dfft_btran_c2r(IN,OUT)

P3DFFT implementation Implemented in Fortran90 with MPI 1D FFT: call FFTW or ESSL Transpose implementation in 2D decomposition: Set up 2D cartesian subcommunicators, using MPI_COMM_SPLIT (rows and columns) Exchange data using MPI_Alltoall or MPI_Alltoallv Two transposes are needed: 1. within rows 2. within columns MPI composite datatypes are not used Assemble data into/out of send/receive buffers manually

Communication performance Big part of total time (~50%) Message sizes are medium to large Mostly sensitive to network bandwidth, not latency All-to-all exchanges are directly affected by bisection bandwidth of the interconnect Buffers for exchange are close in size Good load balance Performance can be sensitive to choice of P1,P2 Often best when P1 < #cores/node (In that case rows are contained inside a node, and only one of the two transposes goes out of the node)

Communication scaling and networks Increasing P decreases buffer size Expect 1/P scaling on fat-trees and other networks with full bisection bandwidth (unless buffer size gets below the latency threshold). On torus topology bisection bandwidth is scaling as P2/3 Some of it may be made up by very high link bandwidth (Cray XT3/4)

Communication scaling and networks (cont.) Process mapping on BG/L did not make a difference up to 32K cores Routing within 2D communicators on a 3D torus is not trivial MPI performance on Cray XT3/4: MPI_Alltoallv is significantly slower than MPI_Alltoall P3DFFT has an option to use MPI_Alltoall and pad the buffers to make them equal size

Computation performance 1D FFT, three times: 1. Stride-1 2. Small stride Strategy: 3. Large stride (out of cache) Use an established library (ESSL, FFTW) It can help to reorder the data so that stride=1 The results are then laid out as (Z,X,Y) instead of (X,Y,Z) Use loop blocking to optimize cache use Especially important with FFTW when doing out-of-place transform Not a significant improvement with ESSL or when doing in-place transforms

Performance on Cray XT4 (Kraken) at NICS/ORNL speedup # cores (P)

Applications of P3DFFT P3DFFT has already been applied in a number of codes, in science fields including the following: Turbulence Astrophysics Oceanography Other potential areas include Molecular Dynamics X-ray crystallography Atmospheric science

DNS turbulence Code from Georgia Tech (P.K.Yeung et al.) to simulate isotropic turbulence on a cubic periodic domain Uses pseudospectral method to solve Navier-Stokes eqs. 3D FFT is the most time-consuming part It is crucial to simulate grids with high resolution to minimize discretization effects and study a wide range of length scales. Therefore it is crucial to efficiently utilize state-of-the-art compute resources, both for total memory and speed. 2D decomposition based on P3DFFT framework has been implemented.

DNS performance Plot adapted from D.A.Donzis et al., “Turbulence simulations on O(10^4) processors”, presented at Teragrid’08, June 2008

P3DFFT - Future work Part 1: Interface and Flexibility Add the option to break down 3D FFT into stages: three transforms and two transposes Expand the memory layout options currently (X,Y,Z) on input, (X,Y,Z) or (Z,X,Y) on output Expand decomposition options currently: 1D or 2D, with Y and Z divided on input in the future: more general processor layouts, incl. 3D Add other types of transform: cosine, sine, etc.

P3DFFT - Future work Part 2: Communication/computation overlap Approach 1: coarse-grain overlap. Stagger different stages of 3D FFT algorithm for different fields, overlap with all-to-all transposes Use either non-blocking MPI, or one-sided MPI-2, or ?? Buffer sizes are still large (insensitive to latency) Approach 2: fine-grain overlap. Overlap FFT stages with transposes for parts of array (2D planes?) within each field Use either one-sided MPI-2, or CAF, or ?? Buffer sizes are small, however if overlap is implemented efficiently latency may be hidden. Combine with threaded version?

Questions or comments: Conclusions Efficient, scalable parallel 3D FFT library is being developed (open-source download available at http://www.sdsc.edu/us/resources/p3dfft.php) Good potential for enabling petascale science Future work is planned for expanded flexibility of the interface and better ultra-scale performance Questions or comments: dmitry@sdsc.edu

Acknowledgements P.K.Yeung D.A.Donzis G. Chukkappalli J. Goebbert G. Brethouser N. Prigozhina Supported by NSF/OCI