Communication Analysis of Parallel 3D FFT for Flat Cartesian Meshes on Large Blue Gene Systems A. Chan, P. Balaji, W. Gropp, R. Thakur Math. and Computer.

Slides:



Advertisements
Similar presentations
An Overview of ABFT in cloud computing
Advertisements

C. Bell, D. Bonachea, R. Nishtala, and K. Yelick, 1Berkeley UPC: Optimizing Bandwidth Limited Problems Using One-Sided Communication.
Natively Supporting True One-sided Communication in MPI on Multi-core Systems with InfiniBand G. Santhanaraman, P. Balaji, K. Gopalakrishnan, R. Thakur,
6.1 Synchronous Computations ITCS 4/5145 Cluster Computing, UNC-Charlotte, B. Wilkinson, 2006.
A PLFS Plugin for HDF5 for Improved I/O Performance and Analysis Kshitij Mehta 1, John Bent 2, Aaron Torres 3, Gary Grider 3, Edgar Gabriel 1 1 University.
Parallel Programming Patterns Eun-Gyu Kim June 10, 2004.
© 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P Blue Gene/P System Overview - Hardware.
Three-dimensional Fast Fourier Transform (3D FFT): the algorithm
Mapping Communication Layouts to Network Hardware Characteristics on Massive-Scale Blue Gene Systems Pavan Balaji*, Rinku Gupta*, Abhinav Vishnu + and.
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Performance of Applications Using Dual-Rail InfiniBand 3D Torus Network on the.
Toward Efficient Support for Multithreaded MPI Communication Pavan Balaji 1, Darius Buntinas 1, David Goodell 1, William Gropp 2, and Rajeev Thakur 1 1.
Acknowledgments: Thanks to Professor Nicholas Brummell from UC Santa Cruz for his help on FFTs after class, and also thanks to Professor James Demmel from.
Topic Overview One-to-All Broadcast and All-to-One Reduction
Diffusion scheduling in multiagent computing system MotivationArchitectureAlgorithmsExamplesDynamics Robert Schaefer, AGH University of Science and Technology,
Application Performance Analysis on Blue Gene/L Jim Pool, P.I. Maciej Brodowicz, Sharon Brunett, Tom Gottschalk, Dan Meiron, Paul Springer, Thomas Sterling,
Interconnection and Packaging in IBM Blue Gene/L Yi Zhu Feb 12, 2007.
Synergy.cs.vt.edu Power and Performance Characterization of Computational Kernels on the GPU Yang Jiao, Heshan Lin, Pavan Balaji (ANL), Wu-chun Feng.
A Framework for Collective Personalized Communication Laxmikant V. Kale, Sameer Kumar, Krishnan Varadarajan.
Charm++ Load Balancing Framework Gengbin Zheng Parallel Programming Laboratory Department of Computer Science University of Illinois at.
Non-Collective Communicator Creation in MPI James Dinan 1, Sriram Krishnamoorthy 2, Pavan Balaji 1, Jeff Hammond 1, Manojkumar Krishnan 2, Vinod Tipparaju.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
Collective Communication on Architectures that Support Simultaneous Communication over Multiple Links Ernie Chan.
1 First-Principles Molecular Dynamics for Petascale Computers François Gygi Dept of Applied Science, UC Davis
GePSeA: A General Purpose Software Acceleration Framework for Lightweight Task Offloading Ajeet SinghPavan BalajiWu-chun Feng Dept. of Computer Science,
LLNL-PRES-XXXXXX This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.
1 Scaling Collective Multicast Fat-tree Networks Sameer Kumar Parallel Programming Laboratory University Of Illinois at Urbana Champaign ICPADS ’ 04.
Basic Communication Operations Based on Chapter 4 of Introduction to Parallel Computing by Ananth Grama, Anshul Gupta, George Karypis and Vipin Kumar These.
Principles of Scalable HPC System Design March 6, 2012 Sue Kelly Sandia National Laboratories Abstract: Sandia National.
Impact of Network Sharing in Multi-core Architectures G. Narayanaswamy, P. Balaji and W. Feng Dept. of Comp. Science Virginia Tech Mathematics and Comp.
HPC Technology Track: Foundations of Computational Science Lecture 1 Dr. Greg Wettstein, Ph.D. Research Support Group Leader Division of Information Technology.
Non-uniformly Communicating Non-contiguous Data: A Case Study with PETSc and MPI P. Balaji, D. Buntinas, S. Balay, B. Smith, R. Thakur and W. Gropp Mathematics.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
© 2010 IBM Corporation Enabling Concurrent Multithreaded MPI Communication on Multicore Petascale Systems Gabor Dozsa 1, Sameer Kumar 1, Pavan Balaji 2,
Semantics-based Distributed I/O with the ParaMEDIC Framework P. Balaji, W. Feng, H. Lin Math. and Computer Science, Argonne National Laboratory Computer.
CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.
1 Computer Programming (ECGD2102 ) Using MATLAB Instructor: Eng. Eman Al.Swaity Lecture (1): Introduction.
An Analysis of 10-Gigabit Ethernet Protocol Stacks in Multi-core Environments G. Narayanaswamy, P. Balaji and W. Feng Dept. of Comp. Science Virginia Tech.
Argonne National Laboratory is a U.S. Department of Energy laboratory managed by U Chicago Argonne, LLC. Xin Zhao *, Pavan Balaji † (Co-advisor) and William.
©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 VSCSE Summer School Proven Algorithmic Techniques for Many-core Processors Lecture.
Analysis of Topology-Dependent MPI Performance on Gemini Networks Antonio J. Peña, Ralf G. Correa Carvalho, James Dinan, Pavan Balaji, Rajeev Thakur, and.
SUPERCOMPUTING CHALLENGE KICKOFF 2015 A Model for Computational Science Investigations Oct 2015 © challenge.org Supercomputing Around.
- Rohan Dhamnaskar. Overview  What is a Supercomputer  Some Concepts  Couple of examples.
High Performance Computing How to use Recommended Books Spring Semester 2005 Geoffrey Fox Community Grids Laboratory Indiana University 505 N Morton Suite.
2009/4/21 Third French-Japanese PAAP Workshop 1 A Volumetric 3-D FFT on Clusters of Multi-Core Processors Daisuke Takahashi University of Tsukuba, Japan.
Non-Data-Communication Overheads in MPI: Analysis on Blue Gene/P P. Balaji, A. Chan, W. Gropp, R. Thakur, E. Lusk Argonne National Laboratory University.
PMI: A Scalable Process- Management Interface for Extreme-Scale Systems Pavan Balaji, Darius Buntinas, David Goodell, William Gropp, Jayesh Krishna, Ewing.
Diskless Checkpointing on Super-scale Architectures Applied to the Fast Fourier Transform Christian Engelmann, Al Geist Oak Ridge National Laboratory Februrary,
1 Interconnection Networks. 2 Interconnection Networks Interconnection Network (for SIMD/MIMD) can be used for internal connections among: Processors,
Introduction to Scientific Computing II Overview Michael Bader.
Parallelizing Spacetime Discontinuous Galerkin Methods Jonathan Booth University of Illinois at Urbana/Champaign In conjunction with: L. Kale, R. Haber,
Interconnection network network interface and a case study.
2/22/2001Greenbook 2001/OASCR1 Greenbook/OASCR Activities Focus on technology to enable SCIENCE to be conducted, i.e. Software tools Software libraries.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
Sockets Direct Protocol for Hybrid Network Stacks: A Case Study with iWARP over 10G Ethernet P. Balaji, S. Bhagvat, R. Thakur and D. K. Panda, Mathematics.
C OMPUTATIONAL R ESEARCH D IVISION 1 Defining Software Requirements for Scientific Computing Phillip Colella Applied Numerical Algorithms Group Lawrence.
Spring EE 437 Lillevik 437s06-l22 University of Portland School of Engineering Advanced Computer Architecture Lecture 22 Distributed computer Interconnection.
Optimizing Parallel Programming with MPI Michael Chen TJHSST Computer Systems Lab Abstract: With more and more computationally- intense problems.
Network Systems Lab. Korea Advanced Institute of Science and Technology No.1 Ch. 1 Introduction EE692 Parallel and Distribution Computation | Prof. Song.
CIS 595 MATLAB First Impressions. MATLAB This introduction will give Some basic ideas Main advantages and drawbacks compared to other languages.
OpenMP Runtime Extensions Many core Massively parallel environment Intel® Xeon Phi co-processor Blue Gene/Q MPI Internal Parallelism Optimizing MPI Implementation.
1 Advanced MPI William D. Gropp Rusty Lusk and Rajeev Thakur Mathematics and Computer Science Division Argonne National Laboratory.
A CASE STUDY IN USING MASSIVELY PARALLEL SIMULATION FOR EXTREME- SCALE TORUS NETWORK CO-DESIGN Misbah Mubarak, Rensselaer Polytechnic Institute Christopher.
PARALLEL MODEL OF EVOLUTIONARY GAME DYNAMICS Amanda Peters MIT /13/2009.
These slides are based on the book:
Parallel Objects: Virtualization & In-Process Components
Parallel Matrix Multiplication and other Full Matrix Algorithms
Parallel Matrix Multiplication and other Full Matrix Algorithms
Ph.D. Thesis Numerical Solution of PDEs and Their Object-oriented Parallel Implementations Xing Cai October 26, 1998.
Presentation transcript:

Communication Analysis of Parallel 3D FFT for Flat Cartesian Meshes on Large Blue Gene Systems A. Chan, P. Balaji, W. Gropp, R. Thakur Math. and Computer Science, Argonne National Lab University of Illinois, Urbana Champaign

Fast Fourier Transform One of the most popular and widely used numerical methods in scientific computing Forms a core building block for applications in many fields, e.g., molecular dynamics, many-body simulations, monte- carlo simulations, partial differential equation solvers 1D, 2D, 3D data grids FFTs are all used –Represents the dimensionality of the data being operated on 2D process grids are popular –Represents the logical layout of the processes –E.g., Used by P3DFFT Pavan Balaji, Argonne National Laboratory (HiPC: 12/19/2008)

Parallel 3D FFT with P3DFFT Relative new implementation of 3DFFT from SDSC Designed for massively parallel systems –Reduces synchronization overheads compared to other 3D FFT implementations –Communicates along row and column in the 2D process grid –Internally utilizes sequential 1D FFT libraries and performance data grid transforms to collect the required data Pavan Balaji, Argonne National Laboratory (HiPC: 12/19/2008)

P3DFFT for Flat Cartesian Meshes Lot of prior work to improve 3D FFT performance Mostly focuses on regular 3D cartesian meshes –All sides of the mesh are of (almost) equal size Flat 3D cartesian meshes are becoming popular –Good tool for studying quasi-2D systems that occur during the transition of 3D systems to 2D systems –E.g., superconducting condensate, Quantum-Hall effect, and Turbulence theory in geophysical studies –Failure of P3DFFT for such systems is a known problem Objective: Understand the communication characteristics of P3DFFT, especially with respect to flat 3D meshes Pavan Balaji, Argonne National Laboratory (HiPC: 12/19/2008)

Presentation Layout Introduction Communication overheads in P3DFFT Experimental Results and Analysis Concluding Remarks and Future Work Pavan Balaji, Argonne National Laboratory (HiPC: 12/19/2008)

BG/L Network Overview BG/L has five different networks –Two of them (1G Ethernet and 100M Ethernet with JTAG interface) are used for file I/O and system management –3D Torus: Used for point-to-point MPI communication (as well as collectives for large message sizes) –Global Collective Network: Used for collectives using small messages and regular communication patterns –Global Interrupt Network: Used for barrier and other process synchronization routines For Alltoallv (in P3DFFT), the 3D Torus network is used –175MB/s bandwidth per link per direction (total 1.05 GB/s) Pavan Balaji, Argonne National Laboratory (HiPC: 12/19/2008)

Mapping 2D Process Grid to BG/L A 512 process system: –By default broken into a 32x16 logical process grid (provided by MPI_Dims_create) –Forms a 8x8x8 physical process grid on the BG/L Pavan Balaji, Argonne National Laboratory (HiPC: 12/19/2008)

Communication Characterization of P3DFFT Consider a process grid of P = P row x P col and a data grid of N = n x x n y x n z P3DFFT performs a two-step process (forward transform and reverse transform) –The first step requires n z / P col Alltoallv’s over the row sub- communicator with message size m row = N / (n z x P row 2 ) –The second step requires one Alltoallv over the column sub- communicator with message size m col = N. P row / P 2 –Total time = Pavan Balaji, Argonne National Laboratory (HiPC: 12/19/2008)

Trends in P3DFFT Performance Total communication time impacted by three variables: –Message size Too small message size implies network bandwidth is not fully utilized Too large message size is “OK”, but that implies the other communicator’s message size will be too small –Communicator size The lesser the better –Communicator topology (and corresponding congestion) This part increases quadratically with communicator size, so will have a large impact on large-scale systems Pavan Balaji, Argonne National Laboratory (HiPC: 12/19/2008)

Presentation Layout Introduction Communication overheads in P3DFFT Experimental Results and Analysis Concluding Remarks and Future Work Pavan Balaji, Argonne National Laboratory (HiPC: 12/19/2008)

Alltoallv Bandwidth on Small Systems Pavan Balaji, Argonne National Laboratory (HiPC: 12/19/2008)

Alltoallv Bandwidth on Large Systems Pavan Balaji, Argonne National Laboratory (HiPC: 12/19/2008)

Communication Analysis on Small Systems Small P row and small n z provide the best performance for small-scale systems –This is the exact opposite of MPI’s default behavior ! It tries to keep P row and P col as close as possible; we need them to be as far away as possible –Difference of up to 10% Pavan Balaji, Argonne National Laboratory (HiPC: 12/19/2008)

Evaluation on Large Systems (16 racks) Small P row still performs the best Unlike small systems, large n z is better for large systems –Increasing congestion plays an important role –Difference as much as 48% Pavan Balaji, Argonne National Laboratory (HiPC: 12/19/2008)

Presentation Layout Introduction Communication overheads in P3DFFT Experimental Results and Analysis Concluding Remarks and Future Work Pavan Balaji, Argonne National Laboratory (HiPC: 12/19/2008)

Concluding Remarks and Future Work We analyzed the communication in P3DFFT on BG/L and identified the parameters that impact performance –Evaluated the impact of the different parameters and identified trends in performance –Found that while uniform process grid topologies are ideal for uniform 3D data grids, for flat cartesian grids, non- uniform process grid topologies are ideal –Shown up to 48% improvement in performance by utilizing our understanding to tweak parameters Future Work: Intend to do this on Blue Gene/P (performance counters make this a lot more interesting) Pavan Balaji, Argonne National Laboratory (HiPC: 12/19/2008)

Thank You! Contacts: s:{chan, balaji, Web Link: