1 Scalable Distributed Fast Multipole Methods Qi Hu, Nail A. Gumerov, Ramani Duraiswami Institute for Advanced Computer Studies Department of Computer.

Slides:

Advertisements

Similar presentations

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Advertisements

Formulation of an algorithm to implement Lowe-Andersen thermostat in parallel molecular simulation package, LAMMPS Prathyusha K. R. and P. B. Sunil Kumar.

1 Toward Improved Aeromechanics Simulations Using Recent Advancements in Scientific Computing Qi Hu, Nail A. Gumerov, Ramani Duraiswami Institute for Advanced.

1 Scalable Fast Multipole Accelerated Vortex Methods Qi Hu a Nail A. Gumerov a Rio Yokota b Lorena Barba c Ramani Duraiswami a a Department of Computer.

GPGPU Introduction Alan Gray EPCC The University of Edinburgh.

Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500 Cluster.

OpenFOAM on a GPU-based Heterogeneous Cluster

GPU Computational Screening of Carbon Capture Materials J Kim 1, A Koniges 1, R Martin 1, M Haranczyk 1, J Swisher 2 and B Smit 1,2 1 Berkeley Lab (USA),

Lincoln University Canterbury New Zealand Evaluating the Parallel Performance of a Heterogeneous System Elizabeth Post Hendrik Goosen formerly of Department.

GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and.

Evaluation of Fast Electrostatics Algorithms Alice N. Ko and Jesús A. Izaguirre with Thierry Matthey Department of Computer Science and Engineering University.

Weiping Shi Department of Computer Science University of North Texas HiCap: A Fast Hierarchical Algorithm for 3D Capacitance Extraction.

DCABES 2009 China University Of Geosciences 1 The Parallel Models of Coronal Polarization Brightness Calculation Jiang Wenqian.

Analysis and Performance Results of a Molecular Modeling Application on Merrimac Erez, et al. Stanford University 2004 Presented By: Daniel Killebrew.

Module on Computational Astrophysics Jim Stone Department of Astrophysical Sciences 125 Peyton Hall : ph :

1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.

Computing Platform Benchmark By Boonyarit Changaival King Mongkut’s University of Technology Thonburi (KMUTT)

Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.

Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.

Efficient Parallel Implementation of Molecular Dynamics with Embedded Atom Method on Multi-core Platforms Reporter: Jilin Zhang Authors:Changjun Hu, Yali.

Performance Evaluation of Hybrid MPI/OpenMP Implementation of a Lattice Boltzmann Application on Multicore Systems Department of Computer Science and Engineering,

Parallel Performance of Hierarchical Multipole Algorithms for Inductance Extraction Ananth Grama, Purdue University Vivek Sarin, Texas A&M University Hemant.

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

GPU-accelerated Evaluation Platform for High Fidelity Networking Modeling 11 December 2007 Alex Donkers Joost Schutte.

Scalable Data Clustering with GPUs Andrew D. Pangborn Thesis Defense Rochester Institute of Technology Computer Engineering Department Friday, May 14 th.

Predictive Runtime Code Scheduling for Heterogeneous Architectures 1.

11 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray) Abdullah Gharaibeh, Lauro Costa, Elizeu.

Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500, clusters,

1 Scalable Fast Multipole Methods on Distributed Heterogeneous Architecture Qi Hu, Nail A. Gumerov, Ramani Duraiswami Institute for Advanced Computer Studies.

GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.

Parallel Applications Parallel Hardware Parallel Software IT industry (Silicon Valley) Users Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University.

MS Thesis Defense “IMPROVING GPU PERFORMANCE BY REGROUPING CPU-MEMORY DATA” by Deepthi Gummadi CoE EECS Department April 21, 2014.

Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,

Scheduling Many-Body Short Range MD Simulations on a Cluster of Workstations and Custom VLSI Hardware Sumanth J.V, David R. Swanson and Hong Jiang University.

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.

Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"

Fast Support Vector Machine Training and Classification on Graphics Processors Bryan Catanzaro Narayanan Sundaram Kurt Keutzer Parallel Computing Laboratory,

Tone Mapping on GPUs Cliff Woolley University of Virginia Slides courtesy Nolan Goodnight.

Paradyn Project Paradyn / Dyninst Week Madison, Wisconsin April 29-May 3, 2013 Mr. Scan: Efficient Clustering with MRNet and GPUs Evan Samanas and Ben.

Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.

Fast BVH Construction on GPUs (Eurographics 2009) Park, Soonchan KAIST (Korea Advanced Institute of Science and Technology)

IIIT Hyderabad Scalable Clustering using Multiple GPUs K Wasif Mohiuddin P J Narayanan Center for Visual Information Technology International Institute.

Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.

By Dirk Hekhuis Advisors Dr. Greg Wolffe Dr. Christian Trefftz.

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.

Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.

GEM: A Framework for Developing Shared- Memory Parallel GEnomic Applications on Memory Constrained Architectures Mucahid Kutlu Gagan Agrawal Department.

Outline Why this subject? What is High Performance Computing?

Smoothed Particle Hydrodynamics Matthew Zhu CSCI 5551 — Fall 2015.

MSc in High Performance Computing Computational Chemistry Module Parallel Molecular Dynamics (i) Bill Smith CCLRC Daresbury Laboratory

3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.

Space Charge with PyHEADTAIL and PyPIC on the GPU Stefan Hegglin and Adrian Oeftiger Space Charge Working Group meeting –

Nawanol Theera-Ampornpunt, Seong Gon Kim, Asish Ghoshal, Saurabh Bagchi, Ananth Grama, and Somali Chaterji Fast Training on Large Genomics Data using Distributed.

PuReMD Design Initialization – neighbor-list, bond-list, hydrogenbond-list and Coefficients of QEq matrix Bonded interactions – Bond-order, bond-energy,

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo Vignesh T. Ravi Gagan Agrawal Department of Computer Science and Engineering,

Multipole-Based Preconditioners for Sparse Linear Systems. Ananth Grama Purdue University. Supported by the National Science Foundation.

S. Pardi Frascati, 2012 March GPGPU Evaluation – First experiences in Napoli Silvio Pardi.

Hybrid Parallel Implementation of The DG Method Advanced Computing Department/ CAAM 03/03/2016 N. Chaabane, B. Riviere, H. Calandra, M. Sekachev, S. Hamlaoui.

Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.

A Dynamic Scheduling Framework for Emerging Heterogeneous Systems

Parallel Plasma Equilibrium Reconstruction Using GPU

Introduction to Parallelism.

Performance Evaluation of the Parallel Fast Multipole Algorithm Using the Optimal Effectiveness Metric Ioana Banicescu and Mark Bilderback Department of.

Course Outline Introduction in algorithms and applications

Supported by the National Science Foundation.

Hybrid Programming with OpenMP and MPI

Chapter 01: Introduction

Presentation transcript:

1 Scalable Distributed Fast Multipole Methods Qi Hu, Nail A. Gumerov, Ramani Duraiswami Institute for Advanced Computer Studies Department of Computer Science University of Maryland, College Park, MD

2 Previous work FMM algorithm for GPUs -Gumerov and Duraiswami (2008) explored the FMM algorithm for GPU -Yokota, et al. (2009) presented FMM on the GPU cluster FMM algorithm for distributed systems -Greengard and Gropp (1990) discussed parallelzing FMM -Ying, et al. (2003): the parallel version of kernel independent FMM -Lashuk, et al. (2009) presented kernel independent adaptive FMM on heterogeneous architectures -Cruz, et al. (2010): the PetFMM library -Hu, et al. (2011): the heterogeneous FMM algorithm -Yokota, et al. (2011): using FMM to simulate turbulence on 4096 GPUs

3 Issues with previous results Communication data structures: local essential tree -Bottleneck -No details of implementation Especially, our previous work Hu, et al. (2011): -The L|L translations are not fully distributed -Very large data transfer overheads

4 Contributions Efficient salable communication management data structures -Classify the spatial boxes -Determine boxes for internode exchange on GPU -Only communicate necessary data -Small amount of communication data Fully distribute the all parts of FMM among all the nodes Extremely fast parallel algorithms for FMM data structures -Complexity O(N) and much faster than evaluation steps -Suitable for dynamics problems

5 Motivation: Brownout Complicated phenomena involving interaction between rotorcraft wake, ground, and dust particles Causes accidents due to poor visibility and damage to helicopters Understanding can lead to mitigation strategies Lagrangian (vortex element) methods to compute the flow Fast evaluation of the fields at particle locations Need for fast evaluation of all pairwise 3D interactions

6 Motivation Many other applications require fast evaluation of pairwise interactions with 3D Laplacian kernel and its derivatives Astrophysics (gravity potential and forces) wissrech.ins.uni-bonn.de Molecular Dynamics (Coulomb potential and forces) Micro and Nanofluidics (complex channel Stokes flows) Imaging and Graphics (high quality RBF interpolation) Much More!

7 Introduction to fast multipole methods Problem: compute matrix-vector product of some kernels Linear computation and memory cost O(N+M) with any accuracy Divide the sum to the far field and near field terms Direct kernel evaluations for the near field Approximations of the far field sum via the multipole expansions of the kernel function and spatial data structures (octree for 3D)

8 Introduction to the fast multipole method The local and multipole expansions of the Laplace kernel at the center with the truncation number p Expansions regions are validated by well separated pairs realized using spatial boxes of octree (hierarchical data structures) Translations of expansion coefficients -Multipole to multipole translations (M|M) -Multipole to local translations (M|L) -Local to local translations (L|L) r n Y nm local spherical basis functions r −(n+1) Y nm multipole spherical basis functions

9 FMM flow chart 1.Build data structures 2.Initial M-expansions 3.Upward M|M translations 4.Downward M|L, L|L translations 5.L-expansions 6.Local direct sum (P2P) and final summation From Java animation of FMM by Y. Wang M.S. Thesis, UMD 2005

10 Issues with distributed algorithms Halo regions Incomplete translations

11 Solutions Distribute the entire computing domain based on the global workload balance Classify the spatial boxes into 5 categories -Need to calculate box types efficiently -No interruptions on kernel evaluations A single communication to exchange data of halo regions -Master-slave model -Small overheads All other computations are performed independently -Fully distributed

12 Heterogeneous architecture

13 Mapping the FMM on CPU/GPU architecture GPU is a highly parallel, multithreaded, many-core processor -Good for repetitive operations on multiple data (SIMD) CPUs are good for complex tasks with -Complicated data structures, such as FMM M|L translation stencils, with complicated patterns of memory access CPU-GPU communications expensive Profile FMM and determine which parts of FMM go where DRAM Cache Control ALU CPU A few cores DRAM GPU Hundreds of cores

14 FMM on the GPU Look at implementation of Gumerov & Duraiswami (2008) -M2L translation cost: 29%; GPU speedup 1.6x -Local direct sum: 66.1%; GPU speedup 90.1x Profiling data suggests -Perform translations on the CPU: multicore parallelization and large cache provides comparable or better performance -GPU computes local direct sum (P2P) and particle related work: SIMD

15 Workload distribution A forest of trees Workload balance

16 The distributed FMM algorithm on a single node

17 The distributed algorithm on multiple nodes

18 A 2D example

19 Box types Five types: root, import, export, domestic, other Each node computes its box types on GPUs with other FMM data structures Very small overhead A 2D example

20 Communication overhead Depend on the input data distribution It is proportional to -the number of computing node P -the number of boxes in the boundary (halo) regions -Assume the uniform distribution -Cost:

21 Comments on implementations Other FMM data structures (interaction lists) and heterogeneous implementations -Using the heterogeneous algorithms in Hu, et al. (2011) All data structures passed to the kernel evaluation engine are compact, i.e. no empty box related structures The data manager is on the master node and implemented on CPU -Its workload is small -No noticeable gain of its GPU implementation

22 Weak scalability Compare with Hu et al. (2011) Reduce both the overheads and kernel evaluations Fix 8M per node and run tests on 1 ~ 16 nodes The depth of the octree determines the overhead The particle density determines the parallel region timing

23 Weak scalability

24 Weak scalability

25 Strong scalability Fix the problem size to be 8M particles Run tests on 1 ~ 16 nodes Direct sum dominates the computation cost -Unless GPU is fully occupied algorithm does not achieve strong scalability -Can choose number of GPUs on a node according to size Compare with Hu et al. (2011)

26 Strong scalability

27 Strong scalability

28 The billion size test case Using all 32 Chimera nodes and 64 GPUs 2 30 ~1.07 billion particles potential computation in 12.2 s vs 21.6 s in Hu et al. (2011) -32 M per node Each node: Dual quad-core Intel Nehalem GHz processors 24 GB of RAM Two Tesla C1060 GPU

29 Performance count HPCC 2012SC’11SC’10 PaperHu et al. 2012Hu et al Hamada and Nitadori, 2010 AlgorithmFMM Tree code Problem size 1,073,741,824 3,278,982,596 Flops count 39.7 TFlops on 64 GPUs, 32 nodes 38 TFlops on 64 GPUs, 32 nodes 190 TFlops on 576 GPUs, 144 nodes GPU Tesla C1060: GHz 240 cores Tesla C1060: GHz 240 cores GTX GHz 2 x 240 cores 620 GFlops/GPU593 GFlops/GPU329 GFlops/GPU

30 Conclusion Fast communication management data structures -Handle non-uniform distribution -Parallel granularity: spatial boxes -Small overheads Much improved scalability and performance The capability of computing million or billion size problems on a single workstation or a middle size cluster Developed code will be used in solvers for many large scale problems in aeromechanics, astrophysics, molecular dynamics, etc.

31 Questions? Acknowledgments