1 Scalable Fast Multipole Accelerated Vortex Methods Qi Hu a Nail A. Gumerov a Rio Yokota b Lorena Barba c Ramani Duraiswami a a Department of Computer.

Slides:



Advertisements
Similar presentations
The first-generation Cell Broadband Engine (BE) processor is a multi-core chip comprised of a 64-bit Power Architecture processor core and eight synergistic.
Advertisements

Copyright 2011, Data Mining Research Laboratory Fast Sparse Matrix-Vector Multiplication on GPUs: Implications for Graph Mining Xintian Yang, Srinivasan.
A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.
Christopher McCabe, Derek Causon and Clive Mingham Centre for Mathematical Modelling & Flow Analysis Manchester Metropolitan University MANCHESTER M1 5GD.
GPU Virtualization Support in Cloud System Ching-Chi Lin Institute of Information Science, Academia Sinica Department of Computer Science and Information.
1 Toward Improved Aeromechanics Simulations Using Recent Advancements in Scientific Computing Qi Hu, Nail A. Gumerov, Ramani Duraiswami Institute for Advanced.
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
Eigenvalue and eigenvectors  A x = λ x  Quantum mechanics (Schrödinger equation)  Quantum chemistry  Principal component analysis (in data mining)
Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500 Cluster.
PARALLEL PROCESSING COMPARATIVE STUDY 1. CONTEXT How to finish a work in short time???? Solution To use quicker worker. Inconvenient: The speed of worker.
OpenFOAM on a GPU-based Heterogeneous Cluster
GPU Computational Screening of Carbon Capture Materials J Kim 1, A Koniges 1, R Martin 1, M Haranczyk 1, J Swisher 2 and B Smit 1,2 1 Berkeley Lab (USA),
Lincoln University Canterbury New Zealand Evaluating the Parallel Performance of a Heterogeneous System Elizabeth Post Hendrik Goosen formerly of Department.
Evaluation of Fast Electrostatics Algorithms Alice N. Ko and Jesús A. Izaguirre with Thierry Matthey Department of Computer Science and Engineering University.
1 Parallel multi-grid summation for the N-body problem Jesús A. Izaguirre with Thierry Matthey Department of Computer Science and Engineering University.
Compositional Development of Parallel Programs Nasim Mahmood, Guosheng Deng, and James C. Browne.
Derek C. Richardson (U Maryland) PKDGRAV : A Parallel k-D Tree Gravity Solver for N-Body Problems FMM 2004.
Computing Platform Benchmark By Boonyarit Changaival King Mongkut’s University of Technology Thonburi (KMUTT)
Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.
1 Scalable Distributed Fast Multipole Methods Qi Hu, Nail A. Gumerov, Ramani Duraiswami Institute for Advanced Computer Studies Department of Computer.
ExaFMM --An open source fast multipole method library aimed for Exascale systems Rio Yokota (KAUST), L. A. Barba (BU)
Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster.
Efficient Parallel Implementation of Molecular Dynamics with Embedded Atom Method on Multi-core Platforms Reporter: Jilin Zhang Authors:Changjun Hu, Yali.
Performance Evaluation of Hybrid MPI/OpenMP Implementation of a Lattice Boltzmann Application on Multicore Systems Department of Computer Science and Engineering,
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
GPU-accelerated Evaluation Platform for High Fidelity Networking Modeling 11 December 2007 Alex Donkers Joost Schutte.
Scalable Data Clustering with GPUs Andrew D. Pangborn Thesis Defense Rochester Institute of Technology Computer Engineering Department Friday, May 14 th.
Predictive Runtime Code Scheduling for Heterogeneous Architectures 1.
Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500, clusters,
1 Scalable Fast Multipole Methods on Distributed Heterogeneous Architecture Qi Hu, Nail A. Gumerov, Ramani Duraiswami Institute for Advanced Computer Studies.
Advisor: Dr. Aamir Shafi Co-Advisor: Mr. Ali Sajjad Member: Dr. Hafiz Farooq Member: Mr. Tahir Azim Optimizing N-body Simulations for Multi-core Compute.
MS Thesis Defense “IMPROVING GPU PERFORMANCE BY REGROUPING CPU-MEMORY DATA” by Deepthi Gummadi CoE EECS Department April 21, 2014.
Fan Zhang, Yang Gao and Jason D. Bakos
Porting the physical parametrizations on GPUs using directives X. Lapillonne, O. Fuhrer, Cristiano Padrin, Piero Lanucara, Alessandro Cheloni Eidgenössisches.
Scheduling Many-Body Short Range MD Simulations on a Cluster of Workstations and Custom VLSI Hardware Sumanth J.V, David R. Swanson and Hong Jiang University.
1 © 2012 The MathWorks, Inc. Parallel computing with MATLAB.
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
GPU-Accelerated Surface Denoising and Morphing with LBM Scheme Ye Zhao Kent State University, Ohio.
Fast Support Vector Machine Training and Classification on Graphics Processors Bryan Catanzaro Narayanan Sundaram Kurt Keutzer Parallel Computing Laboratory,
Molecular Dynamics Simulations on a GPU in OpenCL Alex Cappiello.
Speeding up the Simulation of Coupling Losses in ITER size CICC joints by using a Multi-Level Fast Multipole Method and GPU technology Ezra van Lanen,
NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.
An FPGA Implementation of the Ewald Direct Space and Lennard-Jones Compute Engines By: David Chui Supervisor: Professor P. Chow.
1 Complex Images k’k’ k”k” k0k0 -k0-k0 branch cut   k 0 pole C1C1 C0C0 from the Sommerfeld identity, the complex exponentials must be a function.
Hardware Acceleration Using GPUs M Anirudh Guide: Prof. Sachin Patkar VLSI Consortium April 4, 2008.
Molecular Simulation of Reactive Systems. _______________________________ Sagar Pandit, Hasan Aktulga, Ananth Grama Coordinated Systems Lab Purdue University.
Simulating complex surface flow by Smoothed Particle Hydrodynamics & Moving Particle Semi-implicit methods Benlong Wang Kai Gong Hua Liu
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
Introduction: Lattice Boltzmann Method for Non-fluid Applications Ye Zhao.
Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.
Dynamic Scheduling Monte-Carlo Framework for Multi-Accelerator Heterogeneous Clusters Authors: Anson H.T. Tse, David B. Thomas, K.H. Tsoi, Wayne Luk Source:
Smoothed Particle Hydrodynamics Matthew Zhu CSCI 5551 — Fall 2015.
Space Charge with PyHEADTAIL and PyPIC on the GPU Stefan Hegglin and Adrian Oeftiger Space Charge Working Group meeting –
AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.
Multipole-Based Preconditioners for Sparse Linear Systems. Ananth Grama Purdue University. Supported by the National Science Foundation.
Hybrid Parallel Implementation of The DG Method Advanced Computing Department/ CAAM 03/03/2016 N. Chaabane, B. Riviere, H. Calandra, M. Sekachev, S. Hamlaoui.
Fermi National Accelerator Laboratory & Thomas Jefferson National Accelerator Facility SciDAC LQCD Software The Department of Energy (DOE) Office of Science.
Multicore Applications in Physics and Biochemical Research Hristo Iliev Faculty of Physics Sofia University “St. Kliment Ohridski” 3 rd Balkan Conference.
Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi
A Dynamic Scheduling Framework for Emerging Heterogeneous Systems
Parallel Plasma Equilibrium Reconstruction Using GPU
Accelerating MapReduce on a Coupled CPU-GPU Architecture
Performance Evaluation of the Parallel Fast Multipole Algorithm Using the Optimal Effectiveness Metric Ioana Banicescu and Mark Bilderback Department of.
Course Outline Introduction in algorithms and applications
Supported by the National Science Foundation.
Hybrid Programming with OpenMP and MPI
Chapter 01: Introduction
Graphics Processing Unit
N-Body Gravitational Simulations
Presentation transcript:

1 Scalable Fast Multipole Accelerated Vortex Methods Qi Hu a Nail A. Gumerov a Rio Yokota b Lorena Barba c Ramani Duraiswami a a Department of Computer Science, University of Maryland, College Park b King Abdullah University of Science and Technology c Mechanical and Aerospace Engineering, George Washington University 1 of 23

2 Task 3.5: Computational Considerations in Brownout Simulations Motivation: N-body Problems in Brownout Complicated phenomena involving interaction between rotorcraft wake, ground, and dust particles Causes accidents due to poor visibility and damage to helicopters Understanding can lead to mitigation strategies Fast evaluation of the fields at particle locations Need for fast evaluation of all pairwise N-body interactions 2 of 23 Image courtesy of Professor Leishman

3 Motivation: Equations to Solve 3 of 23

4 Task 3.5: Computational Considerations in Brownout Simulations Motivation Many other applications require fast evaluation of pairwise interactions with 3D Laplacian kernel and its derivatives Astrophysics (gravity potential and forces) wissrech.ins.uni-bonn.de Molecular Dynamics (Coulomb potential and forces) Micro and Nanofluidics (complex channel Stokes flows) Imaging and Graphics (high quality RBF interpolation) Much More! 4 of 23

5 Task 3.5: Computational Considerations in Brownout Simulations Previous Work FMM on distributed systems -Greengard & Gropp 1990 discussed parallelizing FMM -Ying, et al. 2003: the parallel version of kernel independent FMM FMM on GPUs -Gumerov & Duraiswami 2008 explored the FMM algorithm for GPU -Yokota, et al. 2009, 2012 presented FMM on the GPU cluster -Qi, et al. 2011, 2012 developed heterogenous FMM algorithm Other impressive results use the benefits of architecture tuning on the networks of multi-core processors or GPUs -Hamada, et al. 2009, 2010: the Golden Bell Prize SC’09 -Lashuk, et al presented kernel independent adaptive FMM on heterogeneous architectures -Chandramowlishwaran, et al. 2010: optimizations for multi-core clusters -Cruz, et al. 2010: the PetFMM library 5 of 23

6 Previous Work Treecode-based vortex method -Modeling rotor wakes hybrid overflow-vortex method on a GPU cluster (Stock & Stone 2010) -Isotropic turbulence simulation (Hamada, et al. 2009) FMM-based vortex method -Validate soundness of FMM-based VEM using GPUs by Yokota & Barba group in a series of publications 6 of 23

7 Task 3.5: Computational Considerations in Brownout Simulations Contributions Highly optimized heterogeneous FMM algorithms for dynamic problems Lamb-Helmholtz Decomposition Gaussian blob approximation 7 of 23

8 Task 3.5: Computational Considerations in Brownout Simulations N-body Problem Source points: Receiver points: The matrix-vector product: Complexity of direct method: 8 of 23

9 Task 3.5: Computational Considerations in Brownout Simulations Introduction of FMM FMM introduced by Rokhlin and Greengard (1987) Achieves dense NxM matrix-vector multiplication for special kernels in O(N+M) time and memory cost Based on the idea that the far field of a group of singularities (vortices) can be represented compactly via multipole expansions 9 of 23

10 Task 3.5: Computational Considerations in Brownout Simulations Factorization and well separate pair Factorization of the far-field interaction: Well separate pairs via hierarchical data structures 10 of 23 Image courtesy of Professor Mount

11 Task 3.5: Computational Considerations in Brownout Simulations FMM Flow Chart 1.Build data structures 2.Initial expansions 3.Upward translations 4.Downward translations 5.Local expansions 6.Near-field direct sum and final summation From Java animation of FMM by Y. Wang M.S. Thesis, UMD of 23

12 Task 3.5: Computational Considerations in Brownout Simulations Heterogeneous Architecture 12 of 23

13 Task 3.5: Computational Considerations in Brownout Simulations Single Node Heterogeneous Algorithm particle positions, source strength GPU workCPU work ODE solver: source receiver update data structure (octree and neighbors) source M-expansions translation stencils upward M|M downward M|L L|L local direct sum receiver L-expansions final sum of far-field and near-field interactions time loop 13 of 23

14 Task 3.5: Computational Considerations in Brownout Simulations Advantages CPU and GPU are tasked with the most efficient jobs -Faster translations: CPU code can be better optimized which requires complex translation stencil data structures -Faster local direct sum: many cores on GPU; same kernel evaluation but on multiple data (SIMD) The CPU is not idle during the GPU-based computations High accuracy translation without much cost penalty on CPU Easy visualization: all data reside in GPU RAM 14 of 23

15 Task 3.5: Computational Considerations in Brownout Simulations Issues with Distributed Algorithms Halo regions Incomplete translations 15 of 23

16 Task 3.5: Computational Considerations in Brownout Simulations Box Types to Manage Halo Regions Five types: root, import, export, domestic, other Each node computes its box types on GPUs with other FMM data structures Very small overhead A 2D example 16 of 23

17 Task 3.5: Computational Considerations in Brownout Simulations The Fully Distributed Heterogeneous Algorithm 17 of 23

18 Task 3.5: Computational Considerations in Brownout Simulations Lamb-Helmholtz Decomposition Mathematically reformulation (Gumerov & Duraiswami 2013) Velocity field can be described by only two harmonic scalar potentials Perform the FMM translations for “velocity+streching” calculation at the cost of two Laplace potential kernels instead of six 18 of 23

19 Task 3.5: Computational Considerations in Brownout Simulations FMM for Vortex Element Methods Gaussian smoothing kernels evaluated on GPU -Fast approximations with guaranteed accuracy It combines -Algorithmic speedup -Fast data structure constructions -Highly optimized CUDA, OpenMP and MPI implementations No smoothing Has smoothing 19 of 23

20 Task 3.5: Computational Considerations in Brownout Simulations Strong Scalability 20 of 23

21 Task 3.5: Computational Considerations in Brownout Simulations The Billion Size Test Case Truncation number p= ~1.07 billion particles 32 computing nodes Streching+velocity: 55.9s Tflops/s with 64 C1060 GPUs (933 Gflop/s peak performance for each GPU reported by NVIDIA) -Equivalent to 82% efficiency Velocity: 39.9s Each node: dual quad-core Intel Nehalem GHz CPU, 24 GB of RAM; two Tesla C1060 GPU 21 of 23

22 Task 3.5: Computational Considerations in Brownout Simulations Conclusions Developed scalable vortex methods using FMM -Algorithmic acceleration on the translations via Lamb-Helmholtz decomposition -Incorporate our distributed heterogeneous FMM algorithm -Analyzed the treat-off between accuracy and speed of the expensive vortex core evaluations -Valid the acceptable accuracy of FMM based Vortex Method The FMM algorithms developed can be used in solvers for many large scale problems in aeromechanics, astrophysics, molecular dynamics, etc. 22 of 23

23 Task 3.5: Computational Considerations in Brownout Simulations Future Work Fast data structures without using histograms Distributed communication management Automatic GPU/CPU work balance algorithms Partition methods and dynamic data structure updates 23 of 23