1 Toward Improved Aeromechanics Simulations Using Recent Advancements in Scientific Computing Qi Hu, Nail A. Gumerov, Ramani Duraiswami Institute for Advanced.

Slides:

Advertisements

Similar presentations

GPU Programming using BU Shared Computing Cluster

Advertisements

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,

IIAA GPMAD A beam dynamics code using Graphics Processing Units GPMAD (GPU Processed Methodical Accelerator Design) utilises Graphics Processing Units.

GPU Virtualization Support in Cloud System Ching-Chi Lin Institute of Information Science, Academia Sinica Department of Computer Science and Information.

GPU System Architecture Alan Gray EPCC The University of Edinburgh.

Early Linpack Performance Benchmarking on IPE Mole-8.5 Fermi GPU Cluster Xianyi Zhang 1),2) and Yunquan Zhang 1),3) 1) Laboratory of Parallel Software.

1 Scalable Fast Multipole Accelerated Vortex Methods Qi Hu a Nail A. Gumerov a Rio Yokota b Lorena Barba c Ramani Duraiswami a a Department of Computer.

GPGPU Introduction Alan Gray EPCC The University of Edinburgh.

OpenFOAM on a GPU-based Heterogeneous Cluster

A many-core GPU architecture.. Price, performance, and evolution.

Multi Agent Simulation and its optimization over parallel architecture using CUDA™ Abdur Rahman and Bilal Khan NEDUET(Department Of Computer and Information.

Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow Wilson W. L. Fung Ivan Sham George Yuan Tor M. Aamodt Electrical and Computer Engineering.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

ATI GPUs and Graphics APIs Mark Segal. ATI Hardware X1K series 8 SIMD vertex engines, 16 SIMD fragment (pixel) engines 3-component vector + scalar ALUs.

CS 732: Advance Machine Learning Usman Roshan Department of Computer Science NJIT.

Heterogeneous Computing Dr. Jason D. Bakos. Heterogeneous Computing 2 “Traditional” Parallel/Multi-Processing Large-scale parallel platforms: –Individual.

GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.

1 Scalable Distributed Fast Multipole Methods Qi Hu, Nail A. Gumerov, Ramani Duraiswami Institute for Advanced Computer Studies Department of Computer.

Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.

Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.

Chapter 2 Computer Clusters Lecture 2.3 GPU Clusters for Massive Paralelism.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

David Luebke NVIDIA Research GPU Computing: The Democratization of Parallel Computing.

Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.

Computer Graphics Graphics Hardware

1 Scalable Fast Multipole Methods on Distributed Heterogeneous Architecture Qi Hu, Nail A. Gumerov, Ramani Duraiswami Institute for Advanced Computer Studies.

UIUC CSL Global Technology Forum © NVIDIA Corporation 2007 Computing in Crisis: Challenges and Opportunities David B. Kirk.

GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.

By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.

Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Advisor: Dr. Aamir Shafi Co-Advisor: Mr. Ali Sajjad Member: Dr. Hafiz Farooq Member: Mr. Tahir Azim Optimizing N-body Simulations for Multi-core Compute.

Helmholtz International Center for CBM – Online Reconstruction and Event Selection Open Charm Event Selection – Driving Force for FEE and DAQ Open charm:

© David Kirk/NVIDIA and Wen-mei W. Hwu, 1 Programming Massively Parallel Processors Lecture Slides for Chapter 1: Introduction.

Taking the Complexity out of Cluster Computing Vendor Update HPC User Forum Arend Dittmer Director Product Management HPC April,

YOU LI SUPERVISOR: DR. CHU XIAOWEN CO-SUPERVISOR: PROF. LIU JIMING THURSDAY, MARCH 11, 2010 Speeding up k-Means by GPUs 1.

Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.

Diane Marinkas CDA 6938 April 30, Outline Motivation Algorithm CPU Implementation GPU Implementation Performance Lessons Learned Future Work.

SJSU SPRING 2011 PARALLEL COMPUTING Parallel Computing CS 147: Computer Architecture Instructor: Professor Sin-Min Lee Spring 2011 By: Alice Cotti.

STRATEGIC ICT SUMMIT FEBRUARY 3 – 4, 2009 Name: Dr Kenji Takeda Organisation: School of Engineering Sciences, University of Southampton Contact Information:

GPU Architecture and Programming

Accelerating the Singular Value Decomposition of Rectangular Matrices with the CSX600 and the Integrable SVD September 7, 2007 PaCT-2007, Pereslavl-Zalessky.

Hardware Acceleration Using GPUs M Anirudh Guide: Prof. Sachin Patkar VLSI Consortium April 4, 2008.

Fast BVH Construction on GPUs (Eurographics 2009) Park, Soonchan KAIST (Korea Advanced Institute of Science and Technology)

ICAL GPU 架構中所提供分散式運算之功能與限制. 11/17/09ICAL2 Outline Parallel computing with GPU NVIDIA CUDA SVD matrix computation Conclusion.

GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.

By Dirk Hekhuis Advisors Dr. Greg Wolffe Dr. Christian Trefftz.

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.

GPU Programming Shirley Moore CPS 5401 Fall 2013

Big data Usman Roshan CS 675. Big data Typically refers to datasets with very large number of instances (rows) as opposed to attributes (columns). Data.

Dynamic Scheduling Monte-Carlo Framework for Multi-Accelerator Heterogeneous Clusters Authors: Anson H.T. Tse, David B. Thomas, K.H. Tsoi, Wayne Luk Source:

CS 732: Advance Machine Learning

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 Graphic Processing Processors (GPUs) Parallel.

Sunpyo Hong, Hyesoon Kim

GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.

3/12/2013Computer Engg, IIT(BHU)1 CUDA-3. GPGPU ● General Purpose computation using GPU in applications other than 3D graphics – GPU accelerates critical.

December 13, G raphical A symmetric P rocessing Prototype Presentation December 13, 2004.

Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.

S. Pardi Frascati, 2012 March GPGPU Evaluation – First experiences in Napoli Silvio Pardi.

Fermi National Accelerator Laboratory & Thomas Jefferson National Accelerator Facility SciDAC LQCD Software The Department of Energy (DOE) Office of Science.

Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.

GPU Acceleration of Particle-In-Cell Methods B. M. Cowan, J. R. Cary, S. W. Sides Tech-X Corporation.

General Purpose computing on Graphics Processing Units

Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.

Parallel Plasma Equilibrium Reconstruction Using GPU

CS427 Multicore Architecture and Parallel Computing

Real-Time Ray Tracing Stefan Popov.

Computer-Generated Force Acceleration using GPUs: Next Steps

NVIDIA Fermi Architecture

Presentation transcript:

1 Toward Improved Aeromechanics Simulations Using Recent Advancements in Scientific Computing Qi Hu, Nail A. Gumerov, Ramani Duraiswami Institute for Advanced Computer Studies and Department of Computer Science Monica Syal, J. Gordon Leishman Alfred Gessow Rotorcraft Center and Department of Aerospace Engineering University of Maryland College Park, MD Sponsored by AFOSR, Flow Interactions & Control Program Contract Monitor: Douglas Smith Presented at the 67 th Annual Forum of the American Helicopter Society, Virginia Beach, VA, 3–5 May 2011

2 Task 3.5: Computational Considerations in Brownout Simulations 100x+ “faster” is “fundamentally different” David B. Kirk, Chief Scientist, NVIDIA

3 Task 3.5: Computational Considerations in Brownout Simulations Outline Motivation −Vortex element method −Particle motion simulations Brute force algorithm accelerations −Graphics processing units (GPU) −Performance Algorithmic accelerations −Fast multipole methods (FMM) Fast algorithms on GPUs −FMM on GPU −Fast data structures −Performance and error analysis Conclusion

4 Task 3.5: Computational Considerations in Brownout Simulations Motivation

5 Motivation – Aeromechanical Simulations High fidelity comprehensive analysis required for aeromechanics Aeroacoustics Aeroelasticity Vibrations Complex turbulent flows Many more Particularly, we are interested with rotorcraft brownout simulations, which include Flow simulations using free vortex method Dust cloud dynamics in vortical flows via Lagrangian methods These simulations are very time consuming and we are looking for accelerations using high performance computing and algorithmic advances

6 Motivation – Problem of Brownout Brownout is a safety of flight issue and cause of many mishaps Loss of ground visibility for the pilot as well as vection illusions Modeling dust cloud helps understand the scope of the problem and possible means of mitigation: - By rotor design - By flight-path management Video courtesy OADS

7 Challenges in Dust Cloud Modeling Flow field is complicated and many vortex elements are needed to model the flow correctly Physics of two-phase particulate flows is complex and different mechanisms of particle-flow interaction can be important A large number of particles is needed for Lagrangian methods Many time steps are needed to provide reliable computations

8 Free-Vortex Method Real flow Image flow Ground Velocity fieldSmoothing kernel “viscous core” Vortex center dynamics N 2 interactions (all to all)

9 Particle Dynamics Force on particleParticle position Fluid velocity field Particle velocity N vortex elements act on M particles: Total number of interactions NM

10 Task 3.5: Computational Considerations in Brownout Simulations Technical Barriers and Solutions Computation is expensive for real simulations: −Millions of particles and vortex elements involved with O(N 2 +NM) cost per time step −Many time steps for realistic simulations Ways to achieve efficiency: A.Acceleration of brute force computations −Multiple CPU cores −CPU distributed clusters −Graphics processors (GPUs) −Heterogeneous CPU/GPU architectures B.Algorithmic acceleration −Fast multipole methods (FMM) C.Use both

11 Task 3.5: Computational Considerations in Brownout Simulations Brute Force Acceleration

12 A Quick Introduction to the GPU Graphics processing unit (GPU) is a highly parallel, multithreaded, many-core processor with high computation power and memory bandwidth GPU is designed for single instruction multiple data (SIMD) computation; more transistors for processing rather than data caching and flow control NVIDIA Tesla C2050: 1.25 Tflops single 0.52 Tflops double 448 cores DRAM Cache Control ALU CPU DRAM GPU A few coresHundreds cores

13 Is It Expensive? Any PC has GPU which probably performs faster than the CPU GPUs with Teraflops performance are used in game stations Tens of millions of GPUs are produced each year Price for 1 good GPU in range $ Prices for the most advanced NVIDIA GPUs for general purpose computing (e.g. Tesla C2050) are in the range $1K-$2K Modern research supercomputer with several GPUs can be purchased for a few thousand dollars GPUs provide the best Gflops/$ ratio They also provide the best Gflops/watt

14 Floating-Point Operations for CPU and GPU

15 Task 3.5: Computational Considerations in Brownout Simulations Is It Easy to Program A GPU ? For inexperienced GPU programmers −Matlab Parallel Computing Toolbox For FORTRAN Programmers: FLAGON −Middleware to program GPU from FORTRAN −Relatively easy to incorporate to existing codes −Developed by the authors at UMD −Free (available online) For advanced users −CUDA: a C-like programming language −Math libraries are available −Custom functions can be implemented −Requires careful memory management −Free (available online) Local memory ~50 kB GPU global memory ~1-4 GB Host memory ~4-128 GB

16 Task 3.5: Computational Considerations in Brownout Simulations University of Maryland UMD is one of the NVIDIA world excellence centers for the GPU programming −Courses on GPU programming −PCs equipped with GPUs −CPU/GPU heterogeneous cluster at Institute of Advance Computer Study (UMIACS)

17 Task 3.5: Computational Considerations in Brownout Simulations Acceleration via GPUs Existing brute force brownout simulations −At least 20 times speedup for double precision −At least 250 times speedup for single precision −Total time for landing simulation: CPU (8 cores): 45.1 hours GPU : 4.1 hours

18 Task 3.5: Computational Considerations in Brownout Simulations Direct Parallelism for Simulations Wake induced velocities −computation expensive (quadratic) −easy to parallel the brute force calculations −incorporate CUDA codes into current FORTRAN codes by FLAGON For small number of particles, GPU implementation not efficient because of computational over- heads involved For large number of particles, single precision 10 times faster than double precision Single precision Double precision Acceleration, X

19 Task 3.5: Computational Considerations in Brownout Simulations Algorithmic Acceleration

20 Task 3.5: Computational Considerations in Brownout Simulations Fast Multipole Method FMM introduced by Rokhlin and Greengard (1987), hundreds of publications since then Achieves dense NxM matrix-vector multiplication for special kernels in O(N+M) time and memory cost Based on the idea that the far field of a group of singularities (vortices) can be represented compactly via multipole expansions Uses hierarchical data structures

21 Task 3.5: Computational Considerations in Brownout Simulations Algorithmic and Hardware Acceleration

22 Task 3.5: Computational Considerations in Brownout Simulations FMM on GPU Pioneering work by Gumerov and Duraiswami 2007 with many papers since −Showed that the peculiarities of GPU architecture affect the FMM algorithm −1 million N-body interaction computed for 1 second in single precision −Bottleneck: FMM data structures are relatively slow and take time exceeding the FMM run time −Did not implement the vortex element method Our new results: −Fast data structures on GPU (very important for dynamic problems) −Vector kernels for the vortex element method −Problem sizes on a single GPU extended to tens of millions particles −Double precision computations

23 Task 3.5: Computational Considerations in Brownout Simulations Acceleration of the FMM Data Structure on GPU Depth of the FMM octree (levels) Our new algorithm constructs the FMM data structures on GPU for millions of particles for times of the order of 0.1 s opposed to 2-10 s required for CPU. This provides very substantial computational savings for dynamic problems, where particle positions change and the data structure should be regenerated each time step.

24 Task 3.5: Computational Considerations in Brownout Simulations FMM for 3D Vector Kernel (Vortex Elements) The Baseline FMM on GPU in previous implementation computes the scalar kernel (1/r) To obtain the Biot-Savart 3D vector kernel, we need to apply the baseline FMM three times and compute the gradients is the smoothing kernel (viscous core) with support ε.

25 Task 3.5: Computational Considerations in Brownout Simulations FMM for Biot–Savart Vector Kernel Our algorithm demonstrates that the full FMM computation time is even less than doubled baseline FMM running time (not tripled) Number of vortex elements Time (sec)

26 Task 3.5: Computational Considerations in Brownout Simulations Overall Performance Test Double precision computation of 10 million particle interaction takes about 16 seconds and single precision takes 7 seconds per time step

27 Task 3.5: Computational Considerations in Brownout Simulations Error Analysis Single precisionDouble precision Number of Vortex Elements Error Relative error in L 2 -norm for different multipole expansion truncation numbers and problem sizes The total number of multipoles in a single expansion is

28 Task 3.5: Computational Considerations in Brownout Simulations Conclusions The capability of improved high fidelity aeromechanics very large simulations demonstrated Accelerated vortex particle computations on GPUs performed GPU based FMM data structures with very small cost enable the FMM application for dynamic problems The acceptable accuracy of FMM on GPU is shown with both single and double precision The ability to achieve very large simulations in acceptable time has been demonstrated

29 Task 3.5: Computational Considerations in Brownout Simulations 100x+ “faster” is “fundamentally different” David B. Kirk, Chief Scientist, NVIDIA Questions?

30 Task 3.5: Computational Considerations in Brownout Simulations Backup slides

31 Task 3.5: Computational Considerations in Brownout Simulations Two vortex rings interaction demo Two vortex rings move at the same direction Two vortex rings collision

32 Task 3.5: Computational Considerations in Brownout Simulations FMM testing Run a single vortex ring movement to test FMM discretized ring elements and particles

33 Task 3.5: Computational Considerations in Brownout Simulations FMM testing Compute relative errors by comparing with CPU results for every time step Run for 500 time steps with acceptable error 10^(-6)

34 Task 3.5: Computational Considerations in Brownout Simulations Extending the algorithm to clusters Practical simulations may require billions of particles/vortices Recently we developed heterogeneous algorithm that scales well on the cluster of CPU/GPU nodes Our current result: One billion of vortices in 30s on clusters of 30 nodes expected to be significantly improved both in terms of number of particles and computation time

35 Toward Improved Aeromechanics Simulations Using Recent Advancements in Scientific Computing Qi Hu, Nail A. Gumerov, Ramani Duraiswami Institute for Advanced Computer Studies and Department of Computer Science Monica Syal, J. Gordon Leishman Alfred Gessow Rotorcraft Center and Department of Aerospace Engineering University of Maryland College Park, MD Sponsored by AFOSR Contract Monitor Douglas Smith

36 Task 3.5: Computational Considerations in Brownout Simulations Overall Performance Test Larger fonts for titles, legend and labels. X-axis title: Number of Vortex elements. Also put time in seconds, not milliseconds. Double precision computation Full interaction of 10 million particles in about 16 seconds (Single precision in 7 seconds)

37 Task 3.5: Computational Considerations in Brownout Simulations Algorithmic Acceleration - FMM 4 cores of CPU via OMP

38 Task 3.5: Computational Considerations in Brownout Simulations Acceleration via GPUs Existing brute force brownout simulations −At least 20 times speedup for double precision −At least 250 times speedup for single precision −Total time for landing simulation: CPU (8 cores): 45.1 hours GPU : 4.1 hours