GPU Acceleration of the Generalized Interpolation Material Point Method Wei-Fan Chiang, Michael DeLisi, Todd Hummel, Tyler Prete, Kevin Tew, Mary Hall,

Slides:

Advertisements

Similar presentations

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.

Advertisements

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

A Complete GPU Compute Architecture by NVIDIA Tamal Saha, Abhishek Rawat, Minh Le {ts4rq, ar8eb,

1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.

Exploiting Graphics Processors for High- performance IP Lookup in Software Routers Author: Jin Zhao, Xinya Zhang, Xin Wang, Yangdong Deng, Xiaoming Fu.

Dan Iannuzzi Kevin Pine CS 680. Outline The Problem Recap of CS676 project Goal of this GPU Research Approach Parallelization attempts Results Difficulties.

CS 179: GPU Programming Lecture 7. Week 3 Goals: – More involved GPU-accelerable algorithms Relevant hardware quirks – CUDA libraries.

March 18, 2008SSE Meeting 1 Mary Hall Dept. of Computer Science and Information Sciences Institute Multicore Chips and Parallel Programming.

Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:

1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.

L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.

L13: Review for Midterm. Administrative Project proposals due Friday at 5PM (hard deadline) No makeup class Friday! March 23, Guest Lecture Austin Robison,

 Data copy forms part of an auto-tuning compiler framework.  Auto-tuning compiler, while using the library, can empirically evaluate the different implementations.

DCABES 2009 China University Of Geosciences 1 The Parallel Models of Coronal Polarization Brightness Calculation Jiang Wenqian.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

L11: Jacobi, Tools and Project CS6963. Administrative Issues Office hours today: – Begin at 1:30 Homework 2 graded – I’m reviewing, grades will come out.

CUDA and the Memory Model (Part II). Code executed on GPU.

University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.

To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.

An Introduction to Programming with CUDA Paul Richmond

Slide 1/8 Performance Debugging for Highly Parallel Accelerator Architectures Saurabh Bagchi ECE & CS, Purdue University Joint work with: Tsungtai Yeh,

Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.

Scalable Data Clustering with GPUs Andrew D. Pangborn Thesis Defense Rochester Institute of Technology Computer Engineering Department Friday, May 14 th.

L15: Putting it together: N-body (Ch. 6) October 30, 2012.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

CS6963 L17: Lessons from Particle System Implementations.

CS 395 Last Lecture Summary, Anti-summary, and Final Thoughts.

GPU in HPC Scott A. Friedman ATS Research Computing Technologies.

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.

CUDA Optimizations Sathish Vadhiyar Parallel Programming.

GPU Architecture and Programming

Parallelization and Characterization of Pattern Matching using GPUs Author: Giorgos Vasiliadis 、 Michalis Polychronakis 、 Sotiris Ioannidis Publisher:

Experiences Accelerating MATLAB Systems Biology Applications Heart Wall Tracking Lukasz Szafaryn, Kevin Skadron University of Virginia.

Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.

Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.

Adam Wagner Kevin Forbes. Motivation  Take advantage of GPU architecture for highly parallel data-intensive application  Enhance image segmentation.

QCAdesigner – CUDA HPPS project

Optimizing MapReduce for GPUs with Effective Shared Memory Usage Department of Computer Science and Engineering The Ohio State University Linchuan Chen.

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

The Generalized Interpolation Material Point Method.

Programmability Hiroshi Nakashima Thomas Sterling.

An Efficient CUDA Implementation of the Tree-Based Barnes Hut n-body Algorithm By Martin Burtscher and Keshav Pingali Jason Wengert.

Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009.

L19: Putting it together: N-body (Ch. 6) November 22, 2011.

Auto-tuning Dense Matrix Multiplication for GPGPU with Cache

University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.

Smoothed Particle Hydrodynamics Matthew Zhu CSCI 5551 — Fall 2015.

AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.

Sunpyo Hong, Hyesoon Kim

Heterogeneous Computing using openCL lecture 4 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.

CS6963 L13: Application Case Study III: Molecular Visualization and Material Point Method.

An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems Isaac Gelado, Javier Cabezas. John Stone, Sanjay Patel, Nacho Navarro.

CS 179: GPU Computing Recitation 2: Synchronization, Shared memory, Matrix Transpose.

CS 179: GPU Programming Lecture 7. Week 3 Goals: – More involved GPU-accelerable algorithms Relevant hardware quirks – CUDA libraries.

Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi

GPU Acceleration of Particle-In-Cell Methods B. M. Cowan, J. R. Cary, S. W. Sides Tech-X Corporation.

1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,

Sathish Vadhiyar Parallel Programming

Parallel Programming By J. H. Wang May 2, 2017.

Recitation 2: Synchronization, Shared memory, Matrix Transpose

Synchronization trade-offs in GPU implementations of Graph Algorithms

CS 179: GPU Programming Lecture 7.

Presented by: Isaac Martin

Optimizing MapReduce for GPUs with Effective Shared Memory Usage

CS179: GPU PROGRAMMING Recitation 2 GPU Memory Synchronization

6- General Purpose GPU Programming

Presentation transcript:

GPU Acceleration of the Generalized Interpolation Material Point Method Wei-Fan Chiang, Michael DeLisi, Todd Hummel, Tyler Prete, Kevin Tew, Mary Hall, Phil Wallstedt, and James Guilkey Sponsored in part by NSF awards CSR and OCI and by hardware donations from NVIDIA Corporation.

Outline What is Material Point Method and Generalized Interpolation Material Point Method? Suitability for GPU Acceleration Implementation Challenges – Inverse mapping from grids to particles (global synchronization) – I/O in sequential implementation Experimental Results Looking to the future: – Programming Tools and Auto-tuning 2

Rigid, Soft Body and Fluid Simulations Tungsten Particle Impacting sandstone Compaction of a foam microstructure Breadth of applications fluids and smoke in games, astrophysics simulation, oil exploration, and molecular dynamics MPM Part of Center for the Simulation of Accidental Fires and Explosions (C-SAFE) software environment 3

2. Overlying mesh defined 1. Lagrangian material points carry all state data (position, velocity, stress, etc.) 5. Particle positions/velocities updated from mesh solution. 6. Discard deformed mesh. Define new mesh and repeat The Material Point Method (MPM) 3. Particle state projected to mesh, e.g.: 4. Conservation of momentum solved on mesh giving updated mesh velocity and (in principal) position. Stress at particles computed based on gradient of the mesh velocity. 6 4

Approach Start with sequential library implementation of MPM and GIMP – And descriptions of parallel OpenMP and MPI implementations Profiling pinpointed key computations (updateContribList and advance, >99%) Two independent implementations (2-3 person teams) Some other aspects of mapping – Makes heavy use of C++ templates – Gnuplot used for visualization 5

Key Features of MPM and GIMP Computation Large amounts of data parallelism Particles mapped to discretized grid – Compute contribution of particles to grid nodes (updateContribList) – Compute operations on grid nodes (advance) Each time step, the particles are moving – Compute stresses and recompute mapping Periodically, visualize or store results 6

Overview of Strategy for CUDA Implementation Partition particle data structure and mapping to grid across threads Build an inverse map from grid nodes to particles – Requires global synchronization Later phase partitions grid across threads Two implementations differ in strategy for this inverse map – V1: Sort grid nodes after every time step – V2: Replicate inverse map, using extra storage to avoid hotspots in memory (focus) 7

__device__ void addParticleToCell(int3 gridPos, uint index, uint* gridCounters, uint* gridCells) { // calculate grid hash uint gridHash = calcGridHash(gridPos); // increment cell counter using atomics int counter = atomicAdd(&gridCounters[gridHash], 1); counter = min(counter, params.maxParticlesPerCell-1); // write particle index into this cell (uncoalesced!) gridCells[gridHash*params.maxParticlesPerCell + counter] = index; } index refers to index of particle gridPos represents grid cell in 3-d space gridCells is data structure in global memory for the inverse mapping What this does: Builds up gridCells as array limited by max # particles per grid atomicAdd gives how many particles have already been added to this cell Global Synchronization for Inverse Map (CUDA Particle Project) 8

Optimized Version: Replicate gridCounters to avoid Contention Results of this optimization: – 2x speedup on updateContribList TaTa TaTa gc x TbTb TbTb TcTc TcTc gc y gc z atomicAdd operations gridCounter, one elt per grid node (global memory) Threads computing Inverse mapping TaTa TaTa gc x0 TbTb TbTb TcTc TcTc gc y 0 gc z0 atomicAdd operations replicated gridCounter (global memory) Threads computing Inverse mapping gc x p gc y p gc zp gc x1 gc y 1 gc z1 9

Summary of Other Optimizations Global memory coalescing – gridHash and gridCounters organization – Use of float2 and float4 data types – CUDA Visual Profiler pinpointed these! Maintain data on GPU across time steps Fuse multiple functions from sequential code into single, coarser grained GPU kernel Replace divides by multiples of inverse and cache 10

Experiment Details Architectures – Original = Intel Core2 Duo E8400 (3.00 GHz) – CUDA = nVIDIA GeForce 9600 GT (8 SMs) Input data set CellGrid NodesParticles 321,3522, ,3569, ,01219,897 11

Results on Key Computations All results use 128 threads Speedups of 12.5x and 6.6x, respectively, over sequential implementation 12

Overall Speedup Results No output, speedup of 10.7x With output, speedup only 3.3x Obvious future work: Open GL for visualization 13

Shifting Gears: Programmability and Auto-tuning Midterm extra credit question: – “If you could invest in tool research for GPUs, in what areas would you like to see progress?” Tools – Assistance with partitioning across threads/blocks – Assistance with selecting numbers of threads/blocks – Assistance with calculating indexing relative to thread/block partitioning 14

Auto-Tuning “Compiler” Batch Compiler code input data Traditional view: Code Translation code input data (characteristics) (Semi-)Autotuning Compiler: search script(s) transformation script(s) Experiments Engine 15

Current Research Activity Automatically generate CUDA from sequential code and transformation script, with CUDAize(loop,TI,TJ,kernnm) Advantages of auto-tuning – Tradeoffs between large number of threads to hide latency and smaller number to increase reuse of data in registers – Detect ordering sensitivities that impact coalescing, bank conflicts, etc. – Evaluate alternative memory hierarchy optimizations Addresses challenges from earlier slide – Correct code generation, including indexing – Auto-tuning to select best thread/block partitioning – Memory hierarchy optimizations and data movement 16

Summary Three areas of improvement for MPM/GIMP – Used single precision, which may not always be sufficiently precise – Wanted more threads but constrained by register limits – OpenGL visualization of results Newer GPUs and straightforward extensions ameliorate these challenges Future work on programmability and auto- tuning 17