NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC Using GPUs to compute the multilevel summation.

Slides:

Advertisements

Similar presentations

Lecture 6: Multicore Systems

Advertisements

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Optimization on Kepler Zehuan Wang

Cell Broadband Engine. INF5062, Carsten Griwodz & Pål Halvorsen University of Oslo Cell Broadband Engine Structure SPE PPE MIC EIB.

1 NAMD - Scalable Molecular Dynamics Gengbin Zheng 9/1/01.

GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and.

1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.

L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.

Evaluation of Fast Electrostatics Algorithms Alice N. Ko and Jesús A. Izaguirre with Thierry Matthey Department of Computer Science and Engineering University.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.

To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

BTRC for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC 1 Future Direction with NAMD David Hardy

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Memory Hardware in G80.

1 Chapter 04 Authors: John Hennessy & David Patterson.

Extracted directly from:

Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.

Molecular Dynamics Collection of [charged] atoms, with bonds – Newtonian mechanics – Relatively small #of atoms (100K – 10M) At each time-step – Calculate.

Scheduling Many-Body Short Range MD Simulations on a Cluster of Workstations and Custom VLSI Hardware Sumanth J.V, David R. Swanson and Hong Jiang University.

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.

Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"

NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC Faster, Cheaper, and Better Science: Molecular.

NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC Adapting a Message-Driven Parallel Application.

NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.

CUDA Optimizations Sathish Vadhiyar Parallel Programming.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow/ Thread Execution.

CUDA Performance Patrick Cozzi University of Pennsylvania CIS Fall

NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.

NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.

An FPGA Implementation of the Ewald Direct Space and Lennard-Jones Compute Engines By: David Chui Supervisor: Professor P. Chow.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.

Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.

Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.

ICAL GPU 架構中所提供分散式運算之功能與限制. 11/17/09ICAL2 Outline Parallel computing with GPU NVIDIA CUDA SVD matrix computation Conclusion.

CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.

© David Kirk/NVIDIA, Wen-mei W. Hwu, and John Stratton, ECE 498AL, University of Illinois, Urbana-Champaign 1 CUDA Lecture 7: Reductions and.

Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University of Seoul) Chao-Yue Lai (UC Berkeley) Slav Petrov (Google Research) Kurt Keutzer (UC Berkeley)

Programming with CUDA WS 08/09 Lecture 10 Tue, 25 Nov, 2008.

NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC Multilevel Summation of Electrostatic Potentials.

©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 VSCSE Summer School Proven Algorithmic Techniques for Many-core Processors Lecture.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.

Sunpyo Hong, Hyesoon Kim

Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.

CUDA Compute Unified Device Architecture. Agent Based Modeling in CUDA Implementation of basic agent based modeling on the GPU using the CUDA framework.

My Coordinates Office EM G.27 contact time:

Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1.

Processor Level Parallelism 2. How We Got Here Developments in PC CPUs.

Parallel Molecular Dynamics A case study : Programming for performance Laxmikant Kale

GPU Acceleration of Particle-In-Cell Methods B. M. Cowan, J. R. Cary, S. W. Sides Tech-X Corporation.

Flexibility and Interoperability in a Parallel MD code Robert Brunner, Laxmikant Kale, Jim Phillips University of Illinois at Urbana-Champaign.

Sathish Vadhiyar Parallel Programming

CS427 Multicore Architecture and Parallel Computing

EECE571R -- Harnessing Massively Parallel Processors ece

GPU Memory Details Martin Kruliš by Martin Kruliš (v1.1)

ECE 498AL Spring 2010 Lectures 8: Threading & Memory Hardware in G80

L18: CUDA, cont. Memory Hierarchy and Examples

ECE408 / CS483 Applied Parallel Programming Lecture 23: Application Case Study – Electrostatic Potential Calculation.

NVIDIA Fermi Architecture

Mattan Erez The University of Texas at Austin

Mattan Erez The University of Texas at Austin

6- General Purpose GPU Programming

Presentation transcript:

NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC Using GPUs to compute the multilevel summation of electrostatic forces David J. Hardy Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology University of Illinois at Urbana-Champaign Multiscale Molecular Modelling, Edinburgh, June 30 - July 3, 2010

NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC Outline Multilevel summation method (MSM) GPU architecture and kernel design considerations GPU kernel design alternatives for short-range non-bonded interactions GPU kernel for MSM long-range part Initial performance results for speeding up MD

NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC Multilevel Summation Method Fast algorithm for N-body electrostatics Calculates sum of smoothed pairwise potentials interpolated from a hierarchal nesting of grids Advantages over PME (particle-mesh Ewald) and/or FMM (fast multipole method):  Algorithm has linear time complexity  Allows non-periodic or periodic boundaries  Produces continuous forces for dynamics (advantage over FMM)  Avoids 3D FFTs for better parallel scaling (advantage over PME)  Permits polynomial splittings (no erfc() evaluation, as used by PME)  Spatial separation allows use of multiple time steps  Can be extended to other types of pairwise interactions Skeel, et al., J. Comp. Chem. 23: , 2002.

NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC MSM Main Ideas    atoms h -grid 2h -grid Split the 1/r potentialInterpolate the smoothed potentials a2a2a Split the 1/r potential into a short-range cutoff part plus smoothed parts that are successively more slowly varying. All but the top level potential are cut off. Smoothed potentials are interpolated from successively coarser grids. Finest grid spacing h and smallest cutoff distance a are doubled at each successive level. 1/r r0r0

NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC MSM Calculation force exact short-range part interpolated long-range part   Computational Steps short-range cutoff interpolationanterpolation h -grid cutoff 2h -grid cutoff 4h -grid restriction prolongation long-range parts positions charges forces

NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC GPU Computing Concepts Heterogeneous computing model  CPU host manages GPU devices, memories, invokes “kernels” GPU hardware supports standard integer and floating point types and math operations, designed for fine-grained data parallelism  Hundreds of “threads” grouped together into a “block” performing SIMD execution  Need hundreds of blocks (~10, ,000 threads) to fully utilize hardware Great speedups attainable (with commodity hardware)  10x - 20x over one CPU core are common  100x or more possible in some cases Programming toolkits (CUDA and OpenCL) in C/C++ dialects, allowing GPUs to be integrated into legacy software

NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC SP = Streaming Processor SFU = Special Function Unit Tex = Texture Unit

NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC NVIDIA Fermi New Features ECC memory support L1 (64 KB) and L2 (768 KB) caches 512 cores, faster arithmetic (“madd”) by 2x  HOWEVER: memory bandwidth increases only % Support for additional functions in SFU: erfc() for PME!  HOWEVER: no commensurate improvement in SFU performance Improved double precision performance and accuracy Improved integer performance Allow multiple kernel execution  Could prove extremely important for keeping GPU fully utilized

NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC GPU Kernel Design (CUDA) Problem must offer substantial data parallelism Data access using gather memory patterns (reads) rather than scatter (writes) Increase arithmetic intensity through effective use of memory subsystems  registers: the fastest memory, but smallest in size  constant memory: near register-speed when all threads together read a single location (fast broadcast)  shared memory: near register-speed when all threads access without bank conflicts  texture cache: can improve performance for irregular memory access  global memory: large but slow, coalesced access for best performance May benefit from trading memory access for more arithmetic

NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC Additional GPU Considerations Coalescing reads and writes from global memory Avoiding bank conflicts accessing shared memory Avoiding branch divergence Synchronizing threads within thread blocks Atomic memory operations can provide synchronization across thread blocks “Stream” management for invoking kernels and transferring memory asynchronously Page-locked host memory to speed up memory transfers between CPU and GPU

NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC Designing GPU Kernels for Short-range Non-bonded Forces Calculate both electrostatics and van der Waals interactions (need atom coordinates and parameters) Spatial hashing of atoms into bins (best done on CPU) Should we use pairlists?  Reduces computation, increases and delocalizes memory access Should we make use of Newton’s 3rd Law to reduce work? Is single precision enough? Do we need double precision? How might we handle non-bonded exclusions?  Detect and omit excluded pairs (use bit masks)  Ignore, fix with CPU (use force clamping)

NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC Designing GPU Kernels for Short-range Non-bonded Forces How do we map work to the GPU threads?  Fine-grained: assign threads to sum forces on atoms  Extremely fine-grained: assign threads to pairwise interactions How do we decompose work into thread blocks?  Non-uniform: assign thread blocks to bins  Uniform: assign thread blocks to entries of the force matrix How do we compute potential energies or the virial? How do we calculate expensive functional forms?  PME requires erfc(): is it faster to use an interpolation table? Other issues: supporting NBFix parameters

NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC GPU Kernel for Short-range MSM CPU sorts atoms into bins, copies bins to GPU global memory Each bin is assigned to a thread block Threads are assigned to individual atoms Loop over surrounding neighborhood of bins, summing forces and energies from their atoms Calculation for MSM involves rsqrt() plus several multiplies and adds CPU copies forces and energies back from GPU global memory Constant Memory Array of Bin Index Offsets Shared Memory Bin of Atoms Copy bin into shared memory Find index offset for my neighbor bin Global Memory Each thread keeps its atom’s forces in registers

NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC GPU Kernel for Short-range MSM Each thread accumulates atom force and energies in registers Bin neighborhood index offsets stored in constant memory Load atom bin data into shared memory; atom data and bin “depth” are carefully chosen to permit coalesced reads from global memory Check for and omit excluded pairs Thread block performs sum reduction of energies Coalesced writing of forces and energies (with padding) to GPU global memory CPU sums energies from bins Constant Memory Array of Bin Index Offsets Shared Memory Bin of Atoms Copy bin into shared memory Find index offset for my neighbor bin Global Memory Each thread keeps its atom’s forces in registers

NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC Spatially decompose data and communication. Separate but related work decomposition. “Compute objects” facilitate iterative, measurement-based load balancing system. NAMD Hybrid Decomposition Kale et al., J. Comp. Phys. 151: , 1999.

NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Non-bonded Forces on GPU Decompose work into pairs of “patches” (bins), identical to NAMD structure. Each patch-pair is calculated by an SM (thread block). 16kB Shared Memory Patch A Coordinates & Parameters 32kB Registers Patch B Coords, Params, & Forces Texture Unit Force Table Interpolation Constants Exclusions 8kB cache 32-way SIMD Multiprocessor multiplexed threads 768 MB Main Memory, no cache, 300+ cycle latency Force computation on single multiprocessor (GeForce 8800 GTX has 16) Stone et al., J. Comp. Chem. 28: , 2007.

NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC MSM Grid Interactions Potential summed from grid point charges within cutoff Uniform spacing enables distance-based interactions to be precomputed as stencil of “weights” Weights at each level are identical up to scaling factor (!) Calculate as 3D convolution of weights  stencil size up to 23x23x23 Cutoff radius Accumulate potential Sphere of grid point charges

NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC MSM Grid Interactions on GPU Store weights in constant memory (padded up to next multiple of 4) Thread block calculates 4x4x4 region of potentials (stored contiguously) Pack all regions over all levels into 1D array (each level padded with zero-charge region) Store map of level array offsets in constant memory Kernel has thread block loop over surrounding regions of charge (load into shared memory) All grid levels are calculated concurrently, scaled by level factor (keeps GPU from running out of work at upper grid levels) Shared memory Global memory Constant memory Grid potential regions Each thread block cooperatively loads regions of grid charge into shared memory, multiply by weights from constant memory Grid charge regions Stencil of weights Subset of grid charge regions Hardy, et al., J. Paral. Comp. 35: , 2009.

NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC Apply Weights Using Sliding Window Thread block must collectively use same value from constant memory Read 8x8x8 grid charges (8 regions) into shared memory Window of size 4x4x4 maintains same relative distances Slide window by 4 shifts along each dimension

NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC Initial Results Box of flexible waters, 12 A cutoff, 1ps CPU onlywith GPU Speedup vs. NAMD/CPU NAMD with PME s210.5 s5.7 x NAMD-Lite with MSM s ( short, long) s (93.9 short, 63.1 long) 6.8 x (19% over NAMD/GPU) (GPU: NVIDIA GTX-285, using CUDA 3.0; CPU: 2.4 GHz Intel Core 2 Q6600 quad core)

NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC Concluding Remarks Redesign of MD algorithms for GPUs is important:  Commodity GPUs offer great speedups for well-constructed codes  CPU single core performance is not improving  Multicore CPUs benefit from redesign efforts (e.g., OpenCL can target multicore CPUs) MSM offers advantages that benefit GPUs:  Short-range splitting using polynomial of r 2  Calculation on uniform grids More investigation into alternative designs for MD algorithms

NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC Acknowledgments Bob Skeel (Purdue University) John Stone (University of Illinois at Urbana-Champaign) Jim Phillips (University of Illinois at Urbana-Champaign) Klaus Schulten (University of Illinois at Urbana- Champaign) Funding from NSF and NIH