NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC Faster, Cheaper, and Better Science: Molecular.

Slides:

Advertisements

Similar presentations

Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters

Advertisements

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,

Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

A Complete GPU Compute Architecture by NVIDIA Tamal Saha, Abhishek Rawat, Minh Le {ts4rq, ar8eb,

Optimization on Kepler Zehuan Wang

Computer Abstractions and Technology

GPU Virtualization Support in Cloud System Ching-Chi Lin Institute of Information Science, Academia Sinica Department of Computer Science and Information.

Intro to GPU’s for Parallel Computing. Goals for Rest of Course Learn how to program massively parallel processors and achieve – high performance – functionality.

GPGPU Introduction Alan Gray EPCC The University of Edinburgh.

GRAPHICS AND COMPUTING GPUS Jehan-François Pâris

GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.

Panda: MapReduce Framework on GPU’s and CPU’s

Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.

GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.

GPGPU platforms GP - General Purpose computation using GPU

Synergy.cs.vt.edu Power and Performance Characterization of Computational Kernels on the GPU Yang Jiao, Heshan Lin, Pavan Balaji (ANL), Wu-chun Feng.

BTRC for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC 1 Analysis and Visualization Algorithms in VMD David.

BTRC for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC 1 Future Direction with NAMD David Hardy

National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Direct Self-Consistent Field Computations on GPU Clusters Guochun.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.

NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC High Performance Molecular Simulation, Visualization,

Extracted directly from:

Parallel Applications Parallel Hardware Parallel Software IT industry (Silicon Valley) Users Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University.

Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

NIH BTRC for Macromolecular Modeling and Bioinformatics Beckman Institute, U. Illinois at Urbana-Champaign S5246—Innovations in.

NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC S04: High Performance Computing with CUDA Case.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 1 Programming Massively Parallel Processors Lecture Slides for Chapter 1: Introduction.

Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.

NIH BTRC for Macromolecular Modeling and Bioinformatics Beckman Institute, U. Illinois at Urbana-Champaign Programming in CUDA:

NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.

Emergence of GPU systems and clusters for general purpose high performance computing ITCS 4145/5145 April 3, 2012 © Barry Wilkinson.

NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC Accelerating Molecular Modeling Applications.

GPU Architecture and Programming

NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.

NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC S03: High Performance Computing with CUDA Heterogeneous.

NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC Using GPUs to compute the multilevel summation.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.

Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.

Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.

ICAL GPU 架構中所提供分散式運算之功能與限制. 11/17/09ICAL2 Outline Parallel computing with GPU NVIDIA CUDA SVD matrix computation Conclusion.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC High Performance Molecular Simulation, Visualization,

GPU Programming Shirley Moore CPS 5401 Fall 2013

Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University of Seoul) Chao-Yue Lai (UC Berkeley) Slav Petrov (Google Research) Kurt Keutzer (UC Berkeley)

NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC Multilevel Summation of Electrostatic Potentials.

NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC High Performance Molecular Visualization and.

Co-Processor Architectures Fermi vs. Knights Ferry Roger Goff Dell Senior Global CERN/LHC Technologist |

Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.

GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.

3/12/2013Computer Engg, IIT(BHU)1 CUDA-3. GPGPU ● General Purpose computation using GPU in applications other than 3D graphics – GPU accelerates critical.

Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.

GPU Computing for GIS James Mower Department of Geography and Planning University at Albany.

NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC Molecular Visualization and Analysis on GPUs.

Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs Mai Zheng, Vignesh T. Ravi, Wenjing Ma, Feng Qin, and Gagan Agrawal Dept. of Computer.

GPU Acceleration of Particle-In-Cell Methods B. M. Cowan, J. R. Cary, S. W. Sides Tech-X Corporation.

General Purpose computing on Graphics Processing Units

CS427 Multicore Architecture and Parallel Computing

John Stone Theoretical and Computational Biophysics Group

Mattan Erez The University of Texas at Austin

NVIDIA Fermi Architecture

Mattan Erez The University of Texas at Austin

Graphics Processing Unit

6- General Purpose GPU Programming

Presentation transcript:

NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC Faster, Cheaper, and Better Science: Molecular Modeling on GPUs John Stone Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology University of Illinois at Urbana-Champaign Fall National Meeting of the American Chemical Society, Boston, MA, August 22, 2010

NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC Potential Impact of GPU Computing State-of-the-art GPUs achieve over 1.0 TFLOPS single-precision, and 0.5 TFLOPS double-precision GPUs are an inexpensive commodity technology that is already ubiquitous, and are well-supported by existing operating systems, unlike previous accelerators End-users do not need to become HPC gurus to run GPU-accelerated software

NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC GPU Hardware Platforms GPUs are equally applicable to accelerating applications on laptops, desktops, and supercomputers GPUs are available in HPC and data-center-friendly rack mounts, with ECC protected memory, supporting hardware monitoring features of interest to cluster builders 2 of the top 10 supercomputers in the latest Top500 list are currently GPU-accelerated systems 2 of the top 10 supercomputers in the Green500 list are GPU-accelerated, and are over 3x more efficient than the average of all Green500 systems!

NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC Programmable Graphics Hardware Groundbreaking research systems: AT&T Pixel Machine (1989): 82 x DSP32 processors UNC PixelFlow ( ): 64 x (PA ,192 bit-serial SIMD) SGI RealityEngine (1990s): Up to 12 i860-XP processors perform vertex operations (ucode), fixed- func. fragment hardware All mainstream GPUs now incorporate fully programmable processors SGI Reality Engine i860 Vertex Processors UNC PixelFlow Rack

NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC GLSL Sphere Fragment Shader Written in OpenGL Shading Language High-level C-like language with vector types and operations Compiled dynamically by the graphics driver at runtime Compiled machine code executes on GPU

NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC GPU Computing Commodity devices, omnipresent in modern computers Massively parallel hardware, hundreds of processing units, throughput oriented design Support all standard integer and floating point types CUDA and OpenCL allow GPU software to be written in dialects of familiar C/C++ and integrated into legacy software GPU algorithms are often multicore-friendly due to attention paid to data locality and work decomposition, and can be successfully executed on multi-core CPUs as well, using special runtime systems (e.g. MCUDA)

NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC What Speedups Can GPUs Achieve? Single-GPU speedups of 10x to 30x vs. one CPU core are common Best speedups can reach 100x or more, attained on codes dominated by floating point arithmetic, especially native GPU machine instructions, e.g. expf(), rsqrtf(), … Amdahl’s Law can prevent legacy codes from achieving peak speedups with shallow GPU acceleration efforts

NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC GPU Peak Single-Precision Performance: Exponential Trend

NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC GPU Peak Memory Bandwidth: Linear Trend GT200

NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC Comparison of CPU and GPU Hardware Architecture CPU: Cache heavy, focused on individual thread performance GPU: ALU heavy, massively parallel, throughput oriented

NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC Graphics Processor Cluster NVIDIA Fermi GPU Streaming Multiprocessor GPC 768KB Level 2 Cache SP SFU SM LDST SM Tex Texture Cache 64 KB L1 Cache / Shared Memory ~3-6 GB DRAM Memory w/ ECC 64KB Constant Cache

NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC GPUs Require Lots of Parallelism, Favor Large Problem Sizes GPU underutilized GPU fully utilized, ~40x faster than CPU Accelerating molecular modeling applications with graphics processors. J. Stone, J. Phillips, P. Freddolino, D. Hardy, L. Trabuco, K. Schulten. J. Comp. Chem., 28: , Lower is better Cold start GPU initialization time: ~110ms

NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Molecular Dynamics on GPUs NVIDIA Tesla NCSA Lincoln GPU Cluster

NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Performance on NCSA “Lincoln” GPU Cluster (8 cores and 2 GPUs per node) 2 GPUs = 24 cores 4 GPUs 8 GPUs 16 GPUs CPU cores STMV (1M atoms), s/timestep ~5.6~2.8 8 GPUs = 96 CPU cores

NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD 2.7b3 w/ CUDA Available CUDA-enabled NAMD binaries for 64-bit Linux are available on the NAMD web site now! Direct download link:

NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC GPU-Accelerated NAMD Plans Serial performance –Target NVIDIA Fermi architecture –Revisit GPU kernel design decisions made in 2007 –Improve performance of remaining CPU code Parallel scaling –Target NSF Track 2D Keeneland cluster at ORNL –Finer-grained work units on GPU (feature of Fermi) –One process per GPU, one thread per CPU core –Dynamic load balancing of GPU work Wider range of simulation options and features

NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC Science Benefits from GPU-Accelerated Molecular Dynamics Run simulations of larger molecular complexes Run longer timescale simulations Run simulations on desktop or departmental sized cluster at speeds previously accessible only on larger machines Future: enable higher fidelity modeling –Polarizable force fields –Better sampling

NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC VMD – “Visual Molecular Dynamics” Visualization and analysis of molecular dynamics simulations, sequence data, volumetric data, quantum chemistry calculations, particle systems, … User extensible with scripting and plugins

NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC Motivation for GPU Acceleration in VMD Increases in supercomputing resources at NSF centers such as NCSA enable increased simulation complexity, fidelity, and longer time scales… Drives need for more visualization and analysis capability at the desktop and on clusters Desktop use is the most compute-limited scenario, where GPUs can make a big impact… GPU acceleration provides an opportunity to make some slow, or batch calculations capable of being run interactively, or on-demand…

NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC Imaging of gas migration pathways in proteins with implicit ligand sampling 20x to 30x faster Ion placement 20x to 44x faster GPU Algorithms in VMD GPU: massively parallel co-processor Electrostatic field calculation: multilevel summation method 31x to 44x faster

NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC Molecular orbital calculation and display 100x to 120x faster GPU Algorithms in VMD GPU: massively parallel co-processor Radial distribution functions 30x to 70x faster

NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC Ongoing VMD GPU Development Development of new CUDA kernels for common molecular dynamics trajectory analysis tasks, faster surface renderings, and more… Support for CUDA in MPI-enabled builds of VMD for analysis runs on GPU clusters Updating existing CUDA kernels to take advantage of new hardware features on the latest NVIDIA “Fermi” GPUs Adaptation of CUDA kernels to OpenCL, evaluation of JIT techniques with OpenCL

NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC Trajectory Analysis on NCSA GPU Cluster with MPI-enabled VMD Short time-averaged electrostatic field test case (few hundred frames, 700,000 atoms) 1:1 CPU/GPU ratio Power measured on a single node w/ NCSA monitoring tools CPUs-only: 1465 sec, 299 watts CPUs+GPUs: 57 sec, 742 watts Speedup 25.5 x Power efficiency gain: 10.5 x NCSA Tweet-a-watt power monitoring device

NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC Latest GPUs Bring Higher Performance and Easier Programming NVIDIA’s latest “Fermi” GPUs bring: –Greatly increased peak single- and double-precision arithmetic rates –Moderately increased global memory bandwidth –Increased capacity on-chip memory partitioned into shared memory and an L1 cache for global memory –Concurrent kernel execution –Bidirectional asynchronous host-device I/O –ECC memory, faster atomic ops, many others…

NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC Early Experiences with Fermi The 2x single-precision and up to 8x double- precision arithmetic performance increases vs. GT200 cause more kernels to be memory bandwidth bound… …unless they make effective use of the larger on- chip shared mem and L1 global memory cache to improve performance Arithmetic is cheap, memory references are costly (trend is certain to continue & intensify…) Register consumption and GPU “occupancy” are a bigger concern with Fermi than with GT200

NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC Padding optimizes global memory performance, guaranteeing coalesced global memory accesses Grid of thread blocks Small 8x8 thread blocks afford large per-thread register count, shared memory MO 3-D lattice decomposes into 2-D slices (CUDA grids) … 0,0 0,1 1,1 …… … … Threads producing results that are discarded Each thread computes one MO lattice point. Threads producing results that are used 1,0 … GPU 2 GPU 1 GPU 0 Lattice can be computed using multiple GPUs MO GPU Parallel Decomposition

NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC VMD MO GPU Kernel Snippet: Loading Tiles Into Shared Memory On-Demand [… outer loop over atoms …] if ((prim_counter + (maxprim = SHAREDSIZE) { prim_counter += sblock_prim_counter; sblock_prim_counter = prim_counter & MEMCOAMASK; s_basis_array[sidx ] = basis_array[sblock_prim_counter + sidx ]; s_basis_array[sidx + 64] = basis_array[sblock_prim_counter + sidx + 64]; s_basis_array[sidx + 128] = basis_array[sblock_prim_counter + sidx + 128]; s_basis_array[sidx + 192] = basis_array[sblock_prim_counter + sidx + 192]; prim_counter -= sblock_prim_counter; __syncthreads(); } for (prim=0; prim < maxprim; prim++) { float exponent = s_basis_array[prim_counter ]; float contract_coeff = s_basis_array[prim_counter + 1]; contracted_gto += contract_coeff * __expf(-exponent*dist2); prim_counter += 2; } [… continue on to angular momenta loop …] Shared memory tiles: Tiles are checked and loaded, if necessary, immediately prior to entering key arithmetic loops Adds additional control overhead to loops, even with optimized implementation

NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC VMD MO GPU Kernel Snippet: Fermi kernel based on L1 cache [… outer loop over atoms …] // loop over the shells belonging to this atom (or basis function) for (shell=0; shell < maxshell; shell++) { float contracted_gto = 0.0f; int maxprim = shellinfo[(shell_counter<<4) ]; int shell_type = shellinfo[(shell_counter<<4) + 1]; for (prim=0; prim < maxprim; prim++) { float exponent = basis_array[prim_counter ]; float contract_coeff = basis_array[prim_counter + 1]; contracted_gto += contract_coeff * __expf(-exponent*dist2); prim_counter += 2; } [… continue on to angular momenta loop …] L1 cache: Simplifies code! Reduces control overhead Gracefully handles arbitrary-sized problems Matches performance of constant memory

NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC VMD MO Performance Results for C GHz Intel X5550 vs. NVIDIA GTX 480 KernelCores/GPUsRuntime (s)Speedup CPU ICC-SSE CPU ICC-SSE CUDA-tiled-shared CUDA-Fermi-L1-cache (16kB) CUDA-const-cache C 60 basis set 6-31Gd. We used a high resolution MO grid for accurate timings. A more typical calculation has 1/8 th the grid points. Fermi L1 cache supports arbitrary sized problems, at near peak performance, with much simpler kernel design…

NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC Computing Radial Distribution Functions (Histogramming) on GPUs Fermi: larger shared memory and faster atomic ops increase histogramming performance ~3x faster than GT200 Up to 70x faster than 4-core Intel X5550 CPU

NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC Acknowledgements Theoretical and Computational Biophysics Group, University of Illinois at Urbana-Champaign Wen-mei Hwu and the IMPACT group at University of Illinois at Urbana-Champaign NVIDIA CUDA Center of Excellence, University of Illinois at Urbana-Champaign Ben Levine, Axel Kohlmeyer at Temple University NCSA Innovative Systems Lab The CUDA team at NVIDIA NIH support: P41-RR05969