NIH BTRC for Macromolecular Modeling and Bioinformatics Beckman Institute, U. Illinois at Urbana-Champaign Programming in CUDA:

Slides:



Advertisements
Similar presentations
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
Advertisements

A Complete GPU Compute Architecture by NVIDIA Tamal Saha, Abhishek Rawat, Minh Le {ts4rq, ar8eb,
Optimization on Kepler Zehuan Wang
HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.
GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and.
1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.
Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
Emergence of GPU systems for general purpose high performance computing ITCS 4145/5145 April 4, 2013 © Barry Wilkinson CUDAIntro.ppt.
BTRC for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC 1 Analysis and Visualization Algorithms in VMD David.
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Dec 31, 2012 Emergence of GPU systems and clusters for general purpose High Performance Computing.
National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Direct Self-Consistent Field Computations on GPU Clusters Guochun.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.
NIH BTRC for Macromolecular Modeling and Bioinformatics Beckman Institute, U. Illinois at Urbana-Champaign Visualization and Analysis.
David Luebke NVIDIA Research GPU Computing: The Democratization of Parallel Computing.
Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.
NIH BTRC for Macromolecular Modeling and Bioinformatics Beckman Institute, U. Illinois at Urbana-Champaign Petascale Molecular.
NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC High Performance Molecular Simulation, Visualization,
Extracted directly from:
NIH BTRC for Macromolecular Modeling and Bioinformatics Beckman Institute, U. Illinois at Urbana-Champaign S5246—Innovations in.
NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC S04: High Performance Computing with CUDA Case.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 1 Programming Massively Parallel Processors Lecture Slides for Chapter 1: Introduction.
NIH BTRC for Macromolecular Modeling and Bioinformatics Beckman Institute, U. Illinois at Urbana-Champaign Interactive Molecular.
General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific 09/12/12.
NIH BTRC for Macromolecular Modeling and Bioinformatics Beckman Institute, U. Illinois at Urbana-Champaign GPU-Accelerated Analysis.
Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.
NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC OpenCL: Molecular Modeling on Heterogeneous.
NVIDIA Tesla GPU Zhuting Xue EE126. GPU Graphics Processing Unit The "brain" of graphics, which determines the quality of performance of the graphics.
NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC Faster, Cheaper, and Better Science: Molecular.
NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.
© Copyright John E. Stone, Page 1 OpenCL for Molecular Modeling Applications: Early Experiences John Stone University of Illinois at Urbana-Champaign.
Programming for Hybrid Architectures
Emergence of GPU systems and clusters for general purpose high performance computing ITCS 4145/5145 April 3, 2012 © Barry Wilkinson.
NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC Accelerating Molecular Modeling Applications.
NIH BTRC for Macromolecular Modeling and Bioinformatics Beckman Institute, U. Illinois at Urbana-Champaign GPU-Accelerated Molecular.
Introducing collaboration members – Korea University (KU) ALICE TPC online tracking algorithm on a GPU Computing Platforms – GPU Computing Platforms Joohyung.
NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC S03: High Performance Computing with CUDA Heterogeneous.
Hardware Acceleration Using GPUs M Anirudh Guide: Prof. Sachin Patkar VLSI Consortium April 4, 2008.
NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC Using GPUs to compute the multilevel summation.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC High Performance Molecular Simulation, Visualization,
NIH BTRC for Macromolecular Modeling and Bioinformatics Beckman Institute, U. Illinois at Urbana-Champaign Early Experiences Scaling.
NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC In-Situ Visualization and Analysis of Petascale.
NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC Multilevel Summation of Electrostatic Potentials.
NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC High Performance Molecular Visualization and.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 Graphic Processing Processors (GPUs) Parallel.
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
3/12/2013Computer Engg, IIT(BHU)1 CUDA-3. GPGPU ● General Purpose computation using GPU in applications other than 3D graphics – GPU accelerates critical.
My Coordinates Office EM G.27 contact time:
Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.
GPU Computing for GIS James Mower Department of Geography and Planning University at Albany.
NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC Molecular Visualization and Analysis on GPUs.
NIH BTRC for Macromolecular Modeling and Bioinformatics Beckman Institute, U. Illinois at Urbana-Champaign S6261—VMD+OptiX: Streaming.
Emergence of GPU systems for general purpose high performance computing ITCS 4145/5145 July 12, 2012 © Barry Wilkinson CUDAIntro.ppt.
High Performance Molecular Visualization and Analysis on GPUs
John Stone Theoretical and Computational Biophysics Group
NVIDIA Fermi Architecture
Mattan Erez The University of Texas at Austin
Graphics Processing Unit
6- General Purpose GPU Programming
Multicore and GPU Programming
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
Presentation transcript:

NIH BTRC for Macromolecular Modeling and Bioinformatics Beckman Institute, U. Illinois at Urbana-Champaign Programming in CUDA: the Essentials, Part 1 John E. Stone Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology University of Illinois at Urbana-Champaign Cape Town GPU Workshop Cape Town, South Africa, April 29, 2013

NIH BTRC for Macromolecular Modeling and Bioinformatics Beckman Institute, U. Illinois at Urbana-Champaign Evolution of Graphics Hardware Towards Programmability As graphics accelerators became more powerful, an increasing fraction of the graphics processing pipeline was implemented in hardware For performance reasons, this hardware was highly optimized and task-specific Over time, with ongoing increases in circuit density and the need for flexibility in lighting and texturing, graphics pipelines gradually incorporated programmability in specific pipeline stages Modern graphics accelerators are now complete processors in their own right (thus the new term “GPU”), and are composed of large arrays of programmable processing units

NIH BTRC for Macromolecular Modeling and Bioinformatics Beckman Institute, U. Illinois at Urbana-Champaign Origins of Computing on GPUs Widespread support for programmable shading led researchers to begin experimenting with the use of GPUs for general purpose computation, “GPGPU” Early GPGPU efforts used existing graphics APIs to express computation in terms of drawing As expected, expressing general computation problems in terms of triangles and pixels and “drawing the answer” is obfuscating and painful to debug… Soon researchers began creating dedicated GPU programming tools, starting with Brook and Sh, and ultimately leading to a variety of commercial tools such as RapidMind, CUDA, OpenCL, and others...

NIH BTRC for Macromolecular Modeling and Bioinformatics Beckman Institute, U. Illinois at Urbana-Champaign GPU Computing Commodity devices, omnipresent in modern computers (over a million sold per week) Massively parallel hardware, hundreds of processing units, throughput oriented architecture Standard integer and floating point types supported Programming tools allow software to be written in dialects of familiar C/C++ and integrated into legacy software GPU algorithms are often multicore friendly due to attention paid to data locality and data-parallel work decomposition

NIH BTRC for Macromolecular Modeling and Bioinformatics Beckman Institute, U. Illinois at Urbana-Champaign Benefits of GPUs vs. Other Parallel Computing Approaches Increased compute power per unit volume Increased FLOPS/watt power efficiency Desktop/laptop computers easily incorporate GPUs, no need to teach non- technical users how to use a remote cluster or supercomputer GPU can be upgraded without new OS license fees, low cost hardware

NIH BTRC for Macromolecular Modeling and Bioinformatics Beckman Institute, U. Illinois at Urbana-Champaign What Speedups Can GPUs Achieve? Single-GPU speedups of 10x to 30x vs. one CPU core are very common Best speedups can reach 100x or more, attained on codes dominated by floating point arithmetic, especially native GPU machine instructions, e.g. expf(), rsqrtf(), … Amdahl’s Law can prevent legacy codes from achieving peak speedups with shallow GPU acceleration efforts

NIH BTRC for Macromolecular Modeling and Bioinformatics Beckman Institute, U. Illinois at Urbana-Champaign GPU Solution: Time-Averaged Electrostatics Thousands of trajectory frames 1.5 hour job reduced to 3 min GPU Speedup: 25.5x Per-node power consumption on NCSA GPU cluster: –CPUs-only: 448 Watt-hours –CPUs+GPUs: 43 Watt-hours Power efficiency gain: 10x

NIH BTRC for Macromolecular Modeling and Bioinformatics Beckman Institute, U. Illinois at Urbana-Champaign GPU Solution: Radial Distribution Function Histogramming 4.7 million atoms 4-core Intel X5550 CPU: 15 hours 4 NVIDIA C2050 GPUs: 10 minutes Fermi GPUs ~3x faster than GT200 GPUs: larger on-chip shared memory Precipitate Liquid

NIH BTRC for Macromolecular Modeling and Bioinformatics Beckman Institute, U. Illinois at Urbana-Champaign Science 5: Quantum Chemistry Visualization Chemistry is the result of atoms sharing electrons Electrons occupy “clouds” in the space around atoms Calculations for visualizing these “clouds” are costly: tens to hundreds of seconds on CPUs – non- interactive GPUs enable the dynamics of electronic structures to be animated interactively for the first time VMD enables interactive display of QM simulations, e.g. Terachem, GAMESS Taxol: cancer drug

NIH BTRC for Macromolecular Modeling and Bioinformatics Beckman Institute, U. Illinois at Urbana-Champaign GPU Solution: Computing C 60 Molecular Orbitals DeviceCPUs, GPUs Runtime (s) Speedup Intel X5550-SSE Intel X5550-SSE GeForce GTX GeForce GTX D CUDA grid on one GPU 3-D orbital lattice: millions of points Lattice slices computed on multiple GPUs GPU threads each compute one point. CUDA thread blocks

NIH BTRC for Macromolecular Modeling and Bioinformatics Beckman Institute, U. Illinois at Urbana-Champaign Molecular Orbital Inner Loop, Hand-Coded x86 SSE Hard to Read, Isn’t It? (And this is the “pretty” version!) for (shell=0; shell < maxshell; shell++) { __m128 Cgto = _mm_setzero_ps(); for (prim=0; prim<num_prim_per_shell[shell_counter]; prim++) { float exponent = -basis_array[prim_counter ]; float contract_coeff = basis_array[prim_counter + 1]; __m128 expval = _mm_mul_ps(_mm_load_ps1(&exponent), dist2); __m128 ctmp = _mm_mul_ps(_mm_load_ps1(&contract_coeff), exp_ps(expval)); Cgto = _mm_add_ps(contracted_gto, ctmp); prim_counter += 2; } __m128 tshell = _mm_setzero_ps(); switch (shell_types[shell_counter]) { case S_SHELL: value = _mm_add_ps(value, _mm_mul_ps(_mm_load_ps1(&wave_f[ifunc++]), Cgto)); break; case P_SHELL: tshell = _mm_add_ps(tshell, _mm_mul_ps(_mm_load_ps1(&wave_f[ifunc++]), xdist)); tshell = _mm_add_ps(tshell, _mm_mul_ps(_mm_load_ps1(&wave_f[ifunc++]), ydist)); tshell = _mm_add_ps(tshell, _mm_mul_ps(_mm_load_ps1(&wave_f[ifunc++]), zdist)); value = _mm_add_ps(value, _mm_mul_ps(tshell, Cgto)); break; Writing SSE kernels for CPUs requires assembly language, compiler intrinsics, various libraries, or a really smart autovectorizing compiler and lots of luck...

NIH BTRC for Macromolecular Modeling and Bioinformatics Beckman Institute, U. Illinois at Urbana-Champaign for (shell=0; shell < maxshell; shell++) { float contracted_gto = 0.0f; for (prim=0; prim<num_prim_per_shell[shell_counter]; prim++) { float exponent = const_basis_array[prim_counter ]; float contract_coeff = const_basis_array[prim_counter + 1]; contracted_gto += contract_coeff * exp2f(-exponent*dist2); prim_counter += 2; } float tmpshell=0; switch (const_shell_symmetry[shell_counter]) { case S_SHELL: value += const_wave_f[ifunc++] * contracted_gto; break; case P_SHELL: tmpshell += const_wave_f[ifunc++] * xdist; tmpshell += const_wave_f[ifunc++] * ydist tmpshell += const_wave_f[ifunc++] * zdist; value += tmpshell * contracted_gto; break; Molecular Orbital Inner Loop in CUDA Aaaaahhhh…. Data-parallel CUDA kernel looks like normal C code for the most part….

NIH BTRC for Macromolecular Modeling and Bioinformatics Beckman Institute, U. Illinois at Urbana-Champaign Peak Arithmetic Performance Trend

NIH BTRC for Macromolecular Modeling and Bioinformatics Beckman Institute, U. Illinois at Urbana-Champaign Peak Memory Bandwidth Trend

NIH BTRC for Macromolecular Modeling and Bioinformatics Beckman Institute, U. Illinois at Urbana-Champaign What Runs on a GPU? GPUs run data-parallel programs called “kernels” GPUs are managed by a host CPU thread: –Create a CUDA context –Allocate/deallocate GPU memory –Copy data between host and GPU memory –Launch GPU kernels –Query GPU status –Handle runtime errors

NIH BTRC for Macromolecular Modeling and Bioinformatics Beckman Institute, U. Illinois at Urbana-Champaign CUDA Stream of Execution Host CPU thread launches a CUDA “kernel”, a memory copy, etc. on the GPU GPU action runs to completion Host synchronizes with completed GPU action CPUGPU CPU code running CPU waits for GPU, ideally doing something productive CPU code running

NIH BTRC for Macromolecular Modeling and Bioinformatics Beckman Institute, U. Illinois at Urbana-Champaign Comparison of CPU and GPU Hardware Architecture CPU: Cache heavy, focused on individual thread performance GPU: ALU heavy, massively parallel, throughput oriented

NIH BTRC for Macromolecular Modeling and Bioinformatics Beckman Institute, U. Illinois at Urbana-Champaign GPU: Throughput-Oriented Hardware Architecture GPUs have very small on-chip caches Main memory latency (several hundred clock cycles!) is tolerated through hardware multithreading – overlap memory transfer latency with execution of other work When a GPU thread stalls on a memory operation, the hardware immediately switches context to a ready thread Effective latency hiding requires saturating the GPU with lots of work – tens of thousands of independent work items

NIH BTRC for Macromolecular Modeling and Bioinformatics Beckman Institute, U. Illinois at Urbana-Champaign GPU Memory Systems GPU arithmetic rates dwarf memory bandwidth For Kepler K20 hardware: –~2 TFLOPS vs. ~250 GB/sec –The ratio is roughly 40 FLOPS per memory reference for single-precision floating point GPUs include multiple fast on-chip memories to help narrow the gap: –Registers –Constant memory (64KB) –Shared memory (48KB / 16KB) –Read-only data cache / Texture cache (48KB)

NIH BTRC for Macromolecular Modeling and Bioinformatics Beckman Institute, U. Illinois at Urbana-Champaign GPUs Require ~20,000 Independent Threads for Full Utilization, Latency Hidding GPU underutilized GPU fully utilized, ~40x faster than CPU Accelerating molecular modeling applications with graphics processors. J. Stone, J. Phillips, P. Freddolino, D. Hardy, L. Trabuco, K. Schulten. J. Comp. Chem., 28: , Lower is better Host thread GPU Cold Start: context init, device binding, kernel PTX JIT: ~110ms

NIH BTRC for Macromolecular Modeling and Bioinformatics Beckman Institute, U. Illinois at Urbana-Champaign SP SFU SP SFU Instruction Fetch/Dispatch Instruction L1Data L1 Texture Processor Cluster SM Shared Memory Streaming Processor Array Streaming Multiprocessor Texture Unit Streaming Processor ADD, SUB MAD, Etc… Special Function Unit SIN, EXP, RSQRT, Etc… TPC SM Constant Cache Read-only, 8kB spatial cache, 1/2/3-D interpolation 64kB, read-only FP64 Unit FP64 Unit (double precision) NVIDIA GT200

NIH BTRC for Macromolecular Modeling and Bioinformatics Beckman Institute, U. Illinois at Urbana-Champaign Graphics Processor Cluster NVIDIA Fermi GPU Streaming Multiprocessor GPC 768KB Level 2 Cache SP SFU SM LDST SM Tex Texture Cache 64 KB L1 Cache / Shared Memory ~3-6 GB DRAM Memory w/ ECC 64KB Constant Cache

NIH BTRC for Macromolecular Modeling and Bioinformatics Beckman Institute, U. Illinois at Urbana-Champaign NVIDIA Kepler GPU Streaming Multiprocessor - SMX GPC 1536KB Level 2 Cache SMX Tex Unit 48 KB Tex + Read-only Data Cache 64 KB L1 Cache / Shared Memory ~3-6 GB DRAM Memory w/ ECC 64 KB Constant Cache SP DP SFU LDST SP DP 16 × Execution block = 192 SP, 64 DP, 32 SFU, 32 LDST SP DP SFU LDST SP DP Graphics Processor Cluster GPC

NIH BTRC for Macromolecular Modeling and Bioinformatics Beckman Institute, U. Illinois at Urbana-Champaign Acknowledgements Theoretical and Computational Biophysics Group, University of Illinois at Urbana- Champaign NCSA Blue Waters Team NCSA Innovative Systems Lab NVIDIA CUDA Center of Excellence, University of Illinois at Urbana-Champaign The CUDA team at NVIDIA NIH support: P41-RR005969

NIH BTRC for Macromolecular Modeling and Bioinformatics Beckman Institute, U. Illinois at Urbana-Champaign GPU Computing Publications Lattice Microbes: High ‐ performance stochastic simulation method for the reaction ‐ diffusion master equation. E. Roberts, J. E. Stone, and Z. Luthey ‐ Schulten. J. Computational Chemistry 34 (3), , Fast Visualization of Gaussian Density Surfaces for Molecular Dynamics and Particle System Trajectories. M. Krone, J. E. Stone, T. Ertl, and K. Schulten. EuroVis Short Papers, pp , Immersive Out-of-Core Visualization of Large-Size and Long- Timescale Molecular Dynamics Trajectories. J. Stone, K. Vandivort, and K. Schulten. G. Bebis et al. (Eds.): 7th International Symposium on Visual Computing (ISVC 2011), LNCS 6939, pp. 1-12, Fast Analysis of Molecular Dynamics Trajectories with Graphics Processing Units – Radial Distribution Functions. B. Levine, J. Stone, and A. Kohlmeyer. J. Comp. Physics, 230(9): , 2011.

NIH BTRC for Macromolecular Modeling and Bioinformatics Beckman Institute, U. Illinois at Urbana-Champaign GPU Computing Publications Quantifying the Impact of GPUs on Performance and Energy Efficiency in HPC Clusters. J. Enos, C. Steffen, J. Fullop, M. Showerman, G. Shi, K. Esler, V. Kindratenko, J. Stone, J Phillips. International Conference on Green Computing, pp , GPU-accelerated molecular modeling coming of age. J. Stone, D. Hardy, I. Ufimtsev, K. Schulten. J. Molecular Graphics and Modeling, 29: , OpenCL: A Parallel Programming Standard for Heterogeneous Computing. J. Stone, D. Gohara, G. Shi. Computing in Science and Engineering, 12(3):66- 73, An Asymmetric Distributed Shared Memory Model for Heterogeneous Computing Systems. I. Gelado, J. Stone, J. Cabezas, S. Patel, N. Navarro, W. Hwu. ASPLOS ’10: Proceedings of the 15 th International Conference on Architectural Support for Programming Languages and Operating Systems, pp , 2010.

NIH BTRC for Macromolecular Modeling and Bioinformatics Beckman Institute, U. Illinois at Urbana-Champaign GPU Computing Publications GPU Clusters for High Performance Computing. V. Kindratenko, J. Enos, G. Shi, M. Showerman, G. Arnold, J. Stone, J. Phillips, W. Hwu. Workshop on Parallel Programming on Accelerator Clusters (PPAC), In Proceedings IEEE Cluster 2009, pp. 1-8, Aug Long time-scale simulations of in vivo diffusion using GPU hardware. E. Roberts, J. Stone, L. Sepulveda, W. Hwu, Z. Luthey-Schulten. In IPDPS’09: Proceedings of the 2009 IEEE International Symposium on Parallel & Distributed Computing, pp. 1-8, High Performance Computation and Interactive Display of Molecular Orbitals on GPUs and Multi-core CPUs. J. Stone, J. Saam, D. Hardy, K. Vandivort, W. Hwu, K. Schulten, 2nd Workshop on General-Purpose Computation on Graphics Pricessing Units (GPGPU-2), ACM International Conference Proceeding Series, volume 383, pp. 9-18, Probing Biomolecular Machines with Graphics Processors. J. Phillips, J. Stone. Communications of the ACM, 52(10):34-41, Multilevel summation of electrostatic potentials using graphics processing units. D. Hardy, J. Stone, K. Schulten. J. Parallel Computing, 35: , 2009.

NIH BTRC for Macromolecular Modeling and Bioinformatics Beckman Institute, U. Illinois at Urbana-Champaign GPU Computing Publications Adapting a message-driven parallel application to GPU-accelerated clusters. J. Phillips, J. Stone, K. Schulten. Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, IEEE Press, GPU acceleration of cutoff pair potentials for molecular modeling applications. C. Rodrigues, D. Hardy, J. Stone, K. Schulten, and W. Hwu. Proceedings of the 2008 Conference On Computing Frontiers, pp , GPU computing. J. Owens, M. Houston, D. Luebke, S. Green, J. Stone, J. Phillips. Proceedings of the IEEE, 96: , Accelerating molecular modeling applications with graphics processors. J. Stone, J. Phillips, P. Freddolino, D. Hardy, L. Trabuco, K. Schulten. J. Comp. Chem., 28: , Continuous fluorescence microphotolysis and correlation spectroscopy. A. Arkhipov, J. Hüve, M. Kahms, R. Peters, K. Schulten. Biophysical Journal, 93: , 2007.