Programming for Hybrid Architectures

Slides:



Advertisements
Similar presentations
Tesla CUDA GPU Accelerates Computing The Right Processor for the Right Task.
Advertisements

HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.
GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.
Panda: MapReduce Framework on GPU’s and CPU’s
HPCC Mid-Morning Break Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery Introduction to the new GPU (GFX) cluster.
BTRC for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC 1 Analysis and Visualization Algorithms in VMD David.
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.
GPU Programming with CUDA – Accelerated Architectures Mike Griffiths
Last time: Runtime infrastructure for hybrid (GPU-based) platforms  Task scheduling Extracting performance models at runtime  Memory management Asymmetric.
Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster.
BTRC for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC 1 Future Direction with NAMD David Hardy
National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Direct Self-Consistent Field Computations on GPU Clusters Guochun.
Scalable Data Clustering with GPUs Andrew D. Pangborn Thesis Defense Rochester Institute of Technology Computer Engineering Department Friday, May 14 th.
NIH BTRC for Macromolecular Modeling and Bioinformatics Beckman Institute, U. Illinois at Urbana-Champaign Visualization and Analysis.
Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.
BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.
NIH BTRC for Macromolecular Modeling and Bioinformatics Beckman Institute, U. Illinois at Urbana-Champaign Petascale Molecular.
NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC High Performance Molecular Simulation, Visualization,
By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
NIH BTRC for Macromolecular Modeling and Bioinformatics Beckman Institute, U. Illinois at Urbana-Champaign S5246—Innovations in.
NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC S04: High Performance Computing with CUDA Case.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 1 Programming Massively Parallel Processors Lecture Slides for Chapter 1: Introduction.
NIH BTRC for Macromolecular Modeling and Bioinformatics Beckman Institute, U. Illinois at Urbana-Champaign Interactive Molecular.
General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific 09/12/12.
NIH BTRC for Macromolecular Modeling and Bioinformatics Beckman Institute, U. Illinois at Urbana-Champaign GPU-Accelerated Analysis.
NIH BTRC for Macromolecular Modeling and Bioinformatics Beckman Institute, U. Illinois at Urbana-Champaign Programming in CUDA:
©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 VSCSE Summer School Proven Algorithmic Techniques for Many-core Processors Lecture.
NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC Faster, Cheaper, and Better Science: Molecular.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.
© Copyright John E. Stone, Page 1 OpenCL for Molecular Modeling Applications: Early Experiences John Stone University of Illinois at Urbana-Champaign.
NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC Accelerating Molecular Modeling Applications.
NIH BTRC for Macromolecular Modeling and Bioinformatics Beckman Institute, U. Illinois at Urbana-Champaign GPU-Accelerated Molecular.
Introducing collaboration members – Korea University (KU) ALICE TPC online tracking algorithm on a GPU Computing Platforms – GPU Computing Platforms Joohyung.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CMPS 5433 Dr. Ranette Halverson Programming Massively.
NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.
NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.
NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC S03: High Performance Computing with CUDA Heterogeneous.
© John E. Stone, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 21: Application Performance Case Studies: Molecular Visualization.
Heterogeneous CPU/GPU co- processor clusters Michael Fruchtman.
NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC Using GPUs to compute the multilevel summation.
NIH BTRC for Macromolecular Modeling and Bioinformatics Beckman Institute, U. Illinois at Urbana-Champaign Visualization of Petascale.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC High Performance Molecular Simulation, Visualization,
GPU Programming Shirley Moore CPS 5401 Fall 2013
NIH BTRC for Macromolecular Modeling and Bioinformatics Beckman Institute, U. Illinois at Urbana-Champaign Early Experiences Scaling.
NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC In-Situ Visualization and Analysis of Petascale.
NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC Multilevel Summation of Electrostatic Potentials.
NIH BTRC for Macromolecular Modeling and Bioinformatics Beckman Institute, U. Illinois at Urbana-Champaign Broadening the Use of.
NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC High Performance Molecular Visualization and.
National Center for Supercomputing Applications University of Illinois at Urbana–Champaign Visualization Support for XSEDE and Blue Waters DOE Graphics.
Shangkar Mayanglambam, Allen D. Malony, Matthew J. Sottile Computer and Information Science Department Performance.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 Graphic Processing Processors (GPUs) Parallel.
NIH BTRC for Macromolecular Modeling and Bioinformatics Beckman Institute, U. Illinois at Urbana-Champaign VMD: GPU-Accelerated.
An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems Isaac Gelado, Javier Cabezas. John Stone, Sanjay Patel, Nacho Navarro.
NIH BTRC for Macromolecular Modeling and Bioinformatics Beckman Institute, U. Illinois at Urbana-Champaign S6261—VMD+OptiX: Streaming.
© Wen-mei W. Hwu and John Stone, Urbana July 22, Illinois UPCRC Summer School 2010 The OpenCL Programming Model Part 2: Case Studies Wen-mei Hwu.
August Accelerated Computing CPU Optimized for Serial Tasks GPU Accelerator Optimized for Parallel Tasks 10x Performance 5x Energy Efficiency.
High Performance Molecular Visualization and Analysis on GPUs
ECE408 Fall 2015 Applied Parallel Programming Lecture 21: Application Case Study – Molecular Dynamics.
John Stone Theoretical and Computational Biophysics Group
Theoretical and Computational Biophysics Group
John E. Stone Theoretical and Computational Biophysics Group
GPU-Accelerated Molecular Visualization and Analysis with VMD
Multicore and GPU Programming
John E. Stone Theoretical and Computational Biophysics Group
Presentation transcript:

Programming for Hybrid Architectures John E. Stone Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology University of Illinois at Urbana-Champaign http://www.ks.uiuc.edu/Research/gpu/ GPGPU 2015: Advanced Methods for Computing with CUDA, University of Cape Town, April 2015

Solid Growth of GPU Accelerated Apps Top HPC Applications Molecular Dynamics AMBER CHARMM DESMOND GROMACS LAMMPS NAMD Quantum Chemistry Abinit Gaussian GAMESS NWChem Material Science CP2K QMCPACK Quantum Espresso VASP Weather & Climate COSMO GEOS-5 HOMME CAM-SE NEMO NIM WRF Lattice QCD Chroma MILC Plasma Physics GTC GTS Structural Mechanics ANSYS Mechanical LS-DYNA Implicit MSC Nastran OptiStruct Abaqus/Standard Fluid Dynamics ANSYS Fluent Culises (OpenFOAM) # of GPU-Accelerated Apps 2011 2012 2013 Courtesy NVIDIA Accelerated, In Development

Major Approaches For Programming Hybrid Architectures Use drop-in libraries in place of CPU-based libraries Little or no code development Speedups limited by Amdahl’s Law and overheads associated with data movement between CPUs and GPU accelerators Examples: MAGMA, BLAS-variants, FFT libraries, etc. Generate accelerator code as a variant of CPU source, e.g. using OpenMP w/ OpenACC, similar methods Write lower-level accelerator-specific code, e.g. using CUDA, OpenCL, other approaches

GPU Accelerated Libraries “Drop-in” Acceleration for your Applications Linear Algebra FFT, BLAS, SPARSE, Matrix NVIDIA cuFFT, cuBLAS, cuSPARSE Numerical & Math RAND, Statistics NVIDIA Math Lib NVIDIA cuRAND Data Struct. & AI Sort, Scan, Zero Sum GPU Accelerated libraries provide “drop-in” acceleration for your applications. There are a wide variety of computational libraries available. You are probably familiar with the FFTW fast fourier transform library, the various BLAS linear algebra libraries, the Intel Math Kernel Library, or Intel Performance Primitives library, to name a few. There are compatible GPU-accelerated libraries available as drop-in replacements for all of the libraries I just mentioned, and many more. cuBLAS is NVIDIA’s GPU-accelerated BLAS linear algebra library, which includes very fast dense matrix and vector computations. cuFFT is NVIDIA’s GPU-accelerated Fast Fourier Transform Library. cuRAND is NVIDIA’s GPU-accelerated random number generation library, which is useful in Monte Carlo algorithms and financial applications, among others. cuSPARSE is NVIDIA’s Sparse matrix library, which includes sparse matrix-vector multiplication, fast sparse linear equation solvers, and tridiagonal solvers. NVIDIA Performance Primitives includes literally thousands of signal and image processing functions. Thrust is an open-source C++ Template library of parallel algorithms such as sorting, reductions, searching, and set operations. And there are many more libraries available from third-party developers and open source projects. GPU-accelerated libraries are the most straightforward way to add GPU acceleration to applications, and in many cases their performance cannot be beat. GPU AI – Board Games GPU AI – Path Finding Visual Processing Image & Video NVIDIA Video Encode NVIDIA NPP Courtesy NVIDIA

OpenACC: Open, Simple, Portable Open Standard Easy, Compiler-Driven Approach Portable on GPUs and Xeon Phi main() { … <serial code> #pragma acc kernels { <compute intensive code> } Compiler Hint CAM-SE Climate 6x Faster on GPU Top Kernel: 50% of Runtime #1: open standard Supported by PGI, CAPS, Cray, Oak Ridge, Univ of Tennessee, Univ of Houston, etc Means when you code in OpenACC, it’s guaranteed to work across compilers and HW. It is absolutely portable. #2: Easy The standard is born out of a sub committee in OpenMP.. Just like OpenMP but targeted for even more parallellism Just put some compiler hints, and the compiler parallelizes the code for you #3: Truly portable When you write code in OpenACC, it will run on all x86 + Accl platform, include MIC, multicore CPUs, etc. Peace of mind for customers who want their code once, run everywhere. Courtesy NVIDIA

Using the CPU to Optimize GPU Performance GPU performs best when the work evenly divides into the number of threads/processing units Optimization strategy: Use the CPU to “regularize” the GPU workload Use fixed size bin data structures, with “empty” slots skipped or producing zeroed out results Handle exceptional or irregular work units on the CPU; GPU processes the bulk of the work concurrently On average, the GPU is kept highly occupied, attaining a high fraction of peak performance

CUDA Grid/Block/Thread Decomposition 1-D, 2-D, or 3-D Computational Domain 1-D, 2-D, or 3-D (SM >= 2.x) Grid of thread blocks: 0,0 0,1 … 1,0 1,1 … 1-D, 2-D, 3-D thread block: … … … Padding arrays out to full blocks optimizes global memory performance by guaranteeing memory coalescing

Avoiding Shared Memory Bank Conflicts: Array of Structures (AOS) vs Avoiding Shared Memory Bank Conflicts: Array of Structures (AOS) vs. Structure of Arrays (SOA) AOS: typedef struct { float x; float y; float z; } myvec; myvec aos[1024]; aos[threadIdx.x].x = 0; aos[threadIdx.x].y = 0; SOA typedef struct { float x[1024]; float y[1024]; float z[1024]; } myvecs; myvecs soa; soa.x[threadIdx.x] = 0; soa.y[threadIdx.x] = 0;

Time-Averaged Electrostatics Analysis on Energy-Efficient GPU Cluster 1.5 hour job (CPUs) reduced to 3 min (CPUs+GPU) Electrostatics of thousands of trajectory frames averaged Per-node power consumption on NCSA “AC” GPU cluster: CPUs-only: 448 Watt-hours CPUs+GPUs: 43 Watt-hours GPU Speedup: 25.5x Power efficiency gain: 10.5x Quantifying the Impact of GPUs on Performance and Energy Efficiency in HPC Clusters. J. Enos, C. Steffen, J. Fullop, M. Showerman, G. Shi, K. Esler, V. Kindratenko, J. Stone, J. Phillips. The Work in Progress in Green Computing,  pp. 317-324, 2010.

AC Cluster GPU Performance and Power Efficiency Results Application GPU speedup Host watts Host+GPU watts Perf/watt gain NAMD 6 316 681 2.8 VMD 25 299 742 10.5 MILC 20 225 555 8.1 QMCPACK 61 314 853 22.6 Quantifying the Impact of GPUs on Performance and Energy Efficiency in HPC Clusters. J. Enos, C. Steffen, J. Fullop, M. Showerman, G. Shi, K. Esler, V. Kindratenko, J. Stone, J. Phillips. The Work in Progress in Green Computing, pp. 317-324, 2010.

Optimizing GPU Algorithms for Power Consumption NVIDIA “Carma”, “Kayla”, “Jetson” single board computers Tegra+GPU energy efficiency testbed

Time-Averaged Electrostatics Analysis on NCSA Blue Waters NCSA Blue Waters Node Type Seconds per trajectory frame for one compute node Cray XE6 Compute Node: 32 CPU cores (2xAMD 6200 CPUs) 9.33 Cray XK6 GPU-accelerated Compute Node: 16 CPU cores + NVIDIA X2090 (Fermi) GPU 2.25 Speedup for GPU XK6 nodes vs. CPU XE6 nodes XK6 nodes are 4.15x faster overall Tests on XK7 nodes indicate MSM is CPU-bound with the Kepler K20X GPU. Performance is not much faster (yet) than Fermi X2090 Need to move spatial hashing, prolongation, interpolation onto the GPU… In progress…. XK7 nodes 4.3x faster overall Preliminary performance for VMD time-averaged electrostatics w/ Multilevel Summation Method on the NCSA Blue Waters Early Science System

Multilevel Summation on the GPU Accelerate short-range cutoff and lattice cutoff parts Performance profile for 0.5 Å map of potential for 1.5 M atoms. Hardware platform is Intel QX6700 CPU and NVIDIA GTX 280. Computational steps CPU (s) w/ GPU (s) Speedup Short-range cutoff 480.07 14.87 32.3 Long-range anterpolation 0.18 restriction 0.16 lattice cutoff 49.47 1.36 36.4 prolongation 0.17 interpolation 3.47 Total 533.52 20.21 26.4 The short-range cutoff and lattice cutoff calculations are the most computationally demanding parts of multilevel summation and have the most data-parallelism to exploit. The short-range part is just cutoff summation on the GPU, using a switching function particular to multilevel summation. For the lattice cutoff part, the pairwise distances between lattice points are precomputed. The calculation simplifies to accumulating at each lattice point a sum of the “weighted” surrounding charges. We accelerate this by streaming blocks of lattice charges from global memory into shared memory and reading the fixed weights from the GPU constant memory. Multilevel summation of electrostatic potentials using graphics processing units. D. Hardy, J. Stone, K. Schulten. J. Parallel Computing, 35:164-177, 2009.

Multi-GPU NUMA Architectures: Example of a “balanced” PCIe topology NUMA: Host threads should be pinned to the CPU that is “closest” to their target GPU GPUs on the same PCIe I/O Hub (IOH) can use CUDA peer-to-peer transfer APIs Intel: GPUs on different IOHs can’t use peer-to-peer Simulation of reaction diffusion processes over biologically relevant size and time scales using multi-GPU workstations Michael J. Hallock, John E. Stone, Elijah Roberts, Corey Fry, and Zaida Luthey-Schulten. Journal of Parallel Computing, 40:86-99, 2014. http://dx.doi.org/10.1016/j.parco.2014.03.009

Multi-GPU NUMA Architectures: Example of a very “unbalanced” PCIe topology CPU 2 will overhelm its QP/HT link with host- GPU DMAs Poor scalability as compared to a balanced PCIe topology Simulation of reaction diffusion processes over biologically relevant size and time scales using multi-GPU workstations Michael J. Hallock, John E. Stone, Elijah Roberts, Corey Fry, and Zaida Luthey-Schulten. Journal of Parallel Computing, 40:86-99, 2014. http://dx.doi.org/10.1016/j.parco.2014.03.009

Multi-GPU NUMA Architectures: GPU-to-GPU peer DMA operations are much more performant than other approaches, particularly for moderate sized transfers Likely to perform even better in future multi-GPU cards with direct GPU links, e.g. announced “NVLink” Simulation of reaction diffusion processes over biologically relevant size and time scales using multi-GPU workstations Michael J. Hallock, John E. Stone, Elijah Roberts, Corey Fry, and Zaida Luthey-Schulten. Journal of Parallel Computing, 40:86-99, 2014.  http://dx.doi.org/10.1016/j.parco.2014.03.009

GPU PCI-Express DMA Simulation of reaction diffusion processes over biologically relevant size and time scales using multi-GPU workstations Michael J. Hallock, John E. Stone, Elijah Roberts, Corey Fry, and Zaida Luthey-Schulten. Journal of Parallel Computing, 40:86-99, 2014. http://dx.doi.org/10.1016/j.parco.2014.03.009

Single CUDA Execution “Stream” Host CPU thread launches a CUDA “kernel”, a memory copy, etc. on the GPU GPU action runs to completion Host synchronizes with completed GPU action CPU GPU CPU code running CPU waits for GPU, ideally doing something productive CPU code running

Multiple CUDA Streams: Overlapping Compute and DMA Operations Simulation of reaction diffusion processes over biologically relevant size and time scales using multi-GPU workstations Michael J. Hallock, John E. Stone, Elijah Roberts, Corey Fry, and Zaida Luthey-Schulten. Journal of Parallel Computing, 40:86-99, 2014. http://dx.doi.org/10.1016/j.parco.2014.03.009

Acknowledgements Theoretical and Computational Biophysics Group, University of Illinois at Urbana-Champaign NVIDIA CUDA Center of Excellence, University of Illinois at Urbana- Champaign NVIDIA CUDA team NCSA Blue Waters Team Funding: NSF OCI 07-25070 NSF PRAC “The Computational Microscope” NIH support: 9P41GM104601, 5R01GM098243-02 20

GPU Computing Publications http://www.ks.uiuc.edu/Research/gpu/ Runtime and Architecture Support for Efficient Data Exchange in Multi-Accelerator Applications Javier Cabezas, Isaac Gelado, John E. Stone, Nacho Navarro, David B. Kirk, and Wen-mei Hwu. IEEE Transactions on Parallel and Distributed Systems, 26(5):1405-1418, 2015. Unlocking the Full Potential of the Cray XK7 Accelerator Mark Klein and John E. Stone. Cray Users Group, Lugano Switzerland, May 2014. Simulation of reaction diffusion processes over biologically relevant size and time scales using multi-GPU workstations Michael J. Hallock, John E. Stone, Elijah Roberts, Corey Fry, and Zaida Luthey-Schulten. Journal of Parallel Computing, 40:86-99 2014. GPU-Accelerated Analysis and Visualization of Large Structures Solved by Molecular Dynamics Flexible Fitting John E. Stone, Ryan McGreevy, Barry Isralewitz, and Klaus Schulten. Faraday Discussions, 169:265-283, 2014. GPU-Accelerated Molecular Visualization on Petascale Supercomputing Platforms. J. Stone, K. L. Vandivort, and K. Schulten. UltraVis'13: Proceedings of the 8th International Workshop on Ultrascale Visualization, pp. 6:1-6:8, 2013. Early Experiences Scaling VMD Molecular Visualization and Analysis Jobs on Blue Waters. J. E. Stone, B. Isralewitz, and K. Schulten. Extreme Scaling Workshop (XSW), pp. 43-50, 2013. Lattice Microbes: High‐performance stochastic simulation method for the reaction‐diffusion master equation. E. Roberts, J. E. Stone, and Z. Luthey‐Schulten. J. Computational Chemistry 34 (3), 245-255, 2013.

GPU Computing Publications http://www.ks.uiuc.edu/Research/gpu/ Fast Visualization of Gaussian Density Surfaces for Molecular Dynamics and Particle System Trajectories. M. Krone, J. E. Stone, T. Ertl, and K. Schulten. EuroVis Short Papers, pp. 67-71, 2012. Fast Analysis of Molecular Dynamics Trajectories with Graphics Processing Units – Radial Distribution Functions. B. Levine, J. Stone, and A. Kohlmeyer. J. Comp. Physics, 230(9):3556- 3569, 2011. Immersive Out-of-Core Visualization of Large-Size and Long-Timescale Molecular Dynamics Trajectories. J. Stone, K. Vandivort, and K. Schulten. G. Bebis et al. (Eds.): 7th International Symposium on Visual Computing (ISVC 2011), LNCS 6939, pp. 1-12, 2011. Quantifying the Impact of GPUs on Performance and Energy Efficiency in HPC Clusters. J. Enos, C. Steffen, J. Fullop, M. Showerman, G. Shi, K. Esler, V. Kindratenko, J. Stone, J Phillips. International Conference on Green Computing, pp. 317-324, 2010. GPU-accelerated molecular modeling coming of age. J. Stone, D. Hardy, I. Ufimtsev, K. Schulten. J. Molecular Graphics and Modeling, 29:116-125, 2010. OpenCL: A Parallel Programming Standard for Heterogeneous Computing. J. Stone, D. Gohara, G. Shi. Computing in Science and Engineering, 12(3):66-73, 2010.

GPU Computing Publications http://www.ks.uiuc.edu/Research/gpu/ An Asymmetric Distributed Shared Memory Model for Heterogeneous Computing Systems. I. Gelado, J. Stone, J. Cabezas, S. Patel, N. Navarro, W. Hwu. ASPLOS ’10: Proceedings of the 15th International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 347-358, 2010. GPU Clusters for High Performance Computing. V. Kindratenko, J. Enos, G. Shi, M. Showerman, G. Arnold, J. Stone, J. Phillips, W. Hwu. Workshop on Parallel Programming on Accelerator Clusters (PPAC), In Proceedings IEEE Cluster 2009, pp. 1-8, Aug. 2009. Long time-scale simulations of in vivo diffusion using GPU hardware. E. Roberts, J. Stone, L. Sepulveda, W. Hwu, Z. Luthey-Schulten. In IPDPS’09: Proceedings of the 2009 IEEE International Symposium on Parallel & Distributed Computing, pp. 1-8, 2009. High Performance Computation and Interactive Display of Molecular Orbitals on GPUs and Multi-core CPUs. J. Stone, J. Saam, D. Hardy, K. Vandivort, W. Hwu, K. Schulten, 2nd Workshop on General-Purpose Computation on Graphics Pricessing Units (GPGPU-2), ACM International Conference Proceeding Series, volume 383, pp. 9-18, 2009. Probing Biomolecular Machines with Graphics Processors. J. Phillips, J. Stone. Communications of the ACM, 52(10):34-41, 2009. Multilevel summation of electrostatic potentials using graphics processing units. D. Hardy, J. Stone, K. Schulten. J. Parallel Computing, 35:164-177, 2009.

GPU Computing Publications http://www.ks.uiuc.edu/Research/gpu/ Adapting a message-driven parallel application to GPU-accelerated clusters. J. Phillips, J. Stone, K. Schulten. Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, IEEE Press, 2008. GPU acceleration of cutoff pair potentials for molecular modeling applications. C. Rodrigues, D. Hardy, J. Stone, K. Schulten, and W. Hwu. Proceedings of the 2008 Conference On Computing Frontiers, pp. 273-282, 2008. GPU computing. J. Owens, M. Houston, D. Luebke, S. Green, J. Stone, J. Phillips. Proceedings of the IEEE, 96:879-899, 2008. Accelerating molecular modeling applications with graphics processors. J. Stone, J. Phillips, P. Freddolino, D. Hardy, L. Trabuco, K. Schulten. J. Comp. Chem., 28:2618-2640, 2007. Continuous fluorescence microphotolysis and correlation spectroscopy. A. Arkhipov, J. Hüve, M. Kahms, R. Peters, K. Schulten. Biophysical Journal, 93:4006-4017, 2007.