© Copyright John E. Stone, 2009 - Page 1 OpenCL for Molecular Modeling Applications: Early Experiences John Stone University of Illinois at Urbana-Champaign.

Slides:



Advertisements
Similar presentations
Instructor Notes This lecture describes the different ways to work with multiple devices in OpenCL (i.e., within a single context and using multiple contexts),
Advertisements

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
2009/04/07 Yun-Yang Ma.  Overview  What is CUDA ◦ Architecture ◦ Programming Model ◦ Memory Model  H.264 Motion Estimation on CUDA ◦ Method ◦ Experimental.
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.
Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.
HPEC_GPU_DECODE-1 ADC 8/6/2015 MIT Lincoln Laboratory GPU Accelerated Decoding of High Performance Error Correcting Codes Andrew D. Copeland, Nicholas.
Panda: MapReduce Framework on GPU’s and CPU’s
OpenSSL acceleration using Graphics Processing Units
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
AGENT SIMULATIONS ON GRAPHICS HARDWARE Timothy Johnson - Supervisor: Dr. John Rankin 1.
BTRC for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC 1 Future Direction with NAMD David Hardy
Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
NIH BTRC for Macromolecular Modeling and Bioinformatics Beckman Institute, U. Illinois at Urbana-Champaign Visualization and Analysis.
David Luebke NVIDIA Research GPU Computing: The Democratization of Parallel Computing.
Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.
Codeplay CEO © Copyright 2012 Codeplay Software Ltd 45 York Place Edinburgh EH1 3HP United Kingdom Visit us at The unique challenges of.
© John E. Stone, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 19: Performance Case Studies: Ion Placement Tool, VMD Guest.
NIH BTRC for Macromolecular Modeling and Bioinformatics Beckman Institute, U. Illinois at Urbana-Champaign Petascale Molecular.
Implementation of Parallel Processing Techniques on Graphical Processing Units Brad Baker, Wayne Haney, Dr. Charles Choi.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
Exploiting SIMD parallelism with the CGiS compiler framework Nicolas Fritz, Philipp Lucas, Reinhard Wilhelm Saarland University.
NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC S04: High Performance Computing with CUDA Case.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 1 Programming Massively Parallel Processors Lecture Slides for Chapter 1: Introduction.
NIH BTRC for Macromolecular Modeling and Bioinformatics Beckman Institute, U. Illinois at Urbana-Champaign Interactive Molecular.
General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific 09/12/12.
NIH BTRC for Macromolecular Modeling and Bioinformatics Beckman Institute, U. Illinois at Urbana-Champaign GPU-Accelerated Analysis.
GPU in HPC Scott A. Friedman ATS Research Computing Technologies.
NIH BTRC for Macromolecular Modeling and Bioinformatics Beckman Institute, U. Illinois at Urbana-Champaign Programming in CUDA:
NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC OpenCL: Molecular Modeling on Heterogeneous.
NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC Faster, Cheaper, and Better Science: Molecular.
Programming for Hybrid Architectures
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.
NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC Accelerating Molecular Modeling Applications.
GPU Architecture and Programming
NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC S03: High Performance Computing with CUDA Heterogeneous.
© John E. Stone, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 21: Application Performance Case Studies: Molecular Visualization.
Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.
OpenCL Programming James Perry EPCC The University of Edinburgh.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 18: Performance Case Studies: Ion.
Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC High Performance Molecular Simulation, Visualization,
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
GPU Programming Shirley Moore CPS 5401 Fall 2013
NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC In-Situ Visualization and Analysis of Petascale.
NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC Multilevel Summation of Electrostatic Potentials.
An Introduction to OpenCL
NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC High Performance Molecular Visualization and.
Shangkar Mayanglambam, Allen D. Malony, Matthew J. Sottile Computer and Information Science Department Performance.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 Graphic Processing Processors (GPUs) Parallel.
3/12/2013Computer Engg, IIT(BHU)1 CUDA-3. GPGPU ● General Purpose computation using GPU in applications other than 3D graphics – GPU accelerates critical.
My Coordinates Office EM G.27 contact time:
NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC Molecular Visualization and Analysis on GPUs.
© Wen-mei W. Hwu and John Stone, Urbana July 22, Illinois UPCRC Summer School 2010 The OpenCL Programming Model Part 2: Case Studies Wen-mei Hwu.
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
Jun Doi IBM Research – Tokyo Early Performance Evaluation of Lattice QCD on POWER+GPU Cluster 17 July 2015.
GPGPU Programming with CUDA Leandro Avila - University of Northern Iowa Mentor: Dr. Paul Gray Computer Science Department University of Northern Iowa.
Xen GPU Rider. Outline Target & Vision GPU & Xen CUDA on Xen GPU Hardware Acceleration On VM - VMGL.
1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,
CS427 Multicore Architecture and Parallel Computing
ECE408 Fall 2015 Applied Parallel Programming Lecture 21: Application Case Study – Molecular Dynamics.
John Stone Theoretical and Computational Biophysics Group
6- General Purpose GPU Programming
Presentation transcript:

© Copyright John E. Stone, Page 1 OpenCL for Molecular Modeling Applications: Early Experiences John Stone University of Illinois at Urbana-Champaign SC2009 OpenCL BOF, November 18, 2009

© Copyright John E. Stone, Page 2 Overview Molecular modeling applications, need for higher performance, increased energy efficiency Potential benefits of OpenCL for molecular modeling applications Early experiences with OpenCL 1.0 beta implementations Some crude performance results using existing OpenCL toolkits, both production and alpha/beta Detailed comparison of some OpenCL kernels targeting CPUs, GPUs, Cell, etc. OpenCL middleware opportunities, existing efforts

© Copyright John E. Stone, Page 3 Computational Biology’s Insatiable Demand for Computational Power Many simulations still fall short of biological timescales Large simulations extremely difficult to prepare, analyze Performance increases allow use of more sophisticated models

© Copyright John E. Stone, Page 4 Molecular orbital calculation and display 100x to 120x faster Imaging of gas migration pathways in proteins with implicit ligand sampling 20x to 30x faster Electrostatic field calculation, ion placement 20x to 44x faster CUDA+OpenCL Acceleration in VMD

© Copyright John E. Stone, Page 5 Supporting Diverse Accelerator Hardware in Production Codes…. Development of HPC-oriented scientific software is already challenging Maintaining unique code paths for each accelerator type is costly and impractical beyond a certain point Diversity and rapid evolution of accelerators exacerbates these issues OpenCL ameliorates several key problems: - Targets CPUs, GPUs, and other accelerator devices - Common language for writing computational “kernels” - Common API for managing execution on target device

© Copyright John E. Stone, Page 6 Strengths and Weaknesses of Current OpenCL Implementations Code is portable to multiple OpenCL device types - Correctly written OpenCL code will run and yield correct results on multiple devices Performance is not necessarily portable - Some OpenCL API/language constructs naturally map better to some target devices than others - OpenCL can’t hide significant differences in HW architecture Room for improvement in existing OpenCL compilers: - Sophisticated batch-mode compilers have long provided autovectorization and high-end optimization techniques - Current alpha/beta OpenCL implementations aren’t quite there yet - Some compiler optimization techniques might be too slow to be made available in the typical runtime-compilation usage of OpenCL

© Copyright John E. Stone, Page 7 Strengths and Weaknesses of Current OpenCL Implementations (2) OpenCL is a low-level API - Freedom to wire up your code in a variety of ways - Developers are responsible for a lot of plumbing, lots of objects/handles to keep track of - A basic OpenCL “hello world” is _much_ more code to write than doing the same thing in the CUDA runtime API Developers are responsible for enforcing thread-safety in many cases - Some multi-accelerator codes are much more difficult to write than in the CUDA runtime API Great need for OpenCL middleware and/or libraries - Simplified device management, integration of large numbers of kernels into legacy apps, auto-selection of best kernels for device... - Tools to better support OpenCL apps in large HPC environments, e.g. clusters (more on this at the end of my talk)

© Copyright John E. Stone, Page 8 Electrostatic Potential Maps Electrostatic potentials evaluated on 3-D lattice: Applications include: - Ion placement for structure building - Time-averaged potentials for simulation - Visualization and analysis Isoleucine tRNA synthetase Accelerating molecular modeling applications with graphics processors. Stone et al.,. J. Comp. Chem., 28: , 2007.

© Copyright John E. Stone, Page 9 Global Memory Texture Local Memory Local Memory Local Memory Local Memory Local Memory Local Memory OpenCL Device Constant Memory Direct Coulomb Summation in OpenCL, Building Block for Better Algorithms Host Atomic Coordinates Charges Work items compute up to 8 potentials, skipping by coalesced memory width Work groups: work items NDRange containing all work items, decomposed into work groups Lattice padding Stone et al., J. Comp. Chem., 28: , 2007.

© Copyright John E. Stone, Page 10 DCS Inner Loop, Scalar OpenCL …for (atomid=0; atomid<numatoms; atomid++) { float dy = coory - atominfo[atomid].y; float dyz2 = (dy * dy) + atominfo[atomid].z; float dx1 = coorx – atominfo[atomid].x; float dx2 = dx1 + gridspacing_coalesce; float dx3 = dx2 + gridspacing_coalesce; float dx4 = dx3 + gridspacing_coalesce; float charge = atominfo[atomid].w; energyvalx1 += charge * native_rsqrt(dx1*dx1 + dyz2); energyvalx2 += charge * native_rsqrt(dx2*dx2 + dyz2); energyvalx3 += charge * native_rsqrt(dx3*dx3 + dyz2); energyvalx4 += charge * native_rsqrt(dx4*dx4 + dyz2); } Well-written CUDA code can often be easily ported to OpenCL if C++ features and pointer arithmetic aren’t used in kernels.

© Copyright John E. Stone, Page 11 DCS Inner Loop, Vectorized OpenCL float4 gridspacing_u4 = { 0.f, 1.f, 2.f, 3.f }; gridspacing_u4 *= gridspacing_coalesce; float4 energyvalx=0.0f; … for (atomid=0; atomid<numatoms; atomid++) { float dy = coory - atominfo[atomid].y; float dyz2 = (dy * dy) + atominfo[atomid].z; float4 dx = gridspacing_u4 + (coorx – atominfo[atomid].x); float charge = atominfo[atomid].w; energyvalx1 += charge * native_rsqrt(dx1*dx1 + dyz2); } CPUs, AMD GPUs, and Cell often perform better with vectorized kernels. Use of vector types may increase register pressure; sometimes a delicate balance…

© Copyright John E. Stone, Page 12 Apples to Oranges Performance Results: OpenCL Direct Coulomb Summation Kernels MADD, RSQRT = 2 FLOPS All other FP instructions = 1 FLOP OpenCL Target DeviceOpenCL “cores” Scalar Kernel: Ported from original CUDA kernel 4-Vector Kernel: Replaced manually unrolled loop iterations with float4 vector ops AMD 2.2GHz Opteron 148 CPU (a very old Linux test box) Bevals/sec, 2.19 GFLOPS 0.49 Bevals/sec, 3.59 GFLOPS Intel 2.2Ghz Core2 Duo, (Apple MacBook Pro) Bevals/sec, 6.55 GFLOPS 2.38 Bevals/sec, GFLOPS IBM QS22 CellBE *** __constant not implemented yet Bevals/sec, GFLOPS **** 6.21 Bevals/sec, GFLOPS **** AMD Radeon 4870 GPU Bevals/sec, GFLOPS Bevals/sec, GFLOPS NVIDIA GeForce GTX 285 GPU Bevals/sec, GFLOPS Bevals/sec, GFLOPS

© Copyright John E. Stone, Page 13 Getting More Performance: Adapting DCS Kernel to OpenCL on Cell OpenCL Target Device Scalar Kernel: Ported directly from original CUDA kernel 4-Vector Kernel: Replaced manually unrolled loop iterations with float4 vector ops Async Copy Kernel: Replaced __constant accesses with use of async_work_group_copy(), use float16 vector ops IBM QS22 CellBE *** __constant not implemented 2.33 Bevals/sec, GFLOPS **** 6.21 Bevals/sec, GFLOPS **** Bevals/sec, GFLOPS Replacing the use of constant memory with loads of atom data to __local memory via async_work_group_copy() increases performance significantly since Cell doesn’t implement __constant memory yet. Tests show that the speed of native_rsqrt() is currently a performance limiter for Cell. Replacing native_rsqrt() with a multiply results in a ~3x increase in execution rate.

© Copyright John E. Stone, Page 14 Computing Molecular Orbitals Visualization of MOs aids in understanding the chemistry of molecular system MO spatial distribution is correlated with electron probability density Calculation of high resolution MO grids can require tens to hundreds of seconds on CPUs >100x speedup allows interactive animation of 10 FPS C 60 High Performance Computation and Interactive Display of Molecular Orbitals on GPUs and Multi-core CPUs. Stone et al., GPGPU-2, ACM International Conference Proceeding Series, volume 383, pp. 9-18, 2009

© Copyright John E. Stone, Page 15 Molecular Orbital Inner Loop, Hand-Coded SSE Hard to Read, Isn’t It? (And this is the “pretty” version!) for (shell=0; shell < maxshell; shell++) { __m128 Cgto = _mm_setzero_ps(); for (prim=0; prim<num_prim_per_shell[shell_counter]; prim++) { float exponent = -basis_array[prim_counter ]; float contract_coeff = basis_array[prim_counter + 1]; __m128 expval = _mm_mul_ps(_mm_load_ps1(&exponent), dist2); __m128 ctmp = _mm_mul_ps(_mm_load_ps1(&contract_coeff), exp_ps(expval)); Cgto = _mm_add_ps(contracted_gto, ctmp); prim_counter += 2; } __m128 tshell = _mm_setzero_ps(); switch (shell_types[shell_counter]) { case S_SHELL: value = _mm_add_ps(value, _mm_mul_ps(_mm_load_ps1(&wave_f[ifunc++]), Cgto)); break; case P_SHELL: tshell = _mm_add_ps(tshell, _mm_mul_ps(_mm_load_ps1(&wave_f[ifunc++]), xdist)); tshell = _mm_add_ps(tshell, _mm_mul_ps(_mm_load_ps1(&wave_f[ifunc++]), ydist)); tshell = _mm_add_ps(tshell, _mm_mul_ps(_mm_load_ps1(&wave_f[ifunc++]), zdist)); value = _mm_add_ps(value, _mm_mul_ps(tshell, Cgto)); break; Until now, writing SSE kernels for CPUs required assembly language, compiler intrinsics, various libraries, or a really smart autovectorizing compiler and lots of luck...

© Copyright John E. Stone, Page 16 Molecular Orbital Inner Loop, OpenCL Vec4 Ahhh, much easier to read!!! for (shell=0; shell < maxshell; shell++) { float4 contracted_gto = 0.0f; for (prim=0; prim < const_num_prim_per_shell[shell_counter]; prim++) { float exponent = const_basis_array[prim_counter ]; float contract_coeff = const_basis_array[prim_counter + 1]; contracted_gto += contract_coeff * native_exp2(-exponent*dist2); prim_counter += 2; } float4 tmpshell=0.0f; switch (const_shell_symmetry[shell_counter]) { case S_SHELL: value += const_wave_f[ifunc++] * contracted_gto; break; case P_SHELL: tmpshell += const_wave_f[ifunc++] * xdist; tmpshell += const_wave_f[ifunc++] * ydist; tmpshell += const_wave_f[ifunc++] * zdist; value += tmpshell * contracted_gto; break; OpenCL’s C-like kernel language is easy to read, even 4-way vectorized kernels can look similar to scalar CPU code. All 4-way vectors shown in green.

© Copyright John E. Stone, Page 17 Apples to Oranges Performance Results: OpenCL Molecular Orbital Kernels KernelCoresRuntime (s)Speedup Intel QX6700 CPU ICC-SSE (SSE intrinsics) Intel Core2 Duo CPU OpenCL scalar Intel QX6700 CPU ICC-SSE (SSE intrinsics) Intel Core2 Duo CPU OpenCL vec Cell OpenCL vec4*** no __constant Radeon 4870 OpenCL scalar Radeon 4870 OpenCL vec GeForce GTX 285 OpenCL vec GeForce GTX 285 CUDA 2.1 scalar GeForce GTX 285 OpenCL scalar GeForce GTX 285 CUDA 2.0 scalar Minor varations in compiler quality can have a strong effect on “tight” kernels. The two results shown for CUDA demonstrate performance variability with compiler revisions, and that with vendor effort, OpenCL has the potential to match the performance of other APIs.

© Copyright John E. Stone, Page 18 MCUDA for OpenCL MCUDA  “Multicore CUDA”: Compilation framework originally developed to allow retargeting CUDA kernels to multicore CPUs - "MCUDA: An Efficient Implementation of CUDA Kernels on Multi- cores“, J. Stratton, S. Stone, W. Hwu. Technical report, University of Illinois at Urbana-Champaign, IMPACT-08-01, March, Potential extensions to MCUDA for OpenCL: - Make OpenCL performance portable: generate vectorized OpenCL kernels from scalar CUDA or OpenCL kernels, as needed by specific target devices - Translate CUDA kernels to OpenCL Availability:

© Copyright John E. Stone, Page 19 OpenCL on GPU Clusters at NCSA Lincoln CPUs, 384 Tesla GPUs - Production system available via NCSA/TeraGrid HPC allocation AC CPUs, 128 Tesla GPUs - Available for GPU devel & experimentation

© Copyright John E. Stone, Page 20 NCSA CUDA/OpenCL Wrapper Library Virtualize accelerator devices Workload manager control of device visibility, access, and resource allocation Transparent monitoring of application, accelerator activity: - Measure accelerator utilization and performance by individual HPC codes - Track accelerator and API usage by user community Rapid implementation and evaluation of HPC-relevant features not available in standard CUDA / OpenCL APIs Hope for eventual uptake of proven features by vendors and standards organizations CUDA Driver+Runtime Library OpenCL Runtime Library CUDA+OpenCL Wrapper Libraries HPC Application Using CUDA or OpenCL Workload Manager Cluster Monitoring Kindratenko et al. Workshop on Parallel Programming on Accelerator Clusters (PPAC), IEEE Cluster 2009.

© Copyright John E. Stone, Page 21 NCSA CUDA/OpenCL Wrapper Library Principle of operation: - Use /etc/ld.so.preload to overload (intercept) a subset of CUDA / OpenCL device management APIs, e.g. cudaSetDevice(), clGetDeviceIDs(), etc… Features: - NUMA affinity mapping: - Sets application thread affinity to the CPU core nearest to the target device - Maximizes host-device transfer bandwidth, particularly on multi-GPU hosts - Shared host, multi-GPU device fencing - Only GPUs allocated by batch scheduler are visible / accessible to application - GPU device IDs are virtualized, with a fixed mapping to a physical device per user environment - User always sees allocated GPU devices indexed from 0 - Device ID Rotation - Virtual to Physical device mapping rotated for each process accessing a GPU device - Allowed for common execution parameters (e.g. Target gpu0 with 4 processes, each one gets separate gpu, assuming 4 gpus available)

© Copyright John E. Stone, Page 22 NCSA CUDA/OpenCL Wrapper Library Memory Scrubber Utility - Linux kernel does no management of GPU device memory - Must run between user jobs to ensure security between users - Independent utility from wrapper, but packaged with it CUDA/OpenCL Wrapper Authors: - Jeremy Enos - Guochun Shi - Volodymyr Kindratenko Wrapper Software Availability - NCSA / UIUC Open Source License -

© Copyright John E. Stone, Page 23 Acknowledgements University of Illinois at Urbana-Champaign - Theoretical and Computational Biophysics Group, NIH Resource for Macromolecular Modeling and Bioinformatics - NVIDIA CUDA Center of Excellence - Wen-mei Hwu and the IMPACT Group - NCSA Innovative Systems Laboratory: - Jeremy Enos, Guochun Shi, Volodymyr Kindratenko AMD, IBM, NVIDIA - Access to alpha+beta OpenCL toolkits and drivers - Many timely bug fixes - Consultation and guidance NIH support: P41-RR05969

© Copyright John E. Stone, Page 24 Publications Probing Biomolecular Machines with Graphics Processors. J. Phillips, J. Stone. Communications of the ACM, 52(10):34-41, GPU Clusters for High Performance Computing. V. Kindratenko, J. Enos, G. Shi, M. Showerman, G. Arnold, J. Stone, J. Phillips, W. Hwu. Workshop on Parallel Programming on Accelerator Clusters (PPAC), IEEE Cluster In press. Long time-scale simulations of in vivo diffusion using GPU hardware. E. Roberts, J. Stone, L. Sepulveda, W. Hwu, Z. Luthey-Schulten. In IPDPS’09: Proceedings of the 2009 IEEE International Symposium on Parallel & Distributed Computing, pp. 1-8, High Performance Computation and Interactive Display of Molecular Orbitals on GPUs and Multi-core CPUs. J. Stone, J. Saam, D. Hardy, K. Vandivort, W. Hwu, K. Schulten, 2nd Workshop on General-Purpose Computation on Graphics Pricessing Units (GPGPU-2), ACM International Conference Proceeding Series, volume 383, pp. 9-18, Multilevel summation of electrostatic potentials using graphics processing units. D. Hardy, J. Stone, K. Schulten. J. Parallel Computing, 35: , 2009.

© Copyright John E. Stone, Page 25 Publications (cont) Adapting a message-driven parallel application to GPU-accelerated clusters. J. Phillips, J. Stone, K. Schulten. Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, IEEE Press, GPU acceleration of cutoff pair potentials for molecular modeling applications. C. Rodrigues, D. Hardy, J. Stone, K. Schulten, and W. Hwu. Proceedings of the 2008 Conference On Computing Frontiers, pp , GPU computing. J. Owens, M. Houston, D. Luebke, S. Green, J. Stone, J. Phillips. Proceedings of the IEEE, 96: , Accelerating molecular modeling applications with graphics processors. J. Stone, J. Phillips, P. Freddolino, D. Hardy, L. Trabuco, K. Schulten. J. Comp. Chem., 28: , Continuous fluorescence microphotolysis and correlation spectroscopy. A. Arkhipov, J. Hüve, M. Kahms, R. Peters, K. Schulten. Biophysical Journal, 93: , 2007.