GPU Computational Screening of Carbon Capture Materials J Kim 1, A Koniges 1, R Martin 1, M Haranczyk 1, J Swisher 2 and B Smit 1,2 1 Berkeley Lab (USA),

Slides:



Advertisements
Similar presentations
Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
Advertisements

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Optimization on Kepler Zehuan Wang
Timothy Blattner and Shujia Zhou May 18, This project is sponsored by Lockheed Martin We would like to thank Joseph Swartz, Sara Hritz, Michael.
Intro to GPU’s for Parallel Computing. Goals for Rest of Course Learn how to program massively parallel processors and achieve – high performance – functionality.
GPU System Architecture Alan Gray EPCC The University of Edinburgh.
Early Linpack Performance Benchmarking on IPE Mole-8.5 Fermi GPU Cluster Xianyi Zhang 1),2) and Yunquan Zhang 1),3) 1) Laboratory of Parallel Software.
Appendix A — 1 FIGURE A.2.2 Contemporary PCs with Intel and AMD CPUs. See Chapter 6 for an explanation of the components and interconnects in this figure.
HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.
University of Michigan Electrical Engineering and Computer Science Transparent CPU-GPU Collaboration for Data-Parallel Kernels on Heterogeneous Systems.
OpenFOAM on a GPU-based Heterogeneous Cluster
From Imagery to Map: Digital Photogrammetric Technologies 13 th International Scientific and Technical Conference From Imagery to Map: Digital Photogrammetric.
Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:
2009/04/07 Yun-Yang Ma.  Overview  What is CUDA ◦ Architecture ◦ Programming Model ◦ Memory Model  H.264 Motion Estimation on CUDA ◦ Method ◦ Experimental.
GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.
Accelerating SYMV kernel on GPUs Ahmad M Ahmad, AMCS Division, KAUST
Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.
University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.
HPEC_GPU_DECODE-1 ADC 8/6/2015 MIT Lincoln Laboratory GPU Accelerated Decoding of High Performance Error Correcting Codes Andrew D. Copeland, Nicholas.
1 petaFLOPS+ in 10 racks TB2–TL system announcement Rev 1A.
Panda: MapReduce Framework on GPU’s and CPU’s
Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
HPCC Mid-Morning Break Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery Introduction to the new GPU (GFX) cluster.
1 Scalable Distributed Fast Multipole Methods Qi Hu, Nail A. Gumerov, Ramani Duraiswami Institute for Advanced Computer Studies Department of Computer.
Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.
Exploiting Disruptive Technology: GPUs for Physics Chip Watson Scientific Computing Group Jefferson Lab Presented at GlueX Collaboration Meeting, May 11,
Iterative and direct linear solvers in fully implicit magnetic reconnection simulations with inexact Newton methods Xuefei (Rebecca) Yuan 1, Xiaoye S.
National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Direct Self-Consistent Field Computations on GPU Clusters Guochun.
Performance and Energy Efficiency of GPUs and FPGAs
Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.
Predictive Runtime Code Scheduling for Heterogeneous Architectures 1.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Computational Science and Engineering Petascale Initiative at LBL PI: Alice Koniges An ASCR Funded ARRA Project Status: March 16, 2011 Project Scope: Hire.
Massively Parallel Mapping of Next Generation Sequence Reads Using GPUs Azita Nouri, Reha Oğuz Selvitopi, Özcan Öztürk, Onur Mutlu, Can Alkan Bilkent University,
MS Thesis Defense “IMPROVING GPU PERFORMANCE BY REGROUPING CPU-MEMORY DATA” by Deepthi Gummadi CoE EECS Department April 21, 2014.
YOU LI SUPERVISOR: DR. CHU XIAOWEN CO-SUPERVISOR: PROF. LIU JIMING THURSDAY, MARCH 11, 2010 Speeding up k-Means by GPUs 1.
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.
NVIDIA Tesla GPU Zhuting Xue EE126. GPU Graphics Processing Unit The "brain" of graphics, which determines the quality of performance of the graphics.
Fast Support Vector Machine Training and Classification on Graphics Processors Bryan Catanzaro Narayanan Sundaram Kurt Keutzer Parallel Computing Laboratory,
PDSF at NERSC Site Report HEPiX April 2010 Jay Srinivasan (w/contributions from I. Sakrejda, C. Whitney, and B. Draney) (Presented by Sandy.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.
Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.
Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation,
University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.
Geant4 on GPU prototype Nicholas Henderson (Stanford Univ. / ICME)
Sunpyo Hong, Hyesoon Kim
GFlow: Towards GPU-based High- Performance Table Matching in OpenFlow Switches Author : Kun Qiu, Zhe Chen, Yang Chen, Jin Zhao, Xin Wang Publisher : Information.
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
PuReMD Design Initialization – neighbor-list, bond-list, hydrogenbond-list and Coefficients of QEq matrix Bonded interactions – Bond-order, bond-energy,
Parallel Computers Today Oak Ridge / Cray Jaguar > 1.75 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating.
Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.
High performance computing architecture examples Unit 2.
Institute of Software,Chinese Academy of Sciences An Insightful and Quantitative Performance Optimization Chain for GPUs Jia Haipeng.
Large-scale geophysical electromagnetic imaging and modeling on graphical processing units Michael Commer (LBNL) Filipe R. N. C. Maia (LBNL-NERSC) Gregory.
NVIDIA® TESLA™ GPU Based Super Computer By : Adam Powell Student # For COSC 3P93.
Sobolev(+Node 6, 7) Showcase +K20m GPU Accelerator.
GPU Acceleration of Particle-In-Cell Methods B. M. Cowan, J. R. Cary, S. W. Sides Tech-X Corporation.
Parallel Computers Today LANL / IBM Roadrunner > 1 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating point.
Parallel Plasma Equilibrium Reconstruction Using GPU
CS427 Multicore Architecture and Parallel Computing
Clusters of Computational Accelerators
Parallel Computers Today
Linchuan Chen, Peng Jiang and Gagan Agrawal
Verilog to Routing CAD Tool Optimization
CS 295: Modern Systems GPU Computing Introduction
Presentation transcript:

GPU Computational Screening of Carbon Capture Materials J Kim 1, A Koniges 1, R Martin 1, M Haranczyk 1, J Swisher 2 and B Smit 1,2 1 Berkeley Lab (USA), 2 Department of Chemical Engineering, University of California, Berkeley (USA) - New GPU cluster Dirac at NERSC (44 Fermi Tesla C2050 GPU cards) CUDA cores, 3GB GDDR5 memory, PCIe x16 Gen2, 55 (1030) GFLOPS peak DP(SP) performance GB/sec memory bandwidth - Dirac node: 2 Intel GHz, 8MB cache, 5.86 GT/sec QPI Quad-core Nehalem, 24GB DDR Reg ECC memory - More than 500 cores - Optimized for SIMD (same-instruction- multiple-data) problems - Less than 20 cores - Designed for general programming ALGORITHM: Characterize Large Database of Carbon Capture Materials CPU GPU Control Logic ALU Cache DRAM S TEP 1: E NERGY G RID C ONSTRUCTION S TEP 2: P OCKET BLOCKING S TEP 3: M ONTE C ARLO W IDOM I NSERTION APPLICATION: Carbon Capture and Storage -Project Goal: reduce the cost of separating CO 2 molecules from power plant flue gases (46 Energy Frontier Research Centers established by the DOE) - Candidates for Carbon Capture: zeolites, metal-organic frameworks - Over a million hypothetical zeolite structures: how to determine the optimal structure? - Develop GPU code to accelerate screening a large database of carbon capture materials - Henry Coefficients (K H ): characterize selectivity of material at low pressure (used as an initial screening quantity for zeolites) LTA zeolite MFI zeolite - Test insert gas molecule at each grid point and calculate its energy Angstroms grid size (10million+ grid points, GPU DRAM) - Framework atoms (< 2000), keep data in fast GPU memory - Number of GPU threads = number of grid points - Lennard-Jones + Coulomb potentials with periodic boundary conditions X: framework atoms x x x x x x x Thread 0Thread 1 Thread 2 Thread 3 … - Motivation: need to block inaccessible regions (pockets) within the framework - Set threshold energy value such that accessible if exp(-E i ) > exp(-15k B T) - Flood fill algorithm to detect pockets - Test insert a gas molecule in simulation box (CH 4 : one insertion, CO 2 : three insertions) - Check for (a) out of boundary (redo) and (b) inside pocket sphere - Interpolate energy values from grid points - Accumulate Boltzmann factor and repeat - Utilize CURAND Library to generate random numbers Blocking spheres (a) (b) Periodic Unit Cell (1) (2) (3) - (1) and (2) are disconnected and thus inaccessible (block) - (3) forms a channel (accessible) Periodic, Non-orthogonal Unit Cell GPU racks (NERSC Dirac) PERFORMANCE RESULTS - Simulations of IZA structures: 190+ experimentally known zeolites - CH 4 : 2.2 seconds/zeolite - CO 2 : 31.8 seconds/zeolite - 64(72)% of wall time spent in CPU pocket blocking - The code is compute bound (50x improvement from CPU single core implementation) - Successfully computed 120,000+ Henry coefficients for CH4 inside hypothetical zeolites: 5 GPUs, less than 1 day of wall time - Local Henry coefficient color map indicates the regions within the zeolite that contribute most to the overall Henry coefficients Henry coefficients (IZA) Local Henry coefficients (MFI) FUTURE WORK ACKNOWLEDGMENT - Adsorption Isotherm calculations using GPU for CO 2 - Determine good parallelization strategy for the adsorption isotherms - Henry coefficient calculations for ZIFs, and metal-organic frameworks SM14 GPU Tesla C SMs … GCMC P = 1 atm GCMC P = 100 atm GPU Adsorption Isotherm - This work was supported by the Director, Office of Science, Advanced Scientific Computing Research, of the U.S. Department of Energy under Contract No. DE-AC02-05CH ARCHITECTURE: NERSC D IRAC GPU C LUSTER SM2SM1