Parallelization of Tau-Leap Coarse-Grained Monte Carlo Simulations on GPUs Lifan Xu, Michela Taufer, Stuart Collins, Dionisios G. Vlachos Global Computing.

Slides:

Advertisements

Similar presentations

Instructor Notes This lecture describes the different ways to work with multiple devices in OpenCL (i.e., within a single context and using multiple contexts),

Advertisements

Shredder GPU-Accelerated Incremental Storage and Computation

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

Lecture 6: Multicore Systems

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Optimization on Kepler Zehuan Wang

Monte-Carlo method and Parallel computing  An introduction to GPU programming Mr. Fang-An Kuo, Dr. Matthew R. Smith NCHC Applied Scientific Computing.

Scalable Data Clustering with GPUs Andrew D. Pangborn Thesis Defense Rochester Institute of Technology Computer Engineering Department Friday, May 14 th.

OpenFOAM on a GPU-based Heterogeneous Cluster

Fast Circuit Simulation on Graphics Processing Units Kanupriya Gulati † John F. Croix ‡ Sunil P. Khatri † Rahm Shastry ‡ † Texas A&M University, College.

1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.

L13: Review for Midterm. Administrative Project proposals due Friday at 5PM (hard deadline) No makeup class Friday! March 23, Guest Lecture Austin Robison,

Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.

DCABES 2009 China University Of Geosciences 1 The Parallel Models of Coronal Polarization Brightness Calculation Jiang Wenqian.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

CUDA and the Memory Model (Part II). Code executed on GPU.

Big Kernel: High Performance CPU-GPU Communication Pipelining for Big Data style Applications Sajitha Naduvil-Vadukootu CSC 8530 (Parallel Algorithms)

University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.

To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

GPGPU platforms GP - General Purpose computation using GPU

Efficient Pseudo-Random Number Generation for Monte-Carlo Simulations Using GPU Siddhant Mohanty, Subho Shankar Banerjee, Dushyant Goyal, Ajit Mohanty.

Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.

Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.

Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı

1 Integrating GPUs into Condor Timothy Blattner Marquette University Milwaukee, WI April 22, 2009.

Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster.

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.

COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.

Scalable Data Clustering with GPUs Andrew D. Pangborn Thesis Defense Rochester Institute of Technology Computer Engineering Department Friday, May 14 th.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.

Accelerating Statistical Static Timing Analysis Using Graphics Processing Units Kanupriya Gulati & Sunil P. Khatri Department of ECE Texas A&M University,

BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Accelerating Simulation of Agent-Based Models on Heterogeneous Architectures.

Extracted directly from:

By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.

MS Thesis Defense “IMPROVING GPU PERFORMANCE BY REGROUPING CPU-MEMORY DATA” by Deepthi Gummadi CoE EECS Department April 21, 2014.

Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,

YOU LI SUPERVISOR: DR. CHU XIAOWEN CO-SUPERVISOR: PROF. LIU JIMING THURSDAY, MARCH 11, 2010 Speeding up k-Means by GPUs 1.

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.

1 Evaluation of parallel particle swarm optimization algorithms within the CUDA™ architecture Luca Mussi, Fabio Daolio, Stefano Cagnoni, Information Sciences,

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.

Accelerating Statistical Static Timing Analysis Using Graphics Processing Units Kanupriya Gulati and Sunil P. Khatri Department of ECE, Texas A&M University,

GPU Architecture and Programming

Distributed Shared Memory Based on Reference paper: Distributed Shared Memory, Concepts and Systems.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.

Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.

Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.

Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.

QCAdesigner – CUDA HPPS project

Optimizing MapReduce for GPUs with Effective Shared Memory Usage Department of Computer Science and Engineering The Ohio State University Linchuan Chen.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

CUDA. Assignment  Subject: DES using CUDA  Deliverables: des.c, des.cu, report  Due: 12/14,

 Genetic Algorithms  A class of evolutionary algorithms  Efficiently solves optimization tasks  Potential Applications in many fields  Challenges.

CUDA Compute Unified Device Architecture. Agent Based Modeling in CUDA Implementation of basic agent based modeling on the GPU using the CUDA framework.

3/12/2013Computer Engg, IIT(BHU)1 CUDA-3. GPGPU ● General Purpose computation using GPU in applications other than 3D graphics – GPU accelerates critical.

An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems Isaac Gelado, Javier Cabezas. John Stone, Sanjay Patel, Nacho Navarro.

Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs Mai Zheng, Vignesh T. Ravi, Wenjing Ma, Feng Qin, and Gagan Agrawal Dept. of Computer.

Jun Doi IBM Research – Tokyo Early Performance Evaluation of Lattice QCD on POWER+GPU Cluster 17 July 2015.

CEng 713, Evolutionary Computation, Lecture Notes parallel Evolutionary Computation.

These slides are based on the book:

Linchuan Chen, Xin Huo and Gagan Agrawal

Optimizing MapReduce for GPUs with Effective Shared Memory Usage

Department of Computer Science University of California, Santa Barbara

6- General Purpose GPU Programming

Presentation transcript:

Parallelization of Tau-Leap Coarse-Grained Monte Carlo Simulations on GPUs Lifan Xu, Michela Taufer, Stuart Collins, Dionisios G. Vlachos Global Computing Lab University of Delaware

Multiscale Modeling: Challenges Lifan Xu, CFD Mesoscopic Theory/CGMC KMC, DSMC LB, DPD, MD, extended trajectory calc Length Scale (m) Time Scale (s) Quantum Mechanics TST Lab Scale Downscaling or Top- down info traffic Upscaling or Bottom- up info traffic Our Goal: increase time and length scale for the CGMC by using GPUs 2 By courtesy of Dionisios G. Vlachos

Outline Background  CGMC  GPU CGMC implementation on single GPU  Optimize memory use  Optimize algorithm CGMC implementation on multiple GPUs  Combine OpenMP and CUDA Accuracy Conclusions Future work 3 Lifan Xu,

Coarse Grained Kinetic Monte Carlo (CGMC) Study phenomena such as catalysis, crystal growth, and surface diffusion Group neighboring microscopic sites (molecules) together into “coarse-grained” cells Lifan Xu, 4 8 X 8 microscopic sites2 X 2 cells Mapping

Terminology Coverage  A 2D matrix contains number of different species of molecules for each cell Events  Reaction  Diffusion Probability  A list of probabilities of all possible events on every cell  Probability is calculated based on neighboring cell’s coverage  Probability has to be updated if the coverage of its neighbor changed  Events are selected based on probabilities 5 Lifan Xu,

Reaction 6 Lifan Xu, Reaction

Diffusion 7 Lifan Xu, Reaction

CGMC Algorithm 8 Lifan Xu,

GPU Overview (I) 9 Lifan Xu, NVIDIA Tesla C1060:  30 Streaming Multiprocessor (1-N)  8 Scalar Processors/SM (1-M)  30, 8-way SIMD cores = 240 PEs Massively parallel multithreaded  Up to active threads handled by thread execution manager Processing power  933 GigaFLOPS(single precision)  78 GigaFLOPS(double precision) From: CUDA Programming Guide, NVIDIA

GPU Overview (II) 10 Lifan Xu, Memory types:  Read/write per thread Registers Local memory  Read/write per block Shared memory  Read/write per grid Global memory Read-only per grid Constant memory Texture memory Communication among devices and with CPU  Through PCI Express bus From CUDA Programming Guide, NVIDIA

Multi-thread Approach 11 Lifan Xu, Read probability from global memory Read probability from global memory ………… One τ leap Select events Update coverage in global memory Update coverage in global memory Update probability in global memory Update probability in global memory Read coverage from global memory Read coverage from global memory GPU threads, where each thread equals to one cell Thread synchronization

Memory Optimization 12 Lifan Xu, Read probability from global memory Read probability from global memory ………… One τ leap Select and execute events Select and execute events Update coverage in shared memory Update coverage in shared memory Update probability in global memory Update probability in global memory Read coverage from shared memory Read coverage from shared memory GPU threads, where each thread equals to one cell Thread synchronization Write coverage to global memory Write coverage to global memory

Performance 13 Lifan Xu, Platforms GPU –Tesla C1060 Intel(R) Xeon(R) CPU X5450(CPU) Number of cores 240 Number of cores 4 Global memory 4GB Memory 4GB Clock rate 1.44 GHz Clock rate 3 GHz

Performance 14 Lifan Xu, Significant increase in performance Time scale: CGMC simulations can be 100X faster than a CPU implementation

Rethinking the CGMC Algorithm for GPUs 15 Lifan Xu, Redesign of cell structures:  Macro-cell = set of cells  Number of cells per macro-cell is flexible Redesign of event execution:  One thread per macro-cell (coarse-grained parallelism) or one block per macro-cell (fine- grained parallelism)  Events replicated across macro-cells layer macro-cell2-layer macro-cells Cells in a 4x4 system

Synchronization Frequency 16 Leap 2Leap 1Leap 3Leap 4 Synchro nization Lifan Xu, First layer Second layer Third layer

Event Replication Macro-cell 1 (block 0) Macro-cell 2 (block 1) Cell 1 (thread 0) Cell 2 (thread 1) …Cell 2 (thread 5) Cell 1 (thread 6) Cell 3 (thread 7) … Leap 1 E1(A--)E1(A++) E1(A--) E1(A++) E1(A--) Leap 2E1(A--)E1(A++) 17 E1: A[cell 1] -> A[cell 2] A is a molecule specie Macro-cell 1Macro-cell 2 Lifan Xu,

Random Number Generator Generate the random numbers on the CPU and transfer them to the GPU  Simple and easy  Uniform distribution Have a dedicated kernel generate numbers and store them in global memory  Many global memory access Have each thread generate random numbers as needed  Wallace Gaussian Generator  Shared memory requirement 18 Lifan Xu,

Random Number Generator 1. Generate as many random numbers as possible events on CPU 2. Store random numbers in texture memory as read only 3. Start one leap simulation 1. Each event fetches its own random number in texture memory 2. Select events 3. Execute and update 4. End of one leap simulation 5. Go to 1 19 Lifan Xu,

Coarse-Grained vs. Fine-Grained 2-layer coarse-grained  Each thread is in charge of one macro-cell  Each thread simulates five cells in every first leap, one central cell in every second leap 2-layer fine-grained  Each block is in charge of one macro-cell  Each block has five threads  Each thread simulates one cell in every first leap  Five thread simulate the central cell together in every second leap 20 Lifan Xu,

Performance 21 Lifan Xu, Number of Cells Events/ms Small molecular systems large molecular systems

Multi-GPU Implementation Limit in number of cells for a single GPU implementation that can be processed in parallel  Use multiple GPUs Synchronization between multiple GPUs is costly  Take advantage of 2-layer fine-grained parallelism Combine OpenMP with CUDA for multi-GPU programming:  Use portable pinned memory for communication between CPU threads  Use mapped pinned memory for data copy between CPU and GPU(zero copy)  Use write-combined memory for fast data access 22 Lifan Xu,

Pinned Memory Portable pinned memory(PPM)  Available for all host threads  Can be freed by any host thread Mapped pinned memory(MPM)  Allocate memory on host side  Memory can be mapped to device’s address space  Two addresses: one in host memory and one in device memory  No explicit memory copy between CPU and GPU Write-combined pinned memory(WCM)  Transfers across the PCI Express bus,  Buffer is not snooped  High transfer performance but no data coherency guarantee 23 Lifan Xu,

Combine OpenMP and CUDA 24 Lifan Xu,

Multi-GPU Implementation 25 Lifan Xu, CGMC Parameters: RN coverage probability Select Events Execute Events Update coverage Update probability Select Events Execute Events Update coverage Update probability Leap 1Leap 2 Select Events Execute Events Update coverage Update probability Select Events Execute Events Update coverage Update probability Leap 1Leap 2 GPU 1 CPU Pinned Memory RNG Random Numbers GPU 2 25 For 2-layer implementation

Performance with Different Number of GPUs 26 Lifan Xu, Number of Cells Events/ms

Performance with Different Layers on 4 GPUs 27 Lifan Xu, Number of Cells Events/ms

Accuracy 28 Lifan Xu, Simulations of three different species of molecules, A, B, and C  A change to B and vice-versa  2 B molecules change to C and vice-versa  A, B, and C can diffuse Number of molecules in system reaching the equilibrium and in equilibrium GPU CPU-32 CPU-64

Conclusion We present a parallel implementation of the tau-leap method for CGMC simulations on single and multi-GPUs We identify the most efficient parallelism granularity for the simulations given a molecular system with a certain size  42X speedup for small systems using 2-layer fine-grained thread parallelism  100X speedup for average systems using 1-layer parallelism  120X speedup for large systems using 2-layer fine-grained thread parallelism across multiple GPUs We increase both time scale (up to 120 times) and length scale (up to 100,000) 29 Lifan Xu,

Future Work Study molecular systems with heterogeneous distributions Combine MPI, OpenMP, and CUDA to use multiple GPUS across different machines Use MPI to implement CGMC parallelization on CPU cluster  Compare the performance of GPU cluster and CPU cluster Link phenomena and models across scales i.e., from QM to MD and to CGMC 30 Lifan Xu,

Acknowledgements 31 Sponsors: Collaborators: Dionisios G. Vlachos and Stuart Collins (UDel) More questions: GCL Members: Trilce EstradaBoyu Zhang Abel LiconNarayan Ganesan Lifan Xu Philip Saponaro Maria Ruiz Michela Taufer GCL members in Spring 2010 Lifan Xu,