Parallelization of Tau-Leap Coarse-Grained Monte Carlo Simulations on GPUs Lifan Xu, Michela Taufer, Stuart Collins, Dionisios G. Vlachos Global Computing.

Parallelization of Tau-Leap Coarse-Grained Monte Carlo Simulations on GPUs Lifan Xu, Michela Taufer, Stuart Collins, Dionisios G. Vlachos Global Computing Lab University of Delaware

Multiscale Modeling: Challenges Lifan Xu, GCLab@UD CFD Mesoscopic Theory/CGMC KMC, DSMC LB, DPD, MD, extended trajectory calc. 10 -2 Length Scale (m) Time Scale (s) Quantum Mechanics TST Lab Scale 10 -10 10 -9 10 -7 10 -6 10 -12 10 -9 10 -3 10 3 Downscaling or Top- down info traffic Upscaling or Bottom- up info traffic Our Goal: increase time and length scale for the CGMC by using GPUs 2 By courtesy of Dionisios G. Vlachos

Outline Background  CGMC  GPU CGMC implementation on single GPU  Optimize memory use  Optimize algorithm CGMC implementation on multiple GPUs  Combine OpenMP and CUDA Accuracy Conclusions Future work 3 Lifan Xu, GCLab@UD

Coarse Grained Kinetic Monte Carlo (CGMC) Study phenomena such as catalysis, crystal growth, and surface diffusion Group neighboring microscopic sites (molecules) together into “coarse-grained” cells Lifan Xu, GCLab@UD 4 8 X 8 microscopic sites2 X 2 cells Mapping

Terminology Coverage  A 2D matrix contains number of different species of molecules for each cell Events  Reaction  Diffusion Probability  A list of probabilities of all possible events on every cell  Probability is calculated based on neighboring cell’s coverage  Probability has to be updated if the coverage of its neighbor changed  Events are selected based on probabilities 5 Lifan Xu, GCLab@UD

Reaction 6 Lifan Xu, GCLab@UD Reaction

Diffusion 7 Lifan Xu, GCLab@UD Reaction

CGMC Algorithm 8 Lifan Xu, GCLab@UD

GPU Overview (I) 9 Lifan Xu, GCLab@UD NVIDIA Tesla C1060:  30 Streaming Multiprocessor (1-N)  8 Scalar Processors/SM (1-M)  30, 8-way SIMD cores = 240 PEs Massively parallel multithreaded  Up to 30720 active threads handled by thread execution manager Processing power  933 GigaFLOPS(single precision)  78 GigaFLOPS(double precision) From: CUDA Programming Guide, NVIDIA

GPU Overview (II) 10 Lifan Xu, GCLab@UD Memory types:  Read/write per thread Registers Local memory  Read/write per block Shared memory  Read/write per grid Global memory Read-only per grid Constant memory Texture memory Communication among devices and with CPU  Through PCI Express bus From CUDA Programming Guide, NVIDIA

Multi-thread Approach 11 Lifan Xu, GCLab@UD Read probability from global memory Read probability from global memory ………… One τ leap Select events Update coverage in global memory Update coverage in global memory Update probability in global memory Update probability in global memory Read coverage from global memory Read coverage from global memory GPU threads, where each thread equals to one cell Thread synchronization

Memory Optimization 12 Lifan Xu, GCLab@UD Read probability from global memory Read probability from global memory ………… One τ leap Select and execute events Select and execute events Update coverage in shared memory Update coverage in shared memory Update probability in global memory Update probability in global memory Read coverage from shared memory Read coverage from shared memory GPU threads, where each thread equals to one cell Thread synchronization Write coverage to global memory Write coverage to global memory

Performance 13 Lifan Xu, GCLab@UD Platforms GPU –Tesla C1060 Intel(R) Xeon(R) CPU X5450(CPU) Number of cores 240 Number of cores 4 Global memory 4GB Memory 4GB Clock rate 1.44 GHz Clock rate 3 GHz

Performance 14 Lifan Xu, GCLab@UD Significant increase in performance Time scale: CGMC simulations can be 100X faster than a CPU implementation

Rethinking the CGMC Algorithm for GPUs 15 Lifan Xu, GCLab@UD Redesign of cell structures:  Macro-cell = set of cells  Number of cells per macro-cell is flexible Redesign of event execution:  One thread per macro-cell (coarse-grained parallelism) or one block per macro-cell (fine- grained parallelism)  Events replicated across macro-cells 627105113254 3-layer macro-cell2-layer macro-cells Cells in a 4x4 system

Synchronization Frequency 16 Leap 2Leap 1Leap 3Leap 4 Synchro nization Lifan Xu, GCLab@UD First layer Second layer Third layer

Event Replication Macro-cell 1 (block 0) Macro-cell 2 (block 1) Cell 1 (thread 0) Cell 2 (thread 1) …Cell 2 (thread 5) Cell 1 (thread 6) Cell 3 (thread 7) … Leap 1 E1(A--)E1(A++) E1(A--) E1(A++) E1(A--) Leap 2E1(A--)E1(A++) 17 E1: A[cell 1] -> A[cell 2] A is a molecule specie Macro-cell 1Macro-cell 2 Lifan Xu, GCLab@UD

Random Number Generator Generate the random numbers on the CPU and transfer them to the GPU  Simple and easy  Uniform distribution Have a dedicated kernel generate numbers and store them in global memory  Many global memory access Have each thread generate random numbers as needed  Wallace Gaussian Generator  Shared memory requirement 18 Lifan Xu, GCLab@UD

Random Number Generator 1. Generate as many random numbers as possible events on CPU 2. Store random numbers in texture memory as read only 3. Start one leap simulation 1. Each event fetches its own random number in texture memory 2. Select events 3. Execute and update 4. End of one leap simulation 5. Go to 1 19 Lifan Xu, GCLab@UD

Coarse-Grained vs. Fine-Grained 2-layer coarse-grained  Each thread is in charge of one macro-cell  Each thread simulates five cells in every first leap, one central cell in every second leap 2-layer fine-grained  Each block is in charge of one macro-cell  Each block has five threads  Each thread simulates one cell in every first leap  Five thread simulate the central cell together in every second leap 20 Lifan Xu, GCLab@UD

Performance 21 Lifan Xu, GCLab@UD Number of Cells Events/ms Small molecular systems large molecular systems

Multi-GPU Implementation Limit in number of cells for a single GPU implementation that can be processed in parallel  Use multiple GPUs Synchronization between multiple GPUs is costly  Take advantage of 2-layer fine-grained parallelism Combine OpenMP with CUDA for multi-GPU programming:  Use portable pinned memory for communication between CPU threads  Use mapped pinned memory for data copy between CPU and GPU(zero copy)  Use write-combined memory for fast data access 22 Lifan Xu, GCLab@UD

Pinned Memory Portable pinned memory(PPM)  Available for all host threads  Can be freed by any host thread Mapped pinned memory(MPM)  Allocate memory on host side  Memory can be mapped to device’s address space  Two addresses: one in host memory and one in device memory  No explicit memory copy between CPU and GPU Write-combined pinned memory(WCM)  Transfers across the PCI Express bus,  Buffer is not snooped  High transfer performance but no data coherency guarantee 23 Lifan Xu, GCLab@UD

Combine OpenMP and CUDA 24 Lifan Xu, GCLab@UD

Multi-GPU Implementation 25 Lifan Xu, GCLab@UD CGMC Parameters: RN coverage probability Select Events Execute Events Update coverage Update probability Select Events Execute Events Update coverage Update probability Leap 1Leap 2 Select Events Execute Events Update coverage Update probability Select Events Execute Events Update coverage Update probability Leap 1Leap 2 GPU 1 CPU Pinned Memory RNG Random Numbers GPU 2 25 For 2-layer implementation

Performance with Different Number of GPUs 26 Lifan Xu, GCLab@UD Number of Cells Events/ms

Performance with Different Layers on 4 GPUs 27 Lifan Xu, GCLab@UD Number of Cells Events/ms

Accuracy 28 Lifan Xu, GCLab@UD Simulations of three different species of molecules, A, B, and C  A change to B and vice-versa  2 B molecules change to C and vice-versa  A, B, and C can diffuse Number of molecules in system reaching the equilibrium and in equilibrium GPU CPU-32 CPU-64

Conclusion We present a parallel implementation of the tau-leap method for CGMC simulations on single and multi-GPUs We identify the most efficient parallelism granularity for the simulations given a molecular system with a certain size  42X speedup for small systems using 2-layer fine-grained thread parallelism  100X speedup for average systems using 1-layer parallelism  120X speedup for large systems using 2-layer fine-grained thread parallelism across multiple GPUs We increase both time scale (up to 120 times) and length scale (up to 100,000) 29 Lifan Xu, GCLab@UD

Future Work Study molecular systems with heterogeneous distributions Combine MPI, OpenMP, and CUDA to use multiple GPUS across different machines Use MPI to implement CGMC parallelization on CPU cluster  Compare the performance of GPU cluster and CPU cluster Link phenomena and models across scales i.e., from QM to MD and to CGMC 30 Lifan Xu, GCLab@UD

Acknowledgements 31 Sponsors: Collaborators: Dionisios G. Vlachos and Stuart Collins (UDel) More questions: xulifan@udel.edu GCL Members: Trilce EstradaBoyu Zhang Abel LiconNarayan Ganesan Lifan Xu Philip Saponaro Maria Ruiz Michela Taufer GCL members in Spring 2010 Lifan Xu, GCLab@UD

Parallelization of Tau-Leap Coarse-Grained Monte Carlo Simulations on GPUs Lifan Xu, Michela Taufer, Stuart Collins, Dionisios G. Vlachos Global Computing.

Similar presentations

Presentation on theme: "Parallelization of Tau-Leap Coarse-Grained Monte Carlo Simulations on GPUs Lifan Xu, Michela Taufer, Stuart Collins, Dionisios G. Vlachos Global Computing."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Parallelization of Tau-Leap Coarse-Grained Monte Carlo Simulations on GPUs Lifan Xu, Michela Taufer, Stuart Collins, Dionisios G. Vlachos Global Computing.

Similar presentations

Presentation on theme: "Parallelization of Tau-Leap Coarse-Grained Monte Carlo Simulations on GPUs Lifan Xu, Michela Taufer, Stuart Collins, Dionisios G. Vlachos Global Computing."— Presentation transcript:

Similar presentations

About project

Feedback