Presentation is loading. Please wait.

Presentation is loading. Please wait.

Parallelization of Tau-Leap Coarse-Grained Monte Carlo Simulations on GPUs Lifan Xu, Michela Taufer, Stuart Collins, Dionisios G. Vlachos Global Computing.

Similar presentations


Presentation on theme: "Parallelization of Tau-Leap Coarse-Grained Monte Carlo Simulations on GPUs Lifan Xu, Michela Taufer, Stuart Collins, Dionisios G. Vlachos Global Computing."— Presentation transcript:

1 Parallelization of Tau-Leap Coarse-Grained Monte Carlo Simulations on GPUs Lifan Xu, Michela Taufer, Stuart Collins, Dionisios G. Vlachos Global Computing Lab University of Delaware

2 Multiscale Modeling: Challenges Lifan Xu, GCLab@UD CFD Mesoscopic Theory/CGMC KMC, DSMC LB, DPD, MD, extended trajectory calc. 10 -2 Length Scale (m) Time Scale (s) Quantum Mechanics TST Lab Scale 10 -10 10 -9 10 -7 10 -6 10 -12 10 -9 10 -3 10 3 Downscaling or Top- down info traffic Upscaling or Bottom- up info traffic Our Goal: increase time and length scale for the CGMC by using GPUs 2 By courtesy of Dionisios G. Vlachos

3 Outline Background  CGMC  GPU CGMC implementation on single GPU  Optimize memory use  Optimize algorithm CGMC implementation on multiple GPUs  Combine OpenMP and CUDA Accuracy Conclusions Future work 3 Lifan Xu, GCLab@UD

4 Coarse Grained Kinetic Monte Carlo (CGMC) Study phenomena such as catalysis, crystal growth, and surface diffusion Group neighboring microscopic sites (molecules) together into “coarse-grained” cells Lifan Xu, GCLab@UD 4 8 X 8 microscopic sites2 X 2 cells Mapping

5 Terminology Coverage  A 2D matrix contains number of different species of molecules for each cell Events  Reaction  Diffusion Probability  A list of probabilities of all possible events on every cell  Probability is calculated based on neighboring cell’s coverage  Probability has to be updated if the coverage of its neighbor changed  Events are selected based on probabilities 5 Lifan Xu, GCLab@UD

6 Reaction 6 Lifan Xu, GCLab@UD Reaction

7 Diffusion 7 Lifan Xu, GCLab@UD Reaction

8 CGMC Algorithm 8 Lifan Xu, GCLab@UD

9 GPU Overview (I) 9 Lifan Xu, GCLab@UD NVIDIA Tesla C1060:  30 Streaming Multiprocessor (1-N)  8 Scalar Processors/SM (1-M)  30, 8-way SIMD cores = 240 PEs Massively parallel multithreaded  Up to 30720 active threads handled by thread execution manager Processing power  933 GigaFLOPS(single precision)  78 GigaFLOPS(double precision) From: CUDA Programming Guide, NVIDIA

10 GPU Overview (II) 10 Lifan Xu, GCLab@UD Memory types:  Read/write per thread Registers Local memory  Read/write per block Shared memory  Read/write per grid Global memory Read-only per grid Constant memory Texture memory Communication among devices and with CPU  Through PCI Express bus From CUDA Programming Guide, NVIDIA

11 Multi-thread Approach 11 Lifan Xu, GCLab@UD Read probability from global memory Read probability from global memory ………… One τ leap Select events Update coverage in global memory Update coverage in global memory Update probability in global memory Update probability in global memory Read coverage from global memory Read coverage from global memory GPU threads, where each thread equals to one cell Thread synchronization

12 Memory Optimization 12 Lifan Xu, GCLab@UD Read probability from global memory Read probability from global memory ………… One τ leap Select and execute events Select and execute events Update coverage in shared memory Update coverage in shared memory Update probability in global memory Update probability in global memory Read coverage from shared memory Read coverage from shared memory GPU threads, where each thread equals to one cell Thread synchronization Write coverage to global memory Write coverage to global memory

13 Performance 13 Lifan Xu, GCLab@UD Platforms GPU –Tesla C1060 Intel(R) Xeon(R) CPU X5450(CPU) Number of cores 240 Number of cores 4 Global memory 4GB Memory 4GB Clock rate 1.44 GHz Clock rate 3 GHz

14 Performance 14 Lifan Xu, GCLab@UD Significant increase in performance Time scale: CGMC simulations can be 100X faster than a CPU implementation

15 Rethinking the CGMC Algorithm for GPUs 15 Lifan Xu, GCLab@UD Redesign of cell structures:  Macro-cell = set of cells  Number of cells per macro-cell is flexible Redesign of event execution:  One thread per macro-cell (coarse-grained parallelism) or one block per macro-cell (fine- grained parallelism)  Events replicated across macro-cells 627105113254 3-layer macro-cell2-layer macro-cells Cells in a 4x4 system

16 Synchronization Frequency 16 Leap 2Leap 1Leap 3Leap 4 Synchro nization Lifan Xu, GCLab@UD First layer Second layer Third layer

17 Event Replication Macro-cell 1 (block 0) Macro-cell 2 (block 1) Cell 1 (thread 0) Cell 2 (thread 1) …Cell 2 (thread 5) Cell 1 (thread 6) Cell 3 (thread 7) … Leap 1 E1(A--)E1(A++) E1(A--) E1(A++) E1(A--) Leap 2E1(A--)E1(A++) 17 E1: A[cell 1] -> A[cell 2] A is a molecule specie Macro-cell 1Macro-cell 2 Lifan Xu, GCLab@UD

18 Random Number Generator Generate the random numbers on the CPU and transfer them to the GPU  Simple and easy  Uniform distribution Have a dedicated kernel generate numbers and store them in global memory  Many global memory access Have each thread generate random numbers as needed  Wallace Gaussian Generator  Shared memory requirement 18 Lifan Xu, GCLab@UD

19 Random Number Generator 1. Generate as many random numbers as possible events on CPU 2. Store random numbers in texture memory as read only 3. Start one leap simulation 1. Each event fetches its own random number in texture memory 2. Select events 3. Execute and update 4. End of one leap simulation 5. Go to 1 19 Lifan Xu, GCLab@UD

20 Coarse-Grained vs. Fine-Grained 2-layer coarse-grained  Each thread is in charge of one macro-cell  Each thread simulates five cells in every first leap, one central cell in every second leap 2-layer fine-grained  Each block is in charge of one macro-cell  Each block has five threads  Each thread simulates one cell in every first leap  Five thread simulate the central cell together in every second leap 20 Lifan Xu, GCLab@UD

21 Performance 21 Lifan Xu, GCLab@UD Number of Cells Events/ms Small molecular systems large molecular systems

22 Multi-GPU Implementation Limit in number of cells for a single GPU implementation that can be processed in parallel  Use multiple GPUs Synchronization between multiple GPUs is costly  Take advantage of 2-layer fine-grained parallelism Combine OpenMP with CUDA for multi-GPU programming:  Use portable pinned memory for communication between CPU threads  Use mapped pinned memory for data copy between CPU and GPU(zero copy)  Use write-combined memory for fast data access 22 Lifan Xu, GCLab@UD

23 Pinned Memory Portable pinned memory(PPM)  Available for all host threads  Can be freed by any host thread Mapped pinned memory(MPM)  Allocate memory on host side  Memory can be mapped to device’s address space  Two addresses: one in host memory and one in device memory  No explicit memory copy between CPU and GPU Write-combined pinned memory(WCM)  Transfers across the PCI Express bus,  Buffer is not snooped  High transfer performance but no data coherency guarantee 23 Lifan Xu, GCLab@UD

24 Combine OpenMP and CUDA 24 Lifan Xu, GCLab@UD

25 Multi-GPU Implementation 25 Lifan Xu, GCLab@UD CGMC Parameters: RN coverage probability Select Events Execute Events Update coverage Update probability Select Events Execute Events Update coverage Update probability Leap 1Leap 2 Select Events Execute Events Update coverage Update probability Select Events Execute Events Update coverage Update probability Leap 1Leap 2 GPU 1 CPU Pinned Memory RNG Random Numbers GPU 2 25 For 2-layer implementation

26 Performance with Different Number of GPUs 26 Lifan Xu, GCLab@UD Number of Cells Events/ms

27 Performance with Different Layers on 4 GPUs 27 Lifan Xu, GCLab@UD Number of Cells Events/ms

28 Accuracy 28 Lifan Xu, GCLab@UD Simulations of three different species of molecules, A, B, and C  A change to B and vice-versa  2 B molecules change to C and vice-versa  A, B, and C can diffuse Number of molecules in system reaching the equilibrium and in equilibrium GPU CPU-32 CPU-64

29 Conclusion We present a parallel implementation of the tau-leap method for CGMC simulations on single and multi-GPUs We identify the most efficient parallelism granularity for the simulations given a molecular system with a certain size  42X speedup for small systems using 2-layer fine-grained thread parallelism  100X speedup for average systems using 1-layer parallelism  120X speedup for large systems using 2-layer fine-grained thread parallelism across multiple GPUs We increase both time scale (up to 120 times) and length scale (up to 100,000) 29 Lifan Xu, GCLab@UD

30 Future Work Study molecular systems with heterogeneous distributions Combine MPI, OpenMP, and CUDA to use multiple GPUS across different machines Use MPI to implement CGMC parallelization on CPU cluster  Compare the performance of GPU cluster and CPU cluster Link phenomena and models across scales i.e., from QM to MD and to CGMC 30 Lifan Xu, GCLab@UD

31 Acknowledgements 31 Sponsors: Collaborators: Dionisios G. Vlachos and Stuart Collins (UDel) More questions: xulifan@udel.edu GCL Members: Trilce EstradaBoyu Zhang Abel LiconNarayan Ganesan Lifan Xu Philip Saponaro Maria Ruiz Michela Taufer GCL members in Spring 2010 Lifan Xu, GCLab@UD


Download ppt "Parallelization of Tau-Leap Coarse-Grained Monte Carlo Simulations on GPUs Lifan Xu, Michela Taufer, Stuart Collins, Dionisios G. Vlachos Global Computing."

Similar presentations


Ads by Google