Efficient Pseudo-Random Number Generation for Monte-Carlo Simulations Using GPU Siddhant Mohanty, Subho Shankar Banerjee, Dushyant Goyal, Ajit Mohanty.

Slides:

Advertisements

Similar presentations

CS179: GPU Programming Lecture 5: Memory. Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling.

Advertisements

Speed, Accurate and Efficient way to identify the DNA.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Optimization on Kepler Zehuan Wang

Chimera: Collaborative Preemption for Multitasking on a Shared GPU

Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering,

Parallelization of Tau-Leap Coarse-Grained Monte Carlo Simulations on GPUs Lifan Xu, Michela Taufer, Stuart Collins, Dionisios G. Vlachos Global Computing.

2009/04/07 Yun-Yang Ma.  Overview  What is CUDA ◦ Architecture ◦ Programming Model ◦ Memory Model  H.264 Motion Estimation on CUDA ◦ Method ◦ Experimental.

Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.

1 Random Number Generation H Plan: –Introduce basics of RN generation –Define concepts and terminology –Introduce RNG methods u Linear Congruential Generator.

Pseudorandom Number Generators

Weekly Report Start learning GPU Ph.D. Student: Leo Lee date: Sep. 18, 2009.

CUDA and the Memory Model (Part II). Code executed on GPU.

Using Random Numbers in CUDA ITCS 4/5145 Parallel Programming Spring 2012, April 12a, 2012.

GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.

To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

GPGPU platforms GP - General Purpose computation using GPU

OpenSSL acceleration using Graphics Processing Units

2012/06/22 Contents  GPU (Graphic Processing Unit)  CUDA Programming  Target: Clustering with Kmeans  How to use.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.

Accelerating Knowledge-based Energy Evaluation in Protein Structure Modeling with Graphics Processing Units 1 A. Yaseen, Yaohang Li, “Accelerating Knowledge-based.

Accelerating Statistical Static Timing Analysis Using Graphics Processing Units Kanupriya Gulati & Sunil P. Khatri Department of ECE Texas A&M University,

BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.

Extracted directly from:

By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.

General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific 09/12/12.

Porting the physical parametrizations on GPUs using directives X. Lapillonne, O. Fuhrer, Cristiano Padrin, Piero Lanucara, Alessandro Cheloni Eidgenössisches.

High Performance Computing with GPUs: An Introduction Krešimir Ćosić, Thursday, August 12th, LSST All Hands Meeting 2010, Tucson, AZ GPU Tutorial:

© 2003, Carla Ellis Simulation Techniques Overview Simulation environments emulation exec- driven sim trace- driven sim stochastic sim Workload parameters.

Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.

Basic Concepts in Number Theory Background for Random Number Generation 1.For any pair of integers n and m, m  0, there exists a unique pair of integers.

Pseudorandom Number Generation on the GPU Myles Sussman, William Crutchfield, Matthew Papakipos.

Applying GPU and POSIX Thread Technologies in Massive Remote Sensing Image Data Processing By: Group 17 King Mongkut's Institute of Technology Ladkrabang.

1 Evaluation of parallel particle swarm optimization algorithms within the CUDA™ architecture Luca Mussi, Fabio Daolio, Stefano Cagnoni, Information Sciences,

Random Number Generators 1. Random number generation is a method of producing a sequence of numbers that lack any discernible pattern. Random Number Generators.

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.

Monte Carlo Methods.

Accelerating Statistical Static Timing Analysis Using Graphics Processing Units Kanupriya Gulati and Sunil P. Khatri Department of ECE, Texas A&M University,

CUDA Optimizations Sathish Vadhiyar Parallel Programming.

GPU Architecture and Programming

Accelerating Error Correction in High-Throughput Short-Read DNA Sequencing Data with CUDA Haixiang Shi Bertil Schmidt Weiguo Liu Wolfgang Müller-Wittig.

Hardware Acceleration Using GPUs M Anirudh Guide: Prof. Sachin Patkar VLSI Consortium April 4, 2008.

Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.

Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.

QCAdesigner – CUDA HPPS project

CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

 Genetic Algorithms  A class of evolutionary algorithms  Efficiently solves optimization tasks  Potential Applications in many fields  Challenges.

Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.

IceCube simulation with PPC Photonics: 2000 – up to now Photon propagation code PPC: now.

GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.

G. Cowan Lectures on Statistical Data Analysis Lecture 5 page 1 Statistical Data Analysis: Lecture 5 1Probability, Bayes’ theorem 2Random variables and.

My Coordinates Office EM G.27 contact time:

Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1.

Naga Shailaja Dasari Ranjan Desh Zubair M Old Dominion University Norfolk, Virginia, USA.

Matrix Multiplication in CUDA Kyeo-Reh Park Kyeo-Reh Park Nuclear & Quantum EngineeringNuclear & Quantum Engineering.

GPGPU Programming with CUDA Leandro Avila - University of Northern Iowa Mentor: Dr. Paul Gray Computer Science Department University of Northern Iowa.

CUDA Introduction Martin Kruliš by Martin Kruliš (v1.1)

Sathish Vadhiyar Parallel Programming

CS427 Multicore Architecture and Parallel Computing

Leiming Yu, Fanny Nina-Paravecino, David Kaeli, Qianqian Fang

NVIDIA Fermi Architecture

Random-Number Generation

Computing and Statistical Data Analysis Stat 3: The Monte Carlo Method

General Purpose Graphics Processing Units (GPGPUs)

6- General Purpose GPU Programming

Presentation transcript:

Efficient Pseudo-Random Number Generation for Monte-Carlo Simulations Using GPU Siddhant Mohanty, Subho Shankar Banerjee, Dushyant Goyal, Ajit Mohanty and Federico Carminati 1

Device: NVIDIA GeForce GTX 480 Compute Capability: 2.0(suitable for double precision) Total Global Memory: Total Constant Memory: Multiprocessor count: 15 Shared memory per mp: Registers per mp: Threads in warp: 32 Maximum threads per block: 1024 Maximum number of blocks: Compute Capability: 2.0(suitable for double precision) Total Global Memory: Total Constant Memory: Multiprocessor count: 15 Shared memory per mp: Registers per mp: Threads in warp: 32 Maximum threads per block: 1024 Maximum number of blocks:

Two important aspects of this presentation Fast and efficient random number generation on GPU. Multivariate correlated sampling using both CPU and GPU. Fast and efficient random number generation on GPU. Multivariate correlated sampling using both CPU and GPU. 3

Random Number Generation The class of devices that have been designed having parallelism features in mind is the Graphics Processing Unit (GPU) which are highly parallel, multithreaded computing devices. One application where the use of massive parallelism comes instinctively is Monte-Carlo simulations where a large number of independent events have to be simulated. At the core of the Monte- Carlo simulation lies the random number generators. 4

For GPU Programming, the random number generator should have: Good statistical properties. High computational speed. Low memory use. Large period. Good statistical properties. High computational speed. Low memory use. Large period. 5

Mersenne Twister It is one of the most respected methods(used in ROOT as class TRandom3). Has a large period of. Very good statistical properties. It is one of the most respected methods(used in ROOT as class TRandom3). Has a large period of. Very good statistical properties. 6

However it is not suitable for implementation in the GPU as it has a large state that must be updated serially. Each GPU thread must have an individual state in global RAM and requires multiple access per generator. The relatively large number of computation per generated number makes the generator too slow for GPU programming except in cases where the ultimate in quality is needed. 7

Hybrid Approach: Tausworthe and LCG LCG(Linear Congruential Algorithm) where, m = a= c= UL Here the mod operation is not explicitly required due to the unsigned overflow. LCG(Linear Congruential Algorithm) where, m = a= c= UL Here the mod operation is not explicitly required due to the unsigned overflow. 8

Tausworthe Generator unsigned TausStep (unsigned &z) { unsigned b = ( ( ( z > s2 ); return z=( ( ( z & M ) << s3 ) ^ b ); } Where s1,s2,s3,M are all constants. unsigned TausStep (unsigned &z) { unsigned b = ( ( ( z > s2 ); return z=( ( ( z & M ) << s3 ) ^ b ); } Where s1,s2,s3,M are all constants. 9

Combined Generator float HybridTaus() { //combined period is LCM(p1,p2,p3,p4) return e-10 * ( //Periods TausStep(seed1,13,19,12, UL) ^ //p1=2^31-1 TausStep(seed2,2,25,4, UL) ^ //p2=2^30-1 TausStep(seed3,3,11,17, UL) ^ //p3=2^28-1 LCGStep(seed4, , UL) //p4=2^32 ); } Hence, combined period = 2^121. float HybridTaus() { //combined period is LCM(p1,p2,p3,p4) return e-10 * ( //Periods TausStep(seed1,13,19,12, UL) ^ //p1=2^31-1 TausStep(seed2,2,25,4, UL) ^ //p2=2^30-1 TausStep(seed3,3,11,17, UL) ^ //p3=2^28-1 LCGStep(seed4, , UL) //p4=2^32 ); } Hence, combined period = 2^

Seed generation For the initial seed we use unsigned seed(int idx) { return idx* UL; } Where idx is the thread index. It is known as the quick and dirty LCG and the sequence is random enough for the seed generation. For the initial seed we use unsigned seed(int idx) { return idx* UL; } Where idx is the thread index. It is known as the quick and dirty LCG and the sequence is random enough for the seed generation. 11

Auto-Correlation 12

Multivariate Alias Sampling It is carried out : Using CPU Using GPU(both CUDA and OpenCL) 13

Example : 14 We Sample using walker’s alias techniques which requires 4 random numbers per sample. The detail of the alias implementation is beyond the scope of this presentation.

Plots for original and generated distribution 15 Left: Circles represent original points and squares are the generated points. Right: Circles represent original points and continuous curve is for generated points

16

17

Time comparison for events 18

Conclusions We have implemented a hybrid generator which is quite fast and efficient for GPU programming. Using this hybrid generator we have carried out multivariate correlated Monte-Carlo sampling. The kernel execution in GPU is about 1000 times faster as compared to CPU. The overall execution is only 10 times faster as the memory copy operation from device to host or vice-versa is extremely slow. We have implemented a hybrid generator which is quite fast and efficient for GPU programming. Using this hybrid generator we have carried out multivariate correlated Monte-Carlo sampling. The kernel execution in GPU is about 1000 times faster as compared to CPU. The overall execution is only 10 times faster as the memory copy operation from device to host or vice-versa is extremely slow. 19