Pseudorandom Number Generation on the GPU Myles Sussman, William Crutchfield, Matthew Papakipos.

Slides:



Advertisements
Similar presentations
A Micro-benchmark Suite for AMD GPUs Ryan Taylor Xiaoming Li.
Advertisements

COMPUTER GRAPHICS CS 482 – FALL 2014 NOVEMBER 10, 2014 GRAPHICS HARDWARE GRAPHICS PROCESSING UNITS PARALLELISM.
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
SE263 Video Analytics Course Project Initial Report Presented by M. Aravind Krishnan, SERC, IISc X. Mei and H. Ling, ICCV’09.
Topics covered: CPU Architecture CSE 243: Introduction to Computer Architecture and Hardware/Software Interface.
Generating Random Numbers
System Simulation Of 1000-cores Heterogeneous SoCs Shivani Raghav Embedded System Laboratory (ESL) Ecole Polytechnique Federale de Lausanne (EPFL)
Random Number Generation. Random Number Generators Without random numbers, we cannot do Stochastic Simulation Most computer languages have a subroutine,
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
The Divisibility & Modular Arithmetic: Selected Exercises Goal: Introduce fundamental number theory concepts: The division algorithm Congruences Rules.
CO-CLUSTERING USING CUDA. Co-Clustering Explained  Problem:  Large binary matrix of samples (rows) and features (columns)  What samples should be grouped.
Random number generation Algorithms and Transforms to Univariate Distributions.
2009/04/07 Yun-Yang Ma.  Overview  What is CUDA ◦ Architecture ◦ Programming Model ◦ Memory Model  H.264 Motion Estimation on CUDA ◦ Method ◦ Experimental.
Sampling Attila Gyulassy Image Synthesis. Overview Problem Statement Random Number Generators Quasi-Random Number Generation Uniform sampling of Disks,
1 Random Number Generation H Plan: –Introduce basics of RN generation –Define concepts and terminology –Introduce RNG methods u Linear Congruential Generator.
Hash Tables1 Part E Hash Tables  
Chapter 4 Processor Technology and Architecture. Chapter goals Describe CPU instruction and execution cycles Explain how primitive CPU instructions are.
Hardware-Based Nonlinear Filtering and Segmentation using High-Level Shading Languages I. Viola, A. Kanitsar, M. E. Gröller Institute of Computer Graphics.
ATI GPUs and Graphics APIs Mark Segal. ATI Hardware X1K series 8 SIMD vertex engines, 16 SIMD fragment (pixel) engines 3-component vector + scalar ALUs.
CSCE Monte Carlo Methods When you can’t do the math, simulate the process with random numbers Numerical integration to get areas/volumes Particle.
Random numbers in Python Nobody knows what’s next...
Efficient Pseudo-Random Number Generation for Monte-Carlo Simulations Using GPU Siddhant Mohanty, Subho Shankar Banerjee, Dushyant Goyal, Ajit Mohanty.
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
ETM 607 – Random Number and Random Variates
Slide 1 / 16 On Using Graphics Hardware for Scientific Computing ________________________________________________ Stan Tomov June 23, 2006.
GPU Programming Robert Hero Quick Overview (The Old Way) Graphics cards process Triangles Graphics cards process Triangles Quads.
1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.
A SCALABLE LIBRARY FOR PSEUDORANDOM NUMBER GENERATION ALGORITHM 806: SPRNG.
Computer Graphics Graphics Hardware
GPU Computation Strategies & Tricks Ian Buck Stanford University.
GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.
Interactive Time-Dependent Tone Mapping Using Programmable Graphics Hardware Nolan GoodnightGreg HumphreysCliff WoolleyRui Wang University of Virginia.
CS433 Modeling and Simulation Lecture 15 Random Number Generator Dr. Anis Koubâa 24 May 2009 Al-Imam Mohammad Ibn Saud Islamic University College Computer.
Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.
Basic Concepts in Number Theory Background for Random Number Generation 1.For any pair of integers n and m, m  0, there exists a unique pair of integers.
Applying GPU and POSIX Thread Technologies in Massive Remote Sensing Image Data Processing By: Group 17 King Mongkut's Institute of Technology Ladkrabang.
Advanced Computer Architecture 0 Lecture # 1 Introduction by Husnain Sherazi.
Accelerating a Software Radio Astronomy Correlator By Andrew Woods Supervisor: Prof. Inggs & Dr Langman.
Random Number Generators 1. Random number generation is a method of producing a sequence of numbers that lack any discernible pattern. Random Number Generators.
NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.
Monte Carlo Methods.
Accelerating Statistical Static Timing Analysis Using Graphics Processing Units Kanupriya Gulati and Sunil P. Khatri Department of ECE, Texas A&M University,
CSE 690: GPGPU Lecture 7: Matrix Multiplications Klaus Mueller Computer Science, Stony Brook University.
Tone Mapping on GPUs Cliff Woolley University of Virginia Slides courtesy Nolan Goodnight.
Richard Kelley Motion Planning on a GPU. Last Time Nvidia’s white paper Productive discussion.
GPU Computation Strategies & Tricks Ian Buck NVIDIA.
Parallel & Distributed Systems and Algorithms for Inference of Large Phylogenetic Trees with Maximum Likelihood Alexandros Stamatakis LRR TU München Contact:
COMPUTER ORGANIZATION AND ASSEMBLY LANGUAGE Lecture 19 & 20 Instruction Formats PDP-8,PDP-10,PDP-11 & VAX Course Instructor: Engr. Aisha Danish.
December 2, 2015Single-Instruction Multiple Data (SIMD)1 Performance Optimization, cont. How do we fix performance problems?
1. 2 Pipelining vs. Parallel processing  In both cases, multiple “things” processed by multiple “functional units” Pipelining: each thing is broken into.
Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.
A Neural Network Implementation on the GPU By Sean M. O’Connell CSC 7333 Spring 2008.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Silberschatz, Galvin and Gagne  2002 Modified for CSCI 399, Royden, Operating System Concepts Operating Systems Lecture 4 Computer Systems Review.
COMP SYSTEM ARCHITECTURE HOW TO BUILD A CACHE Antoniu Pop COMP25212 – Lecture 2Jan/Feb 2015.
Random numbers in C++ Nobody knows what’s next....
Copyright © Curt Hill SIMD Single Instruction Multiple Data.
Parallel Random Number Generation
3/12/2013Computer Engg, IIT(BHU)1 CUDA-3. GPGPU ● General Purpose computation using GPU in applications other than 3D graphics – GPU accelerates critical.
GPU Computing for GIS James Mower Department of Geography and Planning University at Albany.
Module 9.2 Simulations. Computer simulation Having computer program imitate reality, in order to study situations and make decisions Applications?
Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,
Sobolev(+Node 6, 7) Showcase +K20m GPU Accelerator.
Computer Graphics Graphics Hardware
COMPUTER GRAPHICS CHAPTER 38 CS 482 – Fall 2017 GRAPHICS HARDWARE
CS 475/575 Slide set 4 M. Overstreet Old Dominion University
Computer Graphics Graphics Hardware
Chapter 12 Pipelining and RISC
Ray Tracing on Programmable Graphics Hardware
Graphics Processing Unit
Presentation transcript:

Pseudorandom Number Generation on the GPU Myles Sussman, William Crutchfield, Matthew Papakipos

Outline Motivation & Constraints (Why and What) Review : CPU-based Linear RNGs Parallelization strategy Why Linear RNGs are impractical on GPUs (now) Nonlinear RNGs Gotchas Performance on real GPUs Conclusions & Suggestions for the Hardware

Motivation & Constraints Why? Use GPU for Monte Carlo integration Ideal for GPGPU : compute a lot, output a little Mean, median; uncertainty ~ O(1/√N) Generate random numbers on CPU implies lots of traffic What? Don’t reinvent the RNG wheel! Lots of existing theory on RNGs “Industry standards” : MKL (Intel), ACML (AMD), others

Randomness Diehard and TestU01 : is it random enough? Like repeated poker games Ensure the house isn’t cheating (p-value)

Linear RNGs Modulus m, multiplier a Sequence, period is m Output u in [0,1) Many types : LCG, MCG, MRG Combined generators have larger period (e.g. m 1 x m 2 ) Data dependency : “seed” or previous value

Parallelizing Each pixel is a separate (virtual) thread Required : independent sequence at each pixel 0,00,1…0,N 1,0… … M,0M,N Processors x y 2D Texture

Parallelizing Each pixel is a separate (virtual) thread Required : independent sequence at each pixel How to achieve independence: Different methods : Wichmann-Hill family (273 methods) One long sequence with each pixel assigned a different “block” : MRG32k3a

Blocking Each pixel (substream) outputs 1 block from long sequence Easy to get burned! Linear RNG = long-range correlations MRG32k3a painstakingly optimized, minimizes correlations

How Much Seed Data? Each thread can only write 16 floats At least one is your result Others are needed to update the seed MRG32k3a = 6 doubles = 12 floats, leaves 4 results 4096 x 4096 x 4 buffer of results = 192 MB of seed! Seed update from CPU = slow What about Wichmann-Hill ? 273 methods = each needs to write 240K results! Linear RNG isn’t practical today

Nonlinear RNGs Explicit Inverse Congruential Generator No data dependency, directly compute Sequence, period is m May be combined, period is m 1 x m 2 Fewer correlation “troubles” Compute cost ~ O(log(m)) = more expensive But GPUs are faster …

Parallelizing Made Simple Pixel at texture coordinate (x,y) 4096 x 4096 independent blocks of length B Floating point math = m is 24 bits Tricks must be played to keep within 24 bits Seed data n+n 0 is the same for all pixels! Can be managed on CPU or GPU or both (~100 bytes)

Managing Seed Data

“Ultimate” Architecture? CPU 1CPU 2CPU N GPU Blocking = independent substreams Seeds for GPUs are advanced by “cluster sub- block size” Many cluster architectures possible

Gotchas Some things are different … Integer division is inexact N/N doesn’t always equal 1 Remainder can be off by ±1 (= error in mod) Need special tricks (see the paper) Floating point math = 24 bits MRG32k3a designed for 53 bits (doubles) : requires three floats to store intermediates Nonlinear RNG : combine three 24-bit generators for long period

Performance of RNGs RNG type Usable sequence length at each pixel ATI Radeon X MHz Intel Xeon 3.6 GHz Nonlinear~ million/sec0.3 MRG32k3a> kernel only 110 MKL Wichmann- Hill > kernel only 79 MKL

Unlimited Outputs Per Thread Wichmann-Hill ops are 10x faster vs CPU But we need >240K outputs per thread For MRG32k3a ops are same speed vs CPU Anticipate large speedup with ints (DirectX 10) (or if we have doubles) But we need many more outputs per thread

Performance of RNGs RNG type Usable sequence length at each pixel ATI Radeon X MHz Intel Xeon 3.6 GHz Nonlinear~ million/sec0.3 MRG32k3a> kernel only 110 MKL Wichmann- Hill > kernel only 79 MKL New nonlinear ~

Conclusions & Suggestions Can do RNGs / Monte Carlo on GPU ! Nonlinear RNGs : A good solution today Linear RNGs would be better if… Desired hardware features : Unlimited (or many more) outputs per thread Integers (DirectX 10) & doubles More instructions in each shader program

HLSL sample : Accurate mod float4 mod_div(float4 a, float4 b, out float4 d) { d = floor(a/b); float4 r = a - d*b; // handle case where division off by -1 ulp d = (r<0) ? d-1 : d; r = (r<0) ? r+b : r; // handle case where division off by +1 ulp d = (r<b) ? d : d+1; r = (r<b) ? r : r-b; return r; }

HLSL sample : Pixel Shader /* seed data for all components, used by ceicg_cpu_4 */ sampler seed_data; /* generate 4 random numbers at each pixel position */ float4 ceicg_gpu_4( float2 pixel_pos ) { /* depends only on pixel position and seed data */ } struct PS_OUTPUT { float4 color0 : COLOR0; }; /* main pixel shader program for nonlinear RNG */ PS_OUTPUT ps_main(float2 pos : VPOS) { PS_OUTPUT po; po.color0 = ceicg_gpu_4(pos); return po; }