Parallelisation of Random Number Generation in PLACET Approaches of parallelisation in PLACET Martin Blaha University of Vienna AT CERN 25.09.2013.

Slides:

Advertisements

Similar presentations

Idan Zaguri Ran Tayeb 2014 Parallel Random Number Generator.

Advertisements

MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.

Scheduling and Performance Issues for Programming using OpenMP

Enabling Speculative Parallelization via Merge Semantics in STMs Kaushik Ravichandran Santosh Pande College.

IIAA GPMAD A beam dynamics code using Graphics Processing Units GPMAD (GPU Processed Methodical Accelerator Design) utilises Graphics Processing Units.

NewsFlash!! Earth Simulator no longer #1. In slightly less earthshaking news… Homework #1 due date postponed to 10/11.

Chapter 4: Threads. Overview Multithreading Models Threading Issues Pthreads Windows XP Threads.

DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.

1 Characterizing the Sort Operation on Multithreaded Architectures Layali Rashid, Wessam M. Hassanein, and Moustafa A. Hammad* The Advanced Computer Architecture.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

1 New Architectures Need New Languages A triumph of optimism over experience! Ian Watson 3 rd July 2009.

To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.

PROOF: the Parallel ROOT Facility Scheduling and Load-balancing ACAT 2007 Jan Iwaszkiewicz ¹ ² Gerardo Ganis ¹ Fons Rademakers ¹ ¹ CERN PH/SFT ² University.

JCE A Java-based Commissioning Environment tool Hiroyuki Sako, JAEA Hiroshi Ikeda, Visible Information Center Inc. SAD Workshop.

Parallel Programming in Java with Shared Memory Directives.

Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.

Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.

This module was created with support form NSF under grant # DUE Module developed by Martin Burtscher Module B1 and B2: Parallelization.

DATA STRUCTURES OPTIMISATION FOR MANY-CORE SYSTEMS Matthew Freeman | Supervisor: Maciej Golebiewski CSIRO Vacation Scholar Program

PTC ½ day – Experience in PS2 and SPS H. Bartosik, Y. Papaphilippou.

Parallel Applications Parallel Hardware Parallel Software IT industry (Silicon Valley) Users Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University.

National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Cell processor implementation of a MILC lattice QCD application.

AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author ： Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source ： Proceedings of the 2nd IASTED.

MS Thesis Defense “IMPROVING GPU PERFORMANCE BY REGROUPING CPU-MEMORY DATA” by Deepthi Gummadi CoE EECS Department April 21, 2014.

OpenMP OpenMP A.Klypin Shared memory and OpenMP Simple Example Threads Dependencies Directives Handling Common blocks Synchronization Improving load balance.

S AN D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE On pearls and perils of hybrid OpenMP/MPI programming.

Pseudorandom Number Generation on the GPU Myles Sussman, William Crutchfield, Matthew Papakipos.

Scala Parallel Collections Aleksandar Prokopec, Tiark Rompf Scala Team EPFL.

COP4020 Programming Languages Subroutines and Parameter Passing Prof. Xin Yuan.

Experiences parallelising the mixed C-Fortran Sussix BPM post-processor H. Renshall, BE Dept associate, Jan 2012 Using appendix material from CERN-ATS-Note

- DHRUVA TIRUMALA BUKKAPATNAM Geant4 Geometry on a GPU.

JPEG-GPU: A GPGPU IMPLEMENTATION OF JPEG CORE CODING SYSTEMS Ang Li University of Wisconsin-Madison.

Chapter 4 – Threads (Pgs 153 – 174). Threads  A "Basic Unit of CPU Utilization"  A technique that assists in performing parallel computation by setting.

Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.

Slide 1 Using OpenACC in IFS Physics’ Cloud Scheme (CLOUDSC) Sami Saarinen ECMWF Basic GPU Training Sept 16-17, 2015.

Towards large-scale parallel simulated packings of ellipsoids with OpenMP and HyperFlow Monika Bargieł 1, Łukasz Szczygłowski 1, Radosław Trzcionkowski.

UPC-CHECK Project Final Report High Performance Computing Group Iowa State University Aug 30, 2011.

Threaded Programming Lecture 2: Introduction to OpenMP.

Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009.

Martin Kruliš by Martin Kruliš (v1.1)1.

Published in ACM SIGPLAN, 2010 Heidi Pan MassachusettsInstitute of Technology Benjamin Hindman UC Berkeley Krste Asanovi´c UC Berkeley 1.

Improving the Generation of Random Numbers in PLACET Comparision of RNG replacements Martin Blaha University of Vienna AT CERN

Parallelization Geant4 simulation is an embarrassingly parallel computational problem – each event can possibly be treated independently 1.

Sunpyo Hong, Hyesoon Kim

3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,

PLACET User experiences with PLACET and examples of use for the Drive Beam CLIC Workshop 2008 Erik Adli, CERN/University of Oslo, October 16 th 2008 Lots.

Lab Activities 1, 2. Some of the Lab Server Specifications CPU: 2 Quad(4) Core Intel Xeon 5400 processors CPU Speed: 2.5 GHz Cache : Each 2 cores share.

CPE779: Shared Memory and OpenMP Based on slides by Laxmikant V. Kale and David Padua of the University of Illinois.

Tuning Threaded Code with Intel® Parallel Amplifier.

S. Pardi Frascati, 2012 March GPGPU Evaluation – First experiences in Napoli Silvio Pardi.

J. Snuverink and J. Pfingstner LinSim LinSim Linear Accelerator Simulation Framework with PLACET an GUINEA-PIG Jochem Snuverink Jürgen Pfingstner 16 th.

The Cockroft Institute

Threads prepared and instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University 1July 2016Processes.

Tutorial 2: Homework 1 and Project 1

Accelerating particle identification for high-speed data-filtering using OpenCL on FPGAs and other architectures for FPL 2016 Srikanth Sridharan CERN 8/31/2016.

Speaker: Andrea Latina

CS427 Multicore Architecture and Parallel Computing

New algorithms for tuning the CLIC beam delivery system

A task-based implementation for GeantV

Computer Engg, IIT(BHU)

Introduction to OpenMP

Parallel Algorithm Design

Indranil Roy High Performance Computing (HPC) group

Prof. Leonardo Mostarda University of Camerino

Multithreading Why & How.

Parallel Computing Explained How to Parallelize a Code

Presentation transcript:

Parallelisation of Random Number Generation in PLACET Approaches of parallelisation in PLACET Martin Blaha University of Vienna AT CERN

Additions to centralised RNG TCL command RandomReset o Sets seeds to all streams individually RandomReset –stream Misalignments –seed 1234 o sets default seeds = reset, if called without argument o sets generators o replaces redundancy in Tcl commands that set seeds (e.g. Groundmotion_init) o Help that lists all streams Benchmarks on not parallelised code o gsl causes slowdown of max. 3% depending on generator 1

Motivation for parallel execution Runtimes of simulations slow! “Low-performance” functions: ●SBEND ●QUADRUPOLE ●MULTIPOLE ●ELEMENT → they refer to RNGs through syncrotron radiation emission profile by Yngve Levinsen Feb

Parallel Random Number Generation Problems requesting random numbers from a sequential stream for parallel use is uncontrolable controlable and reproducible gsl random number generators do not support parallel generation by itself 3

Methods for parallel random number generation ●centralized generation ●replicated generation ●distributed generation ●existing Libraries 4

Centralized RNG One generator produces all numbers Advantages: only one RNG with good sequence easy implementation Disadvantage: race conditions occur fair play not guaranteed or crash (programme not stable) slow if queueing (even slower than single thread) 5

Replicated RNG Initial RNG is copied for each thread Advantages: more efficient easy implementation Disadvantage: can suffer from correlations between threads 6

Distributed RNG Each thread has its own generator Advantages: efficient - each thread can work stand alone threadsafe reproducible Disadvantage: can suffer from correlations 7

Existing Libraries SPRNG - University of Florida hard to find “good” documentation on how to combine with parallel code eg OpenMp PRAND for CUDA environment on GPU and CPU good documentation on RNGs in general Disadvantage: yet another library 8

Distributed RNG Summary: distributed generation considered to fit the best for our needs Common methods that are known to produce satisfactory outcome 1.Random Tree Method 2.Block Splitting 3.Leapfrog Method 9

Random Tree Method Global RNG for seeding Standalone RNG per thread Reproducible for known number of threads new tcl command to set number of threads → only runs fair for the same number of threads, not for dinamical thread assignment Seed 10

Block Splitting Split a sequence of RN in blocks Advantages: no overlap in random numbers plays fair Disadvantages: allocates a huge array of numbers number of RNs has to be known in advance 11

Leapfrog Method Distributes a sequence or RN over several threads one by one Advantages: number of RNs must not be known in advance guarantees no overlap of RN plays fair, still permutations in calls Disadvantage: costly call of random numbers 12

Block splitting vs. Leapfrog Block-Splitting and Leapfrog runs fair with dynamic thread assignment Problem of implimentation in a distributed, non centralised way Period per thread is period of RNG/# threads 13

Testing parallel RNG methods SPEEDUP to -33,3% in runtime for random tree method only overheads for nosynrad and little number of particles SLOWDOWN to + 120% in runtime for leapfrog method due to withdrawing more numbers than needed Testing via test-bds-track for particles, with quadrupoles and multipoles 14

Preparation Tool for Parallelisation - OpenMp easy implementation control of variable scope, assignment schedule, critical sections 15

Preparation: Centralising synrad functions 2 functions calculate synrad emmission: synrad.cc photon_spectrum.cc Centralised for easier and reproducible use of parallel RNG synrad.cc has been removed Tested via test-bds-track for 3e5 particles, same outcome 16

Implementation of new class New class PARALLEL_RNG Inherits all methods from RANDOM_NEW Initialises parallel RNG always on max. number of available threads New Tcl-command ParallelThreads –num val to choose number of threads Now RNG stream Radiation runs completely parallel by default 17

Testing – BDS tracking Covariance Matrix of test-bds-track 18 Testing via test-bds-track for particles, with quadrupoles and multipoles

Testing – CLIC beam tracking Beam - tracking with no correctionBeam - tracking with simple correction 19 Testing test-clic-3 for 3500 machines

Time Profile Total runtime on 32 cores: 27 sec Total runtime on 1 core 1 m 21 sec Total runtime on PLACET: 58 sec BDS tracking: PLACET: 39 sec PLACET-NEW: ~9 sec BDS TRACKING 3.5 times faster Timeprofile for BDS tracking for particles 2x Intel Xeon E GHz 8-Core (16 w/hyper threading) (95W 20MB 2.8GHz Turbo Sandy Bridge EP) 20

Profiling BDS: Sbend, elements, multipole, quadrupole still most timeconsuming functions Linac: OMP library causes slowdown in simple-correction routines (e.g. test-clic-4) 76% of time consumption caused by OpenMP in wait_sleep It was necessary to find a compromise! 21

Conclusion BDS runs ~30 % faster (total runtime) CLIC 4 runs ~13 % faster Compared to current placet in the trunk OpenMP is a quick and easy way to parallelisation for existing functions. 22

Future Plan Need to understand the overhead while running sequential Benchmark performance of quick functions e.g. dipoles, drifts, step-in, BPMs Adjust automatically to current configuration Write technical/user documentation Merge into trunk 23