Status of Vectorized Physics Soon Yung Jun (FNAL).

Slides:



Advertisements
Similar presentations
The MSC Process in Geant4
Advertisements

POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
Starting Parallel Algorithm Design David Monismith Based on notes from Introduction to Parallel Programming 2 nd Edition by Grama, Gupta, Karypis, and.
Chapter 11: File System Implementation
Prof. Reinisch, EEAS / Simple Collision Parameters (1) There are many different types of collisions taking place in a gas. They can be grouped.
1 Stratified sampling This method involves reducing variance by forcing more order onto the random number stream used as input As the simplest example,
Cache intro CSE 471 Autumn 011 Principle of Locality: Memory Hierarchies Text and data are not accessed randomly Temporal locality –Recently accessed items.
Mapping Computational Concepts to GPUs Mark Harris NVIDIA Developer Technology.
Cg Programming Mapping Computational Concepts to GPUs.
Lecture 7 – Data Reorganization Pattern Data Reorganization Pattern Parallel Computing CIS 410/510 Department of Computer and Information Science.
Radiation Shielding and Reactor Criticality Fall 2012 By Yaohang Li, Ph.D.
Detector Simulation on Modern Processors Vectorization of Physics Models Philippe Canal, Soon Yung Jun (FNAL) John Apostolakis, Mihaly Novak, Sandro Wenzel.
CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.
Lesson 4: Computer method overview
Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Gamma ray interaction with matter A) Primary interactions 1) Coherent scattering (Rayleigh scattering) 2) Incoherent scattering (Compton scattering) 3)
Progress on Component-Based Subsurface Simulation I: Smooth Particle Hydrodynamics Bruce Palmer Pacific Northwest National Laboratory Richland, WA.
Inclusive Measurements of inelastic electron/positron scattering on unpolarized H and D targets at Lara De Nardo for the HERMES COLLABORATION.
Gamma Photon Transport on the GPU for PET László Szirmay-Kalos, Balázs Tóth, Milán Magdics, Dávid Légrády, Anton Penzov Budapest University of Technology.
A parallel High Level Trigger benchmark (using multithreading and/or SSE)‏ Håvard Bjerke.
Geant4 on GPU prototype Nicholas Henderson (Stanford Univ. / ICME)
Andrei Gheata (CERN) for the GeantV development team G.Amadio (UNESP), A.Ananya (CERN), J.Apostolakis (CERN), A.Arora (CERN), M.Bandieramonte (CERN), A.Bhattacharyya.
GeantV fast simulation ideas and perspectives Andrei Gheata for the GeantV collaboration CERN, May 25-26, 2016.
Dollan, Laihem, Lohse, Schälicke, Stahl 1 Monte Carlo based studies of polarized positrons source for the International Linear Collider (ILC)
Interaction of Radiation with Matter
Vector Physics Models Soon Yung Jun US January 30, 2015.
Parallel Computing Chapter 3 - Patterns R. HALVERSON MIDWESTERN STATE UNIVERSITY 1.
16 th Geant4 Collaboration Meeting SLAC, September 2011 P. Mato, CERN.
Chapter 2 Radiation Interactions with Matter East China Institute of Technology School of Nuclear Engineering and Technology LIU Yi-Bao Wang Ling.
File-System Implementation
GeantV Philippe Canal (FNAL) for the GeantV development team
Vector Electromagnetic Physics Models & Field Propagation
Chapter 11: File System Implementation
William Stallings Computer Organization and Architecture
Task Scheduling for Multicore CPUs and NUMA Systems
Hash-Based Indexes Chapter 11
Cache Memory Presentation I
Morgan Kaufmann Publishers
Vector Processing => Multimedia
Spatial Online Sampling and Aggregation
Pipelining and Vector Processing
Dycore Rewrite Tobias Gysi.
Chapter 11: File System Implementation
Unconventional Fixed-Radix Number Systems
Hashing.
CS222: Principles of Data Management Notes #8 Static Hashing, Extendible Hashing, Linear Hashing Instructor: Chen Li.
Hash-Based Indexes Chapter 10
Multivector and SIMD Computers
UNIVERSITY OF MASSACHUSETTS Dept
CS222P: Principles of Data Management Notes #8 Static Hashing, Extendible Hashing, Linear Hashing Instructor: Chen Li.
Hash-Based Indexes Chapter 11
Fast Reroute in P4: Keep Calm and Enjoy Programmability
Monte Carlo I Previous lecture Analytical illumination formula
Monte Carlo Path Tracing
Chapter 11: File System Implementation
The Transport Equation
Geant4 Low Energy Compton Profile
Math review - scalars, vectors, and matrices
Loop-Level Parallelism
Chapter 11 Instructor: Xin Zhang
Principle of Locality: Memory Hierarchies
The University of Adelaide, School of Computer Science
PHYS 3446 – Lecture #17 Particle Detection Particle Accelerators
Suman Das, Sridhar Rajagopal, Chaitali Sengupta and Joseph R.Cavallaro
Lesson 4: Application to transport distributions
Rohan Yadav and Charles Yuan (rohany) (chenhuiy)
CS222/CS122C: Principles of Data Management UCI, Fall 2018 Notes #07 Static Hashing, Extendible Hashing, Linear Hashing Instructor: Chen Li.
Pipelining.
Lesson 3. Controlling program flow. Loops. Methods. Arrays.
Presentation transcript:

Status of Vectorized Physics Soon Yung Jun (FNAL)

25C Ago: Death of Hector Hector Achilles GeantV FNAL meeting2

2 Decades Ago: Death of Vector Microprocessor GeantV FNAL meeting3

Today: Vector is still on HPC Battle Field GPGPU MIC HPC GeantV FNAL meeting4

Allies of GeantV GeantV FNAL meeting5 Federico Andrei Sandro Rene Philippe S. JohnA. Mihaly G.Lima

Specialized Applications Vector: no good for convention physics model! – pipelined instruction/data throughput – locality (spatial, temporal) GPU : no good for convention physics model! – arithmetic intensity – substantial parallelism – throughput over latency Common strategies or potential solutions? – decompose tasks – minimize branches and data dependencies – share or reuse data GeantV FNAL meeting6

High Performance Physics Models Physics processes and models cannot be easily vectorized or parallelized: Why? – stochastic process (random-ness kills parallelism) – decision trees (branches and switches) Paradigm swift – task level parallelism – algorithmic changes (performance over …) – tabulated physics – alias method (replace loops with conditional breaks) GeantV FNAL meeting7

Task Decomposition Task by the particle type (e -,  h) or geometry Regroup particles (tracks) by the same task GeantV FNAL meeting8

Alias Method: Sampling Recast p.d.f f(x) to N equal probable events, each with likelihood 1/N=c Alias table – a[recipient]=donor – q[N] = non-alias probability Sampling x j : random u 1, u 2 – bin index: N x u 1 = i +  – If (q [i] < u 2 ) : take j=i else take j = a[i] – x j = [  j + (1-  ) (j+1)]  x GeantV FNAL meeting9

Particle Tracking Process – mean free path analysis (step length) – sampling physics process Associated – multiple scattering (true path length) – transport (geometry limited step) Interaction – sampling secondary particles – other collision analysis GeantV FNAL meeting10

Applications of Alias Methods Replace non-vectorizable conditional branches – composition and rejection methods (do-while) no bias compared with Geant4 in secondary particle sampling (see the next page) – conditional exit to select (if-break) physics process (relative weight of cross sections) random element of composite material (can have pre-built alias tables from xsec data table?) To vectorize alias method effectively – coalesced random array – gather scatter memory accesses GeantV FNAL meeting11

Alias Method: Validation Angles of scatter particles GeantV FNAL meeting12

Gather Gather operations may be very expensive if associated arithmetic intensity is light (overhead > gain) GeantV FNAL meeting13

Sampling Physics Process Inclusive sampling vs. explicit sampling InclusiveExplicit step length (d) physics process d=min(d t,d g ) d t =-log(u)/  t sampling by xsec weights d=min(d i,d g ) d i =-log(u)/  i Process for the smallest d i N int leftprocess independentmaterial independent N random N log(rand) less than nv + (np-1) nv nv nv x np Consideration for vectorization conditional exit to select a process  wgt[i] < u if test to select d and a physics process implementationholistic approach with a fixed number of processes modularized (OO) design GeantV FNAL meeting14

Remarks Statically, explicit sampling = inclusive sampling – validation for passage through inhomogeneous materials (Geant4 vs. GeantV inclusive sampling?) – Independent of process history? Cons and Pros – performance-wise: implicit – flexibility-wise: explicit – vectorization-wise: explicit Performance, accuracy, portability GeantV FNAL meeting15

Status Implemented alternative secondary particle sampling (alias, inverse CDF) for – electron models Bremstraahlung (Seltzer Berger) Ionization (Moller Bhabha) – photon models Compton (Klein Nishina) Working on vectorization schema for the mean free path analysis: need to converge – tracking strategies (inclusive vs. explicit) – simd layers (outer vs. inner loop) GeantV FNAL meeting16

Toward HP Physics Library Efficient random number handling and data layout More task decomposition Adopt the template approach (portability) Need more troops GeantV FNAL meeting17

Backup GeantV FNAL meeting18

What to vectorize? Ex1: do-while loop Alternative: alias method for sampling from p.d.f (differential cross sections) xo = f(Z); // common data for(i = 0; i < ntracks ; ++i) { // for the problem-size yo = g((track.E)[i],xo); // for a given constraint \end{verbatim} {\color{magenta} \begin{verbatim} do { // sample x and y x[i] = h(r1); y[i] = s(x[i],r2); } while (y[i] < yo); } GeantV FNAL meeting19

What to vectorize? Ex2: conditional exit to select a physics process Alternative: alias method with the normalized weight of cross sections (physics processes) // np = number of process // w[np] = relative weight of cross sections int ip = np; double w = 0; double fu = rand() for (int i=0 ; i < np-1 ; ++i) { w += w[i]; if (w < fu) { ip = i; break; } } GeantV FNAL meeting20

What to vectorize? Ex3: conditional exit to select a physics process Alternative: alias method with the normalized weight of cross sections (physics processes) // np = number of process // w[np] = relative weight of cross sections int ip = np; double w = 0; double fu = rand() for (int i=0 ; i < np-1 ; ++i) { w += w[i]; if (w < fu) { ip = i; break; } } GeantV FNAL meeting21

What to vectorize? Ex4: if-test to select a random atom Alternative: mask // np = number of processes // s[np] = proposed step lengths int ip; double s = 0; double tmp = MAX; for(int i=0 ; i < np ; ++i) { if(s[i] < tmp) { s = s[i]; tmp = s; } } GeantV FNAL meeting22