Geant4 on GPU prototype Nicholas Henderson (Stanford Univ. / ICME)

Slides:



Advertisements
Similar presentations
Intermediate GPGPU Programming in CUDA
Advertisements

Geant4 v9.2p02 Speed up Makoto Asai (SLAC) Geant4 Tutorial Course.
Optimization on Kepler Zehuan Wang
Christopher McCabe, Derek Causon and Clive Mingham Centre for Mathematical Modelling & Flow Analysis Manchester Metropolitan University MANCHESTER M1 5GD.
P HI T S Exercise ( II ) : How to stop , ,  -rays and neutrons? Multi-Purpose Particle and Heavy Ion Transport code System title1 Feb revised.
1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.
GAMOS tutorial Histogram and Scorers Exercises
Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:
Pedro Arce Point detector scoring 1 Point detector scoring in GEANT4 Pedro Arce, (CIEMAT) Miguel Embid (CIEMAT) Juan Ignacio Lagares (CIEMAT) Geant4 Event.
Electromagnetic Physics I Tatsumi KOI SLAC National Accelerator Laboratory (strongly based on Michel Maire's slides) Geant4 v9.2.p02 Standard EM package.
DCABES 2009 China University Of Geosciences 1 The Parallel Models of Coronal Polarization Brightness Calculation Jiang Wenqian.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Electromagnetic Physics I Joseph Perl SLAC National Accelerator Laboratory (strongly based on Michel Maire's slides) Geant4 v9.3.p01 Standard EM package.
Data Storage Technology
University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
Geant4: Electromagnetic Processes 2 V.Ivanchenko, BINP & CERN
Accelerating Knowledge-based Energy Evaluation in Protein Structure Modeling with Graphics Processing Units 1 A. Yaseen, Yaohang Li, “Accelerating Knowledge-based.
Chapter 3: Operating-System Structures System Components Operating System Services System Calls System Programs System Structure Virtual Machines System.
St. Petersburg State University. Department of Physics. Division of Computational Physics. COMPUTER SIMULATION OF CURRENT PRODUCED BY PULSE OF HARD RADIATION.
Automated Electron Step Size Optimization in EGS5 Scott Wilderman Department of Nuclear Engineering and Radiological Sciences, University of Michigan.
Parallel Applications Parallel Hardware Parallel Software IT industry (Silicon Valley) Users Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University.
Validation and TestEm series Michel Maire for the Standard EM group LAPP (Annecy) July 2006.
Author: Haoyu Song, Fang Hao, Murali Kodialam, T.V. Lakshman Publisher: IEEE INFOCOM 2009 Presenter: Chin-Chung Pan Date: 2009/12/09.
+ CUDA Antonyus Pyetro do Amaral Ferreira. + The problem The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now.
ICS 145B -- L. Bic1 Project: Main Memory Management Textbook: pages ICS 145B L. Bic.
Physics Modern Lab1 Electromagnetic interactions Energy loss due to collisions –An important fact: electron mass = 511 keV /c2, proton mass = 940.
HPCLatAm 2013 HPCLatAm 2013 Permutation Index and GPU to Solve efficiently Many Queries AUTORES  Mariela Lopresti  Natalia Miranda  Fabiana Piccoli.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
Medical Accelerator F. Foppiano, M.G. Pia, M. Piergentili
Detector Simulation on Modern Processors Vectorization of Physics Models Philippe Canal, Soon Yung Jun (FNAL) John Apostolakis, Mihaly Novak, Sandro Wenzel.
Geant4 based simulation of radiotherapy in CUDA
CUDA - 2.
- DHRUVA TIRUMALA BUKKAPATNAM Geant4 Geometry on a GPU.
Calorimeters Chapter 4 Chapter 4 Electromagnetic Showers.
Simulation of the energy response of  rays in CsI crystal arrays Thomas ZERGUERRAS EXL-R3B Collaboration Meeting, Orsay (France), 02/02/ /03/2006.
Chapter 10 Designing the Files and Databases. SAD/CHAPTER 102 Learning Objectives Discuss the conversion from a logical data model to a physical database.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.
QCAdesigner – CUDA HPPS project
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
 Genetic Algorithms  A class of evolutionary algorithms  Efficiently solves optimization tasks  Potential Applications in many fields  Challenges.
4th Workshop on Geant4 Bio-medical Developments and Geant4 Physics Validation Riccardo Capra 1 Physics processes Software process and OOAD.
GPU Accelerated MRI Reconstruction Professor Kevin Skadron Computer Science, School of Engineering and Applied Science University of Virginia, Charlottesville,
The High Performance Simulation Project Status and short term plans 17 th April 2013 Federico Carminati.
An Efficient CUDA Implementation of the Tree-Based Barnes Hut n-body Algorithm By Martin Burtscher and Keshav Pingali Jason Wengert.
CS/EE 217 GPU Architecture and Parallel Programming Midterm Review
Gamma Photon Transport on the GPU for PET László Szirmay-Kalos, Balázs Tóth, Milán Magdics, Dávid Légrády, Anton Penzov Budapest University of Technology.
MSc in High Performance Computing Computational Chemistry Module Parallel Molecular Dynamics (i) Bill Smith CCLRC Daresbury Laboratory
Pedro Arce Introducción a GEANT4 1 GAMOS tutorial RadioTherapy Exercises Pedro Arce Dubois CIEMAT
Update on G5 prototype Andrei Gheata Computing Upgrade Weekly Meeting 26 June 2012.
AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.
CUDA Compute Unified Device Architecture. Agent Based Modeling in CUDA Implementation of basic agent based modeling on the GPU using the CUDA framework.
GPU Performance Optimisation Alan Gray EPCC The University of Edinburgh.
3/12/2013Computer Engg, IIT(BHU)1 CUDA-3. GPGPU ● General Purpose computation using GPU in applications other than 3D graphics – GPU accelerates critical.
My Coordinates Office EM G.27 contact time:
CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1.
Institute of Software,Chinese Academy of Sciences An Insightful and Quantitative Performance Optimization Chain for GPUs Jia Haipeng.
S. Pardi Frascati, 2012 March GPGPU Evaluation – First experiences in Napoli Silvio Pardi.
Dollan, Laihem, Lohse, Schälicke, Stahl 1 Monte Carlo based studies of polarized positrons source for the International Linear Collider (ILC)
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
GPU Acceleration of Particle-In-Cell Methods B. M. Cowan, J. R. Cary, S. W. Sides Tech-X Corporation.
CS427 Multicore Architecture and Parallel Computing
Leiming Yu, Fanny Nina-Paravecino, David Kaeli, Qianqian Fang
NVIDIA Profiler’s Guide
CS/EE 217 – GPU Architecture and Parallel Programming
Lecture 3: Main Memory.
6- General Purpose GPU Programming
Presentation transcript:

Geant4 on GPU prototype Nicholas Henderson (Stanford Univ. / ICME) Koichi Murakami (KEK / CRC) Stanford ICME, SLAC, G4-Japan Collaboration supported by NVIDIA Sep/10/2012 17th Geant4 Collaboration meeting

17th Geant4 Collaboration meeting Outline project goal CUDA basics algorithm and implementation prototype and performance Sep/10/2012 17th Geant4 Collaboration meeting

17th Geant4 Collaboration meeting Project Goal Dose calculation for radiation therapy GPU-powered parallel processing with CUDA boost-up calculation speed voxel geometry including DICOM interface material : water with variable densities limited Geant4 EM physic processes electron/positron/gamma medical energy rage ( < 10-100 MeV) scoring dose in each voxel DCMTK GPU processing Dose DICOMRT-Dose gMocren Sep/10/2012 17th Geant4 Collaboration meeting

17th Geant4 Collaboration meeting CUDA basics I “SIMD” architecture : Single Instruction, Multiple Data CUDA is a data parallel language wants to run same instruction on multiple pieces of data Think parallel! Coalesced memory access NVIDIA GPUs read from memory in 128 byte blocks To maximize memory throughput, we want a single read to satisfy as many threads as possible We use a “struct of arrays” data structure to maximize opportunities for coalesced memory reads Sep/10/2012 17th Geant4 Collaboration meeting

17th Geant4 Collaboration meeting CUDA Basics II Memory hierarchy CUDA provides access to several device memory types: global, shared, constant, texture We currently use global memory for all thread and track data and constant memory for parameters We will use shared memory, which is on-chip, at a later phase in the project Race conditions arise when multiple CUDA threads attempt to write to same location in global memory we avoid race conditions (or using atomic operations) by maintaining independent track and dose stacks for each thread Sep/10/2012 17th Geant4 Collaboration meeting

Parallelization strategy Each GPU thread processes a single track until the track exits the geometry GPU runs ≈ 32k CUDA threads under the current configuration Each thread has two stacks : one for storing secondary particles one for recording the energy dose in a voxel After a number of steps : energy dose in the stack is moved to main dose array secondary stacks may be redistributed for performance Sep/10/2012 17th Geant4 Collaboration meeting

17th Geant4 Collaboration meeting G4CU Basics Each thread stores data for: thread state {running, stopped} PIL(-left) for the step the limiting physics process for the step Each thread processes a track, which stores data for: particle spices position direction energy Other data associated with each thread: random number generator state, primary generation state, track stack, dose stack, physical process data Sep/10/2012 17th Geant4 Collaboration meeting

17th Geant4 Collaboration meeting Geometry Focused on voxel navigation taking advantage of GPU power Implementation currently handles a single box with uniform discretizations for each dimension planning for a hierarchical voxel model to allow higher resolution in certain regions The material of each voxel is water with different density cross section, energy loss, etc are proportional to density not necessary to preparing thousands of tables Sep/10/2012 17th Geant4 Collaboration meeting

17th Geant4 Collaboration meeting Physics processes particles : electron, positron, gamma energy range : < 10-100 MeV material: water (and air) with variable densities processes: electron / positron energy loss (ionization, bremsstrahlung) multiple scattering (different models will be tried) positron annihilation gamma Compton scattering photo electric effect gamma conversion physics tables cross section, dE/dx, range, etc are retrieved from Geant4 prepared for "standard" water Sep/10/2012 17th Geant4 Collaboration meeting

Major algorithm phases 1. initialization allocate memory, initialize RNG (Random Number Generator) 2. main loop always take a step sometimes check termination conditions sometimes generate primary particles sometimes pop a secondary particle from track stack sometimes balance track stacks sometimes distribute dose stacks to main dose array 3. clean up output dose free all memory Sep/10/2012 17th Geant4 Collaboration meeting

17th Geant4 Collaboration meeting Algorithm pieces termination check: algorithm stops if all tracks are stopped, all stacks are empty, and all primary budgets are exhausted primary generation: The generation procedure generates primaries and pushes them onto stack until stack size reaches a fill_level If the stack size for a given thread is equal to or larger than fill_level, then nothing is done stack pop: if possible pop a track from the stack and compute initial PILs for all processes Sep/10/2012 17th Geant4 Collaboration meeting

17th Geant4 Collaboration meeting A single step 1. select process with the shortest PIL 2. apply all continuous processes processes have access to the stack 3. decrease PILs of all processes by step path length 4. apply discrete method of the limiting process process has access to the stack resample PIL for the limiting process update transportation PIL if particle properties have changed Sep/10/2012 17th Geant4 Collaboration meeting

17th Geant4 Collaboration meeting Dose accumulation Notes: should avoid race condition due to memory access each thread has a dose stack to record deposit energy accumulation a thread will push "the sum of energy deposition in a voxel" to stack when a track to a different voxel periodically, dose stacks will fill up and need to be distributed to the main dose array Dose distribution procedure: 1. sort all dose stacks together by voxel index 2. reduce (sum-up) by voxel index 3. store the result in main dose array no race conditions because voxel indices are now unique Sep/10/2012 17th Geant4 Collaboration meeting

Example configuration GPU: Tesla C2070 (Fermi) 448 cores, 1.15 GHz, 6GB GDDR5 (ECC) SDK: CUDA 4.2 (5-RC) : CURAND, Thrust, SCons (CMake) Application generate 200k primaries 128 blocks, with 256 threads each total 32,768 CUDA threads run takes about 1.2 GB on device (mostly voxel arrays) generate and pop every 25 steps check termination conditions every 1,000 steps Sep/10/2012 17th Geant4 Collaboration meeting

Current fake “physics” processes Compute PIL with: -logf(curand_uniform(&data.rng.state[id])); Chdir: perturbs the direction of the particle Gensec: generates a secondary particle with less energy Sep/10/2012 17th Geant4 Collaboration meeting

NVIDA Visual Profile (nvvp) Output from nvvp zoomed to a single iteration Sep/10/2012 17th Geant4 Collaboration meeting

17th Geant4 Collaboration meeting Run profile Run profile with current fake physics processes and no stack balancing stack balance: performance should be improved. Sep/10/2012 17th Geant4 Collaboration meeting

17th Geant4 Collaboration meeting Summary Collaborative activity on Geant4-GPU between Stanford ICME, SLAC, and G4-Japan (KEK), supported by NVIDIA Focused on medical application dose calculation in voxel domain Geant4 EM physics processes GPU prototype parallel tracking on GPU thread multiple data structure for parallel processing efficient stack management Working on porting physics processes from Geant4 optimization Sep/10/2012 17th Geant4 Collaboration meeting