CUDA in FairRoot Mohammad Al-Turany GSI Darmstadt.

Slides:



Advertisements
Similar presentations
DEVELOPMENT OF ONLINE EVENT SELECTION IN CBM DEVELOPMENT OF ONLINE EVENT SELECTION IN CBM I. Kisel (for CBM Collaboration) I. Kisel (for CBM Collaboration)
Advertisements

Introduction to the CUDA Platform
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.
L1 Event Reconstruction in the STS I. Kisel GSI / KIP CBM Collaboration Meeting Dubna, October 16, 2008.
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.
Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.
Parallel Programming using CUDA. Traditional Computing Von Neumann architecture: instructions are sent from memory to the CPU Serial execution: Instructions.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
CUDA (Compute Unified Device Architecture) Supercomputing for the Masses by Peter Zalutski.
The PTX GPU Assembly Simulator and Interpreter N.M. Stiffler Zheming Jin Ibrahim Savran.
PhD/Master course, Uppsala  Understanding the interaction between your program and computer  Structuring the code  Optimizing the code  Debugging.
ALICE HLT High Speed Tracking and Vertexing Real-Time 2010 Conference Lisboa, May 25, 2010 Sergey Gorbunov 1,2 1 Frankfurt Institute for Advanced Studies,
Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.
Contemporary Languages in Parallel Computing Raymond Hummel.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
Slide 1/8 Performance Debugging for Highly Parallel Accelerator Architectures Saurabh Bagchi ECE & CS, Purdue University Joint work with: Tsungtai Yeh,
Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.
OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.
Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.
Computer Graphics Graphics Hardware
BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.
GBT Interface Card for a Linux Computer Carson Teale 1.
Tim Madden ODG/XSD.  Graphics Processing Unit  Graphics card on your PC.  “Hardware accelerated graphics”  Video game industry is main driver.  More.
Instructor Notes GPU debugging is still immature, but being improved daily. You should definitely check to see the latest options available before giving.
High Performance Computing with GPUs: An Introduction Krešimir Ćosić, Thursday, August 12th, LSST All Hands Meeting 2010, Tucson, AZ GPU Tutorial:
Algorithm Engineering „GPGPU“ Stefan Edelkamp. Graphics Processing Units  GPGPU = (GP)²U General Purpose Programming on the GPU  „Parallelism for the.
CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.
Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.
FAIR Simulation & Analysis Framework FairRoot Mohammad Al-Turany (IT-GSI) Denis Bertini (IT-GSI) Florian Uhlig (IT-GSI)
NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.
Automatic translation from CUDA to C++ Luca Atzori, Vincenzo Innocente, Felice Pantaleo, Danilo Piparo 31 August, 2015.
GPU Architecture and Programming
CBM Software Workshop for Future Challenges in Tracking and Trigger Concepts, GSI, 9 June 2010 Volker Friese.
GPU DAS CSIRO ASTRONOMY AND SPACE SCIENCE Chris Phillips 23 th October 2012.
- DHRUVA TIRUMALA BUKKAPATNAM Geant4 Geometry on a GPU.
Status of Reconstruction in CBM
CUDA-based Volume Rendering in IGT Nobuhiko Hata Benjamin Grauer.
Next Generation Operating Systems Zeljko Susnjar, Cisco CTG June 2015.
Simulations for CBM CBM-India Meeting, Jammu, 12 February 2008 V. Friese
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
Tim Madden ODG/XSD.  Graphics Processing Unit  Graphics card on your PC.  “Hardware accelerated graphics”  Video game industry is main driver.  More.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
GPU Code integration in FairRoot
Martin Kruliš by Martin Kruliš (v1.0)1.
Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.
Lecture 8 : Manycore GPU Programming with CUDA Courtesy : SUNY-Stony Brook Prof. Chowdhury’s course note slides are used in this lecture note.
XLV INTERNATIONAL WINTER MEETING ON NUCLEAR PHYSICS Tiago Pérez II Physikalisches Institut For the PANDA collaboration FPGA Compute node for the PANDA.
© David Kirk/NVIDIA and Wen-mei W. Hwu, CS/EE 217 GPU Architecture and Programming Lecture 2: Introduction to CUDA C.
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1.
Graphic Processing Units Presentation by John Manning.
Track finding and fitting with GPUs Oleksiy Rybalchenko and Mohammad Al-Turany GSI-IT 4/29/11 PANDA FEE/DAQ/Trigger Workshop Grünberg.
CS 179: GPU Computing LECTURE 2: MORE BASICS. Recap Can use GPU to solve highly parallelizable problems Straightforward extension to C++ ◦Separate CUDA.
International Accelerator Facility for Beams of Ions and Antiprotons at Darmstadt The Facility for Antiproton and Ion Research Peter Malzacher, GSI EGEE'09,
1 MOHAMMAD AL-TURANY GSI Darmstadt GPUs for event reconstruction 1.
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
GPU's for event reconstruction in FairRoot Framework Mohammad Al-Turany (GSI-IT) Florian Uhlig (GSI-IT) Radoslaw Karabowicz (GSI-IT)
General Purpose computing on Graphics Processing Units
Computer Engg, IIT(BHU)
Introduction to CUDA Li Sung-Chi Taiwan Evolutionary Intelligence Laboratory 2016/12/14 Group Meeting Presentation.
CS427 Multicore Architecture and Parallel Computing
GPU Memory Details Martin Kruliš by Martin Kruliš (v1.1)
Competences & Plans at GSI
Lecture 2: Intro to the simd lifestyle and GPU internals
NVIDIA Fermi Architecture
6- General Purpose GPU Programming
Presentation transcript:

CUDA in FairRoot Mohammad Al-Turany GSI Darmstadt

GSI Heute 9/26/10 GSI + FAIR 2 Belle-II PXD DAQ/Trigger Workshop

An International Accelerator Facility for Research with Ions and Antiprotons 9/26/10 3 Belle-II PXD DAQ/Trigger Workshop

FAIR Programs and experiments: 9/26/10 4 Heavy Ion physics CBM HADES Hadrons physics PANDA PAX ASSIA Plasma-und applied physics HEDgeHOB WDM BIOMAT Atomic physics SPARC FLAIR Nuclear structure and astrophysics (NUSTAR) Super-FRS HISPEC / DESPEC MATS LASPEC R3B ILIMA AIC ELISe EXL Belle-II PXD DAQ/Trigger Workshop

F AIR R OOT 9/26/10 5 PANDA CBM MPD R3B Mohammad Al-Turany Denis Bertini Florian Uhlig Radek Karabowicz Belle-II PXD DAQ/Trigger Workshop Fairroot.gsi.de

9/26/10 6 ~400 Physicist 50 Institutes 15 countries ~400 Physicist 50 Institutes 15 countries Belle-II PXD DAQ/Trigger Workshop

Anti-Proton ANnhilation at DArmstadt 9/26/10 7 Data rates ~10 7 Events/s Data volume ~10 PB/yr ~450 Physicist 52 Institutes 17 countries ~450 Physicist 52 Institutes 17 countries Belle-II PXD DAQ/Trigger Workshop

Data flow 9/26/10 8 CMS am LHCCBM am FAIR CBM 1 TB/s 100 GB/s Belle-II PXD DAQ/Trigger Workshop

Events/s full-reconstruction online? 1 Event complete reconstruction = 1 – 10 ms -> – CPU core The Challenge : At which level to parallelize? Event Track Cluster Which hardware to use? CPUs GPUs FPGA DSP 9/26/10 9 Belle-II PXD DAQ/Trigger Workshop

10 Design Run Manager Event Generator Magnetic Field Detector base IO Manager Tasks RTDataBase Oracle Conf, Par, Geo Root files Conf, Par, Geo Root files Hits, Digits, Tracks Application Cuts, processes Event Display Track propagation ROOT Virtual MC Geant3 Geant4 FLUKA G4VMC FlukaVMC G3VMC Geometry STS TRD TOF RICH ECAL MVD ZDC MUCH ASCII Urqmd Pluto Track finding digitizers Hit Producers Hit Producers Dipole Map Active Map const. field CBM Code STT MUO TOF DCH EMC MVD TPC DIRC ASCII EVT DPM Track finding digitizers Hit Producers Hit Producers Dipole Map Solenoid Map Solenoid Map const. field const. field Panda Code Common development 9/26/10 Belle-II PXD DAQ/Trigger Workshop

CUDA: F EATURES Standard C language for parallel application development on the GPU Standard numerical libraries: FFT (Fast Fourier Transform) BLAS (Basic Linear Algebra Subroutines) CUSPARSE (for handling sparse matrices) CURAND (generation of high-quality pseudorandom and quasirandom numbers.) Dedicated CUDA driver for computing with fast data transfer path between GPU and CPU 11 9/26/10 Belle-II PXD DAQ/Trigger Workshop

Mohammad Al-Turany, PANDA DAQT W HY CUDA? CUDA development tools work alongside the conventional C/C++ compiler, so one can mix GPU code with general- purpose code for the host CPU. CUDA Automatically Manages Threads: It does NOT require explicit management for threads in the conventional sense, which greatly simplifies the programming model. Stable, available (for free), documented and supported for windows, Linux and Mac OS Low learning curve: Just a few extensions to C No knowledge of graphics is required 9/26/10 12 Belle-II PXD DAQ/Trigger Workshop

CUDA IN F AIR R OOT FindCuda.cmake ( Abe Stephens SCI Institute ) Integrate CUDA into FairRoot very smoothly CMake create shared libraries for CUDA part FairCuda is a class which wraps CUDA implemented functions so that they can be used directly from ROOT CINT or compiled code 9/26/10 13 Belle-II PXD DAQ/Trigger Workshop

R ECONSTRUCTION CHAIN (PANDA ) Hits Track Finder Track candidates Track Fitter Tracks Task CPU Task GPU /26/10 14 Belle-II PXD DAQ/Trigger Workshop

C UDA T OOLKIT NVCC C compiler CUDA FFT and BLAS libraries for the GPU CUDA-gdb hardware debugger CUDA Visual Profiler CUDA runtime driver (also available in the standard NVIDIA GPU driver) CUDA programming manual 9/26/10 15 Belle-II PXD DAQ/Trigger Workshop

CUDA - GDB : T HE NVIDIA CUDA D EBUGGER cuda‐gdb is an extension to the standard i386/AMD64 port of gdb, the GNU Project debugger, version 6.6. Standard debugging features are inherently supported for host code plus additional features provided to support debugging GPU (device) code. There is no difference between cuda‐gdb and gdb when debugging host code. 9/26/10 16 Belle-II PXD DAQ/Trigger Workshop

D EBUGGING CUDA PROGRAMS 17 Device Emulation mode + Printf Not supported anymore CUDA-GDB 9/26/10 Belle-II PXD DAQ/Trigger Workshop

CUDA - GDB 18 GPU memory is treated as an extension to host memory CUDA threads and blocks are treated as extensions to host threads. Memory checker: Global memory violations and mis‐aligned global memory accesses 9/26/10 Belle-II PXD DAQ/Trigger Workshop

C UDA - GDB LIMITATIONS 19 It only runs on UNIX-based system X11 cannot be running on the GPU that is used for debugging 9/26/10 Belle-II PXD DAQ/Trigger Workshop

NVIDIA C OMPUTE V ISUAL P ROFILER 9/26/10 Belle -II PXD DAQ /Tri gger Wor ksho p 20

E XAMPLES 9/26/10 Belle-II PXD DAQ/Trigger Workshop 21

Track propagation in magnetic field Runge-Kutta propagator (See Handbook Nat. Bur. Of Standards, procedure ) The algorithm is hardly parallelizable, but one can propagate all tracks in one event in parallel 9/26/10 22 Belle-II PXD DAQ/Trigger Workshop

Hades: Speedup by factor 40 9/26/10 23 Propagation/E vent Tesla (CPU/GPU) Belle-II PXD DAQ/Trigger Workshop

Track propagation (RK4) PANDA 9/26/10 24 Speedup : up to factor 175 Belle-II PXD DAQ/Trigger Workshop

Track + Vertex Fitting (PANDA) 9/26/10 25 The Same program code, same hardware, just using Pinned Memory instead of Mem-Copy! CPU time/GPU time Copy data Execute Copy data Execute Belle-II PXD DAQ/Trigger Workshop

Parallalization on CPU and GPU 9/26/10 26 CPU 1Event 1 Track Candidates GPU TaskTracksCPU 2Event 2 Track Candidates GPU TaskTracksCPU 3Event 3 Track Candidates GPU TaskTracksCPU 4Event 4 Track Candidates GPU TaskTracks No. of Process 50 Track/Event 2000 Track/Eve nt 1 CPU1.7 E4 Track/s9.1 E2 Track/s 1 CPU + GPU (Tesla)5.0 E4 Track/s6.3 E5 Track/s 4 CPU + GPU (Tesla)1.2 E5 Track/s2.2 E6 Track/s Belle-II PXD DAQ/Trigger Workshop

CPU programCUDA program void inc_cpu(int *a, int N) { int idx; for (idx = 0; idx<N; idx++) a[idx] = a[idx] + 1;} int main() {... inc_cpu(a, N); __global__ void inc_gpu(int *a, int N) { int idx = blockIdx.x * blockDim.x + threadIdx.x;if (idx < N) a[idx] = a[idx] + 1;} int main() {... dim3 dimBlock (blocksize); dim3 dimGrid( ceil( N / (float)blocksize) ); inc_gpu >>(a, N); CUDA VS C PROGRAM 27 9/26/10 Belle-II PXD DAQ/Trigger Workshop

CPU VS GPU CODE (R UNGE -K UTTA ALGORITHM ) float h2, h4, f[4]; float xyzt[3], a, b, c, ph,ph2; float secxs[4],secys[4],seczs[4],hxp[3]; float g1, g2, g3, g4, g5, g6, ang2, dxt, dyt, dzt; float est, at, bt, ct, cba; float f1, f2, f3, f4, rho, tet, hnorm, hp, rho1, sint, cost; float x; float y; float z; float xt; float yt; float zt; float maxit = 10; float maxcut = 11; const float hmin = 1e-4; const float kdlt = 1e-3; …... __shared__ float4 field; float h2, h4, f[4]; float xyzt[3], a, b, c, ph,ph2; float secxs[4],secys[4],seczs[4],hxp[3]; float g1, g2, g3, g4, g5, g6, ang2, dxt, dyt, dzt; float est, at, bt, ct, cba; float f1, f2, f3, f4, rho, tet, hnorm, hp, rho1, sint, cost; float x; float y; float z; float xt; float yt; float zt; float maxit= 10; float maxcut= 11; __constant__ float hmin = 1e-4; __constant__ float kdlt = 1e-3; ….. 9/26/10 28 Belle-II PXD DAQ/Trigger Workshop

CPU VS GPU CODE (R UNGE -K UTTA ALGORITHM ) do { rest = step - tl; if (TMath::Abs(h) > TMath::Abs(rest)) h = rest; fMagField->GetFieldValue( vout, f); f[0] = -f[0]; f[1] = -f[1]; f[2] = -f[2]; ……….. if (step < 0.) rest = -rest; if (rest < 1.e-5*TMath::Abs(step)) return; } while(1); do { rest = step - tl; if (fabs(h) > fabs(rest)) h = rest; field=GetField(vout[0],vout[1],vout[2]); f[0] = -field.x; f[1] = -field.y; f[2] = -field.z; ……….. if (step < 0.) rest = -rest; if (rest < 1.e-5*fabs(step)) return; } while(1); 9/26/10 29 Belle-II PXD DAQ/Trigger Workshop

S UMMARY Cuda is an easy to learn and to use tool. Cuda allows heterogeneous programming. Depending on the use case one can win factors in performance compared to CPU Texture memory can be used to solve problems that require lookup tables effectively Pinned Memory simplify some problems, gives also better performance. With Fermi we are getting towards the end of the distinction between CPUs and GPUs The GPU increasingly taking on the form of a massively parallel co- processor 9/26/10 30 Belle-II PXD DAQ/Trigger Workshop

N EXT S TEPS RELATED TO O NLINE In collaboration with the GSI EE, build a proto type for an online system Use the PEXOR card to get data to PC PEXOR driver allocate a buffer in PC memory and write the data to it The GPU uses the Zero copy to access the Data, analyze it and write the results 9/26/10 31 Belle-II PXD DAQ/Trigger Workshop

PEXOR The GSI PEXOR is a PCI express card provides a complete development platform for designing and verifying applications based on the Lattice SCM FPGA family. Serial gigabit transceiver interfaces (SERDES) provide connection to PCI Express x1 or x4 and four 2Gbit SFP optical transceivers 32 9/26/10 Belle-II PXD DAQ/Trigger Workshop

C ONFIGURATION FOR TEST PLANNED AT THE GSI 9/26/10 33 Belle-II PXD DAQ/Trigger Workshop

CPU AND GPU ProcessorIntel Core 2 Extreme QX9650 NVIDIA TESLA C1060 NVIDIA FERMI Transistors820 million1.4 billion3.0 billion Processor clock3 GHz1.3 GHz1.15 GHz Cores (Thread) Cache / Shared Memory 6 MB x 216 KB x 3016 or 48KB (configurable) Threads executed per clock Hardware threads in flight 430,72024,576 Memory controllers Off-die8 x 64-bit6 x 64 bit Memory Bandwidth 12.8 GBps102 GBps144 GBps 34 9/26/10 Belle-II PXD DAQ/Trigger Workshop

C OMPARISON OF NVIDIA’ S THREE CUDA- CAPABLE GPU ARCHITECTURES 9/26/10 35 Belle-II PXD DAQ/Trigger Workshop

E MULATION M ODE When running an application in device emulation mode, the programming model is emulated by the runtime. For each thread in a thread block, the runtime creates a thread on the host. The programmer needs to make sure that: The host is able to run up to the maximum number of threads per block, plus one for the master thread. Enough memory is available to run all threads, knowing that each thread gets 256 KB of stack. 9/26/10 36 Belle-II PXD DAQ/Trigger Workshop

P INNED M EMORY On discrete GPUs, mapped pinned memory is advantageous only in certain cases. Because the data is not cached on the GPU, mapped pinned memory should be read or written only once, and the global loads and stores that read and write the memory should be coalesced. On integrated GPUs, mapped pinned memory is always a performance gain because it avoids superfluous copies as integrated GPU and CPU memory are physically the same. 9/26/10 37 Belle-II PXD DAQ/Trigger Workshop