Presentation is loading. Please wait.

Presentation is loading. Please wait.

CUDA in FairRoot Mohammad Al-Turany GSI Darmstadt.

Similar presentations


Presentation on theme: "CUDA in FairRoot Mohammad Al-Turany GSI Darmstadt."— Presentation transcript:

1 CUDA in FairRoot Mohammad Al-Turany GSI Darmstadt

2 GSI Heute 9/26/10 GSI + FAIR 2 Belle-II PXD DAQ/Trigger Workshop

3 An International Accelerator Facility for Research with Ions and Antiprotons 9/26/10 3 Belle-II PXD DAQ/Trigger Workshop

4 FAIR Programs and experiments: 9/26/10 4 Heavy Ion physics CBM HADES Hadrons physics PANDA PAX ASSIA Plasma-und applied physics HEDgeHOB WDM BIOMAT Atomic physics SPARC FLAIR Nuclear structure and astrophysics (NUSTAR) Super-FRS HISPEC / DESPEC MATS LASPEC R3B ILIMA AIC ELISe EXL Belle-II PXD DAQ/Trigger Workshop

5 F AIR R OOT 9/26/10 5 PANDA CBM MPD R3B Mohammad Al-Turany Denis Bertini Florian Uhlig Radek Karabowicz Belle-II PXD DAQ/Trigger Workshop Fairroot.gsi.de

6 9/26/10 6 ~400 Physicist 50 Institutes 15 countries ~400 Physicist 50 Institutes 15 countries Belle-II PXD DAQ/Trigger Workshop

7 Anti-Proton ANnhilation at DArmstadt 9/26/10 7 Data rates ~10 7 Events/s Data volume ~10 PB/yr ~450 Physicist 52 Institutes 17 countries ~450 Physicist 52 Institutes 17 countries Belle-II PXD DAQ/Trigger Workshop

8 Data flow 9/26/10 8 CMS am LHCCBM am FAIR CBM 1 TB/s 100 GB/s Belle-II PXD DAQ/Trigger Workshop

9 10 000 000 Events/s full-reconstruction online? 1 Event complete reconstruction = 1 – 10 ms -> 10 000 – 100 000 CPU core The Challenge : At which level to parallelize? Event Track Cluster Which hardware to use? CPUs GPUs FPGA DSP 9/26/10 9 Belle-II PXD DAQ/Trigger Workshop

10 10 Design Run Manager Event Generator Magnetic Field Detector base IO Manager Tasks RTDataBase Oracle Conf, Par, Geo Root files Conf, Par, Geo Root files Hits, Digits, Tracks Application Cuts, processes Event Display Track propagation ROOT Virtual MC Geant3 Geant4 FLUKA G4VMC FlukaVMC G3VMC Geometry STS TRD TOF RICH ECAL MVD ZDC MUCH ASCII Urqmd Pluto Track finding digitizers Hit Producers Hit Producers Dipole Map Active Map const. field CBM Code STT MUO TOF DCH EMC MVD TPC DIRC ASCII EVT DPM Track finding digitizers Hit Producers Hit Producers Dipole Map Solenoid Map Solenoid Map const. field const. field Panda Code Common development 9/26/10 Belle-II PXD DAQ/Trigger Workshop

11 16.04.2010 CUDA: F EATURES Standard C language for parallel application development on the GPU Standard numerical libraries: FFT (Fast Fourier Transform) BLAS (Basic Linear Algebra Subroutines) CUSPARSE (for handling sparse matrices) CURAND (generation of high-quality pseudorandom and quasirandom numbers.) Dedicated CUDA driver for computing with fast data transfer path between GPU and CPU 11 9/26/10 Belle-II PXD DAQ/Trigger Workshop

12 16.04.2010 Mohammad Al-Turany, PANDA DAQT W HY CUDA? CUDA development tools work alongside the conventional C/C++ compiler, so one can mix GPU code with general- purpose code for the host CPU. CUDA Automatically Manages Threads: It does NOT require explicit management for threads in the conventional sense, which greatly simplifies the programming model. Stable, available (for free), documented and supported for windows, Linux and Mac OS Low learning curve: Just a few extensions to C No knowledge of graphics is required 9/26/10 12 Belle-II PXD DAQ/Trigger Workshop

13 CUDA IN F AIR R OOT FindCuda.cmake ( Abe Stephens SCI Institute ) Integrate CUDA into FairRoot very smoothly CMake create shared libraries for CUDA part FairCuda is a class which wraps CUDA implemented functions so that they can be used directly from ROOT CINT or compiled code 9/26/10 13 Belle-II PXD DAQ/Trigger Workshop

14 R ECONSTRUCTION CHAIN (PANDA ) Hits Track Finder Track candidates Track Fitter Tracks Task CPU Task GPU....... 9/26/10 14 Belle-II PXD DAQ/Trigger Workshop

15 C UDA T OOLKIT NVCC C compiler CUDA FFT and BLAS libraries for the GPU CUDA-gdb hardware debugger CUDA Visual Profiler CUDA runtime driver (also available in the standard NVIDIA GPU driver) CUDA programming manual 9/26/10 15 Belle-II PXD DAQ/Trigger Workshop

16 CUDA - GDB : T HE NVIDIA CUDA D EBUGGER cuda‐gdb is an extension to the standard i386/AMD64 port of gdb, the GNU Project debugger, version 6.6. Standard debugging features are inherently supported for host code plus additional features provided to support debugging GPU (device) code. There is no difference between cuda‐gdb and gdb when debugging host code. 9/26/10 16 Belle-II PXD DAQ/Trigger Workshop

17 D EBUGGING CUDA PROGRAMS 17 Device Emulation mode + Printf Not supported anymore CUDA-GDB 9/26/10 Belle-II PXD DAQ/Trigger Workshop

18 CUDA - GDB 18 GPU memory is treated as an extension to host memory CUDA threads and blocks are treated as extensions to host threads. Memory checker: Global memory violations and mis‐aligned global memory accesses 9/26/10 Belle-II PXD DAQ/Trigger Workshop

19 C UDA - GDB LIMITATIONS 19 It only runs on UNIX-based system X11 cannot be running on the GPU that is used for debugging 9/26/10 Belle-II PXD DAQ/Trigger Workshop

20 NVIDIA C OMPUTE V ISUAL P ROFILER 9/26/10 Belle -II PXD DAQ /Tri gger Wor ksho p 20

21 E XAMPLES 9/26/10 Belle-II PXD DAQ/Trigger Workshop 21

22 Track propagation in magnetic field Runge-Kutta propagator (See Handbook Nat. Bur. Of Standards, procedure 25.5.20) The algorithm is hardly parallelizable, but one can propagate all tracks in one event in parallel 9/26/10 22 Belle-II PXD DAQ/Trigger Workshop

23 Hades: Speedup by factor 40 9/26/10 23 Propagation/E vent Tesla (CPU/GPU) 10 11 50 15 100 15 200 24 500 34 700 41 Belle-II PXD DAQ/Trigger Workshop

24 Track propagation (RK4) PANDA 9/26/10 24 Speedup : up to factor 175 Belle-II PXD DAQ/Trigger Workshop

25 Track + Vertex Fitting (PANDA) 9/26/10 25 The Same program code, same hardware, just using Pinned Memory instead of Mem-Copy! CPU time/GPU time Copy data Execute Copy data Execute Belle-II PXD DAQ/Trigger Workshop

26 Parallalization on CPU and GPU 9/26/10 26 CPU 1Event 1 Track Candidates GPU TaskTracksCPU 2Event 2 Track Candidates GPU TaskTracksCPU 3Event 3 Track Candidates GPU TaskTracksCPU 4Event 4 Track Candidates GPU TaskTracks No. of Process 50 Track/Event 2000 Track/Eve nt 1 CPU1.7 E4 Track/s9.1 E2 Track/s 1 CPU + GPU (Tesla)5.0 E4 Track/s6.3 E5 Track/s 4 CPU + GPU (Tesla)1.2 E5 Track/s2.2 E6 Track/s Belle-II PXD DAQ/Trigger Workshop

27 CPU programCUDA program void inc_cpu(int *a, int N) { int idx; for (idx = 0; idx<N; idx++) a[idx] = a[idx] + 1;} int main() {... inc_cpu(a, N); __global__ void inc_gpu(int *a, int N) { int idx = blockIdx.x * blockDim.x + threadIdx.x;if (idx < N) a[idx] = a[idx] + 1;} int main() {... dim3 dimBlock (blocksize); dim3 dimGrid( ceil( N / (float)blocksize) ); inc_gpu >>(a, N); CUDA VS C PROGRAM 27 9/26/10 Belle-II PXD DAQ/Trigger Workshop

28 CPU VS GPU CODE (R UNGE -K UTTA ALGORITHM ) float h2, h4, f[4]; float xyzt[3], a, b, c, ph,ph2; float secxs[4],secys[4],seczs[4],hxp[3]; float g1, g2, g3, g4, g5, g6, ang2, dxt, dyt, dzt; float est, at, bt, ct, cba; float f1, f2, f3, f4, rho, tet, hnorm, hp, rho1, sint, cost; float x; float y; float z; float xt; float yt; float zt; float maxit = 10; float maxcut = 11; const float hmin = 1e-4; const float kdlt = 1e-3; …... __shared__ float4 field; float h2, h4, f[4]; float xyzt[3], a, b, c, ph,ph2; float secxs[4],secys[4],seczs[4],hxp[3]; float g1, g2, g3, g4, g5, g6, ang2, dxt, dyt, dzt; float est, at, bt, ct, cba; float f1, f2, f3, f4, rho, tet, hnorm, hp, rho1, sint, cost; float x; float y; float z; float xt; float yt; float zt; float maxit= 10; float maxcut= 11; __constant__ float hmin = 1e-4; __constant__ float kdlt = 1e-3; ….. 9/26/10 28 Belle-II PXD DAQ/Trigger Workshop

29 CPU VS GPU CODE (R UNGE -K UTTA ALGORITHM ) do { rest = step - tl; if (TMath::Abs(h) > TMath::Abs(rest)) h = rest; fMagField->GetFieldValue( vout, f); f[0] = -f[0]; f[1] = -f[1]; f[2] = -f[2]; ……….. if (step < 0.) rest = -rest; if (rest < 1.e-5*TMath::Abs(step)) return; } while(1); do { rest = step - tl; if (fabs(h) > fabs(rest)) h = rest; field=GetField(vout[0],vout[1],vout[2]); f[0] = -field.x; f[1] = -field.y; f[2] = -field.z; ……….. if (step < 0.) rest = -rest; if (rest < 1.e-5*fabs(step)) return; } while(1); 9/26/10 29 Belle-II PXD DAQ/Trigger Workshop

30 S UMMARY Cuda is an easy to learn and to use tool. Cuda allows heterogeneous programming. Depending on the use case one can win factors in performance compared to CPU Texture memory can be used to solve problems that require lookup tables effectively Pinned Memory simplify some problems, gives also better performance. With Fermi we are getting towards the end of the distinction between CPUs and GPUs The GPU increasingly taking on the form of a massively parallel co- processor 9/26/10 30 Belle-II PXD DAQ/Trigger Workshop

31 N EXT S TEPS RELATED TO O NLINE In collaboration with the GSI EE, build a proto type for an online system Use the PEXOR card to get data to PC PEXOR driver allocate a buffer in PC memory and write the data to it The GPU uses the Zero copy to access the Data, analyze it and write the results 9/26/10 31 Belle-II PXD DAQ/Trigger Workshop

32 PEXOR The GSI PEXOR is a PCI express card provides a complete development platform for designing and verifying applications based on the Lattice SCM FPGA family. Serial gigabit transceiver interfaces (SERDES) provide connection to PCI Express x1 or x4 and four 2Gbit SFP optical transceivers 32 9/26/10 Belle-II PXD DAQ/Trigger Workshop

33 C ONFIGURATION FOR TEST PLANNED AT THE GSI 9/26/10 33 Belle-II PXD DAQ/Trigger Workshop

34 CPU AND GPU ProcessorIntel Core 2 Extreme QX9650 NVIDIA TESLA C1060 NVIDIA FERMI Transistors820 million1.4 billion3.0 billion Processor clock3 GHz1.3 GHz1.15 GHz Cores (Thread)4240512 Cache / Shared Memory 6 MB x 216 KB x 3016 or 48KB (configurable) Threads executed per clock 4240512 Hardware threads in flight 430,72024,576 Memory controllers Off-die8 x 64-bit6 x 64 bit Memory Bandwidth 12.8 GBps102 GBps144 GBps 34 9/26/10 Belle-II PXD DAQ/Trigger Workshop

35 C OMPARISON OF NVIDIA’ S THREE CUDA- CAPABLE GPU ARCHITECTURES http://www.in-stat.com 9/26/10 35 Belle-II PXD DAQ/Trigger Workshop

36 E MULATION M ODE When running an application in device emulation mode, the programming model is emulated by the runtime. For each thread in a thread block, the runtime creates a thread on the host. The programmer needs to make sure that: The host is able to run up to the maximum number of threads per block, plus one for the master thread. Enough memory is available to run all threads, knowing that each thread gets 256 KB of stack. 9/26/10 36 Belle-II PXD DAQ/Trigger Workshop

37 P INNED M EMORY On discrete GPUs, mapped pinned memory is advantageous only in certain cases. Because the data is not cached on the GPU, mapped pinned memory should be read or written only once, and the global loads and stores that read and write the memory should be coalesced. On integrated GPUs, mapped pinned memory is always a performance gain because it avoids superfluous copies as integrated GPU and CPU memory are physically the same. 9/26/10 37 Belle-II PXD DAQ/Trigger Workshop


Download ppt "CUDA in FairRoot Mohammad Al-Turany GSI Darmstadt."

Similar presentations


Ads by Google