GPU's for event reconstruction in FairRoot Framework Mohammad Al-Turany (GSI-IT) Florian Uhlig (GSI-IT) Radoslaw Karabowicz (GSI-IT)

03/26/09CHEP 09, Prague 2 Overview FairRoot : Introduction GPU’s and CUDA Panda track fitting on the GPU’s Summary

03/26/09CHEP 09, Prague 3 FairRoot Panda Cbm MDP R3BAXL http://fairroot.gsi.de

03/26/09CHEP 09, Prague 4 Features No Executable: (Only Rootcint) –Compiled Tasks for reconstruction, analysis, etc. –Root macros for steering simulation or reconstruction –Root macros for configurations (G3, G4, Fluka and Analysis) VMC and VGM for simulation. Reconstruction can be done directly with simulation or as a separate step RHO Package for Analysis TGeoManager in Simulation and Reconstruction Grid: we use AliEn! CMake: Makefiles, dependencies, QM Doxygen for class documentation

03/26/09CHEP 09, Prague 5 FairRoot Deliver: Detector base classes that handle initialization, geometry construction, hit processing(stepping action), etc. IO Manager based on ROOT TFolder and TTree (TChain) Geometry Readers, ASCII, ROOT, CAD2ROOT Radiation length manager Generic track propagation based on Geane Generic event display based on EVE and Geane Oracle interface for geometry and parameters handling Fast simulation base services based on VMC and ROOT TTasks. (Full and Fast simulations can be mixed in one run) Interfaces for some event generators, Pythia, Urqmd, Evtgen, Pluto, Dpm,...

03/26/09CHEP 09, Prague 6 Nvidia’s Compute Unified Device Architecture (CUDA )

03/26/09CHEP 09, Prague 7 Cuda Toolkit nvcc C compiler CUDA FFT and BLAS libraries for the GPU Profiler gdb debugger for the GPU CUDA runtime driver (also available in the standard NVIDIA GPU driver) CUDA programming manual

03/26/09CHEP 09, Prague 8 Why CUDA? CUDA development tools work alongside the conventional C/C++ compiler, so one can mix GPU code with general- purpose code for the host CPU. CUDA Automatically Manages Threads: –it does not require explicit management for threads in the conventional sense, which greatly simplifies the programming model. Stable, available (for free), documented and supported for windows, Linux and Mac Low learning curve: –Just a few extensions to C –No knowledge of graphics is required

03/26/09CHEP 09, Prague 9 CUDA and GPGPU programming! GPGPU usually means, writing software for a GPU in the language of the GPU. CUDA permits working with familiar programming concepts while developing software that can run on a GPU. CUDA compile the code directly to the hardware (GPU assembly language, for instance), thereby providing great performance.

03/26/09CHEP 09, Prague 10 GPU threads and CPU threads GPU threads are extremely lightweight CPUs can execute 1-2 threads per core, while GPUs can maintain up to 1024 threads per multiprocessor (8-core) CPUs use SIMD (single instruction is performed over multiple data) vector units, and GPUs use SIMT (single instruction, multiple threads) for scalar thread processing. SIMT does not require developers to convert data to vectors and allows arbitrary branching in threads.

03/26/09CHEP 09, Prague 11 Threads and Kernels Parallel parts of the application are executed on the device as kernels –One kernel is executed at a time –Many threads execute each kernel –Kernel launches a grid of thread blocks –Threads within a block cooperate via shared memory –Threads within a block can synchronize –Threads in different blocks cannot cooperate Shared Memory Shared Memory Shared Memory

03/26/09CHEP 09, Prague 12 Heterogeneous Programming Device Host Serial code Parallel kernel Kernel_0 >>() Grid 0 Grid 1 Serial code Parallel kernel Kernel_1 >>()

03/26/09CHEP 09, Prague 13 Cuda in FairRoot FindCuda.cmake ( Abe Stephens SCI Institute ) –Integrate CUDA into FairRoot very smoothly CMake create shared libraries for cuda part FairCuda is a class which wraps cuda implemented functions so that they can be used directly from ROOT CINT or compiled code

03/26/09CHEP 09, Prague 14 Adding Cuda files to CMake # Add current directory to the nvcc include line. CUDA_INCLUDE_DIRECTORIES( ${CMAKE_CURRENT_SOURCE_DIR} ${CUDA_CUT_INCLUDE} ${ROOT_INCLUDE_DIR} ${CUDA_INCLUDE} ) set(CUDA_SRCS trackfit.cu trackfit_kernel.cu ) set(LINK_DIRECTORIES ${ROOT_LIBRARY_DIR} ) CUDA_ADD_LIBRARY(cuda_imp ${CUDA_SRCS} ) # Specify the dependency. TARGET_LINK_LIBRARIES( Trackfit cuda_imp ${CUDA_TARGET_LINK} ${CUDA_CUT_TARGET_LINK} ${ROOT_LIBRARIES} )

03/26/09CHEP 09, Prague 15 Testing Hardware # of Streaming Processor Cores240 Frequency of processor cores1.3GHz performance (peak): Single Precision floating point 933 GFLOPS Double Precision floating point 78 GFLOPS Floating Point Precision IEEE 754 single & double Total Dedicated Memory4GB GDDR3 Memory Speed800MHz Memory Interface512-bit Memory Bandwidth102GB/sec Max Power Consumption200 W peak, 160 W typical System InterfacePCIe x16 Intel Xeon Quad-Core : 2.5 GHz 16 GB Memory Wattage875 W Suse Enterprise 10.3

03/26/09CHEP 09, Prague 16 Test configuration: MVD, TPC, EMC, MDT detecters 1 GeV Muons 50 Event for each sample Four samples with 50, 100, 1000 and 2000 primary tracks per event Ideal Track finder (simply copy Geant tracks to the track candidates objects)

03/26/09CHEP 09, Prague 17 Reconstruction chain (PANDA ) Hits Track Finder Track candidates Track Fitter Tracks Task CPU Task GPU.......

03/26/09CHEP 09, Prague 18 Track fitting: Implementation in CUDA TClonesarray Track Candidate X,Y,Z 0,0,0 0,0,1............ X,Y,Z 0,0,0 0,0,1....... Shared Memory Shared Memory 32-threads 1 -thread TClonesarray Tracks 32-threads Track Host (PC) Device (Tesla) Global Memory Global Memory

03/26/09CHEP 09, Prague 19 Trackfitting on CPU and GPU 5010010002000 GPU (Emu)6.015.0180370 CPU3.05.0120220 GPU (D)1.21.53.25.0 GPU (F)1.01.21.83.2

03/26/09CHEP 09, Prague 20 What we gain? Track/Event5010010002000 GPU (Double)2.53.337.544.0 GPU (Float)3.04.266.768.8 Track/Event CPU time/ GPU time Track/Event Time (ms)

03/26/09CHEP 09, Prague 21 CUDA GPU occupancy calculator Active Threads per Multiprocessor256 Active Thread Blocks per Multiprocessor8 Occupancy of each Multiprocessor25% In this test: 32 threads per block Active Threads per Multiprocessor1024 Active Thread Blocks per Multiprocessor4 Occupancy of each Multiprocessor100% To Get the most of this card we should use 256 threads per block Tesla C1060: 30 Multiprocessor 2048 Bytes shared memory per block

03/26/09CHEP 09, Prague 22 Next Steps: Tesla support concurrent access which means that different CPU threads can start different kernels on the device. One could test PROOF and CUDA together for problems which cannot use the full capacity of the cards! Port the track finding routines to GPU (PANDA)

03/26/09CHEP 09, Prague 23 Summary Cuda is an easy to learn and to use tool. Cuda allows heterogeneous programming. Depending on the use case one can win factors in performance compared to CPU. Emulation mode is slower than native CPU code, but this could improve with OpenCL

GPU's for event reconstruction in FairRoot Framework Mohammad Al-Turany (GSI-IT) Florian Uhlig (GSI-IT) Radoslaw Karabowicz (GSI-IT)

Similar presentations

Presentation on theme: "GPU's for event reconstruction in FairRoot Framework Mohammad Al-Turany (GSI-IT) Florian Uhlig (GSI-IT) Radoslaw Karabowicz (GSI-IT)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

GPU's for event reconstruction in FairRoot Framework Mohammad Al-Turany (GSI-IT) Florian Uhlig (GSI-IT) Radoslaw Karabowicz (GSI-IT)

Similar presentations

Presentation on theme: "GPU's for event reconstruction in FairRoot Framework Mohammad Al-Turany (GSI-IT) Florian Uhlig (GSI-IT) Radoslaw Karabowicz (GSI-IT)"— Presentation transcript:

Similar presentations

About project

Feedback