GPU's for event reconstruction in FairRoot Framework Mohammad Al-Turany (GSI-IT) Florian Uhlig (GSI-IT) Radoslaw Karabowicz (GSI-IT)

Slides:



Advertisements
Similar presentations
DEVELOPMENT OF ONLINE EVENT SELECTION IN CBM DEVELOPMENT OF ONLINE EVENT SELECTION IN CBM I. Kisel (for CBM Collaboration) I. Kisel (for CBM Collaboration)
Advertisements

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
Intermediate GPGPU Programming in CUDA
Complete Unified Device Architecture A Highly Scalable Parallel Programming Framework Submitted in partial fulfillment of the requirements for the Maryland.
Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters
Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
Lecture 6: Multicore Systems
A Complete GPU Compute Architecture by NVIDIA Tamal Saha, Abhishek Rawat, Minh Le {ts4rq, ar8eb,
Monte-Carlo method and Parallel computing  An introduction to GPU programming Mr. Fang-An Kuo, Dr. Matthew R. Smith NCHC Applied Scientific Computing.
GPU System Architecture Alan Gray EPCC The University of Edinburgh.
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
The Missouri S&T CS GPU Cluster Cyriac Kandoth. Pretext NVIDIA ( ) is a manufacturer of graphics processor technologies that has begun to promote their.
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
CUDA (Compute Unified Device Architecture) Supercomputing for the Masses by Peter Zalutski.
The PTX GPU Assembly Simulator and Interpreter N.M. Stiffler Zheming Jin Ibrahim Savran.
Panda: MapReduce Framework on GPU’s and CPU’s
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
HPCC Mid-Morning Break Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery Introduction to the new GPU (GFX) cluster.
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.
Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.
1 Chapter 04 Authors: John Hennessy & David Patterson.
Implementation of Parallel Processing Techniques on Graphical Processing Units Brad Baker, Wayne Haney, Dr. Charles Choi.
High Performance Computing with GPUs: An Introduction Krešimir Ćosić, Thursday, August 12th, LSST All Hands Meeting 2010, Tucson, AZ GPU Tutorial:
+ CUDA Antonyus Pyetro do Amaral Ferreira. + The problem The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now.
Emergence of GPU systems and clusters for general purpose high performance computing ITCS 4145/5145 April 3, 2012 © Barry Wilkinson.
GPU Architecture and Programming
Introducing collaboration members – Korea University (KU) ALICE TPC online tracking algorithm on a GPU Computing Platforms – GPU Computing Platforms Joohyung.
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
Multi-Core Development Kyle Anderson. Overview History Pollack’s Law Moore’s Law CPU GPU OpenCL CUDA Parallelism.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.
Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.
GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
CUDA Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication.
GPU Code integration in FairRoot
Martin Kruliš by Martin Kruliš (v1.0)1.
Debunking the 100X GPU vs. CPU Myth An Evaluation of Throughput Computing on CPU and GPU Present by Chunyi Victor W Lee, Changkyu Kim, Jatin Chhugani,
Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.
Lecture 8 : Manycore GPU Programming with CUDA Courtesy : SUNY-Stony Brook Prof. Chowdhury’s course note slides are used in this lecture note.
Would'a, CUDA, Should'a. CUDA: Compute Unified Device Architecture OU Supercomputing Symposium Highly-Threaded HPC.
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
IceCube simulation with PPC Photonics: 2000 – up to now Photon propagation code PPC: now.
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
My Coordinates Office EM G.27 contact time:
Track finding and fitting with GPUs Oleksiy Rybalchenko and Mohammad Al-Turany GSI-IT 4/29/11 PANDA FEE/DAQ/Trigger Workshop Grünberg.
NVIDIA® TESLA™ GPU Based Super Computer By : Adam Powell Student # For COSC 3P93.
1 MOHAMMAD AL-TURANY GSI Darmstadt GPUs for event reconstruction 1.
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
Introduction to Operating Systems Concepts
Computer Engg, IIT(BHU)
Prof. Zhang Gang School of Computer Sci. & Tech.
Build and Test system for FairRoot
CS427 Multicore Architecture and Parallel Computing
NVIDIA Fermi Architecture
Introduction to CUDA.
6- General Purpose GPU Programming
CSE 502: Computer Architecture
Multicore and GPU Programming
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
Presentation transcript:

GPU's for event reconstruction in FairRoot Framework Mohammad Al-Turany (GSI-IT) Florian Uhlig (GSI-IT) Radoslaw Karabowicz (GSI-IT)

03/26/09CHEP 09, Prague 2 Overview FairRoot : Introduction GPU’s and CUDA Panda track fitting on the GPU’s Summary

03/26/09CHEP 09, Prague 3 FairRoot Panda Cbm MDP R3BAXL

03/26/09CHEP 09, Prague 4 Features No Executable: (Only Rootcint) –Compiled Tasks for reconstruction, analysis, etc. –Root macros for steering simulation or reconstruction –Root macros for configurations (G3, G4, Fluka and Analysis) VMC and VGM for simulation. Reconstruction can be done directly with simulation or as a separate step RHO Package for Analysis TGeoManager in Simulation and Reconstruction Grid: we use AliEn! CMake: Makefiles, dependencies, QM Doxygen for class documentation

03/26/09CHEP 09, Prague 5 FairRoot Deliver: Detector base classes that handle initialization, geometry construction, hit processing(stepping action), etc. IO Manager based on ROOT TFolder and TTree (TChain) Geometry Readers, ASCII, ROOT, CAD2ROOT Radiation length manager Generic track propagation based on Geane Generic event display based on EVE and Geane Oracle interface for geometry and parameters handling Fast simulation base services based on VMC and ROOT TTasks. (Full and Fast simulations can be mixed in one run) Interfaces for some event generators, Pythia, Urqmd, Evtgen, Pluto, Dpm,...

03/26/09CHEP 09, Prague 6 Nvidia’s Compute Unified Device Architecture (CUDA )

03/26/09CHEP 09, Prague 7 Cuda Toolkit nvcc C compiler CUDA FFT and BLAS libraries for the GPU Profiler gdb debugger for the GPU CUDA runtime driver (also available in the standard NVIDIA GPU driver) CUDA programming manual

03/26/09CHEP 09, Prague 8 Why CUDA? CUDA development tools work alongside the conventional C/C++ compiler, so one can mix GPU code with general- purpose code for the host CPU. CUDA Automatically Manages Threads: –it does not require explicit management for threads in the conventional sense, which greatly simplifies the programming model. Stable, available (for free), documented and supported for windows, Linux and Mac Low learning curve: –Just a few extensions to C –No knowledge of graphics is required

03/26/09CHEP 09, Prague 9 CUDA and GPGPU programming! GPGPU usually means, writing software for a GPU in the language of the GPU. CUDA permits working with familiar programming concepts while developing software that can run on a GPU. CUDA compile the code directly to the hardware (GPU assembly language, for instance), thereby providing great performance.

03/26/09CHEP 09, Prague 10 GPU threads and CPU threads GPU threads are extremely lightweight CPUs can execute 1-2 threads per core, while GPUs can maintain up to 1024 threads per multiprocessor (8-core) CPUs use SIMD (single instruction is performed over multiple data) vector units, and GPUs use SIMT (single instruction, multiple threads) for scalar thread processing. SIMT does not require developers to convert data to vectors and allows arbitrary branching in threads.

03/26/09CHEP 09, Prague 11 Threads and Kernels Parallel parts of the application are executed on the device as kernels –One kernel is executed at a time –Many threads execute each kernel –Kernel launches a grid of thread blocks –Threads within a block cooperate via shared memory –Threads within a block can synchronize –Threads in different blocks cannot cooperate Shared Memory Shared Memory Shared Memory

03/26/09CHEP 09, Prague 12 Heterogeneous Programming Device Host Serial code Parallel kernel Kernel_0 >>() Grid 0 Grid 1 Serial code Parallel kernel Kernel_1 >>()

03/26/09CHEP 09, Prague 13 Cuda in FairRoot FindCuda.cmake ( Abe Stephens SCI Institute ) –Integrate CUDA into FairRoot very smoothly CMake create shared libraries for cuda part FairCuda is a class which wraps cuda implemented functions so that they can be used directly from ROOT CINT or compiled code

03/26/09CHEP 09, Prague 14 Adding Cuda files to CMake # Add current directory to the nvcc include line. CUDA_INCLUDE_DIRECTORIES( ${CMAKE_CURRENT_SOURCE_DIR} ${CUDA_CUT_INCLUDE} ${ROOT_INCLUDE_DIR} ${CUDA_INCLUDE} ) set(CUDA_SRCS trackfit.cu trackfit_kernel.cu ) set(LINK_DIRECTORIES ${ROOT_LIBRARY_DIR} ) CUDA_ADD_LIBRARY(cuda_imp ${CUDA_SRCS} ) # Specify the dependency. TARGET_LINK_LIBRARIES( Trackfit cuda_imp ${CUDA_TARGET_LINK} ${CUDA_CUT_TARGET_LINK} ${ROOT_LIBRARIES} )

03/26/09CHEP 09, Prague 15 Testing Hardware # of Streaming Processor Cores240 Frequency of processor cores1.3GHz performance (peak): Single Precision floating point 933 GFLOPS Double Precision floating point 78 GFLOPS Floating Point Precision IEEE 754 single & double Total Dedicated Memory4GB GDDR3 Memory Speed800MHz Memory Interface512-bit Memory Bandwidth102GB/sec Max Power Consumption200 W peak, 160 W typical System InterfacePCIe x16 Intel Xeon Quad-Core : 2.5 GHz 16 GB Memory Wattage875 W Suse Enterprise 10.3

03/26/09CHEP 09, Prague 16 Test configuration: MVD, TPC, EMC, MDT detecters 1 GeV Muons 50 Event for each sample Four samples with 50, 100, 1000 and 2000 primary tracks per event Ideal Track finder (simply copy Geant tracks to the track candidates objects)

03/26/09CHEP 09, Prague 17 Reconstruction chain (PANDA ) Hits Track Finder Track candidates Track Fitter Tracks Task CPU Task GPU

03/26/09CHEP 09, Prague 18 Track fitting: Implementation in CUDA TClonesarray Track Candidate X,Y,Z 0,0,0 0,0, X,Y,Z 0,0,0 0,0, Shared Memory Shared Memory 32-threads 1 -thread TClonesarray Tracks 32-threads Track Host (PC) Device (Tesla) Global Memory Global Memory

03/26/09CHEP 09, Prague 19 Trackfitting on CPU and GPU GPU (Emu) CPU GPU (D) GPU (F)

03/26/09CHEP 09, Prague 20 What we gain? Track/Event GPU (Double) GPU (Float) Track/Event CPU time/ GPU time Track/Event Time (ms)

03/26/09CHEP 09, Prague 21 CUDA GPU occupancy calculator Active Threads per Multiprocessor256 Active Thread Blocks per Multiprocessor8 Occupancy of each Multiprocessor25% In this test: 32 threads per block Active Threads per Multiprocessor1024 Active Thread Blocks per Multiprocessor4 Occupancy of each Multiprocessor100% To Get the most of this card we should use 256 threads per block Tesla C1060: 30 Multiprocessor 2048 Bytes shared memory per block

03/26/09CHEP 09, Prague 22 Next Steps: Tesla support concurrent access which means that different CPU threads can start different kernels on the device. One could test PROOF and CUDA together for problems which cannot use the full capacity of the cards! Port the track finding routines to GPU (PANDA)

03/26/09CHEP 09, Prague 23 Summary Cuda is an easy to learn and to use tool. Cuda allows heterogeneous programming. Depending on the use case one can win factors in performance compared to CPU. Emulation mode is slower than native CPU code, but this could improve with OpenCL