Track finding and fitting with GPUs Oleksiy Rybalchenko and Mohammad Al-Turany GSI-IT 4/29/11 PANDA FEE/DAQ/Trigger Workshop Grünberg.

Slides:



Advertisements
Similar presentations
DEVELOPMENT OF ONLINE EVENT SELECTION IN CBM DEVELOPMENT OF ONLINE EVENT SELECTION IN CBM I. Kisel (for CBM Collaboration) I. Kisel (for CBM Collaboration)
Advertisements

Issues of HPC software From the experience of TH-1A Lu Yutong NUDT.
Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters
Introduction to the CUDA Platform
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
1 A GPU Accelerated Storage System NetSysLab The University of British Columbia Abdullah Gharaibeh with: Samer Al-Kiswany Sathish Gopalakrishnan Matei.
Khaled A. Al-Utaibi  Computers are Every Where  What is Computer Engineering?  Design Levels  Computer Engineering Fields  What.
GPU System Architecture Alan Gray EPCC The University of Edinburgh.
Early Linpack Performance Benchmarking on IPE Mole-8.5 Fermi GPU Cluster Xianyi Zhang 1),2) and Yunquan Zhang 1),3) 1) Laboratory of Parallel Software.
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.
University of Michigan Electrical Engineering and Computer Science Transparent CPU-GPU Collaboration for Data-Parallel Kernels on Heterogeneous Systems.
CS 345 Computer System Overview
Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:
Java for High Performance Computing Jordi Garcia Almiñana 14 de Octubre de 1998 de la era post-internet.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
CUDA (Compute Unified Device Architecture) Supercomputing for the Masses by Peter Zalutski.
PhD/Master course, Uppsala  Understanding the interaction between your program and computer  Structuring the code  Optimizing the code  Debugging.
Parallelization and CUDA libraries Lei Zhou, Yafeng Yin, Hong Man.
Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.
Contemporary Languages in Parallel Computing Raymond Hummel.
Panda: MapReduce Framework on GPU’s and CPU’s
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
HPCC Mid-Morning Break Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery Introduction to the new GPU (GFX) cluster.
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.
1 Integrating GPUs into Condor Timothy Blattner Marquette University Milwaukee, WI April 22, 2009.
2012/06/22 Contents  GPU (Graphic Processing Unit)  CUDA Programming  Target: Clustering with Kmeans  How to use.
Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.
7th Workshop on Fusion Data Processing Validation and Analysis Integration of GPU Technologies in EPICs for Real Time Data Preprocessing Applications J.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
WORK ON CLUSTER HYBRILIT E. Aleksandrov 1, D. Belyakov 1, M. Matveev 1, M. Vala 1,2 1 Joint Institute for nuclear research, LIT, Russia 2 Institute for.
Codeplay CEO © Copyright 2012 Codeplay Software Ltd 45 York Place Edinburgh EH1 3HP United Kingdom Visit us at The unique challenges of.
Implementation of Parallel Processing Techniques on Graphical Processing Units Brad Baker, Wayne Haney, Dr. Charles Choi.
GPU Programming with CUDA – Optimisation Mike Griffiths
GBT Interface Card for a Linux Computer Carson Teale 1.
Instructor Notes GPU debugging is still immature, but being improved daily. You should definitely check to see the latest options available before giving.
General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific 09/12/12.
Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.
NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.
Robert Liao Tracy Wang CS252 Spring Overview Traditional GPU Architecture The NVIDIA G80 Processor CUDA (Compute Unified Device Architecture) LAPACK.
GPU Architecture and Programming
Introducing collaboration members – Korea University (KU) ALICE TPC online tracking algorithm on a GPU Computing Platforms – GPU Computing Platforms Joohyung.
CUDA-based Volume Rendering in IGT Nobuhiko Hata Benjamin Grauer.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.
Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.
GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Full and Para Virtualization
Program Optimizations and Recent Trends in Heterogeneous Parallel Computing Dušan Gajić, University of Niš Program Optimizations and Recent Trends in Heterogeneous.
GPU Code integration in FairRoot
Debunking the 100X GPU vs. CPU Myth An Evaluation of Throughput Computing on CPU and GPU Present by Chunyi Victor W Lee, Changkyu Kim, Jatin Chhugani,
1/13 Future computing for particle physics, June 2011, Edinburgh A GPU-based Kalman filter for ATLAS Level 2 Trigger Dmitry Emeliyanov Particle Physics.
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
Mitglied der Helmholtz-Gemeinschaft FairMQ with FPGAs and GPUs Simone Esch –
CUDA in FairRoot Mohammad Al-Turany GSI Darmstadt.
Visual Programming Borland Delphi. Developing Applications Borland Delphi is an object-oriented, visual programming environment to develop 32-bit applications.
NVIDIA® TESLA™ GPU Based Super Computer By : Adam Powell Student # For COSC 3P93.
1 MOHAMMAD AL-TURANY GSI Darmstadt GPUs for event reconstruction 1.
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
GPU's for event reconstruction in FairRoot Framework Mohammad Al-Turany (GSI-IT) Florian Uhlig (GSI-IT) Radoslaw Karabowicz (GSI-IT)
The ALICE Data-Acquisition Read-out Receiver Card C. Soós et al. (for the ALICE collaboration) LECC September 2004, Boston.
Jun Doi IBM Research – Tokyo Early Performance Evaluation of Lattice QCD on POWER+GPU Cluster 17 July 2015.
Computer Engg, IIT(BHU)
Chapter 13: I/O Systems Modified by Dr. Neerja Mhaskar for CS 3SH3.
CS427 Multicore Architecture and Parallel Computing
CSCI/CMPE 3334 Systems Programming
NVIDIA Fermi Architecture
Presentation transcript:

Track finding and fitting with GPUs Oleksiy Rybalchenko and Mohammad Al-Turany GSI-IT 4/29/11 PANDA FEE/DAQ/Trigger Workshop Grünberg

Track finding and fitting for Panda: Central tracker : MVD +STT 4/29/11 PANDA FEE/DAQ/Trigger Workshop Grünberg

Work is already ongoing for offline and online in PANDA 4/29/11 PANDA FEE/DAQ/Trigger Workshop Grünberg

Secondary Vertex: Re-iterate with each hit point as reference point 4/29/11 PANDA FEE/DAQ/Trigger Workshop Grünberg

Lines in the conformal space can be found using the Hough Transform 4/29/11 PANDA FEE/DAQ/Trigger Workshop Grünberg

David Münchow and Soeren Lange kindly shared there C++ code with us. 4/29/11 PANDA FEE/DAQ/Trigger Workshop Grünberg

Our Group: We would like to port the code developed at Giessen to GPU Compare the performance, scalability, re-usability, …etc, with other hardware and/or software techniques on the market Build a computing node prototype based on GPU that can make tracking on the fly (online) without IO. 4/29/11 PANDA FEE/DAQ/Trigger Workshop Grünberg

The graphics processor units: GPUs Nowadays, GPUs have evolved into high performance co-processors They can be easily programmed with common high-level language such as C, Fortran and C++. Todays GPUs greatly outpace CPUs in arithmetic performance and memory bandwidth, making them the ideal co-processor to accelerate a variety of data parallel applications. 4/29/11 PANDA FEE/DAQ/Trigger Workshop Grünberg

Floating-Point Operations per Second for CPU and GPU 4/29/11 PANDA FEE/DAQ/Trigger Workshop Grünberg

Memory Bandwidth for the CPU and GPU 4/29/11 PANDA FEE/DAQ/Trigger Workshop Grünberg

Why NVIDIA? Support a true cache hierarchy in combination with on-chip shared memory First GPU architecture to support ECC, it detects and corrects errors before system is affected. It also Protects register files, shared memories, L1 and L2 cache, and DRAM Limited, but increasing support of C++ Tesla product family is designed ground-up for parallel computing and offers exclusive computing features. 4/29/11 PANDA FEE/DAQ/Trigger Workshop Grünberg

Why CUDA? CUDA development tools work alongside the conventional C/C++ compiler, so one can mix GPU code with general-purpose code for the host CPU. CUDA Automatically Manages Threads: o It does NOT require explicit management for threads in the conventional sense, which greatly simplifies the programming model. Stable, available (for free), documented and supported for windows, Linux and Mac OS Low learning curve: o Just a few extensions to C o No knowledge of graphics is required 4/29/11 PANDA FEE/DAQ/Trigger Workshop Grünberg

CUDA Features: Standard C/C++ language for parallel application development on the GPU Standard numerical libraries: o FFT (Fast Fourier Transform) o BLAS (Basic Linear Algebra Subroutines) o CUSPARSE (for handling sparse matrices) o CURAND (generation of high-quality pseudorandom and quasirandom numbers.) Dedicated CUDA driver for computing with fast data transfer path between GPU and CPU 4/29/11 PANDA FEE/DAQ/Trigger Workshop Grünberg

Automatic Scalability 4/29/11 PANDA FEE/DAQ/Trigger Workshop Grünberg

CUDA Toolkit C/C++ compiler Visual Profiler cuda-gdb debugger GPU-accelerated BLAS library GPU-accelerated FFT library GPU-accelerated Sparse Matrix library GPU-accelerated RNG library Additional tools and documentation 4/29/11 PANDA FEE/DAQ/Trigger Workshop Grünberg

GPU usage examples in FairRoot 4/29/11 PANDA FEE/DAQ/Trigger Workshop Grünberg

GPUs usage in FairRoot FindCuda.cmake (Abe Stephens SCI Institute) – Integrate CUDA into FairRoot very smoothly CMake creates shared libraries for cuda part FairCuda is a class which wraps CUDA implemented functions so that they can be used directly from ROOT CINT or compiled code 4/29/11 PANDA FEE/DAQ/Trigger Workshop Grünberg

Track Propagation Speedup by a factor 40 4/29/11 Propagation/E vent Tesla (CPU/GPU) Track propagation (RK4) in magnetic field on GPU PANDA FEE/DAQ/Trigger Workshop Grünberg

Track propagation (RK4) using PANDA Field 4/29/11 Speedup : up to factor 175 PANDA FEE/DAQ/Trigger Workshop Grünberg

Track + Vertex Fitting (PANDA) 4/29/11 The same program code, same hardware, just using Pinned Memory instead of Mem-Copy! CPU time/GPU time Copy data Execute Copy data Execute PANDA FEE/DAQ/Trigger Workshop Grünberg

Porting track finder/Fitter to CUDA 4/29/11 PANDA FEE/DAQ/Trigger Workshop Grünberg

First issues: Original code is not optimized for parallel architectures Lookup tables are used for the mathematical functions (Code is designed to work on FPGA) Redesign the code into many functions (kernels) instead of one main Use the standard mathematical libraries delivered by NVIDIA 4/29/11 PANDA FEE/DAQ/Trigger Workshop Grünberg

Redesigning the code into kernels The code became more modular. Different kernels (functions) have different parallelization level. 4/29/11 PANDA FEE/DAQ/Trigger Workshop Grünberg

A total improvement of factor 200 compared to the original code on an Intel Xeon CPU 2.53 GHz CPU (ms)GPU (ms)ImprovementOccupancyNotes total runtime (without Z-Analysis) startUp() %runs (num_points) times setOrigin() %runs (num_points) times clear Hough and Peaks (memset on GPU) %runs (num_points) times conformalAndHough() %runs (num_points) times findPeaksInHoughSpace() %runs (num_points) times findDoublePointPeaksInHoughSpace() %runs (num_points) times collectPeaks() %runs (num_points) times sortPeaks() %runs (num_points) times resetOrigin() %runs (num_points) times countPointsCloseToTrackAndTrackParams() %runs once collectSimilarTracks() %runs once collectSimilarTracks2()2%runs once getPointsOnTrack() %runs (num_tracks) times nullifyPointsOfThisTrack() %runs (num_tracks) times clear Hough space (memset on GPU) %runs (num_tracks) times secondHough() %runs (num_tracks) times findPeaksInHoughSpaceAgain() %runs (num_tracks) times collectTracks() %runs (num_tracks) times 4/29/11 PANDA FEE/DAQ/Trigger Workshop Grünberg

Profiler output for GPU time for each kernel 4/29/11 PANDA FEE/DAQ/Trigger Workshop Grünberg

Up to 130 Giga Bytes global memory write throughput is reached with the current implementation Kernel GPU time % Global memory write throughput GB/s Global memory read throughput GB/s Occupancy % Total number of Threads launched conformalAndHough findPeaksInHoughSpace collectPeaks findDoublePointPeaksIn HoughSpace memset32_aligned1D /29/11 PANDA FEE/DAQ/Trigger Workshop Grünberg

The results are comparable: CPU Results found tracks in first step: 1615 collect similar tracks: done number of tracks: 158 recalculate tracks: done number of tracks: 14 GPU Results found tracks in first step: 1609 collect similar tracks: done number of tracks: 151 recalculate tracks: done number of tracks: 14 4/29/11 PANDA FEE/DAQ/Trigger Workshop Grünberg

Planed proto-type for online tracking: Data buffer is allocated by the GPU (Pinned/mapped memory) 4/29/11 PANDA FEE/DAQ/Trigger Workshop Grünberg

Plans for the Online: Online track finding and fitting with GPU In collaboration with the GSI EE, build a proto type for an online system o Use the PEXOR card to get data to PC o PEXOR driver allocate a buffer in PC memory and write the data to it o The GPU uses the Zero copy to access the Data, analyze it and write the results 4/29/11 PANDA FEE/DAQ/Trigger Workshop Grünberg

4/29/11 PANDA FEE/DAQ/Trigger Workshop Grünberg PEXOR The GSI PEXOR is a PCI express card provides a complete development platform for designing and verifying applications based on the Lattice SCM FPGA family. Serial gigabit transceiver interfaces (SERDES) provide connection to PCI Express x1 or x4 and four 2Gbit SFP optical transceivers 30

Summary and outlook Porting the code is ongoing, for the ported part we already gain factor 200 compared to single threaded CPU code. We need to use the fast Hough transform reported by David and Soeren. Code optimization should follow. For the computing node prototype based on GPU –PEXOR we need: o To re-write the PEXOR driver to accept externally allocated data buffer o Use one PEXOR as a transmitter to send Geant events (and/or Time streams) for testing and debugging 4/29/11 PANDA FEE/DAQ/Trigger Workshop Grünberg