Introducing collaboration members – Korea University (KU) ALICE TPC online tracking algorithm on a GPU Computing Platforms – GPU Computing Platforms Joohyung.

Slides:



Advertisements
Similar presentations
Multi-GPU and Stream Programming Kishan Wimalawarne.
Advertisements

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Optimization on Kepler Zehuan Wang
Monte-Carlo method and Parallel computing  An introduction to GPU programming Mr. Fang-An Kuo, Dr. Matthew R. Smith NCHC Applied Scientific Computing.
Chimera: Collaborative Preemption for Multitasking on a Shared GPU
GPU Virtualization Support in Cloud System Ching-Chi Lin Institute of Information Science, Academia Sinica Department of Computer Science and Information.
Exploiting Graphics Processors for High- performance IP Lookup in Software Routers Author: Jin Zhao, Xinya Zhang, Xin Wang, Yangdong Deng, Xiaoming Fu.
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.
Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering,
Development of a track trigger based on parallel architectures Felice Pantaleo PH-CMG-CO (University of Hamburg) Felice Pantaleo PH-CMG-CO (University.
PARALLEL PROCESSING COMPARATIVE STUDY 1. CONTEXT How to finish a work in short time???? Solution To use quicker worker. Inconvenient: The speed of worker.
Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow Wilson W. L. Fung Ivan Sham George Yuan Tor M. Aamodt Electrical and Computer Engineering.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.
A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.
Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.
Computing Platform Benchmark By Boonyarit Changaival King Mongkut’s University of Technology Thonburi (KMUTT)
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
HPCC Mid-Morning Break Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery Introduction to the new GPU (GFX) cluster.
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.
Supporting GPU Sharing in Cloud Environments with a Transparent
Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.
COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Computer Graphics Graphics Hardware
BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Accelerating Simulation of Agent-Based Models on Heterogeneous Architectures.
Extracted directly from:
By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.
General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific 09/12/12.
Advanced / Other Programming Models Sathish Vadhiyar.
Use of GPUs in ALICE (and elsewhere) Thorsten Kollegger TDOC-PG | CERN |
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
CUDA Optimizations Sathish Vadhiyar Parallel Programming.
GPU Architecture and Programming
(1) Kernel Execution ©Sudhakar Yalamanchili and Jin Wang unless otherwise noted.
Multicore Computing Lecture 1 : Course Overview Bong-Soo Sohn Associate Professor School of Computer Science and Engineering Chung-Ang University.
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
JPEG-GPU: A GPGPU IMPLEMENTATION OF JPEG CORE CODING SYSTEMS Ang Li University of Wisconsin-Madison.
The introduction of GPGPU and some implementations on model checking Zhimin Wu Nanyang Technological University, Singapore.
Multi-Core Development Kyle Anderson. Overview History Pollack’s Law Moore’s Law CPU GPU OpenCL CUDA Parallelism.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.
Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.
University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.
An Integrated GPU Power and Performance Model (ISCA’10, June 19–23, 2010, Saint-Malo, France. International Symposium on Computer Architecture)
Sunpyo Hong, Hyesoon Kim
Martin Kruliš by Martin Kruliš (v1.0)1.
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
My Coordinates Office EM G.27 contact time:
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2013 Branching.ppt Control Flow These notes will introduce scheduling control-flow.
GPU Computing for GIS James Mower Department of Geography and Planning University at Albany.
Emergence of GPU systems for general purpose high performance computing ITCS 4145/5145 © Barry Wilkinson GPUIntro.ppt Oct 30, 2014.
S. Pardi Frascati, 2012 March GPGPU Evaluation – First experiences in Napoli Silvio Pardi.
CS 179: GPU Computing LECTURE 2: MORE BASICS. Recap Can use GPU to solve highly parallelizable problems Straightforward extension to C++ ◦Separate CUDA.
Gwangsun Kim, Jiyun Jeong, John Kim
Sathish Vadhiyar Parallel Programming
ALICE HLT tracking running on GPU
Lecture 2: Intro to the simd lifestyle and GPU internals
Accelerating MapReduce on a Coupled CPU-GPU Architecture
Graphics Processing Unit
6- General Purpose GPU Programming
Presentation transcript:

Introducing collaboration members – Korea University (KU) ALICE TPC online tracking algorithm on a GPU Computing Platforms – GPU Computing Platforms Joohyung Sun Prof. Hyeonjoong Cho ALICE Collaboration Korea University, Sejong 4 th ALICE ITS upgrade, MFT and O2 Asian Workshop Pusan

Collaboration Institute, Korea University Research goal ALICE TPC online tracking algorithm on a GPU Specification of benchmark platform Introduction

Introducing Korea University Prof. Hyeonjoong Cho, Embedded Systems and Real-time Computing Laboratory 3  Meeting of June 19 th 2014 in KISTI ♦ Proposal of contribution of KISTI and the Korea University to the ALICE O2 ♦ Participants from KISTI, Korea University, and CERN ♦ One of the suggested possible collaborations Benchmarking of detector-specific algorithms on some agreed hardware platforms Multi-cores CPU, many-cores CPU, GPGPU, etc.

 Collaboration institute ♦ Prof. H. Cho, Institute Team Leader, Korea University, Sejong, Republic of Korea ♦ J. Sun, Deputy, Korea University, Sejong, Republic of Korea  Application benchmark on a modern GPU ♦ Benchmarking different types of processors Kepler- and Maxwell-based architecture GPU Maxwell GPU is the successor to the Kepler and is the latest GPU in this year ♦ Reengineering detector data processing algorithms (GPU tracker) Apply NVIDIA Kepler’s technologies Hyper-Q and Dynamic parallelism Our Research Goal Prof. H. Cho and J. Sun, Korea University, Republic of Korea 4

 The online event reconstruction ♦ Performs by the High-Level Trigger ♦ The most complicated algorithm ♦ Adapted to GPUs GPU evolves into a general-purpose, massively parallel processor NVIDIA Fermi, CUDA, and AMD OpenCL ALICE TPC Online Tracking Algorithm on a GPU Detector-specific algorithms with parallel frameworks 5 HLT reconstruction scheme (Reference: David Rohr, CHEP2012)

 Specification of benchmark platform ♦ CPU: Intel i GHz, 4-cores (HT, 8-cores) ♦ GPU: NVIDIA Tesla K20c GPU Kepler-based architecture 13 Multiprocessors 192 CUDA cores per multiprocessor 706 MHz (0.71 GHz) GPU Clock rate 2600 MHz Memory Clock rate 320-bit Memory Bus Width Maximum number of threads per multiprocessor: 2048 Maximum number of threads per block: 1024 Concurrent copy and kernel execution: Yes with 2 copy engines Our Research Goal Benchmarking platform 6

 Only one work queue ♦ It can execute a work at a time ♦ CPUs are not able to fully utilize GPU resources Fermi and Previous Generation GPUs Low the usage of GPU resources 7  Low usage of GPU resources  Even though the GPU has plenty of computational resources

 Enabling multiple CPU cores to launch work on a single GPU simultaneously ♦ Increasing GPU utilization ♦ Slashing CPU idle times Hyper-Q Maximizing the usage of GPU resources 8  32 work queues Fully scheduled, synchronized, and managed all by itself  GPUs receive works from queues at the same time All of the works is being done concurrently

Previous CUDA programming model The communication between host and device 9 Previous CUDA programming model  The communications between CPU and GPU Can affect the application’s performance Each cost as a time is not negligible

 Enabling GPU to dynamically spawn new threads ♦ By adapting to the data ♦ Without going back to the host CPU Dynamic Parallelism Creating work on-the-fly 10 CUDA programming model in Kepler  Effectively allows to be run directly on GPU  Saving the time for communications

Previous works Current progress Optimization with NVIDIA Visual Profiler Progress

 Some results of benchmarking HLT tracker on each GPU ♦ NVIDIA Fermi (current version)174 ms ♦ NVIDIA GTX780 (Kepler)155 ms ♦ NVIDIA Titan (Kepler)146 ms ♦ AMD GCN160 ms Previous Works Benchmarking HLT tracker 12 Reference: P. Buncic and et al., “O2 Project”, ALICE LHCC Referee Meeting

 Application benchmark ♦ Tested on Kepler-based architecture GPU Maxwell-based architecture GPU will be benchmarked ♦ To fully utilize the compute and data movement capabilities of the GPU Optimization Hyper-Q is applied for enabling concurrent copy and kernel execution Dynamic parallelism will be applied for reducing the number of communications between CPU and GPU Our Current Progress ALICE TPC online tracking algorithm on a GPU 13

 Profiling GeForce GTX650 with 2 streams ♦ Works are managed by one work queue ♦ All other copy and kernel executions wait for previous executions Comparison of Hyper-Q between GTX650 and K20c NVIDIA Visual Profiler (nvvp) 14 Kernel execution of algorithm Memory copy execution from CPU to GPU

 Profiling Tesla K20c with 2 streams ♦ 32 work queues for concurrent executions ♦ Copy and kernels are executed concurrently Comparison of Hyper-Q between GTX650 and K20c NVIDIA Visual Profiler (nvvp) 15 Kernel execution of algorithm Memory copy execution from CPU to GPU If the number of streams is increased, then how does it work?

 Tesla K20c with 8 streams More Concurrent Executions Tesla K20c, 8 streams 16 ♦ Copy and kernels in some of the streams more than two are executed concurrently How about using more streams?

 Measuring specific compute kernels’ time per the number of streams ♦ The number of streams: 2~36 Observation from Multiple Streams The number of streams 17 PreInitRowBlocks >> cudaMemcpyAsync (…, cudaMemcpyHostToDevice, …) AliHLTTPCCANeighboursFinder >> AliHLTTPCCANeighboursCleaner >> AliHLTTPCCAStartHitsFinder >> AliHLTTPCCAStartHitsSorter >> It significantly reduces the tracking time compared with using 2 streams

 The number of copy engines in GPU ♦ E.g. Tesla K20c has only 2 copy engines ♦ Limit as the number of works can be executed concurrently  Too short kernel execution time ♦ It could be finished before another kernel execution is arrived ♦ The longest kernel execution time Only about 2 ms during this test  This observation will be a key ♦ For optimizing Hyper-Q Possible Reasons for Observation The key for optimization 18

 Korea University ♦ Prof. Hyeonjoong Cho, Institute Team Leader, Korea University, Sejong, Republic of Korea ♦ Joohyung Sun, Deputy, Korea University, Sejong, Republic of Korea  Next research plans ♦ Benchmarking Maxwell-based architecture GPU GeForce GTX 980, about $ 549 ♦ Efficiently applying GPU’s technologies Hyper-Q with scheduling of streams Dynamic parallelism with device memory management Summary Next research plans 19

Appendix. Actual Code for Dynamic Parallelism Creating work on-the-fly 20 dgetrf(N,N) { for j=1 to N for i=1 to 64 idamx >> memcpy dswap >> memcpy dscal >> dger >> next i memcpy dlaswap >> dtrsm >> dgemm >> next j }  LU decomposition (Fermi) idamx (); dswap (); dscal (); dger (); dlaswap (); dtrsm (); dgemm ();. CPU code GPU code

Appendix. Actual Code for Dynamic Parallelism Creating work on-the-fly 21 dgetrf(N,N) { dgetrf >> synchronize(); }  LU decomposition (Kepler) dgetrf(N,N) { for j=1 to N for i=1 to 64 idamx >> dswap >> dscal >> dger >> next i dlaswap >> dtrsm >> dgemm >> next j } CPU codeGPU code CPU is Free

Appendix. Example of LU Decomposition Profiling LU Decomposition using NVIDIA Visual Profiler (nvvp) 22  Tesla K20c, Context 1 (CUDA) Memcpy cgetrf_cdpentry >>