S. Pardi Frascati, 2012 March 19-23 GPGPU Evaluation – First experiences in Napoli Silvio Pardi.

Slides:



Advertisements
Similar presentations
Intermediate GPGPU Programming in CUDA
Advertisements

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
Lecture 6: Multicore Systems
Monte-Carlo method and Parallel computing  An introduction to GPU programming Mr. Fang-An Kuo, Dr. Matthew R. Smith NCHC Applied Scientific Computing.
Utilization of GPU’s for General Computing Presenter: Charlene DiMeglio Paper: Aspects of GPU for General Purpose High Performance Computing Suda, Reiji,
Exploiting Graphics Processors for High- performance IP Lookup in Software Routers Author: Jin Zhao, Xinya Zhang, Xin Wang, Yangdong Deng, Xiaoming Fu.
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
PARALLEL PROCESSING COMPARATIVE STUDY 1. CONTEXT How to finish a work in short time???? Solution To use quicker worker. Inconvenient: The speed of worker.
Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:
2009/04/07 Yun-Yang Ma.  Overview  What is CUDA ◦ Architecture ◦ Programming Model ◦ Memory Model  H.264 Motion Estimation on CUDA ◦ Method ◦ Experimental.
Chapter Hardwired vs Microprogrammed Control Multithreading
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
HPCC Mid-Morning Break Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery Introduction to the new GPU (GFX) cluster.
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
2012/06/22 Contents  GPU (Graphic Processing Unit)  CUDA Programming  Target: Clustering with Kmeans  How to use.
Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.
Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.
BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.
University of Central Florida Understanding Software Approaches for GPGPU Reliability Martin Dimitrov* Mike Mantor† Huiyang Zhou* *University of Central.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
MS Thesis Defense “IMPROVING GPU PERFORMANCE BY REGROUPING CPU-MEMORY DATA” by Deepthi Gummadi CoE EECS Department April 21, 2014.
General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific 09/12/12.
Algorithm Engineering „GPGPU“ Stefan Edelkamp. Graphics Processing Units  GPGPU = (GP)²U General Purpose Programming on the GPU  „Parallelism for the.
GPU in HPC Scott A. Friedman ATS Research Computing Technologies.
YOU LI SUPERVISOR: DR. CHU XIAOWEN CO-SUPERVISOR: PROF. LIU JIMING THURSDAY, MARCH 11, 2010 Speeding up k-Means by GPUs 1.
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.
GPU Architecture and Programming
Introducing collaboration members – Korea University (KU) ALICE TPC online tracking algorithm on a GPU Computing Platforms – GPU Computing Platforms Joohyung.
CUDA - 2.
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
JPEG-GPU: A GPGPU IMPLEMENTATION OF JPEG CORE CODING SYSTEMS Ang Li University of Wisconsin-Madison.
Multi-Core Development Kyle Anderson. Overview History Pollack’s Law Moore’s Law CPU GPU OpenCL CUDA Parallelism.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.
Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.
Adam Wagner Kevin Forbes. Motivation  Take advantage of GPU architecture for highly parallel data-intensive application  Enhance image segmentation.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
Lecture 8 : Manycore GPU Programming with CUDA Courtesy : SUNY-Stony Brook Prof. Chowdhury’s course note slides are used in this lecture note.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
Sunpyo Hong, Hyesoon Kim
Heterogeneous Computing With GPGPUs Matthew Piehl Overview Introduction to CUDA Project Overview Issues faced nvcc Implementation Performance Metrics Conclusions.
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
GPU Programming Contest. Contents Target: Clustering with Kmeans How to use toolkit1.0 Towards the fastest program.
Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.
G. Russo, D. Del Prete, S. Pardi Frascati, 2011 april 4th-7th The Naples' testbed for the SuperB computing model: first tests G. Russo, D. Del Prete, S.
Jun Doi IBM Research – Tokyo Early Performance Evaluation of Lattice QCD on POWER+GPU Cluster 17 July 2015.
Parallel Programming Models
Computer Engg, IIT(BHU)
Computer Graphics Graphics Hardware
Introduction to CUDA Li Sung-Chi Taiwan Evolutionary Intelligence Laboratory 2016/12/14 Group Meeting Presentation.
M. Bellato INFN Padova and U. Marconi INFN Bologna
NFV Compute Acceleration APIs and Evaluation
Gwangsun Kim, Jiyun Jeong, John Kim
CS427 Multicore Architecture and Parallel Computing
CS 179 Lecture 14.
Faster File matching using GPGPU’s Deephan Mohan Professor: Dr
Multicore / Multiprocessor Architectures
A Comparison-FREE SORTING ALGORITHM ON CPUs
Computer Graphics Graphics Hardware
6- General Purpose GPU Programming
Presentation transcript:

S. Pardi Frascati, 2012 March GPGPU Evaluation – First experiences in Napoli Silvio Pardi

S. Pardi Frascati, 2012 March Goal of our preliminary tests Achieve know-how on GPGPU architectures in order to test the versatility and investigate the possible adoption for some specific tasks interesting for SuperB. multi-core Technology High speed and complex processing unit General Purpose many-core technology Hundred of simple Processing Units Designed to match the SIMD paradigm (Single Instruction Multiple Data)

S. Pardi Frascati, 2012 March The Hardware Available in Napoli 1U rack NVIDIA Tesla S2050  4 GPU Fermi  Memory for GPU: 3.0 GB  Core for GPU: 448  Processor core clock: 1.15 GHz 2U rack Dell PowerEdge R510  Intel Xeon E5506 eight GHz  32.0 GB DDR3  8 hard disk SATA (7200 rpm), 500 GB

S. Pardi Frascati, 2012 March Cuda compilation tools, release 4.0, V NVIDIA-Linux-x86_ run cudatoolkit_4.0.17_linux_64_rhel5.5.run gpucomputingsdk_4.0.17_linux.run ENVIRONMENT

S. Pardi Frascati, 2012 March Tuning for lazily modality removal The firsts test on the Tesla S2050 GPU with CUDA C show a starting overhead ~2 second due the “context” creation needed by the CUDA toolkit. The context is create on demand (lazily) and de- allocated when is not used. SOLUTIONS (Suggested by Davide Rossetti): 1.Before the CUDA 4, Create a dummy process always active that keeps alive the CUDA context “nvidia-smi -l 60 -f /tmp/bu.log” 2.Since CUDA ver. 4 use the -pm (persistence-mode ) option from the nvidia-smi command to activate the GPU context.

S. Pardi Frascati, 2012 March ➢ GPU Memory Allocation (CudaMalloc) ➢ Data transfer between CPU and GPU (CudaMemCopy H2D) ➢ CUDA Kernel execution ➢ Data transfer between GPU and CPU (CudaMemCopy D2H) IBRID code: combination of standard C sequential code with parallel kernel in CUDA C. Each code needs the following main steps: CUDA C CUDA C

S. Pardi Frascati, 2012 March CUDAMALLOC Benchmark BYTE TIME (ms) 10,139 1KB0, KB0,139 1MB0,147 10MB0, MB24,781 1GB170, GB243,123

S. Pardi Frascati, 2012 March CUDAMEMCPY Benchmark BYTE TEMPO (ms) 10, KB0,145 1MB0,756 10MB4, MB42,283 1GB482, GB1.136, GB1.332,238 Max bandwidth achieved : 16Gbit/s

S. Pardi Frascati, 2012 March Exercise: B-meson reconstruction like algorithm Combinatorial problem Problem Modellization: given N quadrivectors (spatial components and energy), combine all the couple without Repetition. Then calculate the mass of the new particle and check if the mass is in a range given by input. GOAL: Understand the impact, benefits and limits of using the GPGPU architecture for this use case, through the help of a toy- model, in order to isolate part of the computation. The algorithm has been implemented in C CUDA

S. Pardi Frascati, 2012 March Parallel implementation Start Read Input Thread 0: Sum of quadrivector: 0, 1 m 2 =E 2 -px 2 -py 2 -pz 2 Thread 1: Sum of quadrivector: 0, 2 m 2 =E 2 -px 2 -py 2 -pz 2 Thread k: Sum of quadrivector N-2, N-1 m 2 =E 2 -px 2 -py 2 -pz 2 … m0-Δ<m<m0+Δ π 0 found SI m0-Δ<m<m0+Δ π 0 found SI m0-Δ<m<m0+Δ π 0 found SI NO Sequential code(CPU) Parallel code (GPU) CUDA MALLOC GPU CUDA MEMCOPY CPU->GPU MEMCOPY GPU->CPU

S. Pardi Frascati, 2012 March Cuda implementation (1/2) GPU Function Memory Allocation on GPU Data transfer from CPU to GPU Run GPU Kernel GPU Memory Free Data transfer back from GPU to CPU

S. Pardi Frascati, 2012 March Cuda Kernel (2/2) Thread ID Computation Quandrivector Composition

S. Pardi Frascati, 2012 March Testing Algorithm Configuration parameters 512 threads for block. Advanced tuning can improve these resultsParticlesSeqParallel 200,013 ms0,351 ms 1000,159 ms0,457 ms ,211 ms0,621 ms ,742 ms19,699 ms ,217 ms170,23 ms ,341 ms301,756 ms ,988 ms380,989 ms Sequential Parallel

S. Pardi Frascati, 2012 March Some Ideas The first experience suggest to continue the investigation expecially in the following ways: Tuning the memory management: Investigate the possibility to Overlapping Data Transfers and Computation through Async and Stream APIs. Tuning the Grid/Block/Threads topology Consider to rearrange the algorithms in term of operations per threads.

S. Pardi Frascati, 2012 March Conclusion In Napoli we started to test the NVIDIA GPGPU architecture and we are investigating how to port HEP code on these architectures. A first experience using a toy algorithm has show several aspects to take in account in GPU programming: Overhead management Memory management Algorithm re-engineerization Work distribution for each Thread Block and Thread topology definition A lot of work is still due in order to achieve a full understanding of the architecture and the real benefits achievable. There are several work in progress.

S. Pardi Frascati, 2012 March END

S. Pardi Frascati, 2012 March Comparison malloc e cudaMalloc

S. Pardi Frascati, 2012 March Algorithm for delegate a couple of vector to each Thread

S. Pardi Frascati, 2012 March 19-23

S. Pardi Frascati, 2012 March 19-23