SixTrack for GPU R. De Maria. SixTrack Status SixTrack: Single Particle Tracking Code [cern.ch/sixtrack]. 70K lines written in Fortran 77/90 (with few.

Slides:



Advertisements
Similar presentations
Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
Advertisements

Slide 2-1 Copyright © 2004 Pearson Education, Inc. Operating Systems: A Modern Perspective, Chapter 2 Using the Operating System 2.
IIAA GPMAD A beam dynamics code using Graphics Processing Units GPMAD (GPU Processed Methodical Accelerator Design) utilises Graphics Processing Units.
GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
GPU System Architecture Alan Gray EPCC The University of Edinburgh.
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
Instructor Notes This lecture discusses three important optimizations The performance impact of mapping threads to data on the GPU is subtle but extremely.
HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.
Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering,
OpenCL Peter Holvenstot. OpenCL Designed as an API and language specification Standards maintained by the Khronos group  Currently 1.0, 1.1, and 1.2.
PARALLEL PROCESSING COMPARATIVE STUDY 1. CONTEXT How to finish a work in short time???? Solution To use quicker worker. Inconvenient: The speed of worker.
Copyright HiPERiSM Consulting, LLC, George Delic, Ph.D. HiPERiSM Consulting, LLC (919) P.O. Box 569, Chapel Hill,
GPUs on Clouds Andrew J. Younge Indiana University (USC / Information Sciences Institute) UNCLASSIFIED: 08/03/2012.
Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:
Presented by Rengan Xu LCPC /16/2014
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.
CUDA (Compute Unified Device Architecture) Supercomputing for the Masses by Peter Zalutski.
Chapter 2: Impact of Machine Architectures What is the Relationship Between Programs, Programming Languages, and Computers.
PhD/Master course, Uppsala  Understanding the interaction between your program and computer  Structuring the code  Optimizing the code  Debugging.
Contemporary Languages in Parallel Computing Raymond Hummel.
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
Synergy.cs.vt.edu Power and Performance Characterization of Computational Kernels on the GPU Yang Jiao, Heshan Lin, Pavan Balaji (ANL), Wu-chun Feng.
HPCC Mid-Morning Break Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery Introduction to the new GPU (GFX) cluster.
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.
Implementation of Parallel Processing Techniques on Graphical Processing Units Brad Baker, Wayne Haney, Dr. Charles Choi.
Revisiting Kirchhoff Migration on GPUs Rice Oil & Gas HPC Workshop
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Optimizing Data Warehousing Applications for GPUs Using Kernel Fusion/Fission.
Use of GPUs in ALICE (and elsewhere) Thorsten Kollegger TDOC-PG | CERN |
GPU Architecture and Programming
Future farm technologies & architectures John Baines 1.
Introducing collaboration members – Korea University (KU) ALICE TPC online tracking algorithm on a GPU Computing Platforms – GPU Computing Platforms Joohyung.
OpenCL Framework for Heterogeneous CPU/GPU Programming a very brief introduction to build excitement NCCS User Forum, March 20, 2012 György (George) Fekete.
- DHRUVA TIRUMALA BUKKAPATNAM Geant4 Geometry on a GPU.
Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.
Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.
Carlo del Mundo Department of Electrical and Computer Engineering Ubiquitous Parallelism Are You Equipped To Code For Multi- and Many- Core Platforms?
GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.
OpenCL Programming James Perry EPCC The University of Edinburgh.
CUDA. Assignment  Subject: DES using CUDA  Deliverables: des.c, des.cu, report  Due: 12/14,
Innovation for Our Energy Future Opportunities for WRF Model Acceleration John Michalakes Computational Sciences Center NREL Andrew Porter Computational.
Experiences with Achieving Portability across Heterogeneous Architectures Lukasz G. Szafaryn +, Todd Gamblin ++, Bronis R. de Supinski ++ and Kevin Skadron.
Weekly Report- Reduction Ph.D. Student: Leo Lee date: Oct. 30, 2009.
Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.
OSes: 2. Structs 1 Operating Systems v Objective –to give a (selective) overview of computer system architectures Certificate Program in Software Development.
Getting ready. Why C? Design Features – Efficiency (C programs tend to be compact and to run quickly.) – Portability (C programs written on one system.
GPU Performance Optimisation Alan Gray EPCC The University of Edinburgh.
COMPUTER ARCHITECTURE & OPERATIONS I Instructor: Yaohang Li.
3/12/2013Computer Engg, IIT(BHU)1 CUDA-3. GPGPU ● General Purpose computation using GPU in applications other than 3D graphics – GPU accelerates critical.
My Coordinates Office EM G.27 contact time:
CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1.
NVIDIA® TESLA™ GPU Based Super Computer By : Adam Powell Student # For COSC 3P93.
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
Matthew Royle Supervisor: Prof Shaun Bangay.  How do we implement OpenCL for CPUs  Differences in parallel architectures  Is our CPU implementation.
GPU Acceleration of Particle-In-Cell Methods B. M. Cowan, J. R. Cary, S. W. Sides Tech-X Corporation.
NFV Compute Acceleration APIs and Evaluation
GPU Computing Jan Just Keijser Nikhef Jamboree, Utrecht
SixTrack Overview for CWG
Enabling machine learning in embedded systems
Program to search an element of array using linear search.
Implementation of Efficient Check-pointing and Restart on CPU - GPU
Brook GLES Pi: Democratising Accelerator Programming
Dycore Rewrite Tobias Gysi.
NVIDIA Fermi Architecture
General Purpose Graphics Processing Units (GPGPUs)
Multicore and GPU Programming
Option Pricing Black-Scholes Equation
Presentation transcript:

SixTrack for GPU R. De Maria

SixTrack Status SixTrack: Single Particle Tracking Code [cern.ch/sixtrack]. 70K lines written in Fortran 77/90 (with few pre-processing steps). Numerically portable across OS and compilers. Used in the volunteer computing project with 200k registered users and about 20k cpus simultaneously running. Example of an LHC simulation: 30k particles; 10 7 turns; 20k beam line elements. Code speed: average 100 ns per particle per beam element 500 turns/(particle∙sec) on serial code in recent hardware (LHC particles make turns/sec)

SixTrack GPU Status GPU porting is being explored in the context of to use volunteer GPU: heterogeneous hardware and software hard to test and fully deploy, many low-end GPU expected (low FP64 FLOPS count). D. Mikushin (Applied Parallel Computing LLC) [indico/event/450856] demonstrated deploying with CUDA + additional compilation stages + code annotations + special compiler software (numerically ok without FMAC instructions, no benchmark available). Standalone tracking library (SixTrackLib) to be used with other codes (including SixTrack itself): lightweight code being written in C/OpenCL for flexibility/portability (CERN&GSoC’14-’15). speed-up of 250x w.r.t single i7 core with AMD-280X (~1TFLOPS FP64, ~300CHF) on first tests driven by pyopencl. ongoing development: OpenCL not completed yet, pure python version benchmarked on components done for the LHC. Hardware for single particle simulations: High FP64 FLOPS counts. Memory bandwidth and memory size less important.

Recent Hardware for FP64 NameTFLOPS (SP/DP) Mem GB Price AMD Radeon / CHF/175$ AMD Radeon 280X4.1/13no stock /220$ AMD FirePro W /2.18/ECC1100CHF/1000$ AMD FirePro W /2.616/ECC3200CHF/3000$ Nvidia Titan Black4/1.36~1000$ (not available) Nvidia Titan Z8/2.712~1500$ (ebay) Nvidia Tesla K404.2/1.412/ECC4000CHF/3200$ Nvidia Tesla K808.7/2.924/ECC5500CHF/5000$ 19/02/ x cost difference for same (nominal) performance!

SixTrack: Model Tracking: propagate p particles through m elements for n times Single Particle Loop: for z in particles for n times for elem in elements f=funset[elem.type] z=f(elem,z) funset= list of functions elememts = list of arrays particles = array of arrays Multi Particle Loop: for n times for elem in elements g=funset[elem.type] if g is multiparticle particles=g(elem,particles) elif g is singleparticle_block for z in particles for elem in elements f=funset[elem.type] z=f(elem,z) SixTrackLib: kernel implemented in OpenCL z=particles[thread_id] for elem in elements f=funset[elem.type] z=f(elem,z) particles[thread_id]=z

SixTrackLib: implementation details SixTrackLib: kernel implemented in OpenCL z=particles[thread_id] for elem_id in sequence elem=elements[elem_id] f=funset[elem.type] z=f(elem,z) particles[thread_id]=z funset= list of inlined functions in kernel (or in C particle loop for serial version) => switch case statements (no function pointer available in Open CL 1.2) elements = list of arrays (prepared in CPU and copied once to global GPU memory ) => flat array of union double/integers with structure indices particles = array of arrays (prepared in CPU and copied once to global GPU memory ) => flat array of union double/integers (row size fixed) sequence = array of ints (in the elements array) Code contains other complications (e.g. dynamic element manipulation, recursion and particle loss) not covered here.