Some GPU activities at the CMS experiment Felice Pantaleo EP-CMG-CO EP-CMG-CO 1.

Slides:



Advertisements
Similar presentations
Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters
Advertisements

Multi-core and tera- scale computing A short overview of benefits and challenges CSC 2007 Andrzej Nowak, CERN
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
Towards a L1 Pixel track trigger for CMS: preliminary thoughts & foreseen strategy Aurore Savoy-Navarro, Paris-Diderot/CNRS &INFN-Pisa.
IIAA GPMAD A beam dynamics code using Graphics Processing Units GPMAD (GPU Processed Methodical Accelerator Design) utilises Graphics Processing Units.
System Simulation Of 1000-cores Heterogeneous SoCs Shivani Raghav Embedded System Laboratory (ESL) Ecole Polytechnique Federale de Lausanne (EPFL)
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
1 Lawrence Livermore National Laboratory By Chunhua (Leo) Liao, Stephen Guzik, Dan Quinlan A node-level programming model framework for exascale computing*
Development of a track trigger based on parallel architectures Felice Pantaleo PH-CMG-CO (University of Hamburg) Felice Pantaleo PH-CMG-CO (University.
PARALLEL PROCESSING COMPARATIVE STUDY 1. CONTEXT How to finish a work in short time???? Solution To use quicker worker. Inconvenient: The speed of worker.
Fast Paths in Concurrent Programs Wen Xu, Princeton University Sanjeev Kumar, Intel Labs. Kai Li, Princeton University.
Upgrading the CMS simulation and reconstruction David J Lange LLNL April CHEP 2015D. Lange.
Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow Wilson W. L. Fung Ivan Sham George Yuan Tor M. Aamodt Electrical and Computer Engineering.
Workload Management Workpackage Massimo Sgaravatto INFN Padova.
Weekly Report Start learning GPU Ph.D. Student: Leo Lee date: Sep. 18, 2009.
A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.
Contemporary Languages in Parallel Computing Raymond Hummel.
Trigger and online software Simon George & Reiner Hauser T/DAQ Phase 1 IDR.
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
GPGPU platforms GP - General Purpose computation using GPU
February 19th 2009AlbaNova Instrumentation Seminar1 Christian Bohm Instrumentation Physics, SU Upgrading the ATLAS detector Overview Motivation The current.
Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.
OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.
Introduzione al Software di CMS N. Amapane. Nicola AmapaneTorino, Aprile Outline CMS Software projects The framework: overview Finding more.
A Lightweight Platform for Integration of Resource Limited Devices into Pervasive Grids Stavros Isaiadis and Vladimir Getov University of Westminster
Roger Jones, Lancaster University1 Experiment Requirements from Evolving Architectures RWL Jones, Lancaster University Ambleside 26 August 2010.
GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.
Helmholtz International Center for CBM – Online Reconstruction and Event Selection Open Charm Event Selection – Driving Force for FEE and DAQ Open charm:
Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
Offline Coordinators  CMSSW_7_1_0 release: 17 June 2014  Usage:  Generation and Simulation samples for run 2 startup  Limited digitization and reconstruction.
Taking the Complexity out of Cluster Computing Vendor Update HPC User Forum Arend Dittmer Director Product Management HPC April,
Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.
Use of GPUs in ALICE (and elsewhere) Thorsten Kollegger TDOC-PG | CERN |
Querying Large Databases Rukmini Kaushik. Purpose Research for efficient algorithms and software architectures of query engines.
Work Package 5 Data Acquisition and High Level Triggering System Jean-Christophe Garnier 3/08/2010.
ALICE Upgrade for Run3: Computing HL-LHC Trigger, Online and Offline Computing Working Group Topical Workshop Sep 5 th 2014.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
Frontiers in Massive Data Analysis Chapter 3.  Difficult to include data from multiple sources  Each organization develops a unique way of representing.
Fast Support Vector Machine Training and Classification on Graphics Processors Bryan Catanzaro Narayanan Sundaram Kurt Keutzer Parallel Computing Laboratory,
Automatic translation from CUDA to C++ Luca Atzori, Vincenzo Innocente, Felice Pantaleo, Danilo Piparo 31 August, 2015.
Future farm technologies & architectures John Baines 1.
Computer Science and Engineering Parallelizing Defect Detection and Categorization Using FREERIDE Leonid Glimcher P. 1 ipdps’05 Scaling and Parallelizing.
Next Generation Operating Systems Zeljko Susnjar, Cisco CTG June 2015.
Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.
Atlas CHEP‘2000 Padova, ITALY February 2000 Implementation of an Object Oriented Track Reconstruction Model into Multiple LHC Experiments.
Partitioned Multistack Evironments for Exascale Systems Jack Lange Assistant Professor University of Pittsburgh.
Program Optimizations and Recent Trends in Heterogeneous Parallel Computing Dušan Gajić, University of Niš Program Optimizations and Recent Trends in Heterogeneous.
Predrag Buncic Future IT challenges for ALICE Technical Workshop November 6, 2015.
OpenMP for Networks of SMPs Y. Charlie Hu, Honghui Lu, Alan L. Cox, Willy Zwaenepoel ECE1747 – Parallel Programming Vicky Tsang.
Shangkar Mayanglambam, Allen D. Malony, Matthew J. Sottile Computer and Information Science Department Performance.
1/13 Future computing for particle physics, June 2011, Edinburgh A GPU-based Kalman filter for ATLAS Level 2 Trigger Dmitry Emeliyanov Particle Physics.
LHCbComputing Computing for the LHCb Upgrade. 2 LHCb Upgrade: goal and timescale m LHCb upgrade will be operational after LS2 (~2020) m Increase significantly.
December 13, G raphical A symmetric P rocessing Prototype Presentation December 13, 2004.
Meeting with University of Malta| CERN, May 18, 2015 | Predrag Buncic ALICE Computing in Run 2+ P. Buncic 1.
Perspective for parallel computing in a close and far future Felice Pantaleo CERN – PH Department Felice Pantaleo CERN – PH Department
Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 1 ipdps’04 Scaling and Parallelizing a Scientific Feature.
GUIDO VOLPI – UNIVERSITY DI PISA FTK-IAPP Mid-Term Review 07/10/ Brussels.
1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,
Hadoop Javad Azimi May What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data. It includes:
Parallel Programming Models
New Track Seeding Techniques at the CMS experiment
R. Rastogi, A. Srivastava , K. Sirasala , H. Chavhan , K. Khonde
for the Offline and Computing groups
Electronics, Trigger and DAQ for SuperB
OCP: High Performance Computing Project
CLARA Based Application Vertical Elasticity
Department of Computer Science University of California, Santa Barbara
Presentation transcript:

Some GPU activities at the CMS experiment Felice Pantaleo EP-CMG-CO EP-CMG-CO 1

Outline Physics and Technologic Motivations Physics and Technologic Motivations Tracking Tracking HGCAL clustering HGCAL clustering CUDA Translation CUDA Translation Conclusion Conclusion 2

Physics and Technologic Motivations 3

Physics Motivation Time needed to process LHC events does not scale linearly with Luminosity Time needed to process LHC events does not scale linearly with Luminosity – Event complexity dominating ~O(  !) Line separating trigger electronics and software becoming thinner allowing improved triggers hence reducing rates Line separating trigger electronics and software becoming thinner allowing improved triggers hence reducing rates Software development is making continuously big strides Software development is making continuously big strides 4

Trends in HEP computing… Distributed computing is here to stay Distributed computing is here to stay Ideal general purpose computing x86 + Linux may be close to the end Ideal general purpose computing x86 + Linux may be close to the end More effective to specialize: More effective to specialize: – GPUs specialized farms – HPC platforms – High Efficiency platforms (ARM, Jetson TX1…) Used for different purposes Used for different purposes – Loose flexibility but may gain significantly in cost 5

…and at the embedded frontier 6 Heterogeneous HPC platforms seem to represent a good opportunity, not only for analysis and simulation applications, but also for more “hardware” jobs Heterogeneous HPC platforms seem to represent a good opportunity, not only for analysis and simulation applications, but also for more “hardware” jobs Fast test and deployment phases Fast test and deployment phases Possibility to change the trigger on the fly and to run multiple triggers at the same time Possibility to change the trigger on the fly and to run multiple triggers at the same time Hardware development by Computer Graphics industry Hardware development by Computer Graphics industry

TrackingTracking 7

PATATRACK PATATRACK PATATRACK – It is a hybrid software to run on heterogeneous HPC platforms for emulating a GPU-based track trigger, data transfer and synchronization – Preliminary studies, still first demonstrator Tracker data partitioning Tracker data partitioning – Fast simulation on fast geometry and uniform magnetic field – The information produced by the whole tracker cannot be processed by one GPU in a trigger environment However this is possible at HLT and Reconstruction stages However this is possible at HLT and Reconstruction stages Low-latency data transfers between network interfaces and multiple GPUs (GPU Direct) Low-latency data transfers between network interfaces and multiple GPUs (GPU Direct) Cellular Automaton executes in-cache for lowest latency Cellular Automaton executes in-cache for lowest latency 8

PATATRACK (ctd.) 9

10 System tested on Wilkes Supercomputer, at the University of Cambridge System tested on Wilkes Supercomputer, at the University of Cambridge GPU Direct very promising GPU Direct very promising – Data transmitted between nodes with lowest latency Track Reconstruction highly dependent on the combinatorics Track Reconstruction highly dependent on the combinatorics Ping times are included (t ~3  s) Ping times are included (t ~3  s) Full scale tests on Microsoft Azure early access soon Full scale tests on Microsoft Azure early access soon

CMS – Vectorised Track Building on Xeon Phi First version of vectorised and parallelised track building implemented First version of vectorised and parallelised track building implemented – Significant speedup achieved both on Xeon and Xeon Phi – 2x from vectorisation -5x on Xeon and 10x on Xeon Phi from parallelisation -Ideal scaling indicates a large margin for further improvements 11 G. Cerati et al.

Clustering at HGCAL 12

Clustering at HGCAL CMS is investigating building a silicon based calorimeter for the forward region of the detector 13

Clustering at HGCAL (ctd.) 14

Clustering at HGCAL (ctd.) Clustering in conditions of high pile-up becomes challenging Clustering in conditions of high pile-up becomes challenging – Even more if you want to be ambitious and run this at HLT stage PandoraPFA out of the box takes 1 PandoraPFA out of the box takes 1 – Can be reduced by factors by using more suitable data structures The problem is perfectly suitable for running on GPUs The problem is perfectly suitable for running on GPUs – Rethinking of data structures needed 15

Translating CUDA 16

CUDA Translation What if somebody wants to run the very same CUDA algorithms on a machine that does not come with a GPU? What if somebody wants to run the very same CUDA algorithms on a machine that does not come with a GPU? Translate CUDA to TBB using Clang Translate CUDA to TBB using Clang – Translating the CUDA program such that the mapping of programming constructs maintains the locality expressed in the programming model with existing operating system and hardware features. 17 CUDAC++ blockstd::thread / Taskasynchronous thread sequential unrolled for loop (can be vectorized) synchronous (barriers) Used source codeTime (ms)Slowdown wrt CUDA CUDA¹ Translated TBB² Native sequential³ Native TBB² L. Atzori

Conclusion Heterogeneous computing is going to become the standard Heterogeneous computing is going to become the standard – Actually outside HEP it is already – Better catch the train, there will be no plug-and-accelerate solution Current solution consists in throwing more events at the problem Current solution consists in throwing more events at the problem – Fine for increasing throughput, but it’s not enough – We may run out of memory HL-LHC luminosity will pose a real challenge for hardware, software engineering, algorithms, parallelism HL-LHC luminosity will pose a real challenge for hardware, software engineering, algorithms, parallelism A careful design of heterogeneous frameworks needs: A careful design of heterogeneous frameworks needs: – Choose the best device for a job – Move the data near the execution – Move the execution near the data For trigger levels: For trigger levels: – Best possible code on best possible hardware – Translation for legacy hardware 18