Future farm technologies & architectures John Baines 1.

Slides:



Advertisements
Similar presentations
Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters
Advertisements

Jet Slice Status Report Ricardo Gonçalo (LIP) and David Miller (Chicago) For the Jet Trigger Group Trigger General Meeting – 9 July 2014.
+ Accelerating Fully Homomorphic Encryption on GPUs Wei Wang, Yin Hu, Lianmu Chen, Xinming Huang, Berk Sunar ECE Dept., Worcester Polytechnic Institute.
1 A GPU Accelerated Storage System NetSysLab The University of British Columbia Abdullah Gharaibeh with: Samer Al-Kiswany Sathish Gopalakrishnan Matei.
Timothy Blattner and Shujia Zhou May 18, This project is sponsored by Lockheed Martin We would like to thank Joseph Swartz, Sara Hritz, Michael.
System Simulation Of 1000-cores Heterogeneous SoCs Shivani Raghav Embedded System Laboratory (ESL) Ecole Polytechnique Federale de Lausanne (EPFL)
GPU System Architecture Alan Gray EPCC The University of Edinburgh.
Early Linpack Performance Benchmarking on IPE Mole-8.5 Fermi GPU Cluster Xianyi Zhang 1),2) and Yunquan Zhang 1),3) 1) Laboratory of Parallel Software.
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.
University of Michigan Electrical Engineering and Computer Science Transparent CPU-GPU Collaboration for Data-Parallel Kernels on Heterogeneous Systems.
S. Gadomski, "ATLAS computing in Geneva", journee de reflexion, 14 Sept ATLAS computing in Geneva Szymon Gadomski description of the hardware the.
ALICE TPC Online Tracking on GPU David Rohr for the ALICE Corporation Lisbon.
GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and.
1 AppliedMicro X-Gene ® ARM Processors Optimized Scale-Out Solutions for Supercomputing.
Panda: MapReduce Framework on GPU’s and CPU’s
Trigger and online software Simon George & Reiner Hauser T/DAQ Phase 1 IDR.
1 The ATLAS Online High Level Trigger Framework: Experience reusing Offline Software Components in the ATLAS Trigger Werner Wiedenmann University of Wisconsin,
HPCC Mid-Morning Break Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery Introduction to the new GPU (GFX) cluster.
GPU Programming with CUDA – Accelerated Architectures Mike Griffiths
Bob Thome, Senior Director of Product Management, Oracle SIMPLIFYING YOUR HIGH AVAILABILITY DATABASE.
7th Workshop on Fusion Data Processing Validation and Analysis Integration of GPU Technologies in EPICs for Real Time Data Preprocessing Applications J.
Predictive Runtime Code Scheduling for Heterogeneous Architectures 1.
GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.
Extracted directly from:
AUTHORS: STIJN POLFLIET ET. AL. BY: ALI NIKRAVESH Studying Hardware and Software Trade-Offs for a Real-Life Web 2.0 Workload.
03/27/2003CHEP20031 Remote Operation of a Monte Carlo Production Farm Using Globus Dirk Hufnagel, Teela Pulliam, Thomas Allmendinger, Klaus Honscheid (Ohio.
Porting the physical parametrizations on GPUs using directives X. Lapillonne, O. Fuhrer, Cristiano Padrin, Piero Lanucara, Alessandro Cheloni Eidgenössisches.
C OMPUTER O RGANIZATION AND D ESIGN The Hardware/Software Interface 5 th Edition Chapter 1 Computer Abstractions and Technology Sections 1.5 – 1.11.
Offline Coordinators  CMSSW_7_1_0 release: 17 June 2014  Usage:  Generation and Simulation samples for run 2 startup  Limited digitization and reconstruction.
Taking the Complexity out of Cluster Computing Vendor Update HPC User Forum Arend Dittmer Director Product Management HPC April,
Use of GPUs in ALICE (and elsewhere) Thorsten Kollegger TDOC-PG | CERN |
Requirements for a Next Generation Framework: ATLAS Experience S. Kama, J. Baines, T. Bold, P. Calafiura, W. Lampl, C. Leggett, D. Malon, G. Stewart, B.
ALICE Upgrade for Run3: Computing HL-LHC Trigger, Online and Offline Computing Working Group Topical Workshop Sep 5 th 2014.
Introducing collaboration members – Korea University (KU) ALICE TPC online tracking algorithm on a GPU Computing Platforms – GPU Computing Platforms Joohyung.
Future Framework John Baines for the Future Framework Requirements Group 1.
TDAQ Upgrade Software Plans John Baines, Tomasz Bold Contents: Future Framework Exploitation of future Technologies Work for Phase-II IDR.
Adam Wagner Kevin Forbes. Motivation  Take advantage of GPU architecture for highly parallel data-intensive application  Enhance image segmentation.
Trigger Software Upgrades John Baines, Tomasz Bold, Joerg Stelzer, Werner Wiedenmann 1.
Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Morgan Kaufmann Publishers
Trigger Software Upgrades John Baines & Tomasz Bold 1.
Space Charge with PyHEADTAIL and PyPIC on the GPU Stefan Hegglin and Adrian Oeftiger Space Charge Working Group meeting –
1/13 Future computing for particle physics, June 2011, Edinburgh A GPU-based Kalman filter for ATLAS Level 2 Trigger Dmitry Emeliyanov Particle Physics.
Some GPU activities at the CMS experiment Felice Pantaleo EP-CMG-CO EP-CMG-CO 1.
AliRoot survey: Reconstruction P.Hristov 11/06/2013.
S. Pardi Frascati, 2012 March GPGPU Evaluation – First experiences in Napoli Silvio Pardi.
SixTrack for GPU R. De Maria. SixTrack Status SixTrack: Single Particle Tracking Code [cern.ch/sixtrack]. 70K lines written in Fortran 77/90 (with few.
FTK high level simulation & the physics case The FTK simulation problem G. Volpi Laboratori Nazionali Frascati, CERN Associate FP07 MC Fellow.
APE group Many-core platforms and HEP experiments computing XVII SuperB Workshop and Kick-off Meeting Elba, May 29-June 1,
Sobolev(+Node 6, 7) Showcase +K20m GPU Accelerator.
Bernd Panzer-Steindel CERN/IT/ADC1 Medium Term Issues for the Data Challenges.
Summary of IAPP scientific activities into 4 years P. Giannetti INFN of Pisa.
WP8 : High Level Trigger John Baines. Tasks & Deliverables WP8 Tasks: 1.Optimize HLT Tracking software for Phase-I 2.Optimize Trigger Selections for Phase-I.
NFV Compute Acceleration APIs and Evaluation
DCS Status and Amanda News
FPGAs for next gen DAQ and Computing systems at CERN
5/14/2018 The ATLAS Trigger and Data Acquisition Architecture & Status Benedetto Gorini CERN - Physics Department on behalf of the ATLAS TDAQ community.
Triggering events with GPGPU in ATLAS
Low-Cost High-Performance Computing Via Consumer GPUs
TDAQ Phase-II kick-off CERN background information
CMS High Level Trigger Configuration Management
Tracking for Triggering Purposes in ATLAS
Accelerating MapReduce on a Coupled CPU-GPU Architecture
John Harvey CERN EP/LBC July 24, 2001
1/2/2019 The ATLAS Trigger and Data Acquisition Architecture & Status Benedetto Gorini CERN - Physics Department on behalf of the ATLAS TDAQ community.
Multicore and GPU Programming
Multicore and GPU Programming
Presentation transcript:

Future farm technologies & architectures John Baines 1

Introduction What will the HLT farm look like in 2020? When & how do we narrow the options? – Choice affects software design as well as farm infrastructure How do we evaluate cost/benefits? When & how do we make the final choices for farm purchases? How do we design software now to ensure we can fully exploit the capability of future farm hardware? What do we need in the way of demonstrators for specific techologies? We can’t evaluate all options – what should be the priorities? 2

Timescales: Framework, Steering & New Technologies 2014 Q3Q4 LS 1 Design & Prototype Implement core functionality Extend to full functionality CommissioningRun EvaluateImplement Infrastructure Exploit New. Tech. in Algorithms Speed up code, thread-safety, investigate possibilities for internal parallelisation Implement Algorithms in new framework. HLT software Commissioning Complete Final Software Complete Framework & Algos. Fix PC architecture Framework Core Functionality Complete Incl. HLT components & new tech. support Design of Framework & HLT Components Complete Narrow h/w choices e.g. Use or not GPU Run 3 Full menu complete Simple menu Framework Requirements Capture Complete Framework New Tech. Algs & Menus Draft Version for discussion 3 Prototype with 1 or 2 chains

Technologies CPU: increased core counts – – currently 12 core (24 thread) e.g. Xeon E v2 series ~0.5 TFLOPS – 18 core (36 thread) coming soon (Xeon E v3 series) – Possible trend to many cores with lower memory => cannot continue to run one job per core 4 GPU: Much bigger core count: e.g. Nvidia K40: 15 SMX, 2880 cores 12 GB memory. 4.3(1.4) TFLOPS SP(DP) Coprocessor: e.g. Intel Xeon Phi up to 61 cores, 244 threads 1.2 TFLOPS

GPU:Towards a cost benefit analysis Will need to Assess: Effort needed to port code to GPU maintain it (bug fixing, new hardware…) and to Support MC simulation on GRID Speed-up for individual components & full chain What can be outsourced to GPU and what done on CPU Integration with Athena (APE) Balance of CPU cores to GPU i.e. sharing of GPU resource between several jobs Farm integration issues: packaging, power consumption…. Financial cost: hardware, installation, commissioning, maintenance… As an exercise, see what we can learn from studies to-date i.e. cost-benefit if we were to purchase today. 5

Demonstrators Demonstrators: ID (RAL, Edinburgh, Oxford): – Complete L2 ID chain ported to CUDA for NVIDIA GPU – ID datapreparation (byestream conversion, clustering, space-point formation) ported additionally to openCL Muon (CERN, Rome) – Muon calorimeter-isolation implemented in CUDA Jet (Lisbon) – Just starting See: Twiki: TriggerSoftwareUpgradeTwiki: TriggerSoftwareUpgrade Porting L2 ID tracking to CUDA ~2 0.5 FTE => 1 staff year (for very experienced expert!) Effort needed to port code? 6

GPUS Time (ms) Tau RoI 0.6x0.6 tt events 2x10 34 C++ on 2.4 GHz CPU CUDA on Tesla C2050 Speedup CPU/GPU Data. Prep.2739 Seeding Seed ext Triplet merging Clone removal CPU GPU xfern/a0.1n/a Total Example of complete L2 ID chain implemented on GPU (Dmitry Emeliyanov) Data Prep. L2 Tracking 7 X2.4 X5 Max. speed-up: x26 Overall speed-up t(GPU/t(CPU): 12

Sharing of GPU resource 8 Blue: Tracking running on CPU Red: Most Tracking steps on GPU, final ambiguity solving on CPU X2.4 With balanced load on CPU/GPU, several CPU cores can share a GPU e.g. Test with L2 ID tracking with 8 CPU cores sharing one C2050 GPU Tesla C2050

Packaging 1U: 2xE or E5-2600v2 3xGPU 2U: 2xE or E5-2600v2 4xGPU Examples: Total for 1027 or 2027 with 2 K20 GPU: ~15k CHF => 12 CPU cores/GPU Total for 2027 with 4 K20 GPU: ~20k CHF => 6 CPU cores/GPU CPU: Intel E5-2697v2 CPU 12 cores, ~0.5 TFLOPS, ~2.3kCHF GPU: Nvidia K20 GPU 2496 cores, 13 SMX, 192 cores per SMX 3.5 (1.1) TFOPS for SP(DP), ~2.4k CHF 9

Power & Cooling SDX racks: Max. Power: 12kW Usable space: 47 U Current power ~300W per motherboard => max. 40 motherboards per rack. Compare 2U unit: a) 4 motherboards, 8CPU 1.1 kW b) 1 motherboard, 2 CPU with 2 GPU (750W) or 4 GPU 1.2kW Based on Max. Power: K20 GPU: 225W c.f. E5-2697v2 CPU: 130W (need to measure typical power) Illustrative farm configuration: 50 racksTotal Farm Nodes CPUCores (max threads) GPU (SMX) Required throughput per Node (per CPU core) 40 nodes per rack ~300W/node 2,0004,00048,000 (96,000) 050 Hz (2.1 Hz) 10 nodes per rack 4 GPU per node ~1200W/node 5001, (24000) 2,000 (26,000) 200 Hz (8.3 Hz) 16 nodes per rack 2 GPU per node ~750W/node 8001,60019,200 (38,400) 1,600 (20,800) 125 Hz (5.2 Hz) 10 x4 X2.5

Summary Current limiting factor is cooling: 12kW/rack => Adding GPU means removing CPU Factor less CPU requires corresponding increase in CPU throughput Financial Cost per motherboard (2U box with 8 CPU versus 2CPU + 4 GPU) : CPU+GPU Factor ~2 more expensive => win with CPU+GPU solution when throughput per CPU increased by more than factor 8 => 90% work (by CPU time) transferred to GPU 11

Discussion Benefits: – If we can manage to port the bulk of time-consuming code to GPU, the benefit is potentially much better scaling with p.u. i.e. No combinatorical code left on CPU => execution times will scale slowly with p.u. Code on GPU is parallel and will scale slowly with p.u. Costs: – Significant effort needed to port code – Need to support different GPU generations with rolling replacements – Potential divergence from offline – Need to support CPU version of code for simulation – Possibly more expensive than CPU-only farm.  CPU+GPU solution attractive IF CPU-based farm cannot provide enough processing power.  However, currently looks like CPU-only farm is the least code solution  Discuss! 12

CPU Coming: Xeon E V3 18 cores and 36 thread 3,960 EUR $5, e.g.

GPU 14 US $K40K20XK20M2090C

15 Increase in Throughput per CPU when GPU added Speed-up t(CPU)/t(GPU) CPU code serial: waits for GPU completion Fraction defined in terms of execution time on CPU If CPU count reduced by factor 4, need factor 4 increase in throughput to break even; i.e. 75% of work moved to GPU

16

Speed-up factors HLT: 60% tracking 20% Calo 10% Muon 10% other 17

Cost of GPU-Cost of CPU Cost of effort for online version Cost for simulation 18

… CPU#1 12 CPU cores; 12/24 cpu threads GPU#1: 15 SMX; 2880 cores GPU#2 15 SMX; 2880 cores 120ms 240ms 360ms 10ms 240ms 250ms 69% CPU: x0.69 Throughput

6 jobs per GPU 20

21

Data Preparation Code 22

23

24

25

26

27

28

29

30

31

32

Data Preparation Code 33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

Power & Cooling SDX racks: Upper level: 27? XPU racks – each 47U usable; 9.5kW – 1U 31,32,40 per rack (=>max 300W per 1U) – Current power consumption 6-9kW per rack Lower level: partially equipped with 10 racks (+6 preseries racks) – each 47U (could be 52U with additional reinforcing); 15 kW – 2U 4-blade servers 1100W, 8 or 10 per rack (9-11kW) GPU: C2050: <238W; K20:<225W; K40:<235W c.f. CPU: 130W (12 core, 2.7GHz) => GPU 80% higher max. power consumption than CPU.  Adding 1 GPU ~doubles power consumption of node 50 racksNodes (mothe rboard) CPUCores (max threads) GPU (SMX) Throughput per Node (per CPU core) 40 nodes per rack (96000) 050Hz (2.08 Hz) 10 nodes per rack 4 GPU per node ~1200W/node (24000) 2000 (30000) 200Hz (8.33 Hz) 15 nodes per rack 2 GPU per node ~800W/node (36000) 1500 (22500) 133Hz (5.55Hz) 54

Packaging 1U: 2xE or E5-2600v2 3xGPU 2U: 2xE or E5-2600v2 4xGPU e.g. 2x12=24 CPU cores 3 GPU  8 CPU cores/GPU GPU: e.g. K40: 2880 cores, 15 SMX, 192 cores per SMX 4.3 (1.4) TFOPS for SP(DP): $4400 e.g. 2x12=24 CPU cores 4 GPU => 6 CPU cores/GPU + GPU 3X $