Trigger Software Upgrades John Baines & Tomasz Bold 1.

Slides:

Advertisements

Similar presentations

Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters

Advertisements

G ö khan Ü nel / CHEP Interlaken ATLAS 1 Performance of the ATLAS DAQ DataFlow system Introduction/Generalities –Presentation of the ATLAS DAQ components.

1 A GPU Accelerated Storage System NetSysLab The University of British Columbia Abdullah Gharaibeh with: Samer Al-Kiswany Sathish Gopalakrishnan Matei.

GPGPU Introduction Alan Gray EPCC The University of Edinburgh.

1 Lawrence Livermore National Laboratory By Chunhua (Leo) Liao, Stephen Guzik, Dan Quinlan A node-level programming model framework for exascale computing*

University of Michigan Electrical Engineering and Computer Science Transparent CPU-GPU Collaboration for Data-Parallel Kernels on Heterogeneous Systems.

MotoHawk Training Model-Based Design of Embedded Systems.

The ATLAS High Level Trigger Steering Journée de réflexion – Sept. 14 th 2007 Till Eifert DPNC – ATLAS group.

Application Models for utility computing Ulrich (Uli) Homann Chief Architect Microsoft Enterprise Services.

Timm M. Steinbeck - Kirchhoff Institute of Physics - University Heidelberg 1 Timm M. Steinbeck HLT Data Transport Framework.

5 th LHCb Computing Workshop, May 19 th 2015 Niko Neufeld, CERN/PH-Department

1 AppliedMicro X-Gene ® ARM Processors Optimized Scale-Out Solutions for Supercomputing.

Trigger and online software Simon George & Reiner Hauser T/DAQ Phase 1 IDR.

1 The ATLAS Online High Level Trigger Framework: Experience reusing Offline Software Components in the ATLAS Trigger Werner Wiedenmann University of Wisconsin,

Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.

7th Workshop on Fusion Data Processing Validation and Analysis Integration of GPU Technologies in EPICs for Real Time Data Preprocessing Applications J.

Many-Core Scalability of the Online Event Reconstruction in the CBM Experiment Ivan Kisel GSI, Germany (for the CBM Collaboration) CHEP-2010 Taipei, October.

Roger Jones, Lancaster University1 Experiment Requirements from Evolving Architectures RWL Jones, Lancaster University Ambleside 26 August 2010.

03/27/2003CHEP20031 Remote Operation of a Monte Carlo Production Farm Using Globus Dirk Hufnagel, Teela Pulliam, Thomas Allmendinger, Klaus Honscheid (Ohio.

Helmholtz International Center for CBM – Online Reconstruction and Event Selection Open Charm Event Selection – Driving Force for FEE and DAQ Open charm:

Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.

Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,

Offline Coordinators  CMSSW_7_1_0 release: 17 June 2014  Usage:  Generation and Simulation samples for run 2 startup  Limited digitization and reconstruction.

Taking the Complexity out of Cluster Computing Vendor Update HPC User Forum Arend Dittmer Director Product Management HPC April,

Use of GPUs in ALICE (and elsewhere) Thorsten Kollegger TDOC-PG | CERN |

Requirements for a Next Generation Framework: ATLAS Experience S. Kama, J. Baines, T. Bold, P. Calafiura, W. Lampl, C. Leggett, D. Malon, G. Stewart, B.

ALICE Upgrade for Run3: Computing HL-LHC Trigger, Online and Offline Computing Working Group Topical Workshop Sep 5 th 2014.

CDF Offline Production Farms Stephen Wolbers for the CDF Production Farms Group May 30, 2001.

Future farm technologies & architectures John Baines 1.

Introducing collaboration members – Korea University (KU) ALICE TPC online tracking algorithm on a GPU Computing Platforms – GPU Computing Platforms Joohyung.

Parallelization and Characterization of Pattern Matching using GPUs Author: Giorgos Vasiliadis 、 Michalis Polychronakis 、 Sotiris Ioannidis Publisher:

ATLAS ATLAS Week: 25/Feb to 1/Mar 2002 B-Physics Trigger Working Group Status Report

DPDs and Trigger Plans for Derived Physics Data Follow up and trigger specific issues Ricardo Gonçalo and Fabrizio Salvatore RHUL.

Future Framework John Baines for the Future Framework Requirements Group 1.

Trigger input to FFReq 1. Specific Issues for Trigger The HLT trigger reconstruction is a bit different from the offline reconstruction: – The trigger.

1 “Steering the ATLAS High Level Trigger” COMUNE, G. (Michigan State University ) GEORGE, S. (Royal Holloway, University of London) HALLER, J. (CERN) MORETTINI,

Navigation Timing Studies of the ATLAS High-Level Trigger Andrew Lowe Royal Holloway, University of London.

TDAQ Upgrade Software Plans John Baines, Tomasz Bold Contents: Future Framework Exploitation of future Technologies Work for Phase-II IDR.

Trigger Software Upgrades John Baines, Tomasz Bold, Joerg Stelzer, Werner Wiedenmann 1.

Trigger Software Validation Olga Igonkina (U.Oregon), Ricardo Gonçalo (RHUL) TAPM Open Meeting – April 12, 2007 Outline: Reminder of plans Status of infrastructure.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

Artemis School On Calibration and Performance of ATLAS Detectors Jörg Stelzer / David Berge.

HIGUCHI Takeo Department of Physics, Faulty of Science, University of Tokyo Representing dBASF Development Team BELLE/CHEP20001 Distributed BELLE Analysis.

Overlap Removal and Timing Optimization Studies Nicole Carlson, Northwestern University 8/8/07 Supervisor: Tomasz Bold.

Future computing strategy Some considerations Ian Bird WLCG Overview Board CERN, 28 th September 2012.

Shangkar Mayanglambam, Allen D. Malony, Matthew J. Sottile Computer and Information Science Department Performance.

General requirements for BES III offline & EF selection software Weidong Li.

Space Charge with PyHEADTAIL and PyPIC on the GPU Stefan Hegglin and Adrian Oeftiger Space Charge Working Group meeting –

1/13 Future computing for particle physics, June 2011, Edinburgh A GPU-based Kalman filter for ATLAS Level 2 Trigger Dmitry Emeliyanov Particle Physics.

Some GPU activities at the CMS experiment Felice Pantaleo EP-CMG-CO EP-CMG-CO 1.

LHCbComputing Computing for the LHCb Upgrade. 2 LHCb Upgrade: goal and timescale m LHCb upgrade will be operational after LS2 (~2020) m Increase significantly.

Concurrency and Performance Based on slides by Henri Casanova.

3/12/2013Computer Engg, IIT(BHU)1 CUDA-3. GPGPU ● General Purpose computation using GPU in applications other than 3D graphics – GPU accelerates critical.

Framework support for Accelerators Sami Kama. Introduction Current Status Future Accelerator use modes Symmetric resource Asymmetric resource 09/11/2015.

AliRoot survey: Reconstruction P.Hristov 11/06/2013.

Meeting with University of Malta| CERN, May 18, 2015 | Predrag Buncic ALICE Computing in Run 2+ P. Buncic 1.

S. Pardi Frascati, 2012 March GPGPU Evaluation – First experiences in Napoli Silvio Pardi.

ATLAS UK physics meeting, 10/01/08 1 Triggers for B physics Julie Kirk RAL Overview of B trigger strategy Algorithms – current status and plans Menus Efficiencies.

WP8 : High Level Trigger John Baines. Tasks & Deliverables WP8 Tasks: 1.Optimize HLT Tracking software for Phase-I 2.Optimize Trigger Selections for Phase-I.

NFV Compute Acceleration APIs and Evaluation

FPGAs for next gen DAQ and Computing systems at CERN

CMS High Level Trigger Configuration Management

Commissioning of the ATLAS High Level Trigger

Tracking for Triggering Purposes in ATLAS

Controlling a large CPU farm using industrial tools

RT2003, Montreal Niko Neufeld, CERN-EP & Univ. de Lausanne

John Harvey CERN EP/LBC July 24, 2001

Multithreaded Programming

Multicore and GPU Programming

CHEP La Jolla San Diego 24-28/3/2003

Presentation transcript:

Trigger Software Upgrades John Baines & Tomasz Bold 1

Introduction High Level Trigger Challenges: – Faster than linear scaling of execution times with luminosity e.g. tracking => – Some HLT rejection power moved to L1 Addition of L1Topo in phase-I Addition of L1Track in Phase-II – Need to maintain current levels of rejection (otherwise problem for offline)  HLT needs to move closer to offline In Phase-II: L1 rate increasing to 400kHz  More computing power needed at the HLT But limited rack space & cooling  Need to efficiently use computing technologies 2 EF ID tracking in muon RoI

Technologies CPU: increased core counts - currently 18 core (36 threads) (e.g. Xeon E v3 series) – Trend to more cores possibly with lower memory per core – Run-2: one job/thread (athenaMP saves mem.) but may not be sustainable long-term => Develop new Framework supporting concurrent execution  Ensure algorithms supports concurrent execution (thread-safe or can be cloned) Accelerators: – Rapid increase in power of GPGPUs e.g. Nvidia K40: 2880 cores 12 GB memory – Increased power & ease of programing of FPGA  Need to Monitor & evaluate key technologies  Ensure ATLAS code doesn’t preclude use of accelerators  Integrate accelerator support into framework e.g. OffLoadSvc  Ensure EDM doesn’t impose big overheads => flattening of EDM (xAOD help) Software tools: – New compilers & language standards e.g. support for multi-threading, accelerators etc. – Faster libraries (also existing libraries becoming unsupported) – New code optimisation tools: profiling  Assess new tools  Recommendations, documentation, core help for migration 3

4 Concurrent Framework L1 Muon RoI Some Key differences online c.f. offline: Don’t reconstruct the whole event – Because run at 100kHz i/p rate  Can only afford ~250ms/ev (for 25k core farm) – Trigger rejects 99 in 100 events => Use Regions of Interest => Chain Terminates when selection fails Error handling: – algorithm errors force routing of events to debug stream Configuration: from DataBase, rather than python (=> resproducible) – 3 integers specify: Menu & algorithm parameters, L1 prescales, HLT prescales  Need additional Framework functionality – Run-1&2: provided by Trigger-specific additions to framework - HLT Steering & HLT navigation – Run-3 goal: functionality provided by common framework. Key questions: How to implement Event Views? What extra Scheduler functionality is required? => Address through Requirements Capture (FFReq) and prototyping (see Ben’s Talk)

HLT Farm What will the HLT farm look like in 2020? In 2025? – When & how do we narrow the technology options? Choice affects software design as well as farm infrastructure – How do we evaluate cost/benefits of different technologies? Key criterion: – Cost – financial, effort – Benefit – throughput per rack (events/s) – Constraints: cooling per rack, network… e.g. Important questions for assessing GPU technology: – Are GPU useful? What is the cost? What is the benefit? – What is the optimum balance of GPU to CPU? – What fraction of code (by CPU time) could realistically be ported to GPU? – What fraction of code must be ported to make GPU cost-effective – What is the overhead imposed by the EDM? How could it be reduced? See Dmitry's talk at FFReq on a possible GPU-friendly Identifiable ContainerDmitry's talk at FFReq => Aim to get some answers through a Trigger Demonstrator: see Dmitry’s talk 5

GPGPU 6 Assume 50 HLT racks: Max. Power: 12kW per rack; Usable space: 47 U per rack Compare a) CPU and b) CPU+GPU systems, where each rack has: a) 10 x (2U with 4 motherboards, 8 CPU): 80 CPU; 11 kW; ~40 TFLOPS b) 16 x (Supermicro 1027GR-TR2 server): 32 CPU; 32 GPU ; ~12 kW CPU: Intel E5-2697v2 : 12 cores, ~0.5 TFLOPS GPU: Nvidia K20: 2496 cores, 13 SMX, 3.5 (1.1) TFOPS for SP(DP) Assume: Fixed cost; Fixed power/rack  win with CPU+GPU solution when throughput per CPU increased by more than factor ~2.5  65% work (by CPU time) transferred to GPU Speed-up of GPU code relative to cpu code: t(CPU)/t(GPU) Need to redo using results of demonstrator A toy cost-benefit analysis has been conducted based on todays technology. Done to illustrate process – not enough information to draw any firm conclusions

Timescales: Framework, Steering & New Technologies 2014 Q3Q4 LS 1 Design & Prototype Implement core functionality Extend to full functionality CommissioningRun EvaluateImplement Infrastructure Exploit New. Tech. in Algorithms Speed up code, thread-safety, investigate possibilities for internal parallelisation Implement Algorithms in new framework. HLT software Commissioning Complete Final Software Complete Framework & Algos. Fix PC architecture Framework Core Functionality Complete Incl. HLT components & new tech. support Initial Framework & HLT Components available Narrow h/w choices e.g. Use or not GPU Run 3 Full menu complete Simple menu Framework Requirements Capture Complete Framework New Tech. Algs & Menus 7 Prototype with 1 or 2 chains

Summary For Run 3 we need : – A framework supporting concurrent execution of algorithms – To make efficient use of computing technology (hw & sw) Work has started: – FFReq & Framework demonstrators – GPU demonstrator Success requires significant developments in core software, reconstruction and EDM – Algorithms must support concurrent execution (thread-safe or able to be cloned) – EDM must become data-orientated – Maintain and increase gains made via code optimization Vital for Trigger & Offline to work together to have common solutions 8

Additonal Material 9

Combined HLT: 60% tracking 20% Calo 10% Muon 10% other 10 Top CPU Consumers in Run-1

GPUS Time (ms) Tau RoI 0.6x0.6 tt events 2x10 34 C++ on 2.4 GHz CPU CUDA on Tesla C2050 Speedup CPU/GPU Data. Prep.2739 Seeding Seed ext Triplet merging Clone removal CPU GPU xfern/a0.1n/a Total Example of complete L2 ID chain implemented on GPU (Dmitry Emeliyanov) Data Prep. L2 Tracking 11 X2.4 X5 Max. speed-up: x26 Overall speed-up t(GPU/t(CPU): 12

Sharing of GPU resource 12 Blue: Tracking running on CPU Red: Most Tracking steps on GPU, final ambiguity solving on CPU X2.4 With balanced load on CPU/GPU, several CPU cores can share a GPU e.g. Test with L2 ID tracking with 8 CPU cores sharing one C2050 GPU Tesla C2050

Power & Cooling SDX racks: Max. Power: 12kW Usable space: 47 U Current power ~300W per motherboard => max. 40 motherboards per rack. Compare 2U unit: a) 4 motherboards, 8CPU 1.1 kW b) 1 motherboard, 2 CPU with 2 GPU (750W) or 4 GPU 1.2kW Based on Max. Power: K20 GPU: 225W c.f. E5-2697v2 CPU: 130W (need to measure typical power) Illustrative farm configuration: 50 racksTotal Farm Nodes CPUCores (max threads) GPU (SMX) Required throughput per Node (per CPU core) 40 nodes per rack ~300W/node 2,0004,00048,000 (96,000) 050 Hz (2.1 Hz) 10 nodes per rack 4 GPU per node ~1200W/node 5001, (24,000) 2,000 (26,000) 200 Hz (8.3 Hz) 16 nodes per rack 2 GPU per node ~750W/node 8001,60019,200 (38,400) 1,600 (20,800) 125 Hz (5.2 Hz) 13 x4 X2.5

GPUS Time (ms) Tau RoI 0.6x0.6 tt events 2x10 34 C++ on 2.4 GHz CPU CUDA on Tesla C2050 Speedup CPU/GPU Data. Prep.2739 Seeding Seed ext Triplet merging Clone removal CPU GPU xfern/a0.1n/a Total Example of complete L2 ID chain implemented on GPU (Dmitry Emeliyanov) Data Prep. L2 Tracking 14 X2.4 X5 Max. speed-up: x26 Overall speed-up t(GPU/t(CPU): 12

Sharing of GPU resource 15 Blue: Tracking running on CPU Red: Most Tracking steps on GPU, final ambiguity solving on CPU X2.4 With balanced load on CPU/GPU, several CPU cores can share a GPU e.g. Test with L2 ID tracking with 8 CPU cores sharing one C2050 GPU Tesla C2050

Packaging 1U: 2xE or E5-2600v2 3xGPU 2U: 2xE or E5-2600v2 4xGPU Examples: Total for 1027 or 2027 with 2 K20 GPU: ~15k CHF => 12 CPU cores/GPU Total for 2027 with 4 K20 GPU: ~20k CHF => 6 CPU cores/GPU Chasis: Supermicro 1027GR-TR2 or 2027GR-TR2 CPU: Intel E5-2697v2 CPU 12 cores, ~0.5 TFLOPS, ~2.3kCHF GPU: Nvidia K20 GPU 2496 cores, 13 SMX, 192 cores per SMX 3.5 (1.1) TFOPS for SP(DP), ~2.4k CHF 16

Power & Cooling SDX racks: Max. Power: 12kW Usable space: 47 U Current power ~300W per motherboard => max. 40 motherboards per rack. Compare 2U unit: a) 4 motherboards, 8CPU 1.1 kW b) 1 motherboard, 2 CPU with 2 GPU (750W) or 4 GPU 1.2kW Based on Max. Power: K20 GPU: 225W c.f. E5-2697v2 CPU: 130W (need to measure typical power) Illustrative farm configuration: 50 racksTotal Farm Nodes CPUCores (max threads) GPU (SMX) Required throughput per Node (per CPU core) 40 nodes per rack ~300W/node 2,0004,00048,000 (96,000) 050 Hz (2.1 Hz) 10 nodes per rack 4 GPU per node ~1200W/node 5001, (24,000) 2,000 (26,000) 200 Hz (8.3 Hz) 16 nodes per rack 2 GPU per node ~750W/node 8001,60019,200 (38,400) 1,600 (20,800) 125 Hz (5.2 Hz) 17 x4 X2.5

Summary Current limiting factor is cooling: 12kW/rack  Adding GPU means removing CPU  For fixed cooling would have a factor 2.5(4) less CPU when adding 2(4) GPU Financial cost ~25%(70%) more per 2U with 2CPU and 2(4) GPU than a 2U with 8 CPU. => For fixed cooling and fixed cost would have factor 5-7 less CPU => win with CPU+GPU solution when throughput per CPU increased by more than factor 5-7 => 80-85% work (by CPU time) transferred to GPU Whether we need 1 or 2 GPU per CPU depends on relative CPU & GPU load 18

19 Increase in Throughput per CPU when GPU added Speed-up t(CPU)/t(GPU) CPU code serial: waits for GPU completion Fraction defined in terms of execution time on CPU

6 jobs per GPU 20

21