Trigger Software Upgrades John Baines & Tomasz Bold 1.

Trigger Software Upgrades John Baines & Tomasz Bold 1

Introduction High Level Trigger Challenges: – Faster than linear scaling of execution times with luminosity e.g. tracking => – Some HLT rejection power moved to L1 Addition of L1Topo in phase-I Addition of L1Track in Phase-II – Need to maintain current levels of rejection (otherwise problem for offline)  HLT needs to move closer to offline In Phase-II: L1 rate increasing to 400kHz  More computing power needed at the HLT But limited rack space & cooling  Need to efficiently use computing technologies 2 EF ID tracking in muon RoI

Technologies CPU: increased core counts - currently 18 core (36 threads) (e.g. Xeon E5 2600 v3 series) – Trend to more cores possibly with lower memory per core – Run-2: one job/thread (athenaMP saves mem.) but may not be sustainable long-term => Develop new Framework supporting concurrent execution  Ensure algorithms supports concurrent execution (thread-safe or can be cloned) Accelerators: – Rapid increase in power of GPGPUs e.g. Nvidia K40: 2880 cores 12 GB memory – Increased power & ease of programing of FPGA  Need to Monitor & evaluate key technologies  Ensure ATLAS code doesn’t preclude use of accelerators  Integrate accelerator support into framework e.g. OffLoadSvc  Ensure EDM doesn’t impose big overheads => flattening of EDM (xAOD help) Software tools: – New compilers & language standards e.g. support for multi-threading, accelerators etc. – Faster libraries (also existing libraries becoming unsupported) – New code optimisation tools: profiling  Assess new tools  Recommendations, documentation, core help for migration 3

4 Concurrent Framework L1 Muon RoI Some Key differences online c.f. offline: Don’t reconstruct the whole event – Because run at 100kHz i/p rate  Can only afford ~250ms/ev (for 25k core farm) – Trigger rejects 99 in 100 events => Use Regions of Interest => Chain Terminates when selection fails Error handling: – algorithm errors force routing of events to debug stream Configuration: from DataBase, rather than python (=> resproducible) – 3 integers specify: Menu & algorithm parameters, L1 prescales, HLT prescales  Need additional Framework functionality – Run-1&2: provided by Trigger-specific additions to framework - HLT Steering & HLT navigation – Run-3 goal: functionality provided by common framework. Key questions: How to implement Event Views? What extra Scheduler functionality is required? => Address through Requirements Capture (FFReq) and prototyping (see Ben’s Talk)

HLT Farm What will the HLT farm look like in 2020? In 2025? – When & how do we narrow the technology options? Choice affects software design as well as farm infrastructure – How do we evaluate cost/benefits of different technologies? Key criterion: – Cost – financial, effort – Benefit – throughput per rack (events/s) – Constraints: cooling per rack, network… e.g. Important questions for assessing GPU technology: – Are GPU useful? What is the cost? What is the benefit? – What is the optimum balance of GPU to CPU? – What fraction of code (by CPU time) could realistically be ported to GPU? – What fraction of code must be ported to make GPU cost-effective – What is the overhead imposed by the EDM? How could it be reduced? See Dmitry's talk at FFReq on a possible GPU-friendly Identifiable ContainerDmitry's talk at FFReq => Aim to get some answers through a Trigger Demonstrator: see Dmitry’s talk 5

GPGPU 6 Assume 50 HLT racks: Max. Power: 12kW per rack; Usable space: 47 U per rack Compare a) CPU and b) CPU+GPU systems, where each rack has: a) 10 x (2U with 4 motherboards, 8 CPU): 80 CPU; 11 kW; ~40 TFLOPS b) 16 x (Supermicro 1027GR-TR2 server): 32 CPU; 32 GPU ; ~12 kW CPU: Intel E5-2697v2 : 12 cores, ~0.5 TFLOPS GPU: Nvidia K20: 2496 cores, 13 SMX, 3.5 (1.1) TFOPS for SP(DP) Assume: Fixed cost; Fixed power/rack  win with CPU+GPU solution when throughput per CPU increased by more than factor ~2.5  65% work (by CPU time) transferred to GPU Speed-up of GPU code relative to cpu code: t(CPU)/t(GPU) 2.5 0.65 Need to redo using results of demonstrator A toy cost-benefit analysis has been conducted based on todays technology. Done to illustrate process – not enough information to draw any firm conclusions

Timescales: Framework, Steering & New Technologies 2014 Q3Q4 LS 1 Design & Prototype Implement core functionality Extend to full functionality CommissioningRun EvaluateImplement Infrastructure Exploit New. Tech. in Algorithms Speed up code, thread-safety, investigate possibilities for internal parallelisation Implement Algorithms in new framework. HLT software Commissioning Complete Final Software Complete Framework & Algos. Fix PC architecture Framework Core Functionality Complete Incl. HLT components & new tech. support Initial Framework & HLT Components available Narrow h/w choices e.g. Use or not GPU Run 3 Full menu complete Simple menu Framework Requirements Capture Complete Framework New Tech. Algs & Menus 7 Prototype with 1 or 2 chains

Summary For Run 3 we need : – A framework supporting concurrent execution of algorithms – To make efficient use of computing technology (hw & sw) Work has started: – FFReq & Framework demonstrators – GPU demonstrator Success requires significant developments in core software, reconstruction and EDM – Algorithms must support concurrent execution (thread-safe or able to be cloned) – EDM must become data-orientated – Maintain and increase gains made via code optimization Vital for Trigger & Offline to work together to have common solutions 8

Additonal Material 9

Combined HLT: 60% tracking 20% Calo 10% Muon 10% other 10 Top CPU Consumers in Run-1

GPUS Time (ms) Tau RoI 0.6x0.6 tt events 2x10 34 C++ on 2.4 GHz CPU CUDA on Tesla C2050 Speedup CPU/GPU Data. Prep.2739 Seeding8.31.65 Seed ext.1567.820 Triplet merging 7.43.42 Clone removal706.211 CPU GPU xfern/a0.1n/a Total2682212 Example of complete L2 ID chain implemented on GPU (Dmitry Emeliyanov) Data Prep. L2 Tracking 11 X2.4 X5 Max. speed-up: x26 Overall speed-up t(GPU/t(CPU): 12

Sharing of GPU resource 12 Blue: Tracking running on CPU Red: Most Tracking steps on GPU, final ambiguity solving on CPU X2.4 With balanced load on CPU/GPU, several CPU cores can share a GPU e.g. Test with L2 ID tracking with 8 CPU cores sharing one C2050 GPU Tesla C2050

Power & Cooling SDX racks: Max. Power: 12kW Usable space: 47 U Current power ~300W per motherboard => max. 40 motherboards per rack. Compare 2U unit: a) 4 motherboards, 8CPU 1.1 kW b) 1 motherboard, 2 CPU with 2 GPU (750W) or 4 GPU 1.2kW Based on Max. Power: K20 GPU: 225W c.f. E5-2697v2 CPU: 130W (need to measure typical power) Illustrative farm configuration: 50 racksTotal Farm Nodes CPUCores (max threads) GPU (SMX) Required throughput per Node (per CPU core) 40 nodes per rack ~300W/node 2,0004,00048,000 (96,000) 050 Hz (2.1 Hz) 10 nodes per rack 4 GPU per node ~1200W/node 5001,00012000 (24,000) 2,000 (26,000) 200 Hz (8.3 Hz) 16 nodes per rack 2 GPU per node ~750W/node 8001,60019,200 (38,400) 1,600 (20,800) 125 Hz (5.2 Hz) 13 x4 X2.5

GPUS Time (ms) Tau RoI 0.6x0.6 tt events 2x10 34 C++ on 2.4 GHz CPU CUDA on Tesla C2050 Speedup CPU/GPU Data. Prep.2739 Seeding8.31.65 Seed ext.1567.820 Triplet merging 7.43.42 Clone removal706.211 CPU GPU xfern/a0.1n/a Total2682212 Example of complete L2 ID chain implemented on GPU (Dmitry Emeliyanov) Data Prep. L2 Tracking 14 X2.4 X5 Max. speed-up: x26 Overall speed-up t(GPU/t(CPU): 12

Sharing of GPU resource 15 Blue: Tracking running on CPU Red: Most Tracking steps on GPU, final ambiguity solving on CPU X2.4 With balanced load on CPU/GPU, several CPU cores can share a GPU e.g. Test with L2 ID tracking with 8 CPU cores sharing one C2050 GPU Tesla C2050

Packaging 1U: 2xE5-2600 or E5-2600v2 3xGPU 2U: 2xE5-2600 or E5-2600v2 4xGPU Examples: Total for 1027 or 2027 with 2 K20 GPU: ~15k CHF => 12 CPU cores/GPU Total for 2027 with 4 K20 GPU: ~20k CHF => 6 CPU cores/GPU Chasis: Supermicro 1027GR-TR2 or 2027GR-TR2 CPU: Intel E5-2697v2 CPU 12 cores, ~0.5 TFLOPS, ~2.3kCHF GPU: Nvidia K20 GPU 2496 cores, 13 SMX, 192 cores per SMX 3.5 (1.1) TFOPS for SP(DP), ~2.4k CHF 16

Power & Cooling SDX racks: Max. Power: 12kW Usable space: 47 U Current power ~300W per motherboard => max. 40 motherboards per rack. Compare 2U unit: a) 4 motherboards, 8CPU 1.1 kW b) 1 motherboard, 2 CPU with 2 GPU (750W) or 4 GPU 1.2kW Based on Max. Power: K20 GPU: 225W c.f. E5-2697v2 CPU: 130W (need to measure typical power) Illustrative farm configuration: 50 racksTotal Farm Nodes CPUCores (max threads) GPU (SMX) Required throughput per Node (per CPU core) 40 nodes per rack ~300W/node 2,0004,00048,000 (96,000) 050 Hz (2.1 Hz) 10 nodes per rack 4 GPU per node ~1200W/node 5001,00012000 (24,000) 2,000 (26,000) 200 Hz (8.3 Hz) 16 nodes per rack 2 GPU per node ~750W/node 8001,60019,200 (38,400) 1,600 (20,800) 125 Hz (5.2 Hz) 17 x4 X2.5

Summary Current limiting factor is cooling: 12kW/rack  Adding GPU means removing CPU  For fixed cooling would have a factor 2.5(4) less CPU when adding 2(4) GPU Financial cost ~25%(70%) more per 2U with 2CPU and 2(4) GPU than a 2U with 8 CPU. => For fixed cooling and fixed cost would have factor 5-7 less CPU => win with CPU+GPU solution when throughput per CPU increased by more than factor 5-7 => 80-85% work (by CPU time) transferred to GPU Whether we need 1 or 2 GPU per CPU depends on relative CPU & GPU load 18

19 Increase in Throughput per CPU when GPU added Speed-up t(CPU)/t(GPU) CPU code serial: waits for GPU completion Fraction defined in terms of execution time on CPU

6 jobs per GPU 20

Trigger Software Upgrades John Baines & Tomasz Bold 1.

Similar presentations

Presentation on theme: "Trigger Software Upgrades John Baines & Tomasz Bold 1."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Trigger Software Upgrades John Baines & Tomasz Bold 1.

Similar presentations

Presentation on theme: "Trigger Software Upgrades John Baines & Tomasz Bold 1."— Presentation transcript:

Similar presentations

About project

Feedback