Download presentation
Presentation is loading. Please wait.
Published byMary Catherine Little Modified over 9 years ago
1
Future farm technologies & architectures John Baines 1
2
Introduction What will the HLT farm look like in 2020? When & how do we narrow the options? – Choice affects software design as well as farm infrastructure How do we evaluate cost/benefits? When & how do we make the final choices for farm purchases? How do we design software now to ensure we can fully exploit the capability of future farm hardware? What do we need in the way of demonstrators for specific techologies? We can’t evaluate all options – what should be the priorities? 2
3
Timescales: Framework, Steering & New Technologies 2014 Q3Q4 LS 1 Design & Prototype Implement core functionality Extend to full functionality CommissioningRun EvaluateImplement Infrastructure Exploit New. Tech. in Algorithms Speed up code, thread-safety, investigate possibilities for internal parallelisation Implement Algorithms in new framework. HLT software Commissioning Complete Final Software Complete Framework & Algos. Fix PC architecture Framework Core Functionality Complete Incl. HLT components & new tech. support Design of Framework & HLT Components Complete Narrow h/w choices e.g. Use or not GPU Run 3 Full menu complete Simple menu Framework Requirements Capture Complete Framework New Tech. Algs & Menus Draft Version for discussion 3 Prototype with 1 or 2 chains
4
Technologies CPU: increased core counts – – currently 12 core (24 thread) e.g. Xeon E5 2600 v2 series ~0.5 TFLOPS – 18 core (36 thread) coming soon (Xeon E5 2600 v3 series) – Possible trend to many cores with lower memory => cannot continue to run one job per core 4 GPU: Much bigger core count: e.g. Nvidia K40: 15 SMX, 2880 cores 12 GB memory. 4.3(1.4) TFLOPS SP(DP) Coprocessor: e.g. Intel Xeon Phi up to 61 cores, 244 threads 1.2 TFLOPS
5
GPU:Towards a cost benefit analysis Will need to Assess: Effort needed to port code to GPU maintain it (bug fixing, new hardware…) and to Support MC simulation on GRID Speed-up for individual components & full chain What can be outsourced to GPU and what done on CPU Integration with Athena (APE) Balance of CPU cores to GPU i.e. sharing of GPU resource between several jobs Farm integration issues: packaging, power consumption…. Financial cost: hardware, installation, commissioning, maintenance… As an exercise, see what we can learn from studies to-date i.e. cost-benefit if we were to purchase today. 5
6
Demonstrators Demonstrators: ID (RAL, Edinburgh, Oxford): – Complete L2 ID chain ported to CUDA for NVIDIA GPU – ID datapreparation (byestream conversion, clustering, space-point formation) ported additionally to openCL Muon (CERN, Rome) – Muon calorimeter-isolation implemented in CUDA Jet (Lisbon) – Just starting See: Twiki: TriggerSoftwareUpgradeTwiki: TriggerSoftwareUpgrade Porting L2 ID tracking to CUDA ~2 years @ 0.5 FTE => 1 staff year (for very experienced expert!) Effort needed to port code? 6
7
GPUS Time (ms) Tau RoI 0.6x0.6 tt events 2x10 34 C++ on 2.4 GHz CPU CUDA on Tesla C2050 Speedup CPU/GPU Data. Prep.2739 Seeding8.31.65 Seed ext.1567.820 Triplet merging 7.43.42 Clone removal706.211 CPU GPU xfern/a0.1n/a Total2682212 Example of complete L2 ID chain implemented on GPU (Dmitry Emeliyanov) Data Prep. L2 Tracking 7 X2.4 X5 Max. speed-up: x26 Overall speed-up t(GPU/t(CPU): 12
8
Sharing of GPU resource 8 Blue: Tracking running on CPU Red: Most Tracking steps on GPU, final ambiguity solving on CPU X2.4 With balanced load on CPU/GPU, several CPU cores can share a GPU e.g. Test with L2 ID tracking with 8 CPU cores sharing one C2050 GPU Tesla C2050
9
Packaging 1U: 2xE5-2600 or E5-2600v2 3xGPU 2U: 2xE5-2600 or E5-2600v2 4xGPU Examples: Total for 1027 or 2027 with 2 K20 GPU: ~15k CHF => 12 CPU cores/GPU Total for 2027 with 4 K20 GPU: ~20k CHF => 6 CPU cores/GPU CPU: Intel E5-2697v2 CPU 12 cores, ~0.5 TFLOPS, ~2.3kCHF GPU: Nvidia K20 GPU 2496 cores, 13 SMX, 192 cores per SMX 3.5 (1.1) TFOPS for SP(DP), ~2.4k CHF 9
10
Power & Cooling SDX racks: Max. Power: 12kW Usable space: 47 U Current power ~300W per motherboard => max. 40 motherboards per rack. Compare 2U unit: a) 4 motherboards, 8CPU 1.1 kW b) 1 motherboard, 2 CPU with 2 GPU (750W) or 4 GPU 1.2kW Based on Max. Power: K20 GPU: 225W c.f. E5-2697v2 CPU: 130W (need to measure typical power) Illustrative farm configuration: 50 racksTotal Farm Nodes CPUCores (max threads) GPU (SMX) Required throughput per Node (per CPU core) 40 nodes per rack ~300W/node 2,0004,00048,000 (96,000) 050 Hz (2.1 Hz) 10 nodes per rack 4 GPU per node ~1200W/node 5001,00012000 (24000) 2,000 (26,000) 200 Hz (8.3 Hz) 16 nodes per rack 2 GPU per node ~750W/node 8001,60019,200 (38,400) 1,600 (20,800) 125 Hz (5.2 Hz) 10 x4 X2.5
11
Summary Current limiting factor is cooling: 12kW/rack => Adding GPU means removing CPU Factor 2.5-4 less CPU requires corresponding increase in CPU throughput Financial Cost per motherboard (2U box with 8 CPU versus 2CPU + 4 GPU) : CPU+GPU Factor ~2 more expensive => win with CPU+GPU solution when throughput per CPU increased by more than factor 8 => 90% work (by CPU time) transferred to GPU 11
12
Discussion Benefits: – If we can manage to port the bulk of time-consuming code to GPU, the benefit is potentially much better scaling with p.u. i.e. No combinatorical code left on CPU => execution times will scale slowly with p.u. Code on GPU is parallel and will scale slowly with p.u. Costs: – Significant effort needed to port code – Need to support different GPU generations with rolling replacements – Potential divergence from offline – Need to support CPU version of code for simulation – Possibly more expensive than CPU-only farm. CPU+GPU solution attractive IF CPU-based farm cannot provide enough processing power. However, currently looks like CPU-only farm is the least code solution Discuss! 12
13
CPU Coming: Xeon E5-2699 V3 18 cores and 36 thread 3,960 EUR $5,392 13 e.g.
14
GPU 14 US $K40K20XK20M2090C2050 144353200269518251100 288709600808554752200 417740128001078073004400
15
15 Increase in Throughput per CPU when GPU added Speed-up t(CPU)/t(GPU) CPU code serial: waits for GPU completion Fraction defined in terms of execution time on CPU If CPU count reduced by factor 4, need factor 4 increase in throughput to break even; i.e. 75% of work moved to GPU
16
16
17
Speed-up factors HLT: 60% tracking 20% Calo 10% Muon 10% other 17
18
Cost of GPU-Cost of CPU Cost of effort for online version Cost for simulation 18
19
… CPU#1 12 CPU cores; 12/24 cpu threads GPU#1: 15 SMX; 2880 cores GPU#2 15 SMX; 2880 cores 120ms 240ms 360ms 10ms 240ms 250ms 69% CPU: x0.69 Throughput 1.44 19
20
6 jobs per GPU 20
21
21
22
Data Preparation Code 22
23
23
24
24
25
25
26
26
27
27
28
28
29
29
30
30
31
31
32
32
33
Data Preparation Code 33
34
34
35
35
36
36
37
37
38
38
39
39
40
40
41
41
42
42
43
43
44
44
45
45
46
46
47
47
48
48
49
49
50
50
51
51
52
52
53
53
54
Power & Cooling SDX racks: Upper level: 27? XPU racks – each 47U usable; 9.5kW – 1U 31,32,40 per rack (=>max 300W per 1U) – Current power consumption 6-9kW per rack Lower level: partially equipped with 10 racks (+6 preseries racks) – each 47U (could be 52U with additional reinforcing); 15 kW – 2U 4-blade servers 1100W, 8 or 10 per rack (9-11kW) GPU: C2050: <238W; K20:<225W; K40:<235W c.f. CPU: 130W (12 core, 2.7GHz) => GPU 80% higher max. power consumption than CPU. Adding 1 GPU ~doubles power consumption of node 50 racksNodes (mothe rboard) CPUCores (max threads) GPU (SMX) Throughput per Node (per CPU core) 40 nodes per rack2000400048000 (96000) 050Hz (2.08 Hz) 10 nodes per rack 4 GPU per node ~1200W/node 500100012000 (24000) 2000 (30000) 200Hz (8.33 Hz) 15 nodes per rack 2 GPU per node ~800W/node 750150018000 (36000) 1500 (22500) 133Hz (5.55Hz) 54
55
Packaging 1U: 2xE5-2600 or E5-2600v2 3xGPU 2U: 2xE5-2600 or E5-2600v2 4xGPU e.g. 2x12=24 CPU cores 3 GPU 8 CPU cores/GPU GPU: e.g. K40: 2880 cores, 15 SMX, 192 cores per SMX 4.3 (1.4) TFOPS for SP(DP): $4400 e.g. 2x12=24 CPU cores 4 GPU => 6 CPU cores/GPU + GPU 3X $4435 55
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.