Presentation is loading. Please wait.

Presentation is loading. Please wait.

Future farm technologies & architectures John Baines 1.

Similar presentations


Presentation on theme: "Future farm technologies & architectures John Baines 1."— Presentation transcript:

1 Future farm technologies & architectures John Baines 1

2 Introduction What will the HLT farm look like in 2020? When & how do we narrow the options? – Choice affects software design as well as farm infrastructure How do we evaluate cost/benefits? When & how do we make the final choices for farm purchases? How do we design software now to ensure we can fully exploit the capability of future farm hardware? What do we need in the way of demonstrators for specific techologies? We can’t evaluate all options – what should be the priorities? 2

3 Timescales: Framework, Steering & New Technologies 2014 Q3Q4 LS 1 Design & Prototype Implement core functionality Extend to full functionality CommissioningRun EvaluateImplement Infrastructure Exploit New. Tech. in Algorithms Speed up code, thread-safety, investigate possibilities for internal parallelisation Implement Algorithms in new framework. HLT software Commissioning Complete Final Software Complete Framework & Algos. Fix PC architecture Framework Core Functionality Complete Incl. HLT components & new tech. support Design of Framework & HLT Components Complete Narrow h/w choices e.g. Use or not GPU Run 3 Full menu complete Simple menu Framework Requirements Capture Complete Framework New Tech. Algs & Menus Draft Version for discussion 3 Prototype with 1 or 2 chains

4 Technologies CPU: increased core counts – – currently 12 core (24 thread) e.g. Xeon E5 2600 v2 series ~0.5 TFLOPS – 18 core (36 thread) coming soon (Xeon E5 2600 v3 series) – Possible trend to many cores with lower memory => cannot continue to run one job per core 4 GPU: Much bigger core count: e.g. Nvidia K40: 15 SMX, 2880 cores 12 GB memory. 4.3(1.4) TFLOPS SP(DP) Coprocessor: e.g. Intel Xeon Phi up to 61 cores, 244 threads 1.2 TFLOPS

5 GPU:Towards a cost benefit analysis Will need to Assess: Effort needed to port code to GPU maintain it (bug fixing, new hardware…) and to Support MC simulation on GRID Speed-up for individual components & full chain What can be outsourced to GPU and what done on CPU Integration with Athena (APE) Balance of CPU cores to GPU i.e. sharing of GPU resource between several jobs Farm integration issues: packaging, power consumption…. Financial cost: hardware, installation, commissioning, maintenance… As an exercise, see what we can learn from studies to-date i.e. cost-benefit if we were to purchase today. 5

6 Demonstrators Demonstrators: ID (RAL, Edinburgh, Oxford): – Complete L2 ID chain ported to CUDA for NVIDIA GPU – ID datapreparation (byestream conversion, clustering, space-point formation) ported additionally to openCL Muon (CERN, Rome) – Muon calorimeter-isolation implemented in CUDA Jet (Lisbon) – Just starting See: Twiki: TriggerSoftwareUpgradeTwiki: TriggerSoftwareUpgrade Porting L2 ID tracking to CUDA ~2 years @ 0.5 FTE => 1 staff year (for very experienced expert!) Effort needed to port code? 6

7 GPUS Time (ms) Tau RoI 0.6x0.6 tt events 2x10 34 C++ on 2.4 GHz CPU CUDA on Tesla C2050 Speedup CPU/GPU Data. Prep.2739 Seeding8.31.65 Seed ext.1567.820 Triplet merging 7.43.42 Clone removal706.211 CPU GPU xfern/a0.1n/a Total2682212 Example of complete L2 ID chain implemented on GPU (Dmitry Emeliyanov) Data Prep. L2 Tracking 7 X2.4 X5 Max. speed-up: x26 Overall speed-up t(GPU/t(CPU): 12

8 Sharing of GPU resource 8 Blue: Tracking running on CPU Red: Most Tracking steps on GPU, final ambiguity solving on CPU X2.4 With balanced load on CPU/GPU, several CPU cores can share a GPU e.g. Test with L2 ID tracking with 8 CPU cores sharing one C2050 GPU Tesla C2050

9 Packaging 1U: 2xE5-2600 or E5-2600v2 3xGPU 2U: 2xE5-2600 or E5-2600v2 4xGPU Examples: Total for 1027 or 2027 with 2 K20 GPU: ~15k CHF => 12 CPU cores/GPU Total for 2027 with 4 K20 GPU: ~20k CHF => 6 CPU cores/GPU CPU: Intel E5-2697v2 CPU 12 cores, ~0.5 TFLOPS, ~2.3kCHF GPU: Nvidia K20 GPU 2496 cores, 13 SMX, 192 cores per SMX 3.5 (1.1) TFOPS for SP(DP), ~2.4k CHF 9

10 Power & Cooling SDX racks: Max. Power: 12kW Usable space: 47 U Current power ~300W per motherboard => max. 40 motherboards per rack. Compare 2U unit: a) 4 motherboards, 8CPU 1.1 kW b) 1 motherboard, 2 CPU with 2 GPU (750W) or 4 GPU 1.2kW Based on Max. Power: K20 GPU: 225W c.f. E5-2697v2 CPU: 130W (need to measure typical power) Illustrative farm configuration: 50 racksTotal Farm Nodes CPUCores (max threads) GPU (SMX) Required throughput per Node (per CPU core) 40 nodes per rack ~300W/node 2,0004,00048,000 (96,000) 050 Hz (2.1 Hz) 10 nodes per rack 4 GPU per node ~1200W/node 5001,00012000 (24000) 2,000 (26,000) 200 Hz (8.3 Hz) 16 nodes per rack 2 GPU per node ~750W/node 8001,60019,200 (38,400) 1,600 (20,800) 125 Hz (5.2 Hz) 10 x4 X2.5

11 Summary Current limiting factor is cooling: 12kW/rack => Adding GPU means removing CPU Factor 2.5-4 less CPU requires corresponding increase in CPU throughput Financial Cost per motherboard (2U box with 8 CPU versus 2CPU + 4 GPU) : CPU+GPU Factor ~2 more expensive => win with CPU+GPU solution when throughput per CPU increased by more than factor 8 => 90% work (by CPU time) transferred to GPU 11

12 Discussion Benefits: – If we can manage to port the bulk of time-consuming code to GPU, the benefit is potentially much better scaling with p.u. i.e. No combinatorical code left on CPU => execution times will scale slowly with p.u. Code on GPU is parallel and will scale slowly with p.u. Costs: – Significant effort needed to port code – Need to support different GPU generations with rolling replacements – Potential divergence from offline – Need to support CPU version of code for simulation – Possibly more expensive than CPU-only farm.  CPU+GPU solution attractive IF CPU-based farm cannot provide enough processing power.  However, currently looks like CPU-only farm is the least code solution  Discuss! 12

13 CPU Coming: Xeon E5-2699 V3 18 cores and 36 thread 3,960 EUR $5,392 13 e.g.

14 GPU 14 US $K40K20XK20M2090C2050 144353200269518251100 288709600808554752200 417740128001078073004400

15 15 Increase in Throughput per CPU when GPU added Speed-up t(CPU)/t(GPU) CPU code serial: waits for GPU completion Fraction defined in terms of execution time on CPU If CPU count reduced by factor 4, need factor 4 increase in throughput to break even; i.e. 75% of work moved to GPU

16 16

17 Speed-up factors HLT: 60% tracking 20% Calo 10% Muon 10% other 17

18 Cost of GPU-Cost of CPU Cost of effort for online version Cost for simulation 18

19 … CPU#1 12 CPU cores; 12/24 cpu threads GPU#1: 15 SMX; 2880 cores GPU#2 15 SMX; 2880 cores 120ms 240ms 360ms 10ms 240ms 250ms 69% CPU: x0.69 Throughput 1.44 19

20 6 jobs per GPU 20

21 21

22 Data Preparation Code 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 Data Preparation Code 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

52 52

53 53

54 Power & Cooling SDX racks: Upper level: 27? XPU racks – each 47U usable; 9.5kW – 1U 31,32,40 per rack (=>max 300W per 1U) – Current power consumption 6-9kW per rack Lower level: partially equipped with 10 racks (+6 preseries racks) – each 47U (could be 52U with additional reinforcing); 15 kW – 2U 4-blade servers 1100W, 8 or 10 per rack (9-11kW) GPU: C2050: <238W; K20:<225W; K40:<235W c.f. CPU: 130W (12 core, 2.7GHz) => GPU 80% higher max. power consumption than CPU.  Adding 1 GPU ~doubles power consumption of node 50 racksNodes (mothe rboard) CPUCores (max threads) GPU (SMX) Throughput per Node (per CPU core) 40 nodes per rack2000400048000 (96000) 050Hz (2.08 Hz) 10 nodes per rack 4 GPU per node ~1200W/node 500100012000 (24000) 2000 (30000) 200Hz (8.33 Hz) 15 nodes per rack 2 GPU per node ~800W/node 750150018000 (36000) 1500 (22500) 133Hz (5.55Hz) 54

55 Packaging 1U: 2xE5-2600 or E5-2600v2 3xGPU 2U: 2xE5-2600 or E5-2600v2 4xGPU e.g. 2x12=24 CPU cores 3 GPU  8 CPU cores/GPU GPU: e.g. K40: 2880 cores, 15 SMX, 192 cores per SMX 4.3 (1.4) TFOPS for SP(DP): $4400 e.g. 2x12=24 CPU cores 4 GPU => 6 CPU cores/GPU + GPU 3X $4435 55


Download ppt "Future farm technologies & architectures John Baines 1."

Similar presentations


Ads by Google