Presentation is loading. Please wait.

Presentation is loading. Please wait.

Exascale: challenges and opportunities in a power constrained world Carlo Cavazzoni – SuperComputing Applications and Innovation.

Similar presentations


Presentation on theme: "Exascale: challenges and opportunities in a power constrained world Carlo Cavazzoni – SuperComputing Applications and Innovation."— Presentation transcript:

1 Exascale: challenges and opportunities in a power constrained world Carlo Cavazzoni – c.cavazzoni@cineca.it SuperComputing Applications and Innovation Department

2 outline -Cineca Roadmap -Exascale challenge -Applications challange -Energy efficiency

3 GALILEO Name: Galileo Model: IBM/Lenovo NeXtScale Processor type: Intel Xeon Haswell@ 2.4 GHz Computing Nodes: 516 Each node: 16 cores, 128 GB of RAM Computing Cores: 8.256 RAM: 66 TByte Internal Network: Infiniband 4xQDR switches (40 Gb/s) Accelerators: 768 Intel Phi 7120p (2 per node on 384 nodes) + 80 Nvidia K80 Peak Performance: 1.5 PFlops National and PRACE Tier-1 calls X86 based system for production of medium scalability applications

4 PICO Model: IBM NeXtScale IB linux cluster Processor type: Intel Xeon E5 2670 v2 @2,5Ghz Computing Nodes: 80 Each node : 20 cores/node, 128 GB of RAM 2 Visualization nodes 2 Big Mem nodes 4 data mover nodes Storage and processing of large volumes of data Storage 50TByte of SSD 5PByte on-line repository (same fabric of the cluster) 16PByte of tapes Services Hadoop & PBS OpenStack cloud NGS pipelines Workflows (weather/sea forecast) Analytics High-throughput workloads

5 todayQ1 2016 Tier0: Fermi (BGQ) Tier1: Galileo BigData: Pico Tier0: new (on going procurement) (HPC Top10) BigData: Galileo/Pico Tier0 BigData: 50PFlops 50PByte Q1 2019 Cineca Road-map

6 Dennard scaling law (downscaling) L’ = L / 2 V’ = V / 2 F’ = F * 2 D’ = 1 / L 2 = 4D P’ = P do not hold anymore! The power crisis! L’ = L / 2 V’ = ~V F’ = ~F * 2 D’ = 1 / L 2 = 4 * D P’ = 4 * P Increase the number of cores to maintain the architectures evolution on the Moore’s law Programming crisis! The core frequency and performance do not grow following the Moore’s law any longer new VLSI gen. old VLSI gen.

7 Moore’s Law Number of transistors per chip double every 18 month Oh-oh! Huston! The true it double every 24 month

8 The silicon lattice Si lattice 0.54 nm There will be still 4~6 cycles (or technology generations) left until we reach 11 ~ 5.5 nm technologies, at which we will reach downscaling limit, in some year between 2020-30 (H. Iwai, IWJT2008). 50 atoms!

9 Amdahl's law upper limit for the scalability of parallel applications determined by the fraction of the overall execution time spent in non- scalable operations (Amdahl's law). maximum speedup tends to 1 / ( 1 − P ) P= parallel fraction 1000000 core P = 0.999999 serial fraction= 0.000001

10 HPC trends (constrained by the three law) Peak PerformanceMoore law FPU PerformanceDennard law Number of FPUs Moore + Dennard App. Parallelism Amdahl's law 10^9 exaflops gigaflops Serial fraction 1/10^9 opportunity challenge

11 Energy trends “traditional” RISK and CISC chips are designed for maximum performance for all possible workloads A lot of silicon to maximize single thread performace Compute Power Energy Datacenter Capacity

12 Change of paradigm New chips designed for maximum performance in a small set of workloads Simple functional units, poor single thread performance, but maximum throughput Compute Power Energy Datacenter Capacity

13 Architecture toward exascale CPUACC. Single thread perf. throughput GPU/MIC/FPGA bottleneck ACC. AMD APU CPU SoC KNL, ARM ACC. CPU OpenPower Nvidia GPU 3D stacking Active memory Photonic -> platform flexibility TSV -> stacking

14 Exascale architecture two model Hybrid Homogeneus ACC. AMD APU CPU ACC.CPU Nvidia GPU SoC ARM Intel

15 Accelerator/GPGPU Sum of 1D array +

16 Intel Vector Units Next to come: AVX-512 up to 16 Multiply Add / clock

17 Applications Challenges -Programming model -Scalability -I/O, Resiliency/Fault tolerance -Numerical stability -Algorithms -Energy Aware/Efficiency

18 www.quantum-espresso.org H2020 MaX Center of Excellence

19 Scalability The case of Quantum Espresso QE parallelization hierarchy

20 20 ok for petascale, not enough for exascale Ab-initio simulations -> numerical solution of the quantum mechanical equations

21 QE evolution New Algorithm: CG vs Davidson New Algorithm: CG vs Davidson Double buffering Task level parallelism Communication avoiding Communication avoiding Coupled Application DSL Coupled Application DSL LAMMPS QE High Throughput / Ensamble Simulations Reliability Completeness Robustness Standard Interface Reliability Completeness Robustness Standard Interface

22 Multi-level parallelism MPI: Domain partition OpenMP: Node Level shared mem CUDA/OpenCL/OpenAcc:/OpenMP4 floating point accelerators Python: Ensemble simulations, workfows Workload Management: system level, High-throughput

23 23 QE (Al2O3 small benchmark) Energy to solution – as a function of the clock

24 Conclusions Exascale Systems, will be there Power is the main architectural constraints Exascale Applications? Yes, but… Concurrency, Fault Tolerance, I/O … Energy aware

25 Backup slides

26 I/O subsystem of high performance computers are still deployed using spinning disks, with their mechanical limitation (spinning speed cannot grow above a certain regime, above which the vibration cannot be controlled), and like for the DRAM they eat energy even if their state is not changed. Solid state technology appear to be a possible alternative, but costs do not allow to implement data storage systems of the same size. Probably some hierarchical solutions can exploit both technology, but this do not solve the problem of having spinning disks spinning for nothing. I/O Challenges 10K clients 100K core per clients 1Exabyte 100K Disks 100TByte/sec 1Gbyte blocks Parallel Filesystem Multi Tier architecture 100 clients 1000 core per client 3PByte 3K Disks 100 Gbyte/sec 8MByte blocks Parallel Filesystem One Tier architecture Today Tomorrow

27 27 Storage I/O The I/O subsystem is not keeping the pace with CPU Checkpointing will not be possible Reduce I/O On the fly analysis and statistics Disk only for archiving Scratch on non volatile memory (“close to RAM”)

28 Today I/O client ….. I/O server RAID Controller Switch I/O server ….. I/O client cores disks RAID Controller disks RAID Controller disks 160K cores, 96 I/O clients, 24 I/O servers, 3 RAID controllers IMPORTANT: I/O subsystem has its own parallelism!

29 Today-Tomorrow I/O client ….. I/O server RAID Controller Switch I/O server ….. I/O client cores disks RAID Controller disks RAID Controller disks 1M cores, 1000 I/O clients, 100 I/O servers, 10 RAID FLASH/DISK controllers FLASH RAID Controller Tier-1 Tier-2

30 Tomorrow I/O client ….. I/O server RAID Controller Switch I/O server ….. I/O client cores disks RAID Controller disks RAID Controller disks 1G cores, 10K NVRAM nodes, 1000 I/O clients, 100 I/O servers, 10 RAID controllers NVRAM Tier-1 (byte addressable?)Tier-2/Tier-3 (Block device) FLASH RAID Controller Tier-2 Tier-3

31 Pool 2Pool 1 Resiliency/Fault tolerance Check point / restart Warm start vs Cold Start Non volatile memory -> Data pooling Node 1 Wf 1 Wf 2 Node 2 Wf 2 Wf 1 Node 3 Wf 3 Wf 4 Node 4 Wf 4 Wf 3 Node 1 Wf 1 Node 2 Wf 2 Node 3 Wf 3 Node 4 Wf 4

32 Energy Aware/Efficiency

33 EURORA PRACE Prototype experience 3,200MOPS/W – 30KW #1 in The Green500 List June 2013 Address Today HPC Constraints: Flops/Watt, Flops/m2, Flops/Dollar. Efficient Cooling Technology: hot water cooling (free cooling); measure power efficiency, evaluate (PUE & TCO). Improve Application Performances: at the same rate as in the past (~Moore’s Law); new programming models. Evaluate Hybrid (accelerated) Technology: Intel Xeon Phi; NVIDIA Kepler. Custom Interconnection Technology: 3D Torus network (FPGA); evaluation of accelerator-to-accelerator communications. 64 compute cards 128 Xeon SandyBridge (2.1GHz, 95W and 3.1GHz, 150W) 16GByte DDR3 1600MHz per node 160GByte SSD per node 1 FPGA (Altera Stratix V) per node IB QDR interconnect 3D Torus interconnect 128 Accelerator cards (NVIDA K20 and INTEL PHI)

34 Monitoring Infrastructure Data collection “front-end” powerDAM (LRZ) Monitoring, Energy accounting Matlab Modelling and feature extraction Data collection “back-end” Node stats (Intel CPUs, Intel MIC, NVidia GPUs) 12-20ms overhead, update every 5s. Rack stats (Power Distribution Unit) Room stats (Cooling and power supply) Job stats (PBS) Accounting

35 35 PMU (x Core): - CPI - HW Freq - Load - Temperature RAPL (xCPU): - Pcores - Ppackage - Pdram GPU (nvm APl) - Die Temp. - Die Power - GPU Load - MEM Load - GPU freq - SMEM freq - MEM freq MIC (Ganglia): - Temperature VRs 3 Card sensor Die & GDDR - 7 Power meas. point - Core & GDDR frequency - Core & memory load - Bandwith PCIe Link What we collect

36 CPUs Frequency nodeID CoreID Eurora At Work Inactive nodes 2.1GHz CPUs 3.1GHz CPUs Average frequency over 24h Using on-demand governor to track workload On-demand governor: per core – different freq per core Turbo boost (HW driven) The other cpu is inactive

37 Clock per Instruction nodeID CoreID CPI = non-idle-clockperio/instructions-executed Eurora At Work

38 Application Benchmarks

39 Quantum ESPRESSO Energy to Solution (PHI) Time-to-solution (right) and Energy-to-solution (left) compared between Xeon Phi and CPU only versions of QE on a single node.

40 Time-to-solution (right) and Energy-to-solution (left) compared between GPU and CPU only versions of QE on a single node Quantum ESPRESSO Energy to Solution (K20)

41

42

43 Conclusions Exascale Systems, will be there Power is the main architectural constraints Exascale Applications? Yes, but… Concurrency, Fault Tolerance, I/O … Energy aware


Download ppt "Exascale: challenges and opportunities in a power constrained world Carlo Cavazzoni – SuperComputing Applications and Innovation."

Similar presentations


Ads by Google