Presentation is loading. Please wait.

Presentation is loading. Please wait.

CARLA Buenos Aires, Argentina - Sept , 2017

Similar presentations


Presentation on theme: "CARLA Buenos Aires, Argentina - Sept , 2017"— Presentation transcript:

1 CARLA 2017 - Buenos Aires, Argentina - Sept. 20-22, 2017
Benchmarking Performance: Influence of Task Location on Cluster Throughput Manuel Rodríguez-Pascual, José A. Moríñigo, Rafael Mayo-García CIEMAT – ICT Division Avda. Complutense 40, Madrid 28040, SPAIN CARLA Buenos Aires, Argentina - Sept , 2017

2 Overview Introduction Facility Design of experiments NAS Parallel Benchmarks Results Conclusions

3 Prospect on HPC environments
Introduction Prospect on HPC environments App’s with very different requirements may coexist CPU – , I/O– and Memory–bound MPI & OpenMP continue being instrumental … and also: Execution time Degree of paralelism …. Need of scalability experiments to understand Weak – scaling  Strong – scaling % serial App’s in clusters  may act as perturbations

4 Maximize performance Minimize energy consumption
Introduction Motivation What to do with partially-filled multicore-CPUs? How is the performance sensitivity to the Application itself? To which extent system architecture drives App’s behaviour? …. Goal: to achieve better exploitation and cost-effective HPC Focus of this work Results (answers) will impact on specific scheduling decisions Maximize performance Minimize energy consumption Sweet point?

5 Overview Introduction Facility NAS Parallel Benchmarks Design of experiments Results Conclusions

6 Intel Memory Latency Checker
Facility Starting Point: Our HPC characterization Two families of Benchmarks: 1 - System Benchmarks: “raw” performance of HPC components: 2 – Scientific Application Benchmarks: behaviour when running real applications: OSU Micro-Benchmark (Ohio State Uni) Bonnie++ STREAM Intel Memory Latency Checker NAS (more on this later…)

7 Cluster ACME: state-of-the-art, for research 10 nodes:
Facility Cluster ACME: state-of-the-art, for research 10 nodes: 8 compute nodes (128 cores) 2 Xeon-Phi nodes Intra-node BW Inter-node BW Ratio = Basic data: 2 Bull chassis with 4 nodes each 2 nodes with one Xeon Phi each Each CPU node: 2 x 8-core Xeon 32GB DDR4 NUMA model

8 ….... Facility Software stack in ACME Slurm v.16.05 (cluster manager)
Mvapich v.2.2 not used 2 Xeon-Phi nodes (2 CPUs/node) login node slurmctld jobs queued ….... # 1 # 2 # 3 # 8 ssh 8 compute nodes (slurmd) Infiniband FDR

9 Overview Introduction Facility NAS Parallel Benchmarks Design of experiments Results Conclusions

10 Parallel NAS Benchmarks
Classes (problem size) NAS Parallel Benchmarks (NPB) v. 2.0: - Code: Fortran & MPI - Building blocks (7): BT, CG, LU, IS, EP, FT, MG CPU- bound Memory-bound Usefulness – metric: computational efficiency  feedback to sci-groups of production clusters.

11 Overview Introduction Facility NAS Parallel Benchmarks Design of experiments Results Conclusions

12 Design of experiments Some definitions
Mapping NAS kernels as MPI tasks onto groups of cores Configuration, linked to the job: nN x nT = [# Nodes] x [# MPI Tasks]

13 Standard deviation < 1%
Design of experiments Executions under Slurm (cluster manager) setups: Dedicated Network (reference setup): 1 job at the same time running in the cluster. Dedicated Cores: one-to-one assignment of cores to MPI tasks of the job. Dedicated Nodes: entire node asigned to execute MPI task of the same job. Dedicated Sockets: a socket executes MPI tasks of the same jobs (no part of another job may share it during execution). An experiment?: one NAS kernel (+ class) + definite nN x nT configuration definite Slurm setup 4,000 exec’s at each cluster setup 10 executions, averaged Standard deviation < 1%

14 More than 45% jobs are 8 MPI task or less
Design of experiments Some data of sent jobs Range of partitions: 1 (serial), 2, 4, ….. 64, 128 tasks (MPI ranks) Averaged execution time Consistency: 3 repetitions of experiments, then averaged. Cost: 48,000 sent jobs ( 4,000 jobs sent to the queue at once) > 30% Setup Time ratio (%) Dedicated Nodes 32.6 Dedicated Sockets 25.8 Dedicated Cores 17.5 More than 45% jobs are 8 MPI task or less

15 Overview Introduction Facility NAS Parallel Benchmarks Design of experiments Results What’s next?

16 Dedicated Cores setup: Realistic scenario in production
Results Dedicated Cores setup: Realistic scenario in production - Nondimensional execution time (4, 8, 16, 32 tasks): 4 tasks 8 tasks

17 Dedicated Cores setup (Cont.)
Results Dedicated Cores setup (Cont.) - Nondimensional execution time (4, 8, 16, 32 tasks): 16 tasks 32 tasks

18 Focus on kernels EP and MG & Dedicated Nodes setup
Results Focus on kernels EP and MG & Dedicated Nodes setup MG EP

19 Focus on kernels EP and MG & Dedicated Nodes setup
Results Focus on kernels EP and MG & Dedicated Nodes setup Out of the box! MG EP

20 Sensitivity to cluster setup: kernels MG and EP
Results Sensitivity to cluster setup: kernels MG and EP MG EP

21 Maps of Speedup VS. Computational resources
Results Maps of Speedup VS. Computational resources How many unused resources? Example: Dedicated Nodes  Performance / Energy decisions

22 Maps of Speedup VS. Computational resources
Results Maps of Speedup VS. Computational resources E.g.: 2x4 configuration Dedicated Nodes  4 tasks per socket: 75% unused (Dedicated Cores  (idem): 0% unused)

23 NAS has been executed in a systematic way Computational efficiency?
Conclusions NAS has been executed in a systematic way Computational efficiency? No one-rule regarding grouping MPI tasks. Statistical patterns depend on the cluster manager (Slurm) setups. Interestingly in ACME, under Dedicated Cores setup, there are more cases at which grouping improves speedup  good for production scenarios. More experiments are needed to be conclusive!

24  Initial characterization of our HPC facility.
Conclusions In perspective… Execution of NAS is a first step  Initial characterization of our HPC facility. Ongoing: Analysis of scientific (production) codes MPI-based codes: LAMMPS (MD) OpenFOAM (CFD) FFTW miniFE (FEM solver - NERSC) … How does the picture change?

25 Thank you! THANK YOU!!! CIEMAT – Avda. Complutense, 40 – Madrid, SPAIN {manuel.rodriguez, josea.morinigo, Take a look at our group site!


Download ppt "CARLA Buenos Aires, Argentina - Sept , 2017"

Similar presentations


Ads by Google