CARLA Buenos Aires, Argentina - Sept , 2017

Slides:



Advertisements
Similar presentations
CGrid 2005, slide 1 Empirical Evaluation of Shared Parallel Execution on Independently Scheduled Clusters Mala Ghanesh Satish Kumar Jaspal Subhlok University.
Advertisements

MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.
The Who, What, Why and How of High Performance Computing Applications in the Cloud Soheila Abrishami 1.
Performance Analysis of Virtualization for High Performance Computing A Practical Evaluation of Hypervisor Overheads Matthew Cawood University of Cape.
Institute of Computer Science Foundation for Research and Technology – Hellas Greece Computer Architecture and VLSI Systems Laboratory Exploiting Spatial.
SKELETON BASED PERFORMANCE PREDICTION ON SHARED NETWORKS Sukhdeep Sodhi Microsoft Corp Jaspal Subhlok University of Houston.
Communication Pattern Based Node Selection for Shared Networks
Presented by Rengan Xu LCPC /16/2014
PARALLEL PROCESSING The NAS Parallel Benchmarks Daniel Gross Chen Haiout.
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
Synergy.cs.vt.edu Power and Performance Characterization of Computational Kernels on the GPU Yang Jiao, Heshan Lin, Pavan Balaji (ANL), Wu-chun Feng.
Presenter MaxAcademy Lecture Series – V1.0, September 2011 Introduction and Motivation.
1 Down Place Hammersmith London UK 530 Lytton Ave. Palo Alto CA USA.
Parallel Communications and NUMA Control on the Teragrid’s New Sun Constellation System Lars Koesterke with Kent Milfeld and Karl W. Schulz AUS Presentation.
Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.
Dual Stack Virtualization: Consolidating HPC and commodity workloads in the cloud Brian Kocoloski, Jiannan Ouyang, Jack Lange University of Pittsburgh.
Supporting GPU Sharing in Cloud Environments with a Transparent
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
Scaling and Packing on a Chip Multiprocessor Vincent W. Freeh Tyler K. Bletsch Freeman L. Rawson, III Austin Research Laboratory.
COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.
Scalable Data Clustering with GPUs Andrew D. Pangborn Thesis Defense Rochester Institute of Technology Computer Engineering Department Friday, May 14 th.
Predictive Runtime Code Scheduling for Heterogeneous Architectures 1.
Energy Profiling And Analysis Of The HPC Challenge Benchmarks Scalable Performance Laboratory Department of Computer Science Virginia Tech Shuaiwen Song,
Impact of Network Sharing in Multi-core Architectures G. Narayanaswamy, P. Balaji and W. Feng Dept. of Comp. Science Virginia Tech Mathematics and Comp.
Kyle Spafford Jeremy S. Meredith Jeffrey S. Vetter
1 Using Multiple Energy Gears in MPI Programs on a Power- Scalable Cluster Vincent W. Freeh, David K. Lowenthal, Feng Pan, and Nandani Kappiah Presented.
Trace Generation to Simulate Large Scale Distributed Application Olivier Dalle, Emiio P. ManciniMar. 8th, 2012.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
Taking the Complexity out of Cluster Computing Vendor Update HPC User Forum Arend Dittmer Director Product Management HPC April,
IPDPS 2005, slide 1 Automatic Construction and Evaluation of “Performance Skeletons” ( Predicting Performance in an Unpredictable World ) Sukhdeep Sodhi.
Parallel Programming on the SGI Origin2000 With thanks to Igor Zacharov / Benoit Marchand, SGI Taub Computer Center Technion Moshe Goldberg,
Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.
Nanco: a large HPC cluster for RBNI (Russell Berrie Nanotechnology Institute) Anne Weill – Zrahia Technion,Computer Center October 2008.
MIAO ZHOU, YU DU, BRUCE CHILDERS, RAMI MELHEM, DANIEL MOSSÉ UNIVERSITY OF PITTSBURGH Writeback-Aware Bandwidth Partitioning for Multi-core Systems with.
DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.
Martin Kruliš by Martin Kruliš (v1.1)1.
Shouqing Hao Institute of Computing Technology, Chinese Academy of Sciences Processes Scheduling on Heterogeneous Multi-core Architecture.
LACSI 2002, slide 1 Performance Prediction for Simple CPU and Network Sharing Shreenivasa Venkataramaiah Jaspal Subhlok University of Houston LACSI Symposium.
NUMA Control for Hybrid Applications Kent Milfeld TACC May 5, 2015.
Multicore Applications in Physics and Biochemical Research Hristo Iliev Faculty of Physics Sofia University “St. Kliment Ohridski” 3 rd Balkan Conference.
HPC/HTC vs. Cloud Benchmarking An empirical evaluation of the performance and cost implications Kashif Iqbal - PhD ICHEC, NUI Galway,
15/02/2006CHEP 061 Measuring Quality of Service on Worker Node in Cluster Rohitashva Sharma, R S Mundada, Sonika Sachdeva, P S Dhekne, Computer Division,
Dynamic Mobile Cloud Computing: Ad Hoc and Opportunistic Job Sharing.
INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.
Parallel OpenFOAM CFD Performance Studies Student: Adi Farshteindiker Advisors: Dr. Guy Tel-Zur,Prof. Shlomi Dolev The Department of Computer Science Faculty.
SPIDAL Java Optimized February 2017 Software: MIDAS HPC-ABDS
A Brief Introduction to NERSC Resources and Allocations
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
SPIDAL Analytics Performance February 2017
Introduction to Load Balancing:
Slurm User Group Meeting’16 - Athens, Greece – Sept , 2016
Use-case: CFD software with FEM and unstructured meshes
Operating Systems (CS 340 D)
Geant4 MT Performance Soon Yung Jun (Fermilab)
Task Scheduling for Multicore CPUs and NUMA Systems
Sub-millisecond Stateful Stream Querying over
Operating Systems (CS 340 D)
Linchuan Chen, Xin Huo and Gagan Agrawal
Department of Computer Science University of California, Santa Barbara
Memory Opportunity in Multicore Era
High Performance Computing
Shreeni Venkataramaiah
CLUSTER COMPUTING.
Discussion HPC Priority project for COSMO consortium
New cluster capabilities through Checkpoint/ Restart
Introduction, background, jargon
Chapter 01: Introduction
Department of Computer Science University of California, Santa Barbara
Assoc. Prof. Marc FRÎNCU, PhD. Habil.
Presentation transcript:

CARLA 2017 - Buenos Aires, Argentina - Sept. 20-22, 2017 Benchmarking Performance: Influence of Task Location on Cluster Throughput Manuel Rodríguez-Pascual, José A. Moríñigo, Rafael Mayo-García CIEMAT – ICT Division Avda. Complutense 40, Madrid 28040, SPAIN CARLA 2017 - Buenos Aires, Argentina - Sept. 20-22, 2017

Overview Introduction Facility Design of experiments NAS Parallel Benchmarks Results Conclusions

Prospect on HPC environments Introduction Prospect on HPC environments App’s with very different requirements may coexist CPU – , I/O– and Memory–bound MPI & OpenMP continue being instrumental … and also: Execution time Degree of paralelism …. Need of scalability experiments to understand Weak – scaling  Strong – scaling % serial App’s in clusters  may act as perturbations

Maximize performance Minimize energy consumption Introduction Motivation What to do with partially-filled multicore-CPUs? How is the performance sensitivity to the Application itself? To which extent system architecture drives App’s behaviour? …. Goal: to achieve better exploitation and cost-effective HPC Focus of this work Results (answers) will impact on specific scheduling decisions Maximize performance Minimize energy consumption Sweet point?

Overview Introduction Facility NAS Parallel Benchmarks Design of experiments Results Conclusions

Intel Memory Latency Checker Facility Starting Point: Our HPC characterization Two families of Benchmarks: 1 - System Benchmarks: “raw” performance of HPC components: 2 – Scientific Application Benchmarks: behaviour when running real applications: OSU Micro-Benchmark (Ohio State Uni) Bonnie++ STREAM Intel Memory Latency Checker NAS (more on this later…)

Cluster ACME: state-of-the-art, for research 10 nodes: Facility Cluster ACME: state-of-the-art, for research 10 nodes: 8 compute nodes (128 cores) 2 Xeon-Phi nodes Intra-node BW Inter-node BW Ratio = ≈ 3 - 3.5 Basic data: 2 Bull chassis with 4 nodes each 2 nodes with one Xeon Phi each Each CPU node: 2 x 8-core Xeon E5-2640 @2.6GH 32GB DDR4 RAM @2133MHz NUMA model

….... Facility Software stack in ACME Slurm v.16.05 (cluster manager) Mvapich v.2.2 not used 2 Xeon-Phi nodes (2 CPUs/node) login node slurmctld jobs queued ….... # 1 # 2 # 3 # 8 ssh 8 compute nodes (slurmd) Infiniband FDR

Overview Introduction Facility NAS Parallel Benchmarks Design of experiments Results Conclusions

Parallel NAS Benchmarks Classes (problem size) NAS Parallel Benchmarks (NPB) v. 2.0: - Code: Fortran & MPI - Building blocks (7): BT, CG, LU, IS, EP, FT, MG CPU- bound Memory-bound Usefulness – metric: computational efficiency  feedback to sci-groups of production clusters.

Overview Introduction Facility NAS Parallel Benchmarks Design of experiments Results Conclusions

Design of experiments Some definitions Mapping NAS kernels as MPI tasks onto groups of cores Configuration, linked to the job: nN x nT = [# Nodes] x [# MPI Tasks]

Standard deviation < 1% Design of experiments Executions under Slurm (cluster manager) setups: Dedicated Network (reference setup): 1 job at the same time running in the cluster. Dedicated Cores: one-to-one assignment of cores to MPI tasks of the job. Dedicated Nodes: entire node asigned to execute MPI task of the same job. Dedicated Sockets: a socket executes MPI tasks of the same jobs (no part of another job may share it during execution). An experiment?: one NAS kernel (+ class) + definite nN x nT configuration definite Slurm setup 4,000 exec’s at each cluster setup 10 executions, averaged Standard deviation < 1%

More than 45% jobs are 8 MPI task or less Design of experiments Some data of sent jobs Range of partitions: 1 (serial), 2, 4, ….. 64, 128 tasks (MPI ranks) Averaged execution time Consistency: 3 repetitions of experiments, then averaged. Cost: 48,000 sent jobs ( 4,000 jobs sent to the queue at once) > 30% Setup Time ratio (%) Dedicated Nodes 32.6 Dedicated Sockets 25.8 Dedicated Cores 17.5 More than 45% jobs are 8 MPI task or less

Overview Introduction Facility NAS Parallel Benchmarks Design of experiments Results What’s next?

Dedicated Cores setup: Realistic scenario in production Results Dedicated Cores setup: Realistic scenario in production - Nondimensional execution time (4, 8, 16, 32 tasks): 4 tasks 8 tasks

Dedicated Cores setup (Cont.) Results Dedicated Cores setup (Cont.) - Nondimensional execution time (4, 8, 16, 32 tasks): 16 tasks 32 tasks

Focus on kernels EP and MG & Dedicated Nodes setup Results Focus on kernels EP and MG & Dedicated Nodes setup MG EP

Focus on kernels EP and MG & Dedicated Nodes setup Results Focus on kernels EP and MG & Dedicated Nodes setup Out of the box! MG EP

Sensitivity to cluster setup: kernels MG and EP Results Sensitivity to cluster setup: kernels MG and EP MG EP

Maps of Speedup VS. Computational resources Results Maps of Speedup VS. Computational resources How many unused resources? Example: Dedicated Nodes  Performance / Energy decisions

Maps of Speedup VS. Computational resources Results Maps of Speedup VS. Computational resources E.g.: 2x4 configuration Dedicated Nodes  4 tasks per socket: 75% unused (Dedicated Cores  (idem): 0% unused)

NAS has been executed in a systematic way Computational efficiency? Conclusions NAS has been executed in a systematic way Computational efficiency? No one-rule regarding grouping MPI tasks. Statistical patterns depend on the cluster manager (Slurm) setups. Interestingly in ACME, under Dedicated Cores setup, there are more cases at which grouping improves speedup  good for production scenarios. More experiments are needed to be conclusive!

 Initial characterization of our HPC facility. Conclusions In perspective… Execution of NAS is a first step  Initial characterization of our HPC facility. Ongoing: Analysis of scientific (production) codes MPI-based codes: LAMMPS (MD) OpenFOAM (CFD) FFTW miniFE (FEM solver - NERSC) … How does the picture change?

Thank you! THANK YOU!!! CIEMAT – Avda. Complutense, 40 – 28040 Madrid, SPAIN e-mail: {manuel.rodriguez, josea.morinigo, rafael.mayo} @ciemat.es Take a look at our group site! http://rdgroups.ciemat.es/web/sci-track/