Www.bsc.es Belgrade, 24 September 2014 George S. Markomanolis, Oriol Jorba, Kim Serradell Introduction to an HPC environment.

Slides:



Advertisements
Similar presentations
Issues of HPC software From the experience of TH-1A Lu Yutong NUDT.
Advertisements

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
1 Computational models of the physical world Cortical bone Trabecular bone.
WHAT IS AN OPERATING SYSTEM? An interface between users and hardware - an environment "architecture ” Allows convenient usage; hides the tedious stuff.
GPU System Architecture Alan Gray EPCC The University of Edinburgh.
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
Prof. Srinidhi Varadarajan Director Center for High-End Computing Systems.
Types of Parallel Computers
IBM RS6000/SP Overview Advanced IBM Unix computers series Multiple different configurations Available from entry level to high-end machines. POWER (1,2,3,4)
Reference: Message Passing Fundamentals.
11/14/05ELEC Fall Multi-processor SoCs Yijing Chen.
Introduction to Scientific Computing Doug Sondak Boston University Scientific Computing and Visualization.
Hitachi SR8000 Supercomputer LAPPEENRANTA UNIVERSITY OF TECHNOLOGY Department of Information Technology Introduction to Parallel Computing Group.
Computer Organization and Architecture
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
High Performance Computing (HPC) at Center for Information Communication and Technology in UTM.
Contemporary Languages in Parallel Computing Raymond Hummel.
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
ORIGINAL AUTHOR JAMES REINDERS, INTEL PRESENTED BY ADITYA AMBARDEKAR Overview for Intel Xeon Processors and Intel Xeon Phi coprocessors.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.
GPU Programming with CUDA – Accelerated Architectures Mike Griffiths
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Silberschatz, Galvin and Gagne  2002 Modified for CSCI 399, Royden, Operating System Concepts Operating Systems Lecture 1 Introduction Read:
Chapter 1 Computer System Overview Dave Bremer Otago Polytechnic, N.Z. ©2008, Prentice Hall Operating Systems: Internals and Design Principles, 6/E William.
Performance Concepts Mark A. Magumba. Introduction Research done on 1058 correspondents in 2006 found that 75% OF them would not return to a website that.
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
TRACEREP: GATEWAY FOR SHARING AND COLLECTING TRACES IN HPC SYSTEMS Iván Pérez Enrique Vallejo José Luis Bosque University of Cantabria TraceRep IWSG'15.
The WRF Model The Weather Research and Forecasting (WRF) Model is a mesoscale numerical weather prediction system designed for both atmospheric research.
Compiler BE Panel IDC HPC User Forum April 2009 Don Kretsch Director, Sun Developer Tools Sun Microsystems.
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
Debugging and Profiling GMAO Models with Allinea’s DDT/MAP Georgios Britzolakis April 30, 2015.
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
Taking the Complexity out of Cluster Computing Vendor Update HPC User Forum Arend Dittmer Director Product Management HPC April,
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
Hybrid MPI and OpenMP Parallel Programming
Belgrade, 25 September 2014 George S. Markomanolis, Oriol Jorba, Kim Serradell Performance analysis Tools: a case study of NMMB on Marenostrum.
CS 460/660 Compiler Construction. Class 01 2 Why Study Compilers? Compilers are important – –Responsible for many aspects of system performance Compilers.
Nanco: a large HPC cluster for RBNI (Russell Berrie Nanotechnology Institute) Anne Weill – Zrahia Technion,Computer Center October 2008.
Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.
1 Computer Systems II Introduction to Processes. 2 First Two Major Computer System Evolution Steps Led to the idea of multiprogramming (multiple concurrent.
Belgrade, 26 September 2014 George S. Markomanolis, Oriol Jorba, Kim Serradell Overview of on-going work on NMMB HPC performance at BSC.
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.
Outline Why this subject? What is High Performance Computing?
Threaded Programming Lecture 1: Concepts. 2 Overview Shared memory systems Basic Concepts in Threaded Programming.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.
AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.
How to use HybriLIT Matveev M. A., Zuev M.I. Heterogeneous Computations team HybriLIT Laboratory of Information Technologies (LIT), Joint Institute for.
© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Operating Systems Overview: Using Hardware.
Embedded Real-Time Systems Processing interrupts Lecturer Department University.
Hybrid Parallel Implementation of The DG Method Advanced Computing Department/ CAAM 03/03/2016 N. Chaabane, B. Riviere, H. Calandra, M. Sekachev, S. Hamlaoui.
Fermi National Accelerator Laboratory & Thomas Jefferson National Accelerator Facility SciDAC LQCD Software The Department of Energy (DOE) Office of Science.
Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,
Constructing a system with multiple computers or processors 1 ITCS 4/5145 Parallel Programming, UNC-Charlotte, B. Wilkinson. Jan 13, 2016.
INTRODUCTION TO XSEDE. INTRODUCTION  Extreme Science and Engineering Discovery Environment (XSEDE)  “most advanced, powerful, and robust collection.
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
Processes and threads.
Constructing a system with multiple computers or processors
What is Parallel and Distributed computing?
Superscalar Processors & VLIW Processors
Constructing a system with multiple computers or processors
Constructing a system with multiple computers or processors
Quick Tutorial on MPICH for NIC-Cluster
Types of Parallel Computers
6- General Purpose GPU Programming
Presentation transcript:

Belgrade, 24 September 2014 George S. Markomanolis, Oriol Jorba, Kim Serradell Introduction to an HPC environment

2 Objectives Overview of an HPC environnement Present some High Performance Computing topics Describe tools, options and techniques to get better performance in your simulations Feel free to ask whenever you want…

3 Why HPC? HPC: High Performance Computing Definition: High-performance computing (HPC) is the use of parallel processing for running advanced application programs efficiently, reliably and quickly.* We need HPC to calculate the operations inside the models *

4 What is it a Supercomputer Processors, Blades, BladeCenters and Racks & network CORE PROCCESSOR BLADE CENTER RACKS

Mare Nostrum layout

6 Mare Nostrum 3 Hardware

7 Networks In Mare Nostrum 3, there are three different connection network in each compute rack: –Infiniband: high speed interconnection network via fibre optic cables connection the compute nodes. Using a “Fat Tree” topology –Ethernet for Management –Ethernet for GPFS disk

8 Working in HPC Constraints –Many users –Many jobs –Limited ressources We need a job sumitter and scheduler –Distributes jobs through machine –Gives priority –Each job has an id –Makes a waiting queue The user submits the jobs and waits for the result. User need to learn commands to communicate with queue

9 Working in HPC

10 Queue and Scheduling In Mare Nostrum 3, LSF is the queue managing system –Distributes work through nodes –Balances load –Manages Jobs (id, status…) –Detects faulty nodes –Manages queues, reservations, downtimes… –Administration done by Operations Team at BSC ~]$ bjobs JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME bsc3235 RUN bsc_es s16r2b45 s11 *AG_AND1km Dec 4 06: bsc3235 RUN bsc_es s16r2b45 s16 IMAG_CAT Dec 4 07: bsc3235 RUN xlarge login2 s10 WRF-TEST Dec 4 08: bsc3235 RUN xlarge login2 s03 *-WRF-2013 Dec 4 08: bsc3235 RUN bsc_es login2 s15 *GE-NETCDF Dec 4 08: bsc3235 RUN bsc_es login1 s15 KRIGING_AM Dec 4 09: bsc3235 RUN xlarge s10r1b29 s03 *OPE-CAN_t Dec 4 09: bsc3235 RUN xlarge s03r1b15 s08 *LIOPE-CAN Dec 4 09: bsc3235 RUN bsc_es s03r1b15 16*s16 NDOWN_IP Dec 4 09:17 16*s02

11 Workflow Two ways to manage a workflow –Using a script language Then waiting files, executions… –Using dependences in queues With LSF: bsub -w ended(jobid) < job_command

12 NMMB/BSC Forecasts at Earth Sciences Three operational and daily forecasts –Global (1.4ºx1º horizontal resolution) for 120h GFS 00h Meteo, Dust and Sea Salt 49 minutes for the model execution –Regional over NA-ME-E (0.33º x 0.33º) for 72h GFS 12h Meteo and Dust 71 minutes for the model execution –Regional over NA-ME-E (0.1ºx0.1º) for 72h GFS 12h Meteo and Dust 51 minutes for the model execution Using a dedicated cluster on 16 cpu’s Using 256 cpu’s in Mare Nostrum 3

13 Manage data In Earth Sciences community, HUGE size of data can be generated. We need different filesystems to work and store this data. –Immediate disk: disk in the supercomputer where the model is run. Very fast disk. –Medium term disk: one the data is generated, we need to work with the data to analyse it. –Long term storage: store the data, in case to reuse later. Usually tapes, speed is not a constraint. Data storage costs money !!! Before running a simulation, do an analysis about which variables you need later.

14 Compilers We need to pass from codes written in languages that we understand (Fortran, C, C++,…) In a code that the machine can understand. Each final executable, will be different in each machine. –Sometimes compatible, sometimes not.

15 Compilers (II) Differents compilers –GNU Compilers: gcc, gfortran –Licensed Compilers Intel Compilers PGI Compilers Diferent results and behaviours !!! Usually, Intel compilers perform much better tan others in Intel architectures.

16 Different Architectures Distributed Memory –MPI, OpenMP GPU1 Shared Memory –OpenMP Accelerators –CUDA, OpenGL, Cell, Xeon Phy...

17 Programming Models Message Passing (MPI) Shared Memory (OpenMP) Partitioned Global Address Space Programming (PGAS) Languages –UPC, Coarray Fortran, Titanium Next Generation Programming Languages and Models –Chapel, X10, Fortress, ompSs Languages and Paradigm for Hardware Accelerators –CUDA, OpenCL Hybrid: MPI + OpenMP + CUDA/OpenCL New Xeon Phy Co-processors

18 BSC Programming Model OmpSs –Focused in extend OpenMP with new directives to support asynchronous parallelism and heterogeneity (devices like GPUs) Exploit data dependencies Heterogeneity –Make reduce the complexity on working with new devices and technologies.

We need to be able to run this models in Multi-core architectures. Model domain is decomposed in patches Patch: portion of the model domain allocated to a distributed/shared memory node. 19 Parallelizing Atmospheric Models MPI/OpenMP Communication with neighbours Patch

20 Parallelizing Atmospheric Models Halo exchange

21 Parallelizing Atmospheric Models Use a “smart” domain decomposition –Cores for X and Y axis –Try to follow Y > X –Ex: if we have 60 cores, X=6 and Y=10 We want to increase computation and reduce communication time If the communication pattern is weak, the application will not perform well and will scale poorly. Y X

22 Optimizing your code Optimizing the code –Getting better performances to run faster the models. Many optimizations can be done –Take time to write/learn your code. Ask engineers. –In compilation time, ask for the usual compilations flags in your machine. But be careful. Optimizations sometimes are no for free: –Compilation time is increased –It’s a tradeoff with precision. –Use tools to improve the analysis and get better performance

23 Analyze your code Seems obvious, but it’s important to know the structure of your code In NMMB, –Main Structure:

24 Analyze your code (II)

25 Optimization flags in Intel Fortran Compiler Take a look to “Quick reference guide to optimization with Intel compilers version 13” –Step by step approach to optimize your code –Wide explanation about compiler options General optimization options –-O0: No optimization. Used during the early stages of application development and debugging. –-O1: Optimize for size. Omits optimizations that tend to increase object size. Creates the smallest optimized code in most cases. –-O2: Maximize speed. Default setting. Enables many optimizations, including vectorization. –-O3: Enables -O2 optimizations, plus more aggressive loop and memory access optimizations.

26 Optimization flags in Intel Fortran Compiler (II) Processor-Specific Optimization Options –-xtarget: Generates specialized code for any Intel processor that supports the instruction set specified by target. –-mtarget: Generates specialized code for any Intel processor or compatible, non-Intel processor that supports the instruction set specified by target. Floating - Point Arithmetic Options –-fast=[1|2]: Allows more aggressive optimizations at a slight cost in accuracy or consistency. (fast=1 is the default). This may include some additional optimizations that are performed on Intel microprocessors but not on non-Intel microprocessors. –-precise: Allows only value-safe optimizations on floating point code. –-double/extended/source: Intermediate results are computed in double, extended, or source precision Implies precise unless overridden. –-except: Enforces floating-point exception semantics. –-strict: Enables both the precise and except options, and does not assume the default floating-point environment.

27 Optimization flags in Intel Fortran Compiler (III) Interprocedural Optimization –-ip: Single file interprocedural optimizations, including selective inlining, within the current source file. –-ipo: Permits inlining and other interprocedural optimizations among multiple source files. Long compilation time and for my experience, can fail uin building time.

28 Optimization through Cache Hierarchy Accessing data is an “expensive” operation It takes more time to “move” data from memory than make an operation instruction Take advantage of your cache hierarchy +TIME+TIME

29 Optimization through Cache Hierarchy (II) How to do it ? –Make use of every word of every cache line. –Use a cache line intensively and then not return to it later. –Fortran stores matrices in column-major order (consecutive elements of a column are in consecutive memory locations). Using this ordering, data is copied one column at a time. This means copying adjacent memory locations in sequence. All data in the same cache line are used, and each line needs to be loaded only once.

30 Scalability How to decide how many cpus use in a job? –Depending on the scalability of the code. –Avoid wasting resources –If the code does not scale, giving twice cpu’s will no make your code twice faster !!!

31 Tools Timing your application –/usr/bin/time Profiling your application –Compile with debug and profile support –Using gprof tool to analize results –Focus on fuctions executed longer real 49m54.536s user 788m36.269s sys 2m54.719s

32 Tools (II) In NMMB: –Internal timings

33 Tools - Paraver Tools to analyse and visualize simulations At BSC, we use Paraver –Paraver will extract information from an execution CPU use Memory consumption Hardware counters Communication pattern

34 Tools - Paraver

35 QUESTIONS Хвала !