Single Node Optimization Computational Astrophysics.

Slides:

Advertisements

Similar presentations

OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.

Advertisements

The OpenUH Compiler: A Community Resource Barbara Chapman University of Houston March, 2007 High Performance Computing and Tools Group

Profiling your application with Intel VTune at NERSC

NewsFlash!! Earth Simulator no longer #1. In slightly less earthshaking news… Homework #1 due date postponed to 10/11.

Advanced microprocessor optimization Kampala August, 2007 Agner Fog

May 2, 2015©2006 Craig Zilles1 (Easily) Exposing Thread-level Parallelism  Previously, we introduced Multi-Core Processors —and the (atomic) instructions.

Scientific Programming OpenM ulti- P rocessing M essage P assing I nterface.

Reference: Message Passing Fundamentals.

OpenMP Andrew Williams References Chandra et al, Parallel Programming in OpenMP, Morgan Kaufmann Publishers 1999 OpenMP home:

ORIGINAL AUTHOR JAMES REINDERS, INTEL PRESENTED BY ADITYA AMBARDEKAR Overview for Intel Xeon Processors and Intel Xeon Phi coprocessors.

Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı

Programming for High Performance Computers John M. Levesque Director Cray’s Supercomputing Center Of Excellence.

NCCS Brown Bag Series. Vectorization Efficient SIMD parallelism on NCCS systems Craig Pelissier* and Kareem Sorathia

1 Copyright © 2010, Elsevier Inc. All rights Reserved Chapter 5 Shared Memory Programming with OpenMP An Introduction to Parallel Programming Peter Pacheco.

Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.

1 Day 1 Module 2:. 2 Use key compiler optimization switches Upon completion of this module, you will be able to: Optimize software for the architecture.

Multi-core Programming Thread Profiler. 2 Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads Topics Look at Intel® Thread Profiler features.

OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.

1 Chapter 04 Authors: John Hennessy & David Patterson.

ICOM 6115©Manuel Rodriguez-Martinez ICOM 6115 – Computer Networks and the WWW Manuel Rodriguez-Martinez, Ph.D. Lecture 6.

Compiler BE Panel IDC HPC User Forum April 2009 Don Kretsch Director, Sun Developer Tools Sun Microsystems.

OpenMP OpenMP A.Klypin Shared memory and OpenMP Simple Example Threads Dependencies Directives Handling Common blocks Synchronization Improving load balance.

Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,

OpenMP – Introduction* *UHEM yaz çalıştayı notlarından derlenmiştir. (uhem.itu.edu.tr)

Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.

Computer Organization David Monismith CS345 Notes to help with the in class assignment.

Work Replication with Parallel Region #pragma omp parallel { for ( j=0; j

Experiences parallelising the mixed C-Fortran Sussix BPM post-processor H. Renshall, BE Dept associate, Jan 2012 Using appendix material from CERN-ATS-Note

1 06/09/2011, COSMO GM Xavier Lapillonne Porting the physical parametrizations on GPU using directives X. Lapillonne, O. Fuhrer Eidgenössisches Departement.

1 Announcements  Homework 4 out today  Dec 7 th is the last day you can turn in Lab 4 and HW4, so plan ahead.

1. 2 Pipelining vs. Parallel processing  In both cases, multiple “things” processed by multiple “functional units” Pipelining: each thing is broken into.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

Yang Yu, Tianyang Lei, Haibo Chen, Binyu Zang Fudan University, China Shanghai Jiao Tong University, China Institute of Parallel and Distributed Systems.

Threaded Programming Lecture 2: Introduction to OpenMP.

11/13/2012CS4230 CS4230 Parallel Programming Lecture 19: SIMD and Multimedia Extensions Mary Hall November 13, 2012.

Threaded Programming Lecture 1: Concepts. 2 Overview Shared memory systems Basic Concepts in Threaded Programming.

Barriers and Condition Variables

3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,

Special Topics in Computer Engineering OpenMP* Essentials * Open Multi-Processing.

Lab Activities 1, 2. Some of the Lab Server Specifications CPU: 2 Quad(4) Core Intel Xeon 5400 processors CPU Speed: 2.5 GHz Cache : Each 2 cores share.

SMP Basics KeyStone Training Multicore Applications Literature Number: SPRPxxx 1.

Background Computer System Architectures Computer System Software.

Vector computers.

Native Computing & Optimization on Xeon Phi John D. McCalpin, Ph.D. Texas Advanced Computing Center.

Processor Level Parallelism 1

PERFORMANCE OF THE OPENMP AND MPI IMPLEMENTATIONS ON ULTRASPARC SYSTEM Abstract Programmers and developers interested in utilizing parallel programming.

1 Lecture 5a: CPU architecture 101 boris.

1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,

Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.

Chapter 4: Multithreaded Programming

Chapter 4: Threads.

John Levesque Director Cray Supercomputing Center of Excellence

Atomic Operations in Hardware

3- Parallel Programming Models

Computer Engg, IIT(BHU)

Exploiting Parallelism

Constructing a system with multiple computers or processors

Compiler Back End Panel

Compiler Back End Panel

Introduction to High Performance Computing Lecture 20

Constructing a system with multiple computers or processors

Constructing a system with multiple computers or processors

Constructing a system with multiple computers or processors

Multithreading Why & How.

Chapter 4: Threads & Concurrency

Multi-Core Programming Assignment

Lecture 2 The Art of Concurrency

Types of Parallel Computers

EN Software Carpentry Python – A Crash Course Esoteric Sections Parallelization

Shared-Memory Paradigm & OpenMP

Presentation transcript:

Single Node Optimization Computational Astrophysics

Outline  Node Topology  Vectorization and cache blocking  OpenMP  Performance Tuning Tips

Node Topology  Consider dual socket Intel Haswell node  Each socket is a NUMA domain  Non-uniform memory access domain  Local memory controller is gatekeeper to data access to main memory (local DIMMS)  Each socket has its own inclusive L3 cache  Each core has its own inclusive L2 cache  Each core has its own Data and Instruction L1 cache

 Intel Haswell 12 core server chip  From cyberparse.co.uk

From anandtech.com

SIMD Vectorization  SIMD = single instruction multiple data  For x86 we have  SSE and revisions (128 bit vector length = 2 DP values)  AVX and revisions (256 bit vector length = 4 DP values)  AVX-512 (coming on KNL) (512 bit vector length = 8 DP values)  Before MPP, SIMD was the way to get parallelization and performance  Cray Black Widow vector processors had 128 element vector registers

Vectorization Requirements 1.Independent operations for each iteration  Operations that depend on the last iteration are called “recurrence” and often prevent vectorization 2.Stride-1 access pattern  Data used for each iteration must be contiguous  Without restrict on pointers, this is typically why C code will not vectorize 3.Little to no conditional code in the loop body  x86 processors have “predicate registers” that allow conditional execution on portions of vectors, but number is limited

Loop 1: DO i = 1, N q(i) = g(i)*b(i) + c(i) END DO Loop 2: DO i = 1, N q(i) = q(i) + f * b(i) END DO Loop 3: DO i = 1, N q(i) = 0.5 * (q(i-1) + q(i+1)) END DO Loop 4: DO i = 1, N, 2 q(i) = 0.5 * (q(i-1) + q(i+1)) END DO Loop 5: DO i = 1, Nx IF (F(i+1).LT. 1.0E-10) F(i+1) = 0.0 q(i) = q(i) – dtdx * (F(i+1) – F(i)) END DO

Example Code

Cache Blocking and Reuse  Code may reuse data across multiple iterations in a loop  Best would be to keep that in the highest level cache until it is no longer needed  Board example

OpenMP Threading  Directive-based language extension (C, C++, Fortran) that allows code to multithreaded by the compiler  Generally much easier to use than pthread library or equivalent and code is much more portable  Many compilers have fairly good OpenMP implementations that can scale to dozens of cores  Intel is quite good but launches a “helper” thread that often gets in the way  GNU is fairly good, but thread synchronization performance generally slower  PGI is OK  Cray/IBM/etc custom compilers may perform much better on specific codes

OpenMP Threading  User defined parallel regions where all threads operate  Work can be divided amongst threads either via sets of “tasks” or by giving portions of a loop to each thread  Un-safe operations (such as stores to a shared variable) can be done with atomic operations or critical sections  is a great resource!

From llnl.gov

Put it all together!

Version/ThreadsGNUIntelGNU cache blocked Intel cache blocked Serial / OMP / OMP / OMP / OMP / Performance Results

Tuning Tips  Experiment!  Write multiple versions of your code with various techniques included  Try all available compilers. What do they each do?  It is almost always fastest to write code that can vectorize AND it makes adding OpenMP much easier if it does  Count you operations if you can. Measure the run time and ask yourself, am I getting 0.1% of peak or 10%. Aim for 10% or better for your “hot” loops and subroutines  IT CAN ALWAYS BE FASTER!!!

Helpful Compiler Flags  GNU  -fopt-info will show what optimizations the compiler did  -fopenmp to enable OpenMP  Intel  -opt-report will produce a *.optrpt file for each source showing where vectorization was done  -openmp to enable OpenMP

Useful Tools  objdump –l –d [executable or.o]  Can be used to look at generated assembly. Use with –g compiler flag to see where in the source you are.  Try to learn assembly basics as it is a very useful skill when investigating performance issues. No need to learn to program it.  PAPI  Used to gather counters from the processor  Requires admin to install on linux  DDT / TAU / Intel Vtune / CrayPAT  All full performance suites for analyzing application, finding bottlenecks, etc.