Show don’t tell: improving vectorization awareness in HPC Mark O’Connor VP Product Management.

Slides:



Advertisements
Similar presentations
Superscalar and VLIW Architectures Miodrag Bolic CEG3151.
Advertisements

Program Analysis and Tuning The German High Performance Computing Centre for Climate and Earth System Research Panagiotis Adamidis.
Optimizing single thread performance Dependence Loop transformations.
Intel® performance analyze tools Nikita Panov Idrisov Renat.
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
Introductory Courses in High Performance Computing at Illinois David Padua.
KIT – The cooperation of Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) SX-9, a good Choice ? Steinbuch Centre for Computing.
Introduction CS 524 – High-Performance Computing.
1 Lecture 6 Performance Measurement and Improvement.
Parallel Programming on the SGI Origin2000 With thanks to Moshe Goldberg, TCC and Igor Zacharov SGI Taub Computer Center Technion Mar 2005 Anne Weill-Zrahia.
Choice for the rest of the semester New Plan –assembler and machine language –Operating systems Process scheduling Memory management File system Optimization.
Dutch-Belgium DataBase Day University of Antwerp, MonetDB/x100 Peter Boncz, Marcin Zukowski, Niels Nes.
Performance Engineering and Debugging HPC Applications David Skinner
1 1 Profiling & Optimization David Geldreich (DREAM)
1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.
ORIGINAL AUTHOR JAMES REINDERS, INTEL PRESENTED BY ADITYA AMBARDEKAR Overview for Intel Xeon Processors and Intel Xeon Phi coprocessors.
NCCS Brown Bag Series. Vectorization Efficient SIMD parallelism on NCCS systems Craig Pelissier* and Kareem Sorathia
Performance Analysis and Optimization Tool Andres S. CHARIF-RUBIAL Emmanuel OSERET Performance Analysis Team, University.
GPU Programming with CUDA – Accelerated Architectures Mike Griffiths
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Analyzing parallel programs with Pin Moshe Bach, Mark Charney, Robert Cohn, Elena Demikhovsky, Tevi Devor, Kim Hazelwood, Aamer Jaleel, Chi- Keung Luk,
Performance Analysis, Profiling and Optimization of Weather Research and Forecasting (WRF) model Negin Sobhani 1,2, Davide Del Vento2,David Gill2, Sam.
The WRF Model The Weather Research and Forecasting (WRF) Model is a mesoscale numerical weather prediction system designed for both atmospheric research.
John Mellor-Crummey Robert Fowler Nathan Tallent Gabriel Marin Department of Computer Science, Rice University Los Alamos Computer Science Institute HPCToolkit.
Application performance and communication profiles of M3DC1_3D on NERSC babbage KNC with 16 MPI Ranks Thanh Phung, Intel TCAR Woo-Sun Yang, NERSC.
High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.
JPCM - JDC121 JPCM. Agenda JPCM - JDC122 3 Software performance is Better Performance tuning requires accurate Measurements. JPCM - JDC124 Software.
CDA 3101 Fall 2013 Introduction to Computer Organization Computer Performance 28 August 2013.
Martin Schulz Center for Applied Scientific Computing Lawrence Livermore National Laboratory Lawrence Livermore National Laboratory, P. O. Box 808, Livermore,
ACMSE’04, ALDepartment of Electrical and Computer Engineering - UAH Execution Characteristics of SPEC CPU2000 Benchmarks: Intel C++ vs. Microsoft VC++
 Copyright, HiPERiSM Consulting, LLC, George Delic, Ph.D. HiPERiSM Consulting, LLC (919) P.O. Box 569, Chapel Hill, NC.
Parallel Programming on the SGI Origin2000 With thanks to Igor Zacharov / Benoit Marchand, SGI Taub Computer Center Technion Moshe Goldberg,
Leibniz Supercomputing Centre Garching/Munich Matthias Brehm HPC Group June 16.
CSC 7600 Lecture 28 : Final Exam Review Spring 2010 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS FINAL EXAM REVIEW Daniel Kogler, Chirag Dekate.
Data Management for Decision Support Session-4 Prof. Bharat Bhasker.
Concurrency, Processes, and System calls Benefits and issues of concurrency The basic concept of process System calls.
Operated by Los Alamos National Security, LLC for DOE/NNSA DC Reviewed by Kei Davis SKA – Static Kernel Analysis using LLVM IR Kartik Ramkrishnan and Ben.
Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.
1-1 Computer Organization Part The von Neumann Model The von Neumann model consists of five major components: (1) input unit; (2) output unit;
Template This is a template to help, not constrain, you. Modify as appropriate. Move bullet points to additional slides as needed. Don’t cram onto a single.
A Software Performance Monitoring Tool Daniele Francesco Kruse March 2010.
Computer Organization Part 1
Copyright 2014 – Noah Mendelsohn Performance Analysis Tools Noah Mendelsohn Tufts University Web:
Computer Operation. Binary Codes CPU operates in binary codes Representation of values in binary codes Instructions to CPU in binary codes Addresses in.
Introduction to HPC Debugging with Allinea DDT Nick Forrington
IBM Cell Processor Ryan Carlson, Yannick Lanner-Cusin, & Cyrus Stoller CS87: Parallel and Distributed Computing.
Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,
Performance Tuning Methodology Approach to Attacking Oracle Performance Problems.
Martin Kruliš by Martin Kruliš (v1.1)1.
Parallel OpenFOAM CFD Performance Studies Student: Adi Farshteindiker Advisors: Dr. Guy Tel-Zur,Prof. Shlomi Dolev The Department of Computer Science Faculty.
Application of Emerging Computational Architectures (GPU, MIC) to Atmospheric Modeling Tom Henderson NOAA Global Systems Division
J.J. Keijser Nikhef Amsterdam Grid Group MyFirstMic experience Jan Just Keijser 26 November 2013.
Code Optimization.
Modern supercomputers, Georgian supercomputer project and usage areas
Early Results of Deep Learning on the Stampede2 Supercomputer
Software Architecture in Practice
Visit for more Learning Resources
Use-case: CFD software with FEM and unstructured meshes
Unconventional applications of Intel® Xeon Phi™ Processor (KNL)
Characterization of Parallel Scientific Simulations
A Review of Processor Design Flow
High Performance Computing University of Southern California
Early Results of Deep Learning on the Stampede2 Supercomputer
Intro to Architecture & Organization
CSC Classes Required for TCC CS Degree
Optimize Your Java Code By Tools
Discussion HPC Priority project for COSMO consortium
Introduction to Heterogeneous Parallel Computing
Multi-Core Programming Assignment
CSE 502: Computer Architecture
Presentation transcript:

Show don’t tell: improving vectorization awareness in HPC Mark O’Connor VP Product Management

An Uncomfortable Truth About HPC Codes caption

This is a problem when CPU architectures change… Image © Guido da Rozze CC-BYGuido da Rozze

This is a problem when CPU architectures change… The Dragon of Hell © Baltasar Vischi CC-BY-NDBaltasar Vischi

How do we help scientific users get the most out of our machines? Brave Hero: 1 code 10x faster Wise Ruler: 100 codes 10% faster Images © Frank Kovalchek CC-BYFrank Kovalchek

Two very different ways to improve the situation

Can we help scientific developers scalably? You can teach a man to fish But first he must realize he is hungry Image © Kanani CC-BYKanani

Communicating the benefits of optimization caption This is your brain on drugs…… this is your code on –O0

Show the user with a performance model they understand caption “Vectorization, how does it work?”

Communicating at the user’s abstraction level caption Out-of- order Pipelined Time per retired instruction

Explaining performance in terms of the program counter caption Statistical wallclock time estimate of: Scalar numeric operations AVX/AVX2 operations Memory accesses (what about indirect accesses?) Other (branch, logic, …) + simple, actionable advice

Not just for vectorization, but for MPI, I/O, memory and energy usage too caption

Diving deeper with Allinea Forge – where can I improve? caption Performance over time Slow lines of code Unvectorized loops MPI Bottlenecks OpenMP bottlenecks …and much more!

Intel® Xeon Phi™ Knights Landing Support Debug First-class Intel® Xeon Phi™ support Memory debugging enhancements for HBM Tune and Analyze First-class Intel® Xeon Phi™ support Investigating additional Intel® Xeon Phi™ metrics – watch this space! Profile First-class Intel® Xeon Phi™ support Investigating additional Intel® Xeon Phi™ metrics – watch this space!

Thank you! Any questions? Mark O’Connor VP Product Management