Performance Analysis and Optimization Tool Andres S. CHARIF-RUBIAL Emmanuel OSERET Performance Analysis Team, University.

Slides:



Advertisements
Similar presentations
Houssam Haitof Technische Univeristät München
Advertisements

Xtensa C and C++ Compiler Ding-Kai Chen
ECE 454 Computer Systems Programming Compiler and Optimization (I) Ding Yuan ECE Dept., University of Toronto
1 Lecture 5: Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2)
ENGS 116 Lecture 101 ILP: Software Approaches Vincent H. Berk October 12 th Reading for today: , 4.1 Reading for Friday: 4.2 – 4.6 Homework #2:
1 Lecture: Static ILP Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)
Profiling your application with Intel VTune at NERSC
1 Lecture 10: Static ILP Basics Topics: loop unrolling, static branch prediction, VLIW (Sections 4.1 – 4.4)
Intel® performance analyze tools Nikita Panov Idrisov Renat.
3.13. Fallacies and Pitfalls Fallacy: Processors with lower CPIs will always be faster Fallacy: Processors with faster clock rates will always be faster.
Presented by Rengan Xu LCPC /16/2014
Memory Systems Performance Workshop 2004© David Ryan Koes MSP 2004 Programmer Specified Pointer Independence David Koes Mihai Budiu Girish Venkataramani.
1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.
Multiscalar processors
CS 7810 Lecture 24 The Cell Processor H. Peter Hofstee Proceedings of HPCA-11 February 2005.
Kathy Grimes. Signals Electrical Mechanical Acoustic Most real-world signals are Analog – they vary continuously over time Many Limitations with Analog.
ORIGINAL AUTHOR JAMES REINDERS, INTEL PRESENTED BY ADITYA AMBARDEKAR Overview for Intel Xeon Processors and Intel Xeon Phi coprocessors.
NCCS Brown Bag Series. Vectorization Efficient SIMD parallelism on NCCS systems Craig Pelissier* and Kareem Sorathia
Development in hardware – Why? Option: array of custom processing nodes Step 1: analyze the application and extract the component tasks Step 2: design.
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
Prospector : A Toolchain To Help Parallel Programming Minjang Kim, Hyesoon Kim, HPArch Lab, and Chi-Keung Luk Intel This work will be also supported by.
Multi-core Programming Thread Profiler. 2 Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads Topics Look at Intel® Thread Profiler features.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Telecommunications and Signal Processing Seminar Ravi Bhargava * Lizy K. John * Brian L. Evans Ramesh Radhakrishnan * The University of Texas at.
Lecture 3: Technologies Allen D. Malony Department of Computer and Information Science.
PMaC Performance Modeling and Characterization Efficient HPC Data Motion via Scratchpad Memory Kayla Seager, Ananta Tiwari, Michael Laurenzano, Joshua.
TRIPS – An EDGE Instruction Set Architecture Chirag Shah April 24, 2008.
1 Single-ISA Heterogeneous Multi-Core Architectures: The Potential for Processor Power Reduction Rakesh Kumar, Keith I. Farkas, Norman P. Jouppi, Parthasarathy.
Performance Model & Tools Summary Hung-Hsun Su UPC Group, HCS lab 2/5/2004.
Dept. of Computer and Information Sciences : University of Delaware John Cavazos Department of Computer and Information Sciences University of Delaware.
© 2007 SET Associates Corporation SAR Processing Performance on Cell Processor and Xeon Mark Backues, SET Corporation Uttam Majumder, AFRL/RYAS.
Performance of mathematical software Agner Fog Technical University of Denmark
Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.
1 Optimizing compiler tools and building blocks project Alexander Drozdov, PhD Sergey Novikov, PhD.
Assembly Code Optimization Techniques for the AMD64 Athlon and Opteron Architectures David Phillips Robert Duckles Cse 520 Spring 2007 Term Project Presentation.
Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Michael D’Mello
Advanced Computer Architecture Lab University of Michigan Compiler Controlled Value Prediction with Branch Predictor Based Confidence Eric Larson Compiler.
Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.
CISC Machine Learning for Solving Systems Problems John Cavazos Dept of Computer & Information Sciences University of Delaware
Outline Announcements: –HW III due Friday! –HW II returned soon Software performance Architecture & performance Measuring performance.
Threaded Programming Lecture 2: Introduction to OpenMP.
A Software Performance Monitoring Tool Daniele Francesco Kruse March 2010.
Weekly Report- Reduction Ph.D. Student: Leo Lee date: Oct. 30, 2009.
Single Node Optimization Computational Astrophysics.
PAPI on Blue Gene L Using network performance counters to layout tasks for improved performance.
Sunpyo Hong, Hyesoon Kim
Introduction to Computer Organization Pipelining.
Compacting ARM binaries with the Diablo framework – Dominique Chanet & Ludo Van Put Compacting ARM binaries with the Diablo framework Dominique Chanet.
Michael J. Voss and Rudolf Eigenmann PPoPP, ‘01 (Presented by Kanad Sinha)
Tuning Threaded Code with Intel® Parallel Amplifier.
Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,
Compilers: History and Context COMP Outline Compilers and languages Compilers and architectures – parallelism – memory hierarchies Other uses.
Ioannis E. Venetis Department of Computer Engineering and Informatics
Computer Architecture Principles Dr. Mike Frank
Separate Assembly allows a program to be built from modules rather than a single source file assembler linker source file.
Getting Started with Automatic Compiler Vectorization
CDA 3101 Spring 2016 Introduction to Computer Organization
COMP4211 : Advance Computer Architecture
A Review of Processor Design Flow
Performance Optimization for Embedded Software
Adaptive Code Unloading for Resource-Constrained JVMs
Peng Jiang, Linchuan Chen, and Gagan Agrawal
Instruction Level Parallelism (ILP)
Multi-Core Programming Assignment
Predicting Unroll Factors Using Supervised Classification
How to improve (decrease) CPI
Mapping DSP algorithms to a general purpose out-of-order processor
Lecture 5: Pipeline Wrap-up, Static ILP
Lecture 11: Machine-Dependent Optimization
SPL – PS1 Introduction to C++.
Presentation transcript:

Performance Analysis and Optimization Tool Andres S. CHARIF-RUBIAL Emmanuel OSERET Performance Analysis Team, University of Versailles

VI-HPS Introduction Performance Analysis  Understand the performance of an application  How well it behaves on a given machine  What are the issues ?  Generally a multifaceted problem  Maximizing the number of views = better understand  Use techniques and tools to understand issues  Once understood Optimize application

VI-HPS Introduction Compilation chain  Compiler remains your best friend  Be sure to select proper flags (e.g., -xavx)  Pragmas: Unrolling, Vector alignment  O2 V.S. O3  Vectorisation/optimisation report

VI-HPS Introduction MAQAO Tool  Open source (LGPL 3.0)  Currently binary release  Source release by mid December  Available for x86-64 and Xeon Phi  Looking forward in porting MAQAO on BlueGene

VI-HPS Introduction MAQAO Tool  Easy install  Packaging : ONE (static) standalone binary  Easy to embeed  Audience  User/Tool developer: analysis and optimisation tool  Performance tool developer: framework services  TAU: tau_rewrite (MIL)  ScoreP: on-going effort (MIL)

VI-HPS Introduction MAQAO Tool

VI-HPS Introduction MAQAO Tool  Scripting language  Lua language : simplicity and productivity  Fast prototyping  MAQAO Lua API : Access to services

VI-HPS Introduction MAQAO Tool  Built on top of the Framework  Loop-centric approach  Produce reports  We deal with low level details  You get high level reports

VI-HPS Outline  Introduction  Pinpointing hotspots  Code quality analysis  Upcoming modules

VI-HPS Pinpointing hotspots Measurement methodology  MAQAO Profiling  Instrumentation  Through binary rewriting  High overhead / More precision  Sampling  Hardware counter through perf_event_open system call  Very low overhead / less details

VI-HPS Pinpointing hotspots Parallelism level  SPMD  Program level  SIMD  Instruction level  By default MAQAO only considers system processes and threads

VI-HPS Pinpointing hotspots Parallelism level  Display functions and their exclusive time  Associated callchains and their contribution  Loops  Innermost loops can then be analyzed by the code quality analyzer module (CQA)  Command line and GUI (HTML) outputs

VI-HPS Pinpointing hotspots GUI snapshot

VI-HPS Outline  Introduction  Pinpointing hotspots  Code quality analysis  Upcoming modules

VI-HPS Static performance modeling Introduction  Main performance issues:  Core level  Multicore interactions  Communications  Most of the time core level is forgotten

VI-HPS Static performance modeling Goals  Static performance model  Targets innermost loops  source loop V.S. assembly loop  Take into account processor (micro)architecture  Assess code quality  Estimate performance  Degree of vectorization  Impact on micro architecture Source Loop Source Loop ASM Loop 1 ASM Loop 2 ASM Loop 3 ASM Loop 4 ASM Loop 5

VI-HPS Static performance modeling Model  Simulates the target (micro)architecture  Instructions description (latency, uops dispatch...)  Machine model  For a given binary and micro-architecture, provides  Quality metrics (how well the binary is fitted to the micro architecture)  Static performance (lower bounds on cycles)  Hints and workarounds to improve static performance

VI-HPS Static performance modeling Metrics  Vectorization (ratio and speedup)  Allows to predict vectorization (if possible) speedup and increase vectorization ratio if it’s worth  High latency instructions (division/square root)  Allows to use less precise but faster instructions like RCP (1/x) and RSQRT (1/sqrt(x))  Unrolling (unroll factor detection)  Allows to statically predict performance for different unroll factors (find main loops)

VI-HPS Static performance modeling Report example Pathological cases Your loop is processing FP elements but is NOT OR PARTIALLY VECTORIZED. Since your execution units are vector units, only a fully vectorized loop can use their full power. By fully vectorizing your loop, you can lower the cost of an iteration from to 3.50 cycles (4.00x speedup). Two propositions: - Try another compiler or update/tune your current one: * gcc: use O3 or Ofast. If targeting IA32, add mfpmath=sse combined with march=, msse or msse2. * icc: use the vec-report option to understand why your loop was not vectorized. If "existence of vector dependences", try the IVDEP directive. If, using IVDEP, "vectorization possible but seems inefficient", try the VECTOR ALWAYS directive. - Remove inter-iterations dependences from your loop and make it unit-stride. WARNING: Fix as many pathological cases as you can before reading the following sections. Bottlenecks The divide/square root unit is a bottleneck. Try to reduce the number of division or square root instructions. If you accept to loose numerical precision, you can speedup your code by passing the following options to your compiler: gcc: (ffast-math or Ofast) and mrecip icc: this should be automatically done by default By removing all these bottlenecks, you can lower the cost of an iteration from to 1.50 cycles (9.33x speedup).

VI-HPS Outline  Introduction  Pinpointing hotspots  Code quality analysis  Upcoming modules

VI-HPS Ongoing work  Dynamic bottleneck analyzer  Differential analysis  Memory characterization tool  Access patterns  Data reshaping  Cache simulator  Value profiler  Function specialization / memorizing  Loops instances (iteration count) variations

VI-HPS MAQAO Tool Thanks for your attention ! Questions ?