Beyond Automatic Performance Analysis Prof. Dr. Michael Gerndt Technische Univeristät München

Slides:



Advertisements
Similar presentations
Barcelona Supercomputing Center. The BSC-CNS objectives: R&D in Computer Sciences, Life Sciences and Earth Sciences. Supercomputing support to external.
Advertisements

Issues of HPC software From the experience of TH-1A Lu Yutong NUDT.
Houssam Haitof Technische Univeristät München
Technology Drivers Traditional HPC application drivers – OS noise, resource monitoring and management, memory footprint – Complexity of resources to be.
Program Analysis and Tuning The German High Performance Computing Centre for Climate and Earth System Research Panagiotis Adamidis.
International Symposium on Low Power Electronics and Design Energy-Efficient Non-Minimal Path On-chip Interconnection Network for Heterogeneous Systems.
The OpenUH Compiler: A Community Resource Barbara Chapman University of Houston March, 2007 High Performance Computing and Tools Group
Profiling your application with Intel VTune at NERSC
NewsFlash!! Earth Simulator no longer #1. In slightly less earthshaking news… Homework #1 due date postponed to 10/11.
Thoughts on Shared Caches Jeff Odom University of Maryland.
Development of a track trigger based on parallel architectures Felice Pantaleo PH-CMG-CO (University of Hamburg) Felice Pantaleo PH-CMG-CO (University.
University of Houston So What’s Exascale Again?. University of Houston The Architects Did Their Best… Scale of parallelism Multiple kinds of parallelism.
Hiperspace Lab University of Delaware Antony, Sara, Mike, Ben, Dave, Sreedevi, Emily, and Lori.
1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.
Robert Bell, Allen D. Malony, Sameer Shende Department of Computer and Information Science Computational Science.
Tools for Engineering Analysis of High Performance Parallel Programs David Culler, Frederick Wong, Alan Mainwaring Computer Science Division U.C.Berkeley.
1 Dr. Frederica Darema Senior Science and Technology Advisor NSF Future Parallel Computing Systems – what to remember from the past RAMP Workshop FCRC.
The new The new MONARC Simulation Framework Iosif Legrand  California Institute of Technology.
1 Presenter: Chien-Chih Chen Proceedings of the 2002 workshop on Memory system performance.
Power is Leading Design Constraint Direct Impacts of Power Management – IDC: Server 2% of US energy consumption and growing exponentially HPC cluster market.
ORIGINAL AUTHOR JAMES REINDERS, INTEL PRESENTED BY ADITYA AMBARDEKAR Overview for Intel Xeon Processors and Intel Xeon Phi coprocessors.
CC02 – Parallel Programming Using OpenMP 1 of 25 PhUSE 2011 Aniruddha Deshmukh Cytel Inc.
OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.
4.x Performance Technology drivers – Exascale systems will consist of complex configurations with a huge number of potentially heterogeneous components.
UAB Dynamic Monitoring and Tuning in Multicluster Environment Genaro Costa, Anna Morajko, Paola Caymes Scutari, Tomàs Margalef and Emilio Luque Universitat.
Slide 1 Auburn University Computer Science and Software Engineering Scientific Computing in Computer Science and Software Engineering Kai H. Chang Professor.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
:: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: :: Dennis Hoppe (HLRS) ATOM: A near-real time Monitoring.
CCA Common Component Architecture Manoj Krishnan Pacific Northwest National Laboratory MCMD Programming and Implementation Issues.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
Performance Model & Tools Summary Hung-Hsun Su UPC Group, HCS lab 2/5/2004.
Programming Models & Runtime Systems Breakout Report MICS PI Meeting, June 27, 2002.
4.2.1 Programming Models Technology drivers – Node count, scale of parallelism within the node – Heterogeneity – Complex memory hierarchies – Failure rates.
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
1 Optimizing compiler tools and building blocks project Alexander Drozdov, PhD Sergey Novikov, PhD.
Leibniz Supercomputing Centre Garching/Munich Matthias Brehm HPC Group June 16.
Debugging parallel programs. Breakpoint debugging Probably the most widely familiar method of debugging programs is breakpoint debugging. In this method,
Nanco: a large HPC cluster for RBNI (Russell Berrie Nanotechnology Institute) Anne Weill – Zrahia Technion,Computer Center October 2008.
1 Typical performance bottlenecks and how they can be found Bert Wesarg ZIH, Technische Universität Dresden.
University of Maryland Towards Automated Tuning of Parallel Programs Jeffrey K. Hollingsworth Department of Computer Science University.
System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.
Introduction to OpenMP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB Fredericton, New Brunswick.
Allen D. Malony Department of Computer and Information Science TAU Performance Research Laboratory University of Oregon Discussion:
Using Cache Models and Empirical Search in Automatic Tuning of Applications Apan Qasem Ken Kennedy John Mellor-Crummey Rice University Houston, TX Apan.
Energy-Aware Resource Adaptation in Tessellation OS 3. Space-time Partitioning and Two-level Scheduling David Chou, Gage Eads Par Lab, CS Division, UC.
Breakout Group: Debugging David E. Skinner and Wolfgang E. Nagel IESP Workshop 3, October, Tsukuba, Japan.
MPI and OpenMP.
Programmability Hiroshi Nakashima Thomas Sterling.
Virtual Application Profiler (VAPP) Problem – Increasing hardware complexity – Programmers need to understand interactions between architecture and their.
Generic GUI – Thoughts to Share Jinping Gwo EMSGi.org.
Dynamic Scheduling Monte-Carlo Framework for Multi-Accelerator Heterogeneous Clusters Authors: Anson H.T. Tse, David B. Thomas, K.H. Tsoi, Wayne Luk Source:
Parallelization Strategies Laxmikant Kale. Overview OpenMP Strategies Need for adaptive strategies –Object migration based dynamic load balancing –Minimal.
Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.
Computing Systems: Next Call for Proposals Dr. Panagiotis Tsarchopoulos Computing Systems ICT Programme European Commission.
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
Dynamic Tuning of Parallel Programs with DynInst Anna Morajko, Tomàs Margalef, Emilio Luque Universitat Autònoma de Barcelona Paradyn/Condor Week, March.
EU-Russia Call Dr. Panagiotis Tsarchopoulos Computing Systems ICT Programme European Commission.
4/27/2000 A Framework for Evaluating Programming Models for Embedded CMP Systems Niraj Shah Mel Tsai CS252 Final Project.
OpenMP Runtime Extensions Many core Massively parallel environment Intel® Xeon Phi co-processor Blue Gene/Q MPI Internal Parallelism Optimizing MPI Implementation.
Michael J. Voss and Rudolf Eigenmann PPoPP, ‘01 (Presented by Kanad Sinha)
Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Towards a High Performance Extensible Grid Architecture Klaus Krauter Muthucumaru Maheswaran {krauter,
Productive Performance Tools for Heterogeneous Parallel Computing
Ph.D. in Computer Science
Tracing and Performance Analysis Tools for Heterogeneous Multicore System by Soon Thean Siew.
EIN 6133 Enterprise Engineering
Allen D. Malony Computer & Information Science Department
Introduction to Heterogeneous Parallel Computing
Department of Computer Science, University of Tennessee, Knoxville
Presentation transcript:

Beyond Automatic Performance Analysis Prof. Dr. Michael Gerndt Technische Univeristät München

Performance Analysis and Tuning is Essential

Performance Analysis at Scale A high level description of the performance of cosmology code MADCAP. Source: David Skinner, NERSC

Performance Analysis for Parallel Systems Development cycle – Assumption: Reproducibility Instrumentation – Static vs Dynamic – Source-level vs binary-level Monitoring – Software vs Hardware – Statistical profiles vs event traces Analysis – Source-based tools – Visualization tools – Automatic analysis tools Coding Performance Monitoring and Analysis Production Program Tuning

Periscope Automated search – Based on formalized performance properties Online analysis – Search performed while application is executing Distributed search – User specified number of analysis agents – Additional cores for agents Profile data only – even for MPI Waittime analysis

Properties StallCycles(Region, Rank, Thread, Metric, Phase) – Condition: Percentage of lost cycles >30% – Confidence: 1.0 – Severity: Percentage of lost cycles StallCyclesIntegerLoads – Requires access to two counters L3MissesDominatingMemoryAccess – Condition: Importance of L3 misses (theoretical latencies) – Severity: Importance multiplied by actual stall cycles

Periscope Design Interactive Frontend Performance Analysis Agent Network Application with Monitor MRI Master Agent Communication Agent Analysis Agent

Agent Search Strategies Application phase is a period of program‘s execution – Phase regions Full program Single user region assumed to be repetitive – Phase boundaries have to be global (SPMD programs) Search strategies – Determine hypothesis refinement Region nesting Property hierarchy-based refinement

Agent Search Strategies Stall cycle analysis Stall cycle analysis with breadth first search MPI strategy OpenMP strategy OMP Scalability strategy Benchmarking strategy Default strategy

Crystal Growth Simulation

Results USER_REGION; cx.f:61; ; IA64 Pipeline Stall Cycles ; Stalls due to L1D TLB misses ; L3 misses dominate memory access ; Stalls due to waiting for FP register CALL_REGION; cx.f:70; ; IA64 Pipeline Stall Cycles ; L3 misses dominate memory access ; Stalls due to L1D TLB misses ; Stalls due to waiting for FP register velo; cx.f:622; ; IA64 Pipeline Stall Cycles ; L3 misses dominate memory access ; Stalls due to L1D TLB misses ; Stalls due to waiting for FP register LOOP_REGION; cx.f:731; ; IA64 Pipeline Stall Cycles ; L3 misses dominate memory access ; Stalls due to waiting for FP register

Integration in Eclipse (PTP) Where is the problem? What is the most severe problem? Filter problems for region

Altix 4700 PerSyst: Periscope System Monitoring Distributed fault tolerant architecture Incremental analysis High Level Agent Synchronisierung und Aggregation Analysis Agent IO Agent … 1 Agent per Partition Data Base Analysis Agent IO Agent

280 MB/Tag PerSyst: Data Reduction Aggregation of data in properties Aggregation per job High Level Agent Analysis Agent IO Agent … Data Base Analysis Agent IO Agent < 18 MB/Tag < 244 MB/Tag

AutoTune Automatic application tuning – Performance and energy Parallel architectures – HPC and parallel servers – Homogeneous and heterogeneous – Multicore and GPU accelerated systems – Reproducable execution capabilities Variety of parallel pardigms – MPI, HMPP, parallel patterns

Partners Technische Universität München Universität Wien CAPS Entreprises Universitat Autònoma de Barcelona Leibniz Computing Centre National University of Galaway, ICHEC

Autotune Approach Predefined tuning strategies combining performance analysis and tuning Plugins – Compiler based optimization – HMPP tuning for GPUs – Parallel pattern tuning – MPI tuning – Energy efficiency tuning

Periscope Tuning Framework Online – Analysis and evaluation of tuned version in single application run – Multiple versions in single step due to parallelism in application Result – Tuning recommendation – Adaptation of source code and /or execution environment – Impact on production runs

Autotuning Extension in HMPP Directives to provide optimization space to explore – Parameterized loop transformations – Alternative/specialized code declaration to specify various implementations Runtime API – Optimization space description – Static code information collect – Dynamic information collect (i.e. timing, parameter values) #pragma hmppcg(CUDA) unroll(RANGE), jam for( i = 0 ; i < n; i++ ) { for( j = 0 ; j < n; j++ ) {... VC(j,i) = alpha*prod+beta * VC(j,i); } #pragma hmppcg(CUDA) unroll(RANGE), jam for( i = 0 ; i < n; i++ ) { for( j = 0 ; j < n; j++ ) {... VC(j,i) = alpha*prod+beta * VC(j,i); }

Energy Efficiency Plugin (LRZ) 1.Preparation phase: Selection of possible core frequencies Selection of regions for code instrumentation 2.“While”-loop (until region refinement): “For”-loop (all frequencies) : a)Set new frequency of tuned region b)Periscope analysis (instrumentation, (re-)compiling, start and stop experiment) c)Measure execution time + energy of tuned region d)Evaluate experiment “End for” 3.Evaluate results of all experiments in refinement loop 4.Store best frequencies-region combination

Parallel Pattern/MPI Tuning PP or MPI Plugin encloses a performance model (ex. M/W) based on Periscope- provided performance data Automatically generated tuning decisions are sent to Tuner Tuner dynamically modifies the application before next experiment + Tuner PP or MPI Plugin

Expected Impact Improved performance of applications Reduced power consumption of parallel systems Facilitated program development and porting Reduced time for application tuning Leadership of European performance tools groups Strengthened European HPC industry

THANK YOU