Houssam Haitof Technische Univeristät München

Slides:



Advertisements
Similar presentations
Performance Analysis and Optimization through Run-time Simulation and Statistics Philip J. Mucci University Of Tennessee
Advertisements

The view from space Last weekend in Los Angeles, a few miles from my apartment…
© 2004 Wayne Wolf Topics Task-level partitioning. Hardware/software partitioning.  Bus-based systems.
Technology Drivers Traditional HPC application drivers – OS noise, resource monitoring and management, memory footprint – Complexity of resources to be.
Link-Time Path-Sensitive Memory Redundancy Elimination Manel Fernández and Roger Espasa Computer Architecture Department Universitat.
Thoughts on Shared Caches Jeff Odom University of Maryland.
Overview Motivations Basic static and dynamic optimization methods ADAPT Dynamo.
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
Introductory Courses in High Performance Computing at Illinois David Padua.
IBM Software Group ® Recommending Materialized Views and Indexes with the IBM DB2 Design Advisor (Automating Physical Database Design) Jarek Gryz.
Online Performance Auditing Using Hot Optimizations Without Getting Burned Jeremy Lau (UCSD, IBM) Matthew Arnold (IBM) Michael Hind (IBM) Brad Calder (UCSD)
Hiperspace Lab University of Delaware Antony, Sara, Mike, Ben, Dave, Sreedevi, Emily, and Lori.
Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.
Presented by Rengan Xu LCPC /16/2014
1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.
Java for High Performance Computing Jordi Garcia Almiñana 14 de Octubre de 1998 de la era post-internet.
Compiler Optimization-Space Exploration Adrian Pop IDA/PELAB Authors Spyridon Triantafyllis, Manish Vachharajani, Neil Vachharajani, David.
Instrumentation and Measurement CSci 599 Class Presentation Shreyans Mehta.
10th TTCN-3 User Conference, 7-9 June 2011, Bled, Slovenia AUTOSAR Conformance Tests - Feedback on their development and utilization Alain Feudjio-Vouffo,
Performance Analysis and Optimization Tool Andres S. CHARIF-RUBIAL Emmanuel OSERET Performance Analysis Team, University.
SEC(R) 2008 Intel® Concurrent Collections for C++ - a model for parallel programming Nikolay Kurtov Software and Services.
Portable and Predictable Performance on Heterogeneous Embedded Manycores (ARTEMIS ) ARTEMIS Project Review 28 nd October 2014 Multimedia Demonstrator.
Prospector : A Toolchain To Help Parallel Programming Minjang Kim, Hyesoon Kim, HPArch Lab, and Chi-Keung Luk Intel This work will be also supported by.
Multi-core Programming Thread Profiler. 2 Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads Topics Look at Intel® Thread Profiler features.
4.x Performance Technology drivers – Exascale systems will consist of complex configurations with a huge number of potentially heterogeneous components.
UAB Dynamic Monitoring and Tuning in Multicluster Environment Genaro Costa, Anna Morajko, Paola Caymes Scutari, Tomàs Margalef and Emilio Luque Universitat.
Beyond Automatic Performance Analysis Prof. Dr. Michael Gerndt Technische Univeristät München
UPC/SHMEM PAT High-level Design v.1.1 Hung-Hsun Su UPC Group, HCS lab 6/21/2005.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Software Performance Analysis Using CodeAnalyst for Windows Sherry Hurwitz SW Applications Manager SRD Advanced Micro Devices Lei.
Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Tekin Bicer Gagan Agrawal 1.
Performance Model & Tools Summary Hung-Hsun Su UPC Group, HCS lab 2/5/2004.
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Graduate Student Department Of CSE 1.
4.2.1 Programming Models Technology drivers – Node count, scale of parallelism within the node – Heterogeneity – Complex memory hierarchies – Failure rates.
Generative Programming. Automated Assembly Lines.
Software Engineering Prof. Ing. Ivo Vondrak, CSc. Dept. of Computer Science Technical University of Ostrava
VGreen: A System for Energy Efficient Manager in Virtualized Environments G. Dhiman, G Marchetti, T Rosing ISLPED 2009.
Sep 08, 2009 SPEEDUP – Optimization and Porting of Path Integral MC Code to New Computing Architectures V. Slavnić, A. Balaž, D. Stojiljković, A. Belić,
1 Typical performance bottlenecks and how they can be found Bert Wesarg ZIH, Technische Universität Dresden.
Introduction to OpenMP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB Fredericton, New Brunswick.
Using Cache Models and Empirical Search in Automatic Tuning of Applications Apan Qasem Ken Kennedy John Mellor-Crummey Rice University Houston, TX Apan.
© 2006, National Research Council Canada © 2006, IBM Corporation Solving performance issues in OTS-based systems Erik Putrycz Software Engineering Group.
Experiences with Achieving Portability across Heterogeneous Architectures Lukasz G. Szafaryn +, Todd Gamblin ++, Bronis R. de Supinski ++ and Kevin Skadron.
Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009.
Shangkar Mayanglambam, Allen D. Malony, Matthew J. Sottile Computer and Information Science Department Performance.
Computing Systems: Next Call for Proposals Dr. Panagiotis Tsarchopoulos Computing Systems ICT Programme European Commission.
Single Node Optimization Computational Astrophysics.
AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
Memory-Aware Compilation Philip Sweany 10/20/2011.
Prefetching Techniques. 2 Reading Data prefetch mechanisms, Steven P. Vanderwiel, David J. Lilja, ACM Computing Surveys, Vol. 32, Issue 2 (June 2000)
Dynamic Tuning of Parallel Programs with DynInst Anna Morajko, Tomàs Margalef, Emilio Luque Universitat Autònoma de Barcelona Paradyn/Condor Week, March.
EU-Russia Call Dr. Panagiotis Tsarchopoulos Computing Systems ICT Programme European Commission.
1 of 14 Lab 2: Formal verification with UPPAAL. 2 of 14 2 The gossiping persons There are n persons. All have one secret to tell, which is not known to.
LACSI 2002, slide 1 Performance Prediction for Simple CPU and Network Sharing Shreenivasa Venkataramaiah Jaspal Subhlok University of Houston LACSI Symposium.
OpenMP Runtime Extensions Many core Massively parallel environment Intel® Xeon Phi co-processor Blue Gene/Q MPI Internal Parallelism Optimizing MPI Implementation.
Michael J. Voss and Rudolf Eigenmann PPoPP, ‘01 (Presented by Kanad Sinha)
Introduction to HPC Debugging with Allinea DDT Nick Forrington
Profiling/Tracing Method and Tool Evaluation Strategy Summary Slides Hung-Hsun Su UPC Group, HCS lab 1/25/2005.
Resource Optimization for Publisher/Subscriber-based Avionics Systems Institute for Software Integrated Systems Vanderbilt University Nashville, Tennessee.
Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,
Performance Tool Integration in Programming Environments for GPU Acceleration: Experiences with TAU and HMPP Allen D. Malony1,2, Shangkar Mayanglambam1.
Productive Performance Tools for Heterogeneous Parallel Computing
Performance Optimization for Embedded Software
Calpa: A Tool for Automating Dynamic Compilation
Allen D. Malony Computer & Information Science Department
Maximizing Speedup through Self-Tuning of Processor Allocation
Rohan Yadav and Charles Yuan (rohany) (chenhuiy)
Presentation transcript:

Houssam Haitof Technische Univeristät München

AutoTune AutoTune objective is to combine performance analysis and tuning into a single tool AutoTune will extend Periscope with automatic online tuning plugins for performance and energy efficiency tuning. The result would be the Periscope Tuning Framework The whole tuning process (automatic performance analysis and automatic tuning) will be executed online

AutoTune Automatic application tuning – Performance and energy Parallel architectures – HPC and parallel servers – Homogeneous and heterogeneous – Multicore and GPU accelerated systems – Reproducable execution capabilities Variety of parallel pardigms – MPI, HMPP, parallel patterns

Partners Technische Universität München Universität Wien CAPS Entreprises Universitat Autònoma de Barcelona Leibniz Computing Centre National University of Galaway, ICHEC

Distributed architecture – Analysis performed by multiple distributed hierarchical agents – Enables locality: Processing of performance data close to its point of generation Iterative online analysis – Measurements are configured, obtained & evaluated while the program is running – no tracing Automatic bottlenecks search – Based on performance optimization experts' knowledge Enhanced GUI – Eclipse based integrated development and performance analysis environment Instrumentation – Fortran, C/C++ – Automatic overhead control Periscope Performance Analysis Toolkit

Automation in Periscope is based on formalized performance properties – e.g., inefficient cache usage or load imbalance – Automatically check which measurements are required, which properties were found and which are the properties to look for in the next step The overall search for performance problems is determined by search strategies – A search strategy defines in which order an analysis agent investigates the multidimensional search space of properties, program regions, and processes. Periscope provides search strategies for single node performance (e.g., searching for inefficient use of the memory hierarchy), MPI and OpenMP. Periscope Performance Analysis Toolkit

Periscope Performance Analysis Toolkit

Integration in Eclipse (PTP) Where is the problem? What is the most severe problem? Filter problems for region

AutoTune Approach AutoTune will follow Periscope principles – Predefined tuning strategies combining performance analysis and tuning, online search, distributd processing... Plugins (online and semi-online) – Compiler based optimization – HMPP tuning for GPUs – Parallel pattern tuning – MPI tuning – Energy efficiency tuning

Periscope Tuning Framework Online – Analysis and evaluation of tuned version in single application run – Multiple versions in single step due to parallelism in application Result – Tuning recommendation – Adaptation of source code and /or execution environment – Impact on production runs

Example Plugin A demo plugin was developed to demonstrate and validate the auto-tuning extensions to Periscope.

Compiler Flag Selection The tuning objective is to reduce the execution time of the applications phase region by selecting the best compiler flags for the application Compilers can automatically apply many optimization to the code (e.g., loop interchange, data prefetching, vectorization, and software pipelining) It is very difficult to predict the performance impact and to select the right sequence of transformations

Autotuning Extension in HMPP Directive-based programming model supporting OpenHMPP and OpenACC standards. The directives allow the programmer to write hardware independent applications that can significantly speed up C and Fortran code. Directives to provide optimization space to explore – Parameterized loop transformations and gridification – Alternative/specialized code declaration to specify various implementations Runtime API – Optimization space description – Static code information collect – Dynamic information collect (i.e. timing, parameter values) #pragma hmppcg(CUDA) unroll(RANGE), jam for( i = 0 ; i < n; i++ ) { for( j = 0 ; j < n; j++ ) {... VC(j,i) = alpha*prod+beta * VC(j,i); } #pragma hmppcg(CUDA) unroll(RANGE), jam for( i = 0 ; i < n; i++ ) { for( j = 0 ; j < n; j++ ) {... VC(j,i) = alpha*prod+beta * VC(j,i); }

Parallel Patterns for GPGPU Tuning of high-level patterns for single-node heterogeneous manycore architectures comprising CPUs and GPUs. Focus on pipeline patterns, which are expressed in C/C++ programs with annotated while-loops Pipeline stages correspond to functions for which different implementation variants may exist The associated programming framework has been developed in the EU project PEPPHER

MPI Tuning MPI Plugin encloses a performance model (ex. M/W) based on Periscope-provided performance data Automatically generated tuning decisions are sent to Tuner Tuner dynamically modifies the application before next experiment + Tuner MPI Plugin

Energy Efficiency Plugin (LRZ) 1.Preparation phase: Selection of possible core frequencies Selection of regions for code instrumentation 2.While-loop (until region refinement): For-loop (all frequencies) : a)Set new frequency of tuned region b)Periscope analysis (instrumentation, (re-)compiling, start and stop experiment) c)Measure execution time + energy of tuned region d)Evaluate experiment End for 3.Evaluate results of all experiments in refinement loop 4.Store best frequencies-region combination

Expected Impact Improved performance of applications Reduced power consumption of parallel systems Facilitated program development and porting Reduced time for application tuning Leadership of European performance tools groups Strengthened European HPC industry

THANK YOU

BACKUP SLIDES

Performance Analysis for Parallel Systems Development cycle – Assumption: Reproducibility Instrumentation – Static vs Dynamic – Source-level vs binary-level Monitoring – Software vs Hardware – Statistical profiles vs event traces Analysis – Source-based tools – Visualization tools – Automatic analysis tools Coding Performance Monitoring and Analysis Production Program Tuning

Properties StallCycles(Region, Rank, Thread, Metric, Phase) – Condition: Percentage of lost cycles >30% – Confidence: 1.0 – Severity: Percentage of lost cycles StallCyclesIntegerLoads – Requires access to two counters L3MissesDominatingMemoryAccess – Condition: Importance of L3 misses (theoretical latencies) – Severity: Importance multiplied by actual stall cycles

Agent Search Strategies Application phase is a period of programs execution – Phase regions Full program Single user region assumed to be repetitive – Phase boundaries have to be global (SPMD programs) Search strategies – Determine hypothesis refinement Region nesting Property hierarchy-based refinement

Agent Search Strategies Stall cycle analysis Stall cycle analysis with breadth first search MPI strategy OpenMP strategy OMP Scalability strategy Benchmarking strategy Default strategy

Technische Universität München Periscope performs multiple iterative performance measurement experiments on the basis of Phases: All measurements are performed inside phase Begin and end of phase are global synchronization points By default phase is the whole program Needs restart if multiple experiments required (single core performance analysis strategies require multiple experiments) Unnecessary code parts also measured User specified region in Fortran files that is marked with !$MON USER REGION and !$MON END USER REGION will be used as phase: Typically main loop of application no need for restart, faster analysis Unnecessary code parts are not measured less measurements overhead Severity value is normalized on the main loop iteration time more precise performance impact estimation Initialization Measurement phase Finalization Analysis Main loop iteration Periscope Phases

Technische Universität München Automatic search for bottlenecks Automation based on formalized expert knowledge –Potential performance problems properties –Efficient search algorithm search strategies Performance property –Condition –Confidence –Severity Performance analysis strategies –Itanium2 Stall Cycle Analysis –Westmere-EX Single Core Performance –IBM POWER6 Single Core Performance Analysis –MPI Communication Pattern Analysis –Generic Memory Strategy –OpenMP-based Performance Analysis –Scalability Analysis – OpenMP codes

Technische Universität München Periscope Performance Property Based on APART Specification Language: – Performance Property: A performance property (e.g. load imbalance, communication, cache misses, redundant computations, etc.) characterizes a specific performance behavior of a program (or part) and can be checked by a set of conditions. Conditions are associated with a confidence value (between 0 and 1) indicating the degree of confidence about the existence of a performance property. In addition, for every performance property a severity measure is provided the magnitude of which specifies the importance of the property. The severity can be used to focus effort on the important performance issues during the (manual or automatic) performance tuning process. Performance properties, confidence and severity are defined over performance-related data. Process 17 Region 1Region 2 Region 3 Region 4Region 5 Thread 1 Property: Cycles lost due to cache misses Condition: PM_MISS_L1_CYC/PM_RUN_CYC>a Confidence: 1.0 Severity: PM_MISS_L1_CYC/PM_RUN_CYC Context Execute Property: Cycles lost due to cache misses Condition:TRUE Confidence: 1.0 Severity: 21% Hypothesis Found property

Technische Universität München Example Properties Long Latency Instruction Exception –Condition Percentage of exceptions weighted over region time >30% –Severity Percentage of exceptions weighted over region time MPI Late Sender –Automatic detection of wait patterns –Measurement on the fly –No tracing required! OpenMP Synchronization properties –Critical section overhead property –Frequent atomic property MPI_Recv MPI_Send p1 p2

Technische Universität München Scalability Analysis – OpenMP codes Identifies the OpenMP code regions that do not scale well Scalability Analysis is done by the frontend / restarts the application / No need to manually configure the runs and find the speedup! Frontend initialization Frontend.run() i.Starts application ii.Starts analysis agents iii.Receives found properties Frontend.run() i.Starts application ii.Starts analysis agents iii.Receives found properties Configuration 1,2, …, 2 n Extracts information from the found properties Does Scalability Analysis Exports the Properties GUI-based Analysis After n runs