IBM Haifa Labs © 2005 IBM Corporation Performance Tools developed in IBM Haifa Gad Haber

Slides:



Advertisements
Similar presentations
CPU Structure and Function
Advertisements

Performance Analysis and Optimization through Run-time Simulation and Statistics Philip J. Mucci University Of Tennessee
TM 1 ProfileMe: Hardware-Support for Instruction-Level Profiling on Out-of-Order Processors Jeffrey Dean Jamey Hicks Carl Waldspurger William Weihl George.
1 Lecture 3: MIPS Instruction Set Today’s topic:  More MIPS instructions  Procedure call/return Reminder: Assignment 1 is on the class web-page (due.
Review of the MIPS Instruction Set Architecture. RISC Instruction Set Basics All operations on data apply to data in registers and typically change the.
Lecture 6 Programming the TMS320C6x Family of DSPs.
Achieving over 50% system speedup with custom instructions and multi-threading. Kaiming Ho Fraunhofer IIS June 3 rd, 2014.
Architecture-dependent optimizations Functional units, delay slots and dependency analysis.
Anshul Kumar, CSE IITD CSL718 : VLIW - Software Driven ILP Hardware Support for Exposing ILP at Compile Time 3rd Apr, 2006.
ECE 454 Computer Systems Programming Compiler and Optimization (I) Ding Yuan ECE Dept., University of Toronto
1 Lecture 5: Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2)
Computer Architecture Lecture 7 Compiler Considerations and Optimizations.
Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property.
1 Lecture 10: Static ILP Basics Topics: loop unrolling, static branch prediction, VLIW (Sections 4.1 – 4.4)
Fall EE 333 Lillevik 333f06-l4 University of Portland School of Engineering Computer Organization Lecture 4 Assembly language programming ALU and.
Advanced microprocessor optimization Kampala August, 2007 Agner Fog
Computer Organization and Architecture
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
1 Lecture: Pipeline Wrap-Up and Static ILP Topics: multi-cycle instructions, precise exceptions, deep pipelines, compiler scheduling, loop unrolling, software.
August Code Compaction for UniCore on Link-Time Optimization Platform Zhang Jiyu Compilation Toolchain Group MPRC.
1 Handling nested procedures Method 1 : static (access) links –Reference to the frame of the lexically enclosing procedure –Static chains of such links.
1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.
Choice for the rest of the semester New Plan –assembler and machine language –Operating systems Process scheduling Memory management File system Optimization.
Chapter 13 Reduced Instruction Set Computers (RISC) Pipelining.
Software Development and Software Loading in Embedded Systems.
Module 8: Monitoring SQL Server for Performance. Overview Why to Monitor SQL Server Performance Monitoring and Tuning Tools for Monitoring SQL Server.
UPC/SHMEM PAT High-level Design v.1.1 Hung-Hsun Su UPC Group, HCS lab 6/21/2005.
Software Performance Analysis Using CodeAnalyst for Windows Sherry Hurwitz SW Applications Manager SRD Advanced Micro Devices Lei.
1 A Simple but Realistic Assembly Language for a Course in Computer Organization Eric Larson Moon Ok Kim Seattle University October 25, 2008.
1 Copyright © 2011, Elsevier Inc. All rights Reserved. Appendix A Authors: John Hennessy & David Patterson.
Is Out-Of-Order Out Of Date ? IA-64’s parallel architecture will improve processor performance William S. Worley Jr., HP Labs Jerry Huck, IA-64 Architecture.
Buffered dynamic run-time profiling of arbitrary data for Virtual Machines which employ interpreter and Just-In-Time (JIT) compiler Compiler workshop ’08.
CSc 453 Final Code Generation Saumya Debray The University of Arizona Tucson.
1 Fast and Efficient Partial Code Reordering Xianglong Huang (UT Austin, Adverplex) Stephen M. Blackburn (Intel) David Grove (IBM) Kathryn McKinley (UT.
Computer architecture Lecture 11: Reduced Instruction Set Computers Piotr Bilski.
© 2010 IBM Corporation Code Alignment for Architectures with Pipeline Group Dispatching Helena Kosachevsky, Gadi Haber, Omer Boehm Code Optimization Technologies.
Performance of mathematical software Agner Fog Technical University of Denmark
Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.
CS 211: Computer Architecture Lecture 6 Module 2 Exploiting Instruction Level Parallelism with Software Approaches Instructor: Morris Lancaster.
Lecture 4: MIPS Instruction Set
Crosscutting Issues: The Rôle of Compilers Architects must be aware of current compiler technology Compiler Architecture.
CS 598 Scripting Languages Design and Implementation 14. Self Compilers.
1 ROGUE Dynamic Optimization Framework Using Pin Vijay Janapa Reddi PhD. Candidate - Electrical And Computer Engineering University of Colorado at Boulder.
Tuning Threaded Code with Intel® Parallel Amplifier.
Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,
Code Optimization.
Instruction Level Parallelism
Lecture 4: MIPS Instruction Set
RISC Concepts, MIPS ISA Logic Design Tutorial 8.
CS203 – Advanced Computer Architecture
Improving Program Efficiency by Packing Instructions Into Registers
Flow Path Model of Superscalars
For Example: User level quicksort program Three address code.
Feedback directed optimization in Compaq’s compilation tools for Alpha
Application Binary Interface (ABI)
Henk Corporaal TUEindhoven 2011
Lecture Topics: 11/1 General Operating System Concepts Processes
Sampoorani, Sivakumar and Joshua
Optimization 薛智文 (textbook ch# 9) 薛智文 96 Spring.
Dynamic Hardware Prediction
Compiler Construction
CSc 453 Final Code Generation
Dynamic Binary Translators and Instrumenters
CSC Multiprocessor Programming, Spring, 2011
Presentation transcript:

IBM Haifa Labs © 2005 IBM Corporation Performance Tools developed in IBM Haifa Gad Haber

IBM Haifa Labs © 2005 IBM Corporation 2 HRL Performance Tools  FDPR-Pro  Feedback-based optimizer operating on binary executable files  Part of the AIX 5L  Available on Linux on Power via alphaworks  Under development for  Mac OS X – to be available soon via alphaworks  z/OS  CodeAnalyzer  Eclipse plugin tool for analyzing executable files  Under development  To be added as part of the Performance Work Bench (PerfWB)  BProber  Utility for instrumenting binary executable files  Under development  ESTO  Utility for identifying the optimal set of optimization options  Under development

IBM Haifa Labs © 2005 IBM Corporation FDPR-Pro Feedback Directed Program Restructuring

IBM Haifa Labs © 2005 IBM Corporation 4 FDPR-Pro - Feedback Directed Program Restructuring  Using a global view of the entire program  Operating on the executable file after linkage  These properties enable FDPR-Pro to do:  Global Code Reordering  Inter Procedure Boundaries Optimizations  Static Data Rearrangement  Constant Area Rearrangement  Data Prefetching  Examples of FDPR-Pro additional optimizations:  Usage of Branch Tables  Usage of TOC load instructions  More..

IBM Haifa Labs © 2005 IBM Corporation 5 Method  Phase 1: Code instrumentation  Basic block level  Phase 2: Profile information gathering  Selection of "right" input set (representative workload)  Accumulation over several input sets  Phase 3: Global Code & Data Optimizations  Complements the compiler

IBM Haifa Labs © 2005 IBM Corporation 6 FDPR-Pro Optimization Options  -RCReorder Code  -bfBranch folding  -bpBranch prediction bit setting  -alignCode alignment  -nop Eliminate nop instructions  -uceUnreachable code elimination  -hco_resched Hot/Cold instruction scheduling  -RD, -build_dcg Static data reordering  -tocload, -reduce_toc Tocload optimizations  -si, -ipht, -ihf, -isf Aggressive function inlining options  -ptrgl_optimization Optimize function calls via pointers  -dcbt_optimizationInject data prefetching instructions  -link_reg_optimization Eliminate stores/restore of link register  -volatile_regsEliminate stores/restores using available volatile regs  -killed_regs Eliminate stores/restores of killed registers  -load_after_store Separate between frequent load and store to same address  -loop_unroll Loop unrolling  -stack_opt Reduce stack frame size of Hot functions  -dceDead code elimination

IBM Haifa Labs © 2005 IBM Corporation CodeAnalyzer

IBM Haifa Labs © 2005 IBM Corporation 8 CodeAnalyzer - Motivation  Architectures are becoming more complex  Using only hardware simulators to detect information about potential performance bottlenecks in a given program is hard  There is a need for performance tools that can statically analyze and visualize programs for a platform design, to be used by:  Hardware architects  Compiler writers  Application developers

IBM Haifa Labs © 2005 IBM Corporation 9 CodeAnalyzer  CodeAnalyzer is an eclipse plugin which performs comprehensive static analysis on given executable files and DLLs  Relies on the FDPR-Pro tool for the analysis phase  CodeAnalyzer displays the analyzed information together with profiling data collected by:  tprof  FDPR-Pro  The code is then colored according to:  Frequency counters - gathered by FDPR-Pro  Hardware event ticks - gathered by tprof

IBM Haifa Labs © 2005 IBM Corporation 10 CodeAnalyzer – (continued)  Provides several views of the input binary  Assembly instructions  Basic blocks  Procedures  CSECT modules  control flow graph  Hot loops  Call graph  Annotated source code  Dispatch group formation  Pipeline slots and functional units

IBM Haifa Labs © 2005 IBM Corporation 11 CodeAnalyzer – (continued)

IBM Haifa Labs © 2005 IBM Corporation 12 CodeAnalyzer – (continued)

IBM Haifa Labs © 2005 IBM Corporation 13 CodeAnalyzer – (continued)

IBM Haifa Labs © 2005 IBM Corporation 14 CodeAnalyzer – (continued)

IBM Haifa Labs © 2005 IBM Corporation 15 CodeAnalyzer – (continued)

IBM Haifa Labs © 2005 IBM Corporation 16 CodeAnalyzer – (continued)

IBM Haifa Labs © 2005 IBM Corporation 17 CodeAnalyzer – (continued)

IBM Haifa Labs © 2005 IBM Corporation 18 CodeAnalyzer – (continued)

IBM Haifa Labs © 2005 IBM Corporation 19 CodeAnalyzer – Performance Comments  Performance comments displayed by CodeAnalyzer  Comments which do not require profiling  Pipeline stalls for the Power architecture  Unreachable code and non-used data  Profile-based comments  Non-variant instructions within Hot loops  Hot function calls proceeded by overwriting non-volatile registers  Hot saves and restores of registers which could be relocated to cold spill areas  Hot instructions that could be scheduled to colder areas in the code  Removable hot branches  Hot direct unconditional branches  Hot direct conditional branches that are taken, which have a colder fallthru  Hot call sites that are appropriate candidates for function inlining  Hot call sites that are appropriate for function specialization  Hot loops that are appropriate for loop unrolling  Hot TOC load instructions that can be replaced by immediate add instructions

IBM Haifa Labs © 2005 IBM Corporation Performance Workbench (PerfWB)

IBM Haifa Labs © 2005 IBM Corporation 21 PerfWB  CodeAnalyzer is part of the Performance Workbench (PerfWB) utility  PerfWB is a collection of eclipse plugins that provide performance monitoring, tuning and analysis  PerfWB consists of the following eclipse plugins:  ProcMon - system-level monitoring tool for displaying system state and for monitoring running processes and threads  E-Tune - visualizer of feedback information produced by tprof  CodeAnalyzer – performance analyzer of executables and DLLs

IBM Haifa Labs © 2005 IBM Corporation 22 ProcMon

IBM Haifa Labs © 2005 IBM Corporation 23 E-Tune with CodeAnalyzer