Multi-core Programming VTune Analyzer Basics. 2 Basics of VTune™ Performance Analyzer Topics What is the VTune™ Performance Analyzer? Performance tuning.

Slides:



Advertisements
Similar presentations
Processes and Threads Chapter 3 and 4 Operating Systems: Internals and Design Principles, 6/E William Stallings Patricia Roy Manatee Community College,
Advertisements

Module 13: Performance Tuning. Overview Performance tuning methodologies Instance level Database level Application level Overview of tools and techniques.
3.1 Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition Process An operating system executes a variety of programs: Batch system.
Profiling your application with Intel VTune at NERSC
Intel® performance analyze tools Nikita Panov Idrisov Renat.
WHAT IS AN OPERATING SYSTEM? An interface between users and hardware - an environment "architecture ” Allows convenient usage; hides the tedious stuff.
Overview Motivations Basic static and dynamic optimization methods ADAPT Dynamo.
® IBM Software Group © 2010 IBM Corporation What’s New in Profiling & Code Coverage RAD V8 April 21, 2011 Kathy Chan
Last update: August 9, 2002 CodeTest Embedded Software Verification Tools By Advanced Microsystems Corporation.
MCTS GUIDE TO MICROSOFT WINDOWS 7 Chapter 10 Performance Tuning.
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
Thursday, June 08, 2006 The number of UNIX installations has grown to 10, with more expected. The UNIX Programmer's Manual, 2nd Edition, June, 1972.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment Chapter 11: Monitoring Server Performance.
Chapter 13: I/O Systems I/O Hardware Application I/O Interface
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 3: Processes.
Operating Systems (CSCI2413) Lecture 3 Processes phones off (please)
OllyDbg Debuger.
Hands-On Microsoft Windows Server 2008 Chapter 11 Server and Network Monitoring.
Windows Server 2008 Chapter 11 Last Update
Page 1 © 2001 Hewlett-Packard Company Tools for Measuring System and Application Performance Introduction GlancePlus Introduction Glance Motif Glance Character.
1 Chapter Overview Monitoring Server Performance Monitoring Shared Resources Microsoft Windows 2000 Auditing.
Chapter 17: Watching Your System BAI617. Chapter Topics Working With Event Viewer Performance Monitor Resource Monitor.
Multi-core Programming Thread Profiler. 2 Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads Topics Look at Intel® Thread Profiler features.
MCTS Guide to Microsoft Windows Vista Chapter 11 Performance Tuning.
MCTS Guide to Microsoft Windows 7
Hp education services education.hp.com 33 GlancePlus Version B.02 H4262S Module 3 Slides.
Process Management. Processes Process Concept Process Scheduling Operations on Processes Interprocess Communication Examples of IPC Systems Communication.
CCS APPS CODE COVERAGE. CCS APPS Code Coverage Definition: –The amount of code within a program that is exercised Uses: –Important for discovering code.
Software Performance Analysis Using CodeAnalyst for Windows Sherry Hurwitz SW Applications Manager SRD Advanced Micro Devices Lei.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment, Enhanced Chapter 11: Monitoring Server Performance.
Performance Monitoring on the Intel ® Itanium ® 2 Processor CGO’04 Tutorial 3/21/04 CK. Luk Massachusetts Microprocessor Design.
Application performance and communication profiles of M3DC1_3D on NERSC babbage KNC with 16 MPI Ranks Thanh Phung, Intel TCAR Woo-Sun Yang, NERSC.
Windows 2000 Course Summary Computing Department, Lancaster University, UK.
Tools - Implementation Options - Chapter15 slide 1 FPGA Tools Course Implementation Options.
ACMSE’04, ALDepartment of Electrical and Computer Engineering - UAH Execution Characteristics of SPEC CPU2000 Benchmarks: Intel C++ vs. Microsoft VC++
Srihari Makineni & Ravi Iyer Communications Technology Lab
* Third party brands and names are the property of their respective owners. Performance Tuning Linux* Applications LinuxWorld Conference & Expo Gary Carleton.
NVTune Kenneth Hurley. NVIDIA CONFIDENTIAL NVTune Overview What issues are we trying to solve? Games and applications need to have high frame rates Answer.
Intel Software Development Products. ZJU-Intel Embedded Technology Center VTune ™ Performance Analyzer  Helps you identify.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment, Enhanced Chapter 11: Monitoring Server Performance.
Agilent Technologies Copyright 1999 H7211A+221 v Capture Filters, Logging, and Subnets: Module Objectives Create capture filters that control whether.
Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Michael D’Mello
1: Operating Systems Overview 1 Jerry Breecher Fall, 2004 CLARK UNIVERSITY CS215 OPERATING SYSTEMS OVERVIEW.
Tool Visualizations, Metrics, and Profiled Entities Overview [Brief Version] Adam Leko HCS Research Laboratory University of Florida.
Processor Architecture
Profiling Tools Introduction to Computer System, Fall (PPI, FDU) Vtune & GProfile.
Multithreaded Programing. Outline Overview of threads Threads Multithreaded Models  Many-to-One  One-to-One  Many-to-Many Thread Libraries  Pthread.
Full and Para Virtualization
Process Description and Control Chapter 3. Source Modified slides from Missouri U. of Science and Tech.
A Software Performance Monitoring Tool Daniele Francesco Kruse March 2010.
Join us on Twitter: #AU2013 Building Well-Performing Autodesk® AutoCAD® Applications Albert Szilvasy Software Architect.
1 Process Description and Control Chapter 3. 2 Process A program in execution An instance of a program running on a computer The entity that can be assigned.
Teaching Digital Logic courses with Altera Technology
*Pentium is a trademark or registered trademark of Intel Corporation or its subsidiaries in the United States and other countries Performance Monitoring.
Tuning Threaded Code with Intel® Parallel Amplifier.
Beyond Application Profiling to System Aware Analysis Elena Laskavaia, QNX Bill Graham, QNX.
Confessions of a Performance Monitor Hardware Designer Workshop on Hardware Performance Monitor Design HPCA February 2005 Jim Callister Intel Corporation.
July 10, 2016ISA's, Compilers, and Assembly1 CS232 roadmap In the first 3 quarters of the class, we have covered 1.Understanding the relationship between.
SQL Database Management
Using the VTune Analyzer on Multithreaded Applications
AESA – Module 8: Using Dashboards and Data Monitors
MCTS Guide to Microsoft Windows 7
Process Management Presented By Aditya Gupta Assistant Professor
Threads and Locks.
What we need to be able to count to tune programs
CSCI 315 Operating Systems Design
Chapter 13: I/O Systems I/O Hardware Application I/O Interface
Dynamic Binary Translators and Instrumenters
In Today’s Class.. General Kernel Responsibilities Kernel Organization
Chapter 3: Process Management
Presentation transcript:

Multi-core Programming VTune Analyzer Basics

2 Basics of VTune™ Performance Analyzer Topics What is the VTune™ Performance Analyzer? Performance tuning concepts Using the sampling collector How sampling works Sampling Over Time Call Graph

3 Basics of VTune™ Performance Analyzer VTune™ Performance Analyzer Helps you identify and characterize performance issues by: – Collecting performance data from the system running your application. – Organizing and displaying the data in a variety of interactive views, from system-wide down to source code or processor instruction perspective. – Identifying potential performance issues and suggesting improvements.

4 Basics of VTune™ Performance Analyzer Supported Environments Local and remote data collection Profile applications that are running on the system that has the analyzer installed on it, or Run profiling experiments on other systems that are running VTune analyzer remote agents on them

5 Basics of VTune™ Performance Analyzer Local Performance Analysis Intel ® IA-32 Processors – Microsoft Windows* operating systems – Red Hat Linux* – SuSE Linux Itanium ® Family Processors – Microsoft Windows operating systems – Red Hat Linux – SuSE Linux For specific operating systems versions, see the release notes

6 Basics of VTune™ Performance Analyzer Host/Target Environment VTune™ Performance Analyzer supports remote data collection VTune™ Performance Analyzer installed on host system Remote agent installed on target system Host System Windows* operating system Controls target View results of data collection Target System IA-32 or Itanium ® processor family Windows or Linux* Intel ® PXA2xx processors running Windows CE* LAN Connection

7 Basics of VTune™ Performance Analyzer Feature Overview Sampling Call graph

8 Basics of VTune™ Performance Analyzer Sampling Collects System-wide Performance Data VTune™ Analyzer Features and Usage Models

9 Basics of VTune™ Performance Analyzer Sampling Over Time Views Show How Sampling Data Changes Over Time VTune™ Analyzer Features and Usage Models

10 Basics of VTune™ Performance Analyzer Sampling Source View Displays Source Code Annotated with Performance Data VTune™ Analyzer Features and Usage Models

11 Basics of VTune™ Performance Analyzer Call Graph Collects and Displays Information About the Program Flow of the Application VTune™ Analyzer Features and Usage Models

12 Basics of VTune™ Performance Analyzer What Is a Hotspot? Where in an application or system there is a significant amount of activity – Where = address in memory => OS process => OS thread => executable file or module => user function (requires symbols) => line of source code (requires symbols with line numbers) or processor (assembly) instruction – Significant = activity that occurs infrequently probably does not have much impact on system performance – Activity = time spent or other internal processor event Examples of other events: Cache misses, branch mispredictions, floating-point instructions retired, partial register stalls, and so on.

13 Basics of VTune™ Performance Analyzer Sampling: The Statistical Method of Finding Hotspots The sampling collector – Periodically interrupts the processor Time-based Event-based: Triggered by the occurrence of a certain number of microarchitectural events – Collects the execution context Execution address in memory (CS:IP) Operating system process and thread ID Executable module loaded at that address – If you have symbols for the module, post-processing can identify the function or method at the memory address. – Line numbers from the symbol file can direct you to the relevant line of source code.

14 Basics of VTune™ Performance Analyzer Sampling Collector Periodically interrupt the processor to obtain the execution context – Time-based sampling (TBS) is triggered by: Operating system timer services Every n processor clockticks – Event-based sampling (EBS) is triggered by processor event counter overflow These events are processor-specific, like L2 cache misses, branch mispredictions, floating-point instructions retired, and so on

15 Basics of VTune™ Performance Analyzer Sampling Results by Operating System Process This operating system process that has the most clockticks samples Sorted by Clockticks Sorted by Clockticks VTune™ Analyzer Sampling Collector

16 Basics of VTune™ Performance Analyzer Click here for OS Process view. Click here for OS Process view. Display only Clockticks sample data. VTune™ Analyzer Sampling Collector Click here for Over Time view. Click here for Over Time view.

17 Basics of VTune™ Performance Analyzer This view was filtered by selecting only one item from the process view. VTune™ Analyzer Sampling Collector Click here to break down by CPU. Click here to break down by CPU.

18 Basics of VTune™ Performance Analyzer Table View: Selection Summary of line is highlighted in table. VTune™ Analyzer Sampling Collector

19 Basics of VTune™ Performance Analyzer Hotspot view of one module for all OS processes and threads grouped by function (or method). VTune™ Analyzer Sampling Collector

20 Basics of VTune™ Performance Analyzer VTune™ Analyzer Sampling Collector

21 Basics of VTune™ Performance Analyzer Totals for Source Line Activity at Instruction Locations VTune™ Analyzer Sampling Collector Click here for disassembly view. Click here for disassembly view.

22 Basics of VTune™ Performance Analyzer Totals for Source Line Activity at Instruction Locations VTune™ Analyzer Sampling Collector Red time intervals have more samples in them. Zoom In Zoom Out Select Event

23 Basics of VTune™ Performance Analyzer Three Key Benefits of Sampling You do not have to modify your code. – But DO compile/link with symbols and line numbers. – But DO make release builds with optimizations. Sampling is system-wide. – Not just YOUR application. – You can see activity in operating system code, including drivers. Sampling overhead is very low. – Validity is highest when perturbation is low. – Overhead can be reduced further by turning off progress meters in the user interface. How else can you reduce sampling overhead?

24 Basics of VTune™ Performance Analyzer How Event-based Sampling (EBS) Works Conceptual Diagram Select Event Signal “Sample After” Number Count Down Underflow to Zero Internal Interrupt Controller § Interrupt CPU to Take Sample How do you choose a “Sample After” number?

25 Basics of VTune™ Performance Analyzer How Many Samples Are Enough? One million samples for a five-second run? – Do you have enough samples for it to be statistically significant? – How much overhead are you causing? What if you only get 100 samples? – Is your sample after number 1? – Are you getting a good profile? About 1,000 samples per second is a good balance between significance and overhead

26 Basics of VTune™ Performance Analyzer Objective: 1,000 Samples Per Second What is the sample after value for clockticks? – Dependent upon CPU clock speed – ANSWER: CPU clock speed in KHz If CPU clock speed = 1,400,000,000 Hz Sample after 1,400,000 clockticks What is the sample after value for L2 cache read misses? – It depends on how often you miss the L2 cache! Circular definition? Is not that what you are trying to determine? – Make an intelligent guess! Estimate! More or less often than the clockticks? 10 times? 100 times? 1000 times?

27 Basics of VTune™ Performance Analyzer Calibration Sets the sample after value to get a reasonable number of samples. – ~1000 samples per second per logical CPU Requires the workload to be run twice Manual Calibration: – Uncheck Calibrate Sample After value Found on Advanced Activity Configuration dialog – Start with default value or an estimate – Run a test – Modify the sample after value and re-test – Try to get about a 1000 samples per second per logical CPU

28 Basics of VTune™ Performance Analyzer Sampling Over Time Shows how sample distributions change over time by process, thread, or module Zoom in on time regions Useful for: – Identifying time-variant performance characteristics – Understanding thread behavior

29 Basics of VTune™ Performance Analyzer Sampling Over Time Usage Model Collect sampling data Select items of interest from either the process, thread, or modules view Click Highlight region of interest Click Click to see process/thread/address histogram for time region

30 Basics of VTune™ Performance Analyzer Call Graph Profiling Tracks the function entry and exit points of your code at run time Uses binary instrumentation Uses this data to determine program flow, critical functions and call sequences Not system-wide: Only profiles code in applications call path in Ring 3

31 Basics of VTune™ Performance Analyzer What Can You Profile? Win32 applications Stand-alone Win32* DLLs Stand-alone COM+ DLLs Java applications.NET* applications ASP.NET applications Linux32* applications

32 Basics of VTune™ Performance Analyzer Call Graph View The red lines show the critical path. The critical path is the most time- consuming call path. It is based on self time. Bright orange nodes indicate functions with the highest self time. Filter view by self time

33 Basics of VTune™ Performance Analyzer Call Graph Navigation Window Use the graph navigation window for an overview of the entire call graph.

34 Basics of VTune™ Performance Analyzer Switch between call list and call graph views here. Call Graph Call List View

35 Basics of VTune™ Performance Analyzer Call Graph Metrics Performance Metric Description Self TimeTotal time in a function, excluding time spent in its children (includes wait time) Total TimeTime measured from a function entry to exit point Total Wait TimeTime spent in a function and its children when the thread is blocked Wait TimeTime spent in a function when the thread is blocked (excludes blocked time in its children) CallsNumber of times the function is called

36 Basics of VTune™ Performance Analyzer Sampling Versus Call Graph SamplingCall graph Low overheadHigher overhead System-wideRing 3 only on your application call tree System-wide address histogramShow function level hierarchy with call counts, times, and the critical path For function level drill-down, must have debug information Must re-link with /fixed:no, automatically instruments Can sample based on time and other processor events Results are based on time

37 Basics of VTune™ Performance Analyzer Java* and.NET* Applications Provides performance data for both managed code and unmanaged code Gives insight into how managed code calls translate into Win32* calls Uses managed code profiling API and binary instrumentation

38 Basics of VTune™ Performance Analyzer Basics of VTune™ Performance Analyzer What’s Been Covered You can use the different profilers in the VTune™ analyzer to understand the different aspects of the performance of your application.