Josef Weidendorfer KDE Developer Conference 2004 Ludwigsburg, Germany.

Slides:



Advertisements
Similar presentations
Chapter 4 Loops Liang, Introduction to Java Programming, Eighth Edition, (c) 2011 Pearson Education, Inc. All rights reserved
Advertisements

Chapter 13: I/O Systems I/O Hardware Application I/O Interface
Chapter 5: CPU Scheduling
Advanced Piloting Cruise Plot.
Analysis of Computer Algorithms
Chapter 7 System Models.
Chapter 7 Constructors and Other Tools. Copyright © 2006 Pearson Addison-Wesley. All rights reserved. 7-2 Learning Objectives Constructors Definitions.
Chapter 1 The Study of Body Function Image PowerPoint
1 Copyright © 2013 Elsevier Inc. All rights reserved. Chapter 3 CPUs.
6 Copyright © 2005, Oracle. All rights reserved. Building Applications with Oracle JDeveloper 10g.
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
FACTORING ax2 + bx + c Think “unfoil” Work down, Show all steps.
OPERATING SYSTEMS Lecturer: Szabolcs Mikulas Office: B38B
1 Processes and Threads Creation and Termination States Usage Implementations.
Chapter 6 File Systems 6.1 Files 6.2 Directories
Chapter 1 Introduction Copyright © Operating Systems, by Dhananjay Dhamdhere Copyright © Introduction Abstract Views of an Operating System.
TM 1 ProfileMe: Hardware-Support for Instruction-Level Profiling on Out-of-Order Processors Jeffrey Dean Jamey Hicks Carl Waldspurger William Weihl George.
Re-examining Instruction Reuse in Pre-execution Approaches By Sonya R. Wolff Prof. Ronald D. Barnes June 5, 2011.
Intel VTune Yukai Hong Department of Mathematics National Taiwan University July 24, 2008.
Configuration management
Chapter 5 : Memory Management
Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall, Inc
13 Copyright © 2005, Oracle. All rights reserved. Monitoring and Improving Performance.
Database Performance Tuning and Query Optimization
SE-292 High Performance Computing Profiling and Performance R. Govindarajan
Chapter 4 Memory Management Basic memory management Swapping
Hardware-assisted Virtualization
INTRODUCTION TO SIMULATION WITH OMNET++ José Daniel García Sánchez ARCOS Group – University Carlos III of Madrid.
Project : Phase 1 Grading Default Statistics (40 points) Values and Charts (30 points) Analyses (10 points) Branch Predictor Statistics (30 points) Values.
Chapter 3 Memory Management
Learning Cache Models by Measurements Jan Reineke joint work with Andreas Abel Uppsala University December 20, 2012.
1 Undirected Breadth First Search F A BCG DE H 2 F A BCG DE H Queue: A get Undiscovered Fringe Finished Active 0 distance from A visit(A)
VOORBLAD.
Making Time-stepped Applications Tick in the Cloud Tao Zou, Guozhang Wang, Marcos Vaz Salles*, David Bindel, Alan Demers, Johannes Gehrke, Walker White.
April 30, A New Tool for Designer-Level Verification: From Concept to Reality April 30, 2014 Ziv Nevo IBM Haifa Research Lab.
1 Breadth First Search s s Undiscovered Discovered Finished Queue: s Top of queue 2 1 Shortest path from s.
© 2012 National Heart Foundation of Australia. Slide 2.
1 Processes and Threads Chapter Processes 2.2 Threads 2.3 Interprocess communication 2.4 Classical IPC problems 2.5 Scheduling.
Liang, Introduction to Java Programming, Sixth Edition, (c) 2007 Pearson Education, Inc. All rights reserved Chapter 4 Loops.
Understanding Generalist Practice, 5e, Kirst-Ashman/Hull
Executional Architecture
DB analyzer utility An overview 1. DB Analyzer An application used to track discrepancies and other reports in Sanchay Post Constantly updated by SDC.
Chapter 5 Loops Liang, Introduction to Java Programming, Tenth Edition, (c) 2015 Pearson Education, Inc. All rights reserved.
25 seconds left…...
1 By Terence James Haydock Manufacturing Cell A simulated response in Arena.
Januar MDMDFSSMDMDFSSS
Copyright © 2003 by Prentice Hall Computers: Tools for an Information Age Chapter 15 Programming and Languages: Telling the Computer What to Do.
Chapter 10: The Traditional Approach to Design
Systems Analysis and Design in a Changing World, Fifth Edition
SE-292 High Performance Computing
We will resume in: 25 Minutes.
©Brooks/Cole, 2001 Chapter 12 Derived Types-- Enumerated, Structure and Union.
©2004 Brooks/Cole FIGURES FOR CHAPTER 12 REGISTERS AND COUNTERS Click the mouse to move to the next page. Use the ESC key to exit this chapter. This chapter.
PSSA Preparation.
Chapter 11 Describing Process Specifications and Structured Decisions
Mani Srivastava UCLA - EE Department Room: 6731-H Boelter Hall Tel: WWW: Copyright 2003.
Chapter 8 Improving the User Interface
L6:CSC © Dr. Basheer M. Nasef Lecture #6 By Dr. Basheer M. Nasef.
Liang, Introduction to Java Programming, Eighth Edition, (c) 2011 Pearson Education, Inc. All rights reserved Chapter 3 Loops.
Intel® performance analyze tools Nikita Panov Idrisov Renat.
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
1 1 Profiling & Optimization David Geldreich (DREAM)
Taming Hardware Event Samples for FDO Compilation Dehao Chen (Tsinghua University) Neil Vachharajani, Robert Hundt, Shih-wei Liao (Google) Vinodha Ramasamy.
A Software Performance Monitoring Tool Daniele Francesco Kruse March 2010.
Introduction to OProfile
Presentation transcript:

Josef Weidendorfer KDE Developer Conference 2004 Ludwigsburg, Germany

Ludwigsburg Germany 2004 Performance Optimization – Simulation and Real Measurement Josef Weidendorfer 2 •Introduction •Performance Analysis •Profiling Tools: Examples & Demo •KCachegrind: Visualizing Results •What’s to come …

Ludwigsburg Germany 2004 Performance Optimization – Simulation and Real Measurement Josef Weidendorfer 3 •Why Performance Analysis in KDE ? –Key to useful Optimizations –Responsive Applications required for Acceptance –Not everybody owns a 3 GHz •About Me –Supporter of KDE since Beginning (“KAbalone”) –Currently at TU Munich, working on Cache Optimization for Numerical Code & Tools

Ludwigsburg Germany 2004 Performance Optimization – Simulation and Real Measurement Josef Weidendorfer 4 •Introduction •Performance Analysis –Basics, Terms and Methods –Hardware Support •Profiling Tools: Examples & Demo •KCachegrind: Visualizing Results •What’s to come …

Ludwigsburg Germany 2004 Performance Optimization – Simulation and Real Measurement Josef Weidendorfer 5 •Why to use… –Locate Code Regions for Optimizations (Calls to time-intensive Library-Functions) –Check for Assumptions on Runtime Behavior (same Paint-Operation multiple times?) –Best Algorithm from Alternatives for a given Problem –Get Knowledge about unknown Code (includes used Libraries like KDE-Libs/QT)

Ludwigsburg Germany 2004 Performance Optimization – Simulation and Real Measurement Josef Weidendorfer 6 •How to do… •At End of (fully tested) Implementation •On Compiler-Optimized Release Version •With typical/representative Input Data •Steps of Optimization Cycle MeasurementLocate BottleneckModify Code Check for Improvement (Runtime) Improvement Satisfying? Finished Start No Yes

Ludwigsburg Germany 2004 Performance Optimization – Simulation and Real Measurement Josef Weidendorfer 7 •Performance Bottlenecks (sequential) –Logical Errors: Too often called Functions –Algorithms with bad Complexity or Implementation –Bad Memory Access Behavior (Bad Layout, Low Locality) –Lots of (conditional) Jumps, Lots of (unnecessary) Data Dependencies,... Too low-level for GUI Applications ?

Ludwigsburg Germany 2004 Performance Optimization – Simulation and Real Measurement Josef Weidendorfer 8 •Wanted: –Time Partitioning with •Reason for Performance Loss (Stall because of…) •Detailed Relation to Source (Code, Data Structure) –Runtime Numbers •Call Relationships, Call Numbers •Loop Iterations, Jump Counts –No Perturbation of Results b/o Measurement

Ludwigsburg Germany 2004 Performance Optimization – Simulation and Real Measurement Josef Weidendorfer 9 •Trace: Stream of Time-Stamped Events •Enter/Leave of Code Region, Actions, … Example: Dynamic Call Tree •Huge Amount of Data (Linear to Runtime) •Unneeded for Sequential Analysis (?)

Ludwigsburg Germany 2004 Performance Optimization – Simulation and Real Measurement Josef Weidendorfer 10 •Profiling (e.g.Time Partitioning) –Summary over Execution •Exclusive, Inclusive Cost / Time, Counters •Example: DCT  DCG (Dynamic Call Graph) –Amount of Data Linear to Code Size

Ludwigsburg Germany 2004 Performance Optimization – Simulation and Real Measurement Josef Weidendorfer 11 •Precise Measurements –Increment Counter (Array) on Event –Attribute Counters to •Code / Data –Data Reduction Possibilities •Selection (Event Type, Code/Data Range) •Online Processing (Compression, …) –Needs Instrumentation (Measurement Code)

Ludwigsburg Germany 2004 Performance Optimization – Simulation and Real Measurement Josef Weidendorfer 12 –Manual –Source Instrumentation –Library Version with Instrumentation –Compiler –Binary Editing –Runtime Instrumentation / Compiler –Runtime Injection

Ludwigsburg Germany 2004 Performance Optimization – Simulation and Real Measurement Josef Weidendorfer 13 •Statistical Measurement (“Sampling”) –TBS (Time Based), EBS (Event Based) –Assumption: Event Distribution over Code Approximated by checking every N-th Event –Similar Way for Iterative Code: Measure only every N-th Iteration •Data Reduction Tunable –Compromise between Quality/Overhead

Ludwigsburg Germany 2004 Performance Optimization – Simulation and Real Measurement Josef Weidendorfer 14 •Simulation –Events for (not existant) HW Models –Results not influenced by Measurement –Compromise Quality / Slowdown •Rough Model = High Discrepancy to Reality •Detailed Model = Best Match to Reality But: Reality (CPU) often unknown… –Allows for Architecture Parameter Studies

Ludwigsburg Germany 2004 Performance Optimization – Simulation and Real Measurement Josef Weidendorfer 15 •Monitor Hardware –Event Sensors (in CPU, on Board) –Event Processing / Collection / Storing •Best: Separate HW •Comprimise: Use Same Resources after Data Reduction –Most CPUs nowadays include Performance Counters

Ludwigsburg Germany 2004 Performance Optimization – Simulation and Real Measurement Josef Weidendorfer 16 •Multiple Event Sensors –ALU Utilization, Branch Prediction, Cache Events (L1/L2/TLB), Bus Utilization •Processing Hardware –Counter Registers •Itanium2: 4, Pentium-4: 18, Opteron: 8 Athlon: 4, Pentium-II/III/M: 2, Alpha 21164: 3

Ludwigsburg Germany 2004 Performance Optimization – Simulation and Real Measurement Josef Weidendorfer 17 •Two Uses: –Read •Get Precise Count of Events in Code Regions by Enter/Leave Instrumentation –Interrupt on Overflow •Allows Statistical Sampling •Handler Gets Process State & Restarts Counter •Both can have Overhead •Often Difficult to Understand

Ludwigsburg Germany 2004 Performance Optimization – Simulation and Real Measurement Josef Weidendorfer 18 •Introduction •Performance Analysis •Profiling Tools: Examples & Demo –Callgrind/Calltree –OProfile •KCachegrind: Visualizing Results •What’s to come …

Ludwigsburg Germany 2004 Performance Optimization – Simulation and Real Measurement Josef Weidendorfer 19 •Read Hardware Performance Counters –Specific: PerfCtr (x86), Pfmon (Itanium), perfex (SGI) Portable: PAPI, PCL •Statistical Sampling –PAPI, Pfmon (Itanium), OProfile (Linux), VTune (commercial - Intel), Prof/GProf (TBS) •Instrumentation –GProf, Pixie (HP/SGI), VTune (Intel) –DynaProf (Using DynInst), Valgrind (x86 Simulation)

Ludwigsburg Germany 2004 Performance Optimization – Simulation and Real Measurement Josef Weidendorfer 20 •GProf (Compiler generated Instr.): •Function Entries increment Call Counter for (caller, called)-Tupel •Combined with Time Based Sampling •Compile with “gcc –pg...” •Run creates “gmon.out” •Analyse with “gprof...” •Overhead still around 100% ! •Available with GCC on UNIX

Ludwigsburg Germany 2004 Performance Optimization – Simulation and Real Measurement Josef Weidendorfer 21 •Callgrind/Calltree (Linux/x86), GPL –Cache Simulator using Valgrind –Builds up Dynamic Call Graph –Comfortable Runtime Instrumentation – •Disadvantages –Time Estimation Inaccurate (No Simulation of modern CPU Characteristics!) –Only User-Level

Ludwigsburg Germany 2004 Performance Optimization – Simulation and Real Measurement Josef Weidendorfer 22 •Callgrind/Calltree (Linux/x86), GPL –Run with “callgrind prog” –Generates “callgrind.out.xxx” –Results with “callgrind_annotate” or “kcachegrind” –Cope with Slowness of Simulation: •Switch of Cache Simulation: --simulate-cache=no •Use “Fast Forward”: --instr-atstart=no / callgrind_control –i on •DEMO: KHTML Rendering…

Ludwigsburg Germany 2004 Performance Optimization – Simulation and Real Measurement Josef Weidendorfer 23 •OProfile –Configure (as Root: oprof_start, ~/.oprofile/daemonrc) –Start the OProfile daemon (opcontrol -s) –Run your code –Flush Measurement, Stop daemon (opcontrol –d/-h) –Use tools to analyze the profiling data opreport : Breakdown of CPU time by procedures (better: opreport –gdf | op2calltree) •DEMO: KHTML Rendering…

Ludwigsburg Germany 2004 Performance Optimization – Simulation and Real Measurement Josef Weidendorfer 24 •Introduction •Performance Analysis •Profiling Tools: Examples & Demo •KCachegrind: Visualizing Results –Data Model, GUI Elements, Basic Usage –DEMO •What’s to come …

Ludwigsburg Germany 2004 Performance Optimization – Simulation and Real Measurement Josef Weidendorfer 25 •Hierarchy of Cost Items (=Code Relations) –Profile Measurement Data –Profile Data Dumps –Function Groups: Source files, Shared Libs, C++ classes –Functions –Source Lines –Assembler Instructions

Ludwigsburg Germany 2004 Performance Optimization – Simulation and Real Measurement Josef Weidendorfer 26 •List of Functions / Function Groups •Visualizations for an Activated Function •DEMO

Ludwigsburg Germany 2004 Performance Optimization – Simulation and Real Measurement Josef Weidendorfer 27 •Introduction •Performance Analysis •Profiling Tools: Examples & Demo •KCachegrind: Visualizing Results •What’s to come … –Callgrind –KCachegrind

Ludwigsburg Germany 2004 Performance Optimization – Simulation and Real Measurement Josef Weidendorfer 28 •Callgrind –Free definable User Costs (“MyCost += arg1” on Entering MyFunc) –Relation of Events to Data Objects/Structures –More Optional Simulation (TLB, HW Prefetch)

Ludwigsburg Germany 2004 Performance Optimization – Simulation and Real Measurement Josef Weidendorfer 29 •KCachegrind –Supplement Sampling Data with Inclusive Cost via Call-Graph from Simulation –Comparation of Measurements –Plugins for •Interactive Control of Profiling Tools •Visualizations •Visualizations for Data Relation

Ludwigsburg Germany 2004 Performance Optimization – Simulation and Real Measurement Josef Weidendorfer 30 THANKS FOR LISTENING