Software Performance Monitoring Daniele Francesco Kruse July 2010.

Slides:

Advertisements

Similar presentations

Profiler In software engineering, profiling ("program profiling", "software profiling") is a form of dynamic program analysis that measures, for example,

Advertisements

Exploring P4 Trace Cache Features Ed Carpenter Marsha Robinson Jana Wooten.

AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.

DBMSs on a Modern Processor: Where Does Time Go? Anastassia Ailamaki Joint work with David DeWitt, Mark Hill, and David Wood at the University of Wisconsin-Madison.

PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The.

Performance Monitoring Update Daniele Francesco Kruse April 2010.

Superscalar Organization Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.

Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.

Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.

CSIE30300 Computer Architecture Unit 10: Virtual Memory Hsin-Chou Chi [Adapted from material by and

April 27, 2010CS152, Spring 2010 CS 152 Computer Architecture and Engineering Lecture 23: Putting it all together: Intel Nehalem Krste Asanovic Electrical.

CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Putting it all together: Intel Nehalem Steve Ko Computer Sciences and Engineering University.

331 Lec20.1Fall :332:331 Computer Architecture and Assembly Language Fall 2003 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.

Memory Management 2010.

EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.

Embedded Computing From Theory to Practice November 2008 USTC Suzhou.

The PowerPC Architecture  IBM, Motorola, and Apple Alliance  Based on the IBM POWER Architecture Facilitate parallel execution Scale well with advancing.

331 Lec20.1Spring :332:331 Computer Architecture and Assembly Language Spring 2005 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.

Introduction Operating Systems’ Concepts and Structure Lecture 1 ~ Spring, 2008 ~ Spring, 2008TUCN. Operating Systems. Lecture 1.

Code Coverage Testing Using Hardware Performance Monitoring Support Alex Shye, Matthew Iyer, Vijay Janapa Reddi and Daniel A. Connors University of Colorado.

Time-predictability of a computer system Master project in progress By Wouter van der Put.

CS 152 Computer Architecture and Engineering Lecture 23: Putting it all together: Intel Nehalem Krste Asanovic Electrical Engineering and Computer Sciences.

Superscalar SMIPS Processor Andy Wright Leslie Maldonado.

Intel Architecture. Changes in architecture Software architecture: –Front end (Feature changes such as adding more graphics, changing the background colors,

Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Intel® Software College.

Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.

CSE431 L22 TLBs.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 22. Virtual Memory Hardware Support Mary Jane Irwin (

Performance Monitoring on the Intel ® Itanium ® 2 Processor CGO’04 Tutorial 3/21/04 CK. Luk Massachusetts Microprocessor Design.

Carnegie Mellon /18-243: Introduction to Computer Systems Instructors: Bill Nace and Gregory Kesden (c) All Rights Reserved. All work.

High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.

Nicolas Tjioe CSE 520 Wednesday 11/12/2008 Hyper-Threading in NetBurst Microarchitecture David Koufaty Deborah T. Marr Intel Published by the IEEE Computer.

JPCM - JDC121 JPCM. Agenda JPCM - JDC122 3 Software performance is Better Performance tuning requires accurate Measurements. JPCM - JDC124 Software.

Hyper-Threading Technology Architecture and Micro-Architecture.

Instruction Issue Logic for High- Performance Interruptible Pipelined Processors Gurinder S. Sohi Professor UW-Madison Computer Architecture Group University.

AMD Opteron Overview Michael Trotter (mjt5v) Tim Kang (tjk2n) Jeff Barbieri (jjb3v)

Srihari Makineni & Ravi Iyer Communications Technology Lab

* Third party brands and names are the property of their respective owners. Performance Tuning Linux* Applications LinuxWorld Conference & Expo Gary Carleton.

Dynamic Pipelines. Interstage Buffers Superscalar Pipeline Stages In Program Order In Program Order Out of Order.

Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Michael D’Mello

Computer Architecture 2 nd year (computer and Information Sc.)

Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.

1 Copyright © 2010, Elsevier Inc. All rights Reserved Chapter 2 Parallel Hardware and Parallel Software An Introduction to Parallel Programming Peter Pacheco.

Dept. of Computer Science - CS6461 Computer Architecture CS6461 – Computer Architecture Fall 2015 Lecture 1 – Introduction Adopted from Professor Stephen.

Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.

1 Lecture 7: Speculative Execution and Recovery Branch prediction and speculative execution, precise interrupt, reorder buffer.

A Software Performance Monitoring Tool Daniele Francesco Kruse March 2010.

6 September 2007CHEP07 Parallel - SJ1 Perfmon2: A leap forward in Performance Monitoring Sverre Jarp, Ryszard Jurga, Andrzej Nowak CERN openlab CHEP 2007.

Question What technology differentiates the different stages a computer had gone through from generation 1 to present?

To Compute: To Do Math. Information is collected by tallying data as it travels across circuits. The key part of integrated circuits are transistors.transistors.

Performance profiling of Experiments’ Geant4 Simulations Geant4 Technical Forum Ryszard Jurga.

Performance Monitoring Update Daniele Francesco Kruse August 2010.

Computer Structure 2015 – Intel ® Core TM μArch 1 Computer Structure Multi-Threading Lihu Rappoport and Adi Yoaz.

Modern general-purpose processors. Post-RISC architecture Instruction & arithmetic pipelining Superscalar architecture Data flow analysis Branch prediction.

Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.

Confessions of a Performance Monitor Hardware Designer Workshop on Hardware Performance Monitor Design HPCA February 2005 Jim Callister Intel Corporation.

PipeliningPipelining Computer Architecture (Fall 2006)

Modular Software Performance Monitoring Daniele Francesco Kruse – CERN – PH / SFT Karol Kruzelecki – CERN – PH / LBC.

Modular Software Performance Monitoring

Computer Structure Multi-Threading

PowerPC 604 Superscalar Microprocessor

QuickPath interconnect GB/s GB/s total To I/O

Performance Tuning Team Chia-heng Tu June 30, 2009

Bruhadeshwar Meltdown Bruhadeshwar

What we need to be able to count to tune programs

The Microarchitecture of the Pentium 4 processor

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

Hardware Multithreading

Intro to Architecture & Organization

15-740/ Computer Architecture Lecture 5: Precise Exceptions

CS 286 Computer Architecture & Organization

Presentation transcript:

Software Performance Monitoring Daniele Francesco Kruse July 2010

Summary 1.Performance Monitoring 2.Performance Counters 3.Core and Nehalem PMUs – Overview 4.Nehalem : Overview of the architecture 5.μops flow in Nehalem pipeline 6.Perfmon2 7.Cycle Accounting Analysis 8.The 4-way Performance Monitoring 9.PfmCodeAnalyser : a new tool for fast monitoring 10.David Levinthal : the expert 11.Conclusions 2

Performance Monitoring DEF : The action of collecting information related to how an application or system performs HOW : Obtain micro-architectural level information from hardware performance counters WHY : To identify bottlenecks, and possibly remove them in order to improve application performance 3

Performance Counters All recent processor architectures include a processor–specific PMU The Performance Monitoring Unit contains several performance counters Performance counters are able to count micro-architectural events from many hardware sources (cpu pipeline, caches, bus, etc…) 4

Core and Nehalem PMUs - Overview 5 Intel Core Microarchitecture PMU 3 fixed counters (INSTRUCTIONS_RETIRED, UNHALTED_CORE_CYCLES, UNHALTED_REFERENCE_CYCLES) 2 programmable counters Intel Nehalem Microarchitecture PMU 3 fixed core-counters (INSTRUCTIONS_RETIRED, UNHALTED_CORE_CYCLES, UNHALTED_REFERENCE_CYCLES) 4 programmable core-counters 1 fixed uncore-counter (UNCORE_CLOCK_CYCLES) 8 programmable uncore-counters

Nehalem : Overview of the architecture 6 Core 0 Core 1 Core 2 Core 3 Level 3 Cache shared, writeback (lazy write) and inclusive (contains L1 & L2 of each core) Integrated Memory Controller Link to local memory (DDR3) Quick Path Interconnect Link to I/O hub (& to other processors, if present) L1D Cache L1I Cache L2 Cache TLBs writeback writeback unified, writeback, not-inclusive DTLB0 & ITLB (1 st Level), STLB (unified 2 nd Level)

μops flow in Nehalem pipeline 7 We are mainly interested in UOPS_EXECUTED (dispatched) and UOPS_RETIRED (the useful ones). Mispredicted UOPS_ISSUED may be eliminated before being executed. Instruction Fetch & Branch Prediction Unit Decoder Retirement & Writeback Re-Order Buffer Resource Allocator Execution Units Reservation Station UOPS_ISSUED UOPS_RETIRED UOPS_EXECUTED

Perfmon2 A generic API to access the PMU ( libpfm ) Developed by Stéphane Eranian Portable across all new processor micro-architectures Supports system-wide and per-thread monitoring Supports counting and sampling CPU Hardware Linux Kernel Generic Perfmon Architectural Perfmon PMU User space Pfmon Other libpfm-based Apps libpfm 8

Cycle Accounting Analysis Total Cycles (Application total execution time) Issuing μops Not Issuing μops Stalled (no work) Not retiring μops (useless work) Retiring μops (useful work) Store-Fwd L2 miss L2 hit LCP L1 TLB miss 9

The 4-way Performance Monitoring Overall Analysis 2. Symbol Level Analysis 3. Module Level Analysis 4. Modular Symbol Level Analysis Overall (pfmon)Modular Sampling Counting

PfmCodeAnalyser : a new tool for fast monitoring 11 Unreasonable (and useless) to run a complete analysis for every change in code Often interested in only small part of code and in one single event Solution: a fast, precise and light “singleton” class called PfmCodeAnalyser How to use it: #include PfmCodeAnalyser::Instance(“INSTRUCTIONS_RETIRED”).start(); //code to monitor PfmCodeAnalyser::Instance().stop();

PfmCodeAnalyser : a new tool for fast monitoring 12 PfmCodeAnalyser::Instance("INSTRUCTIONS_RETIRED", 0, 0, "UNHALTED_CORE_CYCLES", 0, 0, "ARITH:CYCLES_DIV_BUSY", 0, 0, "UOPS_RETIRED:ANY", 0, 0).start(); Event: INSTRUCTIONS_RETIRED Total count: Number of counts:10 Average count: Event: UNHALTED_CORE_CYCLES Total count: Number of counts:10 Average count: Event: ARITH:CYCLES_DIV_BUSY Total count: Number of counts:10 Average count: Event: UOPS_RETIRED:ANY Total count: Number of counts:10 Average count:

David Levinthal : the expert 13 Intel senior engineer specialized in performance monitoring of applications He is going to be here from the 13th to the 30th of July He will give a detailed lecture about performance monitoring on the 21st or 22nd of July in Anyone who’s interested in partecipating may contact us for details

Conclusions 14 1.We gave a very brief introduction to Nehalem multicore architecture and to Performance Monitoring using hardware performance counters 2.A consistent set of events has been monitored across the 4 different analysis approaches in CMSSW, Gaudi and Geant4 (Cycle Accounting Analysis) 3.A monitoring tool has been developed for quick performance monitoring: PfmCodeAnalyser 4.The report and background of the work done for CMSSW is available at

Questions ?