Performance Monitoring Update Daniele Francesco Kruse April 2010.

Slides:



Advertisements
Similar presentations
Profiler In software engineering, profiling ("program profiling", "software profiling") is a form of dynamic program analysis that measures, for example,
Advertisements

Cache Performance 1 Computer Organization II © CS:APP & McQuain Cache Memory and Performance Many of the following slides are taken with.
Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.
Performance of Cache Memory
Cache Here we focus on cache improvements to support at least 1 instruction fetch and at least 1 data access per cycle – With a superscalar, we might need.
Erhan Erdinç Pehlivan Computer Architecture Support for Database Applications.
The Little man computer
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
Java.  Java is an object-oriented programming language.  Java is important to us because Android programming uses Java.  However, Java is much more.
RAMP Gold: Architecture and Timing Model Andrew Waterman, Zhangxi Tan, Rimas Avizienis, Yunsup Lee, David Patterson, Krste Asanović Parallel Computing.
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
Quantum Computing II CPSC 321 Andreas Klappenecker.
Review for Midterm 2 CPSC 321 Computer Architecture Andreas Klappenecker.
A Characterization of Processor Performance in the VAX-11/780 From the ISCA Proceedings 1984 Emer & Clark.
Lecture 24: CPU Design Today’s topic –Multi-Cycle ALU –Introduction to Pipelining 1.
Linux Operations and Administration
Secure Embedded Processing through Hardware-assisted Run-time Monitoring Zubin Kumar.
Revisiting Load Value Speculation:
University of Maryland Compiler-Assisted Binary Parsing Tugrul Ince PD Week – 27 March 2012.
2.0 Computer System.
1 4.2 MARIE This is the MARIE architecture shown graphically.
CMSBrownBag,05/29/2007 B.Mangano How to “use” CMSSW on own Linux Box and be happy In this context “use” means: - check-out pre-compiled CMSSW code - run.
A Time Predictable Instruction Cache for a Java Processor Martin Schoeberl.
FNAL Geant4 Performance Group Issues and Progress Daniel Elvira for M. Fischler, J. Kowalkowski, M. Paterno.
Web Standards Web Design – Sec 2-4 Part or all of this lesson was adapted from the University of Washington’s “Web Design & Development I” Course materials.
Software Integrity Monitoring Using Hardware Performance Counters Corey Malone.
Introduction Advantages/ disadvantages Code examples Speed Summary Running on the AOD Analysis Platforms 1/11/2007 Andrew Mehta.
Computer Organization and Architecture Tutorial 1 Kenneth Lee.
Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Michael D’Mello
1 How will execution time grow with SIZE? int array[SIZE]; int sum = 0; for (int i = 0 ; i < ; ++ i) { for (int j = 0 ; j < SIZE ; ++ j) { sum +=
1 CENG 450 Computer Systems and Architecture Cache Review Amirali Baniasadi
Optimizing CMS Data Formats for Analysis Peerut Boonchokchuay August 11 th,
Browser Wars (Click on the logo to see the performance)
Software Performance Monitoring Daniele Francesco Kruse July 2010.
Plug-in Architectures Presented by Truc Nguyen. What’s a plug-in? “a type of program that tightly integrates with a larger application to add a special.
Baum, Boyett, & Garrison Comparing Intel C++ and Microsoft Visual C++ Compilers Michael Baum David Boyett Holly Garrison.
A Software Performance Monitoring Tool Daniele Francesco Kruse March 2010.
Web Browsing *TAKE NOTES*. Millions of people browse the Web every day for research, shopping, job duties and entertainment. Installing a web browser.
Computer Organization CS224 Fall 2012 Lessons 41 & 42.
Test Plan: Introduction o Primary focus: developer testing –Implementation phase –Release testing –Maintenance and enhancement o Secondary focus: formal.
6 September 2007CHEP07 Parallel - SJ1 Perfmon2: A leap forward in Performance Monitoring Sverre Jarp, Ryszard Jurga, Andrzej Nowak CERN openlab CHEP 2007.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
Sunpyo Hong, Hyesoon Kim
Performance profiling of Experiments’ Geant4 Simulations Geant4 Technical Forum Ryszard Jurga.
Performance Monitoring Update Daniele Francesco Kruse August 2010.
Reading ROOT files in (almost) any browser.  Use XMLHttpRequest JavaScript class to perform the HTTP HEAD and GET requests  This class is highly browser.
1 G4UIRoot Isidro González ALICE ROOT /10/2002.
1 How will execution time grow with SIZE? int array[SIZE]; int sum = 0; for (int i = 0 ; i < ; ++ i) { for (int j = 0 ; j < SIZE ; ++ j) { sum +=
Modularization of Geant4 Dynamic loading of modules Configurable build using CMake Pere Mato Witek Pokorski
Fermilab Scientific Computing Division Fermi National Accelerator Laboratory, Batavia, Illinois, USA. Off-the-Shelf Hardware and Software DAQ Performance.
Modular Software Performance Monitoring Daniele Francesco Kruse – CERN – PH / SFT Karol Kruzelecki – CERN – PH / LBC.
Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire,
PPEP: online Performance, power, and energy prediction framework
The Little man computer
Modular Software Performance Monitoring
A Level Computing – a2 Component 2 1A, 1B, 1C, 1D, 1E.
Cache Memory and Performance
Lesson Objectives Aims You should be able to:
Chapter 1: A Tour of Computer Systems
Web Standards Web Design – Sec 2-3
Investigation of the improved performance on Haswell processors
How will execution time grow with SIZE?
The University of Adelaide, School of Computer Science
Web Standards Web Design – Sec 2-3
Bruhadeshwar Meltdown Bruhadeshwar
Cayuse 424 Desktop Readiness.
Geometry checking tools
Christophe Dubach, Timothy M. Jones and Michael F.P. O’Boyle
rePLay: A Hardware Framework for Dynamic Optimization
CSc 453 Final Code Generation
Presentation transcript:

Performance Monitoring Update Daniele Francesco Kruse April 2010

Summary 1.Monitoring CMSSW on Nehalem 2.SDL substitute candidates 3.Monitoring Geant4 2

A performance model for Nehalem: Overview 3 Total_cycles: CPU_CLK_UNHALTED:THREAD_P Useless_uops: (UOPS_EXECUTED:PORT015 + UOPS_EXECUTED:PORT234_CORE) - UOPS_RETIRED:ANY PORT015_uop_execution_rate: UOPS_EXECUTED:PORT015 / UOPS_EXECUTED:PORT015(CMASK=1) PORT234_CORE_uop_execution_rate: UOPS_EXECUTED:PORT234_CORE / UOPS_EXECUTED:PORT234_CORE(CMASK=1) Uop_execution_rate: PORT015_uop_execution_rate + PORT234_CORE_uop_execution_rate Useless_cycles: Useless_uops / Uop_execution_rate Useful_cycles: UOPS_RETIRED:ANY / Uop_execution_rate Active_cycles: Useless_cycles + Useful_cycles Stalled_cycles: Total_cycles - Active_cycles

Cycle Accounting Analysis for Nehalem (Intel core i7) Total Cycles (Application total execution time) Issuing μops Not Issuing μops Stalled (no work) Not retiring μops (useless work) Retiring μops (useful work) 4 CPU_CLK_UNHALTED:THREAD_P Active_Cycles = Useless_cycles + Useful_cycles UOPS_RETIRED:ANY / Uop_execution_rate Useless_uops / Uop_execution_rate Total_cycles - Active_cycles

Nehalem: Overview of memory and cache stalls 5 Memory and cache related stalls: MEM_LOAD_RETIRED:DTLB_MISS // ~10 cycles MEM_LOAD_RETIRED:L1D_HIT // too small: penalty hidden MEM_LOAD_RETIRED:L2_HIT // ~14.5 cycles MEM_LOAD_RETIRED:L3_MISS // ~180 cycles (arch. dependent) MEM_LOAD_RETIRED:L3_UNSHARED_HIT // ~42 cycles MEM_LOAD_RETIRED:OTHER_CORE_L2_HIT_HITM // ~74 cycles ITLB_MISS_RETIRED // too small: penalty hidden Other Stalls: ILD_STALL:ANY // BROKEN?!?! RAT_STALLS RESOURCE_STALLS SEG_RENAME_STALLS SQ_FULL_STALL_CYCLES STORE_BLOCKS And finally what happened to store-forward stalls? - Loads spanning across cache lines cause almost no stalls anymore - Loads blocked by unknown address stores and loads blocked because they are not completely contained in preceding store still cause stalls - Unfortunately no direct event to count these situations

Monitoring CMSSW on Nehalem First Nehalem results with CMSSW pre2 (here)here (compare with Core results here)here Tool discovers architecture at runtime (CPUID) First performance considerations on Nehalem Faster (generally 3 – 15% over Core cycle count) Lower CPI (no & type of instructions stays the same obviously) Stalled cycles cut down to around 30% of Core values (but we need to verify coverage accurately) Same percentage of mispredicted branches Seems more useful cycles required to do the same job 6

Version with graphs Structure and libraries 7 Analysis Configuration Start Performance Data Taking Program Run Performance Data Output Performance Data Analysis Browsable HTML results End libpng zlib libSDL libSDL_ttflibpfm zlib

SDL substitute candidates libSDL_ttf is not part of standard SLC5 installation SDL substitute candidates (both successfully tested): HTML5’s tag: Supported by Firefox, Opera, Safari & Chrome Text drawing supported only by Firefox (Gecko), Safari & Chrome Internet Explorer also supports it through Mozilla’s plugin ROOT: A little heavier and more difficult to adapt Works the same way as the current SDL implementation (png output) 8

Monitoring Geant4 Overall and symbol analysis already possible pfmon command line tool & FullCMS simulation example Modular analysis through User Actions Probably RunAction and EventAction combined Type of particle, direction and energy determine complexity and type of event This triple may be used to describe the “module” of the analysis Proposal: event-level granularity 9

Conclusions 10 CMSSW has been successfully monitored on a Nehalem machine Two proposed substitutes for results graphics display have been successfully tested for suitability: ROOT & A study to apply modular monitoring of Geant4 is underway

What’s next? 11 Further study stall impacts on Nehalem and validate Cycle Accounting Analysis (possibly with David Levinthal in may) Implement graphics display without SDL dependency, using ROOT or HTML5’s tag Make Geant4 monitoring exercises with simple examples and later with FullCMS application

Thank you, Questions ?