Understanding Performance Counter Data - 1

Slides:



Advertisements
Similar presentations
TM 1 ProfileMe: Hardware-Support for Instruction-Level Profiling on Out-of-Order Processors Jeffrey Dean Jamey Hicks Carl Waldspurger William Weihl George.
Advertisements

Exploring P4 Trace Cache Features Ed Carpenter Marsha Robinson Jana Wooten.
Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures Pree Thiengburanathum Advanced computer architecture Oct 24,
AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.
1.Calculate number of events by searching for event in assembly file or analytical model. 2.Validate the numbers from step one with a simulator. 3.Compare.
Erhan Erdinç Pehlivan Computer Architecture Support for Database Applications.
1 Recap: Memory Hierarchy. 2 Unified vs.Separate Level 1 Cache Unified Level 1 Cache (Princeton Memory Architecture). A single level 1 cache is used for.
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
Tools for applications improvement George Bosilca.
Microarchitectural Characterization of Production JVMs and Java Workload work in progress Jungwoo Ha (UT Austin) Magnus Gustafsson (Uppsala Univ.) Stephen.
Accuracy of Performance Monitoring Hardware Michael E. Maxwell, Patricia J. Teller, and Leonardo Salayandia University of Texas-El Paso and Shirley Moore.
Performance Potentials of Compiler- directed Data Speculation Author: Youfeng Wu, Li-Ling Chen, Roy Ju, Jesse Fang Programming Systems Research Lab Intel.
Translation Buffers (TLB’s)
The PowerPC Architecture  IBM, Motorola, and Apple Alliance  Based on the IBM POWER Architecture ­Facilitate parallel execution ­Scale well with advancing.
Code Coverage Testing Using Hardware Performance Monitoring Support Alex Shye, Matthew Iyer, Vijay Janapa Reddi and Daniel A. Connors University of Colorado.
Vir. Mem II CSE 471 Aut 011 Synonyms v.p. x, process A v.p. y, process B v.p # index Map to same physical page Map to synonyms in the cache To avoid synonyms,
8/16/2015\course\cpeg323-08F\Topics1b.ppt1 A Review of Processor Design Flow.
Taming Hardware Event Samples for FDO Compilation Dehao Chen (Tsinghua University) Neil Vachharajani, Robert Hundt, Shih-wei Liao (Google) Vinodha Ramasamy.
An Integrated Hardware-Software Approach to Transactional Memory Sean Lie Theory of Parallel Systems Monday December 8 th, 2003.
DBMSs On A Modern Processor: Where Does Time Go? by A. Ailamaki, D.J. DeWitt, M.D. Hill, and D. Wood University of Wisconsin-Madison Computer Science Dept.
Lecture 8. Profiling - for Performance Analysis - Prof. Taeweon Suh Computer Science Education Korea University COM503 Parallel Computer Architecture &
Computer Science Department University of Texas at El Paso PCAT Performance Counter Assessment Team PAPI Development Team SC 2003, Phoenix, AZ – November.
Performance Monitoring on the Intel ® Itanium ® 2 Processor CGO’04 Tutorial 3/21/04 CK. Luk Massachusetts Microprocessor Design.
Statistical Simulation of Superscalar Architectures using Commercial Workloads Lieven Eeckhout and Koen De Bosschere Dept. of Electronics and Information.
Oct Using Platform-Specific Performance Counters for Dynamic Compilation Florian Schneider and Thomas Gross ETH Zurich.
André Seznec Caps Team IRISA/INRIA HAVEGE HArdware Volatile Entropy Gathering and Expansion Unpredictable random number generation at user level André.
OCR Introspection EDT Characterization & Profiling Infrastructure Intel TG Team.
Dept. of Computer and Information Sciences : University of Delaware John Cavazos Department of Computer and Information Sciences University of Delaware.
Electrical and Computer Engineering University of Wisconsin - Madison Prefetching Using a Global History Buffer Kyle J. Nesbit and James E. Smith.
Architectural Characterization of an IBM RS6000 S80 Server Running TPC-W Workloads Lei Yang & Shiliang Hu Computer Sciences Department, University of.
Architectural Characterization of an IBM RS6000 S80 Server Running TPC-W Workloads Lei Yang & Shiliang Hu Computer Sciences Department, University of.
20th May 2008 Presented by Mitesh Meswani. Outline  Problem Description  FPU Availability  FXU Availability.
Tool Visualizations, Metrics, and Profiled Entities Overview [Brief Version] Adam Leko HCS Research Laboratory University of Florida.
CPE 631 Project Presentation Hussein Alzoubi and Rami Alnamneh Reconfiguration of architectural parameters to maximize performance and using software techniques.
1 SciDAC High-End Computer System Performance: Science and Engineering Jack Dongarra Innovative Computing Laboratory University of Tennesseehttp://
Software Performance Monitoring Daniele Francesco Kruse July 2010.
Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)
On-board Performance Counters: What do they really tell us? Pat Teller The University of Texas at El Paso (UTEP) PTools 2002 Annual Meeting, University.
CMP/CMT Scaling of SPECjbb2005 on UltraSPARC T1 (Niagara) Dimitris Kaseridis and Lizy K. John The University of Texas at Austin Laboratory for Computer.
Computer Science Department University of Texas at El Paso PCAT Performance Counter Assessment Team PAPI Development Team UGC 2003, Bellevue, WA – June.
TEST 1 – Tuesday March 3 Lectures 1 - 8, Ch 1,2 HW Due Feb 24 –1.4.1 p.60 –1.4.4 p.60 –1.4.6 p.60 –1.5.2 p –1.5.4 p.61 –1.5.5 p.61.
Guiding Ispike with Instrumentation and Hardware (PMU) Profiles CGO’04 Tutorial 3/21/04 CK. Luk Massachusetts Microprocessor Design.
DISSERTATION RESEARCH PLAN Mitesh Meswani. Outline  Dissertation Research Update  Previous Approach and Results  Modified Research Plan  Identifying.
PAPI on Blue Gene L Using network performance counters to layout tasks for improved performance.
Workload Design: Selecting Representative Program-Input Pairs Lieven Eeckhout Hans Vandierendonck Koen De Bosschere Ghent University, Belgium PACT 2002,
Instruction-Based Sampling and AMD CodeAnalyst ISPASS 2010 poster session Paul J. Drongowski | March 29, 2010.
Performance Monitoring Update Daniele Francesco Kruse August 2010.
An Evaluation of Memory Consistency Models for Shared- Memory Systems with ILP processors Vijay S. Pai, Parthsarthy Ranganathan, Sarita Adve and Tracy.
© 2004 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice Understanding Virtualization Overhead.
CSL718 : Superscalar Processors
Performance monitoring on HP Alpha using DCPI
Advantages of Dynamic Scheduling
What we need to be able to count to tune programs
A Dynamic Algorithm: Tomasulo’s
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
Hardware Multithreading
CMSC 611: Advanced Computer Architecture
Alpha Microarchitecture
TLB Performance Seung Ki Lee.
CMSC 611: Advanced Computer Architecture
Translation Buffers (TLB’s)
Determining the Accuracy of Event Counts - Methodology
Program Phase Directed Dynamic Cache Way Reconfiguration
CSE 471 Autumn 1998 Virtual memory
Wackiness Algorithm A: Algorithm B:
Virtual Memory: Working Sets
Translation Buffers (TLBs)
Review What are the advantages/disadvantages of pages versus segments?
What Are Performance Counters?
Project Guidelines Prof. Eric Rotenberg.
Presentation transcript:

Understanding Performance Counter Data - 1 Methodology [Configuration micro-benchmark] Validation micro-benchmark – used to predict event count Prediction via tool, mathematical model, and/or simulation Hardware-reported event count collection via PAPI (instrumented benchmark run 100 times; mean event count and standard deviation calculated) Comparison/analysis Report findings

Understanding Performance Counter Data - 2 Can quantify PAPI overhead in some cases, e.g., Loads and stores Floating-point operations (on some platforms) Can show that count is reasonable in others, e.g., L1 Dcache misses DTLB misses (R10K) Multiprocessor cache consistency protocol-related events (R10K)

Understanding Performance Counter Data - 3 Interesting facts Stream buffers are incredibly effective! Itanium has 17% more instructions retired and 17% more Icache misses than predicted – this is due to no-ops Itanium has 5x TLB misses than predicted – don’t know why yet! Power3 has 5x (for smaller versions of benchmark) and 2x (for larger versions) TLB misses than predicted – don’t know why yet!

Understanding Performance Counter Data - 4 Interesting facts Power3 (gcc compiler): single-precision vs. double-precision floating-point add benchmark ½ the number of floating-point operations for double-precision benchmark due to rounding instructions needed for single-precision benchmark 1.39x cycles for single-precision benchmark, as compared to double-precision benchmark