1.Calculate number of events by searching for event in assembly file or analytical model. 2.Validate the numbers from step one with a simulator. 3.Compare.

Slides:

Advertisements

Similar presentations

Performance Analysis and Optimization through Run-time Simulation and Statistics Philip J. Mucci University Of Tennessee

Advertisements

Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid Department of Computer Science and Engineering.

Module 13: Performance Tuning. Overview Performance tuning methodologies Instance level Database level Application level Overview of tools and techniques.

SE-292 High Performance Computing Profiling and Performance R. Govindarajan

Performance Evaluation of Cache Replacement Policies for the SPEC CPU2000 Benchmark Suite Hussein Al-Zoubi.

Exploring P4 Trace Cache Features Ed Carpenter Marsha Robinson Jana Wooten.

Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property.

DBMSs on a Modern Processor: Where Does Time Go? Anastassia Ailamaki Joint work with David DeWitt, Mark Hill, and David Wood at the University of Wisconsin-Madison.

1 Copyright © 2013 Elsevier Inc. All rights reserved. Chapter 5 Program Design and Analysis.

Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.

Advanced Computer Architecture Lab University of Michigan MASE Eric Larson MASE: Micro Architectural Simulation Environment Eric Larson, Saugata Chatterjee,

Accuracy of Performance Monitoring Hardware Michael E. Maxwell, Patricia J. Teller, and Leonardo Salayandia University of Texas-El Paso and Shirley Moore.

H.264/AVC Baseline Profile Decoder Complexity Analysis Michael Horowitz, Anthony Joch, Faouzi Kossentini, and Antti Hallapuro IEEE TRANSACTIONS ON CIRCUITS.

Glenn Reinman, Brad Calder, Department of Computer Science and Engineering, University of California San Diego and Todd Austin Department of Electrical.

Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.

Evaluation of Branch Predictors Using High-density-branch Programs Fang Pang MEng. Lei Zhu MEng. Electrical and Computer Engineering Department University.

Compiler Optimization-Space Exploration Adrian Pop IDA/PELAB Authors Spyridon Triantafyllis, Manish Vachharajani, Neil Vachharajani, David.

7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.

Topic 1: Introduction to Computers and Programming

Author: D. Brooks, V.Tiwari and M. Martonosi Reviewer: Junxia Ma

A Characterization of Processor Performance in the VAX-11/780 From the ISCA Proceedings 1984 Emer & Clark.

Evaluation of Dynamic Branch Prediction Schemes in a MIPS Pipeline Debajit Bhattacharya Ali JavadiAbhari ELE 475 Final Project 9 th May, 2012.

Virtual Memory By: Dinouje Fahih. Definition of Virtual Memory Virtual memory is a concept that, allows a computer and its operating system, to use a.

Basics of Operating Systems March 4, 2001 Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard.

8/16/2015\course\cpeg323-08F\Topics1b.ppt1 A Review of Processor Design Flow.

Chapter 1 Section 1.4 Dr. Iyad F. Jafar Evaluating Performance.

Computer Science 210 Computer Organization The Instruction Execution Cycle.

Software Testing. Definition To test a program is to try to make it fail.

Lecture 21 Last lecture Today’s lecture Cache Memory Virtual memory

Performance of Web Applications Introduction One of the success-critical quality characteristics of Web applications is system performance. What.

Department of Computer Science Mining Performance Data from Sampled Event Traces Bret Olszewski IBM Corporation – Austin, TX Ricardo Portillo, Diana Villa,

1 Introduction to SimpleScalar (Based on SimpleScalar Tutorial) CPSC 614 Texas A&M University.

Computer Science Department University of Texas at El Paso PCAT Performance Counter Assessment Team PAPI Development Team SC 2003, Phoenix, AZ – November.

Lecture 2 Process Concepts, Performance Measures and Evaluation Techniques.

Dept. of Computer and Information Sciences : University of Delaware John Cavazos Department of Computer and Information Sciences University of Delaware.

Understanding Performance Counter Data Authors: Alonso Bayona, Michael Maxwell, Manuel Nieto, Leonardo Salayandia, Seetharami Seelam Mentor: Dr. Patricia.

OOE vs. EPIC Emily Evans Prashant Nagaraddi Lin Gu.

CS 149: Operating Systems March 3 Class Meeting Department of Computer Science San Jose State University Spring 2015 Instructor: Ron Mak

The Central Processing Unit (CPU) and the Machine Cycle.

A Single-Pass Cache Simulation Methodology for Two-level Unified Caches + Also affiliated with NSF Center for High-Performance Reconfigurable Computing.

CSCI 6461: Computer Architecture Branch Prediction Instructor: M. Lancaster Corresponding to Hennessey and Patterson Fifth Edition Section 3.3 and Part.

Methodologies for Performance Simulation of Super-scalar OOO processors Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project.

CPE 631 Project Presentation Hussein Alzoubi and Rami Alnamneh Reconfiguration of architectural parameters to maximize performance and using software techniques.

SNU IDB Lab. Ch4. Performance Measurement © copyright 2006 SNU IDB Lab.

Compiler Construction Dr. Naveed Ejaz Lecture 4. 2 The Back End Register Allocation:  Have each value in a register when it is used. Instruction selection.

On-board Performance Counters: What do they really tell us? Pat Teller The University of Texas at El Paso (UTEP) PTools 2002 Annual Meeting, University.

Baum, Boyett, & Garrison Comparing Intel C++ and Microsoft Visual C++ Compilers Michael Baum David Boyett Holly Garrison.

Computer Science Department University of Texas at El Paso PCAT Performance Counter Assessment Team PAPI Development Team UGC 2003, Bellevue, WA – June.

Static Identification of Delinquent Loads V.M. Panait A. Sasturkar W.-F. Fong.

1 Efficient System-on-Chip Energy Management with a Segmented Counting Bloom Filter Mrinmoy Ghosh- Georgia Tech Emre Özer- ARM Ltd Stuart Biles- ARM Ltd.

DISSERTATION RESEARCH PLAN Mitesh Meswani. Outline  Dissertation Research Update  Previous Approach and Results  Modified Research Plan  Identifying.

A computer contains two major sets of tools, software and hardware. Software is generally divided into Systems software and Applications software. Systems.

Sunpyo Hong, Hyesoon Kim

On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.

1 University of Maryland Using Information About Cache Evictions to Measure the Interactions of Application Data Structures Bryan R. Buck Jeffrey K. Hollingsworth.

Introduction to SimpleScalar (Based on SimpleScalar Tutorial)

Introduction to SimpleScalar (Based on SimpleScalar Tutorial)

Figure 8.1 Architecture of a Simple Computer System.

Agenda Why simulation Simulation and model Instruction Set model

EEL 4713/EEL 5764 Computer Architecture

Understanding Performance Counter Data - 1

FIGURE 12-1 Memory Hierarchy

Phase Capture and Prediction with Applications

Figure 8.1 Architecture of a Simple Computer System.

Determining the Accuracy of Event Counts - Methodology

Compiler Construction

Gang Luo, Hongfei Guo {gangluo,

What Are Performance Counters?

Phase based adaptive Branch predictor: Seeing the forest for the trees

Chapter 4 The Von Neumann Model

Presentation transcript:

1.Calculate number of events by searching for event in assembly file or analytical model. 2.Validate the numbers from step one with a simulator. 3.Compare numbers with those generated by counters. 1.Calculate number of events by searching for event in assembly file or analytical model. 2.Validate the numbers from step one with a simulator. 3.Compare numbers with those generated by counters. Evaluation of Hardware Performance Counters on the R12000 Microprocessor The necessity for accurate performance counters became apparent when we began defining the resource usage of Sweep 3D, an ASCI benchmark from the DOD used to evaluate high performance computers. For years, many computer scientists have used performance counters to help find problem areas in code. This study shows that performance counters on modern microprocessors provide rudimentary performance measurements that may or may not be accurate. Below shows the methodology used to determine the accuracy of this hardware feature on the R12000 as well as results.  High Performance  Experiments Performance Counters are used mainly to optimize code. for i = 1 to n do for j = 1 to n do a[i j]: = a[i j] + 1 For example, this piece of code has a nested loop and accesses data in a matrix. The way the matrix is stored in memory determines the number of cache misses. Cache misses increase execution time. If this code was analyzed using performance counters and the results showed that there are many cache misses during execution of this code, the analytical model programmer could try to tune the code to decrease this miss rate and, thus, decrease execution time. To quantify the accuracy of performance counters, the number of events a program generates must be known. Thus, microbenchmarks were designed to generate events for which we could predict counts. For example, if we used the above code, we could measure the number of cache misses generated by the code. Certain types of code measure certain events. Below is a diagram of three types of microbenchmarks and the events they can generate.  Microbenchmarks Two counters can count up to 30 total events, we studied nine. To generate events, use small programs, or Microbenchmarks. Two counters can count up to 30 total events, we studied nine. To generate events, use small programs, or Microbenchmarks. Based on results, conclusions are made about problem areas in code. 1. Decoded instructions 2. Decoded loads 3. Decoded stores 4. Conditional resolved branches 5. Primary instruction cache misses 6. Translation Lookaside Buffer misses 7. Primary data cache misses 8. Secondary data cache misses 9. Secondary instruction cache misses 1. Decoded instructions 2. Decoded loads 3. Decoded stores 4. Conditional resolved branches 5. Primary instruction cache misses 6. Translation Lookaside Buffer misses 7. Primary data cache misses 8. Secondary data cache misses 9. Secondary instruction cache misses Loop Linear Array Data a = 1; b = 1; c = 1; a = b + 1; b = a + 1; c = a + b; a = b + c; b = a + c; c = a + b; #define MAXSIZE int main (int argc, char *argv[]) { int a[MAXSIZE], ARRAYSIZE, i; ARRAYSIZE = atoi(argv[1]); for (i=0; i<ARRAYSIZE;i++) a[i] = a[i] + 1;} Use grep on the assembly file to find events such as loads, stores, branches. Validate predictions Predictions Simulations Counter Data Compare results Use sim-outorder from the SimpleScalar simulation tool suite and an R12000 configuration file. Use the perfex and libperfex interfaces to access the counters. Compare the numbers from steps 1, 2, and 3.  Conclusions Accuracy depends on: Per figures A-F below, counters accessed by perfex exhibit poorer accuracy than those accessed by libperfex for microbenchmarks with small numbers of events. Per D and E, cache miss counts were not accurate using either interface; Per A, load counts were accurate when the number generated events was large enough. The linear microbenchmark neither generated enough data cache misses to provide accurate counts nor did it provide accurate instruction counts. the interface used the event begin measured the application run to generate the events the interface used the event begin measured the application run to generate the events ABCDEF Methodology Wendy Korn, Senior SSEAL, Computer ScienceMentor: Dr. Patricia Teller