Enabling Efficient On-the-fly Microarchitecture Simulation Thierry Lafage September 2000.

Slides:



Advertisements
Similar presentations
Full-System Timing-First Simulation Carl J. Mauer Mark D. Hill and David A. Wood Computer Sciences Department University of Wisconsin—Madison.
Advertisements

Discovering and Exploiting Program Phases Timothy Sherwood, Erez Perelman, Greg Hamerly, Suleyman Sair, Brad Calder CSE 231 Presentation by Justin Ma.
Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.
Computer Abstractions and Technology
Accurately Approximating Superscalar Processor Performance from Traces Kiyeon Lee, Shayne Evans, and Sangyeun Cho Dept. of Computer Science University.
Chapter 1 CSF 2009 Computer Performance. Defining Performance Which airplane has the best performance? Chapter 1 — Computer Abstractions and Technology.
An Adaptable Benchmark for MPFS Performance Testing A Master Thesis Presentation Yubing Wang Advisor: Prof. Mark Claypool.
RISC By Don Nichols. Contents Introduction History Problems with CISC RISC Philosophy Early RISC Modern RISC.
Chapter 7 Interupts DMA Channels Context Switching.
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
Instrumentation and Profiling David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA
SECTION 1: INTRODUCTION TO SIMICS Scott Beamer CS152 - Spring 2009.
1 Presenter: Chien-Chih Chen Proceedings of the 2002 workshop on Memory system performance.
ABACUS: A Hardware-Based Software Profiler for Modern Processors Eric Matthews Lesley Shannon School of Engineering Science Sergey Blagodurov Sergey Zhuravlev.
Educational Computer Architecture Experimentation Tool Dr. Abdelhafid Bouhraoua.
Group 5 Alain J. Percial Paula A. Ortiz Francis X. Ruiz.
I/O Tanenbaum, ch. 5 p. 329 – 427 Silberschatz, ch. 13 p
8/16/2015\course\cpeg323-08F\Topics1b.ppt1 A Review of Processor Design Flow.
CPU PROFILING FIND THE BOTTLENECK. WHAT? WHEN? HOW?
Computer Organization
Toolbox for Dimensioning Windows Storage Systems Jalil Boukhobza, Claude Timsit 12/09/2006 Versailles Saint Quentin University.
Waleed Alkohlani 1, Jeanine Cook 2, Nafiul Siddique 1 1 New Mexico Sate University 2 Sandia National Laboratories Insight into Application Performance.
MIPS coding. SPIM Some links can be found such as:
1 Introduction to SimpleScalar (Based on SimpleScalar Tutorial) CPSC 614 Texas A&M University.
© 2008, Renesas Technology America, Inc., All Rights Reserved 1 Purpose  This training module provides an overview of optimization techniques used in.
Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.
Statistical Simulation of Superscalar Architectures using Commercial Workloads Lieven Eeckhout and Koen De Bosschere Dept. of Electronics and Information.
A Time Predictable Instruction Cache for a Java Processor Martin Schoeberl.
NATIONAL INSTITUTE OF TECHNOLOGY KARNATAKA,SURATHKAL Presentation on ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS Publisher’s:
1 University of Maryland Linger-Longer: Fine-Grain Cycle Stealing in Networks of Workstations Kyung Dong Ryu © Copyright 2000, Kyung Dong Ryu, All Rights.
Performance Simulators José Nelson Amaral CMPUT 429 Dept. of Computing Science University of Alberta.
Processes Introduction to Operating Systems: Module 3.
Performance Lecture notes from MKP, H. H. Lee and S. Yalamanchili.
Replicating Memory Behavior for Performance Skeletons Aditya Toomula PC-Doctor Inc. Reno, NV Jaspal Subhlok University of Houston Houston, TX By.
CPE 631 Project Presentation Hussein Alzoubi and Rami Alnamneh Reconfiguration of architectural parameters to maximize performance and using software techniques.
Chapter 4: Multithreaded Programming. 4.2 Silberschatz, Galvin and Gagne ©2005 Operating System Concepts What is Thread “Thread is a part of a program.
Simics: A Full System Simulation Platform Synopsis by Jen Miller 19 March 2004.
© Michel Dubois, Murali Annavaram, Per Strenstrom All rights reserved Embedded Computer Architecture 5SAI0 Simulation - chapter 9 - Luc Waeijen 16 Nov.
Project Presentation By: Dean Morrison 12/6/2006 Dynamically Adaptive Prepaging for Effective Virtual Memory Management.
(1) SIMICS Overview. (2) SIMICS – A Full System Simulator Models disks, runs unaltered OSs etc. Accuracy is high (e.g., pollution effects factored in)
Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.
Interrupts and Exception Handling. Execution We are quite aware of the Fetch, Execute process of the control unit of the CPU –Fetch and instruction as.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
KIT – University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association SYSTEM ARCHITECTURE GROUP DEPARTMENT OF COMPUTER.
An Offline Approach for Whole-Program Paths Analysis using Suffix Arrays G. Pokam, F. Bodin.
1 Chapter 2: Operating-System Structures Services Interface provided to users & programmers –System calls (programmer access) –User level access to system.
A Framework For Trusted Instruction Execution Via Basic Block Signature Verification Milena Milenković, Aleksandar Milenković, and Emil Jovanov Electrical.
CSE 340 Computer Architecture Summer 2016 Understanding Performance.
Qin Zhao1, Joon Edward Sim2, WengFai Wong1,2 1SingaporeMIT Alliance 2Department of Computer Science National University of Singapore
PINTOS: An Execution Phase Based Optimization and Simulation Tool) PINTOS: An Execution Phase Based Optimization and Simulation Tool) Wei Hsu, Jinpyo Kim,
Introduction to Operating Systems Concepts
Virtualization.
??? ple r B Amulya Sai EDM14b005 What is simple scalar?? Simple scalar is an open source computer architecture simulator developed by Todd.
Improving the support for ARM in IgProf
Threads vs. Events SEDA – An Event Model 5204 – Operating Systems.
ECE 4100/6100 Advanced Computer Architecture Lecture 1 Performance
Chapter 9 – Real Memory Organization and Management
Introduction to SimpleScalar (Based on SimpleScalar Tutorial)
5.2 Eleven Advanced Optimizations of Cache Performance
Architecture Background
Improving java performance using Dynamic Method Migration on FPGAs
Introduction to SimpleScalar (Based on SimpleScalar Tutorial)
A Review of Processor Design Flow
Section 1: Introduction to Simics
How much does OS operation impact your code’s performance?
A Survey on Virtualization Technologies
System calls….. C-program->POSIX call
Performance Lecture notes from MKP, H. H. Lee and S. Yalamanchili.
UNISIM (UNIted SIMulation Environment) walkthrough
What Are Performance Counters?
Presentation transcript:

Enabling Efficient On-the-fly Microarchitecture Simulation Thierry Lafage September 2000

Thierry Lafage2 Introduction Microarchitecture simulation: –Accurate, but slow (execution  ) –“On-the-fly” (vs. trace-driven): Enables execution-driven simulation (complex microprocessors) Simulation of long running workloads Complete microprocessor simulation requires: –Realistic workloads and working sets –Huge amount of CPU time

September 2000Thierry Lafage3 Realistic simulations in an affordable time  simulations of a reduced number of instructions : One “big slice” (eg. after program start-up phase) Trace sampling Introduction (2)  Representativeness of the simulated execution slices? On-the-fly simulations  fast forwarding  Current tools “fast” forwarding mode: >20  execution slowdown 01.5B. 1B. 500M B.500M.

September 2000Thierry Lafage4 Outline 1. Speeding up the fast forwarding mode –Approach –Implementation –Performance on the SPEC95 benchmarks –Conclusion 2. Selecting representative execution slices –Approach –Application to data cache simulations –Conclusion Conclusion and Future Work

September 2000Thierry Lafage5 Speeding up the fast forwarding mode Two execution modes: A really fast mode (static code annotation)  Rapid positioning of the execution where to begin the simulation with direct execution An emulation mode (embedded instruction- set emulator)  Calls to analysis routines (user provided)  At run time: Dynamic switches between both modes

September 2000Thierry Lafage6 DICE Host ISA Emulator User analysis routines Implementation Original code SPARC V9 assembly code calvin2 Static Code Annotation Tool checkpoint Switching event Emulation mode Switching event

September 2000Thierry Lafage7 Performance on the SPEC95 Benchmarks calvin2+DICE: –Average slowdown in fast mode: 1.31 (checkpoints at procedure calls and inside loops) –Average slowdown in emulation mode (instruction and data addresses trace): Shade (instruction and data address generation enabled): –Average slowdown in “fast forward” mode: (empty analysis routine) –Average slowdown in emulation mode: (tracing analysis routine)

September 2000Thierry Lafage8 A Simple Example of Microprocessor Simulation Simulation of 1% of a 1 hour workload Additional  1000 slowdown Direct ExecutionEmulation + Simulation  With calvin2+DICE: 0.99   ( ) = 12.5 hours Fast ForwardEmulation + Simulation  With Shade: 0.99   ( ) = 27.7 hours

September 2000Thierry Lafage9 Conclusion for calvin2+DICE Performance of the emulator: not an issue Overall performance given by the performance of the fast forwarding mode (long running workloads)  calvin2+DICE enables simulations on slices spread over a whole application

September 2000Thierry Lafage10 Outline 1. Speeding up the fast forwarding mode –Approach –Implementation –Performance on the SPEC95 benchmarks –Conclusion 2. Selecting representative execution slices –Approach –Application to cache simulations –Conclusion Conclusion and Future Work

September 2000Thierry Lafage11 On-the-fly simulations using realistic applications in an affordable time  simulations of a reduced number of instructions –Before: one “big slice” (after program start-up phase) –With calvin2+DICE: on-the-fly statistical sampling Number of simulated instructions often determined by: –The simulation time –Empirical results Introduction  Representativeness of the simulated instructions? 01B.500M. 01.5B. 1B. 500M....

September 2000Thierry Lafage12 Our Approach Dynamic characterization of the target programs Select representative execution slices for simulations (classification) Aim:  Tune a per-program amount of simulated activity  Reduce simulation time or increase simulation result accuracy

September 2000Thierry Lafage13 Dynamic Characterization of the Target Programs N Execution Slices Program Characterization Metrics independent from the implementation detail of the simulated components

September 2000Thierry Lafage14 Selection of Representative Execution Slices Hierarchical Classification {2,1,3},{0,4} Two slices selected

September 2000Thierry Lafage15 Selection of Class Representatives  Wmdc indicator: weighted mean of distances from class centers Class centers Class representatives

September 2000Thierry Lafage16 Application to the Data Stream Data stream characterization: –Temporal locality: data reuse distances –Spatial locality: data reuse distances with several line sizes Data reuse distance (in instructions) Relative frequency (%)

September 2000Thierry Lafage17 Results for Trained Cache Simulations on the SPEC95 Benchmarks Cache configurations: 4-way set associative, LRU write back, write allocate  sizes from 4KB to 512KB  line sizes from 16B to 128B

September 2000Thierry Lafage18 Conclusion for representative slice selection Similar results with: –Branch characterization for branch predictor simulations –Data stream characterization, branch characterization, instruction mix and basic block sizes for data cache simulations and branch predictor simulations  Program characterization actually helps in tuning the amount of simulated activity

September 2000Thierry Lafage19 General Conclusion calvin2+DICE enables simulations on slices spread over a whole application Our approach enables to select representative execution slices Future Work Complete execution-driven simulations (complex microprocessor) Operating system activity: LiKE, a Linux Kernel Emulator

September 2000Thierry Lafage20 Static Code Annotation with calvin2 Light instrumentation: –Use of the S ALTO library (assembly language level) –Instrumentation of SPARC V9 code Checkpoint code insertion: –At each beginning of procedure –Inside each loop

September 2000Thierry Lafage21 DICE: A Dynamic Inner Code Emulator Host and target: SPARC V9 ISA Architectural resources (registers) modeling DICE: an archive library –Able to receive control or return to direct execution at any moment –Access to complete target program state (registers, memory, …) User-defined analysis routines called for each emulated instruction (trace information passed as parameter)

September 2000Thierry Lafage22 LiKE: A Linux Kernel Emulator Derived from DICE Host and target: SPARC V9 ISA (full 64 bits) Dynamically loaded module Receive control at the beginning of the system calls Return to direct execution at the end of system calls Not yet implemented: Support for all system calls and other OS. activity Interface with on-the-fly simulator shared with user-space emulated program Full debugging