© Michel Dubois, Murali Annavaram, Per Strenstrom All rights reserved CHAPTER 9 Simulation Methods SIMULATION METHODS SIMPOINTS PARALLEL SIMULATIONS NONDETERMINISM.

Slides:

Advertisements

Similar presentations

Full-System Timing-First Simulation Carl J. Mauer Mark D. Hill and David A. Wood Computer Sciences Department University of Wisconsin—Madison.

Advertisements

Discovering and Exploiting Program Phases Timothy Sherwood, Erez Perelman, Greg Hamerly, Suleyman Sair, Brad Calder CSE 231 Presentation by Justin Ma.

A KTEC Center of Excellence 1 Cooperative Caching for Chip Multiprocessors Jichuan Chang and Gurindar S. Sohi University of Wisconsin-Madison.

An Evaluation of OpenMP on Current and Emerging Multithreaded/Multicore Processors Matthew Curtis-Maury, Xiaoning Ding, Christos D. Antonopoulos, and Dimitrios.

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.

Combining Statistical and Symbolic Simulation Mark Oskin Fred Chong and Matthew Farrens Dept. of Computer Science University of California at Davis.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Microprocessor Microarchitecture Multithreading Lynn Choi School of Electrical Engineering.

Erhan Erdinç Pehlivan Computer Architecture Support for Database Applications.

Decoupled Pipelines: Rationale, Analysis, and Evaluation Frederick A. Koopmans, Sanjay J. Patel Department of Computer Engineering University of Illinois.

Accurately Approximating Superscalar Processor Performance from Traces Kiyeon Lee, Shayne Evans, and Sangyeun Cho Dept. of Computer Science University.

Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.

The PinPoints Toolkit for Finding Representative Regions of Large Programs Harish Patil Platform Technology & Architecture Development Enterprise Platform.

Enabling Efficient On-the-fly Microarchitecture Simulation Thierry Lafage September 2000.

CISC Machine Learning for Solving Systems Problems Presented by: John Tully Dept of Computer & Information Sciences University of Delaware Using.

Silberschatz, Galvin and Gagne  2002 Modified for CSCI 399, Royden, Operating System Concepts Operating Systems Lecture 19 Scheduling IV.

Variability in Architectural Simulations of Multi-threaded Workloads Alaa R. Alameldeen and David A. Wood University of Wisconsin-Madison

Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

WCED: June 7, 2003 Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-MadisonSlide 1 of 26 Exploring Efficient SMT Branch Predictor Design.

Automatically Characterizing Large Scale Program Behavior Timothy Sherwood Erez Perelman Greg Hamerly Brad Calder.

A Dynamic Binary Translation Approach to Architectural Simulation Harold “Trey” Cain, Kevin Lepak, and Mikko Lipasti Computer Sciences Department Department.

1: Operating Systems Overview

Copyright © 1998 Wanda Kunkle Computer Organization 1 Chapter 2.1 Introduction.

Evaluating Non-deterministic Multi-threaded Commercial Workloads Computer Sciences Department University of Wisconsin—Madison

A Mathematical Model for Balancing Co-Phase Effects in Simulated Multithreaded Systems Joshua L. Kihm, Tipp Moseley, and Dan Connors University of Colorado.

Principle of Functional Verification Chapter 1~3 Presenter : Fu-Ching Yang.

By- Jaideep Moses, Ravi Iyer , Ramesh Illikkal and

February 11, 2003Ninth International Symposium on High Performance Computer Architecture Memory System Behavior of Java-Based Middleware Martin Karlsson,

Shuchang Shan † ‡, Yu Hu †, Xiaowei Li † † Key Laboratory of Computer System and Architecture, Institute of Computing Technology, Chinese Academy of Sciences.

High-Quality, Deterministic Parallel Placement for FPGAs on Commodity Hardware Adrian Ludwin, Vaughn Betz & Ketan Padalia FPGA Seminar Presentation Nov.

DBMSs On A Modern Processor: Where Does Time Go? by A. Ailamaki, D.J. DeWitt, M.D. Hill, and D. Wood University of Wisconsin-Madison Computer Science Dept.

Waleed Alkohlani 1, Jeanine Cook 2, Nafiul Siddique 1 1 New Mexico Sate University 2 Sandia National Laboratories Insight into Application Performance.

Operating System Review September 10, 2012Introduction to Computer Security ©2004 Matt Bishop Slide #1-1.

1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.

Statistical Simulation of Superscalar Architectures using Commercial Workloads Lieven Eeckhout and Koen De Bosschere Dept. of Electronics and Information.

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee and Margaret Martonosi.

Dept. of Computer and Information Sciences : University of Delaware John Cavazos Department of Computer and Information Sciences University of Delaware.

NATIONAL INSTITUTE OF TECHNOLOGY KARNATAKA,SURATHKAL Presentation on ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS Publisher’s:

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.

Predicting Coherence Communication by Tracking Synchronization Points at Run Time Socrates Demetriades and Sangyeun Cho 45 th International Symposium in.

Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.

Computer Science Department In-N-Out: Reproducing Out-of-Order Superscalar Processor Behavior from Reduced In-Order Traces Kiyeon Lee and Sangyeun Cho.

Automatically Characterizing Large Scale Program Behavior Timothy Sherwood Erez Perelman Greg Hamerly Brad Calder Used with permission of author.

P-GAS: Parallelizing a Many-Core Processor Simulator Using PDES Huiwei Lv, Yuan Cheng, Lu Bai, Mingyu Chen, Dongrui Fan, Ninghui Sun Institute of Computing.

Methodologies for Performance Simulation of Super-scalar OOO processors Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project.

Replicating Memory Behavior for Performance Skeletons Aditya Toomula PC-Doctor Inc. Reno, NV Jaspal Subhlok University of Houston Houston, TX By.

Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.

CISC Machine Learning for Solving Systems Problems Presented by: Suman Chander B Dept of Computer & Information Sciences University of Delaware Automatic.

Simics: A Full System Simulation Platform Synopsis by Jen Miller 19 March 2004.

© Michel Dubois, Murali Annavaram, Per Strenstrom All rights reserved Embedded Computer Architecture 5SAI0 Simulation - chapter 9 - Luc Waeijen 16 Nov.

BarrierWatch: Characterizing Multithreaded Workloads across and within Program-Defined Epochs Socrates Demetriades and Sangyeun Cho Computer Frontiers.

Computer Science and Engineering Power-Performance Considerations of Parallel Computing on Chip Multiprocessors Jian Li and Jose F. Martinez ACM Transactions.

Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers Jack Sampson*, Rubén González†, Jean-Francois Collard¤, Norman P.

Precise and Accurate Processor Simulation Harold Cain, Kevin Lepak, Brandon Schwartz, and Mikko H. Lipasti University of Wisconsin—Madison

Sunpyo Hong, Hyesoon Kim

On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.

ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Evaluation – Metrics, Simulation, and Workloads Copyright 2004 Daniel.

Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.

Simulation. Types of simulation Discrete-event simulation – Used for modeling of a system as it evolves over time by a representation in which the state.

KIT – University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association SYSTEM ARCHITECTURE GROUP DEPARTMENT OF COMPUTER.

Presenter: Yi-Ting Chung Fast and Scalable Hybrid Functional Verification and Debug with Dynamically Reconfigurable Co- simulation.

A Review of Processor Design Flow

Phase Capture and Prediction with Applications

Christophe Dubach, Timothy M. Jones and Michael F.P. O’Boyle

Adaptive Single-Chip Multiprocessing

Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt

Hardware Counter Driven On-the-Fly Request Signatures

Phase based adaptive Branch predictor: Seeing the forest for the trees

Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project

Presentation transcript:

© Michel Dubois, Murali Annavaram, Per Strenstrom All rights reserved CHAPTER 9 Simulation Methods SIMULATION METHODS SIMPOINTS PARALLEL SIMULATIONS NONDETERMINISM

© Michel Dubois, Murali Annavaram, Per Strenstrom All rights reserved How to study a computer system  Methodologies  Construct a hardware prototype  Mathematical modeling  Simulation

© Michel Dubois, Murali Annavaram, Per Strenstrom All rights reserved Construct a hardware prototype  Advantages  Runs fast  Disadvantages  Takes long time to build - RPM (Rapid Prototyping engine for Multiprocessors) USC; took a few graduate students several years  Expensive  Not flexible

© Michel Dubois, Murali Annavaram, Per Strenstrom All rights reserved Mathematically model the system  Use analytical modeling  Probabilistic  Queuing  Markov  Petri Net  Advantages  Very flexible  Very quick to develop  Runs quickly  Disadvantages  Can not capture effects of system details  Computer architects are skeptical of models

© Michel Dubois, Murali Annavaram, Per Strenstrom All rights reserved Simulation  Write a program that mimics system behavior  Advantages  Very flexible  Relatively quick to develop  Disadvantages  Runs slowly (e.g., 30,000 times slower than hardware)  Execution-driven simulators are increasingly complex  How to manage complexity?

© Michel Dubois, Murali Annavaram, Per Strenstrom All rights reserved Most popular research method  Simulation is chosen by MOST research projects  Why?  Mathematical model is NOT accurate  Building prototype is too time-consuming and too expensive for academic researchers

© Michel Dubois, Murali Annavaram, Per Strenstrom All rights reserved Computer architecture simulation  Study the characteristics of a complicated computer system with a fixed configuration  Explore design space of a system  With an accurate model, we can make changes and see how they will affect a system

© Michel Dubois, Murali Annavaram, Per Strenstrom All rights reserved Tool classification  OS code execution  System-level (complete system) - Does simulate behavior of an entire computer system, including OS and user code - Examples: – Simics – SimOS  User-level - Does NOT simulate OS code - Does emulate system calls - Examples: – SimpleScalar

© Michel Dubois, Murali Annavaram, Per Strenstrom All rights reserved Tool classification  Simulation detail  Instruction set - Does simulate the function of instructions - Does NOT model detailed micro-architectural timing - Examples: – Simics  Micro-architecture - Does clock cycle level simulation - Does speculative, out-of-order multiprocessor timing simulation - May NOT implement functionality of full instruction set or any devices - Examples: – SimpleScalar  RTL - Does logic gate-level simulation - Examples: – Synopsis

© Michel Dubois, Murali Annavaram, Per Strenstrom All rights reserved Tool classification  Simulation input  Trace-driven - Simulator reads a “trace” of inst captured during a previous execution by software/hardware - Easy to implement, no functional component needed - Large trace size; no branch prediction  Execution-driven - Simulator “runs” the program, generating a trace on-the-fly - More difficult to implement, but has many advantages - Interpreter, direct-execution - Examples: – Simics, SimpleScalar…

© Michel Dubois, Murali Annavaram, Per Strenstrom All rights reserved Tools introduction and tutorial  SimpleScalar   Simics    SimWattch  WattchCMP

© Michel Dubois, Murali Annavaram, Per Strenstrom All rights reserved Simulation Bottleneck  1 GHz = 1 Billion Cycles per Second  Simulating a second of a future machine execution = Simulate 1B cycles!!  Simulation of 1 cycle of a target = 30,000 cycles on a host  1 second of target simulation = 30,000 seconds on host = 8.3 Hours  CPU2K run for a few hours natively  Speed much worse when simulating CMP targets!! 12

© Michel Dubois, Murali Annavaram, Per Strenstrom All rights reserved What to Simulate  Simulating the entire application takes long  So simulate a subsection  But which subsection  Random  Starting point  Ending point  How do we know what we selected is good?

© Michel Dubois, Murali Annavaram, Per Strenstrom All rights reserved Phase behavior: A Visual Illustration with MCF  What is a “phase”?  Interval of execution, not necessarily contiguous, during which a measured program metric (i.e. code flow) is relatively stable  “Phase behavior” in this study  Relationship between Extended Instruction Pointers (EIPs) and Cycles Per Instruction (CPI) EIPs  time  CPIs mcf benchmark M. Annavaram, R. Rakvic, M. Polito, J. Bouguet, R. Hankins, B. Davies. The Fuzzy Correlation between Code and Performance Predictability. In Proceedings of the 37th International Symposium on Microarchitecture, pages , Dec 2004

© Michel Dubois, Murali Annavaram, Per Strenstrom All rights reserved Why Correlate Code and Performance?  Improve simulation speed by selective sampling  Simulate only few samples per phase  Dynamic optimizations  Phase changes may trigger dynamic program optimizations  Reconfigurable/Power aware computing  time  EIPs CPIs mcf benchmark Sample 1 Sample 2

© Michel Dubois, Murali Annavaram, Per Strenstrom All rights reserved Program Phase Identification  Must be independent of architecture  Must be quick  Phases must exist in the dimension we are interested in: CPI, $Misses, Branch Mispredictions… 16

© Michel Dubois, Murali Annavaram, Per Strenstrom All rights reserved Basic Block Vectors  Use program basic block flow as a mechanism to identify similarity  Control flow similarity  program phase similarity 17 B1 B4 B2B Manhattan Distance = |1 – 2| + |1 – 0| = 2 Euclidian Distance = sqrt((1 – 2) 2 + (1 – 0) 2 ) = sqrt(2) B1 B2 B3 B4

© Michel Dubois, Murali Annavaram, Per Strenstrom All rights reserved Generating BBV  Split program into 100M instruction windows  For each window compute the BBV  Compare similarities in BBV using distance metric  Cluster BBVs with minimum distance between themselves into groups 18

© Michel Dubois, Murali Annavaram, Per Strenstrom All rights reserved Basic Block Similarity Matrix  Darker the pattern higher the similarity Timothy Sherwood, Erez Perelman, Greg Hamerly, and Brad Calder Automatically characterizing large scale program behavior. SIGOPS Oper. Syst. Rev. 36, 5 (October 2002),

© Michel Dubois, Murali Annavaram, Per Strenstrom All rights reserved Identifying Phases from BBV  BBV is very high dimension vector (1 entry per each unique basic block)  Clustering on high dimensions is extremely complex  Dimension reduction using random linear projection  Cluster using lower order projection vectors using k-means 20

© Michel Dubois, Murali Annavaram, Per Strenstrom All rights reserved Parallel Simulations

© Michel Dubois, Murali Annavaram, Per Strenstrom All rights reserved Organization  Why is parallel simulation critical in future  Improving parallel simulation speed using Slack  Slacksim: Implementation of our parallel simulator  Comparison of Slack Simulation Schemes on SlackSim  Conclusion and Future Work

© Michel Dubois, Murali Annavaram, Per Strenstrom All rights reserved CMP Simulation – A Major Design Bottleneck  Era of CMPs  CMPs become mainstream (Intel, AMD, SUN…)  Increasing core count  Simulation - a Crucial Tool of Architects  Simulate a target design on an existing host system  Explore design space  Evaluate merit of design changes  Typically, Simulate All Target CMP Cores in a Single Host Thread (Single-threaded CMP simulation)  When running the single-threaded simulator on a CMP host, only one core is utilized  Increasing gap between target core count and simulation speed using one host core

© Michel Dubois, Murali Annavaram, Per Strenstrom All rights reserved Parallel Simulation  Parallel Discrete Event Simulation (PDES)  Conservative - Barrier synchronization - Lookahead  Optimistic (checkpoint and rollback) - Time Warp  WWT and WWT II  Multi-processor simulator  Conservative quantum-based synchronization  Compared to SlackSim - SlackSim provides higher simulation speed - SlackSim provides new trade-offs between simulation speed and accuracy - Slack is not limited by target architecture’s critical latency

© Michel Dubois, Murali Annavaram, Per Strenstrom All rights reserved Multithreaded Simulation Schemes  Simulate a Subset of Target Cores per Each Host Thread (Multi-threaded CMP Simulation)  Problem: How to synchronize the interactions between multiple target cores?  Cycle-by-cycle  Synchronizes all threads at end of every simulated cycle  Simulation more accurate (not necessarily 100% accurate due to time dilation!)  Improves speed compared to single thread  But, still suffers from numerous synchronization overheads and scalability issue

© Michel Dubois, Murali Annavaram, Per Strenstrom All rights reserved Quantum-based Simulation Schemes  Critical Latency  Shortest delay between any two communicating threads (typically L2 cache access latency in CMPs)  Quantum-based  Synchronize all threads at end of a few simulated cycles (quantum) critical latency  Guarantees cycle-by-cycle equivalent accuracy if quantum is smaller than the shortest delay (critical latency) between two communicating threads  As the communication delays between threads reduces (as is the case in CMPs) quantum size must be reduced

© Michel Dubois, Murali Annavaram, Per Strenstrom All rights reserved Slack Simulations Schemes Bounded Slack No synchronization as long as local time < max local time Trade-off some accuracy for speed Bound the slack to reduce inaccuracies (yet good speedup) Unbounded Slack No synchronization Wait

© Michel Dubois, Murali Annavaram, Per Strenstrom All rights reserved Comparing Simulation Speed

© Michel Dubois, Murali Annavaram, Per Strenstrom All rights reserved Simulation Speedup  Simulate 8-core target CMP on 2, 4, or 8-core host CMP  Baseline : 8-core target CMP on one host core  Average Speedup of Barnes, FFT, LU, and Water- Nsquared Computed with the Harmonic Mean  As host core count increases the gap between the simulation speed of target cores widens

April 28, 2003 Mikko Lipasti--University of Wisconsin Nondeterministic Workloads Source of nondeterminism: Interrupts (e.g. due to constant disk latency) O/S scheduling perturbed No longer a scientific controlled experiment How do we compare performance? tata tbtb Interrupt tata tbtb tctc

April 28, 2003 Mikko Lipasti--University of Wisconsin Nondeterministic Workloads Source of nondeterminism: Data races (e.g. RAW becomes WAR) t d observes older version, not from t b IPC not a meaningful metric Workload-specific high-level measures of work These suffer from cold start and end effects tctc tdtd tata tbtb RAW tctc tdtd tata tbtb WAR tete

Mikko H. Lipasti--University of Wisconsin Spatial Variability SPECjbb 16 warehouses on 16p PowerPC SMP, 400 ops/wh Study effect of 10% variation in memory latency Same end-to-end work, 40% variation in cycles, instructions

Mikko H. Lipasti--University of Wisconsin Spatial Variability The problem: variability due to (minor) machine changes Interrupts, thread synchronization differ in each experiment Result: A different set of instructions retire in every simulation Cannot use conventional performance metrics (e.g. IPC, miss rates) Must measure work and count cycles per unit of work Work: transaction, web interaction, database query Modify workload to count work, signal simulator when work complete Simulate billions of instructions to overcome cold start and end effects One solution: statistical simulation [Alameldeen et al., 2003] Simulate same interval n times with random perturbations n determined by coefficient of variation and desired confidence interval Problem: for small relative error, n can be very large Simulate n x billions of instructions per experiment

Mikko H. Lipasti--University of Wisconsin A Better Solution Eliminate spatial variability [Lepak, Cain, Lipasti, PACT 2003] Force each experiment to follow same path Record control “trace” Inject stall time to prevent deviation from trace Bound sacrifice in fidelity with injected stall time Enable comparisons with single simulation at each point Simulate 10s of millions of instructions per experiment

Mikko H. Lipasti--University of Wisconsin Determinism Results Results match intuition Experimental error is bounded (4.2%) Can reason about minor variations

Mikko H. Lipasti--University of Wisconsin Conclusions Spatial variability complicates multithreaded program performance evaluation Enforcing determinism enables: Relative comparisons with a single simulation Immunity to start/end effects Use of conventional performance metrics Avoid cumbersome workload-specific setup Bound error with injected delay AMD has already adopted determinism

© Michel Dubois, Murali Annavaram, Per Strenstrom All rights reserved CHAPTER 9 Simulation Methods SIMULATION METHODS SIMPOINTS PARALLEL SIMULATIONS NONDETERMINISM