Download presentation
Presentation is loading. Please wait.
Published byScott Alexander Modified over 9 years ago
1
© Michel Dubois, Murali Annavaram, Per Strenstrom All rights reserved CHAPTER 9 Simulation Methods SIMULATION METHODS SIMPOINTS PARALLEL SIMULATIONS NONDETERMINISM
2
© Michel Dubois, Murali Annavaram, Per Strenstrom All rights reserved How to study a computer system Methodologies Construct a hardware prototype Mathematical modeling Simulation
3
© Michel Dubois, Murali Annavaram, Per Strenstrom All rights reserved Construct a hardware prototype Advantages Runs fast Disadvantages Takes long time to build - RPM (Rapid Prototyping engine for Multiprocessors) Project @ USC; took a few graduate students several years Expensive Not flexible
4
© Michel Dubois, Murali Annavaram, Per Strenstrom All rights reserved Mathematically model the system Use analytical modeling Probabilistic Queuing Markov Petri Net Advantages Very flexible Very quick to develop Runs quickly Disadvantages Can not capture effects of system details Computer architects are skeptical of models
5
© Michel Dubois, Murali Annavaram, Per Strenstrom All rights reserved Simulation Write a program that mimics system behavior Advantages Very flexible Relatively quick to develop Disadvantages Runs slowly (e.g., 30,000 times slower than hardware) Execution-driven simulators are increasingly complex How to manage complexity?
6
© Michel Dubois, Murali Annavaram, Per Strenstrom All rights reserved Most popular research method Simulation is chosen by MOST research projects Why? Mathematical model is NOT accurate Building prototype is too time-consuming and too expensive for academic researchers
7
© Michel Dubois, Murali Annavaram, Per Strenstrom All rights reserved Computer architecture simulation Study the characteristics of a complicated computer system with a fixed configuration Explore design space of a system With an accurate model, we can make changes and see how they will affect a system
8
© Michel Dubois, Murali Annavaram, Per Strenstrom All rights reserved Tool classification OS code execution System-level (complete system) - Does simulate behavior of an entire computer system, including OS and user code - Examples: – Simics – SimOS User-level - Does NOT simulate OS code - Does emulate system calls - Examples: – SimpleScalar
9
© Michel Dubois, Murali Annavaram, Per Strenstrom All rights reserved Tool classification Simulation detail Instruction set - Does simulate the function of instructions - Does NOT model detailed micro-architectural timing - Examples: – Simics Micro-architecture - Does clock cycle level simulation - Does speculative, out-of-order multiprocessor timing simulation - May NOT implement functionality of full instruction set or any devices - Examples: – SimpleScalar RTL - Does logic gate-level simulation - Examples: – Synopsis
10
© Michel Dubois, Murali Annavaram, Per Strenstrom All rights reserved Tool classification Simulation input Trace-driven - Simulator reads a “trace” of inst captured during a previous execution by software/hardware - Easy to implement, no functional component needed - Large trace size; no branch prediction Execution-driven - Simulator “runs” the program, generating a trace on-the-fly - More difficult to implement, but has many advantages - Interpreter, direct-execution - Examples: – Simics, SimpleScalar…
11
© Michel Dubois, Murali Annavaram, Per Strenstrom All rights reserved Tools introduction and tutorial SimpleScalar http://www.simplescalar.com/ http://www.simplescalar.com/ Simics http://www.virtutech.com/ http://www.virtutech.com/ https://www.simics.net/ https://www.simics.net/ SimWattch WattchCMP
12
© Michel Dubois, Murali Annavaram, Per Strenstrom All rights reserved Simulation Bottleneck 1 GHz = 1 Billion Cycles per Second Simulating a second of a future machine execution = Simulate 1B cycles!! Simulation of 1 cycle of a target = 30,000 cycles on a host 1 second of target simulation = 30,000 seconds on host = 8.3 Hours CPU2K run for a few hours natively Speed much worse when simulating CMP targets!! 12
13
© Michel Dubois, Murali Annavaram, Per Strenstrom All rights reserved What to Simulate Simulating the entire application takes long So simulate a subsection But which subsection Random Starting point Ending point How do we know what we selected is good?
14
© Michel Dubois, Murali Annavaram, Per Strenstrom All rights reserved Phase behavior: A Visual Illustration with MCF What is a “phase”? Interval of execution, not necessarily contiguous, during which a measured program metric (i.e. code flow) is relatively stable “Phase behavior” in this study Relationship between Extended Instruction Pointers (EIPs) and Cycles Per Instruction (CPI) EIPs time CPIs mcf benchmark M. Annavaram, R. Rakvic, M. Polito, J. Bouguet, R. Hankins, B. Davies. The Fuzzy Correlation between Code and Performance Predictability. In Proceedings of the 37th International Symposium on Microarchitecture, pages 93-104, Dec 2004
15
© Michel Dubois, Murali Annavaram, Per Strenstrom All rights reserved Why Correlate Code and Performance? Improve simulation speed by selective sampling Simulate only few samples per phase Dynamic optimizations Phase changes may trigger dynamic program optimizations Reconfigurable/Power aware computing time EIPs CPIs mcf benchmark Sample 1 Sample 2
16
© Michel Dubois, Murali Annavaram, Per Strenstrom All rights reserved Program Phase Identification Must be independent of architecture Must be quick Phases must exist in the dimension we are interested in: CPI, $Misses, Branch Mispredictions… 16
17
© Michel Dubois, Murali Annavaram, Per Strenstrom All rights reserved Basic Block Vectors Use program basic block flow as a mechanism to identify similarity Control flow similarity program phase similarity 17 B1 B4 B2B3 2202 2112 Manhattan Distance = |1 – 2| + |1 – 0| = 2 Euclidian Distance = sqrt((1 – 2) 2 + (1 – 0) 2 ) = sqrt(2) B1 B2 B3 B4
18
© Michel Dubois, Murali Annavaram, Per Strenstrom All rights reserved Generating BBV Split program into 100M instruction windows For each window compute the BBV Compare similarities in BBV using distance metric Cluster BBVs with minimum distance between themselves into groups 18
19
© Michel Dubois, Murali Annavaram, Per Strenstrom All rights reserved Basic Block Similarity Matrix Darker the pattern higher the similarity Timothy Sherwood, Erez Perelman, Greg Hamerly, and Brad Calder. 2002. Automatically characterizing large scale program behavior. SIGOPS Oper. Syst. Rev. 36, 5 (October 2002), 45-57.
20
© Michel Dubois, Murali Annavaram, Per Strenstrom All rights reserved Identifying Phases from BBV BBV is very high dimension vector (1 entry per each unique basic block) Clustering on high dimensions is extremely complex Dimension reduction using random linear projection Cluster using lower order projection vectors using k-means 20
21
© Michel Dubois, Murali Annavaram, Per Strenstrom All rights reserved Parallel Simulations
22
© Michel Dubois, Murali Annavaram, Per Strenstrom All rights reserved Organization Why is parallel simulation critical in future Improving parallel simulation speed using Slack Slacksim: Implementation of our parallel simulator Comparison of Slack Simulation Schemes on SlackSim Conclusion and Future Work
23
© Michel Dubois, Murali Annavaram, Per Strenstrom All rights reserved CMP Simulation – A Major Design Bottleneck Era of CMPs CMPs become mainstream (Intel, AMD, SUN…) Increasing core count Simulation - a Crucial Tool of Architects Simulate a target design on an existing host system Explore design space Evaluate merit of design changes Typically, Simulate All Target CMP Cores in a Single Host Thread (Single-threaded CMP simulation) When running the single-threaded simulator on a CMP host, only one core is utilized Increasing gap between target core count and simulation speed using one host core
24
© Michel Dubois, Murali Annavaram, Per Strenstrom All rights reserved Parallel Simulation Parallel Discrete Event Simulation (PDES) Conservative - Barrier synchronization - Lookahead Optimistic (checkpoint and rollback) - Time Warp WWT and WWT II Multi-processor simulator Conservative quantum-based synchronization Compared to SlackSim - SlackSim provides higher simulation speed - SlackSim provides new trade-offs between simulation speed and accuracy - Slack is not limited by target architecture’s critical latency
25
© Michel Dubois, Murali Annavaram, Per Strenstrom All rights reserved Multithreaded Simulation Schemes Simulate a Subset of Target Cores per Each Host Thread (Multi-threaded CMP Simulation) Problem: How to synchronize the interactions between multiple target cores? Cycle-by-cycle Synchronizes all threads at end of every simulated cycle Simulation more accurate (not necessarily 100% accurate due to time dilation!) Improves speed compared to single thread But, still suffers from numerous synchronization overheads and scalability issue
26
© Michel Dubois, Murali Annavaram, Per Strenstrom All rights reserved Quantum-based Simulation Schemes Critical Latency Shortest delay between any two communicating threads (typically L2 cache access latency in CMPs) Quantum-based Synchronize all threads at end of a few simulated cycles (quantum) critical latency Guarantees cycle-by-cycle equivalent accuracy if quantum is smaller than the shortest delay (critical latency) between two communicating threads As the communication delays between threads reduces (as is the case in CMPs) quantum size must be reduced
27
© Michel Dubois, Murali Annavaram, Per Strenstrom All rights reserved Slack Simulations Schemes Bounded Slack No synchronization as long as local time < max local time Trade-off some accuracy for speed Bound the slack to reduce inaccuracies (yet good speedup) Unbounded Slack No synchronization Wait
28
© Michel Dubois, Murali Annavaram, Per Strenstrom All rights reserved Comparing Simulation Speed
29
© Michel Dubois, Murali Annavaram, Per Strenstrom All rights reserved Simulation Speedup Simulate 8-core target CMP on 2, 4, or 8-core host CMP Baseline : 8-core target CMP on one host core Average Speedup of Barnes, FFT, LU, and Water- Nsquared Computed with the Harmonic Mean As host core count increases the gap between the simulation speed of target cores widens
30
April 28, 2003 Mikko Lipasti--University of Wisconsin Nondeterministic Workloads Source of nondeterminism: Interrupts (e.g. due to constant disk latency) O/S scheduling perturbed No longer a scientific controlled experiment How do we compare performance? tata tbtb Interrupt tata tbtb tctc
31
April 28, 2003 Mikko Lipasti--University of Wisconsin Nondeterministic Workloads Source of nondeterminism: Data races (e.g. RAW becomes WAR) t d observes older version, not from t b IPC not a meaningful metric Workload-specific high-level measures of work These suffer from cold start and end effects tctc tdtd tata tbtb RAW tctc tdtd tata tbtb WAR tete
32
Mikko H. Lipasti--University of Wisconsin Spatial Variability SPECjbb 16 warehouses on 16p PowerPC SMP, 400 ops/wh Study effect of 10% variation in memory latency Same end-to-end work, 40% variation in cycles, instructions
33
Mikko H. Lipasti--University of Wisconsin Spatial Variability The problem: variability due to (minor) machine changes Interrupts, thread synchronization differ in each experiment Result: A different set of instructions retire in every simulation Cannot use conventional performance metrics (e.g. IPC, miss rates) Must measure work and count cycles per unit of work Work: transaction, web interaction, database query Modify workload to count work, signal simulator when work complete Simulate billions of instructions to overcome cold start and end effects One solution: statistical simulation [Alameldeen et al., 2003] Simulate same interval n times with random perturbations n determined by coefficient of variation and desired confidence interval Problem: for small relative error, n can be very large Simulate n x billions of instructions per experiment
34
Mikko H. Lipasti--University of Wisconsin A Better Solution Eliminate spatial variability [Lepak, Cain, Lipasti, PACT 2003] Force each experiment to follow same path Record control “trace” Inject stall time to prevent deviation from trace Bound sacrifice in fidelity with injected stall time Enable comparisons with single simulation at each point Simulate 10s of millions of instructions per experiment
35
Mikko H. Lipasti--University of Wisconsin Determinism Results Results match intuition Experimental error is bounded (4.2%) Can reason about minor variations
36
Mikko H. Lipasti--University of Wisconsin Conclusions Spatial variability complicates multithreaded program performance evaluation Enforcing determinism enables: Relative comparisons with a single simulation Immunity to start/end effects Use of conventional performance metrics Avoid cumbersome workload-specific setup Bound error with injected delay AMD has already adopted determinism
37
© Michel Dubois, Murali Annavaram, Per Strenstrom All rights reserved CHAPTER 9 Simulation Methods SIMULATION METHODS SIMPOINTS PARALLEL SIMULATIONS NONDETERMINISM
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.