Download presentation
Presentation is loading. Please wait.
Published byAmani Newbury Modified over 10 years ago
1
SimpleScalar v3.0 Tutorial U. of Wisconsin, CS752, Fall 2004 Andrey Litvin (main source: Austin & Burger) (also Dana Vantrease’ slides)
2
Simulator Basics What is an architectural simulator? –Tool that reproduces the behavior of a computing device Why use a simulator? –Flexible Rapid design space exploration Tailor abstraction level to need –Cheap Why not use a simulator? –Slow –Correctness?
3
Functional vs. Performance Functional simulators implement the architecture - Perform the actual execution - Implement what programmers see Performance (or timing) simulators implement the microarchitecture. - Model system resources/internals - Measure time - Implement what programmers do not see
5
Trace vs. Execution Driven (2) Trace + Easy to Implement -Requires large disk files to store instruction stream -Limited state stored - Speculation? Multiprocessor? Execution - Hard to Implement + Allows access to full state information at any point during program execution +/- Execution requires inclusion of instruction set emulator and an I/O emulation module
6
Simplescalar Developed by Todd Austin with Doug Burger at UW, ’94-’96 Execution – driven Collection of simulators that emulate the microprocessor at different detail levels (functional, functional + cache / bpred, out-of-order cycle-timer, etc.) Tools: –C/Fortran compiler, assembler, linker (for PISA) –DLite: target-machine-level debugger –pipeline trace viewer
7
Advantages of SimpleScalar Fast (given priority over clarity) –4 MIPS functional, 300 KIPS OOO on 1.5 GHz host Relatively(!) simple - short learning curve Modular design Well documented and commented Popular - support community, extensions Limitations will be summarized at the end
10
Sim-Fast Bare functional simulator Does not account for the behavior of any part of the microarchitecture Optimized for speed
11
Sim-Safe Similar to sim-fast (slower) Implements some memory op safeguards –Memory alignment –Memory access permission Good for debugging sim-fast crashes
12
Sim-EIO External trace/checkpoint generator Functional simulator like sim-fast and sim- safe Implements checks like sim-safe
13
Sim-Profile Profiles by symbol and address Keeps track of and reports (many options) –Dynamic instruction counts –Instruction class counts –Usage of address modes –Profiles of the text & data segment
14
Sim-Cache/Sim-Bpred Functional core drives the detailed model of the cache/branch predictor Similar to trace-driven cache/bpred simulation Fast results for miss/misprediction rates No timing simulation/performance impact
15
Cache Implementation block size, # of sets, associativity, all customizable Replacement policies: random, FIFO, LRU 2-level cache hierarchy supported (easily extended) Unified/separate I/D L1 and L2 caches
16
Implemented Predictors Specifying branch predictor type -bpred Implemented predictors –nottaken always predicts not taken –taken always taken –perfect always right –bimod BTB with 2-bit counters –2lev 2-level adaptive branch predictor Specify level1 size, level 2 size, history size(# bits), XOR PC and history? GAg, GAp, PAg, PAp, gshare –comb combining (meta) predictor
17
Sim-Outorder Detailed performance simulator Out-of-order execution core Register renaming, reorder buffer, speculative execution 2-level cache hierarchy Branch prediction
20
Making a Fast Timing Simulator Hardware advantage: parallelism Software emulator advantage: free space for auxillary structures Functional and detailed performance parts can be decoupled
21
Simulator RUU vs. Logical RUU No consumer -> producer links (tags) Rather, producer->consumer links for efficient result broadcast at writeback Values not tracked (except address) “Completed” bit (ROB and IQ combined) Same struct used for LSQ
23
Other Simulator Structures ready_queue –Ready to issue instructions –Used at issue stage –Built at writeback –Limited issue bandwidth –Policy: mul/div, load/store, branch first, then oldest first
24
Other Simulator Structures fu_pool – functional units –Issue latency vs. operational latency –Both constant, hard-coded in sim-outorder –Read port latency more detailed (cache simulation) –Issue latency – busy counter at FU –Operational latency – event queue event_queue –ordered by cycle –schedules writeback events
25
Other Simulator Structures create_vector –Register renaming –Maps logical register to RUU entry (architectural register up to date if entry is null) –Updated at dispatch and writeback –Similar to Qi in Tomasulo –Backed up during dispatch if sim “divines” misspeculation
28
Stage Implementation (greater detail in SS hack guide and v2.0 tutorial) Fetch (at ruu_fetch()) –Fetch ins-s up to cache line boundary –Block on miss –Put in Fetch Queue (FQ) –Probe branch predictor for next cache line to fetch from
29
Stage Implementation Decode (at ruu_decode()) –Fetch from FQ –Decode –Rename registers –Put into RUU and LSQ –Update RUU dependency lists, renaming table –Sim (functional core): Execute instruction, update machine state Detect branch misprediction, backup state (checkpoint)
30
Stage Implementation Issue (at ruu_issue() and lsq_refresh()) –Ready queue -> Event queue –Order based on policy (see ready_queue slide) –Check and reserve FU’s –Mem. Ops check memory dependences No load speculation (“maybe” dependence respected)
31
Stage Implementation Writeback (at ruu_writeback()) –Get finished instructions from ready queue –Wake up (put in ready queue) instructions with ready operands (use dependence list) –Performance core detects misprediction here and rolls back the state
32
Stage Implementation Commit (at ruu_commit()) –Service D-TLB misses –Update register file (logically) and rename table –Retire stores to D-cache –Reclaim RUU/LSQ entries of retirees
33
Limitations I/O and other system calls –Only some limited functional simulation Lacks support for arbitrary speculation –Only branches can cause rollback –No speculative loads –Harder problem: Decoupling of functional and timing cores (for performance) complicates data speculation extensions
34
Limitations of Memory System No causality in memory system –All events are calculated at time of first access Simple TLB and virtual memory model –No address translation –TLB miss is a (small) fixed latency parallel with cache access Bandwidth, non-blocking –Modeled as n FU’s (read/write ports with fixed issue delay) Accurate if memory system is lightly utilized –SMT extensions? Overhaul required for multiprocessor simulation
35
Extensions www.simplescalar.com trace cache value prediction SMT multiprocessor more target ISA / host OS support
36
Miscellaneous Enter “sim-outorder” with no options to get options list and usage examples Options database (options.h) –Interface to register an option and check if entered at initialization Stats database (stats.h) –Register new stats –Define secondary stats and track distributions
37
Miscellaneous
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.