2015-11-221 Introduction to SimpleScalar (Based on SimpleScalar Tutorial) CSCE614 Hyunjun Jang Texas A&M University.

Slides:



Advertisements
Similar presentations
1/1/ / faculty of Electrical Engineering eindhoven university of technology Speeding it up Part 3: Out-Of-Order and SuperScalar execution dr.ir. A.C. Verschueren.
Advertisements

Mehmet Can Vuran, Instructor University of Nebraska-Lincoln Acknowledgement: Overheads adapted from those provided by the authors of the textbook.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.
SimpleScalar v3.0 Tutorial U. of Wisconsin, CS752, Fall 2004 Andrey Litvin (main source: Austin & Burger) (also Dana Vantrease’ slides)
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.
Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)
Chapter 12 CPU Structure and Function. CPU Sequence Fetch instructions Interpret instructions Fetch data Process data Write data.
National & Kapodistrian University of Athens Dep.of Informatics & Telecommunications MSc. In Computer Systems Technology Advanced Computer Architecture.
SimpleScalar CS401. A Computer Architecture Simulator Primer What is an architectural simulator? – Tool that reproduces the behavior of a computing device.
Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
Computer Architecture 2011 – out-of-order execution (lec 7) 1 Computer Architecture Out-of-order execution By Dan Tsafrir, 11/4/2011 Presentation based.
Goal: Reduce the Penalty of Control Hazards
Trace Caches J. Nelson Amaral. Difficulties to Instruction Fetching Where to fetch the next instruction from? – Use branch prediction Sometimes there.
EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.
1 IBM System 360. Common architecture for a set of machines. Robert Tomasulo worked on a high-end machine, the Model 91 (1967), on which they implemented.
Multiscalar processors
1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections , )
Chapter 12 CPU Structure and Function. Example Register Organizations.
The PowerPC Architecture  IBM, Motorola, and Apple Alliance  Based on the IBM POWER Architecture ­Facilitate parallel execution ­Scale well with advancing.
Introduction to SimpleScalar (Based on SimpleScalar Tutorial)
Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
P51UST: Unix and Software Tools Unix and Software Tools (P51UST) Compilers, Interpreters and Debuggers Ruibin Bai (Room AB326) Division of Computer Science.
Architecture Basics ECE 454 Computer Systems Programming
Introduction to SimpleScalar (Based on SimpleScalar Tutorial) TA: Kyung Hoon Kim CSCE614 Texas A&M University.
Compiled from SimpleScalar Tutorial
Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.
SimpleScalar Tool Set, Version 2 CSE 323 Department of Computer Engineering.
1 Introduction to SimpleScalar (Based on SimpleScalar Tutorial) CPSC 614 Texas A&M University.
Performance Monitoring on the Intel ® Itanium ® 2 Processor CGO’04 Tutorial 3/21/04 CK. Luk Massachusetts Microprocessor Design.
AMD Opteron Overview Michael Trotter (mjt5v) Tim Kang (tjk2n) Jeff Barbieri (jjb3v)
1 Advanced Computer Architecture Dynamic Instruction Level Parallelism Lecture 2.
Trace cache and Back-end Oper. CSE 4711 Instruction Fetch Unit Using I-cache I-cache I-TLB Decoder Branch Pred Register renaming Execution units.
Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.
FPGA-based Fast, Cycle-Accurate Full System Simulators Derek Chiou, Huzefa Sanjeliwala, Dam Sunwoo, John Xu and Nikhil Patil University of Texas at Austin.
Implementing Precise Interrupts in Pipelined Processors James E. Smith Andrew R.Pleszkun Presented By: Shrikant G.
Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File Stephen Hines, Gary Tyson, and David Whalley Computer Science Dept. Florida.
1 A Superscalar Pipeline [Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005 and Instruction Issue Logic, IEEETC, 39:3, Sohi,
OOO Pipelines - II Smruti R. Sarangi IIT Delhi 1.
LECTURE 10 Pipelining: Advanced ILP. EXCEPTIONS An exception, or interrupt, is an event other than regular transfers of control (branches, jumps, calls,
OOO Pipelines - III Smruti R. Sarangi Computer Science and Engineering, IIT Delhi.
Computer Structure 2015 – Intel ® Core TM μArch 1 Computer Structure Multi-Threading Lihu Rappoport and Adi Yoaz.
CS203 – Advanced Computer Architecture Computer Architecture Simulators.
CSE431 L13 SS Execute & Commit.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 13: SS Backend (Execute, Writeback & Commit) Mary Jane.
Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire,
CS203 – Advanced Computer Architecture
CS 352H: Computer Systems Architecture
??? ple r B Amulya Sai EDM14b005 What is simple scalar?? Simple scalar is an open source computer architecture simulator developed by Todd.
/ Computer Architecture and Design
PowerPC 604 Superscalar Microprocessor
Introduction to SimpleScalar
Introduction to SimpleScalar (Based on SimpleScalar Tutorial)
CS203 – Advanced Computer Architecture
Lecture: Out-of-order Processors
Smruti R. Sarangi Computer Science and Engineering, IIT Delhi
Flow Path Model of Superscalars
Introduction to SimpleScalar (Based on SimpleScalar Tutorial)
Lecture 10: Out-of-order Processors
Lecture 11: Out-of-order Processors
Lecture: Out-of-order Processors
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
Lecture 11: Memory Data Flow Techniques
Control unit extension for data hazards
Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/30/2011
Computer Architecture
Control unit extension for data hazards
Handling Stores and Loads
Presentation transcript:

Introduction to SimpleScalar (Based on SimpleScalar Tutorial) CSCE614 Hyunjun Jang Texas A&M University

Overview What is an architectural simulator –a tool that reproduces the behavior of a computing device Why use a simulator –Leverage a faster, more flexible software development cycle Permit more design space exploration Facilitates validation before H/W becomes available Level of abstraction is tailored by design task Possible to increase/improve system instrumentation Usually less expensive than building a real system

Advantages of SimpleScalar Highly flexible –functional simulator + performance simulator Portable –Host: virtual target runs on most Unix-like systems –Target: simulators can support multiple ISAs Extensible –Source is included for compiler, libraries, simulators –Easy to write simulators Performance –Runs codes approaching ‘real’ sizes

Simulation Tools Shaded tools are included in SimpleScalar Tool Set Trace-Driven Interpreters Exec-Driven Functional Inst SchedulersCycle Timers Performance Architectural Simulators Direct Execution 1) 3)2)

) Functional vs. Performance Simulators Functional simulators implement the architecture –perform real execution –Implement what programmers see Performance simulators implement the microarchitecture –Model system resources/internals –Concern about time –Do not implement what programmers see

) Trace Driven vs. Execution Driven Simulators Trace-Driven –Simulator reads a ‘trace’ of the instructions captured during a previous execution –Easy to implement –No functional components necessary –No feedback to trace (eg. mis-prediction) Execution-Driven –Simulator runs the program (trace-on-the-fly) –Hard to implement –Advantages Faster than tracing No need to store traces Register and memory values usually are not in trace Support mis-speculation cost modeling

) Instruction Schedulers vs. Cycle Timers Instruction Schedulers –Simulator schedules instruction when resources are available –Instructions proceeded one at a time –Simpler, but less detailed Cycle Timers –Simulator tracks microarch. state each cycle –Simulator state == microarchitecture state –Perfect for microarchitecture simulation

SimpleScalar Release 3.0 SimpleScalar now executes multiple instruction sets: SimpleScalar PISA (the old "SimpleScalar ISA") and Alpha AXP. All simulators now support external I/O traces (EIO traces). Generated with a new simulator (sim-eio) Support more platforms explicit fault support And many more

Simulator Suite 1) Sim-Fast2) Sim-Safe3) Sim-Profile 4) Sim-Cache 5) Sim-BPred 6) Sim-Outorder -300 lines -functional -4+ MIPS -350 lines -functional w/checks -900 lines -functional -Lot of stats -< 1000 lines -functional -Cache stats -Branch stats lines -performance -OoO issue -Branch pred. -Mis-spec. -ALUs -Cache -TLB KIPS Performance Detail

) Sim-Fast Functional simulation Optimized for speed Assumes no cache Assumes no instruction checking Does not support Dlite! Does not allow command line arguments <300 lines of code

) Sim-Safe Functional simulation Checks for instruction errors Optimized for speed Assumes no cache Supports Dlite! Does not allow command line arguments

) Sim-Profile ● Program Profiler ● Generates detailed profiles, by symbol and by address ● Keeps track of and reports ● Dynamic instruction counts ● Instruction class counts ● Branch class counts ● Usage of address modes ● Profiles of the text & data segment

) Sim-Cache Cache simulation Ideal for fast simulation of caches (if the effect of cache performance on execution time is not necessary) Accepts command line arguments for: –level 1 & 2 instruction and data caches –TLB configuration (data and instruction) –Flush and compress – and more Ideal for performing high-level cache studies that don’t take access time of the caches into account

) Sim-Bpred Simulate different branch prediction mechanisms Generate prediction hit and miss rate reports Does not simulate the effect of branch prediction on total execution time - notTaken - taken - perfect - bimod bimodal predictor, using a branch target buffer (BTB) with 2-bit counters. - 2lev 2-level adaptive predictor - comb combined predictor (bimodal and 2-level)

) Sim-Outorder Most complicated and detailed simulator Supports out-of-order issue and execution Provides reports –branch prediction –cache –external memory –various configuration

Sim-Outorder HW Architecture Fetch Dispatch Register Scheduler Exe WritebackCommit I-Cache Memory Scheduler Mem Virtual Memory D-CacheD-TLB I-TLB

Sim-Outorder (Main Loop) sim_main() in sim-outorder.c ruu_init(); for(;;){ ruu_commit(); ruu_writeback(); lsq_refresh(); ruu_issue(); ruu_dispatch(); ruu_fetch(); } Executed once for each simulated machine cycle Walks pipeline from Commit to Fetch –Reverse traversal handles inter-stage latch synchronization by only one pass

Sim-Outorder (RUU/LSQ) RUU (Register Update Unit) –Handles register synchronization/communication –Serves as reorder buffer and reservation stations –Performs out-of-order issue when register and memory dependences are satisfied LSQ (Load/Store Queue) –Handles memory synchronization/communication –Contains all loads and stores in program order Relationship between RUU and LSQ –Memory dependencies are resolved by LSQ –Load/Store effective address calculated in RUU

Sim-Outorder: Fetch ● ruu_fetch() ● Models machine fetch bandwidth ● Fetches instructions from one I-cache/memory ● block until I-cache misses are resolved ● Instructions are put into the instruction fetch queue named fetch_data in sim-outorder.c (it is also called dispatch queue in the tutorial paper) ● Probes branch predictor to obtain the cache line for next cycle

Sim-Outorder: Dispatch ● ruu_dispatch() ● Models instruction decoding and register renaming ● Takes instructions from fetch_data ● Decodes instructions ● Enters and links instructions into RUU and LSQ ● Splits memory operations into two separate instructions ● Address calculation, memory operation itself

Sim-Outorder: Execute ● ruu_issue() ● Models functional units, D-cache issue and executes latencies ● Gets instructions that are ready ● Reserves free functional unit ● Schedules write-back events using latency of the functional unit ● Latencies are hardcoded in fu_config[] in sim-outorder.c

Sim-Outorder: Scheduler ● lsq_refresh() ● Models instruction selection, wakeup and issue ● Separate schedulers track register and memory dependences. ● Locates instructions with all register inputs ready and all memory inputs ready ● Issue of ready loads is stalled if there is a store with unresolved effective address in LSQ. ● If earlier store address matches load address, target value is forwarded to load, otherwise load is sent to memory

Sim-Outorder: Writeback ● ruu_writeback() ● Models writeback bandwidth, detects mis-predictions, initiated mis-prediction recovery sequence ● Gets execution finished instructions in event queue ● Wakes up instructions that are dependent on completed instruction on the dependence chains of instruction output ● Detects branch mis-prediction and roll state back to checkpoint, discarding associated instructions

Sim-Outorder: Commit ● ruu_commit() ● Models in-order commit of instructions ● Updates the data caches (or memory) with store values, and data TLB miss handling. ● Keeps retiring instructions at the head of the RUU that are ready to commit. ● When committed, result is placed into the register file, and ● the RUU/LSQ resources devoted to that instruction are reclaimed

Sim-Outorder: Processor core and other specifications Instruction fetch, decode and issue bandwidth Capacity of RUU and LSQ Branch mis-prediction latency Number of functional units –integer ALU, integer multipliers/dividers –FP ALU, FP multipliers/dividers Latency of I-cache/D-cache, memory and TLB Record statistic

Global Options These are supported in most simulators -h print help message -d enable debug message -i start up in Dlite! Debugger -q quit immediately (use with -dumpconfig) -config read config parameters from -dumpconfig save config parameters into

Useful Links – – commandlines.htmlhttp:// commandlines.html commandlines.htmlhttp:// commandlines.html –

How to get assistance Drop by HRBB 335 during office hour –(T/W 11:00-12:00)