Introduction to SimpleScalar (Based on SimpleScalar Tutorial)

Slides:

Advertisements

Similar presentations

1/1/ / faculty of Electrical Engineering eindhoven university of technology Speeding it up Part 3: Out-Of-Order and SuperScalar execution dr.ir. A.C. Verschueren.

Advertisements

1 Lecture 13: Cache and Virtual Memroy Review Cache optimization approaches, cache miss classification, Adapted from UCB CS252 S01.

Mehmet Can Vuran, Instructor University of Nebraska-Lincoln Acknowledgement: Overheads adapted from those provided by the authors of the textbook.

Federation: Repurposing Scalar Cores for Out- of-Order Instruction Issue David Tarjan*, Michael Boyer, and Kevin Skadron* University of Virginia Department.

1 SS and Pipelining: The Sequel Data Forwarding Caches Branch Prediction Michele Co, September 24, 2001.

SimpleScalar v3.0 Tutorial U. of Wisconsin, CS752, Fall 2004 Andrey Litvin (main source: Austin & Burger) (also Dana Vantrease’ slides)

Pipeline Hazards Pipeline hazards These are situations that inhibit that the next instruction can be processed in the next stage of the pipeline. This.

THE MIPS R10000 SUPERSCALAR MICROPROCESSOR Kenneth C. Yeager IEEE Micro in April 1996 Presented by Nitin Gupta.

Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.

Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)

Chapter 12 CPU Structure and Function. CPU Sequence Fetch instructions Interpret instructions Fetch data Process data Write data.

National & Kapodistrian University of Athens Dep.of Informatics & Telecommunications MSc. In Computer Systems Technology Advanced Computer Architecture.

Branch Prediction in SimpleScalar

SimpleScalar CS401. A Computer Architecture Simulator Primer What is an architectural simulator? – Tool that reproduces the behavior of a computing device.

Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)

Goal: Reduce the Penalty of Control Hazards

Trace Caches J. Nelson Amaral. Difficulties to Instruction Fetching Where to fetch the next instruction from? – Use branch prediction Sometimes there.

EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.

Csci4203/ece43631 Review Quiz. 1)It is less expensive 2)It is usually faster 3)Its average CPI is smaller 4)It allows a faster clock rate 5)It has a simpler.

1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections , )

Chapter 12 CPU Structure and Function. Example Register Organizations.

A Configurable Simulator for OOO Speculative Execution Design & Implementation By Mustafa Imran Ali ID#

1 Lecture 7: Branch prediction Topics: bimodal, global, local branch prediction (Sections )

Architecture Basics ECE 454 Computer Systems Programming

Introduction to SimpleScalar (Based on SimpleScalar Tutorial) TA: Kyung Hoon Kim CSCE614 Texas A&M University.

Compiled from SimpleScalar Tutorial

Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.

SimpleScalar Tool Set, Version 2 CSE 323 Department of Computer Engineering.

1 Introduction to SimpleScalar (Based on SimpleScalar Tutorial) CPSC 614 Texas A&M University.

Performance Monitoring on the Intel ® Itanium ® 2 Processor CGO’04 Tutorial 3/21/04 CK. Luk Massachusetts Microprocessor Design.

AMD Opteron Overview Michael Trotter (mjt5v) Tim Kang (tjk2n) Jeff Barbieri (jjb3v)

1 Advanced Computer Architecture Dynamic Instruction Level Parallelism Lecture 2.

Trace cache and Back-end Oper. CSE 4711 Instruction Fetch Unit Using I-cache I-cache I-TLB Decoder Branch Pred Register renaming Execution units.

The original MIPS I CPU ISA has been extended forward three times The practical result is that a processor implementing MIPS IV is also able to run MIPS.

Introduction to SimpleScalar (Based on SimpleScalar Tutorial) CSCE614 Hyunjun Jang Texas A&M University.

Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.

1 CPRE 585 Term Review Performance evaluation, ISA design, dynamically scheduled pipeline, and memory hierarchy.

Implementing Precise Interrupts in Pipelined Processors James E. Smith Andrew R.Pleszkun Presented By: Shrikant G.

Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File Stephen Hines, Gary Tyson, and David Whalley Computer Science Dept. Florida.

The life of an instruction in EV6 pipeline Constantinos Kourouyiannis.

1 A Superscalar Pipeline [Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005 and Instruction Issue Logic, IEEETC, 39:3, Sohi,

OOO Pipelines - II Smruti R. Sarangi IIT Delhi 1.

Superscalar Architecture Design Framework for DSP Operations Rehan Ahmed.

OOO Pipelines - III Smruti R. Sarangi Computer Science and Engineering, IIT Delhi.

1 Appendix C. Review of Memory Hierarchy Introduction Cache ABCs Cache Performance Write policy Virtual Memory and TLB.

CS203 – Advanced Computer Architecture Computer Architecture Simulators.

CSE431 L13 SS Execute & Commit.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 13: SS Backend (Execute, Writeback & Commit) Mary Jane.

Lecture: Out-of-order Processors

??? ple r B Amulya Sai EDM14b005 What is simple scalar?? Simple scalar is an open source computer architecture simulator developed by Todd.

/ Computer Architecture and Design

PowerPC 604 Superscalar Microprocessor

Introduction to SimpleScalar

Introduction to SimpleScalar (Based on SimpleScalar Tutorial)

Lecture: Out-of-order Processors

Introduction to SimpleScalar (Based on SimpleScalar Tutorial)

Lecture 14 Virtual Memory and the Alpha Memory Hierarchy

The Microarchitecture of the Pentium 4 processor

Lecture 10: Out-of-order Processors

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

Lecture 11: Memory Data Flow Techniques

Ka-Ming Keung Swamy D Ponpandi

Lecture: Out-of-order Processors

Control unit extension for data hazards

Conceptual execution on a processor which exploits ILP

Ka-Ming Keung Swamy D Ponpandi

Handling Stores and Loads

Presentation transcript:

Introduction to SimpleScalar (Based on SimpleScalar Tutorial) CSCE614 Texas A&M University 2017-04-17

Overview What is an architectural simulator Why use a simulator a tool that reproduces the behavior of a computing device Why use a simulator Leverage a faster, more flexible software development cycle Permit more design space exploration Facilitates validation before H/W becomes available Level of abstraction is tailored by design task Possible to increase/improve system instrumentation Usually less expensive than building a real system 2017-04-17

Advantages of SimpleScalar Highly flexible functional simulator + performance simulator Portable Host: virtual target runs on most Unix-like systems Target: simulators can support multiple ISAs Extensible Source is included for compiler, libraries, simulators Easy to write simulators Performance Runs codes approaching ‘real’ sizes 2017-04-17

Simulation Tools Shaded tools are included in SimpleScalar Tool Set Trace-Driven Interpreters Exec-Driven Functional Inst Schedulers Cycle Timers Performance Architectural Simulators Direct Execution 2017-04-17

Functional vs. Performance Simulators Functional simulators implement the architecture perform real execution Implement what programmers see Performance simulators implement the microarchitecture Model system resources/internals Concern about time Do not implement what programmers see 2017-04-17

Trace Driven vs. Execution Driven Simulators Simulator reads a ‘trace’ of the instructions captured during a previous execution Easy to implement No functional components necessary No feedback to trace (eg. mis-prediction) Execution-Driven Simulator runs the program (trace-on-the-fly) Hard to implement Advantages Faster than tracing No need to store traces Register and memory values usually are not in trace Support mis-speculation cost modeling 2017-04-17

Instruction Schedulers vs. Cycle Timers Simulator schedules instruction when resources are available Instructions proceeded one at a time Simpler, but less detailed Cycle Timers Simulator tracks microarch. state each cycle Simulator state == microarchitecture state Perfect for microarchitecture simulation 2017-04-17

SimpleScalar Release 3.0 SimpleScalar now executes multiple instruction sets: SimpleScalar PISA (the old "SimpleScalar ISA") and Alpha AXP. All simulators now support external I/O traces (EIO traces). Generated with a new simulator (sim-eio) Support more platforms explicit fault support And many more 2017-04-17

Simulator Suite Performance Detail Sim-Fast Sim-Safe Sim-Profile Sim-Cache Sim-Cheetah Sim-BPred Sim-Outorder 300 lines functional 4+ MIPS 350 lines functional w/checks 900 lines functional Lot of stats < 1000 lines functional Cache stats Branch stats 3900 lines performance OoO issue Branch pred. Mis-spec. ALUs Cache TLB 200+ KIPS Performance Detail 2017-04-17

Sim-Fast Functional simulation Optimized for speed Assumes no cache Assumes no instruction checking Does not support Dlite! Does not allow command line arguments <300 lines of code 2017-04-17

Sim-Safe Functional simulation Checks for instruction errors Optimized for speed Assumes no cache Supports Dlite! Does not allow command line arguments 2017-04-17

Sim-Cache Cache simulation Ideal for fast simulation of caches (if the effect of cache performance on execution time is not necessary) Accepts command line arguments for: level 1 & 2 instruction and data caches TLB configuration (data and instruction) Flush and compress and more Ideal for performing high-level cache studies that don’t take access time of the caches into account 2017-04-17

Sim-Cache (cont'd) generates one- and two-level cache hierarchy statistics and profiles extra options (also supported on sim-outorder): -cache:dl1 <config> - level 1 data cache configuration -cache:dl2 <config> - level 2 data cache configuration -cache:il1 <config> - level 1 instruction cache configuration -cache:il2 <config> - level 2 instruction cache configuration -tlb:dtlb <config> - data TLB configuration -tlb:itlb <config> - instruction TLB configuration -flush <config> - flush caches on system calls -icompress - remaps 64-bit inst addresses to 32-bit equiv. -pcstat <stat> - record statistic <stat> by text address 2017-04-17

Specifying Cache Configurations all caches and TLB configurations specified with same format: <name>:<nsets>:<bsize>:<assoc>:<repl> where: <name> - cache name (make this unique) <nsets> - number of sets <assoc> - associativity (number of “ways”) <repl> - set replacement policy l - for LRU f - for FIFO r - for RANDOM examples: il1:1024:32:2:l 2-way set-assoc 64k-byte cache, LRU dtlb:1:4096:64:r 64-entry fully assoc TLB w/ 4k pages,random replacement 2017-04-17

Sim-Bpred Simulate different branch prediction mechanisms Generate prediction hit and miss rate reports Does not simulate the effect of branch prediction on total execution time nottaken taken perfect bimod bimodal predictor 2lev 2-level adaptive predictor comb combined predictor (bimodal and 2-level) 2017-04-17

Sim-Profile Program Profiler Generates detailed profiles, by symbol and by address Keeps track of and reports Dynamic instruction counts Instruction class counts Branch class counts Usage of address modes Profiles of the text & data segment 2017-04-17

Sim-Outorder Most complicated and detailed simulator Supports out-of-order issue and execution Provides reports branch prediction cache external memory various configuration 2017-04-17

Sim-Outorder: Detailed Performance Simulator generates timing statistics for a detailed out-of-order issue processor core with two-level cache memory hierarchy and main memory extra options: -fetch:ifqsize <size> - instruction fetch queue size (in insts) -fetch:mplat <cycles> - extra branch mis-prediction latency (cycles) -bpred <type> - specify the branch predictor -decode:width <insts> - decoder bandwidth (insts/cycle) -issue:width <insts> - RUU issue bandwidth (insts/cycle) -issue:inorder - constrain instruction issue to program order -issue:wrongpath - permit instruction issue after mis-speculation -ruu:size <insts> - capacity of RUU (insts) -lsq:size <insts> - capacity of load/store queue (insts) -cache:dl1 <config> - level 1 data cache configuration -cache:dl1lat <cycles> - level 1 data cache hit latency 2017-04-17

Sim-Outorder: Detailed Performance Simulator -cache:dl2 <config> - level 2 data cache configuration -cache:dl2lat <cycles> - level 2 data cache hit latency -cache:il1 <config> - level 1 instruction cache configuration -cache:il1lat <cycles> - level 1 instruction cache hit latency -cache:il2 <config> - level 2 instruction cache configuration -cache:il2lat <cycles> - level 2 instruction cache hit latency -cache:flush - flush all caches on system calls -cache:icompress - remap 64-bit inst addresses to 32-bit equiv. -mem:lat <1st> <next> - specify memory access latency (first, rest) -mem:width - specify width of memory bus (in bytes) -tlb:itlb <config> - instruction TLB configuration -tlb:dtlb <config> - data TLB configuration -tlb:lat <cycles> - latency (in cycles) to service a TLB miss 2017-04-17

Sim-Outorder: Detailed Performance Simulator -res:ialu - specify number of integer ALUs -res:imult - specify number of integer multiplier/dividers -res:memports - specify number of first-level cache ports -res:fpalu - specify number of FP ALUs -res:fpmult - specify number of FP multiplier/dividers -pcstat <stat> - record statistic <stat> by text address -ptrace <file> <range> - generate pipetrace 2017-04-17

Specifying the Branch Predictor specifying the branch predictor type: -bpred <type> the supported predictor types are: nottaken always predict not taken taken always predict taken perfect perfect predictor bimod bimodal predictor (BTB w/ 2 bit counters) 2lev 2-level adaptive predictor configuring the bimodal predictor (only useful when “-bpred bimod” is specified): -bpred:bimod <size> size of direct-mapped BTB 2017-04-17

Specifying the Branch Predictor (cont'd) configuring the 2-level adaptive predictor (only useful when “-bpred 2lev” is specified): -bpred:2lev <l1size> <l2size> <hist_size> <xor> Configurations: N, M, W, X N:# entries in first level (# of shift register(s)) M:# entries in 2nd level (# of counters, or other FSM) W:width of shift register(s) (# of bits in each shift register) X:(yes-1/no-0) xor history (We use 0 for this homework.) and address for 2nd level index Sample predictors: GAg: 1,M,W,0 where M = 2^W GAp: 1,M,W,0 where M = C*2^W, C is # of per-address prediction tables PAg: N,M,W,0 where M = 2^W PAp: N,M,W,0 where M = N * 2^W 2017-04-17

Performance Comparison of GAg,GAp, PAg and PAp GAp: 1 global history register and 8 per-address prediction tables Branch address 2-bits per branch predictor Prediction 2-bit global branch history 4 2017-04-17 (a) GAp (b) (2,2) predictor

Hack the state machine of Branch Predictor! (a) A3 (Same as shown in the textbook) (b) A2 (Original Simplescalar Implementation) 2017-04-17

Sim-Outorder HW Architecture Fetch Dispatch Register Scheduler Exe Writeback Commit Memory Scheduler Mem I-Cache I-TLB D-Cache D-TLB Virtual Memory 2017-04-17

Sim-Outorder (Main Loop) sim_main() in sim-outorder.c ruu_init(); for(;;){ ruu_commit(); ruu_writeback(); lsq_refresh(); ruu_issue(); ruu_dispatch(); ruu_fetch(); } Executed once for each simulated machine cycle Walks pipeline from Commit to Fetch Reverse traversal handles inter-stage latch synchronization by only one pass 2017-04-17

Sim-Outorder (RUU/LSQ) RUU (Register Update Unit) Handles register synchronization/communication Serves as reorder buffer and reservation stations Performs out-of-order issue when register and memory dependences are satisfied LSQ (Load/Store Queue) Handles memory synchronization/communication Contains all loads and stores in program order Relationship between RUU and LSQ Memory dependencies are resolved by LSQ Load/Store effective address calculated in RUU 2017-04-17

Sim-Outorder: Fetch ruu_fetch() Models machine fetch bandwidth Fetches instructions from one I-cache/memory block until I-cache misses are resolved Instructions are put into the instruction fetch queue named fetch_data in sim-outorder.c (it is also called dispatch queue in the paper) Probes branch predictor to obtain the cache line for next cycle 2017-04-17

Sim-Outorder: Dispatch ruu_dispatch() Models instruction decoding and register renaming Takes instructions from fetch_data Decodes instructions Enters and links instructions into RUU and LSQ Splits memory operations into two separate instructions 2017-04-17

Sim-Outorder: Scheduler lsq_refresh() Models instruction selection, wakeup and issue Separate schedulers track register and memory dependences. Locates instructions with all register inputs ready and all memory inputs ready Issue of ready loads is stalled if there is a store with unresolved effective address in LSQ. If earlier store address matches load address, target value is forwarded to load. 2017-04-17

Sim-Outorder: Execute ruu_issue() Models functional units, D-cache issue and executes latencies Gets instructions that are ready Reserves free functional unit Schedules writeback events using latency of the functional unit Latencies are hardcoded in fu_config[] in sim-outorder.c 2017-04-17

Sim-Outorder: Writeback ruu_writeback() Models writeback bandwidth, detects mis-predictions, initiated mis-prediction recovery sequence Gets execution finished instructions (specified in event queue) Wakes up instructions that are dependent on completed instruction on the dependence chains of instruction output Detects branch mis-prediction and roll state back to checkpoint 2017-04-17

Sim-Outorder: Commit ruu_commit() Models in-order retirement of instructions, store commits to the D-cache, and D-TLB miss handling While head of RUU/LSQ ready to commit D-TLB miss handling Retire store to D-cache Update register file and rename table Reclaim RUU/LSQ resources 2017-04-17

Sim-Outorder: Processor core and other specifications Instruction fetch, decode and issue bandwidth Capacity of RUU and LSQ Branch mis-prediction latency Number of functional units integer ALU, integer multipliers/dividers FP ALU, FP multipliers/dividers Latency of I-cache/D-cache, memory and TLB Record statistic by text address 2017-04-17

Global Options These are supported on most simulators -h print help message -d enable debug message -i start up in Dlite! Debugger -q quit immediately (use with -dumpconfig) -config read config parameters from <file> -dumpconfig save config parameters into <file> 2017-04-17

How to get help from us Drop by during TA’s office hour E-Mail khkim@cse.tamu.edu 2017-04-17