Timing Model of a Superscalar O-o-O processor in HAsim Framework

Slides:

Advertisements

Similar presentations

CS 6290 Instruction Level Parallelism. Instruction Level Parallelism (ILP) Basic idea: Execute several instructions in parallel We already do pipelining…

Advertisements

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.

1 Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ.

Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.

1 Advanced Computer Architecture Limits to ILP Lecture 3.

Instruction Set Issues MIPS easy –Instructions are only committed at MEM  WB transition Other architectures are more difficult –Instructions may update.

Lecture Objectives: 1)Define pipelining 2)Calculate the speedup achieved by pipelining for a given number of instructions. 3)Define how pipelining improves.

Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.

Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.

1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)

1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.

Review of CS 203A Laxmi Narayan Bhuyan Lecture2.

1 IBM System 360. Common architecture for a set of machines. Robert Tomasulo worked on a high-end machine, the Model 91 (1967), on which they implemented.

The Processor 2 Andreas Klappenecker CPSC321 Computer Architecture.

1 Lecture 18: Pipelining Today’s topics:  Hazards and instruction scheduling  Branch prediction  Out-of-order execution Reminder:  Assignment 7 will.

1 Lecture 9: Dynamic ILP Topics: out-of-order processors (Sections )

COMP381 by M. Hamdi 1 Commercial Superscalar and VLIW Processors.

COMPUTER ARCHITECTURE Assoc.Prof. Stasys Maciulevičius Computer Dept.

1 Advanced Computer Architecture Dynamic Instruction Level Parallelism Lecture 2.

CSE 340 Computer Architecture Summer 2014 Basic MIPS Pipelining Review.

CS.305 Computer Architecture Enhancing Performance with Pipelining Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from.

1 Designing a Pipelined Processor In this Chapter, we will study 1. Pipelined datapath 2. Pipelined control 3. Data Hazards 4. Forwarding 5. Branch Hazards.

UltraSPARC III Hari P. Ananthanarayanan Anand S. Rajan.

CSIE30300 Computer Architecture Unit 04: Basic MIPS Pipelining Hsin-Chou Chi [Adapted from material by and

1 Lecture 7: Speculative Execution and Recovery Branch prediction and speculative execution, precise interrupt, reorder buffer.

EKT303/4 Superscalar vs Super-pipelined.

3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,

1 Lecture: Out-of-order Processors Topics: a basic out-of-order processor with issue queue, register renaming, and reorder buffer.

1 Lecture 10: Memory Dependence Detection and Speculation Memory correctness, dynamic memory disambiguation, speculative disambiguation, Alpha Example.

PipeliningPipelining Computer Architecture (Fall 2006)

Lecture: Out-of-order Processors

CS 230: Computer Organization and Assembly Language

Computer Architecture

Morgan Kaufmann Publishers

Lecture 07: Pipelining Multicycle, MIPS R4000, and More

PowerPC 604 Superscalar Microprocessor

Out of Order Processors

Processor Architecture: Introduction to RISC Datapath (MIPS and Nios II) CSCE 230.

Lecture: Out-of-order Processors

Chapter 14 Instruction Level Parallelism and Superscalar Processors

Prof. Sirer CS 316 Cornell University

Single Clock Datapath With Control

CS203 – Advanced Computer Architecture

HASim Implementing a Functional/Timing Partitioned Microprocessor Simulator with an FPGA Nirav Dave*, Michael Pellauer*, Joel Emer†*, & Arvind* Massachusetts.

Part IV Data Path and Control

CSCI206 - Computer Organization & Programming

Lecture 6: Advanced Pipelines

Lecture 16: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.

Pipelining Multicycle, MIPS R4000, and More

Lecture 10: Out-of-order Processors

Lecture 11: Out-of-order Processors

Lecture: Out-of-order Processors

Lecture 19: Branches, OOO Today’s topics: Instruction scheduling

Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

Lecture 11: Memory Data Flow Techniques

Comparison of Two Processors

A Multiple Clock Cycle Instruction Implementation

Lecture: Out-of-order Processors

Lecture 8: Dynamic ILP Topics: out-of-order processors

Krste Asanovic Electrical Engineering and Computer Sciences

Data Hazards Data Hazard

Lecture 19: Branches, OOO Today’s topics: Instruction scheduling

CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue

* From AMD 1996 Publication #18522 Revision E

October 29 Review for 2nd Exam Ask Questions! 4/26/2019

Lecture 5: Pipeline Wrap-up, Static ILP

Lecture 9: Dynamic ILP Topics: out-of-order processors

Presentation transcript:

Timing Model of a Superscalar O-o-O processor in HAsim Framework Murali Vijayaraghavan

What is HAsim Framework to write software-like timing models and run it on FPGAs Software timing models are inherently sequential – hence slow Parallelism is achieved by implementing the timing model on FPGAs

(stalls, mispredicts, etc) HAsim contd. Functional Partition correct execution (multiply, divide, etc) Timing Partition model time (stalls, mispredicts, etc) requests responses Functional partition == ISA Timing partition == micro-architecture

Functional Partition TOK GEN FET DEC EXE MEM LCO GCO RegState MemState FetAlg DecAlg ExeAlg MemAlg LCOAlg GCOAlg RegState MemState

Model cycle vs FPGA cycle Functional simulator can take any number of FPGA cycles for an operation So there must be an explicit mechanism to monitor the ticks of the processor being modelled

APorts – monitoring ticks Each module in timing partition is connected with each other using APorts A clock tick conceptually begins when the module has read from every input APort and ends when it writes to every output APort But the tick localized to each port

MIPS R10-k specs 64-bit processor Out-of-order execution Superscalar FetchWidth – 4 CommitWidth – 4 2 ALUs 1 Load/Store unit 1 FPU

Timing model design Functional partition operates only on one instruction at a time But timing model time-multiplexes multiple operations to operate on more than one instruction at a time

Timing Model top-level design Exec Results Free Buffer IntQ buffer left PC at Mispredict FU Ops Fetch Decode/Dispath AddrQ buffer left Issue Execute Predicted PC 4 Issue 4 Tokens Commit 4 Commit Mem Exec Token Fetch Decode LCO GCO

Decode/Dispath Module 4 Commit PC at Mispredict Branch/JR Pred ROB Predicted PC Update Update from exec Busy RegFile Insert Decode IntQ Free Count Inst Buffer (8) 4 Inst AddrQ Free Count 4 issue

Issue Module To 2 ALUs IntQ (O-o-O) ScoreBoard 4 Inst AddrQ (In Order) To Load Store

Differences of my timing model from R-10k SMIPS ISA – no floating point ops 32-bit registers and addressing No delay slot One extra cycle in branch mispredict JR and JALR has to go through the Integer Q

Reasons for timing differences Currently functional partition gives only information about branches. So JR and JALR’s address can be got only after execution of JR or JALR I didn’t implement the branch cache which eliminates the extra cycle in branch mispredict

Simulation results Simulated SMIPS v2 ADDUI test case Took 239 FPGA cycles to simulate 7 model cycles – must look into this number as the “bottleneck” is the instruction queue, which takes 7 * 21 cycles = 147 cycles

Miscellaneous Lines of code for timing model ~ 1300 Compared to ~1200 for a simple SMIPS processor in Lab2, excluding caches