Download presentation
Presentation is loading. Please wait.
1
Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania 18042 nestorj@lafayette.edu ECE 313 - Computer Organization Lecture 19 - Pipelined Processor Design 3 Superscalar CPU Fall 2004 Reading: 6.7, 6.9-6.12, 6.13* Homework Due 12/8: 6.1, 6.2, 6.3, 6.4, 6.7, 6.8, 6.9, 6.15 Assignment: Project 4 Portions of these slides are derived from: Textbook figures © 1998 Morgan Kaufmann Publishers all rights reserved Tod Amon's COD2e Slides © 1998 Morgan Kaufmann Publishers all rights reserved Dave Patterson’s CS 152 Slides - Fall 1997 © UCB Rob Rutenbar’s 18-347 Slides - Fall 1999 CMU other sources as noted
2
ECE 313 Fall 2004Lecture 19 - Pipelining 32 Project 4 – Basic Pipelined MIPS
3
ECE 313 Fall 2004Lecture 19 - Pipelining 33 Project 4 - What to Do Download & simulate basic model Extend processor to do either Data forwarding + load/use stall (see Fig. 6-36/6.33) OR Branch implementation in ID including IF.Flush (See Fig. 6-38) Simulate extended processor to show it works You may work in groups of two
4
ECE 313 Fall 2004Lecture 19 - Pipelining 34 Pipelined Processor Design with Hazard Detection (Fig. 6.36, old 6.46)
5
ECE 313 Fall 2004Lecture 19 - Pipelining 35 Pipelined Processor - Design with Branch Hardware in ID (Old Fig. 6.51)
6
ECE 313 Fall 2004Lecture 19 - Pipelining 36 Pipelining Outline Introduction Pipelined Processor Design Advanced Pipelining Overview - Instruction Level Parallelism Superpipelining Static Multiple Issue Dynamic Multiple Issue Speculation Simultaneous Multithreading (HyperThreading)
7
ECE 313 Fall 2004Lecture 19 - Pipelining 37 Instruction Level Parallelism (ILP) Parallel execution of instructions is known as Instruction Level Parallelism (ILP) Pipelining exploits ILP by overlapping execution ILP limited by Data hazards Control hazards
8
ECE 313 Fall 2004Lecture 19 - Pipelining 38 Techniques to Increase ILP Forwarding Branch Prediction Superpipelining Static Multiple Issue Dynamic Multiple Issue Speculation Simultaneous Multithreading (SMT)
9
ECE 313 Fall 2004Lecture 19 - Pipelining 39 Superpipelining Key idea: increase the number of stages MIPS R2000 - 5 Stages MIPS R4000 - 8 Stages Pentium 3 - 10 Stages Pentium 4 - 20 Stages Tradeoffs +Less logic in each stage -> faster clock -Longer pipeline -> higher penalty for stalls, flushes Used in conjunction with other techniques (e.g. branch prediction) to overcome disadvantages
10
ECE 313 Fall 2004Lecture 19 - Pipelining 310 Techniques to Increase ILP Forwarding Branch Prediction Superpipelining Superscalar with Static Multiple Issue VLIW Superscalar with Dynamic Multiple Issue Superscalar with Speculation Superscalar with Simultaneous Multithreading (SMT)
11
ECE 313 Fall 2004Lecture 19 - Pipelining 311 Static Multiple Issue Key idea: issue (decode & execute) multiple instructions in each clock cycle Example: Issue load/store and ALU/branch in MIPS ALU or branch Instruction typePipe stages IFIDEXMEMWB Load/ StoreIFIDEXMEMWB ALU or branch Load/ Store ALU or branch Load/ Store ALU or branch Load/ Store IFIDEXMEMWB IFIDEXMEMWB IFIDEXMEMWB IFIDEXMEMWB IFIDEXMEMWB IFIDEXMEMWB (Fig. 6.44, old 6.57)
12
ECE 313 Fall 2004Lecture 19 - Pipelining 312 Example - A Static Multiple Issue MIPS (Fig. 6.45, old 6.58) Executes ALU/Branch Instructions Executes Load/Store Instructions
13
ECE 313 Fall 2004Lecture 19 - Pipelining 313 Static Multiple Issue Tradeoffs Advantage: increased performance Real processors issue up to 6 instructions / cycle Several challenges: Building a register file with lots of ports Dealing with data dependencies Stalls due to control dependencies (branch prediction helps!) Building a memory system that can “keep up” (caches help!) Finding opportunities to fully utilize the functional units
14
ECE 313 Fall 2004Lecture 19 - Pipelining 314 VLIW / EPIC Processors VLIW - Very Long Instruction Words Functional units exposed in instruction word Static scheduling by compiler Pipeline is exposed; compiler must schedule delays to get right result Examples: Philips Trimedia, Texas Instruments C6000 Explicit Parallel Instruction Computer (EPIC) 3 41-bit instructions in each instruction packet Compiler determines parallelism Hardware checks dependencies and fowards/stalls Examples: Intel Itanium, Itanium 2
15
ECE 313 Fall 2004Lecture 19 - Pipelining 315 Itanium Block Diagram Source: Extreme Tech www.extremetech.com
16
ECE 313 Fall 2004Lecture 19 - Pipelining 316 Software Manipulation to Increase ILP Software Transformations can increase ILP Code reordering to reduce stalls Loop unrolling Example (p. 438) Loop:lw$t0, 0($s1) # $t0=array element addu$t0, $t0, $s2 # add scalar in $s2 sw$t0, 0($s1) # store result addi$s1, $s1, -4 # decrement ptr bne$s1, $zero, Loop Goal: reorder to speed superscalar execution
17
ECE 313 Fall 2004Lecture 19 - Pipelining 317 Software Manipulation Reordering Code Note sparse utilization of superscalar pipeline! End result: 5 instructions in 4 clocks CPI = 0.8 ALU or branch instructionData transfer instructionClock Loop:lw $t0, 0($s1)1 addi $s1, $s1, -42 addu $t0, $t0, $s23 bne $s1, $zero, Loopsw $t0, 4($s1)4
18
ECE 313 Fall 2004Lecture 19 - Pipelining 318 Software Manipulation - Loop Unrolling Assume loop count a multiple of 4 & unroll End result: 4 loop iterations in 8 clocks 2 clocks / iteration! ALU or branch instructionData transfer instructionClock Loop:addi $s1, $s1, -16lw $t0, 0($s1)1 lw $t1, 12($s1)2 lw $t2, 8($s1)3 lw $t3, 4($s1)4 sw $t0, 0($s1)5 sw $t1, 12($s1)6 sw $t2, 8($s1)7 bne $s1, $zero, Loopsw $t3, 4($s1)8 addu $t0, $t0, $s2 addu $t1, $t1, $s2 addu $t2, $t2, $s2 addu $t3, $t3, $s2
19
ECE 313 Fall 2004Lecture 19 - Pipelining 319 Techniques to Increase ILP Forwarding Branch Prediction Superpipelining Superscalar with Static Multiple Issue VLIW Superscalar with Dynamic Multiple Issue Superscalar with Speculation Superscalar with Simultaneous Multithreading (SMT)
20
ECE 313 Fall 2004Lecture 19 - Pipelining 320 Dynamic Multiple Issue Key ideas: ”Look past" stalls for instructions that can execute lw $t0, 20($t2) addu$t1, $t0, $s2 sub$s4, $s4, $s3 slti$t5, $s4, 20 Execute instructions out of order Use multiple functional units for parallel execution Forward results between functional units when necessary Update registers (in original order of execution) addu stalls until $t0 available sub is ready to execute but blocked by stall!
21
ECE 313 Fall 2004Lecture 19 - Pipelining 321 Speculation Guess about the outcome of an instruction (e.g., branch or load) Based on guess, start executing instructions Cancel started instructions if guess is incorrect Complicating factors Must buffer instruction results until outcome known Exceptions in speculated instructions - how can you have an exception in an instruction that didn’t execute?
22
ECE 313 Fall 2004Lecture 19 - Pipelining 322 Superscalar Dynamic Pipelining (Fig. 6.49, old 6.61) Instruction Fetch and decode unit Reservation station Reservation station Reservation station Reservation station Integer Floating point Load/ Store Commit unit Functional units In-order issue In-order commit Out-of-order execute
23
ECE 313 Fall 2004Lecture 19 - Pipelining 323 Superscalar Dynamic Pipelining in the Pentium 4 (Fig. 6.50, mod old 6.62)
24
ECE 313 Fall 2004Lecture 19 - Pipelining 324 Superscalar Dynamic Pipelining in the Pentium 4 Source: “The Microarchitecture of the Pentium® 4 Processor”, Intel Technology Journal, First Quarter 2001 http://developer.intel.com/technology/itj/q12001/articles/art_2.htm.
25
ECE 313 Fall 2004Lecture 19 - Pipelining 325 Pentium 3 & 4 Pipeline Stages Source: “The Microarchitecture of the Pentium® 4 Processor”, Intel Technology Journal, First Quarter 2001 http://developer.intel.com/technology/itj/q12001/articles/art_2.htm. Drive stages - waiting for signal propagation
26
ECE 313 Fall 2004Lecture 19 - Pipelining 326 Techniques to Increase ILP Forwarding Branch Prediction Superpipelining Multiple Issue - Superscalar, VLIW/EPIC Software manipulation Dynamic Pipeline Scheduling Speculative Execution Simultaneous Multithreading (SMT)
27
ECE 313 Fall 2004Lecture 19 - Pipelining 327 Simultaneous Multithreading (SMT) Key idea: extend a superscalar processor to multiple threads of execution that execute concurrently Each thread has its own PC and register state Scheduling hardware shares functional units Appears to software as two “separate” processors Advantage: when one thread stalls, another may be ready Proposed for servers, where multiple threads are common State Thread A State Thread B Functional Units Issue Slots Time
28
ECE 313 Fall 2004Lecture 19 - Pipelining 328 Roadmap for the term: major topics Overview / Abstractions and Technology Performance Instruction sets Logic & arithmetic Processor Implementation Memory systems Input/Output
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.