CMCS 411-101 Computer Architecture Lecture 20 Pipelined Datapath and Control April 11, 2001 www.csee.umbc.edu/~younis/CMSC411/ CMSC411.htm Mohamed.

Slides:

Advertisements

Similar presentations

ECE 232 L22.Pipeline3.1 Adapted from Patterson 97 ©UCBCopyright 1998 Morgan Kaufmann Publishers ECE 232 Hardware Organization and Design Lecture 22 Pipelining,

Advertisements

CS 61C L19 Pipelining II (1) A Carle, Summer 2005 © UCB inst.eecs.berkeley.edu/~cs61c/su05 CS61C : Machine Structures Lecture #19: Pipelining II

ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )

ECE 232 L19.Pipeline2.1 Adapted from Patterson 97 ©UCBCopyright 1998 Morgan Kaufmann Publishers ECE 232 Hardware Organization and Design Lecture 19 Pipelining,

Pipelining Datapath Adapted from the lecture notes of Dr. John Kubiatowicz (UC Berkeley) and Hank Walker (TAMU)

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Sep 9, 2002 Topic: Pipelining Basics.

1 Atanasoff–Berry Computer, built by Professor John Vincent Atanasoff and grad student Clifford Berry in the basement of the physics building at Iowa State.

CS152 / Kubiatowicz Lec13.1 3/17/03©UCB Spring 2003 CS152 Computer Architecture and Engineering Lecture 13 Introduction to Pipelining: Datapath and Control.

Pipelining - II Adapted from CS 152C (UC Berkeley) lectures notes of Spring 2002.

Ceg3420 L1 4.1 DAP Fa97,  U.CB CEG3420 Computer Design Introduction to Pipelining.

Pipelining - II Rabi Mahapatra Adapted from CS 152C (UC Berkeley) lectures notes of Spring 2002.

Lecture 12: Pipeline Datapath Design Professor Mike Schulte Computer Architecture ECE 201.

Pipeline Data Hazards: Detection and Circumvention Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly.

CSIE30300 Computer Architecture Unit 05: Overcoming Data Hazards Hsin-Chou Chi [Adapted from material by and

Pipelining: Implementation CPSC 252 Computer Organization Ellen Walker, Hiram College.

Advanced Computer Architecture CS 704 Advanced Computer Architecture Lecture 10 Computer Hardware Design (Pipeline Datapath and Control Design) Prof. Dr.

CSE 340 Computer Architecture Spring 2016 Overcoming Data Hazards.

Problem with Single Cycle Processor Design

Computer Organization

Stalling delays the entire pipeline

Note how everything goes left to right, except …

IT 251 Computer Organization and Architecture

Performance of Single-cycle Design

CMSC 611: Advanced Computer Architecture

ECE232: Hardware Organization and Design

ECS 154B Computer Architecture II Spring 2009

\course\cpeg323-08F\Topic6b-323

Pipelining Lessons 6 PM T a s k O r d e B C D A 30

CpE 442 Designing a Pipeline Processor (lect. II)

Dave Patterson (http.cs.berkeley.edu/~patterson)

School of Computing and Informatics Arizona State University

John Kubiatowicz (http.cs.berkeley.edu/~kubitron)

Forwarding Now, we’ll introduce some problems that data hazards can cause for our pipelined processor, and show how to handle them with forwarding.

Review: MIPS Pipeline Data and Control Paths

Chapter 4 The Processor Part 2

SOLUTIONS CHAPTER 4.

John Kubiatowicz (http.cs.berkeley.edu/~kubitron)

Single-cycle datapath, slightly rearranged

John Kubiatowicz (http.cs.berkeley.edu/~kubitron)

CS 704 Advanced Computer Architecture

CS 704 Advanced Computer Architecture

A pipeline diagram Clock cycle lw $t0, 4($sp) IF ID

CS152 – Computer Architecture and Engineering Lecture 11 –

CS152 – Computer Architecture and Engineering Lecture 10 –

CS-447– Computer Architecture Lecture 14 Pipelining (2)

Systems Architecture II

Pipelining Lessons 6 PM T a s k O r d e B C D A 30

\course\cpeg323-05F\Topic6b-323

Pipeline control unit (highly abstracted)

The Processor Lecture 3.4: Pipelining Datapath and Control

John Kubiatowicz (http.cs.berkeley.edu/~kubitron)

The Processor Lecture 3.5: Data Hazards

Instruction Execution Cycle

CpE 242 Computer Architecture and Engineering Designing a Pipeline Processor Start X:40.

Pipeline control unit (highly abstracted)

Designing a Pipelined CPU

CS152 Computer Architecture and Engineering Lecture 14 Pipelining Control Continued Introduction to Advanced Pipelining.

Pipelining Basic concept of assembly line

Morgan Kaufmann Publishers The Processor

Instructors: Randy H. Katz David A. Patterson

Introduction to Computer Organization and Architecture

Guest Lecturer: Justin Hsia

A relevant question Assuming you’ve got: One washer (takes 30 minutes)

COMS 361 Computer Organization

Recall: Performance Evaluation

©2003 Craig Zilles (derived from slides by Howard Huang)

Pipelined datapath and control

CS161 – Design and Architecture of Computer Systems

ELEC / Computer Architecture and Design Spring 2015 Pipeline Control and Performance (Chapter 6) Vishwani D. Agrawal James J. Danaher.

Presentation transcript:

CMCS 411-101 Computer Architecture Lecture 20 Pipelined Datapath and Control April 11, 2001 www.csee.umbc.edu/~younis/CMSC411/ CMSC411.htm Mohamed Younis CMCS 411, Computer Architecture 1

Lecture’s Overview Previous Lecture: An overview of pipelining Pipelining concept is natural Start handling of next instruction while current one is in progress Pipeline performance Performance improvement by increasing instruction throughput Ideal and upper bound for speedup is number of stages in pipeline Pipelined hazards Structural, data and control hazards Hazard resolution techniques This Lecture: Designing a pipelined datapath Controlling pipeline operations Mohamed Younis CMCS 411, Computer Architecture 2

Multi-stage Instruction Execution Mohamed Younis CMCS 411, Computer Architecture 3

Stages of Instruction Execution Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Ifetch Reg/Dec Exec Mem Wr Load The load instruction is the longest All instructions follows at most the following five steps: Ifetch: Instruction Fetch Fetch the instruction from the Instruction Memory Reg/Dec: Registers Fetch and Instruction Decode Exec: Calculate the memory address Mem: Read the data from the Data Memory Wr: Write the data back to the register file As shown here, each of these five steps will take one clock cycle to complete. And in pipeline terminology, each step is referred to as one stage of the pipeline. +1 = 8 min. (X:48) * Slide is courtesy of Dave Patterson Mohamed Younis CMCS 411, Computer Architecture 4

Instruction Pipelining Start handling of next instruction while the current instruction is in progress Pipelining is feasible when different devices are used at different stages of instruction execution IFetch Dec Exec Mem WB Program Flow Time Pipelining improves performance by increasing instruction throughput Mohamed Younis CMCS 411, Computer Architecture 5

Pipelined Datapath Data Stationary Mohamed Younis CMCS 411, Computer Architecture 6

Datapath for Pipelined Processor Valid IRex IR IRwb Inst. Mem IRmem WB Ctrl Dcd Ctrl Ex Ctrl Mem Ctrl Equal A Reg. File S Reg File Exec PC Next PC B Mem Access M Data Mem Easy to read * Slide is courtesy of Dave Patterson Mohamed Younis CMCS 411, Computer Architecture 7

IR <- Mem[PC]; PC <– PC+4; A <- R[rs]; B<– R[rt] Control and Datapath IR <- Mem[PC]; PC <– PC+4; A <- R[rs]; B<– R[rt] S  A + B; R[rd]  S; S  A + SX; M  Mem[S] R[rd]  M; S  A or ZX; R[rt]  S; Mem[S]  B if Cond PC  PC+SX; Reg. File Reg File A S Exec PC IR Next PC Inst. Mem B M Mem Access D Data Mem * Slide is courtesy of Dave Patterson Mohamed Younis CMCS 411, Computer Architecture 8

Pipelining the same Instruction Clock Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Ifetch Reg/Dec Exec Mem Wr 1st lw 2nd lw 3rd lw Clock Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Ifetch Reg/Dec Exec Wr R-type For the load instructions, the five independent functional units in the pipeline datapath are: (a) Instruction Memory for the Ifetch stage. (b) Register File’s Read ports for the Reg/Decode stage. (c) ALU for the Exec stage. (d) Data memory for the Mem stage. (e) And finally Register File’s write port for the Write Back stage. Notice that I have treat Register File’s read and write ports as separate functional units because the register file we have allows us to read and write at the same time. Notice that as soon as the 1st load finishes its Ifetch stage, it no longer needs the Instruction Memory. Consequently, the 2nd load can start using the Instruction Memory (2nd Ifetch). Furthermore, since each functional unit is only used ONCE per instruction, we will not have any conflict down the pipeline (Exec-Ifet, Mem-Exec, Wr-Mem) either. I will show you the interaction between instructions in the pipelined datapath later. But for now, I want to point out the performance advantages of pipelining. If these 3 load instructions are to be executed by the multiple cycle processor, it will take 15 cycles. But with pipelining, it only takes 7 cycles. This (7 cycles), however, is not the best way to look at the performance advantages of pipelining. A better way to look at this is that we have one instruction enters the pipeline every cycle so we will have one instruction coming out of the pipeline (Wr stages) every cycle. Consequently, the “effective” (or average) number of cycles per instruction is now ONE even though it takes a total of 5 cycles to complete each instruction. +3 = 14 min. (X:54) Mohamed Younis CMCS 411, Computer Architecture 9

Pipelining R-type & Load Instructions Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Clock Ops! We have a problem! R-type Ifetch Reg/Dec Exec Wr R-type Ifetch Reg/Dec Exec Wr Load Ifetch Reg/Dec Exec Mem Wr R-type Ifetch Reg/Dec Exec Wr What happened if we try to pipeline the R-type instructions with the Load instructions? Well, we have a problem here!!! We end up having two instructions trying to write to the register file at the same time! Why do we have this problem (the write “bubble”)? Well, the reason for this problem is that there is something I have not yet told you. +1 = 16 min. (X:56) R-type Ifetch Reg/Dec Exec Wr We have pipeline conflict or structural hazard: Two instructions try to write to register file at the same time! Only one write port * Slide is courtesy of Dave Patterson Mohamed Younis CMCS 411, Computer Architecture 10

Solution 1: Insert “Bubble” into Pipeline Clock Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Ifetch Reg/Dec Exec Wr R-type Mem Load Pipeline Bubble The first solution is to insert a “bubble” into the pipeline AFTER the load instruction to push back every instruction after the load that are already in the pipeline by one cycle. At the same time, the bubble will delay the Instruction Fetch of the instruction that is about to enter the pipeline by one cycle. Needless to say, the control logic to accomplish this can be complex. Furthermore, this solution also has a negative impact on performance. Notice that due to the “extra” stage (Mem) Load instruction has, we will not have one instruction finishes every cycle (points to Cycle 5). Consequently, a mix of load and R-type instruction will NOT have an average CPI of 1 because in effect, the Load instruction has an effective CPI of 2. So this is not that hot an idea Let’s try something else. +2 = 19 min. (X:59) Insert a “bubble” into pipeline to prevent 2 writes at same cycle The control logic can be complex. Lost instruction fetch and speedup opportunity No instruction is started in Cycle 6! * Slide is courtesy of Dave Patterson Mohamed Younis CMCS 411, Computer Architecture 11

Solution 2: Delay Write by One Cycle Delay R-type’s register write by one cycle: Now R-type instructions also use Reg File’s write port at Stage 5 Mem stage is a NOOP stage: nothing is being done. 1 2 3 4 5 R-type Ifetch Reg/Dec Exec Mem Wr Clock Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Ifetch Reg/Dec Mem Wr R-type Exec Load Well one thing we can do is to add a “Nop” stage to the R-type instruction pipeline to delay its register file write by one cycle. Now the R-type instruction ALSO uses the register file’s witer port at its 5th stage so we eliminate the write conflict with the load instruction. This is a much simpler solution as far as the control logic is concerned. As far as performance is concerned, we also gets back to having one instruction completes per cycle. This is kind of like promoting socialism: by making each individual R-type instruction takes 5 cycles instead of 4 cycles to finish, our overall performance is actually better off. The reason for this higher performance is that we end up having a more efficient pipeline. +1 = 20 min. (Y:00) * Slide is courtesy of Dave Patterson Mohamed Younis CMCS 411, Computer Architecture 12

Modified Control & Datapath IR <- Mem[PC]; PC <– PC+4; A <- R[rs]; B<– R[rt] S  A + B; S  A or ZX; S  A + SX; S  A + SX; if Cond PC  PC+SX; M  S M  S M  Mem[S] Mem[S]  B R[rd]  M; R[rt]  M; R[rd]  M; Reg. File Reg File A M Exec S PC IR Next PC Inst. Mem B Mem Access D Data Mem * Slide is courtesy of Dave Patterson Mohamed Younis CMCS 411, Computer Architecture 13

Standardized Five Stages Instructions Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Load Ifetch Reg/Dec Exec Mem Wr Store Ifetch Reg/Dec Exec Mem Wr R-type Ifetch Reg/Dec Exec Mem Wr Beq Ifetch Reg/Dec Exec Mem Wr Ori Ifetch Reg/Dec Exec Mem Wr Mohamed Younis CMCS 411, Computer Architecture 14

Data Stationary Control The Main Control generates the control signals during Reg/Dec Control signals for Exec (ExtOp, ALUSrc, ...) are used 1 cycle later Control signals for Mem (MemWr Branch) are used 2 cycles later Control signals for Wr (MemtoReg MemWr) are used 3 cycles later Reg/Dec Exec Mem Wr ExtOp ExtOp ALUSrc ALUSrc ALUOp ALUOp Main Control RegDst RegDst The main control here is identical to the one in the single cycle processor. It generate all the control signals necessary for a given instruction during that instruction’s Reg/Decode stage. All these control signals will be saved in the ID/Exec pipeline register at the end of the Reg/Decode cycle. The control signals for the Exec stage (ALUSrc, ... etc.) come from the output of the ID/Exec register. That is they are delayed ONE cycle from the cycle they are generated. The rest of the control signals that are not used during the Exec stage is passed down the pipeline and saved in the Exec/Mem register. The control signals for the Mem stage (MemWr, Branch) come from the output of the Exec/Mem register. That is they are delayed two cycles from the cycle they are generated. Finally, the control signals for the Wr stage (MemtoReg & RegWr) come from the output of the Exec/Wr register: they are delayed three cycles from the cycle they are generated. +2 = 45 min. (Y:45) Ex/Mem Register IF/ID Register ID/Ex Register Mem/Wr Register MemWr MemWr MemWr Branch Branch Branch MemtoReg MemtoReg MemtoReg MemtoReg RegWr RegWr RegWr RegWr * Slide is courtesy of Dave Patterson Mohamed Younis CMCS 411, Computer Architecture 15

Datapath + Data Stationary Control IR v v v fun rw rw rw wb wb wb Inst. Mem Decode rt me me WB Ctrl rs Mem Ctrl ex op Ex Ctrl im rs rt Reg. File A M Reg File S Exec B Mem Access D Data Mem PC Next PC * Slide is courtesy of Dave Patterson Mohamed Younis CMCS 411, Computer Architecture 16

An example Start: Fetch 12 n n n n Inst. Mem Decode WB Ctrl Mem Ctrl PC Next PC 12 = Ex Ctrl IR im rs rt Reg. File Reg File A M S Exec B Mem Access D Data Mem IF 12 lw r1, r2(35) 16 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15 Start: Fetch 12 Mohamed Younis CMCS 411, Computer Architecture 17

An example Fetch 16, Decode 12 n n n Inst. Mem Decode WB Ctrl Mem Ctrl lw r1, r2(35) Inst. Mem Decode WB Ctrl Mem Ctrl PC Next PC 16 = Ex Ctrl IR im 2 rt Reg. File Reg File A M S Exec B Mem Access D Data Mem ID 12 lw r1, r2(35) 16 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 28 ori r8, r9, 17 32 add r10, r11, r12 100 and r13, r14, 15 IF Fetch 16, Decode 12 Mohamed Younis CMCS 411, Computer Architecture 18

An example Fetch 20, Decode 16, Exec 12 n n Inst. Mem Decode WB Ctrl addI r2, r2, 3 Decode WB Ctrl lw r1 Mem Ctrl PC Next PC 20 = Ex Ctrl IR 2 rt 35 Reg. File Reg File M r2 S Exec B Mem Access D Data Mem EX 12 lw r1, r2(35) 16 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 28 ori r8, r9, 17 32 add r10, r11, r12 100 and r13, r14, 15 ID IF Fetch 20, Decode 16, Exec 12 Mohamed Younis CMCS 411, Computer Architecture 19

An example Fetch 24, Decode 20, Exec 16, Mem 12 n Inst. Mem Decode WB addI r2, r2, 3 Inst. Mem sub r3, r4, r5 Decode WB Ctrl lw r1 Mem Ctrl PC Next PC 24 = Ex Ctrl IR 4 5 3 Reg. File Reg File M r2 r2+35 Exec B Mem Access D Data Mem M 12 lw r1, r2(35) 16 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 28 ori r8, r9, 17 32 add r10, r11, r12 100 and r13, r14, 15 EX ID IF Fetch 24, Decode 20, Exec 16, Mem 12 Mohamed Younis CMCS 411, Computer Architecture 20

An example Fetch 28, Dcd 24, Ex 20, Mem 16, WB 12 Note Delayed Branch always executes ori after beq beq r6, r7 100 Inst. Mem Decode addI r2 WB Ctrl sub r3 lw r1 Mem Ctrl PC Next PC 28 = Ex Ctrl IR 6 7 Reg. File Reg File r4 M[r2+35] r2+3 Exec r5 Mem Access D Data Mem ID IF EX M WB 12 lw r1, r2(35) 16 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 28 ori r8, r9, 17 32 add r10, r11, r12 100 and r13, r14, 15 Fetch 28, Dcd 24, Ex 20, Mem 16, WB 12 Mohamed Younis CMCS 411, Computer Architecture 21

An example Fetch 32, Dcd 28, Ex 24, Mem 20, WB 16 Exec Reg. File Mem Access Data r6 r7 r2+3 Reg PC Next PC IR Inst. Mem D Decode Ctrl WB r1=M[r2+35] 9 xx 32 beq addI r2 sub r3 r4-r5 100 ori r8, r9 17 12 lw r1, r2(35) 16 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 28 ori r8, r9, 17 32 add r10, r11, r12 100 and r13, r14, 15 ID IF EX M WB Fetch 32, Dcd 28, Ex 24, Mem 20, WB 16 Mohamed Younis CMCS 411, Computer Architecture 22

An example Fetch 100, Dcd 32, Ex 28, Mem 24, WB 20 Exec Reg. File Mem Access Data r9 x Reg PC Next PC IR Inst. Mem D Decode Ctrl WB r1=M[r2+35] r2 = r2+3 11 12 100 beq sub r3 r4-r5 17 ori r8 xxx add r10, r11, r12 12 lw r1, r2(35) 16 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 28 ori r8, r9, 17 32 add r10, r11, r12 100 and r13, r14, 15 WB M EX ID Fetch 100, Dcd 32, Ex 28, Mem 24, WB 20 IF Mohamed Younis CMCS 411, Computer Architecture 23

An example Fetch 104, Dcd 100, Ex 32, Mem 28, WB 24 n Inst. Mem Decode add r10 ori r8 beq WB Ctrl and r13, r14, r15 Mem Ctrl r1=M[r2+35] r2 = r2+3 r3 = r4-r5 14 15 xx Reg. File IR Reg File r11 r9 | 17 xxx Exec r12 Mem Access D Data Mem 12 lw r1, r2(35) 16 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 28 ori r8, r9, 17 32 add r10, r11, r12 100 and r13, r14, 15 Next PC 104 EX M WB PC Fetch 104, Dcd 100, Ex 32, Mem 28, WB 24 ID Mohamed Younis CMCS 411, Computer Architecture 24

An example Fetch 108, Dcd 104, Ex 100, Mem 32, WB 28 n Inst. Mem Decode add r10 ori r8 and r13 WB Ctrl Mem Ctrl r1=M[r2+35] r2 = r2+3 r3 = r4-r5 xx Reg. File IR Reg File r14 r9 | 17 r11+r12 Exec r15 Mem Access D Data Mem 12 lw r1, r2(35) 16 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 28 ori r8, r9, 17 32 add r10, r11, r12 100 and r13, r14, 15 Next PC 108 WB M PC Fetch 108, Dcd 104, Ex 100, Mem 32, WB 28 EX Mohamed Younis CMCS 411, Computer Architecture 25

An example Fetch 112, Dcd 108, Ex 104, Mem 100, WB 32 n Inst. Mem Exec Reg. File Mem Access Data Reg PC Next PC IR Inst. Mem D Decode Ctrl WB 112 add r10 and r13 n r11+r12 r14 & r15 r1=M[r2+35] r2 = r2+3 r3 = r4-r5 r8 = r9 | 17 12 lw r1, r2(35) 16 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 28 ori r8, r9, 17 32 add r10, r11, r12 100 and r13, r14, 15 WB Fetch 112, Dcd 108, Ex 104, Mem 100, WB 32 M Mohamed Younis CMCS 411, Computer Architecture 26

Reading assignment includes sections 6.2 & 6.3 in the text book Conclusion Summary Designing a pipelined datapath Standardized multi-stage instruction execution Unique resources per stage Pipeline control Simplified stage-based control Detailed working example Next Lecture Pipeline hazard detection Handling hazard in the pipeline design We will take a break and talk about class philosophy. Reading assignment includes sections 6.2 & 6.3 in the text book Mohamed Younis CMCS 411, Computer Architecture 27