John Kubiatowicz (http.cs.berkeley.edu/~kubitron)

John Kubiatowicz (http.cs.berkeley.edu/~kubitron)
CS152 Computer Architecture and Engineering Lecture 12 Introduction to Pipelining October 11, 1999 John Kubiatowicz (http.cs.berkeley.edu/~kubitron) lecture slides: 10/11/99 ©UCB Fall 1999

Recap: Microprogramming
Microprogramming is a convenient method for implementing structured control state diagrams: Random logic replaced by microPC sequencer and ROM Each line of ROM called a instruction: contains sequencer control + values for control points limited state transitions: branch to zero, next sequential, branch to instruction address from displatch ROM Horizontal Code: one control bit in Instruction for every control line in datapath Vertical Code: groups of control-lines coded together in Instruction (e.g. possible ALU dest) Control design reduces to Microprogramming Part of the design process is to develop a “language” that describes control and is easy for humans to understand 10/11/99 ©UCB Fall 1999

Recap: Microprogramming
sequencer control datapath control -Code ROM microinstruction () Decoders implement our -code language: For instance: rt-ALU rd-ALU mem-ALU Decode Decode -sequencer: fetch,dispatch, sequential micro-PC Dispatch ROM To DataPath Opcode Microprogramming is a fundamental concept implement an instruction set by building a very simple processor and interpreting the instructions essential for very complex instructions and when few register transfers are possible overkill when ISA matches datapath 1-1 10/11/99 ©UCB Fall 1999

Exception = unprogrammed control transfer
Exceptions user program System Exception Handler Exception: return from exception normal control flow: sequential, jumps, branches, calls, returns Exception = unprogrammed control transfer system takes action to handle the exception must record the address of the offending instruction record any other information necessary to return afterwards returns control to user must save & restore user state Allows constuction of a “user virtual machine” 10/11/99 ©UCB Fall 1999

Two Types of Exceptions: Interrupts and Traps
caused by external events: Network, Keyboard, Disk I/O, Timer asynchronous to program execution Most interrupts can be disabled for brief periods of time Some (like “Power Failing”) are non-maskable (NMI) may be handled between instructions simply suspend and resume user program Traps caused by internal events exceptional conditions (overflow) errors (parity) faults (non-resident page) synchronous to program execution condition must be remedied by the handler instruction may be retried or simulated and program continued or program may be aborted 10/11/99 ©UCB Fall 1999

MIPS convention: exception means any unexpected change in control flow, without distinguishing internal or external; use the term interrupt only when the event is externally caused. Type of event From where? MIPS terminology I/O device request External Interrupt Invoke OS from user program Internal Exception Arithmetic overflow Internal Exception Using an undefined instruction Internal Exception Hardware malfunctions Either Exception or Interrupt 10/11/99 ©UCB Fall 1999

What happens to Instruction with Exception?
MIPS architecture defines the instruction as having no effect if the instruction causes an exception. When get to virtual memory we will see that certain classes of exceptions must prevent the instruction from changing the machine state. This aspect of handling exceptions becomes complex and potentially limits performance => why it is hard 10/11/99 ©UCB Fall 1999

Performance goals often lead designers to forsake precise interrupts
Precise  state of the machine is preserved as if program executed up to the offending instruction All previous instructions completed Offending instruction and all following instructions act as if they have not even started Same system code will work on different implementations Position clearly established by IBM Difficult in the presence of pipelining, out-ot-order execution, ... MIPS takes this position Imprecise  system software has to figure out what is where and put it all back together Performance goals often lead designers to forsake precise interrupts system software developers, user, markets etc. usually wish they had not done this Modern techniques for out-of-order execution and branch prediction help implement precise interrupts 10/11/99 ©UCB Fall 1999

Big Picture: user / system modes
By providing two modes of execution (user/system) it is possible for the computer to manage itself operating system is a special program that runs in the privileged mode and has access to all of the resources of the computer presents “virtual resources” to each user that are more convenient that the physical resources files vs. disk sectors virtual memory vs physical memory protects each user program from others protects system from malicious users. OS is assumed to “know best”, and is trusted code, so enter system mode on exception. Exceptions allow the system to taken action in response to events that occur while user program is executing: Might provide supplemental behavior (dealing with denormal floating-point numbers for instance). “Unimplemented instruction” used to emulate instructions that were not included in hardware (I.e. MicroVax) 10/11/99 ©UCB Fall 1999

Addressing the Exception Handler
iv_base cause handler code Traditional Approach: Interupt Vector PC <- MEM[ IV_base + cause || 00] 370, 68000, Vax, 80x86, . . . RISC Handler Table PC <– IT_base + cause || 0000 saves state and jumps Sparc, PA, M88K, . . . MIPS Approach: fixed entry PC <– EXC_addr Actually very small table RESET entry TLB other handler entry code iv_base cause 10/11/99 ©UCB Fall 1999

Save it in special registers Shadow Registers
Saving State Push it onto the stack Vax, 68k, 80x86 Save it in special registers MIPS EPC, BadVaddr, Status, Cause Shadow Registers M88k Save state in a shadow of the internal pipeline registers 10/11/99 ©UCB Fall 1999

Additions to MIPS ISA to support Exceptions?
Exception state is kept in “coprocessor 0”. EPC–a 32-bit register used to hold the address of the affected instruction (register 14 of coprocessor 0). Cause–a register used to record the cause of the exception. In the MIPS architecture this register is 32 bits, though some bits are currently unused. Assume that bits 5 to 2 of this register encodes the two possible exception sources mentioned above: undefined instruction=0 and arithmetic overflow=1 (register 13 of coprocessor 0). BadVAddr - register contained memory address at which memory reference occurred (register 8 of coprocessor 0) Status - interrupt mask and enable bits (register 12 of coprocessor 0) Control signals to write EPC , Cause, BadVAddr, and Status Be able to write exception address into PC, increase mux to add as input two ( hex) May have to undo PC = PC + 4, since want EPC to point to offending instruction (not its successor); PC = PC - 4 10/11/99 ©UCB Fall 1999

Recap: Details of Status register
15 8 5 4 3 2 1 Status k e Mask old prev current Mask = 1 bit for each of 5 hardware and 3 software interrupt levels 1 => enables interrupts 0 => disables interrupts k = kernel/user 0 => was in the kernel when interrupt occurred 1 => was running user mode e = interrupt enable 0 => interrupts were disabled 1 => interrupts were enabled When interrupt occurs, 6 LSB shifted left 2 bits, setting 2 LSB to 0 run in kernel mode with interrupts disabled 10/11/99 ©UCB Fall 1999

Recap: Details of Cause register
15 10 5 2 Status Pending Code Pending interrupt 5 hardware levels: bit set if interrupt occurs but not yet serviced handles cases when more than one interrupt occurs at same time, or while records interrupt requests when interrupts disabled Exception Code encodes reasons for interrupt 0 (INT) => external interrupt 4 (ADDRL) => address error exception (load or instr fetch) 5 (ADDRS) => address error exception (store) 6 (IBUS) => bus error on instruction fetch 7 (DBUS) => bus error on data fetch 8 (Syscall) => Syscall exception 9 (BKPT) => Breakpoint exception 10 (RI) => Reserved Instruction exception 12 (OVF) => Arithmetic overflow exception 10/11/99 ©UCB Fall 1999

Example: How Control Handles Traps in our FSD
Undefined Instruction–detected when no next state is defined from state 1 for the op value. We handle this exception by defining the next state value for all op values other than lw, sw, 0 (R-type), jmp, beq, and ori as new state 12. Shown symbolically using “other” to indicate that the op field does not match any of the opcodes that label arcs out of state 1. Arithmetic overflow–detected on ALU ops such as signed add Used to save PC and enter exception handler External Interrupt – flagged by asserted interrupt line Again, must save PC and enter exception handler Note: Challenge in designing control of a real machine is to handle different interactions between instructions and other exception-causing events such that control logic remains small and fast. Complex interactions makes the control unit the most challenging aspect of hardware design 10/11/99 ©UCB Fall 1999

How add traps and interrupts to state diagram?
“instruction fetch” EPC <= PC - 4 PC <= exp_addr cause <= 0(INT) Handle Interrupt Pending INT IR <= MEM[PC] PC <= PC + 4 0000 overflow EPC <= PC - 4 PC <= exp_addr cause <= 12 (Ovf) undefined instruction EPC <= PC - 4 PC <= exp_addr cause <= 10 (RI) other “decode” S<= PC +SX 0001 LW BEQ R-type ORi SW If A = B then PC <= S S <= A - B S <= A fun B S <= A op ZX S <= A + SX S <= A + SX 0100 0110 1000 1011 0010 M <= MEM[S] MEM[S] <= B 1001 1100 R[rd] <= S R[rt] <= S R[rt] <= M 0101 0111 1010 10/11/99 ©UCB Fall 1999

But: What has to change in our -sequencer?
Need concept of branch at micro-code level µAddress Select Logic Opcode microPC 1 Adder Dispatch ROM Mux 2 overflow pending interrupt 4? N? Cond Select Do -branch -offset Seq Select R-type S <= A fun B 0100 overflow EPC <= PC - 4 PC <= exp_addr cause <= 12 (Ovf) 10/11/99 ©UCB Fall 1999

Example: Can easily use with for non-ideal memory
“instruction fetch” IR <= MEM[PC] wait ~wait A <= R[rs] B <= R[rt] “decode / operand fetch” LW R-type ORi SW BEQ Execute Memory Write-back PC <= Next(PC) S <= A fun B S <= A or ZX S <= A + SX S <= A + SX M <= MEM[S] MEM[S] <= B ~wait wait wait ~wait R[rd] <= S PC <= PC + 4 R[rt] <= S PC <= PC + 4 R[rt] <= M PC <= PC + 4 PC <= PC + 4 10/11/99 ©UCB Fall 1999

Summary: Microprogramming one inspiration for RISC
If simple instruction could execute at very high clock rate… If you could even write compilers to produce microinstructions… If most programs use simple instructions and addressing modes… If microcode is kept in RAM instead of ROM so as to fix bugs … If same memory used for control memory could be used instead as cache for “macroinstructions”… Then why not skip instruction interpretation by a microprogram and simply compile directly into lowest language of machine? (microprogramming is overkill when ISA matches datapath 1-1) 10/11/99 ©UCB Fall 1999

Administrative Issues: Result of Midterm I
Exam Average: 62, Standard Dev: 13.5 People had trouble with square-root problem This was very much like Divide! Large shift register moving left Some issues with microcode 10/11/99 ©UCB Fall 1999

Square root example: consider remainder shifting LEFT
Starting: M= R0= and S0 = 0000 Try: N1 =  (2S0+1000)1000 R1=  S1=S = 1000 Try: N2 =  (2S1+0100)0100 Result < 0  S2=S1 = 1000 R2= (unchanged) Try: N3 =  (2S2+0010)0010 R3=  S3=S = 1010 Try: N4 =  (2S2+0001)0001 Result < 0  S4=S3 = 1010 R4= (unchanged) Final result: = with remainder or: = 10 with 18 remainder! 10/11/99 ©UCB Fall 1999

Administrative Issues (continued)
Get started reading Chapter 6! Complete chapter on Pipelining... Next week sections => Cory 119 Computers in the News: Merced silicon has finally seen light of day 10/11/99 ©UCB Fall 1999

The Big Picture: Where are We Now?
The Five Classic Components of a Computer Next Topics: Pipelining by Analogy Administrivia; Course road map Processor Input Control Memory Datapath Output So where are in in the overall scheme of things. Well, we just finished designing the processor’s datapath. Now I am going to show you how to design the control for the datapath. +1 = 7 min. (X:47) 10/11/99 ©UCB Fall 1999

“Folder” takes 20 minutes A B C D
Pipelining is Natural! Laundry Example Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer takes 30 minutes Dryer takes 40 minutes “Folder” takes 20 minutes A B C D 10/11/99 ©UCB Fall 1999

Sequential laundry takes 6 hours for 4 loads
6 PM 7 8 9 10 11 Midnight Time 30 40 20 30 40 20 30 40 20 30 40 20 T a s k O r d e A B C D Sequential laundry takes 6 hours for 4 loads If they learned pipelining, how long would laundry take? 10/11/99 ©UCB Fall 1999

Pipelined Laundry: Start work ASAP
6 PM 7 8 9 10 11 Midnight Time 30 40 20 T a s k O r d e A B C D Pipelined laundry takes 3.5 hours for 4 loads 10/11/99 ©UCB Fall 1999

Pipelining Lessons 6 PM 7 8 9 30 40 20 A B C D
Pipelining doesn’t help latency of single task, it helps throughput of entire workload Pipeline rate limited by slowest pipeline stage Multiple tasks operating simultaneously using different resources Potential speedup = Number pipe stages Unbalanced lengths of pipe stages reduces speedup Time to “fill” pipeline and time to “drain” it reduces speedup Stall for Dependences 6 PM 7 8 9 Time 30 40 20 T a s k O r d e A B C D 10/11/99 ©UCB Fall 1999

Ifetch: Instruction Fetch
The Five Stages of Load Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Load Ifetch Reg/Dec Exec Mem Wr Ifetch: Instruction Fetch Fetch the instruction from the Instruction Memory Reg/Dec: Registers Fetch and Instruction Decode Exec: Calculate the memory address Mem: Read the data from the Data Memory Wr: Write the data back to the register file As shown here, each of these five steps will take one clock cycle to complete. And in pipeline terminology, each step is referred to as one stage of the pipeline. +1 = 8 min. (X:48) 10/11/99 ©UCB Fall 1999

Note: These 5 stages were there all along!
IR <= MEM[PC] PC <= PC + 4 R-type ALUout <= A fun B R[rd] <= ALUout ALUout <= A op ZX R[rt] <= ALUout ORi ALUout <= A + SX R[rt] <= M M <= MEM[ALUout] LW MEM[ALUout] <= B SW 0000 0001 0100 0101 0110 0111 1000 1001 1010 1011 1100 BEQ 0010 If A = B then PC <= ALUout ALUout <= PC +SX Fetch Decode Execute Memory Write-back 10/11/99 ©UCB Fall 1999

Improve performance by increasing throughput
Pipelining Improve performance by increasing throughput Ideal speedup is number of stages in the pipeline. Do we achieve this? 10/11/99 ©UCB Fall 1999

What do we need to add to split the datapath into stages?
Basic Idea What do we need to add to split the datapath into stages? 10/11/99 ©UCB Fall 1999

Graphically Representing Pipelines
Can help with answering questions like: how many cycles does it take to execute this code? what is the ALU doing during cycle 4? use this representation to help understand datapaths 10/11/99 ©UCB Fall 1999

Conventional Pipelined Execution Representation
Time IFetch Dcd Exec Mem WB IFetch Dcd Exec Mem WB IFetch Dcd Exec Mem WB IFetch Dcd Exec Mem WB IFetch Dcd Exec Mem WB Program Flow IFetch Dcd Exec Mem WB 10/11/99 ©UCB Fall 1999

Single Cycle, Multiple Cycle, vs. Pipeline
Clk Single Cycle Implementation: Load Store Waste Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10 Clk Multiple Cycle Implementation: Load Store R-type Ifetch Reg Exec Mem Wr Ifetch Reg Exec Mem Ifetch Here are the timing diagrams showing the differences between the single cycle, multiple cycle, and pipeline implementations. For example, in the pipeline implementation, we can finish executing the Load, Store, and R-type instruction sequence in seven cycles. In the multiple clock cycle implementation, however, we cannot start executing the store until Cycle 6 because we must wait for the load instruction to complete. Similarly, we cannot start the execution of the R-type instruction until the store instruction has completed its execution in Cycle 9. In the Single Cycle implementation, the cycle time is set to accommodate the longest instruction, the Load instruction. Consequently, the cycle time for the Single Cycle implementation can be five times longer than the multiple cycle implementation. But may be more importantly, since the cycle time has to be long enough for the load instruction, it is too long for the store instruction so the last part of the cycle here is wasted. +2 = 77 min. (X:57) Pipeline Implementation: Load Ifetch Reg Exec Mem Wr Store Ifetch Reg Exec Mem Wr R-type Ifetch Reg Exec Mem Wr 10/11/99 ©UCB Fall 1999

Suppose we execute 100 instructions Single Cycle Machine
Why Pipeline? Suppose we execute 100 instructions Single Cycle Machine 45 ns/cycle x 1 CPI x 100 inst = 4500 ns Multicycle Machine 10 ns/cycle x 4.6 CPI (due to inst mix) x 100 inst = 4600 ns Ideal pipelined machine 10 ns/cycle x (1 CPI x 100 inst + 4 cycle drain) = 1040 ns 10/11/99 ©UCB Fall 1999

Why Pipeline? Because the resources are there!
Time (clock cycles) ALU Im Reg Dm I n s t r. O r d e Inst 0 ALU Im Reg Dm Inst 1 ALU Im Reg Dm Inst 2 Inst 3 ALU Im Reg Dm Inst 4 ALU Im Reg Dm 10/11/99 ©UCB Fall 1999

Can pipelining get us into trouble?
Yes: Pipeline Hazards structural hazards: attempt to use the same resource two different ways at the same time E.g., combined washer/dryer would be a structural hazard or folder busy doing something else (watching TV) data hazards: attempt to use item before it is ready E.g., one sock of pair in dryer and one in washer; can’t fold until get sock from washer through dryer instruction depends on result of prior instruction still in the pipeline control hazards: attempt to make a decision before condition is evaulated E.g., washing football uniforms and need to get proper detergent level; need to see after dryer before next load in branch instructions Can always resolve hazards by waiting pipeline control must detect the hazard take action (or delay action) to resolve hazards 10/11/99 ©UCB Fall 1999

Single Memory is a Structural Hazard
Time (clock cycles) ALU I n s t r. O r d e Load Mem Reg Mem Reg ALU Mem Reg Instr 1 ALU Mem Reg Instr 2 ALU Instr 3 Mem Reg Mem Reg ALU Mem Reg Instr 4 Detection is easy in this case! (right half highlight means read, left half write) 10/11/99 ©UCB Fall 1999

Structural Hazards limit performance
Example: if 1.3 memory accesses per instruction and only one memory access per cycle then average CPI  1.3 otherwise resource is more than 100% utilized 10/11/99 ©UCB Fall 1999

Control Hazard Solution #1: Stall
e Time (clock cycles) Add Beq Load ALU Mem Reg Lost potential Stall: wait until decision is clear Impact: 2 lost cycles (i.e. 3 clock cycles per branch instruction) => slow Move decision to end of decode save 1 cycle per branch 10/11/99 ©UCB Fall 1999

Control Hazard Solution #2: Predict
Time (clock cycles) Add Beq Load ALU Mem Reg Predict: guess one direction then back up if wrong Impact: 0 lost cycles per branch instruction if right, 1 if wrong (right 50% of time) Need to “Squash” and restart following instruction if wrong Produce CPI on branch of (1 * * .5) = 1.5 Total CPI might then be: 1.5 * * .8 = 1.1 (20% branch) More dynamic scheme: history of 1 branch ( 90%) 10/11/99 ©UCB Fall 1999

Control Hazard Solution #3: Delayed Branch
Time (clock cycles) Add Beq Misc ALU Mem Reg Load Delayed Branch: Redefine branch behavior (takes place after next instruction) Impact: 0 clock cycles per branch instruction if can find instruction to put in “slot” ( 50% of time) As launch more instruction per clock cycle, less useful 10/11/99 ©UCB Fall 1999

add r1 ,r2,r3 sub r4, r1 ,r3 and r6, r1 ,r7 or r8, r1 ,r9
Data Hazard on r1 add r1 ,r2,r3 sub r4, r1 ,r3 and r6, r1 ,r7 or r8, r1 ,r9 xor r10, r1 ,r11 10/11/99 ©UCB Fall 1999

add r1,r2,r3 sub r4,r1,r3 and r6,r1,r7 or r8,r1,r9 xor r10,r1,r11
Data Hazard on r1: Dependencies backwards in time are hazards Time (clock cycles) IF ID/RF EX MEM WB add r1,r2,r3 Im Reg ALU Reg Dm I n s t r. O r d e ALU sub r4,r1,r3 Im Reg Dm Reg ALU and r6,r1,r7 Im Reg Dm Reg or r8,r1,r9 Im Reg ALU Dm Reg ALU Im Reg Dm xor r10,r1,r11 10/11/99 ©UCB Fall 1999

add r1,r2,r3 sub r4,r1,r3 and r6,r1,r7 or r8,r1,r9 xor r10,r1,r11
Data Hazard Solution: “Forward” result from one stage to another “or” OK if define read/write properly Time (clock cycles) IF ID/RF EX MEM WB add r1,r2,r3 Im Reg ALU Reg Dm I n s t r. O r d e ALU sub r4,r1,r3 Im Reg Dm Reg ALU and r6,r1,r7 Im Reg Dm Reg or r8,r1,r9 Im Reg ALU Dm Reg ALU Im Reg Dm xor r10,r1,r11 10/11/99 ©UCB Fall 1999

Forwarding (or Bypassing): What about Loads?
Dependencies backwards in time are hazards Can’t solve with forwarding: Must delay/stall instruction dependent on loads Time (clock cycles) IF ID/RF EX MEM WB lw r1,0(r2) Im Reg ALU Reg Dm ALU sub r4,r1,r3 Im Reg Dm Reg 10/11/99 ©UCB Fall 1999

Forwarding (or Bypassing): What about Loads
Dependencies backwards in time are hazards Can’t solve with forwarding: Must delay/stall instruction dependent on loads Time (clock cycles) IF ID/RF EX MEM WB lw r1,0(r2) Im Reg ALU Reg Dm ALU Im Reg Dm sub r4,r1,r3 Stall 10/11/99 ©UCB Fall 1999

Designing a Pipelined Processor
Go back and examine your datapath and control diagram associated resources with states ensure that flows do not conflict, or figure out how to resolve assert control in appropriate stage 10/11/99 ©UCB Fall 1999

Control and Datapath: Split state diag into 5 pieces
IR <- Mem[PC]; PC <– PC+4; A <- R[rs]; B<– R[rt] S <– A + B; S <– A or ZX; S <– A + SX; S <– A + SX; If Cond PC < PC+SX; M <– Mem[S] Mem[S] <- B R[rd] <– S; R[rt] <– S; R[rd] <– M; Equal Reg. File Reg File A S Exec PC IR Next PC Inst. Mem B M Mem Access D Data Mem 10/11/99 ©UCB Fall 1999

Pipelined Processor (almost) for slides
What happens if we start a new instruction every cycle? Valid IRex IR IRwb Inst. Mem IRmem WB Ctrl Dcd Ctrl Ex Ctrl Mem Ctrl Equal Reg. File Reg File A S Exec PC Next PC B Mem Access M D Data Mem 10/11/99 ©UCB Fall 1999

Pipelined Datapath (as in book); hard to read
10/11/99 ©UCB Fall 1999

Pipelining the Load Instruction
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Clock Ifetch Reg/Dec Exec Mem Wr 1st lw 2nd lw Ifetch Reg/Dec Exec Mem Wr 3rd lw Ifetch Reg/Dec Exec Mem Wr The five independent functional units in the pipeline datapath are: Instruction Memory for the Ifetch stage Register File’s Read ports (bus A and busB) for the Reg/Dec stage ALU for the Exec stage Data Memory for the Mem stage Register File’s Write port (bus W) for the Wr stage For the load instructions, the five independent functional units in the pipeline datapath are: (a) Instruction Memory for the Ifetch stage. (b) Register File’s Read ports for the Reg/Decode stage. (c) ALU for the Exec stage. (d) Data memory for the Mem stage. (e) And finally Register File’s write port for the Write Back stage. Notice that I have treat Register File’s read and write ports as separate functional units because the register file we have allows us to read and write at the same time. Notice that as soon as the 1st load finishes its Ifetch stage, it no longer needs the Instruction Memory. Consequently, the 2nd load can start using the Instruction Memory (2nd Ifetch). Furthermore, since each functional unit is only used ONCE per instruction, we will not have any conflict down the pipeline (Exec-Ifet, Mem-Exec, Wr-Mem) either. I will show you the interaction between instructions in the pipelined datapath later. But for now, I want to point out the performance advantages of pipelining. If these 3 load instructions are to be executed by the multiple cycle processor, it will take 15 cycles. But with pipelining, it only takes 7 cycles. This (7 cycles), however, is not the best way to look at the performance advantages of pipelining. A better way to look at this is that we have one instruction enters the pipeline every cycle so we will have one instruction coming out of the pipeline (Wr stages) every cycle. Consequently, the “effective” (or average) number of cycles per instruction is now ONE even though it takes a total of 5 cycles to complete each instruction. +3 = 14 min. (X:54) 10/11/99 ©UCB Fall 1999

The Four Stages of R-type
Cycle 1 Cycle 2 Cycle 3 Cycle 4 R-type Ifetch Reg/Dec Exec Wr Ifetch: Instruction Fetch Fetch the instruction from the Instruction Memory Reg/Dec: Registers Fetch and Instruction Decode Exec: ALU operates on the two register operands Update PC Wr: Write the ALU output back to the register file Well, so far so good. Let’s take a look at the R-type instructions. The R-type instruction does NOT access data memory so it only takes four clock cycles, or in our new pipeline terminology, four stages to complete. The Ifetch and Reg/Dec stages are identical to the Load instructions. Well they have to be because at this point, we do not know we have a R-type instruction yet. Instead of calculating the effective address during the Exec stage, the R-type instruction will use the ALU to operate on the register operands. The result of this ALU operation is written back to the register file during the Wr back stage. +1 = 15 min. (55) 10/11/99 ©UCB Fall 1999

Pipelining the R-type and Load Instruction
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Clock Ops! We have a problem! R-type Ifetch Reg/Dec Exec Wr R-type Ifetch Reg/Dec Exec Wr Load Ifetch Reg/Dec Exec Mem Wr R-type Ifetch Reg/Dec Exec Wr R-type Ifetch Reg/Dec Exec Wr What happened if we try to pipeline the R-type instructions with the Load instructions? Well, we have a problem here!!! We end up having two instructions trying to write to the register file at the same time! Why do we have this problem (the write “bubble”)? Well, the reason for this problem is that there is something I have not yet told you. +1 = 16 min. (X:56) We have pipeline conflict or structural hazard: Two instructions try to write to the register file at the same time! Only one write port 10/11/99 ©UCB Fall 1999

Important Observation
Each functional unit can only be used once per instruction Each functional unit must be used at the same stage for all instructions: Load uses Register File’s Write Port during its 5th stage R-type uses Register File’s Write Port during its 4th stage Ifetch Reg/Dec Exec Mem Wr Load 1 2 3 4 5 Ifetch Reg/Dec Exec Wr R-type 1 2 3 4 I already told you that in order for pipeline to work perfectly, each functional unit can ONLY be used once per instruction. What I have not told you is that this (1st bullet) is a necessary but NOT sufficient condition for pipeline to work. The other condition to prevent pipeline hiccup is that each functional unit must be used at the same stage for all instructions. For example here, the load instruction uses the Register File’s Wr port during its 5th stage but the R-type instruction right now will use the Register File’s port during its 4th stage. This (5 versus 4) is what caused our problem. How do we solve it? We have 2 solutions. +1 = 17 min. (X:57) 2 ways to solve this pipeline hazard. 10/11/99 ©UCB Fall 1999

Solution 1: Insert “Bubble” into the Pipeline
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Clock Ifetch Reg/Dec Exec Wr Load Ifetch Reg/Dec Exec Mem Wr Ifetch Reg/Dec Exec Wr R-type Ifetch Reg/Dec Pipeline Exec Wr R-type R-type Ifetch Bubble Reg/Dec Exec Wr Ifetch Reg/Dec Exec Insert a “bubble” into the pipeline to prevent 2 writes at the same cycle The control logic can be complex. Lose instruction fetch and issue opportunity. No instruction is started in Cycle 6! The first solution is to insert a “bubble” into the pipeline AFTER the load instruction to push back every instruction after the load that are already in the pipeline by one cycle. At the same time, the bubble will delay the Instruction Fetch of the instruction that is about to enter the pipeline by one cycle. Needless to say, the control logic to accomplish this can be complex. Furthermore, this solution also has a negative impact on performance. Notice that due to the “extra” stage (Mem) Load instruction has, we will not have one instruction finishes every cycle (points to Cycle 5). Consequently, a mix of load and R-type instruction will NOT have an average CPI of 1 because in effect, the Load instruction has an effective CPI of 2. So this is not that hot an idea Let’s try something else. +2 = 19 min. (X:59) 10/11/99 ©UCB Fall 1999

Solution 2: Delay R-type’s Write by One Cycle
Delay R-type’s register write by one cycle: Now R-type instructions also use Reg File’s write port at Stage 5 Mem stage is a NOOP stage: nothing is being done. 1 2 3 4 5 R-type Ifetch Reg/Dec Exec Mem Wr Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Clock R-type Ifetch Reg/Dec Exec Mem Wr R-type Ifetch Reg/Dec Exec Mem Wr Well one thing we can do is to add a “Nop” stage to the R-type instruction pipeline to delay its register file write by one cycle. Now the R-type instruction ALSO uses the register file’s witer port at its 5th stage so we eliminate the write conflict with the load instruction. This is a much simpler solution as far as the control logic is concerned. As far as performance is concerned, we also gets back to having one instruction completes per cycle. This is kind of like promoting socialism: by making each individual R-type instruction takes 5 cycles instead of 4 cycles to finish, our overall performance is actually better off. The reason for this higher performance is that we end up having a more efficient pipeline. +1 = 20 min. (Y:00) Load Ifetch Reg/Dec Exec Mem Wr R-type Ifetch Reg/Dec Exec Mem Wr R-type Ifetch Reg/Dec Exec Mem Wr 10/11/99 ©UCB Fall 1999

Modified Control & Datapath
IR <- Mem[PC]; PC <– PC+4; A <- R[rs]; B<– R[rt] S <– A + B; S <– A or ZX; S <– A + SX; S <– A + SX; if Cond PC < PC+SX; M <– S M <– Mem[S] Mem[S] <- B M <– S R[rd] <– M; R[rt] <– M; R[rd] <– M; Equal Reg. File Reg File A M Exec S PC IR Next PC Inst. Mem B Mem Access D Data Mem 10/11/99 ©UCB Fall 1999

The Four Stages of Store
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Store Ifetch Reg/Dec Exec Mem Wr Ifetch: Instruction Fetch Fetch the instruction from the Instruction Memory Reg/Dec: Registers Fetch and Instruction Decode Exec: Calculate the memory address Mem: Write the data into the Data Memory Let’s continue our lecture by looking at the store instruction. Once again, the Ifetch and Reg/Decode stages are the same as all other instructions. The Exec stage of the store instruction calculates the memory address. Once the address is calculated, the store instruction will write the data it read from the register file back at the Reg/Decode stage into the data memory during the Mem stage. Notice that unlike the load instruction which takes five cycles to accomplish its task, the Store instruction only takes four cycles or four pipe stages. In order to keep our pipeline diagram looks more uniform, however, we will keep the Wr stage for the store instruction in the pipeline diagram. But keep in mind that as far as the pipelined control and pipelined datapath are concerned, the store instruction requires NOTHING to be done once it finishes its Mem stage. +2 = 27 min. (Y:07) 10/11/99 ©UCB Fall 1999

Ifetch: Instruction Fetch Reg/Dec: Exec:
The Three Stages of Beq Cycle 1 Cycle 2 Cycle 3 Cycle 4 Beq Ifetch Reg/Dec Exec Mem Wr Ifetch: Instruction Fetch Fetch the instruction from the Instruction Memory Reg/Dec: Registers Fetch and Instruction Decode Exec: compares the two register operand, select correct branch target address latch into PC Well similar to the store instruction, the branch instruction only consists of four pipe stages. Ifetch and Reg/decode are the same as all other instructions because we do not know what instruction we have at this point. We have not finish decoding the instruction yet. During the Execute stage of the pipeline, the BEQ instruction will use the ALU to compare the two register operands it fetched during the Reg/Dec stage. At the same time, a separate adder is used to calculate the branch target address. If the registers we compared during the Execute stage (point to the last bullet) have the same value, the branch is taken. That is, the branch target address we calculated earlier (last bullet) will be written into the Program Counter. Once again, similar to the Store instruction, the BEQ instruction will require NEITHER the pipelined control nor the pipelined datapath to do ANY thing once it finishes its Mem stage. With all these talk about pipelined datapath and pipelined control, let’s take a look at how the pipelined datapath looks like. +2 = 29 min. (Y:09) 10/11/99 ©UCB Fall 1999

Control Diagram Equal Reg. File Reg File A M S Exec PC IR Next PC
IR <- Mem[PC]; PC < PC+4; A <- R[rs]; B<– R[rt] S <– A + B; S <– A or ZX; S <– A + SX; S <– A + SX; If Cond PC < PC+SX; M <– S M <– Mem[S] Mem[S] <- B M <– S R[rd] <– S; R[rt] <– S; R[rd] <– M; Equal Reg. File Reg File A M S Exec PC IR Next PC Inst. Mem B Mem Access D Data Mem 10/11/99 ©UCB Fall 1999

Datapath + Data Stationary Control
IR v v v fun rw rw rw wb wb wb Inst. Mem Decode rt me me WB Ctrl rs Mem Ctrl ex op im rs rt Reg. File Reg File A M S Exec B Mem Access D Data Mem PC Next PC 10/11/99 ©UCB Fall 1999

Let’s Try it Out 10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5
24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15 these addresses are octal 10/11/99 ©UCB Fall 1999

Start: Fetch 10 n n n n Inst. Mem Decode WB Ctrl Mem Ctrl IR im rs rt
Reg. File Reg File A M S Exec B Mem Access D Data Mem 10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15 Next PC 10 PC 10/11/99 ©UCB Fall 1999

Fetch 14, Decode 10 n n n Inst. Mem Decode WB Ctrl Mem Ctrl IR im 2 rt
lw r1, r2(35) Decode WB Ctrl Mem Ctrl IR im 2 rt Reg. File Reg File A M S Exec B Mem Access D Data Mem 10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15 Next PC 14 PC 10/11/99 ©UCB Fall 1999

Fetch 20, Decode 14, Exec 10 n n Inst. Mem Decode WB Ctrl Mem Ctrl IR
addI r2, r2, 3 Decode WB Ctrl lw r1 Mem Ctrl IR 2 rt 35 Reg. File Reg File M r2 S Exec B Mem Access D Data Mem 10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15 Next PC 20 PC 10/11/99 ©UCB Fall 1999

Fetch 24, Decode 20, Exec 14, Mem 10 n Inst. Mem Decode WB Ctrl Mem
sub r3, r4, r5 Decode addI r2, r2, 3 WB Ctrl lw r1 Mem Ctrl IR 4 5 3 Reg. File Reg File M r2 r2+35 Exec B Mem Access D Data Mem 10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15 Next PC 24 PC 10/11/99 ©UCB Fall 1999

Fetch 30, Dcd 24, Ex 20, Mem 14, WB 10 Inst. Mem Decode WB Ctrl Mem
beq r6, r7 100 Decode addI r2 WB Ctrl sub r3 lw r1 Mem Ctrl IR 6 7 Reg. File Reg File r4 M[r2+35] r2+3 Exec r5 Mem Access D Data Mem 10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15 Next PC 30 PC 10/11/99 ©UCB Fall 1999

ori r8, r9 17 Decode addI r2 WB Ctrl sub r3 beq Mem Ctrl IR 9 xx 100 Reg. File r1=M[r2+35] Reg File r6 r2+3 r4-r5 Exec r7 Mem Access D Data Mem 10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15 Next PC 34 PC 10/11/99 ©UCB Fall 1999

ori r8 sub r3 WB Ctrl add r10, r11, r12 beq Mem Ctrl 11 12 17 Reg. File IR r1=M[r2+35] Reg File r9 r4-r5 xxx Exec r2 = r2+3 x Mem Access D Data Mem 10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15 Next PC 100 PC ooops, we should have only one delayed instruction 10/11/99 ©UCB Fall 1999

Fetch 104, Dcd 100, Ex 34, Mem 30, WB 24 n Inst. Mem Decode add r10 ori r8 beq WB Ctrl and r13, r14, r15 Mem Ctrl 14 15 xx Reg. File IR r1=M[r2+35] Reg File r11 r9 | 17 xxx Exec r2 = r2+3 r3 = r4-r5 r12 Mem Access D Data Mem 10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15 Next PC 104 PC Squash the extra instruction in the branch shadow! 10/11/99 ©UCB Fall 1999

Fetch 108, Dcd 104, Ex 100, Mem 34, WB 30 n Inst. Mem Decode add r10 ori r8 and r13 WB Ctrl Mem Ctrl xx Reg. File IR r1=M[r2+35] Reg File r14 r9 | 17 r11+r12 Exec r2 = r2+3 r3 = r4-r5 r15 Mem Access D Data Mem 10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15 Next PC 110 PC Squash the extra instruction in the branch shadow! 10/11/99 ©UCB Fall 1999

Fetch 114, Dcd 110, Ex 104, Mem 100, WB 34 n NO WB NO Ovflow and r13 Inst. Mem Decode add r10 WB Ctrl Mem Ctrl Reg. File IR r1=M[r2+35] Reg File r11+r12 r14 & R15 Exec r2 = r2+3 r3 = r4-r5 r8 = r9 | 17 Mem Access D Data Mem 10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15 Next PC 114 PC Squash the extra instruction in the branch shadow! 10/11/99 ©UCB Fall 1999

We’ll build a simple pipeline and look at these issues
Summary: Pipelining What makes it easy all instructions are the same length just a few instruction formats memory operands appear only in loads and stores What makes it hard? structural hazards: suppose we had only one memory control hazards: need to worry about branch instructions data hazards: an instruction depends on a previous instruction We’ll build a simple pipeline and look at these issues We’ll talk about modern processors and what really makes it hard: exception handling trying to improve performance with out-of-order execution, etc. 10/11/99 ©UCB Fall 1999

Pipelining is a fundamental concept
Summary Pipelining is a fundamental concept multiple steps using distinct resources Utilize capabilities of the Datapath by pipelined instruction processing start next instruction while working on the current one limited by length of longest stage (plus fill/flush) detect and resolve hazards 10/11/99 ©UCB Fall 1999

John Kubiatowicz (http.cs.berkeley.edu/~kubitron)

Similar presentations

Presentation on theme: "John Kubiatowicz (http.cs.berkeley.edu/~kubitron)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

John Kubiatowicz (http.cs.berkeley.edu/~kubitron)

Similar presentations

Presentation on theme: "John Kubiatowicz (http.cs.berkeley.edu/~kubitron)"— Presentation transcript:

Similar presentations

About project

Feedback