Download presentation
Presentation is loading. Please wait.
1
Part IV Data Path and Control
July 2005 Computer Architecture, Data Path and Control
2
About This Presentation
This presentation is intended to support the use of the textbook Computer Architecture: From Microprocessors to Supercomputers, Oxford University Press, 2005, ISBN X. It is updated regularly by the author as part of his teaching of the upper-division course ECE 154, Introduction to Computer Architecture, at the University of California, Santa Barbara. Instructors can use these slides freely in classroom teaching and for other educational purposes. Any other use is strictly prohibited. © Behrooz Parhami Edition Released Revised First July 2003 July 2004 July 2005 July 2005 Computer Architecture, Data Path and Control
3
IV Data Path and Control
Design a simple computer (MicroMIPS) to learn about: Data path – part of the CPU where data signals flow Control unit – guides data signals through data path Pipelining – a way of achieving greater performance Topics in This Part Chapter 13 Instruction Execution Steps Chapter 14 Control Unit Synthesis Chapter 15 Pipelined Data Paths Chapter 16 Pipeline Performance Limits July 2005 Computer Architecture, Data Path and Control
4
13 Instruction Execution Steps
A simple computer executes instructions one at a time Fetches an instruction from the loc pointed to by PC Interprets and executes the instruction, then repeats Topics in This Chapter A Small Set of Instructions The Instruction Execution Unit A Single-Cycle Data Path Branching and Jumping Deriving the Control Signals Performance of the Single-Cycle Design July 2005 Computer Architecture, Data Path and Control
5
13.1 A Small Set of Instructions
Fig MicroMIPS instruction formats and naming of the various fields. We will refer to this diagram later Seven R-format ALU instructions (add, sub, slt, and, or, xor, nor) Six I-format ALU instructions (lui, addi, slti, andi, ori, xori) Two I-format memory access instructions (lw, sw) Three I-format conditional branch instructions (bltz, beq, bne) Four unconditional jump instructions (j, jr, jal, syscall) July 2005 Computer Architecture, Data Path and Control
6
The MicroMIPS Instruction Set
op 15 8 10 12 13 14 35 43 2 1 4 5 3 fn 32 34 42 36 37 38 39 Instruction Usage Load upper immediate lui rt,imm Add add rd,rs,rt Subtract sub rd,rs,rt Set less than slt rd,rs,rt Add immediate addi rt,rs,imm Set less than immediate slti rd,rs,imm AND and rd,rs,rt OR or rd,rs,rt XOR xor rd,rs,rt NOR nor rd,rs,rt AND immediate andi rt,rs,imm OR immediate ori rt,rs,imm XOR immediate xori rt,rs,imm Load word lw rt,imm(rs) Store word sw rt,imm(rs) Jump j L Jump register jr rs Branch less than 0 bltz rs,L Branch equal beq rs,rt,L Branch not equal bne rs,rt,L Jump and link jal L System call syscall The MicroMIPS Instruction Set Copy Arithmetic Logic Memory access Control transfer Table 13.1 July 2005 Computer Architecture, Data Path and Control
7
13.2 The Instruction Execution Unit
Fig Abstract view of the instruction execution unit for MicroMIPS. For naming of instruction fields, see Fig July 2005 Computer Architecture, Data Path and Control
8
13.3 A Single-Cycle Data Path
Fig Key elements of the single-cycle MicroMIPS data path. July 2005 Computer Architecture, Data Path and Control
9
Computer Architecture, Data Path and Control
lui imm An ALU for MicroMIPS Fig A multifunction ALU with 8 control signals (2 for function class, 1 arithmetic, 3 shift, 2 logic) specifying the operation. July 2005 Computer Architecture, Data Path and Control
10
Computer Architecture, Data Path and Control
13.4 Branching and Jumping (PC)31: Default option (PC)31: imm When instruction is branch and condition is met (PC)31:28 | jta When instruction is j or jal (rs)31:2 When the instruction is jr SysCallAddr Start address of an operating system routine Update options for PC Fig Next-address logic for MicroMIPS (see top part of Fig. 13.3). July 2005 Computer Architecture, Data Path and Control
11
13.5 Deriving the Control Signals
Table Control signals for the single-cycle MicroMIPS implementation. Control signal 1 2 3 RegWrite Don’t write Write RegDst1, RegDst0 rt rd $31 RegInSrc1, RegInSrc0 Data out ALU out IncrPC ALUSrc (rt ) imm AddSub Add Subtract LogicFn1, LogicFn0 AND OR XOR NOR FnClass1, FnClass0 lui Set less Arithmetic Logic DataRead Don’t read Read DataWrite BrType1, BrType0 No branch beq bne bltz PCSrc1, PCSrc0 jta (rs) SysCallAddr Reg file ALU Data cache Next addr July 2005 Computer Architecture, Data Path and Control
12
Control Signal Settings
Table 13.3 July 2005 Computer Architecture, Data Path and Control
13
Computer Architecture, Data Path and Control
Instruction Decoding Fig Instruction decoder for MicroMIPS built of two 6-to-64 decoders. July 2005 Computer Architecture, Data Path and Control
14
Control Signal Generation
Auxiliary signals identifying instruction classes arithInst = addInst subInst sltInst addiInst sltiInst logicInst = andInst orInst xorInst norInst andiInst oriInst xoriInst immInst = luiInst addiInst sltiInst andiInst oriInst xoriInst Example logic expressions for control signals RegWrite = luiInst arithInst logicInst lwInst jalInst ALUSrc = immInst lwInst swInst AddSub = subInst sltInst sltiInst DataRead = lwInst PCSrc0 = jInst jalInst syscallInst July 2005 Computer Architecture, Data Path and Control
15
Putting It All Together
Fig Fig. 13.4 Fig. 13.3 Control addInst subInst jInst sltInst . July 2005 Computer Architecture, Data Path and Control
16
13.6 Performance of the Single-Cycle Design
Instruction access 2 ns Register read 1 ns ALU operation 2 ns Data cache access 2 ns Register write 1 ns Total 8 ns Single-cycle clock = 125 MHz R-type 44% 6 ns Load 24% 8 ns Store 12% 7 ns Branch 18% 5 ns Jump 2% 3 ns Weighted mean 6.36 ns Fig The MicroMIPS data path unfolded (by depicting the register write step as a separate block) so as to better visualize the critical-path latencies. July 2005 Computer Architecture, Data Path and Control
17
13.6 Performance of the Single-Cycle Design
Instruction access 2 ns Register read 1 ns ALU operation 2 ns Data cache access 2 ns Register write 1 ns Total 8 ns Single-cycle clock = 125 MHz R-type 44% 6 ns Load 24% 8 ns Store 12% 7 ns Branch 18% 5 ns Jump 2% 3 ns Weighted mean 6.36 ns Fig The MicroMIPS data path unfolded (by depicting the register write step as a separate block) so as to better visualize the critical-path latencies. July 2005 Computer Architecture, Data Path and Control
18
14 Control Unit Synthesis
The control unit for the single-cycle design is memoryless Problematic when instructions vary greatly in complexity Multiple cycles needed when resources must be reused Topics in This Chapter A Multicycle Implementation Choosing the Clock Cycle The Control State Machine Performance of the Multicycle Design Microprogramming Exception Handling July 2005 Computer Architecture, Data Path and Control
19
14.1 A Multicycle Implementation
Fig Single-cycle versus multicycle instruction execution. July 2005 Computer Architecture, Data Path and Control
20
13.3 A Single-Cycle Data Path
Fig Key elements of the single-cycle MicroMIPS data path. July 2005 Computer Architecture, Data Path and Control
21
Computer Architecture, Data Path and Control
A Multicycle Data Path Fig Abstract view of a multicycle instruction execution unit for MicroMIPS. For naming of instruction fields, see Fig July 2005 Computer Architecture, Data Path and Control
22
Computer Architecture, Data Path and Control
Differences Single Cache One ALU Add Compute target address increment program counter X, Y, Z registers hold information between cycles July 2005 Computer Architecture, Data Path and Control
23
Computer Architecture, Data Path and Control
Cycles Fetch Instruction from cache to IR Decode Includes reading rs and rt Even if we do not need those values July 2005 Computer Architecture, Data Path and Control
24
Computer Architecture, Data Path and Control
Branch instructions J, jr, jal, syscall Terminates in the third cycle (?) by writing appropriate value into PC Beq, bne, bltz Check branch condition and write to the PC during the third cycle July 2005 Computer Architecture, Data Path and Control
25
Computer Architecture, Data Path and Control
Other Instructions Complete during the fourth cycle with one exception Lw requires a fifth cycle to write the data retrieved from the cache into a register (?) July 2005 Computer Architecture, Data Path and Control
26
Multicycle Data Path with Control Signals Shown
Fig Key elements of the multicycle MicroMIPS data path. July 2005 Computer Architecture, Data Path and Control
27
14.2 Clock Cycle and Control Signals
Table 14.1 Control signal 1 2 3 JumpAddr jta SysCallAddr PCSrc1, PCSrc0 Jump addr x reg z reg ALU out PCWrite Don’t write Write InstData PC MemRead Don’t read Read MemWrite IRWrite RegWrite RegDst1, RegDst0 rt rd $31 RegInSrc Data reg ALUSrcX ALUSrcY1, ALUSrcY0 4 y reg imm 4 imm AddSub Add Subtract LogicFn1, LogicFn0 AND OR XOR NOR FnClass1, FnClass0 lui Set less Arithmetic Logic Program counter Cache Register file ALU July 2005 Computer Architecture, Data Path and Control
28
Computer Architecture, Data Path and Control
Execution Cycles Table Execution cycles for multicycle MicroMIPS Instruction Operations Signal settings Any Read out the instruction and write it into instruction register, increment PC InstData = 0, MemRead = 1 IRWrite = 1, ALUSrcX = 0 ALUSrcY = 0, ALUFunc = ‘+’ PCSrc = 3, PCWrite = 1 Read out rs & rt into x & y registers, compute branch address and save in z register ALUSrcX = 0, ALUSrcY = 3 ALUFunc = ‘+’ ALU type Perform ALU operation and save the result in z register ALUSrcX = 1, ALUSrcY = 1 or 2 ALUFunc: Varies Load/Store Add base and offset values, save in z register ALUSrcX = 1, ALUSrcY = 2 Branch If (x reg) = < (y reg), set PC to branch target address ALUSrcX = 1, ALUSrcY = 1 ALUFunc= ‘’, PCSrc = 2 PCWrite = ALUZero or ALUZero or ALUOut31 Jump Set PC to the target address jta, SysCallAddr, or (rs) JumpAddr = 0 or 1, PCSrc = 0 or 1, PCWrite = 1 Write back z reg into rd RegDst = 1, RegInSrc = 1 RegWrite = 1 Load Read memory into data reg InstData = 1, MemRead = 1 Store Copy y reg into memory InstData = 1, MemWrite = 1 Copy data register into rt RegDst = 0, RegInSrc = 0 Fetch & PC incr 1 Decode & reg read 2 ALU oper & PC update 3 Reg write or mem access 4 Reg write for lw 5 July 2005 Computer Architecture, Data Path and Control
29
14.3 The Control State Machine
Fig The control state machine for multicycle MicroMIPS. July 2005 Computer Architecture, Data Path and Control
30
State and Instruction Decoding
addiInst Fig State and instruction decoders for multicycle MicroMIPS. July 2005 Computer Architecture, Data Path and Control
31
Control Signal Generation
Certain control signals depend only on the control state ALUSrcX = ControlSt2 ControlSt5 ControlSt7 RegWrite = ControlSt4 ControlSt8 Auxiliary signals identifying instruction classes addsubInst = addInst subInst addiInst logicInst = andInst orInst xorInst norInst andiInst oriInst xoriInst Logic expressions for ALU control signals AddSub = ControlSt5 (ControlSt7 subInst) FnClass1 = ControlSt7 addsubInst logicInst FnClass0 = ControlSt7 (logicInst sltInst sltiInst) LogicFn1 = ControlSt7 (xorInst xoriInst norInst) LogicFn0 = ControlSt7 (orInst oriInst norInst) July 2005 Computer Architecture, Data Path and Control
32
14.4 Performance of the Multicycle Design
R-type 44% cycles Load 24% cycles Store 12% cycles Branch 18% cycles Jump 2% cycles Contribution to CPI R-type 4 = 1.76 Load 5 = 1.20 Store 4 = 0.48 Branch 3 = 0.54 Jump 3 = 0.06 _____________________________ Average CPI 4.04 Fig The MicroMIPS data path unfolded (by depicting the register write step as a separate block) so as to better visualize the critical-path latencies. July 2005 Computer Architecture, Data Path and Control
33
Computer Architecture, Data Path and Control
14.5 Microprogramming Fig Possible 22-bit microinstruction format for MicroMIPS. July 2005 Computer Architecture, Data Path and Control
34
Control Unit for Microprogramming
fetch: ----- ----- andi: ----- Multiway branch Fig Microprogrammed control unit for MicroMIPS . July 2005 Computer Architecture, Data Path and Control
35
Computer Architecture, Data Path and Control
PC Control 0001 1001 X011 X101 X111 PCjump PCsyscall PCjreg PCbranch PCnext July 2005 Computer Architecture, Data Path and Control
36
Computer Architecture, Data Path and Control
Cache control 0101 1010 1100 CacheFetch CacheStore CacheLoad July 2005 Computer Architecture, Data Path and Control
37
Computer Architecture, Data Path and Control
Register control 1000 1001 1011 1101 Rt<=Data Rt<=z Rd<=z $31<=PC July 2005 Computer Architecture, Data Path and Control
38
Computer Architecture, Data Path and Control
ALU inputs 000 011 101 110 PC, 4 PC, 4imm X,y X, imm July 2005 Computer Architecture, Data Path and Control
39
Computer Architecture, Data Path and Control
ALU function 0xx10 1xx01 1xx10 X0011 + < - and July 2005 Computer Architecture, Data Path and Control
40
ALU function (continued)
x0111 x1011 x1111 xxx00 or xor ~or lui July 2005 Computer Architecture, Data Path and Control
41
Computer Architecture, Data Path and Control
Sequence control 01 10 11 mPCdisp1 mPCdisp2 mPCfetch July 2005 Computer Architecture, Data Path and Control
42
Microprogram for MicroMIPS
fetch: PCnext, CacheFetch, PC+4 # State 0 (start) PC + 4imm, mPCdisp1 # State 1 lui1: lui(imm) # State 7lui rt ¬ z, mPCfetch # State 8lui add1: x + y # State 7add rd ¬ z, mPCfetch # State 8add sub1: x - y # State 7sub rd ¬ z, mPCfetch # State 8sub slt1: x - y # State 7slt rd ¬ z, mPCfetch # State 8slt addi1: x + imm # State 7addi rt ¬ z, mPCfetch # State 8addi slti1: x - imm # State 7slti rt ¬ z, mPCfetch # State 8slti and1: x Ù y # State 7and rd ¬ z, mPCfetch # State 8and or1: x Ú y # State 7or rd ¬ z, mPCfetch # State 8or xor1: x Å y # State 7xor rd ¬ z, mPCfetch # State 8xor nor1: x ~Ú y # State 7nor rd ¬ z, mPCfetch # State 8nor andi1: x Ù imm # State 7andi rt ¬ z, mPCfetch # State 8andi ori1: x Ú imm # State 7ori rt ¬ z, mPCfetch # State 8ori xori: x Å imm # State 7xori rt ¬ z, mPCfetch # State 8xori lwsw1: x + imm, mPCdisp2 # State 2 lw2: CacheLoad # State 3 rt ¬ Data, mPCfetch # State 4 sw2: CacheStore, mPCfetch # State 6 j1: PCjump, mPCfetch # State 5j jr1: PCjreg, mPCfetch # State 5jr branch1: PCbranch, mPCfetch # State 5branch jal1: PCjump, $31¬PC, mPCfetch # State 5jal syscall1:PCsyscall, mPCfetch # State 5syscall Microprogram for MicroMIPS Fig The complete MicroMIPS microprogram. July 2005 Computer Architecture, Data Path and Control
43
Computer Architecture, Data Path and Control
14.6 Exception Handling Exceptions and interrupts alter the normal program flow Examples of exceptions (things that can go wrong): ALU operation leads to overflow (incorrect result is obtained) Opcode field holds a pattern not representing a legal operation Cache error-code checker deems an accessed word invalid Sensor signals a hazardous condition (e.g., overheating) Exception handler is an OS program that takes care of the problem Derives correct result of overflowing computation, if possible Invalid operation may be a software-implemented instruction Interrupts are similar, but usually have external causes (e.g., I/O) July 2005 Computer Architecture, Data Path and Control
44
Exception Control States
Fig Exception states 9 and 10 added to the control state machine. July 2005 Computer Architecture, Data Path and Control
45
Computer Architecture, Data Path and Control
15 Pipelined Data Paths Pipelining is now used in even the simplest of processors Same principles as assembly lines in manufacturing Unlike in assembly lines, instructions not independent Topics in This Chapter Pipelining Concepts Pipeline Stalls or Bubbles Pipeline Timing and Performance Pipelined Data Path Design Pipelined Control Optimal Pipelining July 2005 Computer Architecture, Data Path and Control
46
Single-Cycle Data Path of Chapter 13
Clock rate = 125 MHz CPI = 1 (125 MIPS) Fig Key elements of the single-cycle MicroMIPS data path. July 2005 Computer Architecture, Data Path and Control
47
Multicycle Data Path of Chapter 14
Clock rate = 500 MHz CPI 4 ( 125 MIPS) Fig Key elements of the multicycle MicroMIPS data path. July 2005 Computer Architecture, Data Path and Control
48
Getting the Best of Both Worlds
Pipelined: Clock rate = 500 MHz CPI 1 Single-cycle: Clock rate = 125 MHz CPI = 1 Multicycle: Clock rate = 500 MHz CPI 4 July 2005 Computer Architecture, Data Path and Control
49
Computer Architecture, Data Path and Control
15.1 Pipelining Concepts Strategies for improving performance 1 – Use multiple independent data paths accepting several instructions that are read out at once: multiple-instruction-issue or superscalar 2 – Overlap execution of several instructions, starting the next instruction before the previous one has run to completion: (super)pipelined Fig Pipelining in the student registration process. July 2005 Computer Architecture, Data Path and Control
50
Pipelined Instruction Execution
Fig Pipelining in the MicroMIPS instruction execution process. July 2005 Computer Architecture, Data Path and Control
51
Alternate Representations of a Pipeline
Except for start-up and drainage overheads, a pipeline can execute one instruction per clock tick; IPS is dictated by the clock frequency Fig Two abstract graphical representations of a 5-stage pipeline executing 7 tasks (instructions). July 2005 Computer Architecture, Data Path and Control
52
Pipelining Example in a Photocopier
A photocopier with an x-sheet document feeder copies the first sheet in 4 s and each subsequent sheet in 1 s. The copier’s paper path is a 4-stage pipeline with each stage having a latency of 1s. The first sheet goes through all 4 pipeline stages and emerges after 4 s. Each subsequent sheet emerges 1s after the previous sheet. How does the throughput of this photocopier varies with x, assuming that loading the document feeder and removing the copies takes 15 s. Solution Each batch of x sheets is copied in (x – 1) = 18 + x seconds. A nonpipelined copier would require 4x seconds to copy x sheets. For x > 6, the pipelined version has a performance edge. When x = 50, the pipelining speedup is (4 50) / ( ) = 2.94. July 2005 Computer Architecture, Data Path and Control
53
15.2 Pipeline Stalls or Bubbles
First type of data dependency Fig Read-after-write data dependency and its possible resolution through data forwarding . July 2005 Computer Architecture, Data Path and Control
54
Second Type of Data Dependency
Fig Read-after-load data dependency and its possible resolution through bubble insertion and data forwarding. July 2005 Computer Architecture, Data Path and Control
55
Control Dependency in a Pipeline
Fig Control dependency due to conditional branch. July 2005 Computer Architecture, Data Path and Control
56
15.3 Pipeline Timing and Performance
Fig Pipelined form of a function unit with latching overhead. July 2005 Computer Architecture, Data Path and Control
57
Throughput Increase in a q-Stage Pipeline
t / q + or q 1 + q / t Fig Throughput improvement due to pipelining as a function of the number of pipeline stages for different pipelining overheads. July 2005 Computer Architecture, Data Path and Control
58
Pipeline Throughput with Dependencies
Assume that one bubble must be inserted due to read-after-load dependency and after a branch when its delay slot cannot be filled. Let be the fraction of all instructions that are followed by a bubble. q (1 + q / t)(1 + b) Pipeline speedup = R-type 44% Load 24% Store 12% Branch 18% Jump 2% Example 15.3 Calculate the effective CPI for MicroMIPS, assuming that a quarter of branch and load instructions are followed by bubbles. Solution Fraction of bubbles b = 0.25( ) = 0.105 CPI = 1 + b = (which is very close to the ideal value of 1) July 2005 Computer Architecture, Data Path and Control
59
15.4 Pipelined Data Path Design
Fig Key elements of the pipelined MicroMIPS data path. July 2005 Computer Architecture, Data Path and Control
60
Computer Architecture, Data Path and Control
15.5 Pipelined Control Fig Pipelined control signals. July 2005 Computer Architecture, Data Path and Control
61
Computer Architecture, Data Path and Control
15.6 Optimal Pipelining MicroMIPS pipeline with more than four-fold improvement Fig Higher-throughput pipelined data path for MicroMIPS and the execution of consecutive instructions in it . July 2005 Computer Architecture, Data Path and Control
62
Optimal Number of Pipeline Stages
Assumptions: Pipeline sliced into q stages Stage overhead is t q/2 bubbles per branch (decision made midway) Fraction b of all instructions are taken branches Fig Pipelined form of a function unit with latching overhead. Derivation of q opt Average CPI = 1 + b q / 2 Throughput = Clock rate / CPI = [(t / q + t)(1 + b q / 2)] –1/2 Differentiate throughput expression with respect to q and equate with 0 q opt = [2(t / t) / b)] 1/2 July 2005 Computer Architecture, Data Path and Control
63
16 Pipeline Performance Limits
Pipeline performance limited by data & control dependencies Hardware provisions: data forwarding, branch prediction Software remedies: delayed branch, instruction reordering Topics in This Chapter Data Dependencies and Hazards Data Forwarding Pipeline Branch Hazards Delayed Branch and Branch Prediction Dealing with Exceptions Advanced Pipelining July 2005 Computer Architecture, Data Path and Control
64
16.1 Data Dependencies and Hazards
Fig Data dependency in a pipeline. July 2005 Computer Architecture, Data Path and Control
65
Resolving Data Dependencies via Forwarding
Fig When a previous instruction writes back a value computed by the ALU into a register, the data dependency can always be resolved through forwarding. July 2005 Computer Architecture, Data Path and Control
66
Certain Data Dependencies Lead to Bubbles
Fig When the immediately preceding instruction writes a value read out from the data memory into a register, the data dependency cannot be resolved through forwarding (i.e., we cannot go back in time) and a bubble must be inserted in the pipeline. July 2005 Computer Architecture, Data Path and Control
67
Computer Architecture, Data Path and Control
16.2 Data Forwarding Fig Forwarding unit for the pipelined MicroMIPS data path. July 2005 Computer Architecture, Data Path and Control
68
Design of the Data Forwarding Units
Let’s focus on designing the upper data forwarding unit Table 16.1 Partial truth table for the upper forwarding unit in the pipelined MicroMIPS data path. Fig Forwarding unit for the pipelined MicroMIPS data path. RegWrite3 RegWrite4 s2matchesd3 s2matchesd4 RetAddr3 RegInSrc3 RegInSrc4 Choose x x2 1 x4 y4 x3 y3 Incorrect in textbook July 2005 Computer Architecture, Data Path and Control
69
Hardware for Inserting Bubbles
Fig Data hazard detector for the pipelined MicroMIPS data path. July 2005 Computer Architecture, Data Path and Control
70
16.3 Pipeline Branch Hazards
Software-based solutions Compiler inserts a “no-op” after every branch (simple, but wasteful) Branch is redefined to take effect after the instruction that follows it Branch delay slot(s) are filled with useful instructions via reordering Hardware-based solutions Mechanism similar to data hazard detector to flush the pipeline Constitutes a rudimentary form of branch prediction: Always predict that the branch is not taken, flush if mistaken More elaborate branch prediction strategies possible July 2005 Computer Architecture, Data Path and Control
71
Computer Architecture, Data Path and Control
16.4 Branch Prediction Predicting whether a branch will be taken Always predict that the branch will not be taken Use program context to decide (backward branch is likely taken, forward branch is likely not taken) Allow programmer or compiler to supply clues Decide based on past history (maintain a small history table); to be discussed later Apply a combination of factors: modern processors use elaborate techniques due to deep pipelines July 2005 Computer Architecture, Data Path and Control
72
A Simple Branch Prediction Algorithm
Fig Four-state branch prediction scheme. Example 16.1 L1: ---- ---- L2: ---- br <c2> L2 br <c1> L1 20 iter’s 10 iter’s Impact of different branch prediction schemes Solution Always taken: 11 mispredictions, 94.8% accurate 1-bit history: 20 mispredictions, 90.5% accurate 2-bit history: Same as always taken July 2005 Computer Architecture, Data Path and Control
73
Hardware Implementation of Branch Prediction
Fig Hardware elements for a branch prediction scheme. The mapping scheme used to go from PC contents to a table entry is the same as that used in direct-mapped caches (Chapter 18) July 2005 Computer Architecture, Data Path and Control
74
Computer Architecture, Data Path and Control
16.5 Advanced Pipelining Deep pipeline = superpipeline; also, superpipelined, superpipelining Parallel instruction issue = superscalar, j-way issue (2-4 is typical) Fig Dynamic instruction pipeline with in-order issue, possible out-of-order completion, and in-order retirement. July 2005 Computer Architecture, Data Path and Control
75
Performance Improvement for Deep Pipelines
Hardware-based methods Lookahead past an instruction that will/may stall in the pipeline (out-of-order execution; requires in-order retirement) Issue multiple instructions (requires more ports on register file) Eliminate false data dependencies via register renaming Predict branch outcomes more accurately, or speculate Software-based method Pipeline-aware compilation Loop unrolling to reduce the number of branches Loop: Compute with index i Loop: Compute with index i Increment i by 1 Compute with index i + 1 Go to Loop if not done Increment i by 2 Go to Loop if not done July 2005 Computer Architecture, Data Path and Control
76
CPI Variations with Architectural Features
Table Effect of processor architecture, branch prediction methods, and speculative execution on CPI. Architecture Methods used in practice CPI Nonpipelined, multicycle Strict in-order instruction issue and exec 5-10 Nonpipelined, overlapped In-order issue, with multiple function units 3-5 Pipelined, static In-order exec, simple branch prediction 2-3 Superpipelined, dynamic Out-of-order exec, adv branch prediction 1-2 Superscalar 2- to 4-way issue, interlock & speculation 0.5-1 Advanced superscalar 4- to 8-way issue, aggressive speculation July 2005 Computer Architecture, Data Path and Control
77
Development of Intel’s Desktop/Laptop Micros
In the beginning, there was the 8080; led to the 80x86 = IA32 ISA Half a dozen or so pipeline stages 80286 80386 80486 Pentium (80586) A dozen or so pipeline stages, with out-of-order instruction execution Pentium Pro Pentium II Pentium III Celeron Two dozens or so pipeline stages Pentium 4 More advanced technology Instructions are broken into micro-ops which are executed out-of-order but retired in-order More advanced technology July 2005 Computer Architecture, Data Path and Control
78
16.6 Dealing with Exceptions
Exceptions present the same problems as branches How to handle instructions that are ahead in the pipeline? (let them run to completion and retirement of their results) What to do with instructions after the exception point? (flush them out so that they do not affect the state) Precise versus imprecise exceptions Precise exceptions hide the effects of pipelining and parallelism by forcing the same state as that of strict sequential execution (desirable, because exception handling is not complicated) Imprecise exceptions are messy, but lead to faster hardware (interrupt handler can clean up to offer precise exception) July 2005 Computer Architecture, Data Path and Control
79
Computer Architecture, Data Path and Control
Where Do We Go from Here? Memory Design: How to build a memory unit that responds in 1 clock Input and Output: Peripheral devices, I/O programming, interfacing, interrupts Higher Performance: Vector/array processing Parallel processing July 2005 Computer Architecture, Data Path and Control
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.