John Kubiatowicz (http.cs.berkeley.edu/~kubitron)

Slides:



Advertisements
Similar presentations
Hardware-Based Speculation. Exploiting More ILP Branch prediction reduces stalls but may not be sufficient to generate the desired amount of ILP One way.
Advertisements

Lec18.1 Step by step for Dynamic Scheduling by reorder buffer Copyright by John Kubiatowicz (http.cs.berkeley.edu/~kubitron)
Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Advanced Computer Architecture COE 501.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Oct 19, 2005 Topic: Instruction-Level Parallelism (Multiple-Issue, Speculation)
A scheme to overcome data hazards
Dynamic ILP: Scoreboard Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008.
Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.
COMP25212 Advanced Pipelining Out of Order Processors.
Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.
Pipelining 5. Two Approaches for Multiple Issue Superscalar –Issue a variable number of instructions per clock –Instructions are scheduled either statically.
Dynamic Branch Prediction
Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Oct. 14, 2002 Topic: Instruction-Level Parallelism (Multiple-Issue, Speculation)
Lecture 8: More ILP stuff Professor Alvin R. Lebeck Computer Science 220 Fall 2001.
CPE 731 Advanced Computer Architecture ILP: Part IV – Speculative Execution Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
DAP Spr.‘98 ©UCB 1 Lecture 6: ILP Techniques Contd. Laxmi N. Bhuyan CS 162 Spring 2003.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
CPE 731 Advanced Computer Architecture ILP: Part II – Branch Prediction Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
W04S1 COMP s1 Seminar 4: Branch Prediction Slides due to David A. Patterson, 2001.
1 IF IDEX MEM L.D F4,0(R2) MUL.D F0, F4, F6 ADD.D F2, F0, F8 L.D F2, 0(R2) WB IF IDM1 MEM WBM2M3M4M5M6M7 stall.
1 Zvika Guz Slides modified from Prof. Dave Patterson, Prof. John Kubiatowicz, and Prof. Nancy Warter-Perez Out Of Order Execution.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Oct. 9, 2002 Topic: Instruction-Level Parallelism (Multiple-Issue, Speculation)
1 IBM System 360. Common architecture for a set of machines. Robert Tomasulo worked on a high-end machine, the Model 91 (1967), on which they implemented.
Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
ENGS 116 Lecture 91 Dynamic Branch Prediction and Speculation Vincent H. Berk October 10, 2005 Reading for today: Chapter 3.2 – 3.6 Reading for Wednesday:
Branch.1 10/14 Branch Prediction Static, Dynamic Branch prediction techniques.
1 Lecture 7: Speculative Execution and Recovery Branch prediction and speculative execution, precise interrupt, reorder buffer.
CS203 – Advanced Computer Architecture ILP and Speculation.
Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.
Dynamic Branch Prediction
Instruction-Level Parallelism and Its Dynamic Exploitation
IBM System 360. Common architecture for a set of machines
/ Computer Architecture and Design
COMP 740: Computer Architecture and Implementation
Step by step for Tomasulo Scheme
Tomasulo Loop Example Loop: LD F0 0 R1 MULTD F4 F0 F2 SD F4 0 R1
CS203 – Advanced Computer Architecture
CPSC 614 Computer Architecture Lec 5 – Instruction Level Parallelism
Advantages of Dynamic Scheduling
CS152 Computer Architecture and Engineering Lecture 18 Dynamic Scheduling (Cont), Speculation, and ILP.
High-level view Out-of-order pipeline
John Kubiatowicz Electrical Engineering and Computer Sciences
Tomasulo With Reorder buffer:
11/14/2018 CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation Aleksandar Milenković, Electrical and Computer.
CMSC 611: Advanced Computer Architecture
A Dynamic Algorithm: Tomasulo’s
Out of Order Processors
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
CS252 Graduate Computer Architecture Lecture 7 Dynamic Scheduling 2: Precise Interrupts February 9th, 2010 John Kubiatowicz Electrical Engineering.
CPE 631: Branch Prediction
John Kubiatowicz (http.cs.berkeley.edu/~kubitron)
Adapted from the slides of Prof
Lecture 7: Dynamic Scheduling with Tomasulo Algorithm (Section 2.4)
Dynamic Branch Prediction
John Kubiatowicz Electrical Engineering and Computer Sciences
Advanced Computer Architecture
/ Computer Architecture and Design
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
Tomasulo Organization
CPSC 614 Computer Architecture Lec 5 – Instruction Level Parallelism
Adapted from the slides of Prof
CS252 Graduate Computer Architecture Lecture 6 Tomasulo, Implicit Register Renaming, Loop-Level Parallelism Extraction Explicit Register Renaming February.
John Kubiatowicz (http.cs.berkeley.edu/~kubitron)
Chapter 3: ILP and Its Exploitation
September 20, 2000 Prof. John Kubiatowicz
Adapted from the slides of Prof
Dynamic Hardware Prediction
Lecture 7 Dynamic Scheduling
Overcoming Control Hazards with Dynamic Scheduling & Speculation
Conceptual execution on a processor which exploits ILP
Presentation transcript:

John Kubiatowicz (http.cs.berkeley.edu/~kubitron) CS152 Computer Architecture and Engineering Lecture 17 Branch Prediction, Explicit Renaming, ILP April 5, 2004 John Kubiatowicz (http.cs.berkeley.edu/~kubitron) lecture slides: http://inst.eecs.berkeley.edu/~cs152/

Review: Tomasulo Organization FP Registers From Mem FP Op Queue Load Buffers Load1 Load2 Load3 Load4 Load5 Load6 Store Buffers Resolve RAW memory conflict? (address in memory buffers) Integer unit executes in parallel Add1 Add2 Add3 Mult1 Mult2 Reservation Stations To Mem FP adders FP multipliers Common Data Bus (CDB) 4/05/04 ©UCB Spring 2004

Review: Three Stages of Tomasulo Algorithm 1. Issue—get instruction from FP Op Queue If reservation station free (no structural hazard), control issues instr & sends operands (renames registers). 2. Execution—operate on operands (EX) When both operands ready then execute; if not ready, watch Common Data Bus for result 3. Write result—finish execution (WB) Write on Common Data Bus to all awaiting units; mark reservation station available Normal data bus: data + destination (“go to” bus) Common data bus: data + source (“come from” bus) 64 bits of data + 4 bits of Functional Unit source address Write if matches expected Functional Unit (produces result) Does the broadcast 4/05/04 ©UCB Spring 2004

Review: Tomasulo Architecture Reservations stations: renaming to larger set of registers + buffering source operands Prevents registers as bottleneck Avoids WAR, WAW hazards of Scoreboard Not limited to basic blocks: integer units gets ahead, beyond branches Dynamic Scheduling: Scoreboarding/Tomasulo In-order issue, out-of-order execution, out-of-order commit Tomasulo can unroll loops dynamically in hardware! Need: renaming (different physical names for different iterations) Fast branch computation 4/05/04 ©UCB Spring 2004

Review: Tomasulo With Reorder buffer (ROB) Done? FP Op Queue -- F0 <val2> ST 0(R3),F0 ADDD F0,F4,F6 Y Ex F4 M[10] LD F4,0(R3) BNE F2,<…> N F2 F10 DIVD F2,F10,F6 ADDD F10,F4,F0 LD F0,10(R2) ROB7 ROB6 ROB5 ROB3 ROB2 ROB1 Newest Reorder Buffer (ROB) Oldest Resolve RAW memory conflict? (address in memory buffers) Integer unit executes in parallel Registers To Memory Dest Dest from Memory 2 ADDD R(F4),ROB1 3 DIVD ROB2,R(F6) Dest Reservation Stations 1 10+R2 FP adders FP multipliers 4/05/04 ©UCB Spring 2004

Review: Four Steps of Speculative Tomasulo Algorithm 1. Issue—get instruction from FP Op Queue If reservation station and reorder buffer slot free, issue instr & send operands & reorder buffer no. for destination (this stage sometimes called “dispatch”) 2. Execution—operate on operands (EX) When both operands ready then execute; if not ready, watch CDB for result; when both in reservation station, execute; checks RAW (sometimes called “issue”) 3. Write result—finish execution (WB) Write on Common Data Bus to all awaiting FUs & reorder buffer 4. Commit—update register with reorder buffer (ROB) result When instr. at head of reorder buffer & result present, update register with result (or store to memory) and remove instr from reorder buffer Stores only commit to memory when reach head of ROB Values only overwrite registers when they reach head Mispredicted branch or interrupt flushes reorder buffer NOTES: In-order issue, Out-of-order execution, In-order commit Can always throw out contents of reorder buffer (must cancel running ops) Precise exception point is instruction at head of buffer 4/05/04 ©UCB Spring 2004

Tomasulo With Reorder buffer: Memory Disambiguation Done? FP Op Queue -- F0 M[10] --- ST 0(R3),F4 ADDD F0,F4,F6 Y Ex F4 LD F4,0(R3) BNE F2,<…> N ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1 Newest What about memory hazards??? Reorder Buffer F2 DIVD F2,F10,F6 N F10 ADDD F10,F4,F0 N Oldest F0 LD F0,10(R2) N Resolve RAW memory conflict? (address in memory buffers) Integer unit executes in parallel Registers To Memory Dest Dest from Memory 2 ADDD R(F4),ROB1 3 DIVD ROB2,R(F6) Dest Reservation Stations 1 10+R2 FP adders FP multipliers 4/05/04 ©UCB Spring 2004

Memory Disambiguation: Handling RAW Hazards in memory Question: Given a load that follows a store in program order, are the two related? (Alternatively: is there a RAW hazard between the store and the load)? Eg: st 0(R2),R5 ld R6,0(R3) Can we go ahead and start the load early? Store address could be delayed for a long time by some calculation that leads to R2 (divide?). We might want to issue/begin execution of both operations in same cycle. Two techiques: No Speculation: we are not allowed to start load until we know for sure that address 0(R2)  0(R3) Speculation: We might guess at whether or not they are dependent (called “dependence speculation”) and use reorder buffer to fixup if we are wrong. 4/05/04 ©UCB Spring 2004

Hardware Support for Memory Disambiguation Need buffer to keep track of all outstanding stores to memory, in program order. Keep track of address (when becomes available) and value (when becomes available) FIFO ordering: will retire stores from this buffer in program order When issuing a load, record current head of store queue (know which stores are ahead of you). When have address for load, check store queue: If any store prior to load is waiting for its address: If not speculating, stall load If speculating, send request to memory (predict no dependence) If load address matches earlier store address (associative lookup), then we have a memory-induced RAW hazard: store value available  return value store value not available  return ROB number of source Otherwise, send out request to memory Actual stores commit in order, so no worry about WAR/WAW hazards through memory. 4/05/04 ©UCB Spring 2004

Memory Disambiguation: Done? FP Op Queue ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1 Newest Reorder Buffer -- LD F4, 10(R3) N F2 ST 10(R3), F5 N F0 LD F0,32(R2) N Oldest -- <val 1> ST 0(R3), F4 Y Resolve RAW memory conflict? (address in memory buffers) Integer unit executes in parallel Registers To Memory Dest Dest from Memory Dest Reservation Stations 2 32+R2 4 ROB3 FP adders FP multipliers 4/05/04 ©UCB Spring 2004

Review: Independent “Fetch” unit Instruction Fetch with Branch Prediction Out-Of-Order Execution Unit Correctness Feedback On Branch Results Stream of Instructions To Execute Instruction fetch decoupled from execution Often issue logic (+ rename) included with Fetch 4/05/04 ©UCB Spring 2004

Branches must be resolved quickly In our loop-unrolling example, we relied on the fact that branches were under control of “fast” integer unit in order to get overlap! Loop: LD F0 0 R1 MULTD F4 F0 F2 SD F4 0 R1 SUBI R1 R1 #8 BNEZ R1 Loop What happens if branch depends on result of multd?? We completely lose all of our advantages! Need to be able to “predict” branch outcome. If we were to predict that branch was taken, this would be right most of the time. Problem much worse for superscalar machines! 4/05/04 ©UCB Spring 2004

Handling some branches: Conditional instructions Avoid branch prediction by turning branches into conditionally executed instructions: if (x) then A = B op C else NOP If false, then neither store result nor cause exception Expanded ISA of Alpha, MIPS, PowerPC, SPARC have conditional move; PA-RISC can annul any following instr. EPIC: 64 1-bit condition fields selected so conditional execution Drawbacks to conditional instructions Still takes a clock even if “annulled” Stall if condition evaluated late Complex conditions reduce effectiveness; condition becomes known late in pipeline Cannot loop! 4/05/04 ©UCB Spring 2004

Prediction: Branches, Dependencies, Data Prediction has become essential to getting good performance from scalar instruction streams. We will discuss predicting branches. However, architects are now predicting everything: data dependencies, actual data, and results of groups of instructions: At what point does computation become a probabilistic operation + verification? We are pretty close with control hazards already… Why does prediction work? Underlying algorithm has regularities. Data that is being operated on has regularities. Instruction sequence has redundancies that are artifacts of way that humans/compilers think about problems. Prediction  Compressible information streams? 4/05/04 ©UCB Spring 2004

Dynamic Branch Prediction Prediction could be “Static” (at compile time) or “Dynamic” (at runtime) For our example, if we were to statically predict “taken”, we would only be wrong once each pass through loop Static information passed through bits in opcode Is dynamic branch prediction better than static branch prediction? Seems to be. Still some debate to this effect Today, lots of hardware being devoted to dynamic branch predictors. Does branch prediction make sense for 5-stage, in- order pipeline? What about 8-stage pipeline? Perhaps: eliminate branch delay slots Then predict branches 4/05/04 ©UCB Spring 2004

Simple dynamic prediction: Branch Target Buffer (BTB) PC as branch index: get prediction AND branch address Must check for branch match now, since can’t use wrong branch address Grab predicted PC from table since may take several cycles to compute Update predicted PC when branch is actually resolved Can predict branch while in Fetch stage!!!! Before we know is it a branch! (Caches previous decoding of instruction) Branch PC Predicted PC =? PC of instruction FETCH Predict taken or untaken 4/05/04 ©UCB Spring 2004

BHT is a table of “Predictors” Branch History Table Predictor 0 Predictor 1 Branch PC Predictor 7 BHT is a table of “Predictors” Usually 2-bit, saturating counters Indexed by PC address of Branch – without tags In Fetch state of branch: BTB identifies branch Predictor from BHT used to make prediction When branch completes Update corresponding Predictor 4/05/04 ©UCB Spring 2004

Dynamic Branch Prediction (standard technologies) Combine Branch Target Buffer and History Tables Branch Target Buffer (BTB): identify branches and hold taken addresses Trick: identify branch before fetching instruction! Must be careful not to misidentify branches or destinations Branch History Table makes prediction Can be complex prediction mechanisms with long history No address check: Can be good, can be bad (aliasing) Simple 1-bit BHT: keep last direction of branch Problem: in a loop, 1-bit BHT will cause two mispredictions (avg is 9 iteratios before exit): End of loop case, when it exits instead of looping as before First time through loop on next time through code, when it predicts exit instead of looping Performance = ƒ(accuracy, cost of misprediction) Misprediction  Flush Reorder Buffer 4/05/04 ©UCB Spring 2004

Dynamic Branch Prediction Solution: 2-bit scheme where change prediction only if get misprediction twice: (Figure 4.13, p. 264) Red: stop, not taken Green: go, taken Adds hysteresis to decision making process T NT Predict Taken Predict Taken T NT T NT Predict Not Taken Predict Not Taken T NT 4/05/04 ©UCB Spring 2004

Mispredict because either: 2-bit BHT Accuracy Mispredict because either: Wrong guess for that branch Got branch history of wrong branch when index the table 4096 entry table misprediction varies from: 1% (nasa7, tomcatv) 9% (spice) 12% (gcc) 18% (eqntott) 4096 about as good as infinite table (in Alpha 211164) 4/05/04 ©UCB Spring 2004

Two possibilities; Current branch depends on: Correlating Branches Hypothesis: behavior of recently executed branches affects prediction of current branch Two possibilities; Current branch depends on: Last m most recently executed branches anywhere in program Produces a “GA” (for “global address”) in the Yeh and Patt classification (e.g. GAg) Last m most recent outcomes of same branch. Produces a “PA” (for “per address”) in same classification (e.g. PAg) Idea: record m most recently executed branches as taken or not taken, and use that pattern to select the proper branch history table entry A single history table shared by all branches (appends a “g” at end), indexed by history value. Address is used along with history to select table entry (appends a “p” at end of classification) If only portion of address used, often appends an “s” to indicate “set-indexed” tables (I.e. GAs) 4/05/04 ©UCB Spring 2004

Correlating Branches For instance, consider global history, set-indexed BHT. That gives us a GAs history table. (2,2) GAs predictor First 2 means that we keep two bits of history Second means that we have 2 bit counters in each slot. Then behavior of recent branches selects between, say, four predictions of next branch, updating just that prediction Note that the original two-bit counter solution would be a (0,2) GAs predictor Note also that aliasing is possible here... Branch address 2-bits per branch predictors Prediction Each slot is 2-bit counter 2-bit global branch history register 4/05/04 ©UCB Spring 2004

Accuracy of Different Schemes 18% 4096 Entries 2-bit BHT Unlimited Entries 2-bit BHT 1024 Entries (2,2) BHT Frequency of Mispredictions 0% 4/05/04 ©UCB Spring 2004

Lab 5: out very soon (tonight/tomorrow) Administrivia Lab 4: due tonight: Lab report must be submitted by midnight You must demo a working system to your TA sometime today (work it out with them). Your TA will be running code that prints out contents of memory with a series of break instructions + address/contents through memory-mapped I/O You will need to demo single-stepping to your TA as well Lab 5: out very soon (tonight/tomorrow) Building a DRAM memory controller/cache controller Need to get the stalls right!! Design document: due Wednesday by 9pm Problem 0: also due Wednesday to Tas by email This is very important: may take off points if you do not submit Midterm II: Wednesday 5/5 4/05/04 ©UCB Spring 2004

Explicit Register Renaming Make use of a physical register file that is larger than number of registers specified by ISA Key insight: Allocate a new physical destination register for every instruction that writes Very similar to a compiler transformation called Static Single Assignment (SSA) form — but in hardware! Removes all chance of WAR or WAW hazards Like Tomasulo, good for allowing full out-of-order completion Like hardware-based dynamic compilation? Mechanism? Keep a translation table: ISA register  physical register mapping When register written, replace entry with new register from freelist. Physical register becomes free when not used by any active instructions 4/05/04 ©UCB Spring 2004

Advantages of Explicit Renaming Decouples renaming from scheduling: Pipeline can be exactly like “standard” DLX pipeline (perhaps with multiple operations issued per cycle) Or, pipeline could be tomasulo-like or a scoreboard, etc. Standard forwarding or bypassing could be used Allows data to be fetched from single register file No need to bypass values from reorder buffer This can be important for balancing pipeline Many processors use a variant of this technique: R10000, Alpha 21264, HP PA8000 Another way to get precise interrupt points: All that needs to be “undone” for precise break point is to undo the table mappings Provides an interesting mix between reorder buffer and future file Results are written immediately back to register file Registers names are “freed” in program order (by ROB) 4/05/04 ©UCB Spring 2004

Can we use explicit register renaming with scoreboard? Functional Units Registers FP Mult FP Divide FP Add Integer Memory SCOREBOARD Rename Table 4/05/04 ©UCB Spring 2004

Stages of Scoreboard Control With Explicit Renaming Issue—decode instructions & check for structural hazards & allocate new physical register for result Instructions issued in program order (for hazard checking) Don’t issue if no free physical registers Don’t issue if structural hazard Read operands—wait until no hazards, read operands All real dependencies (RAW hazards) resolved in this stage, since we wait for instructions to write back data. Execution—operate on operands The functional unit begins execution upon receiving operands. When the result is ready, it notifies the scoreboard Write result—finish execution Note: No checks for WAR or WAW hazards! 4/05/04 ©UCB Spring 2004

Scoreboard With Explicit Renaming Initialized Rename Table 4/05/04 ©UCB Spring 2004

Each instruction allocates free register Renamed Scoreboard 1 Each instruction allocates free register Similar to single-assignment compiler transformation 4/05/04 ©UCB Spring 2004

Renamed Scoreboard 2 4/05/04 ©UCB Spring 2004

Renamed Scoreboard 3 4/05/04 ©UCB Spring 2004

Renamed Scoreboard 4 4/05/04 ©UCB Spring 2004

Renamed Scoreboard 5 4/05/04 ©UCB Spring 2004

Renamed Scoreboard 6 4/05/04 ©UCB Spring 2004

Renamed Scoreboard 7 4/05/04 ©UCB Spring 2004

Renamed Scoreboard 8 4/05/04 ©UCB Spring 2004

Renamed Scoreboard 9 4/05/04 ©UCB Spring 2004

Notice that P32 not listed in Rename Table Renamed Scoreboard 10 WAR Hazard gone! Notice that P32 not listed in Rename Table Still live. Must not be reallocated by accident 4/05/04 ©UCB Spring 2004

Renamed Scoreboard 11 4/05/04 ©UCB Spring 2004

Renamed Scoreboard 12 4/05/04 ©UCB Spring 2004

Renamed Scoreboard 13 4/05/04 ©UCB Spring 2004

Renamed Scoreboard 14 4/05/04 ©UCB Spring 2004

Renamed Scoreboard 15 4/05/04 ©UCB Spring 2004

Renamed Scoreboard 16 4/05/04 ©UCB Spring 2004

Renamed Scoreboard 17 4/05/04 ©UCB Spring 2004

Renamed Scoreboard 18 4/05/04 ©UCB Spring 2004

Explicit Renaming Support Includes: Rapid access to a table of translations A physical register file that has more registers than specified by the ISA Ability to figure out which physical registers are free. No free registers  stall on issue Thus, register renaming doesn’t require reservation stations. However: Many modern architectures use explicit register renaming + Tomasulo-like reservation stations to control execution. Two Questions: How do we manage the “free list”? How does Explicit Register Renaming mix with Precise Interrupts? 4/05/04 ©UCB Spring 2004

Explicit register renaming: (MIPS R10000 Style) F6 F8 P10 P12 P14 P16 P18 P20 P22 P24 p26 P28 P30 Current Map Table Done? Oldest Newest P32 P34 P36 P38  P60 P62 Resolve RAW memory conflict? (address in memory buffers) Integer unit executes in parallel Freelist Physical register file larger than ISA register file On issue, each instruction that modifies a register is allocated new physical register from freelist 4/05/04 ©UCB Spring 2004

Explicit register renaming: (MIPS R10000 Style) F6 F8 P10 P12 P14 P16 P18 P20 P22 P24 p26 P28 P30 Current Map Table Done? Oldest Newest P34 P36 P38 P40  P60 P62 Resolve RAW memory conflict? (address in memory buffers) Integer unit executes in parallel Freelist F0 P0 LD P32,10(R2) N Note that physical register P0 is “dead” (or not “live”) past the point of this load. When we go to commit the load, we free up 4/05/04 ©UCB Spring 2004

Explicit register renaming: (MIPS R10000 Style) F6 F8 P34 P12 P14 P16 P18 P20 P22 P24 p26 P28 P30 Current Map Table Done? Oldest Newest P36 P38 P40 P42  P60 P62 Resolve RAW memory conflict? (address in memory buffers) Integer unit executes in parallel F10 P10 ADDD P34,P4,P32 N Freelist F0 P0 LD P32,10(R2) N 4/05/04 ©UCB Spring 2004

Explicit register renaming: (MIPS R10000 Style) F6 F8 P34 P12 P14 P16 P18 P20 P22 P24 p26 P28 P30 Current Map Table -- F2 F10 F0 P2 P10 P0 BNE P36,<…> N DIVD P36,P34,P6 ADDD P34,P4,P32 LD P32,10(R2) Done? Oldest Newest P38 P40 P44 P48  P60 P62 Resolve RAW memory conflict? (address in memory buffers) Integer unit executes in parallel Freelist P32 P36 P4 F6 F8 P34 P12 P14 P16 P18 P20 P22 P24 p26 P28 P30  P38 P40 P44 P48 Checkpoint at BNE instruction 4/05/04 ©UCB Spring 2004

Explicit register renaming: (MIPS R10000 Style) F6 F8 P34 P12 P14 P16 P18 P20 P22 P24 p26 P28 P30 Current Map Table Done? -- ST 0(R3),P40 Y Oldest Newest F0 P32 ADDD P40,P38,P6 Y F4 P4 LD P38,0(R3) Y P42 P44 P48 P50  P0 P10 -- BNE P36,<…> N Resolve RAW memory conflict? (address in memory buffers) Integer unit executes in parallel F2 P2 DIVD P36,P34,P6 N F10 P10 ADDD P34,P4,P32 y Freelist F0 P0 LD P32,10(R2) y P32 P36 P4 F6 F8 P34 P12 P14 P16 P18 P20 P22 P24 p26 P28 P30  P38 P40 P44 P48 Checkpoint at BNE instruction 4/05/04 ©UCB Spring 2004

Explicit register renaming: (MIPS R10000 Style) F6 F8 P34 P12 P14 P16 P18 P20 P22 P24 p26 P28 P30 Current Map Table Done? Oldest Newest P38 P40 P44 P48  P0 P10 Resolve RAW memory conflict? (address in memory buffers) Integer unit executes in parallel F2 P2 DIVD P36,P34,P6 N F10 P10 ADDD P34,P4,P32 y Freelist F0 P0 LD P32,10(R2) y Speculation fixed by restoring map table/head of freelist P32 P36 P4 F6 F8 P34 P12 P14 P16 P18 P20 P22 P24 p26 P28 P30  P38 P40 P44 P48 Checkpoint at BNE instruction 4/05/04 ©UCB Spring 2004

Limits to Multi-Issue Machines Multi-issue: simple matter of accounting Must do dataflow analysis across multiple instructions simultaneously Rename table updated as if instructions happened serially! Inherent limitations of ILP 1 branch in 5: How to keep a 5-way superscalar busy? Latencies of units: many operations must be scheduled Need about Pipeline Depth x No. Functional Units of independent instructions to keep fully busy Increase ports to Register File VLIW example needs 7 read and 3 write for Int. Reg. & 5 read and 3 write for FP reg Increase ports to memory Current state of the art: Many hardware structures (such as issue/rename logic) has delay proportional to square of number of instructions issued/cycle 4/05/04 ©UCB Spring 2004

Limits to ILP Conflicting studies of amount Benchmarks (vectorized Fortran FP vs. integer C programs) Hardware sophistication Compiler sophistication Initial HW Model here; MIPS compilers. Assumptions for ideal/perfect machine to start: 1. Register renaming–infinite virtual registers and all WAW & WAR hazards are avoided 2. Branch prediction–perfect; no mispredictions 3. Jump prediction–all jumps perfectly predicted => machine with perfect speculation & an unbounded buffer of instructions available 4. Memory-address alias analysis–addresses are known & a store can be moved before a load provided addresses not equal One cycle latency for all instructions; unlimited number of instructions issued per clock cycle 4/05/04 ©UCB Spring 2004

Upper Limit to ILP: Ideal Machine FP: 75 - 150 Integer: 18 - 60 IPC 4/05/04 ©UCB Spring 2004

More Realistic HW: Branch Impact Change from Infinite window to examine to 2000 and maximum issue of 64 instructions per clock cycle FP: 15 - 45 Integer: 6 - 12 IPC Perfect Pick Cor. or BHT BHT (512) Profile No prediction 4/05/04 ©UCB Spring 2004

More Realistic HW: Register Impact (rename regs) FP: 11 - 45 Change 2000 instr window, 64 instr issue, 8K 2 level Prediction Integer: 5 - 15 IPC Infinite 256 128 64 32 None 4/05/04 ©UCB Spring 2004

Realistic HW for ‘9X: Window Impact Perfect disambiguation (HW), 1K Selective Prediction, 16 entry return, 64 registers, issue as many as window FP: 8 - 45 IPC Integer: 6 - 12 Infinite 256 128 64 32 16 8 4 4/05/04 ©UCB Spring 2004

Modern computer architects predict everything: Summary: Modern computer architects predict everything: Branches/Data Dependencies/Data! Fairly simple hardware structures can do a good job of predicting branches: Branch Target Buffer (BTB) identifies branches and branch offsets Branch History Table (BHT) does prediction More Sophisticated prediction: Correlation Different branches depend on one another! Explicit Renaming: more physical registers than ISA. Separates renaming from scheduling Opens up lots of options for resolving RAW hazards Rename table: tracks current association between architectural registers and physical registers Potentially complicated rename table management Parallelism hard to get from real hardware. Let’s summarize today’s lecture. The first thing we covered is the principle of locality. There are two types of locality: temporal, or locality of time and spatial, locality of space. We talked about memory system design. The key idea of memory system design is to present the user with as much memory as possible in the cheapest technology while by taking advantage of the principle of locality, create an illusion that the average access time is close to that of the fastest technology. As far as Random Access technology is concerned, we concentrate on 2: DRAM and SRAM. DRAM is slow but cheap and dense so is a good choice for presenting the use with a BIG memory system. SRAM, on the other hand, is fast but it is also expensive both in terms of cost and power, so it is a good choice for providing the user with a fast access time. I have already showed you how DRAMs are used to construct the main memory for the SPARCstation 20. On Friday, we will talk about caches. +2 = 78 min. (Y:58) 4/05/04 ©UCB Spring 2004