Module: Speculative Execution

Module: Speculative Execution
ECE 4100/6100 Spring 2005 © Krishna V. Palem, Weng Fai Wong, and Sudhakar Yalamanchili, Georgia Institute of Technology (slides contributed by Prof. Weng Fai Wong were prepared while visiting, and employed by, Georgia Tech)

Reading for This Module
Speculative Execution Section 3.7 The Reorder Buffer and Register Renaming Multithreading Section 6.9 Additional Reading Section 3.10, Section 4.5 (pp ) Hyperthreading P4 Microarchitecture Spring 2005

Speculation Speculative execution is the execution of instructions before it is known if it is safe to do so Rely on branch prediction to get the branch direction right in most cases Spring 2005

Speculation vs. Prediction
EX INT Keep the out-of-order execution core full via speculation Keep the instruction pipeline full via prediction IF ID EX FP MEM WB EX BR Maintain correctness of out-of-order execution Prediction is targeted at instruction fetch Prediction is de-coupled from the decision to execute fetched instructions Prediction helps boost the issue rate Speculation refers to the execution of predicted instructions Spring 2005

Speculation Hardware based speculation as an extension of dynamic scheduling is composed of Branch prediction  to select instructions to be speculatively executed Dynamic scheduling  what we have seen so far Execution Commitment  update machine state Exception handling Challenges Handling multiple executions completions/cycle Enforcing dependencies to ensure correctness Handling exceptions Spring 2005

The Reorder Buffer ECE 4100/6100 Spring 2005

Principle Basic block sizes are not very large
Processor datapath I-Fetch Execution Core Retire Basic block sizes are not very large Prediction can increase the issue rate but not the completion rate Boosting issue rate by itself is insufficient The completion rate has to be increased to keep up with the issue rate Need speculative execution Key idea: separate instruction execution from instruction commitment Compute on a need-to-know basis until speculation outcome is determined Spring 2005

Issues What is commitment? What should be the criteria?
Updating the register file! Permanent update to the machine state What should be the criteria? Commitment is performed in program order How to enforce the criteria? Reorder the instructions that complete out-of-order  Reorder Buffer Spring 2005

The Reorder Buffer Initially proposed to support precise interrupts
Handles output and anti-dependences Another form of register renaming Does not take care of flow dependences A FIFO circular queue Spring 2005

The Reorder Buffer Spring 2005

Three Simple Steps ROB From Instruction Unit Every instruction gets a reorder entry allocated in-order as it is issued - the entry is marked “invalid” When an instruction completes, it writes its result to the corresponding entry in the reorder buffer – the entry is now valid When the entry at the head of the reorder buffer is “valid” it is committed to the register file From Memory 8 7 6 5 4 3 2 1 FP Ops “Stack” Load Buffer 6 5 4 3 2 1 FP Registers Operand Busses Decoder Operation Bus 3 2 1 2 1 To Memory Reservation Stations FP Add (2 stage) FP Mul/Div (6 stage) Store Buffer 3 2 1 Common Data Bus Spring 2005

Structure/Operation of the ROB
I-Type Dest Value Ready Speculation info speculative? identify which block? branch memory register register memory address status Why do you need this information? Issue/dispatch must now issue a ROB entry ROB tag is used in renaming Execute in a data-driven manner Write results on the CDB using the ROB tag Commit instructions in-order Commit valid instructions at the head of the ROB Incorrect branches cause the ROB to be flushed and execution restarted Spring 2005

The Result Results are written into the register file in-order
Destination registers are effectively renamed to reorder buffer entries each instruction writes to a new “destination register” Spring 2005

Using the Speculative Bits
Speculative instructions are marked in the reorder buffer by a special “speculative” bit Should a branch become confirmed it will turn the “speculative” bits of the corresponding speculative instruction to “confirm” If it is not confirmed, status is set to “not confirmed” When an instruction reaches the head of the reorder buffer If it is marked “speculative”, commitment is stalled until its status is determined, i.e., it is no longer speculative If it is marked “confirm”, commit the instruction If it is marked as “not confirmed”, its result is discarded These are known as speculative writebacks Spring 2005

Speculative Memory References
Loads/stores do exhibit flow, output and anti-dependences Speculative Stores are a problem Use a store buffer and manage it like a re-order buffer Spring 2005

Simple solution Only one load/store unit
Reservation stations for this single load/store unit is a queue Process in strict queue (in-order) order Inefficient Spring 2005

Separate Load/Store Units
Spring 2005

An Example - Alpha 21264 Both 32 entry reorder buffers Spring 2005
Instruction Cache Instruction processing and dispatch Integer issue fp issue Integer execution unit FP execution unit Memory interface Load queue Store queue Both 32 entry reorder buffers Data Cache Spring 2005

Parallel Retirement Retiring only one instruction per cycle is also a bottleneck Can retire instructions in parallel Advantage: free up more reorder buffer entries quickly Does not affect instruction execution directly as instructions can read from the reorder buffer directly Spring 2005

Parallel Retirement Retirement Logic Reorder Register Buffer File
Instruction Results Retirement Logic Reorder Buffer Register File Instruction Operands Instruction Operands Spring 2005

Parallel Retirement Although retiring in parallel, retirement logic must guarantee in-order retirement must check valid bits in sequence must check destination register number Requires more ports to the register file Spring 2005

Forwarding from the ROB
Results from the ROB can be forwarded directly to executing instructions Can read valid results directly from the reorder buffer Suppose two reorder buffer entry writes to the same destination register, R0 say For an instruction reading from R0, must use extra hardware to decide which is the right one to read from the later of the two instructions writing to R0 has the higher priority Spring 2005

Register Renaming ECE 4100/6100 Spring 2005

Dependencies and Register Pressure
Registers are re-used over the life of a program Compilers provide a static scheme for re-using registers Speculative execution creates a greater demand for registers to eliminate name dependencies Register renaming increases issue rate Spring 2005

Renaming Used at Different Points
EX INT Values available for forwarding Values available for commitment IF ID EX FP MEM MEM WB EX BR Extend the resources available for renaming More physical registers are available than are visible in the ISA Renaming performed at/during ID or prior to issue Number of registers determines how many instructions can exist between issue and commit Spring 2005

Principle Instructions specify logical or architecture registers
Register Re-Map Table (Logical Register File) Physical Register File P0 P1 R0 R1 R7 Entry contains the name of a physical register P11 Instructions specify logical or architecture registers At instruction issue a logical register is re-mapped or re-named to one of a larger pool of physical registers Spring 2005

Example: IBM RS 6000 RS 6000 Scheme R0 R j R1 R2 Free registers R0 R2 R7 Registers in use Extra registers R1 Add a few extra registers to be re-used over the life of the program How do we keep track of this mapping information Index a table with register number  Mapping table Keep track of free registers available for renaming Keep track of currently in use registers in use Spring 2005

When is Safe to Re-Use a Register?
If no active instruction is using that register, it can re-used One approach is to check the registers being used by all active instructions Expensive Another approach is to perform checks at instruction commitment Spring 2005

Case Study: MIPS R10000 ECE 4100/6100 Spring 2005

MIPS R10000 There are 32 logical registers
5 bit logical register specifiers There are 64 physical registers 6 bit physical register identifiers Spring 2005

Main Data Structures The Register Map Table The Free Register List
The Active List The Busy Bit Table Duplicated for General Purpose and Floating Point Registers Spring 2005

The Register Map Table A multi-ported Static Random Access Memory (SRAM) Takes 5 bit addresses Deliver 6 bit results For each instruction that may be issued in one cycle, requires three read ports ADD.D F0, F2, F4 Need at least one write port per instruction that can be retired in a cycle (recall parallel retirement) Spring 2005

Active List A FIFO queue - similar in function to the reorder buffer
Each instruction has a corresponding active list entry Processing the head of the active list is called instruction retirement or graduation ( what we referred to as commitment) Spring 2005

Free Register List A FIFO queue of physical registers that are available for reuse Spring 2005

Busy Bit Table A table to indicate the availability of source operands
Busy bit in the instruction queue entry must be updated constantly Each time a physical register is being written, all corresponding busy bits in the instruction queues must be updated Spring 2005

Functional Unit Instruction Queue
Equivalent of reservation stations for each functional unit Consists of opcode ready bit of physical register operands physical source register identifiers physical destination register identifier a TAG field for locating the corresponding active list entry Spring 2005

MIPS R10000 RMT Spring 2005 Instruction op src1 src2 dst
Register Map Table New Pdst Free Register List Busy Bit Table Old Pdst Op Ready Field Pscr1 Pscr2 Pdst Tag Old Pdst Dst Done Bit FU Instruction Queue FU Instruction Queue Spring 2005

Upon Instruction Issue...
Each instruction gets the following allocated an entry in the corresponding FU instruction queue an entry in the active list a new physical destination register from the free register list Spring 2005

Next... The two 5 bit logical source register specifiers are used to access the RMT to obtain the corresponding physical registers The 5 bit logical destination register specifier is used to access the RMT The output is written to the corresponding active list entry The busy bit for the physical destination is set Spring 2005

Instruction Execution
When both physical source registers are ready, proceed with operand read takes care of flow dependences Result is written directly to the physical destination register Update Busy Bit Table DONE bit in active list entry is set Spring 2005

Instruction Retirement
When the entry at the head of the active list is marked “DONE”, proceed to retire instruction Old physical register is released to free register list for reuse Each allocated physical register is written exactly once Spring 2005

When Is It Safe? When is it safe to reuse a physical register?
Example: R1  P7 previously, now a new instruction, I1, will write to R1 and gets assigned P5 It is safe to reuse P7 when I1 has completed execution (and has written to P5) R1 = ….. = R1 .. R1 = …. Remapped to P7 From behavior of ROB, we know all prior instructions have committed, i.e. P7 can be now freed after this instruction commits Remapped to P5 Spring 2005

Why? Because the logical register R1 has been overwritten
Any subsequent read of R1 should be done to P5 Spring 2005

Handling Flow Dependences
Each time we allocated a new destination register, we update the RMT Any subsequent read will get the correct map from the RMT The Busy bit system comprising of the Busy Bit Table and the constantly updating of the busy fields in the instruction queue entry ensures data availability checking Spring 2005

Handling Output and Anti-Dependences
Each instruction writes to a newly allocated physical register Registers are renamed from logical to physical Can use more physical registers Spring 2005

Case Study: Intel Pentium III and Pentium IV (NETBURST)
ECE 4100/6100 Spring 2005

Intel IA32 Due to backward compatibility
Complex instructions Limited number of registers Each complex instruction is translated into several micro-ops (uops) Register renaming used to allow for more registers Spring 2005

The Pipeline Basic Pentium 3 Misprediction Pipeline
fetch fetch dec dec dec rename ROB rd Rdy sch dispatch exec 1 2 3 4 5 6 7 8 9 10 Basic Pentium 4 Misprediction Pipeline: Key stages TC Nxt IP TC Fetch rename que sch sch sch disp disp RF RF EX 1 5 10 15 20 Spring 2005

ROB The Reorder Buffer (ROB) in the IA32 is implemented by content-addressable memory Served as an instruction pool L2 cache To system bus Bus interface L1 I-cache L1 D-Cache fetch load store Fetch/ decode unit Dispatch/ execute Retire unit ROB Instruction pool Spring 2005

ROB Entries Each ROB entry has a data and a status field
ROB data field stores the data result of a uop ROB status field track the status of the uop producing the result that is to go into the corresponding data field Spring 2005

Register Renaming in P-III
A Register Alias Table (multi-ported SRAM) keeps track of the latest alias for logical registers ROB is managed like a reorder buffer Tracks availability of data Once retired, data is copied from ROB to the Retirement Register File (RRF) Spring 2005

Pentium 3 ROB Spring 2005 Data Status
Register Alias Table (RAT): Remember the most current version of each register EAX EBX ECX EDX 40 entry ROB ESI EDI ESP EBP RRF Spring 2005

Register Renaming RAT may point to a ROB entry or a RRF
No physical EAX, EBX etc. exist Spring 2005

Pentium IV Introduced the NETBURST architecture
Eliminate the copying of ROB data value to the RRF Consists of two RAT Frontend RAT Retirement RAT Spring 2005

Pentium IV The 128 Register File (RF) is separated from the ROB - which now only consists of status fields A unique, in-order sequence number is allocated for each uop that points to the corresponding ROB entry Spring 2005

Pentium IV NetBurst RF ROB Spring 2005 Front End RAT EAX EBX Data
Status ECX EDX ESI EDI ESP 128 physical registers EBP Retirement RAT EAX EBX ECX EDX ESI EDI ESP EBP Spring 2005

Pentium IV Execution Core
Up to 126 instructions in flight and up to 48 loads and 24 stores pending The front end feeds the execution core Allocator allocate ROB entry, rename registers, allocate μop queue entry, allocate load/store buffer Front end μop supply and backend μop retirement bandwidth is 3 μops Dispatch bandwidth into the execution core is 6 μops Multi-clock bypass network for double speed integer ALUs Spring 2005

Pentium IV Execution Core
Compute μop queue memory μop queue Out-of-order schedulers feed dispatch ports scheduler scheduler scheduler scheduler Dispatch Ports Exec Port 0 Exec Port 1 Load Port Store Port ALU (2X) FP Move ALU (2X) Integer FP Load Store Add/Sub Logic Store Data Branches FP/SSE Move FP/SSE Store Add/Sub Shift/Rotate FP/SSE Add Mul Div Spring 2005

Some Observations Applications have a high level of thread parallelism
Within a thread, high latency operations have to be tolerated, e.g., cache misses Transistors have been invested to improve the performance of a single thread Sub-linear relationship between investment (chip area) and return (execution speed)  utilization is the key! Spring 2005

What Next? Exploit thread level parallelism
Use multiple processors and keep them busy Time sharing Switch-on-event time sharing Need to flush the deep pipelines Fine grained multi-threading to keep the pipelines full Simultaneous multithreading to maximize resource utilization with minimal overhead Spring 2005

Forms of Multithreading
Coarse Grain Multithreading Fine Grained Multithreading Simultaneous Multithreading Superscalar time stall Issue slots Spring 2005

Increasing Utilization in the NetBurst Microarchitecture
Observations for dynamically scheduled processors Have large registers sets with support for renaming Tag support enables tracking of instructions across threads Schedulers and execution units track dependencies Idea: provide support for sharing resources across threads with little additional hardware support  Hyper-threading Abstraction: Logical processors This is what the programmer and operating system sees Spring 2005

Hyper-threading in the Xeon Processor Family
2 CPU Without Hyper-threading 2 CPU With Hyper-threading Processor Execution Resources Processor Execution Resources Processor Execution Resources Processor Execution Resources Arch State Arch State Arch State Arch State Arch State Arch State Goals: Minimize die area cost Independent forward progress for a logical processor Do not penalize single thread performance Implementation of Hyper-threading adds less that 5% to the chip area Principle: share major logic components by adding or partitioning buffering logic Spring 2005

The Xeon Pipeline Spring 2005 μop queue TC round robin access
Execute Trace cache access μop queue Rename Queue Schedule Register Read L1 cache WB Retire EX INT FP BR Reg rename TC Register cache Reg allocator ROB round robin access dynamic sharing fairness enforced by limits on buffer sharing Separate RATs Execution unit oblivious to logical processors fairness enforced by limits on buffer sharing Schedulers oblivious to logical processors Fetch Logic Duplicate ITLBs and PCs Independent I-buffers for decode RAS duplicated and some sharing of branch prediction logic Spring 2005

Performance 65% performance increase for high end server applications for 4-way server platform ~20%-30% performance improvement for categories such as transactions, web server, and server side Java environment Operating system can optimize scheduling of threads across logical/physical processor combinations Spring 2005

Recall… Compiler Hardware Front-End & Optimizer Sequential
(superscalar) Determine Dependences Determine Dependences Dependence Architecture (dataflow) Determine Independences Determine Independences Independence Architecture (Horizon) Bind Resources Bind Resources Independence Architecture (VLIW) Execute Compiler Hardware Spring 2005

Review of the Superscalar Datapath
Out-of-order execution core in-order fetch and issue logic In-order completion logic Instruction Completion Instruction Issue Renaming Allocate reservation stations Allocate re-order buffer entry Check for structural hazards Instruction Execution Enable waiting instructions Retire from re-order buffer Forward from re-order buffer Data driven execution  all dependencies have been resolved Issue to functional unit De-allocate reservation stations Forwarding Check load/store dependencies Spring 2005

Concluding Remarks Degree of speculation
Speculate the bad along with the good, e.g., cache misses Speculating through multiple branches Hide long functional unit delays May need to speculate through multiple branches in one cycle Use SATSIM Follow the execution and understand the use of the register renaming and use of the re-order buffer Check the data sheets for modern processors. What techniques do they use? Spring 2005

Study Guide Given a code sequence Exception handling Register renaming
What is the state if the ROB at some point in time? Exception handling Using a ROB Register renaming Given a code sequence, what would be the contents of the rename table or rename register file (depending on which technique is used) at some point in time Which physical registers are available? Forms of speculation – understanding how they work Across branches Speculating memory accesses Spring 2005

Module: Speculative Execution

Similar presentations

Presentation on theme: "Module: Speculative Execution"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Module: Speculative Execution

Similar presentations

Presentation on theme: "Module: Speculative Execution"— Presentation transcript:

Similar presentations

About project

Feedback