Approaches to exploiting Instruction Level Parallelism (ILP)

Approaches to exploiting Instruction Level Parallelism (ILP)
Hardware – hardware uncovers parallelism (Intel Pentium series) Software – compiler finds and uncovers parallelism (Intel Itanium series) Section 2.1

Pipeline CPI Ideal pipeline CPI + structural stalls + data hazard stalls + control stalls Ideal pipeline CPI – maximum performance attainable by the implementation Decrease pipeline CPI by decreasing any of the four terms Chapter 2 discusses techniques to reduce these terms Section 2.1

Instruction-level Parallelism
Potential execution overlap among instructions Basic block (straight line sequence of code – enter at top, exit at bottom) parallelism is small – 3 to 6 instructions execute between a pair of branches for typical MIPS programs Section 2.1

Loop-level parallelism
Parallelism among loop iterations Example: for (i=1; i <= 1000; i++) a[i] = a[i] + b[i] All iterations can be execute in parallel We’ll look at techniques that convert loop-level parallelism into instruction-level parallelism (ILP) Section 2.1

Data dependences and hazards
Two dependent instructions can not be executed in parallel An instruction j is data dependent on an instruction i if: Instruction i produces a result that may be used by j, or Instruction j is data dependent on instruction k and instruction k is data dependent on i Section 2.1

Data dependences Indicate:
1) the possibility of a hazard (whether one occurs and the length of any stall is a property of the pipeline) 2) the order in which results must be calculated 3) sets an upper bound on the amount of parallelism that can be exploited Section 2.1

Name dependences Two instructions access the same register or memory location (name) but there is no flow of data between them via that name Also called a false dependence Two types Antidependence between instructions i and j if instruction j writes to a register or memory location that instruction i reads Output dependence between instructions i and j if instruction j writes to the same memory location or register that instruction i writes to Section 2.1

Possible hazards Hazard – stalls due to dependences between instructions (control, data, output, anti) or insufficient hardware resources (structural) RAW hazard – corresponds to true data dependence WAW hazard – corresponds to an output dependence WAR hazard – corresponds to an antidependence Note that true data dependences can not be eliminated however the other two types can sometimes be eliminated which would eliminate the hazard Section 2.1

Control dependence An instruction j is control dependent upon an instruction i if i determines whether j will be executed In general, these are the constraints imposed by control dependences 1. Instruction that is control dependent on a branch can not be moved before the branch so that it is no longer control dependent upon the branch 2. An instruction that is not control dependent on a branch can not be moved after the branch so that its execution is controlled by the branch Section 2.1

Control dependence Violating control dependences is possible if we can do so without affecting the correctness of the program Two properties critical to program correctness Exception behavior – can’t change how exceptions are raised in the program Data flow – flow of data values among instructions Section 2.1

Basic Pipeline Scheduling
Keep pipeline full by finding sequences of unrelated instructions that can be overlapped in the pipeline Dependent instruction must be separated by its source instruction by X instructions where X is the latency between the two instructions Section 2.2

Latency (review) Number of independent instructions that must be between two data dependent instructions in order to avoid a stall Latencies of FP operations for examples in this chapter – see figure 2.2 (in book) Section 2.2

Example for (i = 1000; i >0; i--) x[i] = x[i] + s; MIPS code:
Loop: L.D F0, 0(R1) ;load x[i] ADD.D F4, F0, F ;add s S.D F4, 0(R1) ;store x[i] DADDUI R1, R1, # ;decrement pointer BNE R1, R2, LOOP ;branch R1!= R2 How many cycles for 1 iteration? Section 2.2

After scheduling Loop: L.D F0, 0(R1) DADDUI R1, R1, #-8
ADD.D F4, F0, F2 BNE R1, R2, LOOP S.D F4, 8(R1) Note: since the DADDUI is moved before the SD, the offset on the SD had to be adjusted (smart compiler) How many cycles for 1 iteration? Sections 2.2

Loop unrolling Note that two of the instructions (DADDUI, BNE) are loop overhead instructions We can eliminate loop overhead instructions by eliminating loop iterations Loop unrolling must be done in conjunction with register renaming to eliminate false loop dependences Section 2.2

Loop unrolling Loop unrolled k times means that new loop has k copies of the original loop body and is executed (n /k) times where n is the number of times original loop is executed Unrolled loop is preceded by a copy of the original loop that is executed n % k times Section 2.2

Loop unrolling example
Loop: L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) L.D F6, -8(R1) ADD.D F8, F6, F2 S.D F8, -8(R1) L.D F10, -16(R1) ADD.D F12, F10, F2 S.D F12, -16(R1) L.D F14, -24(R1) ADD.D F16, F14, F2 S.D F16, -24(R1) DADDUI R1, R1, #-32 BNE R1, R2, LOOP How many cycles? Section 2.2

Scheduled/unrolled version
Loop: L.D F0, 0(R1) L.D F6, -8(R1) L.D F10, -16(R1) L.D F14, -24(R1) ADD.D F4, F0, F2 ADD.D F8, F6, F2 ADD.D F12, F10, F2 ADD.D F16, F14, F2 S.D F4, 0(R1) S.D F8, -8(R1) DADDUI R1, R1, #-32 S.D F12, -16(R1) BNE R1, R2, LOOP S.D F16, -24(R1) How many cycles? Section 2.2

Loop unrolling/scheduling
Compiler must determine dependences within and across loop iterations (data dependence analysis) Analysis must consider access to scalars (in registers) as well as access to arrays (in memory) Compiler must use different registers to eliminate false dependences (register renaming) Section 2.2

Limitations to the gains of loop unrolling
Decrease in the amount of overhead amortized with each unroll – the amount of loop overhead per iteration becomes less and less with each unrolling so there is also less gained (in terms of eliminating overhead) by additional unrolling Code size – this is especially a problem for embedded systems Increase in register pressure – register renaming causes an increase in the number of registers needed by the register allocator Section 2.2

Static Branch Prediction
Simplest scheme – predict all branches as being taken; misprediction rate ranges from 59% to 9% on the SPEC benchmarks Better – predict branches based on profile information; see figure 2.3 (next slide) Section 2.3

Figure 2.3 Section 2.3

Additional uses of static branch prediction
Scheduling for the canceling branch Assisting dynamic branch predictors – prediction used when dynamic predictor doesn’t have a valid prediction Determining frequently executed code paths – compiler focuses optimizations on the frequently executed paths Section 2.3

Dynamic Branch Prediction
Branch prediction made by the hardware; prediction can change as the program executes Dynamic branch prediction very important to any processor that tries to issue more than one instruction per clock cycle Branches arrive up to n times faster in an n issue processor Since branches will arrive faster, Amdahl’s law tells us that the impact of control stalls will be more significant Section 2.3

Branch Prediction Buffer
Buffer that holds prediction of the behavior of fetched branches Indexed by lower portion of address of instructions (alternatively, bits stored in instruction cache with instruction) If instruction is a branch and prediction is taken, fetching begins at the target as soon as it is known Wrong prediction causes bit(s) to be changed Section 2.3

1-bit prediction scheme
Buffer contains a single bit that indicates whether the branch was last taken or not Fetching begins in the predicted direction If hint is wrong, prediction bit is reversed Section 2.3

2-bit prediction scheme
2 bits used to represent the prediction 00, 01 – predict not taken 10, 11 – predict taken Prediction must be wrong twice before it is changed See figure 2.4 Section 2.3

Effectiveness of branch prediction buffer
Only effective if target can be calculated before branch outcome Studies indicate that the 2-bit scheme has a misprediction rate ranging from 0% to 18% on the SPEC89 benchmarks (figure 2.5, next slide) Figure 2.6 (next slide + 1) compares misprediction rates of a 4k buffer to an “infinite” buffer Section 2.3

Figure 2.5 Sections 3.1, 3.2, 3.3

Correlating branch predictors
Branch predictors that use the behavior of other branch instructions to predict the behavior of the current branch instruction if (aa == 2) aa = 0; if (bb == 2) bb = 0; if (aa != bb) { …} Note if both of the first two conditions are true then the last condition won’t be true Section 2.3

(1, 1) correlating branch predictor
1 bit of correlation (uses behavior of 1 previous branch) 1 bit of prediction The behavior of the last branch executed (taken or not taken) is used to choose the prediction bit for the current branch 2 prediction bits/per branch are needed – one is used if the last branch executed was not taken and the other is used if the last branch executed was taken Section 2.3

(m, n) predictor Uses behavior of last m branches to choose among 2m n-bit predictors Global history of the most recent m branches can be recorded in an m-bit shift register Branch prediction buffer accessed with branch address and m-bit global history See next slide Section 2.3

Section 2.3

Figure 2.7 Compares performance of 2-bit predictors
(2, 2) predictor performs better than the simple 2-bit predictors even when the buffer size is infinite Section 2.3

Multilevel branch predictors (tournament predictors)
Uses several branch-prediction tables together with an algorithm for choosing among the predictors Predictors may be: Global – based on behavior of branches executed before current branch Local – based only on behavior of current branch See figure 2.8 (next slide) Section 2.3

Dynamic Scheduling Hardware rearranges instruction execution to reduce stalls while maintaining data flow and exception behavior Advantages Compiler can’t always determine data dependences Compiler can be simpler Allows code compiled on one pipeline to execute efficiently on a different pipeline Disadvantage – hardware complexity Section 2.4

Pipelining Limitations
Instructions are issued in program order Instruction stalled from being issued, all later instructions are also stalled Example: DIV.D F0, F2, F4 ADD.D F10, F0, F8 SUB.D F12, F8, F14 Section 2.4

Out-of-order execution
Instructions fetched and placed into a queue of pending instructions ID stage separated into two parts Issue: Decode instructions, check for structural hazards (in-order issue) Read operands – wait until no data hazards, then read operands Instructions can begin execution out of order (out of order execution) Instructions can complete out of order (out-of-order completion) Section 2.4

Dynamic Scheduling using Tomasulo
Developed by Robert Tomasulo and used in the IBM 360/91 floating point unit Many variations on the ideas in modern processors General idea Track when operands for instructions are available Do register renaming to minimize WAW and WAR hazards Sections 2.4, 2.5

Register renaming DIV.D F0, F2, F4 ADD.D F6, F0, F8 S.D F6, 0(R1)
SUB.D F8, F10, F14 MUL.D F6, F10, F8 DIV.D F0, F2, F4 ADD.D F6, F0, F8 S.D F6, 0(R1) SUB.D S, F10, F14 MUL.D T, F10, S Given sufficient registers, the compiler can do this type of renaming. Sections 2.4, 2.5

Tomasulo’s register renaming
Reservation station fetches and holds an operand, eliminating need to get operand from a register Pending instructions designate the reservation station that will provide their input When there are successive pending writes to a register, only the last one actually updates the register Sections 2.4, 2.5

Tomasulo hardware Instruction queue – queue of instructions that have been fetched; waiting to be sent to a reservation station Reservation stations – holds instructions and their operands waiting to be issued; indicate other reservation stations holding instructions that will produce a needed value Common data bus – result bus that allows all reservation stations waiting for an operand to be loaded simultaneously Load/Store buffers – buffer for the memory unit Sections 2.4, 2.5

Tomasulo instruction execution
Three steps Issue Execute Write result Sections 2.4, 2.5

Tomasulo Issue Step Get next instruction from instruction queue
If reservation station available, issue instruction with the operand values if they are currently in registers If operands are not in registers, store in reservation station the name of the reservation station holding the instruction whose execution will produce the needed operand (renaming) If no reservation station available, stall and stall subsequent instructions (instructions leave queue in order) Sections 2.4, 2.5

Tomasulo Execute Step If one or more operands are not available, monitor the Common Data Bus while waiting for it to be computed When an operand is placed on the CDB, it is read and stored in the reservation station When all operands are available, the operation can be executed Multiple instructions can become ready for execution and begin execution in the same clock cycle (unless they are competing for the same FU) Sections 2.4, 2.5

Tomasulo Execute Step Loads and stores require two step execution
Step 1: effective address computed and stored in the load or store buffer Step 2: instruction sent to memory unit Note: Stores in the store buffer also have to wait for the value being stored Sections 2.4, 2.5

Tomasulo Execute Step Exception behavior is preserved by preventing any instructions from initiating execution until all branches preceding it have finished execution Guarantees that exceptions are only caused by instructions that really would have been executed Sections 2.4, 2.5

Tomasulo Write Step When result is ready and CDB is available, write result to the CDB Reservation stations, register file and store buffer are monitoring CDB and will read needed values off of CDB Sections 2.4, 2.5

Reservation Stations Op – operation to perform on the source operands
Qj, Qk – reservation station that will produce the corresponding source operand; value of zero indicates source operand is already available Vj, Vk – value of the source operands; if the corresponding Q field is non-zero, the V field is invalid Busy – indicates whether reservation station is busy Sections 2.4, 2.5

Register File In addition to the value of the register, there is a Q field for each register Qi – the reservation station that will produce the result to be stored in this register; zero if no pending result Sections 2.4, 2.5

Load/Store buffers A – holds immediate value from store or load and effective address after that is calculated Sections 2.4, 2.5

Common Data Bus Normal bus: data (value to be store) + address (destination) CDB: data (value to be store) + source (name of reservation register or load buffer that is producing the result) Sections 2.4, 2.5

MIPS fp unit using Tomasulo
Figure 2.9 Instructions issued FIFO from the instruction unit and placed either in a reservation station or a load buffer or a store buffer Results from either the FP units or the load unit are placed on the CDB which goes to the FP register file, reservation stations and store buffers FP adders implement addition and subtraction FP multipliers implement multiplication and division Sections 2.4, 2.5

Figure 2.10 Note these tables aren’t part of hardware; they simply illustrate the algorithm First L.D issued, executed and result written Second load issued, executed and will write result in next clock cycle Note name of reservation stations are in the leftmost column Register file holds values and names of reservation stations Sections 2.4, 2.5

Advantages of Tomasulo
Distribution of hazard detection logic - Multiple reservation stations and the use of a CDB allow multiple instructions waiting on a single result to read the result and simultaneously begin execution (if they already have their other operand) Elimination of WAR and WAW hazards Sections 2.4, 2.5

Eliminating WAR hazards
Example: MUL.D F8, F2, F6 ADD.D F2, F6, F4 When MULD is issued either The value F2 is placed in the reservation station in the Vj field or, Qj is set to the name of the reservation station that will produce the source operand When ADD.D is issued, the Qi field of the register file is set to the name of the reservation station holding the ADD.D Sections 2.4, 2.5

Eliminating WAW hazards
Example: MUL.D F2, F3, F4 ADD.D F2, F5, F6 First, MUL.D is issued and the Qi field in the register file is set to the reservation station holding the MUL.D Next, ADD.D is issued and the Qi field is modified to the name of the reservation station holding the ADD.D Sections 2.4, 2.5

Figure 2.11 ADD.D has finished and written its result because the value of F6 was copied into the reservation station holding the DIV.D Sections 2.4, 2.5

Figure 2.12 Source registers: rs, rt Destination register: rd
Immediate field (loads, stores): imm RS is reservation station data structure RegisterStat is the register status data structure Value returned by an FP unit or load unit is called the result Sections 2.4, 2.5

Figure 2.13 Shows Tomasulo at work on two iterations of a loop
Uses the prediction that the branch will be taken The two multiplies identify two different load buffers for their source for F0 The two stores name two different reservation stations as the source of the value to store Sections 2.4, 2.5

Handling loads and stores
Early versions of Tomasulo performed loads and stores in the order in which they were issued. However, a load and a store can be done in a different order provided they access different addresses To determine whether a load can be executed: the processor must check whether any uncompleted stores that precede the load shares the same data memory address Sections 2.4, 2.5

Handling loads and stores
To determine whether a store can be executed: the processor must check whether any uncompleted loads or stores that precede the store share the same data memory address Checking whether a load or store can be executed is done by examining the A field of already issued loads and stores If the address of the load or store matches an A field in the store buffer (and load buffer in case of the store), then the load is not issued Sections 2.4, 2.5

Approaches to exploiting Instruction Level Parallelism (ILP)

Similar presentations

Presentation on theme: "Approaches to exploiting Instruction Level Parallelism (ILP)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Approaches to exploiting Instruction Level Parallelism (ILP)

Similar presentations

Presentation on theme: "Approaches to exploiting Instruction Level Parallelism (ILP)"— Presentation transcript:

Similar presentations

About project

Feedback