Download presentation
Presentation is loading. Please wait.
1
Instruction-Level Parallelism
CSIT 301 (Blum)
2
Instruction-Level Parallelism
Instruction-level Parallelism (ILP) is when a processor has more than one execution unit and thus can execute more than one instruction simultaneously. It should be distinguished from parallelism on a higher level which might be accomplished by having more than one processor (multicore). It should be distinguished from pipelining which has various instructions in various stages but only one in the execution stage. CSIT 301 (Blum)
3
Pipeline Hazards Recall the hazards and potential hazards of pipelining. Having multiple instructions in the pipeline means that the first instruction is not complete before the second instruction begins, which could be a problem if the instructions share data/registers. Another term used is dependency. CSIT 301 (Blum)
4
Dependency Categories
RAR: Read After Read 1st instruction reads, 2nd instruction reads RAW: Read After Write 1st instruction writes, 2nd instruction reads WAR: Write After Read 1st instruction reads, 2nd instruction writes WAW: Write After Write 1st instruction writes, 2nd instruction writes CSIT 301 (Blum)
5
Bigger Problems WAR and WAW are not really problems in a single, in-order pipeline. However, in an out-of-order pipeline or in multiple pipelines, The write may get ahead of the read in WAR turning it into a RAW The second write may get ahead of the first write in WAW leaving the wrong value in the register for subsequent processing. CSIT 301 (Blum)
6
Example from Carter’s Book
LD r1, (r2) Load Reg. 1 with memory location pointed to by Reg. 2 ADD r5, r6, r7 Add values in Reg. 6 and Reg. 7 put answer in Reg 5 SUB r4, r1, r4 Subtract value in Reg. 4 from value in Reg. 1 put answer in Reg. 4 MUL r8, r9, r10 Multiply values in Reg. 9 and Reg. 10, put answer in Reg. 8 ST (r11), r4 Store value in Reg. 4 in memory location pointed to by Reg. 11 CSIT 301 (Blum)
7
Example from Carter’s Book
Execution Unit 1 LD r1, (r2) SUB r4, r1, r4 ST (r11), r4 Execution Unit 2 ADD r5, r6, r7 MUL r8, r9, r10 This program fragment can be broken into the parallel pieces shown above since they do not use the same registers. CSIT 301 (Blum)
8
Another Example from Carter’s Book
ADD r1, r2, r3 LD r4, (r5) SUB r7, r1, r9 MUL r5, r4, r4 SUB r1, r12, r10 ST (r13), r14 OR r15, r14, r12 CSIT 301 (Blum)
9
Type of access to registers in the sequential program fragment
Registers R1 and R4 have RAWs and Registers R1 and R5 have WARs CSIT 301 (Blum)
10
Hazards (RAW) Instruction 3 must follow Instruction 1 because they have a RAW dependency in Register 1. Instruction 4 must follow Instruction 2 because they have a RAW dependency in Register 4. CSIT 301 (Blum)
11
Type of access to registers in the sequential program fragment
Registers R1 and R4 have RAWs and Registers R1 and R5 have WARs CSIT 301 (Blum)
12
Potential Hazards (WAR)
Instruction 5 (writes to R1) is at best simultaneous with Instruction 3 (read from R1) because the read stage of an instruction precedes the the write stage. Instruction 4 is at best simultaneous with Instruction 2, but we already have the stronger condition that it must follow it. CSIT 301 (Blum)
13
Division of Labor After identifying the various conditions on the ordering of instructions, the instructions can be divided up among the execution units in any way that respects the conditions. Instructions that must follow each other will be sent to the same execution unit. This ensures their order and also allows for bypassing. CSIT 301 (Blum)
14
With Two Execution Units
1. ADD r1, r2, r3 3. SUB r7, r1, r9 5. SUB r1, r12, r10 7. OR r15, r14, r12 2. LD r4, (r5) 4. MUL r5, r4, r4 6. ST (r13), r14 7 Cycles 4 Cycles CSIT 301 (Blum)
15
With Four Execution Units
ADD r1, r2, r3 SUB r7, r1, r9 LD r4, (r5) MUL r5, r4, r4 ST (r13), r14 SUB r1,r12,r10 OR r15, r14, r12 7 Cycles 2 Cycles Because of the RAW dependency, we cannot do better than 2 cycles here – no matter how many execution units there are. CSIT 301 (Blum)
16
Another Distinction In the two execution unit result, one has not changed the order of the instructions – apart from executing Instructions 1 and 2 simultaneously. In the four execution unit result, one has changed the order of the instructions – Instructions 6 and 7 occur in the first time cycle before Instructions 3, 4 and 5 which are in the second. Therefore the benefit we gained from the latter assumes that the processor allows for out-of-order processing. CSIT 301 (Blum)
17
Superscalar A processor is said to be superscalar if it has multiple execution units and if the placement of the instructions into the parallel execution units is handled by the processor’s hardware. In other scenarios the hardware may have parallel execution units but the hardware does not determine the splitting up of the instructions among the execution units. The parallelization of instructions will occur at a higher level. It is done by the compiler. CSIT 301 (Blum)
18
Don’t have to recompile
A superscalar processor can give ILP (Instruction-Level Parallelism) to code that was not compiled for a processor that does not have ILP without the code being recompiled. Provided the new processor (with ILP) in backward compatible with the old processor (without ILP). CSIT 301 (Blum)
19
But consider recompiling
The hardware can only consider so many instructions at once – its window of instructions. The compiler can take a much broader view of the code and arrange instructions in a way that allows the superscalar processor to take greater advantage of ILP. CSIT 301 (Blum)
20
Loop Unrolling One example of what a compiler might do to exploit ILP is loop unrolling. Branching is the bane of pipelining and parallelism. Loops have at least one possible branch with each iteration. Loop unrolling is doing two of more iterations worth of work in one iteration. It reduces the number of branch considerations and promotes parallelism. CSIT 301 (Blum)
21
Loop Unrolling Example
for(i=0; i<100; i++){ a[i] = b[i] + c[i]; } for(i=0; i<100; i+=2){ a[i] = b[i] + c[i]; a[i+1] = b[i+1] + c[i+1]; } The unrolled version has half as many branches and so is easier to pipeline. The unrolled version will use more independent registers within each iteration and so takes greater advantage of ILP. CSIT 301 (Blum)
22
Don’t try this at home Loop unrolling requires knowledge of the processor’s capabilities (the number of execution units, the number of stages in the pipeline, etc.). If the programmer does not have this knowledge, the unrolling and other code optimization techniques should be left to the compiler. CSIT 301 (Blum)
23
Superscalar Versus Vector
A vector is essentially a one-dimensional array. A program that is optimized for the efficient handling of such arrays is said to be vectorized. In a superscalar processor, the execution units can be doing different operations on different data, whereas with vectorization the execution units would be doing the same operation on different data. CSIT 301 (Blum)
24
Vectorization Vectorization could even be beneficial if there is only one execution unit because the same operation would be performed over and over again (on different data) so it would not have to be decoded over and over again. Vectorization is more restrictive but easier to implement than making the processor superscalar. But since it is exactly the kind of processing that arises so often, it is worth investing effort in doing it well. CSIT 301 (Blum)
25
SIMD Recall that one of the features of MMX (MultiMedia eXtensions or Matrix Math eXtensions) was SIMD (Single Instruction Multiple Data) in which an individual instruction allowed one to operate on many pieces of data simultaneously (i.e. vectorization). In Mathematics, matrices operate on vectors These are important to the optimization of audio-visual data, since such processing involves a lot of data that can be operated on in parallel. CSIT 301 (Blum)
26
Try this at home While loop unrolling is probably best left to the compiler, there are some things the high-level programmer can consider to try to ensure that his or her code can be vectorized to the fullest extent. Recall that vectorization is concerned with the processing of arrays. CSIT 301 (Blum)
27
Whenever Possible Use for loops instead of while loops
Make the number of iterations a power of 2 Avoid ifs Avoid subroutine calls In nested loops, make the loop with the larger number of iterations the inner loop CSIT 301 (Blum)
28
Who bears the burden? In superscalar processors, it is the hardware that provides the ILP. The compiler can help exploit the hardware’s capabilities. But the superscalar processor can yield ILP (on the fly) even for code compiled on a sequential processor. In Very Long Instruction Word (VLIW) Processors, the burden for discovering ILP is on the compiler. CSIT 301 (Blum)
29
VLIW Processors When the program is compiled, operations which can be done in parallel are sandwiched together in one long instruction, hence the name “very long instruction word” processor. The processor has to parse this long instruction, but it does not have to make decisions about what can be done in parallel since that has been done by the compiler. CSIT 301 (Blum)
30
VLIW Pros and Cons The good thing about VLIW processors is that they depend on the compiler (pre-processor). The bad thing about VLIW processors is that they depend on the compiler (pre-processor). ??? CSIT 301 (Blum)
31
VLIW Pro Placing the burden for parallelizing the code on software allows the hardware to be simpler. The instruction-issue logic circuitry that would determine parallelization in the superscalar processor now does little more than parsing. This allows the hardware To be cheaper To use less power And possibly to be faster. CSIT 301 (Blum)
32
VLIW Pro The simplification of hardware puts it along the same lines as the RISC philosophy. The reduced hardware leads to a reduction in power consumption. E.g. computers based on the Crusoe family of processors from Transmeta can go almost all day without having to recharge the battery. CSIT 301 (Blum)
33
VLIW Pro The compiler can take a more global view when looking for parallelization. The superscalar processor has a window, a limited number of instructions it sees and it looks for ILP within that window. This is not a real advantage of VLIW over superscalar since code on a superscalar processor must also be compiled and that compiler can also look for ILP on a more global scale. CSIT 301 (Blum)
34
VLIW Con The dependence on the compiler for ILP can lead to backward compatibility issues. Within a family of superscalar processors, one can change the micro-architecture (hardware implementation) without changing the architecture. Compiled code is architecture specific but not micro-architecture specific. CSIT 301 (Blum)
35
VLIW Con (Cont.) The new superscalar micro-architecture can take advantage (to some extent) of any new ILP capability without recompiling the code. In a VLIW processor, more of the hardware details must be exposed to the software. And thus changes in the hardware require changes in the software – recompiling. The old VLIW-compiled code may not work on a new VLIW processor. CSIT 301 (Blum)
36
Hyper-Threading Technology
CSIT 301 (Blum)
37
HT Technology CSIT 301 (Blum)
38
Thread-Level Parallelism
“Hyper-Threading Technology provides thread-level-parallelism (TLP) on each processor resulting in increased utilization of processor execution resources.” “Hyper-Threading Technology makes a single physical processor appear as two logical processors ….” CSIT 301 (Blum)
39
EPIC The Itanium processors have a feature known as EPIC.
“EPIC (Explicitly Parallel Instruction Computing) is a 64-bit microprocessor instruction set, jointly defined and designed by Hewlett Packard and Intel, that provides up to 128 general and floating point unit registers and uses speculative loading, predication, and explicit parallelism to accomplish its computing tasks.” CSIT 301 (Blum)
40
Need a compiler to take advantage
One feature of Itanium is its use of a "smart compiler" to optimize how instructions are sent to the processor. This approach allows Itanium and future IA-64 microprocessors to process more instructions per clock cycle (IPCs). IPCs can be used along with clock speed in terms of megahertz (MHz) to indicate a microprocessor's overall performance. CSIT 301 (Blum)
41
References Computer Architecture, Nicholas Carter
PC Hardware in a Nutshell, Thompson and Thompson CSIT 301 (Blum)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.