Download presentation
Presentation is loading. Please wait.
Published byNathen Brede Modified over 9 years ago
1
CSC 370 (Blum)1 Instruction-Level Parallelism
2
CSC 370 (Blum)2 Instruction-Level Parallelism Instruction-level Parallelism (ILP) is when a processor has more than one execution unit and thus can execute more than one instruction simultaneously. –It should be distinguished from parallelism on a higher level which might be accomplished by having more than one processor. –It should be distinguished from pipelining which has various instructions in various stages but only one in the execution stage.
3
CSC 370 (Blum)3 Pipeline Hazards Recall the hazards and potential hazards of pipelining. –Having multiple instructions in the pipeline means that the first instruction is not complete before the second instruction begins, which could be a problem if the instructions share data/registers. –Another term used is dependency.
4
CSC 370 (Blum)4 Dependency Categories RAR: Read After Read –1 st instruction reads, 2 nd instruction reads RAW: Read After Write –1 st instruction writes, 2 nd instruction reads WAR: Write After Read –1 st instruction reads, 2 nd instruction writes WAW: Write After Write –1 st instruction writes, 2 nd instruction writes
5
CSC 370 (Blum)5 Bigger Problems WAR and WAW are not really problems in a single, in-order pipeline. However, in an out-of-order pipeline or in multiple pipelines, –The write may get ahead of the read in WAR turning it into a RAW –The second write may get ahead of the first write in WAW leaving the wrong value in the register for subsequent processing.
6
CSC 370 (Blum)6 Example from Carter’s Book LD r1, (r2) Load Reg. 1 with memory location pointed to by Reg. 2 ADD r5, r6, r7 Add values in Reg. 6 and Reg. 7 put answer in Reg 5 SUB r4, r1, r4 Subtract value in Reg. 4 from value in Reg. 1 put answer in Reg. 4 MUL r8, r9, r10 Multiply values in Reg. 9 and Reg. 10, put answer in Reg. 8 ST (r11), r4 Store value in Reg. 4 in memory location pointed to by Reg. 11
7
CSC 370 (Blum)7 Example from Carter’s Book Execution Unit 1 LD r1, (r2) SUB r4, r1, r4 ST (r11), r4 Execution Unit 2 ADD r5, r6, r7 MUL r8, r9, r10 This program fragment can be broken into the parallel pieces shown above since they do not use the same registers.
8
CSC 370 (Blum)8 Another Example from Carter’s Book 1.ADD r1, r2, r3 2.LD r4, (r5) 3.SUB r7, r1, r9 4.MUL r5, r4, r4 5.SUB r1, r12, r10 6.ST (r13), r14 7.OR r15, r14, r12
9
CSC 370 (Blum)9 Type of access to registers in the sequential program fragment Registers R1 and R4 have RAWs and Registers R1 and R5 have WARs
10
CSC 370 (Blum)10 Hazards (RAW) Instruction 3 must follow Instruction 1 because they have a RAW dependency in Register 1. Instruction 4 must follow Instruction 2 because they have a RAW dependency in Register 4.
11
CSC 370 (Blum)11 Type of access to registers in the sequential program fragment Registers R1 and R4 have RAWs and Registers R1 and R5 have WARs
12
CSC 370 (Blum)12 Potential Hazards (WAR) Instruction 5 (writes to R1) is at best simultaneous with Instruction 3 (read from R1) because the read stage of an instruction precedes the the write stage. Instruction 4 is at best simultaneous with Instruction 2, but we already have the stronger condition that it must follow it.
13
CSC 370 (Blum)13 Division of Labor After identifying the various conditions on the ordering of instructions, the instructions can be divided up among the execution units in any way that respects the conditions. Instructions that must follow each other will be sent to the same execution unit. This ensures their order and also allows for bypassing.
14
CSC 370 (Blum)14 With Two Execution Units 1. ADD r1, r2, r3 3. SUB r7, r1, r9 5. SUB r1, r12, r10 7. OR r15, r14, r12 2. LD r4, (r5) 4. MUL r5, r4, r4 6. ST (r13), r14 7 Cycles 4 Cycles
15
CSC 370 (Blum)15 With Four Execution Units ADD r1, r2, r3 SUB r7, r1, r9 LD r4, (r5) MUL r5, r4, r4 ST (r13), r14 SUB r1,r12,r10 OR r15, r14, r12 7 Cycles 2 Cycles Because of the RAW dependency, we cannot do better than 2 cycles here – no matter how many execution units there are.
16
CSC 370 (Blum)16 Another Distinction In the two execution unit result, one has not changed the order of the instructions – apart from executing Instructions 1 and 2 simultaneously. In the four execution unit result, one has changed the order of the instructions – Instructions 6 and 7 occur in the first time cycle before Instructions 3, 4 and 5 which are in the second. Therefore the benefit we gained from the latter assumes that the processor allows for out-of-order processing.
17
CSC 370 (Blum)17 Superscalar A processor is said to be superscalar if it has multiple execution units and if the placement of the instructions into the parallel execution units is handled by the processor’s hardware. –In other scenarios the hardware may have parallel execution units but the hardware does not determine the splitting up of the instructions among the execution units. The parallelization of instructions will occur at a higher level. It is done by the compiler.
18
CSC 370 (Blum)18 Don’t have to recompile A superscalar processor can give ILP (Instruction-Level Parallelism) to code that was not compiled for a processor that does not have ILP without the code being recompiled. –Provided the new processor (with ILP) in backward compatible with the old processor (without ILP).
19
CSC 370 (Blum)19 But consider recompiling The hardware can only consider so many instructions at once – its window of instructions. The compiler can take a much broader view of the code and arrange instructions in a way that allows the superscalar processor to take greater advantage of ILP.
20
CSC 370 (Blum)20 Loop Unrolling One example of what a compiler might do to exploit ILP is loop unrolling. Branching is the bane of pipelining and parallelism. Loops have at least one branch with each iteration. Loop unrolling is doing two of more iterations worth of work in one iteration. It reduces the number of branch considerations and promotes parallelism.
21
CSC 370 (Blum)21 Loop Unrolling Example for(i=0; i<100; i++){ a[i] = b[i] + c[i]; } for(i=0; i<100; i+=2){ a[i] = b[i] + c[i]; a[i+1] = b[i+1] + c[i+1]; } The unrolled version has half as many branches and so is easier to pipeline. The unrolled version will use more independent registers within each iteration and so takes greater advantage of ILP.
22
CSC 370 (Blum)22 Don’t try this at home Loop unrolling requires knowledge of the processor’s capabilities (the number of execution units, the number of stages in the pipeline, etc.). If the programmer does not have this knowledge, the unrolling and other code optimization techniques should be left to the compiler.
23
CSC 370 (Blum)23 Superscalar Versus Vector A vector is essentially a one-dimensional array. A program that is optimized for the efficient handling of such arrays is said to be vectorized. In a superscalar processor, the execution units can be doing different operations on different data, whereas with vectorization the execution units would be doing the same operation on different data.
24
CSC 370 (Blum)24 Vectorization Vectorization could even be beneficial if there is only one execution unit because the same operation would be performed over and over again (on different data) so it would not have to be decoded over and over again. Vectorization is more restrictive but easier to implement than making the processor superscalar. But since it is exactly the kind of processing that arises so often, it is worth investing effort in doing it well.
25
CSC 370 (Blum)25 SIMD Recall that one of the features of MMX (MultiMedia eXtensions or Matrix Math eXtension) was SIMD (Single Instruction Multiple Data) in which one instruction allowed one to operate on many pieces of data simultaneously (i.e. vectorization). –In Mathematics, matrices operate on vectors These are important to the optimization of audio- visual data, since such processing involves a lot of data that can be operated on in parallel.
26
CSC 370 (Blum)26 Try this at home While loop unrolling is probably best left to the compiler, there are some things the high-level programmer can consider to try to ensure that his or her code can be vectorized to the fullest extent. Recall that vectorization is concerned with the processing of arrays.
27
CSC 370 (Blum)27 Whenever Possible 1.Use for loops instead of while loops 2.Make the number of iterations a power of 2 3.Avoid ifs 4.Avoid subroutine calls 5.In nested loops, make the loop with the larger number of iterations the inner loop
28
CSC 370 (Blum)28 Who bears the burden? In superscalar processors, it is the hardware that provides the ILP. The compiler can help exploit the hardware’s capabilities. But the superscalar processor can yield ILP (on the fly) even for code compiled on a sequential processor. In Very Long Instruction Word (VLIW) Processors, the burden for discovering ILP is on the compiler.
29
CSC 370 (Blum)29 VLIW Processors When the program is compiled, operations which can be done in parallel are sandwiched together in one long instruction, hence the name “very long instruction word” processor. The processor has to parse this long instruction, but it does not have to make decisions about what can be done in parallel since that has been done by the compiler.
30
CSC 370 (Blum)30 VLIW Pros and Cons The good thing about VLIW processors is that they depend on the compiler (pre- processor). The bad thing about VLIW processors is that they depend on the compiler (pre- processor). ???
31
CSC 370 (Blum)31 VLIW Pro Placing the burden for parallelizing the code on software allows the hardware to be simpler. The instruction issue logic circuitry that would determine parallelization in the superscalar processor now does little more than parsing. This allows the hardware –To be cheaper –To use less power –And possibly to be faster.
32
CSC 370 (Blum)32 VLIW Pro The simplification of hardware puts it along the same lines as the RISC philosophy. The reduced hardware leads to a reduction in power consumption. –E.g. computers based on the Crusoe family of processors from Transmeta can go almost all day without having to recharge the battery.
33
CSC 370 (Blum)33 VLIW Pro The compiler can take a more global view when looking for parallelization. –The superscalar processor has a window, a limited number of instructions it sees and it looks for ILP within that window. This is not a real advantage of VLIW over superscalar since code on a superscalar processor must also be compiled and that compiler can also look for ILP on a more global scale.
34
CSC 370 (Blum)34 VLIW Con The dependence on the compiler for ILP can lead to backward compatibility issues. Within a family of superscalar processors, one can change the micro-architecture (hardware implementation) without changing the architecture. Compiled code is architecture specific but not micro- architecture specific.
35
CSC 370 (Blum)35 VLIW Con (Cont.) The new superscalar micro-architecture can take advantage (to some extent) of any new ILP capability without recompiling the code. In a VLIW processor, more of the hardware details must be exposed to the software. And thus changes in the hardware require changes in the software – recompiling. The old VLIW-compiled code may not work on a new VLIW processor.
36
CSC 370 (Blum)36 Hyper-Threading Technology
37
CSC 370 (Blum)37 HT Technology
38
CSC 370 (Blum)38 Thread-Level Parallelism “Hyper-Threading Technology provides thread-level-parallelism (TLP) on each processor resulting in increased utilization of processor execution resources.” “Hyper-Threading Technology makes a single physical processor appear as two logical processors ….”
39
CSC 370 (Blum)39 EPIC The new Itanium processors have a feature known as EPIC. “EPIC (Explicitly Parallel Instruction Computing) is a 64-bit microprocessor instruction set, jointly defined and designed by Hewlett Packard and Intel, that provides up to 128 general and floating point unit registers and uses speculative loading, predication, and explicit parallelism to accomplish its computing tasks.”
40
CSC 370 (Blum)40 Need a compiler to take advantage One feature of Itanium is its use of a "smart compiler" to optimize how instructions are sent to the processor. This approach allows Itanium and future IA-64 microprocessors to process more instructions per clock cycle (IPCs). –IPCs can be used along with clock speed in terms of megahertz (MHz) to indicate a microprocessor's overall performance.
41
CSC 370 (Blum)41 References Computer Architecture, Nicholas Carter http://www.whatis.com http://www.webopedia.com PC Hardware in a Nutshell, Thompson and Thompson http://www.intel.com/technology/itj/2002/v olume06issue01/art01_hyper/p01_abstract.h tm
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.