Download presentation
Presentation is loading. Please wait.
Published byAubrey Sullivan Modified over 9 years ago
1
Fall 2014, Nov 19... ELEC 5200-001/6200-001 Lecture 12 1 ELEC 5200-001/6200-001 Computer Architecture and Design Fall 2014 Instruction-Level Parallelism Vishwani D. Agrawal James J. Danaher Professor Department of Electrical and Computer Engineering Auburn University, Auburn, AL 36849 http://www.eng.auburn.edu/~vagrawal vagrawal@eng.auburn.edu
2
Fall 2014, Nov 19... ELEC 5200-001/6200-001 Lecture 12 2 A Computer System Processor Cache Main memory I/O controller Disk Graphics output Network Memory – I/O bus Interrupts
3
Fall 2014, Nov 19... ELEC 5200-001/6200-001 Lecture 12 3 Advanced Architectures – ILP Instruction level parallelism (ILP): multiple instructions fetched and executed simultaneously. ILP is used in addition to pipelining. Processors with ILP are called multiple-issue processors – multiple instructions launched in 1 clock cycle. Two ways: –MIMD: Multiple Instructions Multiple Data Superpipeline Superscalar – dynamic multiple issue Very long instruction word (VLIW) – static multiple issue –SIMD: Single Instruction Multiple Data Vector processor
4
Fall 2014, Nov 19... ELEC 5200-001/6200-001 Lecture 12 4 Superpipeline and Superscalar IFIDEXMEMWBIFIDEXMEMWBIFIDEXMEMWBIFIDEXMEMWBIFIDEXMEMWBIFIDEXMEMWBIFIDEXMEMWBIFIDEXMEMWBIFIDEXMEMWBIFIDEXMEMWBIFIDEXMEMWBIFIDEXMEMWB System clock cycles Pipeline 1 instruction/cycle Superpipeline (Pipeline clock is twice as fast as the system clock) 2 instructions per cycle Superscalar 2 (or more) instructions/cycle 0 1 2 3 4 5 6 7 8
5
A Static Two-Issue MIPS Pipeline Read two instructions per cycle: An ALU or branch instruction, and A load or store instruction Insert one nop if above pair is not available Added hardware (Figure 4.69, page 336): A second instruction memory Additional input/output ports in register file Additional ALU in execute stage for address calculation Fall 2014, Nov 19... ELEC 5200-001/6200-001 Lecture 12 5
6
An Example (Page 337) Fall 2014, Nov 19... ELEC 5200-001/6200-001 Lecture 12 6 Loop:lw$t0, 0($s1) addu$t0, $t0, $s2 sw$t0, 0(s1) addi$s1, $s1, – 4 bne$s1, $0, Loop
7
Static Two-Issue Execution ALU or branch instruction Data transfer instruction Clock cycle Loop:noplw $t0, 0($s1)1 addi $s1, $s1, – 4nop2 addu $t0, $t0, $s2nop3 bne $s1, $0, Loopsw $t0, 4($s1)4 Fall 2014, Nov 19... ELEC 5200-001/6200-001 Lecture 12 7 Note code reordering and change in sw argument. CPI=4/5=0.8<0.5 (ideal)
8
Loop Unrolling (Index Multiple of 4) ALU or branch instruction Data transfer instruction Clock cycle Loop:addi $s1, $s1, – 16lw $t0, 0($s1)1 noplw $t1, 12($s1)2 addu $t0, $t0, $s2lw $t2, 8($s1)3 addu $t1, $t1, $s2lw $t3, 4($s1)4 addu $t2, $t2, $s2sw $t0, 16($s1)5 addu $t3, $t3, $s2sw $t1, 12($s1)6 nopsw $t2, 8($s1)7 bne $s1, $0, Loopsw $t3, 4($s1)8 Fall 2014, Nov 19... ELEC 5200-001/6200-001 Lecture 12 8 CPI=8/14=0.57<0.5 (ideal)
9
Fall 2014, Nov 19... ELEC 5200-001/6200-001 Lecture 12 9 VLIW: Very Long Instruction Word Static multiple issue, ILP determined by compiler. Datapath contains multiple execution units. Compiler groups instructions that have no data or resource conflicts for parallel execution. Grouped instructions are packed in very long words of a wide instruction memory. Speedup benefit of VLIW is highly program dependent. J. A. Fisher, “Very Long Instruction Word Architecture and ELI-512,” Proc. 10 th Symp. on Computer Architecture, Stockholm, June 1983, pp. 478-490. J. A. Fisher, P. Faraboschi and C. Young, Embedded Computing: A VLIW Approach to Architecture, Compilers and Tools, Morgan Kaufmann.
10
Fall 2014, Nov 19... ELEC 5200-001/6200-001 Lecture 12 10 Superscalar: Dynamic Scheduling and Out-of-Order Execution Instruction fetch and decode unit Reservation station Reservation station Reservation station Reservation station Commit unit integer Floating point Load/ store Functional units In-order issue Out-of-order execution In-order commit Out-of-order issue
11
Out of Order Execution (OOE) A procedural programming language sequences instructions. Sequencing assumes an order of execution – no parallelism. OOE must preserve correctness of result. Principle: Two instructions can be executes in parallel if they do not have dependences. Fall 2014, Nov 19... ELEC 5200-001/6200-001 Lecture 12 11
12
RAW Dependence Read after write (RAW): A dependent instruction reads from a register being written to by another instruction. Example: add$s1, $s2, $s3 sub$s2, $s1, $s3 sub has RAW dependence on add Fall 2014, Nov 19... ELEC 5200-001/6200-001 Lecture 12 12
13
WAR Dependence Write after read (WAR): A dependent instruction writes to a register being read by another instruction. Example: add$s1, $s2, $s3 sub$s2, $s1, $s3 sub has WAR dependence on add Fall 2014, Nov 19... ELEC 5200-001/6200-001 Lecture 12 13
14
WAW Dependence Read after write (RAW): One instruction writes to a register to being written to by another instruction. Example: add$s2, $s2, $s3 sub$s2, $s1, $s3 sub has WAW dependence on add Fall 2014, Nov 19... ELEC 5200-001/6200-001 Lecture 12 14
15
Superscalar Instruction Issue Rules: RAW dependence – If any operand is being written, do not issue. WAR dependence – If the result register is being read, do not issue. WAW dependence – If the result register is being written, do not issue. Scoreboard: Cycle by cycle record of registers and execution units showing how many instructions are using them. Example 1: In-order issue (next 2 slides). Example 2: Out-of-order issue (3 rd slide). Fall 2014, Nov 19... ELEC 5200-001/6200-001 Lecture 12 15
16
Dynamic Scheduling Consider an example: First with in-order issue Then with out-of-order issue Assume: Up to two instructions are fetched in a cycle Instruction register can hold two instructions An Instruction is issued in decode cycle, or must wait until there is no RAW, WAR or WAW dependence An instruction can retire two or three cycles after it is issued Fall 2014, Nov 19... ELEC 5200-001/6200-001 Lecture 12 16
17
Fall 2014, Nov 19... ELEC 5200-001/6200-001 Lecture 12 17 Ck cycle Inst # Decoded Issue Inst# Retire Inst# Reg. to readReg. to write 0123456701234567 11212 R3 = R0 * R1 R4 = R0 + R2 1212 1212 11111 11111 23434 R5 = R0 + R1 R6 = R1 + R4 3-3- 3333 2222 1111 1111 1111 1111 3321111 4123123 2121 1111 111111 5 5R7 = R1 * R2 4545 12121 1111 11111 66R1 = R0 – R2-21111 74111 85 9 7R3 = R3 * R1 6-6- 1111 1111 1111
18
Fall 2014, Nov 19... ELEC 5200-001/6200-001 Lecture 12 18 Ck cycle Instr # Decoded Issue Inst# Retire Inst# Reg. to readReg. to write 0123456701234567 10111 116 12 8R1 = R4 + R4 7-7- 1111 1111 1111 13111 14111 157 16821 1721 188 In-order Issue scoreboard (Continued) Out-of-order scoreboard (Next 2 Slides)
19
Questions? RAW dependence: Inst# 4 (R6 = R1 + R4) could not be issued until cycle 5. Should Inst# 5 (R7 = R1 * R2) wait in queue? Answer: No. Inst# 5 can be issued in cycle 3 as there is no register conflict (out-of-order issue). WAR dependence: Must the issue of Inst#6 (R1 = R0 – R2) waits until cycle 9 when all instructions reading R1 have retired? Answer: No. Provided new result of Inst#6 does not affect R1 being used by previous instructions (register renaming). Fall 2014, Nov 19... ELEC 5200-001/6200-001 Lecture 12 19
20
Fall 2014, Nov 19... ELEC 5200-001/6200-001 Lecture 12 20 Ck cycle Inst # Decoded Issue Inst# Retire Inst# Reg. to readReg. to write 0123456701234567 11212 R3 = R0 * R1 R4 = R0 + R2 1212 1212 11111 11111 23434 R5 = R0 + R1 R6 = R1 + R4 3-3- 3333 2222 1111 1111 1111 1111 35656 R7 = R1 * R2 S1 = R0 – R2 5656 2 343343 333333 232232 111111 1111 111111 111111 4 7878 R3 = R3 * S1 S2 = R4 + R4 4-84-8 1313 3332133321 4443244432 2222222222 1133311333 111111 11111111 1111111111 1111111111 56213111 67 458458 2121 1111 11111111 322322 1111111 11111111 11111 711 811 97
21
References Previous example is from: A. S. Tanenbaum, Structured Computer Organization, Fifth Edition, Prentice-Hall, 2006, pp. 304-309, Section 4.5.3. Further reading: D. W. Anderson, F. J. Sparacio and R. M. Tomasulo, “The IBM 360 Model 91: Processor Philosophy and Instruction Handling,” IBM J. Res. & Dev., vol. 11, no. 1, pp. 8-24, Jan. 1967. Fall 2014, Nov 19... ELEC 5200-001/6200-001 Lecture 12 21
22
Fall 2014, Nov 19... ELEC 5200-001/6200-001 Lecture 12 22 Power Reduction by Slack Scheduling Application: Superscalar, out-of-order execution: An instruction is executed as soon as the required data and resources become available. A commit unit reorders the results. Delay the completion of instructions whose result is not immediately needed. Example of RISC instructions: addr0, r1, r2;(A) addr0, r1, r2;(A) sub r3, r4, r5;(B) sub r3, r4, r5;(B) and r9, r1, r9;(C) and r9, r1, r9;(C) or r5, r9, r10;(D) or r5, r9, r10;(D) xor r2, r10, r11;(E) xor r2, r10, r11;(E) J. Casmira and D. Grunwald, “Dynamic Instruction Scheduling Slack,” Proc. ACM Kool Chips Workshop, Dec. 2000.
23
Fall 2014, Nov 19... ELEC 5200-001/6200-001 Lecture 12 23 Slack Scheduling Example Slack scheduling A BC D E Standard scheduling ABC D E
24
Fall 2014, Nov 19... ELEC 5200-001/6200-001 Lecture 12 24 Slack Scheduling Slack bit Low-power execution units (Reduced voltage) Re-order buffer Scheduling logic
25
Fall 2014, Nov 19... ELEC 5200-001/6200-001 Lecture 12 25 Superscalar Design of P4 (CISC) CISC shell: –Processor fetches instructions from memory in the order of static program. –Each instruction is translated into one or more fixed- length RISC instructions, known as micro-operations (micro-ops). RISC core: –Micro-ops are executed out-of-order in a dynamically scheduled pipeline. –Processor commits the result of each micro-op execution to register file in the order of original program flow.
26
Fall 2014, Nov 19... ELEC 5200-001/6200-001 Lecture 12 26 Superscalars 3 or more instruction issues per clock: Intel P6 AMD K5 Sun UltraSPARC Alpha 21164 MIPS R10000 PowerPC 604/620 HP 8000 References: D. W. Anderson, F. J. Sparacio and R. M. Tomasulo, “The IBM 360 Model 91: Processor Philosophy and Instruction Handling,” IBM J. Res. Dev., vol. 11, pp. 8-24, January 1967. T. Agerwala and J. Cocke, “Reduced Instruction Set Processors,” Technical Report RC12434 (#55845), Yorktown Heights, NY: IBM T. J. Watson Research Center, January 1987. T. Agerwala and J. Cocke, “Reduced Instruction Set Processors,” Technical Report RC12434 (#55845), Yorktown Heights, NY: IBM T. J. Watson Research Center, January 1987.
27
Fall 2014, Nov 19... ELEC 5200-001/6200-001 Lecture 12 27 Topics in Computer Architecture Instruction set Program execution through register transfer See Lectures 13-14. Computer arithmetic (2’s complement, IEEE 754 floating point standard, addition, multiplication) Datapaths (single-cycle, multicycle, pipeline) Control (combinational logic, FSM, microcode) Pipelining (throughput, hazards, forwarding, stall, branch prediction) Memory organization (cache, virtual memory) Performance (benchmarks, energy efficiency, Amdal’s law) Advanced architectures (ILP, OOE, superscalar, etc.) Not discussed in this course: –Multiprocessors –Compiler and software techniques – loop unrolling, trace execution, etc. –Input and output –Power management
28
Fall 2014, Nov 19... ELEC 5200-001/6200-001 Lecture 12 28 One who claims to know much about computer architecture speaks from ignorance... because a lot is going to happen in the future, which is... http://www.youtube.com/watch?v=xZbKHDPPrrc Doris Day in Hitchcock’s 1956 Movie “The Man Who Knew Too Much”
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.