CS5100 Advanced Computer Architecture Instruction-Level Parallelism

Slides:



Advertisements
Similar presentations
CS 378 Programming for Performance Single-Thread Performance: Compiler Scheduling for Pipelines Adopted from Siddhartha Chatterjee Spring 2009.
Advertisements

ILP: IntroductionCSCE430/830 Instruction-level parallelism: Introduction CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng.
CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
Instruction-Level Parallelism compiler techniques and branch prediction prepared and Instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University March.
ENGS 116 Lecture 101 ILP: Software Approaches Vincent H. Berk October 12 th Reading for today: , 4.1 Reading for Friday: 4.2 – 4.6 Homework #2:
1 4/20/06 Exploiting Instruction-Level Parallelism with Software Approaches Original by Prof. David A. Patterson.
1 Lecture: Static ILP Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)
EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.
Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.
1 Lecture 10: Static ILP Basics Topics: loop unrolling, static branch prediction, VLIW (Sections 4.1 – 4.4)
Instruction-Level Parallelism (ILP)
1 IF IDEX MEM L.D F4,0(R2) MUL.D F0, F4, F6 ADD.D F2, F0, F8 L.D F2, 0(R2) WB IF IDM1 MEM WBM2M3M4M5M6M7 stall.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 3 (and Appendix C) Instruction-Level Parallelism and Its Exploitation Computer Architecture.
1 Lecture: Pipeline Wrap-Up and Static ILP Topics: multi-cycle instructions, precise exceptions, deep pipelines, compiler scheduling, loop unrolling, software.
Chapter 3 Instruction-Level Parallelism and Its Dynamic Exploitation – Concepts 吳俊興 高雄大學資訊工程學系 October 2004 EEF011 Computer Architecture 計算機結構.
EECC551 - Shaaban #1 Winter 2002 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.
Chapter 2 Instruction-Level Parallelism and Its Exploitation
EECC551 - Shaaban #1 Fall 2002 lec# Floating Point/Multicycle Pipelining in MIPS Completion of MIPS EX stage floating point arithmetic operations.
EECC551 - Shaaban #1 Winter 2011 lec# Pipelining and Instruction-Level Parallelism (ILP). Definition of basic instruction block Increasing Instruction-Level.
EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
1 Lecture 5 Overview of Superscalar Techniques CprE 581 Computer Systems Architecture, Fall 2009 Zhao Zhang Reading: Textbook, Ch. 2.1 “Complexity-Effective.
CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.
1 Lecture 5: Dependence Analysis and Superscalar Techniques Overview Instruction dependences, correctness, inst scheduling examples, renaming, speculation,
ECE 4100/6100 Advanced Computer Architecture Lecture 2 Instruction-Level Parallelism (ILP) Prof. Hsien-Hsin Sean Lee School of Electrical and Computer.
1 Lecture: Pipeline Wrap-Up and Static ILP Topics: multi-cycle instructions, precise exceptions, deep pipelines, compiler scheduling, loop unrolling, software.
Instruction-Level Parallelism and Its Dynamic Exploitation
CS 352H: Computer Systems Architecture
Concepts and Challenges
Instruction-Level Parallelism (ILP)
现代计算机体系结构 主讲教师:张钢 教授 天津大学计算机学院 2017年
CS203 – Advanced Computer Architecture
Pipeline Implementation (4.6)
CSCE430/830 Computer Architecture
CSL718 : VLIW - Software Driven ILP
Pipelining: Advanced ILP
Lecture 6: Advanced Pipelines
Out of Order Processors
Lecture: Static ILP Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
Lecture: Static ILP Topics: loop unrolling, software pipelines (Sections C.5, 3.2)
Lecture: Static ILP Topics: predication, speculation (Sections C.5, 3.2)
Siddhartha Chatterjee Spring 2008
Chapter 3: ILP and Its Exploitation
Lecture: Static ILP Topics: loop unrolling, software pipelines (Sections C.5, 3.2) HW3 posted, due in a week.
How to improve (decrease) CPI
Advanced Computer Architecture
CS 704 Advanced Computer Architecture
Adapted from the slides of Prof
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Instruction Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
CSC3050 – Computer Architecture
Dynamic Hardware Prediction
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
How to improve (decrease) CPI
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
CMSC 611: Advanced Computer Architecture
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Loop-Level Parallelism
Lecture 5: Pipeline Wrap-up, Static ILP
Presentation transcript:

CS5100 Advanced Computer Architecture Instruction-Level Parallelism Prof. Chung-Ta King Department of Computer Science National Tsing Hua University, Taiwan (Slides are from textbook, Prof. Hsien-Hsin Lee, Prof. Yasun Hsu)

About This Lecture Goal: Outline: To review the basic concepts of instruction-level parallelism and pipelining To study compiler techniques for exposing ILP that are useful for processors with static scheduling Outline: Instruction-level parallelism: concepts and challenges (Sec. 3.1) Basic concepts, factors affecting ILP, strategies for exploiting ILP Basic compiler techniques for exposing ILP (Sec. 3.2) 1

Sequential Program Semantics Human expects “sequential semantics” A program counter (PC) goes through instructions of the program sequentially until the computation is completed Result of computing is considered “correct” (ground truth) Any optimizations, e.g. pipelining or parallelism, must keep the same semantics (result) Sequential semantics dictates a computation model of one instruction executed after another While ensuring execution “correctness” (i.e. sequential semantics), how to optimize executions? Overlapping executions: instruction-level, thread-level, data-level, request-level

Instruction-Level Parallelism (ILP) The parallelism that exists among instructions Example form of parallelism 1: sub r1,r2,r3 add r4,r5,r6 add r4,r5,r6 sub r1,r2,r3 Example form of parallelism 2: sub r1,r2,r3 add r4,r5,r6 Same result up to here! IF DE EX MEM WB Sub-operations can overlap

Instruction-Level Parallelism (ILP) Two instructions are parallel if they can be executed simultaneously in a pipeline of arbitrary depth without causing any stalls, assuming the pipeline has sufficient resources (no structural hazards) If two instructions are dependent, they are not parallel and must be executed in order, although they may be partially overlapped The amount of ILP that can be exploited is affected by program, compiler, and architecture The key is to determine dependences among the instructions Data (true), name, and control dependences

True Dependence Instni followed by Instnj, (but j is not necessary the instruction right next to i), e.g. i writes a value to a register, j uses this value in the register add r2, r4, r6 add r8, r2, r9 i writes to a memory via a write buffer, j loads same value into a register sw r2, 100(r3) lw r4, 100(r3) i loads a register from memory, j uses the content of that register for an operation lw r2, 200(r7) add r1, r2, r3 True dependency reflects sequential semantics and forces “sequentiality”

True Dependence Causes Data Hazard True dependences indicate data flows and are a result of the program (semantics) True dependence may cause Read-After-Write (RAW) data hazard in pipeline Instni followed by Instnj Instnj tries to read operand before Instni writes it lw r1, 30(r2) add r3, r1, r5 Read old data Read new data Want to avoid this

Name Dependence Anti- and output dependences The two instructions use same name (register or memory location) but don’t exchange data  no data flow Due to bad compiler, architecture limitation Can be solved with renaming (use different registers) lw r2, (r12) add r1, r2, 9 mul r2, r3, r4 true dependence output dependence anti-dependence

Anti-Dependence Anti-dependence may cause Write-After-Read (WAR) data hazard Instni followed by Instnj Instnj tries to write to register/memory before Instni reads from that register/memory  get wrong, new data Want to read old value Get new value instead should avoid this

Output Dependence Output dependence may cause Write-after-Write (WAW) data hazard Instni followed by Instnj Instnj tries to write to register/memory before Instni writes to that register/memory  leave wrong result should avoid this

Register Renaming   Output dependence Anti-dependence LW R2 0(R1) DADD R2 R3 R4 DSUB R6 R2 R5 LW R2 0(R1) DADD R10 R3 R4 DSUB R6 R10 R5 WAW disappears  Anti-dependence ADD.D F4 F2 F8 L.D F2 0(R10) SUB.D F14 F2 F6 ADD.D F4 F2 F8 L.D F16 0(R10) SUB.D F14 F16 F6 WAR disappears 

Memory Dependence Ambiguous dependency also forces “sequentiality” To increase ILP, needs dynamic memory disambiguation mechanisms that are either safe or recoverable i1: lw r2, (r12) i2: sw r7, 24(r20) i3: sw r1, (0xFF00) ? ? ?

Control Dependence Ordering of instruction i with respect to a branch instruction Instruction that is control-dependent on a branch cannot be moved before that branch so that its execution is no longer controlled by the branch An instruction that is not control-dependent on a branch cannot be moved after the branch so that its execution is controlled by the branch bge r8,r9,Next add r1,r2,r3 Control dependence

Control Dependence and Basic Blocks Control dependence limits size of basic blocks i1: lw r1, (r11) i2: lw r2, (r12) i3: lw r3, (r13) i4: add r2, r2, r3 i5: bge r2, r9, i9 i6: addi r1, r1, 1 i7: mul r3, r3, 5 i8: j i4 i9: sw r1, (r11) i10: sw r2, (r12) i11: jr r31 a = array[i]; b = array[j]; c = array[k]; d = b + c; while (d<t) { a++; c *= 5; } array[i] = a; array[j] = d;

Control Dependence and Basic Blocks Control dependence limits size of basic blocks i1: lw r1, (r11) i2: lw r2, (r12) i3: lw r3, (r13) a = array[i]; b = array[j]; c = array[k]; d = b + c; while (d<t) { a++; c *= 5; } array[i] = a; array[j] = d; i4: add r2, r2, r3 i5: bge r2, r9, i9 i6: addi r1, r1, 1 i7: mul r3, r3, 5 i8: j i4 i9: sw r1, (r11) i10: sw r2, (r12) i11: jr r31

Control Flow Graph Typical size of basic block = 3~6 instructions BB1 i1: lw r1, (r11) i2: lw r2, (r12) i3: lw r3, (r13) BB1 BB2 BB3 BB4 i4: add r2, r2, r3 i5: bge r2, r9, i9 i6: addi r1, r1, 1 i7: mul r3, r3, 5 i8: j i4 i9: sw r1, (r11) i10: sw r2, (r12) i11: jr r31 Typical size of basic block = 3~6 instructions

Data Dependence: A Short Summary Dependencies are property of program & compiler Pipeline organization determines if a dependence is detected and if it causes a stall Data dependence conveys: Possibility of a hazard Order in which results must be calculated Upper bound on exploitable ILP Overcome dependences: Maintaining the dependence but avoid a hazard Eliminating dependence by transforming code Dependencies that flow through memory locations are difficult to detect

Another Factor Affecting ILP Exploitation Limited HW/SW window in search of ILP R5 = 8(R6) R7 = R5 – R4 R9 = R7 * R7 R15 = 16(R6) R17 = R15 – R14 R19 = R15 * R15 ILP = 1 ILP = ? ILP = 1.5

Window in Search of ILP ILP = 6/3 = 2  better than 1 + 1.5 Larger window gives more opportunities Who exploit the instruction window? But what limits the window? R5 = 8(R6) R7 = R5 – R4 R9 = R7 * R7 R15 = 16(R6) R17 = R15 – R14 R19 = R15 * R15 C1: C2: C3:

Strategies for Exploiting ILP Replicate resources sub r1,r2,r3 add r4,r5,r6 e.g., multiple adders or multi-ported data caches Superscalar architecture Overlap uses of resources Pipelining IF DE EX MEM WB Reg Datapath Reg

Pipelining One machine cycle per stage Synchronous design  slowest stage dominates Pure hardware solution; no change to software Ensure sequential semantics All modern machines are pipelined Key technique in advancing performance in the 80’s Data and control signals

Advanced Means of Exploiting ILP Hardware Control speculation (control) Dynamic scheduling (data) Register renaming (data) Dynamic memory disambiguation (data) Software (Sophisticated) program analysis Predication or conditional instruction (control) Better register allocation (data) Memory disambiguation by compiler (data)

Exploiting ILP Any attempt to exploit ILP must make sure the optimized program is still “correct” Two properties critical to program correctness Data flow: flow of data values among instructions Exception behavior: the ordering of instruction execution must not change how exceptions are raised in program (or, cause any new exceptions), e.g. DADDU R2,R3,R4 BEQZ R2,L LW R1,0(R2) // may cause memory protection L: ... // exception  cannot reorder // BEQZ and LW Branches make data flow dynamic; an instruction may be data dependent on more than one predecessor  program order determines which predecessor, which in turn is decided by control dependence

Exploiting ILP – Preserving Data Flow Example 1: DADDU R1,R2,R3 BEQZ R4,L DSUBU R1,R1,R6 L: … OR R7,R1,R8 Example 2: BEQZ R12,skip DSUBU R4,R5,R6 DADDU R5,R4,R9 skip: OR R7,R8,R9 OR depends on DADDU and DSUBU, but correct execution also depends on BEQZ, not just ordering of DADDU, DSUBU, and OR Assume R4 is not used after skip  possible to move DSUBU before the branch without affecting exception or data flow

Outline Instruction-level parallelism: concepts and challenges (Sec. 3.1) Basic compiler techniques for exposing ILP (Sec. 3.2) 25

Finding ILP gcc has 17% control transfer instructions 5 instructions + 1 branch Must move beyond single block to get more ILP Loop level parallelism gives one opportunity Loop unrolling statically by software or dynamically by hardware Using vector instructions Principle of pipeline scheduling: A dependent instruction must be separated from the source instruction by a distance in clock cycles equal to the pipeline latency of that source instruction How can a compiler perform pipeline scheduling?

Assumptions 5-stage integer pipeline FP ALU: 4-cycles (3 cycle stall for consumer; 2 cycle for ST) LD  any: 1 stall FPALU  any: 3 stalls FPALU  ST: 2 stalls IntALU  BR: 1 stall

Compiler Techniques for Exposing ILP Loop: L.D F0,0(R1) stall ADD.D F4,F0,F2 S.D F4,0(R1) DADDUI R1,R1,#-8 BNE R1,R2,Loop // need R1 in ID for (i=999; i>=0; i=i-1) x[i] = x[i] + s; Add a scalar to a vector Parallel loop 9 cycles Assume the standard five-stage integer pipeline, so that branches have a delay of 1 clock cycle. R1 is initially the address of the element in the array with the highest address F2 contains the scalar value s Register R2 is precomputed, so that 8(R2) is the address of the last element to operate on Assume a latency of 1 cycle from integer ALU to branch since the branch address is calculated in ID which occurs in the same cycle in the EX of the previous instruction. Ignore delayed branches

Pipeline Scheduling Scheduled code: 7 cycles Loop: L.D F0,0(R1) DADDUI R1,R1,#-8 ADD.D F4,F0,F2 stall S.D F4,8(R1) BNE R1,R2,Loop 7 cycles

Loop Unrolling Unroll 4 times (assume # elements is divisible by 4) Eliminate unnecessary instructions Loop: L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) ;drop DADDUI & BNE L.D F6,-8(R1) ADD.D F8,F6,F2 S.D F8,-8(R1) ;drop DADDUI & BNE L.D F10,-16(R1) ADD.D F12,F10,F2 S.D F12,-16(R1) ;drop DADDUI & BNE L.D F14,-24(R1) ADD.D F16,F14,F2 S.D F16,-24(R1) DADDUI R1,R1,#-32 BNE R1,R2,Loop 27 cycles (LD 1 stall ADDD 2 stall DADDUI 1 stall) Note: number of live registers vs. original loop

Loop Unrolling + Pipeline Scheduling Loop: L.D F0,0(R1) L.D F6,-8(R1) L.D F10,-16(R1) L.D F14,-24(R1) ADD.D F4,F0,F2 ADD.D F8,F6,F2 ADD.D F12,F10,F2 ADD.D F16,F14,F2 S.D F4,0(R1) S.D F8,-8(R1) DADDUI R1,R1,#-32 S.D F12,16(R1) S.D F16,8(R1) BNE R1,R2,Loop OK to move S.D past DADDUI even though changes register OK to move loads before stores: analyze memory address (mem. disambiguate) When is it safe to do such changes? understand dependences 14 cycles or 3.5 per iteration

Loop Unrolling: Summary Increase number of instructions relative to the branch and overhead instructions Allow instructions from different iterations to be scheduled together Need to use different registers for each iteration  increase register count Usually done early in compilation A pair of consecutive loops are actually generated; the first executes (n mod k) times and the second has the unrolled body that iterates n/k times

Recap What is instruction-level parallelism? Factors affecting ILP Pipelining and superscalar to enable ILP Factors affecting ILP True, anti-, output dependence Dependence may cause pipeline hazard: RAW, WAR, WAW Control dependence, basic block, and control flow graph Compiler techniques for exploiting ILP Loop unrolling