DAP.F96 1 Lecture 9: Introduction to Compiler Techniques Chapter 4, Sections 4.1-4.3 L.N. Bhuyan CS 203A.

Slides:



Advertisements
Similar presentations
CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
Advertisements

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Advanced Computer Architecture COE 501.
ENGS 116 Lecture 101 ILP: Software Approaches Vincent H. Berk October 12 th Reading for today: , 4.1 Reading for Friday: 4.2 – 4.6 Homework #2:
Copyright 2001 UCB & Morgan Kaufmann ECE668.1 Adapted from Patterson, Katz and Culler © UCB Csaba Andras Moritz UNIVERSITY OF MASSACHUSETTS Dept. of Electrical.
CPE 631: ILP, Static Exploitation Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar Milenkovic,
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
1 4/20/06 Exploiting Instruction-Level Parallelism with Software Approaches Original by Prof. David A. Patterson.
FTC.W99 1 Advanced Pipelining and Instruction Level Parallelism (ILP) ILP: Overlap execution of unrelated instructions gcc 17% control transfer –5 instructions.
Eliminating Stalls Using Compiler Support. Instruction Level Parallelism gcc 17% control transfer –5 instructions + 1 branch –Reordering among 5 instructions.
ILP: Loop UnrollingCSCE430/830 Instruction-level parallelism: Loop Unrolling CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng.
Lecture 6: ILP HW Case Study— CDC 6600 Scoreboard & Tomasulo’s Algorithm Professor Alvin R. Lebeck Computer Science 220 Fall 2001.
EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
EEL Advanced Pipelining and Instruction Level Parallelism Lotzi Bölöni.
Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
CSE 502 Graduate Computer Architecture Lec 11 – More Instruction Level Parallelism Via Speculation Larry Wittie Computer Science, StonyBrook University.
Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.
EENG449b/Savvides Lec /24/04 March 24, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer.
CS152 Lec15.1 Advanced Topics in Pipelining Loop Unrolling Super scalar and VLIW Dynamic scheduling.
Pipelining 5. Two Approaches for Multiple Issue Superscalar –Issue a variable number of instructions per clock –Instructions are scheduled either statically.
Lecture 3: Chapter 2 Instruction Level Parallelism Dr. Eng. Amr T. Abdel-Hamid CSEN 601 Spring 2011 Computer Architecture Text book slides: Computer Architec.
Static Scheduling for ILP Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008.
CS252 Graduate Computer Architecture Lecture 6 Static Scheduling, Scoreboard February 6 th, 2012 John Kubiatowicz Electrical Engineering and Computer Sciences.
DAP.F96 1 Lecture 4: Hazards, Introduction to Compiler Techniques, Chapter 2.
CS136, Advanced Architecture Speculation. CS136 2 Outline Speculation Speculative Tomasulo Example Memory Aliases Exceptions VLIW Increasing instruction.
DAP Spr.‘98 ©UCB 1 Lecture 6: ILP Techniques Contd. Laxmi N. Bhuyan CS 162 Spring 2003.
CSE 502 Graduate Computer Architecture Lec – More Instruction Level Parallelism Via Speculation Larry Wittie Computer Science, StonyBrook University.
EECC551 - Shaaban #1 lec # 6 Fall Multiple Instruction Issue: CPI < 1 To improve a pipeline’s CPI to be better [less] than one, and to.
EECC551 - Shaaban #1 Winter 2002 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
COMP381 by M. Hamdi 1 Superscalar Processors. COMP381 by M. Hamdi 2 Recall from Pipelining Pipeline CPI = Ideal pipeline CPI + Structural Stalls + Data.
1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Oct. 9, 2002 Topic: Instruction-Level Parallelism (Multiple-Issue, Speculation)
Chapter 2 Instruction-Level Parallelism and Its Exploitation
Review of CS 203A Laxmi Narayan Bhuyan Lecture2.
EENG449b/Savvides Lec /24/05 February 24, 2005 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG.
EECC551 - Shaaban #1 lec # 8 Winter Multiple Instruction Issue: CPI < 1 To improve a pipeline’s CPI to be better [less] than one, and to.
CIS 629 Fall 2002 Multiple Issue/Speculation Multiple Instruction Issue: CPI < 1 To improve a pipeline’s CPI to be better [less] than one, and to utilize.
EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.
COMP381 by M. Hamdi 1 Loop Level Parallelism Instruction Level Parallelism: Loop Level Parallelism.
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
CS252/Patterson Lec /6/01 CS252 Graduate Computer Architecture Lecture 19: Intro to Static Pipelining April 6, 2001 Prof. David A. Patterson Computer.
Intel IA-64 Architecture Chehun Kim Glenn Ramos. Contents *Pipelining - Stages of pipelining *Microprogramming *Interconnection Structures.
1 Overcoming Control Hazards with Dynamic Scheduling & Speculation.
Lecture 5: Pipelining & Instruction Level Parallelism Professor Alvin R. Lebeck Computer Science 220 Fall 2001.
Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.
1 Chapter 2: ILP and Its Exploitation Review simple static pipeline ILP Overview Dynamic branch prediction Dynamic scheduling, out-of-order execution Hardware-based.
Review Professor Alvin R. Lebeck Compsci 220 / ECE 252 Fall 2006.
CS203 – Advanced Computer Architecture Instruction Level Parallelism.
CS 5513 Computer Architecture Lecture 6 – Instruction Level Parallelism continued.
现代计算机体系结构 1 主讲教师:张钢 教授 天津大学计算机学院 通信邮箱: 提交作业邮箱: 2012 年.
Compiler Techniques for ILP
CS 352H: Computer Systems Architecture
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue
CSCE430/830 Computer Architecture
CPE 631 Lecture 13: Exploiting ILP with SW Approaches
IA-64 Microarchitecture --- Itanium Processor
Lecture 23: Static Scheduling for High ILP
Adapted from the slides of Prof
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
CSC3050 – Computer Architecture
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
CPE 631 Lecture 14: Exploiting ILP with SW Approaches (2)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Lecture 5: Pipeline Wrap-up, Static ILP
Presentation transcript:

DAP.F96 1 Lecture 9: Introduction to Compiler Techniques Chapter 4, Sections L.N. Bhuyan CS 203A

DAP.F96 2 FP Loop: Where are the Hazards? Consider the following example; For (i=1000; i>0; i=i-1) x[i] = x[i] + s; MIPS code: Loop:LDF0,0(R1);F0=vector element ADDDF4,F0,F2;add scalar from F2 SD0(R1),F4;store result SUBIR1,R1,8;decrement pointer 8B (DW) BNEZR1,Loop;branch R1!=zero NOP;delayed branch slot Where are the stalls?

DAP.F96 3 FP Loop Hazards InstructionInstructionLatency in producing resultusing result clock cycles FP ALU opAnother FP ALU op3 FP ALU opStore double2 Load doubleFP ALU op1 Load doubleStore double0 Integer opInteger op1 Loop:LDF0,0(R1);F0=vector element ADDDF4,F0,F2;add scalar in F2 SD0(R1),F4;store result SUBIR1,R1,8;decrement pointer 8B (DW) BNEZR1,Loop;branch R1!=zero NOP;delayed branch slot

DAP.F96 4 FP Loop Showing Stalls 10 clocks: Rewrite code to minimize stalls? 1 Loop:LDF0,0(R1);F0=vector element 2stall 3ADDDF4,F0,F2;add scalar in F2 4stall 5stall 6 SD0(R1),F4;store result 7 SUBIR1,R1,8;decrement pointer 8B (DW) 8 stall 9 BNEZR1,Loop;branch R1!=zero 10stall;delayed branch slot

DAP.F96 5 Minimizing Stalls Technique 1: Compiler Optimization 6 clocks, but actual work is only 3 cycles for LD, ADD and SD!! Swap BNEZ and SD by changing address of SD 1 Loop:LDF0,0(R1) 2SUBIR1,R1,8 3ADDDF4,F0,F2 4Stall 5BNEZR1,Loop;delayed branch 6 SD8(R1),F4;Address altered from 0(R1) to 8(R1) when moved past SUBI What assumptions made when moved code? OK to move store past SUBI even though changes register OK to move loads before stores: get right data? When is it safe for compiler to do such changes?

DAP.F96 6 Compiler Technique 2: Loop Unrolling 1 Loop:LDF0,0(R1) 2ADDDF4,F0,F2 ;1 cycle delay * 3SD0(R1),F4 ;drop SUBI & BNEZ – 2cycles delay * 4LDF6,-8(R1) 5ADDDF8,F6,F2 ; 1 cycle delay 6SD-8(R1),F8 ;drop SUBI & BNEZ – 2 cycles delay 7LDF10,-16(R1) 8ADDDF12,F10,F2 ; 1 cycle delay 9SD-16(R1),F12 ;drop SUBI & BNEZ – 2 cycles delay 10LDF14,-24(R1) 11ADDDF16,F14,F2 ; 1 cycle delay 12SD-24(R1),F16 ; 2 cycles daly 13SUBIR1,R1,#32;alter to 4*8; 1 cycle delay 14BNEZR1,LOOP ; Delayed branch 15NOP *1 cycle delay for FP operation after load. 2 cycles delay for store after FP. 1 cycle after SUBI x (1+2) + 1 = 28 clock cycles, or 7 per iteration Loop Unrolling is essential for ILP Processors Why? But, increase in Code memory and no. of registers.

DAP.F96 7 Minimize Stall + Loop Unrolling 1 Loop:LDF0,0(R1) 2LDF6,-8(R1) 3LDF10,-16(R1) 4LDF14,-24(R1) 5ADDDF4,F0,F2 6ADDDF8,F6,F2 7ADDDF12,F10,F2 8ADDDF16,F14,F2 9SD0(R1),F4 10SD-8(R1),F8 11SD-16(R1),F12 12SUBIR1,R1,#32 13BNEZR1,LOOP ; Delayed branch 14SD8(R1),F16; 8-32 = clock cycles, or 3.5 per iteration

DAP.F96 8 Compiler Perspectives on Code Movement Compiler concerned about dependencies in program Whether or not a HW hazard depends on pipeline Try to schedule to avoid hazards that cause performance losses (True) Data dependencies (RAW if a hazard for HW) –Instruction i produces a result used by instruction j, or –Instruction j is data dependent on instruction k, and instruction k is data dependent on instruction i. If dependent, can’t execute in parallel Easy to determine for registers (fixed names) Hard for memory (“memory disambiguation” problem): –Does 100(R4) = 20(R6)? –From different loop iterations, does 20(R6) = 20(R6)?

DAP.F96 9 Compiler Perspectives on Code Movement Name Dependencies are Hard to discover for Memory Accesses –Does 100(R4) = 20(R6)? –From different loop iterations, does 20(R6) = 20(R6)? Our example required compiler to know that if R1 doesn’t change then: 0(R1)  -8(R1)  -16(R1)  -24(R1) There were no dependencies between some loads and stores so they could be moved by each other

DAP.F96 10 Steps Compiler Performed to Unroll Check OK to move the S.D after DSUBUI and BNEZ, and find amount to adjust S.D offset Determine unrolling the loop would be useful by finding that the loop iterations were independent Rename registers to avoid name dependencies Eliminate extra test and branch instructions and adjust the loop termination and iteration code Determine loads and stores in unrolled loop can be interchanged by observing that the loads and stores from different iterations are independent –requires analyzing memory addresses and finding that they do not refer to the same address. Schedule the code, preserving any dependences needed to yield same result as the original code

DAP.F96 11 When Safe to Unroll Loop? Example: Where are data dependencies? (A,B,C distinct & nonoverlapping) for (i=0; i<100; i=i+1) { A[i+1] = A[i] + C[i]; /* S1 */ B[i+1] = B[i] + A[i+1]; /* S2 */ } 1. S2 uses the value, A[i+1], computed by S1 in the same iteration. 2. S1 uses a value computed by S1 in an earlier iteration, since iteration i computes A[i+1] which is read in iteration i+1. The same is true of S2 for B[i] and B[i+1]. This is a “loop-carried dependence”: between iterations For our prior example, each iteration was distinct Implies that iterations can’t be executed in parallel, Right????

DAP.F96 12 Compiler Optimization + 2-issue Superscalar EX: One LD/ST unit and one FP ALU 1 Loop:LDF0,0(R1) 2LDF6,-8(R1) 3LDF10,-16(R1) ADDDF4,F0,F2 4LDF14,-24(R1) ADDDF8,F6,F2 5SD0(R1),F4 ADDDF12,F10,F2 6SD-8(R1),F8 ADDDF16,F14,F2 7SD-16(R1),F12 8SUBIR1,R1,#32 9BNEZR1,LOOP ; Delayed branch 10SD8(R1),F16 ; 8-32 = clock cycles, or 2.5 per iteration

DAP.F Issue without Loop Unrolling 1 Loop:LDF0,0(R1) 2SUBIR1,R1,8 ADDD F4,F0,F2 3Stall 4BNEZR1,Loop 5 SD8(R1),F4 - Branch Delay Only One cycle improvement even with 2-issue superscalar! HW: What if 2-issue no compiler optimization?

DAP.F96 14 Static Multiple Issue: Very Long Instruction Word (VLIW) Architectures Wide-issue processor that relies on compiler to –Packet together independent instructions to be issued in parallel –Schedule code to minimize hazards and stalls Very long instruction words (3 to 8 operations) –Can be issued in parallel without checks –If compiler cannot find independent operations, it inserts nops Advantage: simpler HW for wide issue –Faster clock cycle –Lower design & verification cost Disadvantages: –Code size –Requires aggressive compilation technology

DAP.F96 15 Traditional VLIW Hardware Multiple functional units, many registers (e.g. 128) –Large multiported register file (for N FUs need ~3N ports) Simple instruction fetch unit –No checks, direct correspondence between slots & FUs Instruction format –16 to 24 bits per op => 5*16=80 bits to 5*24=120 bits wide –Can share immediate fields (1 per long instruction)

DAP.F96 16 VLIW Code Example Consider the following example; For (i=1000; i>0; i=i-1) x[i] = x[i] + s; MIPS code: Loop:LDF0,0(R1);F0=vector element ADDDF4,F0,F2;add scalar from F2 SD0(R1),F4;store result SUBIR1,R1,8;decrement pointer 8B (DW) BNEZR1,Loop;branch R1!=zero NOP;delayed branch slot

DAP.F96 17 Loop Unrolling in VLIW Memory MemoryFPFPInt. op/Clock reference 1reference 2operation 1 op. 2 branch L.D F0,0(R1)L.D F6,-8(R1)1 L.D F10,-16(R1)L.D F14,-24(R1)2 L.D F18,-32(R1)L.D F22,-40(R1)ADD.D F4,F0,F2ADD.D F8,F6,F23 L.D F26,-48(R1)ADD.D F12,F10,F2ADD.D F16,F14,F24 ADD.D F20,F18,F2ADD.D F24,F22,F25 S.D 0(R1),F4S.D -8(R1),F8ADD.D F28,F26,F26 S.D -16(R1),F12S.D -24(R1),F167 S.D -32(R1),F20S.D -40(R1),F24DSUBUI R1,R1,#488 S.D -0(R1),F28BNEZ R1,LOOP9 Unrolled 7 times to avoid delays 7 results in 9 clocks, or 1.3 clocks per iteration (1.8X) Average: 2.5 ops per clock, 50% efficiency Note: Need more registers in VLIW (15 vs. 6 in SS)

DAP.F96 18 How Can the HW Help the Compiler with Discovering ILP? Compiler’s performance is critical for VLIW processors –Find many independent instructions & schedule them in best possible way What limits the compiler’s ability to discover ILP –Name dependencies (WAW & WAR) »Can eliminate with large number of registers –Branches »Limit compiler’s ability to schedule »Modern VLIW processors use branch prediction too –Dependencies through memory »Force the compiler to use conservative schedule Can the HW help the compiler? –Ideally, with techniques simpler than those for superscalar processors

DAP.F96 19 VLIW Vs. Superscalar sequential stream of long instruction words instructions scheduled statically by the compiler number of simultaneously issued instructions is fixed during compile-time instruction issue is less complicated than in a superscalar processor Disadvantage: VLIW processors cannot react on dynamic events, e.g. cache misses, with the same flexibility like superscalars. The number of instructions in a VLIW instruction word is usually fixed. Padding VLIW instructions with no-ops is needed in case the full issue bandwidth is not be met. This increases code size. More recent VLIW architectures use a denser code format which allows to remove the no-ops. VLIW is an architectural technique, whereas superscalar is a microarchitecture technique. VLIW processors take advantage of spatial parallelism.

DAP.F96 20 Intel/HP IA-64 “Explicitly Parallel Instruction Computer (EPIC)” IA-64: instruction set architecture; EPIC is type –EPIC = 2nd generation VLIW? Itanium™ is name of first implementation (2001) –Highly parallel and deeply pipelined hardware at 800Mhz –6-wide, 10-stage pipeline at 800Mhz on 0.18 µ process bit integer registers bit floating point registers –Not separate register files per functional unit as in old VLIW Hardware checks dependencies (interlocks => binary compatibility over time) Predicated execution (select 1 out of 64 1-bit flags) => 40% fewer mispredictions?

DAP.F96 21 Branch Hints Memory Hints Instruction Cache & Branch Predictors Fetch Fetch Memory Subsystem Memory Subsystem Three levels of cache: L1, L2, L3 Register Stack & Rotation Explicit Parallelism 128 GR & 128 FR, Register Remap & Stack Engine RegisterHandling Fast, Simple 6-Issue Issue Control Micro-architecture Features in hardware : Itanium™ EPIC Design Maximizes SW-HW Synergy (Copyright: Intel at Hotchips ’00) : Architecture Features programmed by compiler: Predication Data & Control Speculation Bypasses & Dependencies Parallel Resources 4 Integer + 4 MMX Units 2 FMACs (4 for SSE) 2 L.D/ST units 32 entry ALAT Speculation Deferral Management

DAP.F Stage In-Order Core Pipeline (Copyright: Intel at Hotchips ’00) Front End Pre-fetch/Fetch of up to 6 instructions/cyclePre-fetch/Fetch of up to 6 instructions/cycle Hierarchy of branch predictorsHierarchy of branch predictors Decoupling bufferDecoupling buffer Instruction Delivery Dispersal of up to 6 instructions on 9 portsDispersal of up to 6 instructions on 9 ports Reg. remappingReg. remapping Reg. stack engineReg. stack engine Operand Delivery Reg read + BypassesReg read + Bypasses Register scoreboardRegister scoreboard Predicated dependencies Predicated dependencies Execution 4 single cycle ALUs, 2 ld/str4 single cycle ALUs, 2 ld/str Advanced load controlAdvanced load control Predicate delivery & branchPredicate delivery & branch Nat/Exception//RetirementNat/Exception//Retirement IPGFET ROTEXP RENREGEXEDETWRBWL.D REGISTER READ WORD-LINE DECODE RENAMEEXPAND INST POINTER GENERATION FETCH ROTATE EXCEPTION DETECT EXECUTEWRITE-BACK

DAP.F96 23 Itanium processor 10-stage pipeline Front-end (stages IPG, Fetch, and Rotate): prefetches up to 32 bytes per clock (2 bundles) into a prefetch buffer, which can hold up to 8 bundles (24 instructions) –Branch prediction is done using a multilevel adaptive predictor like P6 microarchitecture Instruction delivery (stages EXP and REN): distributes up to 6 instructions to the 9 functional units –Implements registers renaming for both rotation and register stacking.

DAP.F96 24 Itanium processor 10-stage pipeline Operand delivery (WLD and REG): accesses register file, performs register bypassing, accesses and updates a register scoreboard, and checks predicate dependences. –Scoreboard used to detect when individual instructions can proceed, so that a stall of 1 instruction in a bundle need not cause the entire bundle to stall Execution (EXE, DET, and WRB): executes instructions through ALUs and load/store units, detects exceptions and posts NaTs, retires instructions and performs write-back –Deferred exception handling for speculative instructions is supported by providing the equivalent of poison bits, called NaTs for Not a Thing, for the GPRs (which makes the GPRs effectively 65 bits wide), and NaT Val (Not a Thing Value) for FPRs (already 82 bits wides)

DAP.F96 25 Test 1, Nov 2 (Thursday) 4 Questions with varying points. 30 pts=>75 mins. Approx 2 mins per point plus 15 mins. Answer to the point. Topics: Performance – CPI etc. (Ch 1.5, 1.6) Pipeline Hazards – Appendix A.1 Branch Prediction – A.2, Ch 3 Scoreboarding – A.8 Tomasulo – Ch 3 Speculation – Ch 3