Compiler techniques for exposing ILP

Slides:

Advertisements

Similar presentations

Instruction-Level Parallelism

Advertisements

CS 378 Programming for Performance Single-Thread Performance: Compiler Scheduling for Pipelines Adopted from Siddhartha Chatterjee Spring 2009.

ILP: IntroductionCSCE430/830 Instruction-level parallelism: Introduction CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng.

CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Advanced Computer Architecture COE 501.

ENGS 116 Lecture 111 ILP: Software Approaches 2 Vincent H. Berk October 14 th Reading for monday: 3.10 – 3.15, Reading for today: 4.2 – 4.6.

1 Lecture 5: Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2)

ENGS 116 Lecture 101 ILP: Software Approaches Vincent H. Berk October 12 th Reading for today: , 4.1 Reading for Friday: 4.2 – 4.6 Homework #2:

CS 6461: Computer Architecture Basic Compiler Techniques for Exposing ILP Instructor: Morris Lancaster Corresponding to Hennessey and Patterson Fifth Edition.

Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.

CPE 631: ILP, Static Exploitation Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar Milenkovic,

1 4/20/06 Exploiting Instruction-Level Parallelism with Software Approaches Original by Prof. David A. Patterson.

FTC.W99 1 Advanced Pipelining and Instruction Level Parallelism (ILP) ILP: Overlap execution of unrelated instructions gcc 17% control transfer –5 instructions.

Instruction Level Parallelism María Jesús Garzarán University of Illinois at Urbana-Champaign.

COMP4611 Tutorial 6 Instruction Level Parallelism

1 Lecture: Static ILP Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)

Eliminating Stalls Using Compiler Support. Instruction Level Parallelism gcc 17% control transfer –5 instructions + 1 branch –Reordering among 5 instructions.

EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.

1 Lecture: Static ILP Topics: loop unrolling, software pipelines (Sections C.5, 3.2)

1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.

EEL Advanced Pipelining and Instruction Level Parallelism Lotzi Bölöni.

Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.

Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.

1 ILP (Recap). 2 Basic Block (BB) ILP is quite small –BB: a straight-line code sequence with no branches in except to the entry and no branches out except.

CS152 Lec15.1 Advanced Topics in Pipelining Loop Unrolling Super scalar and VLIW Dynamic scheduling.

Pipelining 5. Two Approaches for Multiple Issue Superscalar –Issue a variable number of instructions per clock –Instructions are scheduled either statically.

Lecture 3: Chapter 2 Instruction Level Parallelism Dr. Eng. Amr T. Abdel-Hamid CSEN 601 Spring 2011 Computer Architecture Text book slides: Computer Architec.

Static Scheduling for ILP Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008.

CS252 Graduate Computer Architecture Lecture 6 Static Scheduling, Scoreboard February 6 th, 2012 John Kubiatowicz Electrical Engineering and Computer Sciences.

1 Lecture 10: Static ILP Basics Topics: loop unrolling, static branch prediction, VLIW (Sections 4.1 – 4.4)

EECC551 - Shaaban #1 Fall 2005 lec# Static Compiler Optimization Techniques We examined the following static ISA/compiler techniques aimed.

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 3 (and Appendix C) Instruction-Level Parallelism and Its Exploitation Computer Architecture.

Chapter 3 Instruction-Level Parallelism and Its Dynamic Exploitation – Concepts 吳俊興高雄大學資訊工程學系 October 2004 EEF011 Computer Architecture 計算機結構.

EECC551 - Shaaban #1 Winter 2002 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.

EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.

EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.

EENG449b/Savvides Lec /20/04 February 12, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG.

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 29, 2003 Topic: Software Approaches for ILP (Compiler Techniques) contd.

1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.

Chapter 2 Instruction-Level Parallelism and Its Exploitation

Multiscalar processors

EECC551 - Shaaban #1 Winter 2011 lec# Pipelining and Instruction-Level Parallelism (ILP). Definition of basic instruction block Increasing Instruction-Level.

EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.

1 Instruction Level Parallelism Vincent H. Berk October 15, 2008 Reading for today: A.7 – A.8 Reading for Friday: 2.1 – 2.5 Project Proposals Due Right.

COMP381 by M. Hamdi 1 Loop Level Parallelism Instruction Level Parallelism: Loop Level Parallelism.

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

CPSC614 Lec 6.1 Exploiting Instruction-Level Parallelism with Software Approach #1 E. J. Kim.

EECC551 - Shaaban #1 Fall 2001 lec# Floating Point/Multicycle Pipelining in DLX Completion of DLX EX stage floating point arithmetic operations.

CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.

Compiler Techniques for ILP

Computer Architecture Principles Dr. Mike Frank

CSL718 : VLIW - Software Driven ILP

Lecture: Static ILP Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)

Compiler techniques for exposing ILP (cont)

Siddhartha Chatterjee Spring 2008

Lecture: Static ILP Topics: loop unrolling, software pipelines (Sections C.5, 3.2) HW3 posted, due in a week.

Adapted from the slides of Prof

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Instruction Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Dynamic Hardware Prediction

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

How to improve (decrease) CPI

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

CMSC 611: Advanced Computer Architecture

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Loop-Level Parallelism

Lecture 5: Pipeline Wrap-up, Static ILP

Presentation transcript:

Compiler techniques for exposing ILP

Instruction Level Parallelism Potential overlap among instructions Few possibilities in a basic block Blocks are small (6-7 instructions) Instructions are dependent Goal: Exploit ILP across multiple basic blocks Iterations of a loop for (i = 1000; i > 0; i=i-1) x[i] = x[i] + s;

Basic Scheduling for (i = 1000; i > 0; i=i-1) x[i] = x[i] + s; Sequential MIPS Assembly Code Loop: LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 SUBI R1, R1, #8 BNEZ R1, Loop for (i = 1000; i > 0; i=i-1) x[i] = x[i] + s; Pipelined execution: Loop: LD F0, 0(R1) 1 stall 2 ADDD F4, F0, F2 3 stall 4 stall 5 SD 0(R1), F4 6 SUBI R1, R1, #8 7 stall 8 BNEZ R1, Loop 9 stall 10 Scheduled pipelined execution: Loop: LD F0, 0(R1) 1 SUBI R1, R1, #8 2 ADDD F4, F0, F2 3 stall 4 BNEZ R1, Loop 5 SD 8(R1), F4 6

Loop Unrolling Loop: LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 SUBI R1, R1, #8 BEQZ R1, Exit LD F6, 0(R1) ADDD F8, F6, F2 SD 0(R1), F8 LD F10, 0(R1) ADDD F12, F10, F2 SD 0(R1), F12 LD F14, 0(R1) ADDD F16, F14, F2 SD 0(R1), F16 BNEZ R1, Loop Exit: Pros: Larger basic block More scope for scheduling and eliminating dependencies Cons: Increases code size Comment: Often a precursor step for other optimizations

Loop Transformations Instruction independency is the key requirement for the transformations Example Determine that is legal to move SD after SUBI and BNEZ Determine that unrolling is useful (iterations are independent) Use different registers to avoid unnecessary constrains Eliminate extra tests and branches Determine that LD and SD can be interchanged Schedule the code, preserving the semantics of the code

1. Eliminating Name Dependences Loop: LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 LD F0, -8(R1) SD -8(R1), F4 LD F0, -16(R1) SD -16(R1), F4 LD F0, -24(R1) SD -24(R1), F4 SUBI R1, R1, #32 BNEZ R1, Loop Loop: LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 LD F6, -8(R1) ADDD F8, F6, F2 SD -8(R1), F8 LD F10, -16(R1) ADDD F12, F10, F2 SD -16(R1), F12 LD F14, -24(R1) ADDD F16, F14, F2 SD -24(R1), F16 SUBI R1, R1, #32 BNEZ R1, Loop Register Renaming

2. Eliminating Control Dependences Loop: LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 SUBI R1, R1, #8 BEQZ R1, Exit LD F6, 0(R1) ADDD F8, F6, F2 SD 0(R1), F8 LD F10, 0(R1) ADDD F12, F10, F2 SD 0(R1), F12 LD F14, 0(R1) ADDD F16, F14, F2 SD 0(R1), F16 BNEZ R1, Loop Exit: Intermediate BEQZ are never taken Eliminate!

3. Eliminating Data Dependences Loop: LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 SUBI R1, R1, #8 LD F6, 0(R1) ADDD F8, F6, F2 SD 0(R1), F8 LD F10, 0(R1) ADDD F12, F10, F2 SD 0(R1), F12 LD F14, 0(R1) ADDD F16, F14, F2 SD 0(R1), F16 BNEZ R1, Loop Data dependencies SUBI, LD, SD Force sequential execution of iterations Compiler removes this dependency by: Computing intermediate R1 values Eliminating intermediate SUBI Changing final SUBI Data flow analysis Can do on Registers Cannot do easily on memory locations 100(R1) = 20(R2)

4. Alleviating Data Dependencies Unrolled loop: Loop: LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 LD F6, -8(R1) ADDD F8, F6, F2 SD -8(R1), F8 LD F10, -16(R1) ADDD F12, F10, F2 SD -16(R1), F12 LD F14, -24(R1) ADDD F16, F14, F2 SD -24(R1), F16 SUBI R1, R1, #32 BNEZ R1, Loop Scheduled Unrolled loop: Loop: LD F0, 0(R1) LD F6, -8(R1) LD F10, -16(R1) LD F14, -24(R1) ADDD F4, F0, F2 ADDD F8, F6, F2 ADDD F12, F10, F2 ADDD F16, F14, F2 SD 0(R1), F4 SD -8(R1), F8 SUBI R1, R1, #32 SD 16(R1), F12 BNEZ R1, Loop SD 8(R1), F16

Some General Comments Dependences are a property of programs Actual hazards are a property of the pipeline Techniques to avoid dependence limitations Maintain dependences but avoid hazards Code scheduling hardware software Eliminate dependences by code transformations Complex Compiler-based

Loop-level Parallelism Primary focus of dependence analysis Determine all dependences and find cycles for (i=1; i<=100; i=i+1) { x[i] = y[i] + z[i]; w[i] = x[i] + v[i]; } for (i=1; i<=100; i=i+1) { x[i+1] = x[i] + z[i]; } x[1] = x[1] + y[1]; for (i=1; i<=99; i=i+1) { y[i+1] = w[i] + z[i]; x[i+1] = x[i +1] + y[i +1]; } y[101] = w[100] + z[100]; for (i=1; i<=100; i=i+1) { x[i] = x[i] + y[i]; y[i+1] = w[i] + z[i]; }

Dependence Analysis Algorithms Assume array indexes are affine (ai + b) GCD test: For two affine array indexes ai+b and ci+d: if a loop-carried dependence exists, then GCD (c,a) must divide (d-b) x[8*i ] = x[4*i + 2] +3 (2-0)/GCD(8,4) General graph cycle determination is NP a, b, c, and d may not be known at compile time

Software Pipelining Start-up Finish-up Software pipelined iteration Iteration 0 Iteration 1 Iteration 2 Iteration 3 Software pipelined iteration

Example Iteration i Iteration i+1 Iteration i+2 LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 Loop: LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 SUBI R1, R1, #8 BNEZ R1, Loop Loop: SD 16(R1), F4 ADDD F4, F0, F2 LD F0, 0(R1) SUBI R1, R1, #8 BNEZ R1, Loop

Trace (global-code) Scheduling Find ILP across conditional branches Two-step process Trace selection Find a trace (sequence of basic blocks) Use loop unrolling to generate long traces Use static branch prediction for other conditional branches Trace compaction Squeeze the trace into a small number of wide instructions Preserve data and control dependences

Trace Selection LW R4, 0(R1) LW R5, 0(R2) ADD R4, R4, R5 SW 0(R1), R4 A[I] = A[I] + B[I] LW R4, 0(R1) LW R5, 0(R2) ADD R4, R4, R5 SW 0(R1), R4 BNEZ R4, else . . . . SW 0(R2), . . . J join Else: . . . . X Join: . . . . SW 0(R3), . . . T F A[I] = 0? X B[I] = C[I] =

Summary of Compiler Techniques Try to avoid dependence stalls Loop unrolling Reduce loop overhead Software pipelining Reduce single body dependence stalls Trace scheduling Reduce impact of other branches Compilers use a mix of three All techniques depend on prediction accuracy

Food for thought: Analyze this Analyze this for different values of X and Y To evaluate different branch prediction schemes For compiler scheduling purposes add r1, r0, 1000 #  all numbers in decimal add r2, r0, a # Base address of array a loop: andi r10, r1, X beqz r10, even lw r11, 0(r2) addi r11, r11, 1 sw 0(r2), r11 even: addi r2, r2, 4 subi r1, r1, Y bnez r1, loop