Compiler techniques for exposing ILP (cont)

Slides:



Advertisements
Similar presentations
CS 378 Programming for Performance Single-Thread Performance: Compiler Scheduling for Pipelines Adopted from Siddhartha Chatterjee Spring 2009.
Advertisements

Anshul Kumar, CSE IITD CSL718 : VLIW - Software Driven ILP Hardware Support for Exposing ILP at Compile Time 3rd Apr, 2006.
CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
HW 2 is out! Due 9/25!. CS 6290 Static Exploitation of ILP.
Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Advanced Computer Architecture COE 501.
Compiler techniques for exposing ILP
ENGS 116 Lecture 101 ILP: Software Approaches Vincent H. Berk October 12 th Reading for today: , 4.1 Reading for Friday: 4.2 – 4.6 Homework #2:
Lec ECE 463/521, Profs. Conte, Rotenberg and Gehringer, Dept. of ECE, NC State University Static Scheduling Techniques m Local scheduling (within.
CS 6461: Computer Architecture Basic Compiler Techniques for Exposing ILP Instructor: Morris Lancaster Corresponding to Hennessey and Patterson Fifth Edition.
1 4/20/06 Exploiting Instruction-Level Parallelism with Software Approaches Original by Prof. David A. Patterson.
Instruction Level Parallelism María Jesús Garzarán University of Illinois at Urbana-Champaign.
COMP4611 Tutorial 6 Instruction Level Parallelism
EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
1 Lecture: Static ILP Topics: loop unrolling, software pipelines (Sections C.5, 3.2)
1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.
Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.
CS152 Lec15.1 Advanced Topics in Pipelining Loop Unrolling Super scalar and VLIW Dynamic scheduling.
Pipelining 5. Two Approaches for Multiple Issue Superscalar –Issue a variable number of instructions per clock –Instructions are scheduled either statically.
1 Lecture 10: Static ILP Basics Topics: loop unrolling, static branch prediction, VLIW (Sections 4.1 – 4.4)
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 3 (and Appendix C) Instruction-Level Parallelism and Its Exploitation Computer Architecture.
Instruction Level Parallelism (ILP) Colin Stevens.
EECC551 - Shaaban #1 Winter 2002 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
EECC551 - Shaaban #1 Spring 2004 lec# Static Compiler Optimization Techniques We already examined the following static compiler techniques aimed.
EENG449b/Savvides Lec /17/04 February 17, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG.
EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
EECC551 - Shaaban #1 Winter 2003 lec# Static Compiler Optimization Techniques We already examined the following static compiler techniques aimed.
1 Lecture 6: Static ILP Topics: loop analysis, SW pipelining, predication, speculation (Section 2.2, Appendix G)
EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.
1 Lecture 6: Static ILP Topics: loop analysis, SW pipelining, predication, speculation (Section 2.2, Appendix G) Assignment 2 posted; due in a week.
1 Instruction Level Parallelism Vincent H. Berk October 15, 2008 Reading for today: A.7 – A.8 Reading for Friday: 2.1 – 2.5 Project Proposals Due Right.
EECC551 - Shaaban #1 Winter 2002 lec# Static Compiler Optimization Techniques We already examined the following static compiler techniques aimed.
COMP381 by M. Hamdi 1 Loop Level Parallelism Instruction Level Parallelism: Loop Level Parallelism.
1 Lecture 6: Static ILP Topics: loop analysis, SW pipelining, predication, speculation (Section 2.2, Appendix G) Please hand in Assignment 1 now Assignment.
CPSC614 Lec 6.1 Exploiting Instruction-Level Parallelism with Software Approach #1 E. J. Kim.
EECC551 - Shaaban #1 Fall 2001 lec# Floating Point/Multicycle Pipelining in DLX Completion of DLX EX stage floating point arithmetic operations.
CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.
CS203 – Advanced Computer Architecture Instruction Level Parallelism.
Compiler Techniques for ILP
CS203 – Advanced Computer Architecture
Computer Architecture Principles Dr. Mike Frank
现代计算机体系结构 主讲教师:张钢 教授 天津大学计算机学院 2017年
CSCE430/830 Computer Architecture
Lecture 11: Advanced Static ILP
CSL718 : VLIW - Software Driven ILP
Lecture: Static ILP Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)
CS 704 Advanced Computer Architecture
Lecture: Static ILP Topics: predication, speculation (Sections C.5, 3.2)
Lecture: Static ILP Topics: loop unrolling, software pipelines (Sections C.5, 3.2)
Lecture: Static ILP Topics: predication, speculation (Sections C.5, 3.2)
Siddhartha Chatterjee Spring 2008
Lecture: Static ILP Topics: loop unrolling, software pipelines (Sections C.5, 3.2) HW3 posted, due in a week.
CS 704 Advanced Computer Architecture
CS 704 Advanced Computer Architecture
Adapted from the slides of Prof
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Version-6: Software Pipelining
Software Exploits for ILP
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Dynamic Hardware Prediction
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
How to improve (decrease) CPI
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
CMSC 611: Advanced Computer Architecture
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Loop-Level Parallelism
Static Scheduling Techniques
Lecture 5: Pipeline Wrap-up, Static ILP
Presentation transcript:

Compiler techniques for exposing ILP (cont)

Loop-Level Parallelism Analysis at the source level Dependencies across iterations for (i=1000; i>0; i=i-1) x[i] = x[i] + s; for (i=1; i<=100; i=i+1) { x[i+1] = x[i] + z[i]; /* loop-carried dependence */ y[i+1] = y[i] + x[i+1]; }

Loop-Carried Dependences for (i=1; i<=100; i=i+1) { x[i] = x[i] + y[i]; y[i+1] = w[i] + z[i]; } Non-circular dependences x[1] = x[1] + y[1]; for (i=1; i<=99; i=i+1) { y[i+1] = w[i] + z[i]; x[i+1] = x[i +1] + y[i +1]; } y[101] = w[100] + z[100];

Compiler support for ILP Dependence analysis Finding dependences is important for: Good scheduling of code Determining loop-level parallelism Eliminating name dependencies Complexity Simple for scalar variable references Complex for pointers, array references Software pipelining Trace scheduling

Loop-level Parallelism Primary focus of dependence analysis Determine all dependences and find cycles for (i=1; i<=100; i=i+1) { x[i] = y[i] + z[i]; w[i] = x[i] + v[i]; } for (i=1; i<=100; i=i+1) { x[i+1] = x[i] + z[i]; loop-carried, recurrent, circular dependence } x[1] = x[1] + y[1]; for (i=1; i<=99; i=i+1) { y[i+1] = w[i] + z[i]; x[i+1] = x[i +1] + y[i +1]; } y[101] = w[100] + z[100]; for (i=1; i<=100; i=i+1) { x[i] = x[i] + y[i]; y[i+1] = w[i] + z[i]; }

Dependence Analysis Algorithms Assume array indexes are affine (ai + b) GCD test: For two affine array indexes ai+b and ci+d: if a loop-carried dependence exists, then GCD (c,a) must divide (d-b) x[8*i ] = x[4*i + 2] +3 (2-0)/GCD(8,4) General graph cycle determination is NP a, b, c, and d may not be known at compile time

Software Pipelining Start-up Finish-up Software pipelined iteration Iteration 0 Iteration 1 Iteration 2 Iteration 3 Software pipelined iteration

Example Iteration i Iteration i+1 Iteration i+2 LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 Loop: LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 SUBI R1, R1, #8 BNEZ R1, Loop Loop: SD 16(R1), F4 ADDD F4, F0, F2 LD F0, 0(R1) SUBI R1, R1, #8 BNEZ R1, Loop

Trace (global-code) Scheduling Find ILP across conditional branches Two-step process Trace selection Find a trace (sequence of basic blocks) Use loop unrolling to generate long traces Use static branch prediction for other conditional branches Trace compaction Squeeze the trace into a small number of wide instructions Preserve data and control dependences

Trace Selection LW R4, 0(R1) LW R5, 0(R2) ADD R4, R4, R5 SW 0(R1), R4 A[I] = A[I] + B[I] LW R4, 0(R1) LW R5, 0(R2) ADD R4, R4, R5 SW 0(R1), R4 BNEZ R4, else . . . . SW 0(R2), . . . J join Else: . . . . X Join: . . . . SW 0(R3), . . . T F A[I] = 0? X B[I] = C[I] =

Summary of Compiler Techniques Try to avoid dependence stalls Loop unrolling Reduce loop overhead Software pipelining Reduce single body dependence stalls Trace scheduling Reduce impact of other branches Compilers use a mix of three All techniques depend on prediction accuracy

Analyze this Analyze this for different values of X and Y To evaluate different branch prediction schemes For compiler scheduling purposes add r1, r0, 1000 #  all numbers in decimal add r2, r0, a # Base address of array a loop: andi r10, r1, X beqz r10, even lw r11, 0(r2) addi r11, r11, 1 sw 0(r2), r11 even: addi r2, r2, 4 subi r1, r1, Y bnez r1, loop

Midterm Performance Solutions posted on website Range 16.5 to 45.5 out of 50 15-19.5 C+ 20-24.5 B- 25-29.5 B 30-34.5 B+ 35-40 A- 40-45 A 45+ A+ If you have questions about your score please see me on Thursday during office hours