Optimizing Compilers for Modern Architectures Enhancing Fine-Grained Parallelism Part II Chapter 5 of Allen and Kennedy.

Slides:

Advertisements

Similar presentations

Optimizing Compilers for Modern Architectures Copyright, 1996 © Dale Carnegie & Associates, Inc. Dependence Testing Allen and Kennedy, Chapter 3 thru Section.

Advertisements

Optimizing Compilers for Modern Architectures Compiler Improvement of Register Usage Chapter 8, through Section 8.4.

Optimizing Compilers for Modern Architectures Syllabus Allen and Kennedy, Preface Optimizing Compilers for Modern Architectures.

Optimizing Compilers for Modern Architectures Coarse-Grain Parallelism Chapter 6 of Allen and Kennedy.

1 ECE734 VLSI Arrays for Digital Signal Processing Loop Transformation.

Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.

P3 / 2004 Register Allocation. Kostis Sagonas 2 Spring 2004 Outline What is register allocation Webs Interference Graphs Graph coloring Spilling Live-Range.

1 Optimizing compilers Managing Cache Bercovici Sivan.

Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments.

School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.

Chapter 9 Code optimization Section 0 overview 1.Position of code optimizer 2.Purpose of code optimizer to get better efficiency –Run faster –Take less.

Architecture-dependent optimizations Functional units, delay slots and dependency analysis.

Improving Register Usage Chapter 8, Section 8.5  End. Omer Yehezkely.

Parallell Processing Systems1 Chapter 4 Vector Processors.

Compiler Challenges for High Performance Architectures

Enhancing Fine-Grained Parallelism Chapter 5 of Allen and Kennedy Optimizing Compilers for Modern Architectures.

Optimizing Compilers for Modern Architectures Preliminary Transformations Chapter 4 of Allen and Kennedy.

Preliminary Transformations Chapter 4 of Allen and Kennedy Harel Paz.

Parallel and Cluster Computing 1. 2 Optimising Compilers u The main specific optimization is loop vectorization u The compilers –Try to recognize such.

Enhancing Fine- Grained Parallelism Chapter 5 of Allen and Kennedy Mirit & Haim.

Compiler Challenges, Introduction to Data Dependences Allen and Kennedy, Chapter 1, 2.

Improving code generation. Better code generation requires greater context Over expressions: optimal ordering of subtrees Over basic blocks: Common subexpression.

1 Intermediate representation Goals: –encode knowledge about the program –facilitate analysis –facilitate retargeting –facilitate optimization scanning.

EECC551 - Shaaban #1 Spring 2004 lec# Static Compiler Optimization Techniques We already examined the following static compiler techniques aimed.

EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.

Code Generation for Basic Blocks Introduction Mooly Sagiv html:// Chapter

Compiler Improvement of Register Usage Part 1 - Chapter 8, through Section 8.4 Anastasia Braginsky.

Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

Optimizing Compilers for Modern Architectures Coarse-Grain Parallelism Chapter 6 of Allen and Kennedy.

CMPUT Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic B: Loop Restructuring José Nelson Amaral

Dependence Testing Optimizing Compilers for Modern Architectures, Chapter 3 Allen and Kennedy Presented by Rachel Tzoref and Rotem Oshman.

Parallelizing Compilers Presented by Yiwei Zhang.

Improving Code Generation Honors Compilers April 16 th 2002.

Improving code generation. Better code generation requires greater context Over expressions: optimal ordering of subtrees Over basic blocks: Common subexpression.

Creating Coarse-Grained Parallelism Chapter 6 of Allen and Kennedy Dan Guez.

Enhancing Fine-Grained Parallelism - P art 2 Chapter 5 of Allen and Kennedy Mirit & Haim.

Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2 pp

Dependence: Theory and Practice Allen and Kennedy, Chapter 2 Liza Fireman.

Register Allocation and Spilling via Graph Coloring G. J. Chaitin IBM Research, 1982.

Optimizing Compilers for Modern Architectures Compiling High Performance Fortran Allen and Kennedy, Chapter 14.

Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2.

Applying Data Copy To Improve Memory Performance of General Array Computations Qing Yi University of Texas at San Antonio.

Topic #10: Optimization EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.

Optimizing Loop Performance for Clustered VLIW Architectures by Yi Qian (Texas Instruments) Co-authors: Steve Carr (Michigan Technological University)

Aug 15-18, Montreal, Canada1 Recurrence Chain Partitioning of Non-Uniform Dependences Yijun Yu Erik H. D ’ Hollander.

09/15/2011CS4961 CS4961 Parallel Programming Lecture 8: Dependences and Locality Optimizations Mary Hall September 15,

Enhancing Fine-Grained Parallelism Chapter 5 of Allen and Kennedy Optimizing Compilers for Modern Architectures.

Toward Efficient Flow-Sensitive Induction Variable Analysis and Dependence Testing for Loop Optimization Yixin Shou, Robert A. van Engelen, Johnnie Birch,

1 Code optimization “Code optimization refers to the techniques used by the compiler to improve the execution efficiency of the generated object code”

U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2011 Dependence Analysis and Loop Transformations.

Carnegie Mellon Lecture 14 Loop Optimization and Array Analysis I. Motivation II. Data dependence analysis Chapter , 11.6 Dror E. MaydanCS243:

1 Theory and Practice of Dependence Testing Data and control dependences Scalar data dependences  True-, anti-, and output-dependences Loop dependences.

CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.

Carnegie Mellon Lecture 15 Loop Transformations Chapter Dror E. MaydanCS243: Loop Optimization and Array Analysis1.

High-Level Transformations for Embedded Computing

Eliminating affinity tests and simplifying shared accesses in UPC Rahul Garg*, Kit Barton*, Calin Cascaval** Gheorghe Almasi**, Jose Nelson Amaral* *University.

ECE 1754 Loop Transformations by: Eric LaForest

Program Analysis & Transformations Loop Parallelization and Vectorization Toheed Aslam.

Optimizing Compilers for Modern Architectures Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9.

OPTIMIZING DSP SCHEDULING VIA ADDRESS ASSIGNMENT WITH ARRAY AND LOOP TRANSFORMATION Chun Xue, Zili Shao, Ying Chen, Edwin H.-M. Sha Department of Computer.

Optimization of C Code The C for Speed

Memory-Aware Compilation Philip Sweany 10/20/2011.

Carnegie Mellon Lecture 8 Software Pipelining I. Introduction II. Problem Formulation III. Algorithm Reading: Chapter 10.5 – 10.6 M. LamCS243: Software.

CS314 – Section 5 Recitation 13

Dependence Analysis Important and difficult

Data Dependence, Parallelization, and Locality Enhancement (courtesy of Tarek Abdelrahman, University of Toronto)

Loop Restructuring Loop unswitching Loop peeling Loop fusion

Presented by: Huston Bokinsky Ying Zhang 25 April, 2013

Preliminary Transformations

Register Pressure Guided Unroll-and-Jam

Presentation transcript:

Optimizing Compilers for Modern Architectures Enhancing Fine-Grained Parallelism Part II Chapter 5 of Allen and Kennedy

Optimizing Compilers for Modern Architectures Seen So Far... Uncovering potential vectorization in loops by —Loop Interchange —Scalar Expansion —Scalar and Array Renaming Safety and Profitability of these transformations

Optimizing Compilers for Modern Architectures Today’s Talk... More transformations —Node Splitting —Recognition of Reductions —Index-Set Splitting —Run-time Symbolic Resolution —Loop Skewing Unified framework to generate vector code

Optimizing Compilers for Modern Architectures Node Splitting Sometimes Renaming fails DO I = 1, N S1:A(I) = X(I+1) + X(I) S2:X(I+1) = B(I) + 32 ENDDO Recurrence kept intact by renaming algorithm

Optimizing Compilers for Modern Architectures Node Splitting DO I = 1, N S1:A(I) = X(I+1) + X(I) S2:X(I+1) = B(I) + 32 ENDDO Break critical antidependence Make copy of node from which antidependence emanates DO I = 1, N S1’:X$(I) = X(I+1) S1:A(I) = X$(I) + X(I) S2:X(I+1) = B(I) + 32 ENDDO Recurrence broken Vectorized to X$(1:N) = X(2:N+1) X(2:N+1) = B(1:N) + 32 A(1:N) = X$(1:N) + X(1:N)

Optimizing Compilers for Modern Architectures Node Splitting Algorithm Takes a constant loop independent antidependence D Add new assignment x: T$=source(D) Insert x before source(D) Replace source(D) with T$ Make changes in the dependence graph

Optimizing Compilers for Modern Architectures Node Splitting: Profitability Not always profitable For example DO I = 1, N S1:A(I) = X(I+1) + X(I) S2:X(I+1) = A(I) + 32 ENDDO Node Splitting gives DO I = 1, N S1’:X$(I) = X(I+1) S1:A(I) = X$(I) + X(I) S2:X(I+1) = A(I) + 32 ENDDO Recurrence still not broken Antidependence was not critical

Optimizing Compilers for Modern Architectures Node Splitting Determining minimal set of critical antidependences is in NP-C Perfect job of Node Splitting difficult Heuristic: —Select antidependences —Delete it to see if acyclic —If acyclic, apply Node Splitting

Optimizing Compilers for Modern Architectures Recognition of Reductions Sum Reduction, Min/Max Reduction, Count Reduction Vector ---> Single Element S = 0.0 DO I = 1, N S = S + A(I) ENDDO Not directly vectorizable

Optimizing Compilers for Modern Architectures Recognition of Reductions Assuming commutativity and associativity S = 0.0 DO k = 1, 4 SUM(k) = 0.0 DO I = k, N, 4 SUM(k) = SUM(k) + A(I) ENDDO S = S + SUM(k) ENDDO Distribute k loop S = 0.0 DO k = 1, 4 SUM(k) = 0.0 ENDDO DO k = 1, 4 DO I = k, N, 4 SUM(k) = SUM(k) + A(I) ENDDO DO k = 1, 4 S = S + SUM(k) ENDDO

Optimizing Compilers for Modern Architectures Recognition of Reductions After Loop Interchange DO I = 1, N, 4 DO k = I, min(I+3,N) SUM(k-I+1) = SUM(k-I+1) + A(I) ENDDO Vectorize DO I = 1, N, 4 SUM(1:4) = SUM(1:4) + A(I:I+3) ENDDO

Optimizing Compilers for Modern Architectures Recognition of Reductions Useful for vector machines with 4 stage pipeline Recognize Reduction and Replace by the efficient version

Optimizing Compilers for Modern Architectures Recognition of Reductions Properties of Reductions —Reduce Vector/Array to one element —No use of Intermediate values —Reduction operates on vector and nothing else

Optimizing Compilers for Modern Architectures Recognition of Reductions Reduction recognized by —Presence of self true, output and anti dependences —Absence of other true dependences DO I = 1, N S = S + A(I) ENDDO DO I = 1, N S = S + A(I) T(I) = S ENDDO

Optimizing Compilers for Modern Architectures Index-set Splitting Subdivide loop into different iteration ranges to achieve partial parallelization —Threshold Analysis [Strong SIV, Weak Crossing SIV] —Loop Peeling [Weak Zero SIV] —Section Based Splitting [Variation of loop peeling]

Optimizing Compilers for Modern Architectures Threshold Analysis DO I = 1, 20 A(I+20) = A(I) + B ENDDO Vectorize to.. A(21:40) = A(1:20) + B DO I = 1, 100 A(I+20) = A(I) + B ENDDO Strip mine to.. DO I = 1, 100, 20 DO i = I, I+19 A(i+20) = A(i) + B ENDDO Vectorize this

Optimizing Compilers for Modern Architectures Threshold Analysis Crossing thresholds DO I = 1, 100 A(100-I) = A(I) + B ENDDO Strip mine to.. DO I = 1, 100, 50 DO i = I, I+49 A(101-i) = A(i) + B ENDDO Vectorize to.. DO I = 1, 100, 50 A(101-I:51-I) = A(I:I+49) + B ENDDO

Optimizing Compilers for Modern Architectures Loop Peeling Source of dependence is a single iteration DO I = 1, N A(I) = A(I) + A(1) ENDDO Loop peeled to.. A(1) = A(1) + A(1) DO I = 2, N A(I) = A(I) + A(1) ENDDO Vectorize to.. A(1) = A(1) + A(1) A(2:N)= A(2:N) + A(1)

Optimizing Compilers for Modern Architectures Section-based Splitting DO I = 1, N DO J = 1, N/2 S1: B(J,I ) = A( J,I ) + C ENDDO DO J = 1, N S2: A( J,I+1 ) = B ( J,I ) + D ENDDO —J Loop bound by recurrence due to B —Only a portion of B is responsible for it Partition second loop into loop that uses result of S1 and loop that does not DO I = 1, N DO J = 1, N/2 S1: B(J,I ) = A( J,I ) + C ENDDO DO J = 1, N/2 S2: A( J,I+1 ) = B ( J,I ) + D ENDDO DO J = N/2+1, N S3: A( J,I+1 ) = B ( J,I ) + D ENDDO

Optimizing Compilers for Modern Architectures Section-based Splitting DO I = 1, N DO J = 1, N/2 S1: B(J,I ) = A( J,I ) + C ENDDO DO J = 1, N/2 S2: A( J,I+1 ) = B ( J,I ) + D ENDDO DO J = N/2+1, N S3: A( J,I+1 ) = B ( J,I ) + D ENDDO S3 now independent of S1 and S2 Loop distribute to DO I = 1, N DO J = N/2+1, N S3: A( J,I+1 ) = B ( J,I ) + D ENDDO DO I = 1, N DO J = 1, N/2 S1: B(J,I ) = A( J,I ) + C ENDDO DO J = 1, N/2 S2: A( J,I+1 ) = B ( J,I ) + D ENDDO

Optimizing Compilers for Modern Architectures Section-based Splitting DO I = 1, N DO J = N/2+1, N S3: A( J,I+1 ) = B ( J,I ) + D ENDDO DO I = 1, N DO J = 1, N/2 S1: B(J,I ) = A( J,I ) + C ENDDO DO J = 1, N/2 S2: A( J,I+1 ) = B ( J,I ) + D ENDDO Vectorized to A(N/2+1:N, 2:N +1 ) = B(N/2+1:N, 1:N) + D DO I = 1, N B(1:N/2,I ) = A(1:N/2,I ) + C A(1:N/2,I+1 ) = B(1:N/2,I ) + D ENDDO

Optimizing Compilers for Modern Architectures Run-time Symbolic Resolution “Breaking Conditions” DO I = 1, N A(I+L) = A(I) + B(I) ENDDO Transformed to.. IF(L.LE.0) THEN A(L:N+L) = A(1:N) + B(1:N) ELSE DO I = 1, N A(I+L) = A(I) + B(I) ENDDO ENDIF

Optimizing Compilers for Modern Architectures Run-time Symbolic Resolution Identifying minimum number of breaking conditions to break a recurrence is in NP-C Heuristic: —Identify when a critical dependence can be conditionally eliminated via a breaking condition

Optimizing Compilers for Modern Architectures Loop Skewing Reshape Iteration Space to uncover parallelism DO I = 1, N DO J = 1, N (=,<) S:A(I,J) = A(I-1,J) + A(I,J-1) (<,=) ENDDO Parallelism not apparent

Optimizing Compilers for Modern Architectures Loop Skewing Dependence Pattern before loop skewing

Optimizing Compilers for Modern Architectures Loop Skewing Do the following transformation called loop skewing j=J+I or J=j-I DO I = 1, N DO j = I+1, I+N (=,<) S:A(I,j-I) = A(I-1,j-I) + A(I,j-I-1) (<,<) ENDDO Note: Direction Vector Changes

Optimizing Compilers for Modern Architectures Loop Skewing The accesses to A have the following pattern 1,1 = 0,1 + 1,0 S(1,2) 1,2 = 0,2 + 1,1 S(1,3) 1,3 = 0,3 + 1,2 S(1,4) 1,4 = 0,4 + 1,3 S(1,5) 2,1 = 1,1 + 2,0 S(2,3) 2,2 = 1,2 + 2,1 S(2,4) 2,3 = 1,3 + 2,2 S(2,5) 2,4 = 1,4 + 2,3 S(2,6)

Optimizing Compilers for Modern Architectures Loop Skewing Dependence pattern after loop skewing

Optimizing Compilers for Modern Architectures Loop Skewing DO I = 1, N DO j = I+1, I+N S:A(I,j-I) = A(I-1,j-I) + A(I,j-I-1) ENDDO Loop interchange to.. DO j = 2, N+N DO I = max(1,j-N), min(N,j-1) S:A(I,j-I) = A(I-1,j-I) + A(I,j-I-1) ENDDO Vectorize to.. DO j = 2, N+N FORALL I = max(1,j-N), min(N,j-1) S:A(I,j-I) = A(I-1,j-I) + A(I,j-I-1) END FORALL ENDDO

Optimizing Compilers for Modern Architectures Loop Skewing Disadvantages: —Varying vector length –Not profitable if N is small —If vector startup time is more than speedup time, this is not profitable —Vector bounds must be recomputed on each iteration of outer loop Apply loop skewing if everything else fails

Optimizing Compilers for Modern Architectures Putting It All Together Good Part —Many transformations imply more choices to exploit parallelism Bad Part —Choosing the right transformation —How to automate transformation selection process? —Interference between transformations

Optimizing Compilers for Modern Architectures Putting It All Together Example of Interference DO I = 1, N DO J = 1, M S(I) = S(I) + A(I,J) ENDDO Sum Reduction gives.. DO I = 1, N S(I) = S(I) + SUM (A(I,1:M)) ENDDO While Loop Interchange and Vectorization gives.. DO J = 1, N S(1:N) = S(1:N) + A(1:N,J) ENDDO

Optimizing Compilers for Modern Architectures Putting It All Together Any algorithm which tries to tie all transformations must —Take a global view of transformed code —Know the architecture of the target machine Goal of our algorithm —Finding ONE good vector loop [works good for most vector register architectures]

Optimizing Compilers for Modern Architectures Unified Framework Detection: finding ALL loops for EACH statement that can be run in vector Selection: choosing best loop for vector execution for EACH statement Transformation: carrying out the transformations necessary to vectorize the selected loop

Optimizing Compilers for Modern Architectures Unified Framework: Detection procedure mark_loop(S,D) for each edge e in D deletable by scalar expansion, array and scalar renaming, node splitting or symbolic resolution do begin add e to deletable_edges; delete e from D; end mark_gen(S,1,D); for each statement x in S with no vector loop marked do begin attempt Index-Set Splitting and loop skewing; mark vector loops found; end //Restore deletable edges from deletable_edges to D end mark_loop

Optimizing Compilers for Modern Architectures Unified Framework: Detection procedure mark_gen(S,k,D) //Variation of codegen; Doesn’t do vectorization; Only marks vector loops for i =1 to m do begin if S i is cyclic then if outermost carried dependence is at level p>k then //Loop Shifting mark all loops at level < p as vector for S i; else if S i is a reduction mark loop k as vector; mark S i reduction; else begin //Recur at deeper level mark_gen(S i,k+1,D i ); end else mark statements in S i as vector for loops k and deeper; end end mark_gen

Optimizing Compilers for Modern Architectures Selection and Transformation procedure transform_code(R,k,D) //Variation of codegen; for i =1 to m do begin if k is the index of the best vector loop for some statement in R i then if R i is cyclic then select_and_apply_transformation(R i,k,D); //retry vectorization on new dependence graph transform_code(R i,k,D); else generate a vector statement for R i in loop k; end else begin //Recur at deeper level //Generate level k DO and ENDDO statements transform_code(R i,k+1,D); end end transform_code

Optimizing Compilers for Modern Architectures Selection of Transformations procedure select_and_apply_transformation(R i,k,D) if loop k does not carry a dependence in R i then shift loop k to innermost position; else if R i is a reduction at level k then replace with reduction and adjust dependences; else //transform and adjust dependences if array renaming possible then apply array renaming and adjust dependences; else if node-splitting possible then apply node-splitting and adjust dependences; else if scalar expansion possible then apply scalar expansion and adjust dependences; else apply loop skewing or index-set splitting and adjust dependencies; end end select_and_apply_transformation

Optimizing Compilers for Modern Architectures Performance on Benchmark