Loop Tiling for Iterative Stencil Computations Marta Jiménez.

Slides:

Advertisements

Similar presentations

Optimizing Compilers for Modern Architectures Copyright, 1996 © Dale Carnegie & Associates, Inc. Dependence Testing Allen and Kennedy, Chapter 3 thru Section.

Advertisements

Optimizing Compilers for Modern Architectures Compiler Improvement of Register Usage Chapter 8, through Section 8.4.

Optimizing Compilers for Modern Architectures Coarse-Grain Parallelism Chapter 6 of Allen and Kennedy.

CS 201 Compiler Construction

Using the Iteration Space Visualizer in Loop Parallelization Yijun YU

1 Optimizing compilers Managing Cache Bercovici Sivan.

Delivering High Performance to Parallel Applications Using Advanced Scheduling Nikolaos Drosinos, Georgios Goumas Maria Athanasaki and Nectarios Koziris.

Architecture-dependent optimizations Functional units, delay slots and dependency analysis.

1 CS 201 Compiler Construction Software Pipelining: Circular Scheduling.

Improving Register Usage Chapter 8, Section 8.5  End. Omer Yehezkely.

Enhancing Fine-Grained Parallelism Chapter 5 of Allen and Kennedy Optimizing Compilers for Modern Architectures.

Dependence Analysis Kathy Yelick Bebop group meeting, 8/3/01.

Performance Visualizations using XML Representations Presented by Kristof Beyls Yijun Yu Erik H. D’Hollander.

Increasing and Detecting Memory Address Congruence Sam Larsen Emmett Witchel Saman Amarasinghe Laboratory for Computer Science Massachusetts Institute.

Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.

Enhancing Fine- Grained Parallelism Chapter 5 of Allen and Kennedy Mirit & Haim.

Data Locality CS 524 – High-Performance Computing.

Compiler Challenges, Introduction to Data Dependences Allen and Kennedy, Chapter 1, 2.

Telescoping Languages: A Compiler Strategy for Implementation of High-Level Domain-Specific Programming Systems Ken Kennedy Rice University.

Detailed look at the TigerSHARC pipeline Cycle counting for COMPUTE block versions of the DC_Removal algorithm.

CMPUT Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic B: Loop Restructuring José Nelson Amaral

A Data Locality Optimizing Algorithm based on A Data Locality Optimizing Algorithm by Michael E. Wolf and Monica S. Lam.

Data Dependences CS 524 – High-Performance Computing.

Data Locality CS 524 – High-Performance Computing.

High Performance Embedded Computing © 2007 Elsevier Lecture 11: Memory Optimizations Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte.

Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2 pp

ECE669 L23: Parallel Compilation April 29, 2004 ECE 669 Parallel Computer Architecture Lecture 23 Parallel Compilation.

Dependence: Theory and Practice Allen and Kennedy, Chapter 2 Liza Fireman.

DATA LOCALITY & ITS OPTIMIZATION TECHNIQUES Presented by Preethi Rajaram CSS 548 Introduction to Compilers Professor Carol Zander Fall 2012.

Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2.

Optimizing Loop Performance for Clustered VLIW Architectures by Yi Qian (Texas Instruments) Co-authors: Steve Carr (Michigan Technological University)

Antoine Monsifrot François Bodin CAPS Team Computer Aided Hand Tuning June 2001.

1 Parallel Programming using the Iteration Space Visualizer Yijun YuYijun Yu and Erik H. D'HollanderErik H. D'Hollander University of Ghent, Belgium

Aug 15-18, Montreal, Canada1 Recurrence Chain Partitioning of Non-Uniform Dependences Yijun Yu Erik H. D ’ Hollander.

09/15/2011CS4961 CS4961 Parallel Programming Lecture 8: Dependences and Locality Optimizations Mary Hall September 15,

Enhancing Fine-Grained Parallelism Chapter 5 of Allen and Kennedy Optimizing Compilers for Modern Architectures.

U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2011 Dependence Analysis and Loop Transformations.

Carnegie Mellon Lecture 14 Loop Optimization and Array Analysis I. Motivation II. Data dependence analysis Chapter , 11.6 Dror E. MaydanCS243:

Carnegie Mellon Lecture 15 Loop Transformations Chapter Dror E. MaydanCS243: Loop Optimization and Array Analysis1.

Performance Optimization Getting your programs to run faster.

High-Level Transformations for Embedded Computing

Séminaire COSI-Roscoff’011 Séminaire COSI ’01 Power Driven Processor Array Partitionning for FPGA SoC S.Derrien, S. Rajopadhye.

October 11, 2007 © 2007 IBM Corporation Multidimensional Blocking in UPC Christopher Barton, Călin Caşcaval, George Almási, Rahul Garg, José Nelson Amaral,

Eliminating affinity tests and simplifying shared accesses in UPC Rahul Garg*, Kit Barton*, Calin Cascaval** Gheorghe Almasi**, Jose Nelson Amaral* *University.

Using Cache Models and Empirical Search in Automatic Tuning of Applications Apan Qasem Ken Kennedy John Mellor-Crummey Rice University Houston, TX Apan.

CR18: Advanced Compilers L04: Scheduling Tomofumi Yuki 1.

ECE 1754 Loop Transformations by: Eric LaForest

Optimizing Stencil Computations March 18, Administrative Midterm coming April 3? In class March 25, can bring one page of notes Review notes, readings.

Data Structures and Algorithms in Parallel Computing Lecture 7.

Optimizing Compilers for Modern Architectures Enhancing Fine-Grained Parallelism Part II Chapter 5 of Allen and Kennedy.

CPS 258 Announcements –Lecture calendar with slides –Pointers to related material.

Dependence Analysis and Loops CS 3220 Spring 2016.

Dynamic Representation and Prefetching of Linked Data Structures (or Introspection for Low-Level Data Prefetching) Mark Whitney & John Kubiatowicz ROC.

In Search of the Optimal WHT Algorithm J. R. Johnson Drexel University Markus Püschel CMU

Compiler Support for Better Memory Utilization in Scientific Code Rob Fowler, John Mellor-Crummey, Guohua Jin, Apan Qasem {rjf, johnmc, jin,

Dependence Analysis Important and difficult

Data Locality Analysis and Optimization

Data Dependence, Parallelization, and Locality Enhancement (courtesy of Tarek Abdelrahman, University of Toronto)

Loop Restructuring Loop unswitching Loop peeling Loop fusion

FLOW OF CONTROL.

Feedback directed optimization in Compaq’s compilation tools for Alpha

Register Pressure Guided Unroll-and-Jam

Copyright 2003, Keith D. Cooper & Linda Torczon, all rights reserved.

Gary M. Zoppetti Gagan Agrawal

Lecture 19: Code Optimisation

The structure of programming

Thinking procedurally

CSC D70: Compiler Optimization Prefetching

Introduction to Optimization

Optimizing single thread performance

Presentation transcript:

Loop Tiling for Iterative Stencil Computations Marta Jiménez

What is an Iterative Stencil Computation? ISC often performed for PDE, GM, IP –swim, tomcatv, mgrid (from SPEC95 benchmark) –Jacobi DO K = 1, NITER /* time-step loop */ do J =... do I =... {A(I,J), A(I+1,J),…} enddo {wrapped-around computations} ENDDO Matrix A

Loop Tiling –divides IS into regular tiles to make the working set fit in the memory level being exploited –can be applied hierarchically (Multilevel Tiling) Current algorithms for Loop Tiling are limited to loops that: –are “perfectly” nested –are fully permutable –define a rectangular IS However, in iterative stencil computations, loops are: –NOT perfectly nested –NOT fully permutable

Show how Loop Tiling can be applied to iterative stencil computations –based on Song & Li’s paper [PLDI99] define a Program Model 1 Level of 1D-Tiling (cache) –program example: SWIM 2 levels of Tiling –2D-Tiling at the cache level –1D-Tiling at the register level (based on Jiménez et al. [ICS98][HPCA98]) Performance Results –Loop Tiling on EV5 & EV6 Today’s talk

Steps 1- Apply a set of transformations to the original program to achieve the desired program model defined by Song & Li 2- Perform 2D-Tiling for the Cache Level 3- Perform 1D-Tiling for the Register Level

1 st Step: achieve desired program model DO K = 1, NITER /* time-step loop */ do J1 = L J1, U J1 do I1 = L I1, U I1 {A(I,J), A(I+1,J),…} enddo... do Jm = L Jm, U Jm do Im = L Im, U Im {A(I,J), A(I+1,J),…} enddo ENDDO Program Model: Usually, programs are NOT directly written in this form – We must apply a set of transformations to achieve this program model

SWIM original code initializations 90 NCYCLE = NCYCLE +1 CALL CALC1 CALL CALC2 IF (NCYCLE >= ITMAX) STOP IF (NCYCLE <= 1) THEN CALL CALC3Z ELSE CALL CALC3 ENDIF GO TO 90 Transformations – Inline subroutines – Convert GO TO into DO -loop – Peel iterations of the time-step loop to eliminate IF-statements guarded by NCYCLE SUBROUTINE CALCX do J = 1,N do I = 1,M... enddo c wrapped-around computations do J = 1, N... enddo do I = 1, M... enddo...

Wrapped-around Computations DO K = 2, ITMAX-1 do J = 1,N do I = 1,M... enddo wrapped-around comp do J = 1, N... enddo do I = 1, M... enddo... do J = 1,N do I = 1,M... enddo... ENDDO J I I J CALC1 CALC2 CALC3

Projection along direction I DO K = 2, ITMAX-1 do J = 1,N... enddo wrapped-around comp do J = 1, N... enddo do J = 1,N... enddo wrapped-around comp do J = 1, N... enddo... ENDDO c J Wrapped-around Computations c Another way of dealing with the wrapped-around computations is performing code sinking

DO K = 2, ITMAX-1 do J = 1,N... enddo wrapped-around do J = 1,N... enddo wrapped-around do J = 1,N... enddo wrapped-around ENDDO J 1 st Step: achieved program model Flow dependencies & iterations space for SWIM ( Projection along direction I ) CALC1 CALC2 CALC3 K -loop (time) K=2 K=3 1N

Steps 1- Apply a set of transformations to the original program to achieve the program model defined by Song & Li 2- Perform 2D-Tiling for the Cache Level 3- Perform 1D-Tiling for the Register Level

1D-Tiling K=2 K=3 K=4 J 1N Dependencies are violated Tiling parameters: SLOPE, OFFSETS-i SLOPE OFFSET-i J 1N 1N

2D-Tiling K (time-step loop) J I 1 M N1 1 M 1 M N1N1 Tiling parameters: SLOPE, OFFSETS-i for each tiled dimension ( J and I ) Computed using the JI -loop distance subgraph N1N1N1 1 M 1 M 1 M

flow dependencies anti-dependencies output dependencies JI 3 -loop JI 2 -loop JI 1 -loop [1,-1,0] [1,0,-1] [1,-1,-1] [1,0,0] [0,0,0] [1,-1,0] [1,0,-1] [1,-1,0] [0,0,0] JI-loop Distance Subgraph Each node represents a JI -loop nest Each edge represents a dependence (distance vector)

SWIM: Projection along direction I Wrapped-around Computations Backward dependencies with large distances make Tiling not profitable – apply Circular Loop Skewing to shorten backward dependencies DO K = 2, ITMAX-1 do J = 1,N... enddo wrapped-around do J = 1,N... enddo wrapped-around do J = 1,N... enddo wrapped-around ENDDO K -loop (time) K=2 K=3 1N J

Shorts backward dependencies by changing the iteration order Circular Loop Skewing 1N J CLS parameters: BETA-i, DELTA (computed using the JI -loop distance subgraph) K=2 K=3 1N J 1423 BETA-i DELTA 2 2

Circular Loop Skewing DO K = 2, ITMAX-1 do JX = 1+BETA1+DELTA(K-2), N+BETA1+DELTA(K-2) J = MOD(JX-1, N) enddo wrapped-around do JX = 1+BETA2+DELTA(K-2), N+BETA2+DELTA(K-2) J = MOD(JX-1, N) enddo wrapped-around do JX = 1+BETA3+DELTA(K-2), N+BETA3+DELTA(K-2) J = MOD(JX-1, N) enddo wrapped-around ENDDO

DO JJ =... DO II =... DO K =... if (first tile) then do JX =... offsets iter. enddo endif do JX =... Iter. inside tile enddo do JX =... Iter. inside tile enddo do JX =... Iter. inside tile enddo ENDDO SWIM: projection along direction I CLS parameters: DELTA=2, BETA1=0, BETA2=1, BETA3=2 Tiling parameters: SLOPE=2, OFFSET1=1, OFFSET2=OFFSET3=0 2 nd Step: 2D-Tiling for cache level J 312N K=2 K=3 K=4 312N312

Steps 1- Apply a set of transformations to the original program to achieve the program model defined by Song & Li 2- Perform 2D-Tiling for the Cache Level 3- Perform 1D-Tiling for the Register Level

3 rd Step: 1D-Tiling for register level DO JJ =... DO II =... DO K = do JX = L J, U J J = MOD (JX-1, N)+1 do IX = L I, U I I = MOD (IX-1, M)+1 [loop body: {I,J}] enddo... ENDDO The MOD operation introduced by CLS prevents us to fully unroll the loop  Apply first Index Set Splitting to loop J J I 1 M M-1 2 M-2 N1N-12N-2 unrolled

Index Set Splitting ISS splits a loop into two new loops that iterate over non-intersecting portions of the iteration space DO JJ =... DO II =... DO K = do JX = L J, min(N,U J ) J = JX do IX =... enddo do JX = max(N+1,L J ), U J J = JX-N do IX =... enddo... ENDDO J I 1 M M-1 2 M-2 N1N-12N-2 ISS

3 rd Step: 1D-Tiling for register level

Code Transformations Summary 1- Apply a set of transformations to the original program to achieve the program model defined by Song & Li –Inline subroutines –Convert GOTO into DO -loop –Peel iterations of the time-step loop to eliminate IF-statements 2- Perform 2D-Tiling for the Cache Level –Construct JI -loop distance subgraph –Compute DELTA and BETAs and apply CLS to shorten backwards dep. –Update JI -loop distance subgraph –Compute OFSSETs and SLOPE and tile the IS 3- Perform 1D-Tiling for the Register Level –Index Set Splitting –Tiling in a straightforward manner

Architecture: EV56 (500Mhz, L1:8KB, L2:96KB), EV6(500MHz, L1:64KB, L2:4MB) Compiler Invocation: –f77 -O5 -arch ev56 (EV5) –kf77 -O5 -arch ev6 -notransform_loop -unroll 1 (EV6) Programs: –1D-Tiling for the Cache Level: loop J, TS = 4 (EV5), TS=8 (EV6) –2D -Tiling for the Cache Level: TS IxJ = 32x16 (EV5), TS IxJ =40x12(EV6) –1D-Tiling for the register level: loop J, TS=4 (EV5 & EV6) Performance Results (SWIM) Speedup ORI ORI + RT 1D 1D + RT 2D 2D + RT 439s658s294s371s578s296s (execution time) 1519s1533s1023s999s1009s677sEV5 EV6

Architecture: EV56 (500Mhz, L1:8KB, L2:96KB) Compiler invocations: –base: kf77 -O5 -arch ev56 –no_prefetch: kf77 -O5 -arch ev56 -switch nolu_prefetch_fetch ….. Performance Results EV5 (SWIM) Speedup over ORI (base) ORI ORI + RT 1D 1D + RT 2D 2D + RT Speedup

Architecture: EV6(500MHz, L1:64KB, L2:4MB) Compiler invocations: –base: f77 -O5 -arch ev6 –no_prefetch: f77 -O5 -arch ev6 -switch nolu_prefetch_fetch ….. Performance Results EV6 (SWIM) Speedup over ORI (base) Speedup ORI ORI + RT 1D 1D + RT 2D 2D + RT

J Code for Result Verification DO K = 2, ITMAX-1... do J = 1,N... enddo result verification IF (MOD(K,MPRINT).eq.0) THEN do I = do J = UCHECK = UCHECK + {UNEW(I,J)} enddo UNEW (I,I) =... enddo PRINTS ENDIF do J = 1,N... enddo ENDDO c Apply strip-mining to loop K (only useful if MPRINT is large) NEW in SPEC2000!!