Carnegie Mellon Lecture 15 Loop Transformations Chapter 11.10-11.11 Dror E. MaydanCS243: Loop Optimization and Array Analysis1.

Slides:



Advertisements
Similar presentations
Optimizing Compilers for Modern Architectures Compiler Improvement of Register Usage Chapter 8, through Section 8.4.
Advertisements

Optimizing Compilers for Modern Architectures Coarse-Grain Parallelism Chapter 6 of Allen and Kennedy.
P3 / 2004 Register Allocation. Kostis Sagonas 2 Spring 2004 Outline What is register allocation Webs Interference Graphs Graph coloring Spilling Live-Range.
1 Optimizing compilers Managing Cache Bercovici Sivan.
School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.
Architecture-dependent optimizations Functional units, delay slots and dependency analysis.
Improving Register Usage Chapter 8, Section 8.5  End. Omer Yehezkely.
1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.
1 Lecture 10: Static ILP Basics Topics: loop unrolling, static branch prediction, VLIW (Sections 4.1 – 4.4)
Software Methods to Increase Data Cache Performance Presented by Philip Marshall.
INTEL CONFIDENTIAL Improving Parallel Performance Introduction to Parallel Programming – Part 11.
Compiler Challenges for High Performance Architectures
Enhancing Fine-Grained Parallelism Chapter 5 of Allen and Kennedy Optimizing Compilers for Modern Architectures.
Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.
Enhancing Fine- Grained Parallelism Chapter 5 of Allen and Kennedy Mirit & Haim.
Data Locality CS 524 – High-Performance Computing.
Compiler Challenges, Introduction to Data Dependences Allen and Kennedy, Chapter 1, 2.
Improving code generation. Better code generation requires greater context Over expressions: optimal ordering of subtrees Over basic blocks: Common subexpression.
EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
Prof. Bodik CS 164 Lecture 171 Register Allocation Lecture 19.
1 Tuesday, September 19, 2006 The practical scientist is trying to solve tomorrow's problem on yesterday's computer. Computer scientists often have it.
Compiler Improvement of Register Usage Part 1 - Chapter 8, through Section 8.4 Anastasia Braginsky.
Register Allocation (via graph coloring)
CMPUT Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic B: Loop Restructuring José Nelson Amaral
Register Allocation (via graph coloring). Lecture Outline Memory Hierarchy Management Register Allocation –Register interference graph –Graph coloring.
A Data Locality Optimizing Algorithm based on A Data Locality Optimizing Algorithm by Michael E. Wolf and Monica S. Lam.
Data Locality CS 524 – High-Performance Computing.
Improving Code Generation Honors Compilers April 16 th 2002.
Improving code generation. Better code generation requires greater context Over expressions: optimal ordering of subtrees Over basic blocks: Common subexpression.
4/29/09Prof. Hilfinger CS164 Lecture 381 Register Allocation Lecture 28 (from notes by G. Necula and R. Bodik)
Applying Data Copy To Improve Memory Performance of General Array Computations Qing Yi University of Texas at San Antonio.
09/15/2011CS4961 CS4961 Parallel Programming Lecture 8: Dependences and Locality Optimizations Mary Hall September 15,
Recitation 7: 10/21/02 Outline Program Optimization –Machine Independent –Machine Dependent Loop Unrolling Blocking Annie Luo
U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2011 Dependence Analysis and Loop Transformations.
Carnegie Mellon Lecture 14 Loop Optimization and Array Analysis I. Motivation II. Data dependence analysis Chapter , 11.6 Dror E. MaydanCS243:
ECE 454 Computer Systems Programming Memory performance (Part II: Optimizing for caches) Ding Yuan ECE Dept., University of Toronto
*All other brands and names are the property of their respective owners Intel Confidential IA64_Tools_Overview2.ppt 1 修改程序代码以 利用编译器实现优化
High-Level Transformations for Embedded Computing
CR18: Advanced Compilers L04: Scheduling Tomofumi Yuki 1.
Improving Locality through Loop Transformations Copyright 2011, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 512 at.
ECE 1754 Loop Transformations by: Eric LaForest
Loop Optimizations Scheduling. Loop fusion int acc = 0; for (int i = 0; i < n; ++i) { acc += a[i]; a[i] = acc; } for (int i = 0; i < n; ++i) { b[i] +=
Advanced Compiler Techniques LIU Xianhua School of EECS, Peking University Loops.
Programming for Performance CS 740 Oct. 4, 2000 Topics How architecture impacts your programs How (and how not) to tune your code.
Memory Hierarchy—Improving Performance Professor Alvin R. Lebeck Computer Science 220 Fall 2008.
Optimizing Compilers for Modern Architectures Enhancing Fine-Grained Parallelism Part II Chapter 5 of Allen and Kennedy.
Memory-Aware Compilation Philip Sweany 10/20/2011.
Prefetching Techniques. 2 Reading Data prefetch mechanisms, Steven P. Vanderwiel, David J. Lilja, ACM Computing Surveys, Vol. 32, Issue 2 (June 2000)
09/13/2012CS4230 CS4230 Parallel Programming Lecture 8: Dense Linear Algebra and Locality Optimizations Mary Hall September 13,
Carnegie Mellon Lecture 8 Software Pipelining I. Introduction II. Problem Formulation III. Algorithm Reading: Chapter 10.5 – 10.6 M. LamCS243: Software.
Dependence Analysis and Loops CS 3220 Spring 2016.
DEPENDENCE-DRIVEN LOOP MANIPULATION Based on notes by David Padua University of Illinois at Urbana-Champaign 1.
Optimization. How to Optimize Code Conventional Wisdom: 1.Don't do it 2.(For experts only) Don't do it yet.
Automatic Thread Extraction with Decoupled Software Pipelining
Simone Campanoni Loop transformations Simone Campanoni
Data Dependence, Parallelization, and Locality Enhancement (courtesy of Tarek Abdelrahman, University of Toronto)
Loop Restructuring Loop unswitching Loop peeling Loop fusion
CSCI1600: Embedded and Real Time Software
Lecture 14: Reducing Cache Misses
Register Pressure Guided Unroll-and-Jam
Copyright 2003, Keith D. Cooper & Linda Torczon, all rights reserved.
Siddhartha Chatterjee
ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg
Loop Optimization “Programs spend 90% of time in loops”
Cache Performance Improvements
Optimization.
Introduction to Optimization
Optimizing single thread performance
CSCI1600: Embedded and Real Time Software
Presentation transcript:

Carnegie Mellon Lecture 15 Loop Transformations Chapter Dror E. MaydanCS243: Loop Optimization and Array Analysis1

Carnegie Mellon Loop Optimization Domain – Loops: Change the order in which we iterate through loops Goals – Minimize inner loop dependences that inhibit software pipelining – Minimize loads and stores – Parallelism SIMD Vector today, in general multiprocessor as well – Minimize cache misses – Minimize register spilling Tools – Loop interchange – Fusion – Fission – Outer loop unrolling – Cache Tiling – Vectorization Algorithm for putting it all together Dror E. Maydan CS243: Loop Optimization and Array Analysis 2

Carnegie Mellon Loop Interchange for i = 1 to n for j = 1 to n A[j][i] = A[j-1][i-1] * b[i] Should I interchange the two loops? – Stride-1 accesses are better for caches – But one more load in the inner loop – But one less register needed to hold the result of the loop Dror E. Maydan CS243: Loop Optimization and Array Analysis 3 for j = 1 to n for i = 1 to n A[j][i] = A[j-1][i-1] * b[i]

Carnegie Mellon Loop Interchange for i = 1 to n for j = 1 to n A[j][i] = A[j+1][i-1] * b[i] Distance Vector is (deltai, deltaj) = (1, -1) Direction Vector is (>, <) Dependence represents that one ref, a w must happen before another a r To permute loops, permute direction vectors in the same manner – Permutation is legal iff all permuted direction vectors are lexicographically positive Special case: Fully permutable loop nest – Either dependence “carried” by a loop outside of the nest or all components > or = All the loops in the nest can be arbitrarily permuted – (>, >, <) Inner two loops are fully permutable – (>=, =, >) All three loops are fully permutable Dror E. Maydan CS243: Loop Optimization and Array Analysis 4 j i

Carnegie Mellon Loop Interchange for i = 1 to n for j = 1 to i … How do I interchange for j = 1 to n for i = j to n – In general ugly but doable Dror E. Maydan CS243: Loop Optimization and Array Analysis 5 j i

Carnegie Mellon Non Perfectly Nested loops for i = 1 to n for j = 1 to n S1 for j = 1 to n S2 Can’t always interchange – Can be expensive when you can Dror E. Maydan CS243: Loop Optimization and Array Analysis 6

Carnegie Mellon Loop Fusion for i = 1 to n for j = 1 to n S1 for j = 1 to n S2 Moving S2 across “j” iterations but not any of “i” iterations Pretend to fuse Legal as long as there is no direction vector from S2 to S1 with “=“ in all the outer loops and > in one of the inner (=, =, …, =, >, …), – That would imply that S2 is now before S1 Dror E. Maydan CS243: Loop Optimization and Array Analysis 7 for i = 1 to n for j = 1 to n S1 S2

Carnegie Mellon Loop Fusion for i = 1 to n for j = 1 to n a[i][j] = … for j = 1 to n … = a[i][j+1] Legal as long as there is no direction vector from the read to the write with “=“ in all the outer loops and > in one of the inner (=, =, …, =, >, …) – (=, 1) so can’t fusion Dror E. Maydan CS243: Loop Optimization and Array Analysis 8

Carnegie Mellon Loop Fusion for i = 1 to n for j = 1 to n a[i][j] = … for j = 1 to n … = a[i][j+1] If the first “+” direction is always a small literal constant, can skew the loop and allow fusion Bonus: can get rid of a load and maybe a store Dror E. Maydan CS243: Loop Optimization and Array Analysis 9 for i = 1 to n a[i][1] = … for j = 2 to n a[i][j] = … for j = 1, n-1 … = a[i][j+1] … = a[i][n+1] for i = 1 to n a[i][1] = … for j = 2 to n a[i][j] = … for j = 2, n … = a[i][j] … = a[i][n+1] for i = 1 to n a[i][1] = … for j = 2 to n { a[i][j] = … … = a[i][j] } … = a[i][n-1]

Carnegie Mellon Loop Fission for i = 1 to n for j = 1 to n S1 for j = 1 to n S2 Moving S2 across all later “i” iterations – Legal as long as no dependences from S2 to S1 with > in the fissioned outer loops Dror E. Maydan CS243: Loop Optimization and Array Analysis 10 for i = 1 to n for j = 1 to n S1 for i = 1 to n for j = 1 to n S2

Carnegie Mellon Loop Fission for i = 1 to n for j = 1 to n = a[i-1][j] for j = 1 to n a[i][j] = Moving S2 across all later “i” iterations – Legal as long as no dependences from S2 to S1 with > in the fissioned outer loops Dror E. Maydan CS243: Loop Optimization and Array Analysis 11 for i = 1 to n for j = 1 to n = a[i-1][j] for i = 1 to n for j = 1 to n a[i][j] = Dep from write to read of (1)

Carnegie Mellon Inner Loop Fission for i = 1 to n for j = 1 to n … = h[i]; … = h[i+1]; … … = h[i+49]; … = h[i+50]; Legal as long as there is no dependence from an S2 to an S1 where the first “>” is in the “j” loop. Dror E. Maydan CS243: Loop Optimization and Array Analysis 12 for i = 1 to n for j = 1 to n = h[i]; … … = h[i+25]; for j = 1 to n … = h[26]; … … = h[50];

Carnegie Mellon Inner Loop Fission for j = 1 to n S1 S2 S3 Looking at edges carried by the inner most loops Strongly Connected Components can not be fissioned Everything else can be fissoned as long as loops are emitted in topological order Dror E. Maydan CS243: Loop Optimization and Array Analysis 13 S1 S2 S3 = = >

Carnegie Mellon Outer Loop Unrolling for i = 1 to n for j = 1 to n for k = 1 to n c[i][j] += a[i][k] * b[k][j]; How many loads in the inner loop? How many MACs? Dror E. Maydan CS243: Loop Optimization and Array Analysis 14

Carnegie Mellon Outer Loop Unrolling for i = 1 to n by 2 for j = 1 to n by 2 for k = 1 to n c[i][j] += a[i][k] * b[k][j]; c[i][j+1] += a[i][k] * b[k][j+1]; c[i+1][j] += a[i+1][k] * b[k][j]; c[i+1][j+1] += a[i+1][k] * b[k][j+1]; Is it legal? Dror E. Maydan CS243: Loop Optimization and Array Analysis 15

Carnegie Mellon Outer Loop Unrolling If n = 2 – Original order was (1, 1, 1) (1, 1, 2) (1, 2, 1) (1, 2, 2) (2, 1, 1) (2, 1, 2) (2, 2, 1) (2, 2, 2) – New order is (1, 1, 1) (1, 2, 1) (2, 1, 1) (2, 2, 1) (1, 1, 2) (1, 2, 2) (2, 1, 2) (2, 2, 2) Equivalent to permuting the loops into for k = 1 to 2 for i = 1 to 2 for j = 1 to 2 If loops are fully permutable can also outer loop unroll Dror E. Maydan CS243: Loop Optimization and Array Analysis 16 for i = 1 to 2 by 2 for j = 1 to 2 by 2 for k = 1 to 2 c[i][j] += a[i][k] * b[k]; c[i][j+1] += a[i][k] * b[k][j+1]; c[i+1][j] += a[i+1][k] * b[k][j]; c[i+1][j+1] += a[i+1][k] * b[k][j+1];

Carnegie Mellon Unrolling Trapezoidal Loops Dror E. Maydan CS243: Loop Optimization and Array Analysis 17 j i Ugly We unroll two level trapezoidal loops but the details are very ugly for i=1 to n by 2 for j = 1 to i

Carnegie Mellon Trapezoidal Example for (i=0; i<n; i++) { for (j=2*i; j<n-i; j++) { a[i][j] += 1; } D.E. Maydan CS243: Loop Optimization and Array Analysis 18 for(i = 0; i <= (n + -2); i = i + 2) { lstar = (i * 2) + 2; ustar = (n - (i + 1)) + -1; if(((i * 2) + 2) < (n - (i + 1))) { for(r2d_i = i; r2d_i <= (i + 1); r2d_i = r2d_i + 1){ for(j = r2d_i * 2; j <= ((i * 2) + 1); j = j + 1){ a[r2d_i][j] = a[r2d_i][j] + 1; } for(j0 = lstar; ustar >= j0; j0 = j0 + 1) { a[i][j0] = a[i][j0] + 1; a[i + 1][j0] = a[i + 1][j0] + 1; }; for(r2d_i0 = i; r2d_i0 <= (i + 1); r2d_i0 = r2d_i0 + 1) { for(j1 = n - (i + 1); j1 < (n - r2d_i0); j1 = j1 + 1) { a[r2d_i0][j1] = a[r2d_i0][j1] + 1; }; } } else { for(r2d_i1 = i; r2d_i1 <= (i + 1); r2d_i1 = r2d_i1 + 1) { for(j2 = r2d_i1 * 2; j2 < (n - r2d_i1); j2 = j2 + 1) { a[r2d_i1][j2] = a[r2d_i1][j2] + 1; } if(n > i) { for(j3 = i * 2; j3 < (n - i); j3 = j3 + 1) { a[i][j3] = a[i][j3] + 1; }; }

Carnegie Mellon Cache Tiling for i = 1 to n for j = 1 to n for k = 1 to n c[i][j] += a[i][k] * b[k][j]; How many cache misses? Dror E. Maydan CS243: Loop Optimization and Array Analysis 19

Carnegie Mellon Cache Tiling for jb = 1 to n by b for kb = 1 to n by b for i = 1 to n for j = jb to jb+b for k = kb to kb + b c[i][j] += a[i][k] * b[k][j]; How many cache misses? – Order b reuse for each array If loops are fully permutable can cache tile Dror E. Maydan CS243: Loop Optimization and Array Analysis 20

Carnegie Mellon Vectorization: SIMD for i = 1 to n for j = 1 to n a[j][i] = 0; N-way parallel where N is the SIMD width of the machine Dror E. Maydan CS243: Loop Optimization and Array Analysis 21 for i = 1 to n by 8 for j = 1 to n a[j][i:i+7] = 0;

Carnegie Mellon Vectorization: SIMD for i = 1 to n for j = 1 to n for k = 1 to n S 1 …. S M We have moved later iterations of S 1 ahead of earlier iterations of S 2, …, S M, etc Legal as long as no dependence from a latter S to an earlier S where that dependence is carried by the vector loop – E.g legal to vectorize ‘j’ above if no dependence from a latter S to an earlier S with direction (=, >, *) Dror E. Maydan CS243: Loop Optimization and Array Analysis 22

Carnegie Mellon Putting It All Together Three phase algorithm 1.Use fission and fusion to build perfectly nested loops 1.We prefer fusion but not obvious that that is right 2.Enumerate possibilities for unrolling, interchanging, cache tiling and vectorizing 3.Use inner loop fission if necessary to minimize register pressure Dror E. Maydan CS243: Loop Optimization and Array Analysis 23

Carnegie Mellon Phase 2 Choose a loop to vectorize All references that refer to vector loop must be stride-1 For each possible inner loop Compute best possible unrollings for each outer Compute best possible ordering and tiling To compute best possible unrolling Try all combinations of unrolling up to a max product of 16 For each possible unrolling Estimate the machine cycles for the inner loop (ignoring cache) Estimate the register pressure Don’t unroll more if too much register pressure To compute best possible ordering and tiling Consider only loops with “reuse” Choose best three Iterate over all orderings of three with a binary search on cache tile size Note and record the total cycle time for this configuration. Pick the best Estimating cycles Could compile every combination, but … Dror E. Maydan CS243: Loop Optimization and Array Analysis 24

Carnegie Mellon Machine Modeling Recall that software pipelining had resource limits and latency limits – Map high level IR to machine resources – Unroll high level IR operations Remove duplicate loads and stores Count machine resources Build a latency graph of unrolled operations – Iterate over inner loop cycles and find worst cycle Assume performance is worst of two limits Model register pressure – Count loop invariant loads and stores – Count address streams – Count cross iteration cse’s = a[i] + a[i-2] – Add machine dependent constant Dror E. Maydan CS243: Loop Optimization and Array Analysis 25

Carnegie Mellon Cache Modeling Given a loop ordering and a set of tile factors Combine array references that differ by constant, e.g. a[i][j] and a[i+1][j+1] Estimate capacity of all array references, multiply by fudge factor for interference, stop increasing block sizes if capacity is larger than cache Estimate quantity of data that must be brought into cache Dror E. Maydan CS243: Loop Optimization and Array Analysis 26

Carnegie Mellon Phase 3: Inner loop fission Does inner loop use too many registers – Break down into SCCs – Pick biggest SCC Does it use too many registers – If yes, too bad – If no, search for other SCCs to merge in » Pick one with most commonality » Keep merging while enough registers Dror E. Maydan CS243: Loop Optimization and Array Analysis 27

Carnegie Mellon Extra: Reductions for i = 1 to n for j = 1 to n a[j] += b[i][j]; Can I unroll for i = 1 to n by 2 for j = 1 to n a[j] += b[i][j]; a[j] += b[i+1][j]; Legal – Integer: yes – Floating point: maybe Dror E. Maydan CS243: Loop Optimization and Array Analysis 28

Carnegie Mellon Extra: Outer Loop Invariants for i for j a[i][j] += b[i] * cos(c[j]) Can replace with for j t[j] = cos(c[j]) for i for j a[i][j] += b[i] * t[j]; Need to integrate with model – Model must assume that invariant computation will be replaced with loads Dror E. Maydan CS243: Loop Optimization and Array Analysis 29