October 18, 2018 Kit Barton, IBM Canada

Slides:

Advertisements

Similar presentations

Register Allocation Consists of two parts: Goal : minimize spills

Advertisements

March 14, CMPUT680 - Winter 2006 Topic C: Loop Fusion Kit Barton

OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.

Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.

Data-Flow Analysis II CS 671 March 13, CS 671 – Spring Data-Flow Analysis Gather conservative, approximate information about what a program.

1 Optimizing compilers Managing Cache Bercovici Sivan.

Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments.

School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.

Chapter 9 Code optimization Section 0 overview 1.Position of code optimizer 2.Purpose of code optimizer to get better efficiency –Run faster –Take less.

A Process Splitting Transformation for Kahn Process Networks Sjoerd Meijer.

1 CS 201 Compiler Construction Lecture 3 Data Flow Analysis.

Architecture-dependent optimizations Functional units, delay slots and dependency analysis.

Control-Flow Graphs & Dataflow Analysis CS153: Compilers Greg Morrisett.

Improving Register Usage Chapter 8, Section 8.5  End. Omer Yehezkely.

EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.

Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

Compiler Challenges for High Performance Architectures

Enhancing Fine-Grained Parallelism Chapter 5 of Allen and Kennedy Optimizing Compilers for Modern Architectures.

1 CS 201 Compiler Construction Lecture 7 Code Optimizations: Partial Redundancy Elimination.

Enhancing Fine- Grained Parallelism Chapter 5 of Allen and Kennedy Mirit & Haim.

Compiler Challenges, Introduction to Data Dependences Allen and Kennedy, Chapter 1, 2.

EECC551 - Shaaban #1 Winter 2002 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.

Optimizing Compilers for Modern Architectures Coarse-Grain Parallelism Chapter 6 of Allen and Kennedy.

CMPUT Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic B: Loop Restructuring José Nelson Amaral

A Data Locality Optimizing Algorithm based on A Data Locality Optimizing Algorithm by Michael E. Wolf and Monica S. Lam.

1 Copy Propagation What does it mean? – Given an assignment x = y, replace later uses of x with uses of y, provided there are no intervening assignments.

EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.

High Performance Embedded Computing © 2007 Elsevier Lecture 11: Memory Optimizations Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte.

Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2 pp

CISC673 – Optimizing Compilers1/34 Presented by: Sameer Kulkarni Dept of Computer & Information Sciences University of Delaware Phase Ordering.

Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2.

Applying Data Copy To Improve Memory Performance of General Array Computations Qing Yi University of Texas at San Antonio.

U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2011 Dependence Analysis and Loop Transformations.

High-Level Transformations for Embedded Computing

Optimizing Compilers for Modern Architectures Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9.

1 Control Flow Graphs. 2 Optimizations Code transformations to improve program –Mainly: improve execution time –Also: reduce program size Can be done.

Loops Simone Campanoni

CS 412/413 Spring 2005Introduction to Compilers1 CS412/CS413 Introduction to Compilers Tim Teitelbaum Lecture 30: Loop Optimizations and Pointer Analysis.

1 Removing Impediments to Loop Fusion Through Code Transformations Bob Blainey 1, Christopher Barton 2, and Jos’e Nelson Amaral 2 1 IBM Toronto Software.

Simone Campanoni SSA Simone Campanoni

Automatic Thread Extraction with Decoupled Software Pipelining

Simone Campanoni CFA Simone Campanoni

Code Optimization Code produced by compilation algorithms can often be improved (ideally optimized) in terms of run-time speed and the amount of memory.

Code Optimization Overview and Examples

Global Register Allocation Based on

Basic Program Analysis

EECS 583 – Class 2 Control Flow Analysis

Simone Campanoni Loop transformations Simone Campanoni

Data Dependence, Parallelization, and Locality Enhancement (courtesy of Tarek Abdelrahman, University of Toronto)

Loop Restructuring Loop unswitching Loop peeling Loop fusion

Optimization Code Optimization ©SoftMoore Consulting.

Antonia Zhai, Christopher B. Colohan,

User-Defined Functions

Princeton University Spring 2016

Feedback directed optimization in Compaq’s compilation tools for Alpha

Securing A Compiler Transformation

Topic 10: Dataflow Analysis

Control Flow Analysis CS 4501 Baishakhi Ray.

A Practical Stride Prefetching Implementation in Global Optimizer

Register Pressure Guided Unroll-and-Jam

Code Optimization Overview and Examples Control Flow Graph

Copyright 2003, Keith D. Cooper & Linda Torczon, all rights reserved.

Optimizations using SSA

Samuel Larsen and Saman Amarasinghe, MIT CSAIL

Data Flow Analysis Compiler Design

ECE 103 Engineering Programming Chapter 18 Iteration

CSE P 501 – Compilers SSA Hal Perkins Autumn /31/2019

Introduction to Optimization

Optimizing single thread performance

Presentation transcript:

Revisiting Loop Fusion and its place in the loop transformation framework October 18, 2018 Kit Barton, IBM Canada Johannes Doerfert, Argonne National Labs Hal Finkel, Argonne National Labs Michael Kruse, Argonne National Labs

LLVM Developers’ Meeting 2018 Agenda Motivation and Goals Loop representation in LLVM Algorithm for loop fusion Current Results Next Steps LLVM Developers’ Meeting 2018

LLVM Developers’ Meeting 2018 Loop Fusion Combine two (or more) loops into a single loop Motivation Data reuse, parallelism, minimizing bandwidth, … Increase scope for loop optimizations Our Goals Way to learn how to implement a loop optimization in LLVM Starting point for establishing a loop optimization pipeline in LLVM for (int i=0; i < N; ++i) { A[i] = i; } for (int j=0; j < N; ++j) { B[j] = j; for (int i=0, j=0; i < N && j < N; ++i,++j) { A[i] = i; B[j] = j; } LLVM Developers’ Meeting 2018

XL Loop Optimization Pipeline IBM’s XL Compiler has a very mature loop optimization pipeline The pipeline begins with maximal fusion – greedily fuse loops to create large loop nests Run a series of loop optimizations on the loop nests created by fusion Selectively distribute loops based on a set of heuristics, including: data reuse independent loops perfect loop nests register pressure … Aggressive Copy Propagation Dead Store Elimination Loop Fusion Loop Unroll and Jam Node Splitting Scalar Expansion Loop Distribution Loop Permutation Christopher Barton. Code transformations to augment the scope of loop fusion in a production compiler. Master's thesis, University of Alberta, January 2003. LLVM Developers’ Meeting 2018

Loop Representation in LLVM int A[1024]; void example() { for (int i = 0; i < N; i++) A[i] = i; } Preheader A single edge to the header of the loop from outside of the loop. Header Single entry point to the loop that dominates all other blocks in the loop. Exiting Block The block within the loop that has successors outside of the loop. If multiple blocks have successors, this is null. Exit Block The successor block of this loop. If the loop has multiple successors, this is null. Latch Block that contains the branch back to the loop header. LLVM Developers’ Meeting 2018

Requirements for loop fusion In order for two loops, Lj and Lk to be fused, they must satisfy the following conditions: Lj and Lk must be adjacent There cannot be any statements that execute between the end of Lj and the beginning of Lk Lj and Lk must iterate the same number of times Lj and Lk must be control flow equivalent When Lj executes Lk also executes or when Lk executes Lj also executes There cannot be any negative distance dependencies between Lj and Lk A negative distance dependence occurs between Lj and Lk, Lj before Lk, when at iteration m Lk uses a value that is computed by Lj at a future iteration m+n (where n > 0). LLVM Developers’ Meeting 2018

LLVM Developers’ Meeting 2018 Loop Fusion Algorithm Preheader fuseLoops(Function F) for each nest level NL, outermost to innermost Collect loops that are candidates for loop fusion at NL Sort candidates into control-flow equivalent sets for each CFE set for each pair of loops, Lj and Lk if Lj and Lk do not have identical trip counts continue if Lj and Lk cannot be made adjacent then continue if Lj and Lk have invalid dependencies then continue if fusing Lj and Lk is not beneficial then continue Move intervening code to make Lj and Lk adjacent fuse Lj and Lk Update fusion candidate list entry Header for.cond T F ExitingBB ExitBB for.body for.cond.cleanup Preheader for.inc for.end Header Latch for.cond2 T F ExitingBB for.body5 for.cond.cleanup4 ExitBB for.inc12 for.end14 Latch LLVM Developers’ Meeting 2018

Loop Fusion – collect candidates fuseLoops(Function F) for each nest level NL, outermost to innermost Collect loops that are candidates for loop fusion at NL Sort candidates into control-flow equivalent sets for each CFE set for each pair of loops, Lj and Lk if Lj and Lk do not have identical trip counts continue if Lj and Lk cannot be made adjacent then continue if Lj and Lk have invalid dependencies then continue if fusing Lj and Lk is not beneficial then continue Move intervening code to make Lj and Lk adjacent fuse Lj and Lk Update fusion candidate list Loops are not candidates for fusion if: They might throw an exception They contain volatile memory accesses They are not in simplified form Any of the necessary information is not available (preheader, header, latch, exiting blocks, exit block) LLVM Developers’ Meeting 2018

Loop Fusion – sort based on control-flow equivalence fuseLoops(Function F) for each nest level NL, outermost to innermost Collect loops that are candidates for loop fusion at NL Sort candidates into control-flow equivalent sets for each CFE set for each pair of loops, Lj and Lk if Lj and Lk do not have identical trip counts continue if Lj and Lk cannot be made adjacent then continue if Lj and Lk have invalid dependencies then continue if fusing Lj and Lk is not beneficial then continue Move intervening code to make Lj and Lk adjacent fuse Lj and Lk Update fusion candidate list Dominator and post-dominator trees are used to determine control-flow equivalence: if Lj dominates Lk and Lk post-dominates Lj then Lj and Lk are control-flow equivalent Build sets of candidates that are all control flow equivalent by comparing a new loop to the first loop in a set. Once all loops have been placed into sets, sets with a single loop are discarded. Remaining set(s) are sorted in dominance order: if Lj is located in the set before Lk, then Lj dominates Lk LLVM Developers’ Meeting 2018

Loop Fusion – check trip counts fuseLoops(Function F) for each nest level NL, outermost to innermost Collect loops that are candidates for loop fusion at NL Sort candidates into control-flow equivalent sets for each CFE set for each pair of loops, Lj and Lk if Lj and Lk do not have identical trip counts continue if Lj and Lk cannot be made adjacent then continue if Lj and Lk have invalid dependencies then continue if fusing Lj and Lk is not beneficial then continue Move intervening code to make Lj and Lk adjacent fuse Lj and Lk Update fusion candidate list Scalar Evolution (SCEV) is used to determine trip counts If it cannot compute trip counts, or determine that the trip counts are identical, loops are not fused We currently do not try to make trip counts the same via peeling This needs to be added in the future to enable more loop optimizations Interaction with other loop optimizations will be critical here LLVM Developers’ Meeting 2018

Loop Fusion – check adjacent fuseLoops(Function F) for each nest level NL, outermost to innermost Collect loops that are candidates for loop fusion at NL Sort candidates into control-flow equivalent sets for each CFE set for each pair of loops, Lj and Lk if Lj and Lk do not have identical trip counts continue if Lj and Lk cannot be made adjacent then continue if Lj and Lk have invalid dependencies then continue if fusing Lj and Lk is not beneficial then continue Move intervening code to make Lj and Lk adjacent fuse Lj and Lk Update fusion candidate list Analyze all instructions between the exit of Lj and the preheader of Lk and determine if they can be move prior to Lj or past Lk Build a map of all instructions and the location where they can move (prior, past, both, none) If any instructions cannot be moved, the two loops cannot be made adjacent and thus cannot be fused LLVM Developers’ Meeting 2018

Loop Fusion – check dependencies fuseLoops(Function F) for each nest level NL, outermost to innermost Collect loops that are candidates for loop fusion at NL Sort candidates into control-flow equivalent sets for each CFE set for each pair of loops, Lj and Lk if Lj and Lk do not have identical trip counts continue if Lj and Lk cannot be made adjacent then continue if Lj and Lk have invalid dependencies then continue if fusing Lj and Lk is not beneficial then continue Move intervening code to make Lj and Lk adjacent fuse Lj and Lk Update fusion candidate list Three different algorithms are used to test dependencies for fusion: Alias Analysis Test if two memory locations alias each other Dependence Info Uses the depends interface from Dependence Info SCEV Use SCEV to determine if there could be negative dependencies between the two loops If any can prove valid dependencies, then fusion is legal LLVM Developers’ Meeting 2018

Loop Fusion – profitability analysis fuseLoops(Function F) for each nest level NL, outermost to innermost Collect loops that are candidates for loop fusion at NL Sort candidates into control-flow equivalent sets for each CFE set for each pair of loops, Lj and Lk if Lj and Lk do not have identical trip counts continue if Lj and Lk cannot be made adjacent then continue if Lj and Lk have invalid dependencies then continue if fusing Lj and Lk is not beneficial then continue Move intervening code to make Lj and Lk adjacent fuse Lj and Lk Update fusion candidate list Profitability Analysis Hook that will allow different heuristics to be used to determine whether loops should be fused Currently this always returns true, to allow maximal fusion LLVM Developers’ Meeting 2018

Loop Fusion – move code to make adjacent Preheader fuseLoops(Function F) for each nest level NL, outermost to innermost Collect loops that are candidates for loop fusion at NL Sort candidates into control-flow equivalent sets for each CFE set for each pair of loops, Lj and Lk if Lj and Lk do not have identical trip counts continue if Lj and Lk cannot be made adjacent then continue if Lj and Lk have invalid dependencies then continue if fusing Lj and Lk is not beneficial then continue Move intervening code to make Lj and Lk adjacent fuse Lj and Lk Update fusion candidate list entry Header for.cond T F ExitingBB ExitBB ExitBB for.body for.cond.cleanup Preheader for.inc for.end Header Latch for.cond2 T F ExitingBB for.body5 for.cond.cleanup4 ExitBB for.inc12 for.end14 Latch LLVM Developers’ Meeting 2018

Loop Fusion – fuse loops Preheader fuseLoops(Function F) for each nest level NL, outermost to innermost Collect loops that are candidates for loop fusion at NL Sort candidates into control-flow equivalent sets for each CFE set for each pair of loops, Lj and Lk if Lj and Lk do not have identical trip counts continue if Lj and Lk cannot be made adjacent then continue if Lj and Lk have invalid dependencies then continue if fusing Lj and Lk is not beneficial then continue Move intervening code to make Lj and Lk adjacent fuse Lj and Lk Update fusion candidate list entry Header for.cond T F ExitingBB ExitBB for.body Preheader for.inc for.end Header Latch for.cond2 T F ExitingBB for.body5 for.cond.cleanup4 ExitBB for.inc12 for.end14 Latch LLVM Developers’ Meeting 2018

Loop Fusion – update data structures Preheader fuseLoops(Function F) for each nest level NL, outermost to innermost Collect loops that are candidates for loop fusion at NL Sort candidates into control-flow equivalent sets for each CFE set for each pair of loops, Lj and Lk if Lj and Lk do not have identical trip counts continue if Lj and Lk cannot be made adjacent then continue if Lj and Lk have invalid dependencies then continue if fusing Lj and Lk is not beneficial then continue Move intervening code to make Lj and Lk adjacent fuse Lj and Lk Update fusion candidate list entry Header for.cond T F ExitingBB ExitBB for.body Preheader for.inc Header Latch for.cond2 T F ExitingBB for.body5 for.cond.cleanup4 ExitBB for.inc12 for.end14 Latch LLVM Developers’ Meeting 2018

LLVM Developers’ Meeting 2018 After Loop Fusion Preheader fuseLoops(Function F) for each nest level NL, outermost to innermost Collect loops that are candidates for loop fusion at NL Sort candidates into control-flow equivalent sets for each CFE set for each pair of loops, Lj and Lk if Lj and Lk do not have identical trip counts continue if Lj and Lk cannot be made adjacent then continue if Lj and Lk have invalid dependencies then continue if fusing Lj and Lk is not beneficial then continue Move intervening code to make Lj and Lk adjacent fuse Lj and Lk Update fusion candidate list entry Header for.cond T F for.body for.inc for.cond2 T F ExitingBB for.body5 for.cond.cleanup4 ExitBB for.inc12 for.end14 Latch LLVM Developers’ Meeting 2018

Current placement of Loop Fusion Old Pass Manager New Pass Manager Loop Rotate Loop Fusion Loop Distribution Loop Vectorization Loop Load Elimination Loop Data Prefetch Loop Load Elimination Loop Fusion Loop Distribution Loop Vectorize LLVM Developers’ Meeting 2018

LLVM Developers’ Meeting 2018 Number of Loops Fused SPEC 2006 SPEC 2017 Benchmark Candidates for Fusion Loops Fused perlbench 5 bzip2 21 1 gcc 50 8 namd 4 gobmk 96 7 dealII 355 soplex povray 14 3 hmmer 19 h264ref 159 2 lbm astar sphinx3 Benchmark Candidates for Fusion Loops Fused perlbench_r 7 gcc_r 114 2 namd_r 11 parest_r 137 1 povray_r 22 6 lbm_r omnetpp_r 19 x264_r 81 blender_r 259 deepsjeng_r 34 imagick_r 45 5 nab_r 9 xz_r 18 LLVM Developers’ Meeting 2018

LLVM Developers’ Meeting 2018 Reasons for not fusing SPEC 2006 SPEC 2017 Benchmark Dependencies Non-equal Trip Count Cannot make adjacent perlbench 1 2 bzip2 60 9 gcc 17 13 45 namd 3 gobmk 41 231 dealII 278 31 470 soplex povray 4 hmmer 7 h264ref 105 74 506 astar 5 sphinx3 Benchmark Dependencies Non-equal Trip Count Cannot make adjacent perlbench_r 6 2 gcc_r 47 80 107 namd_r 8 1 3 parest_r 43 485 povray_r 36 7 omnetpp_r 29 x264_r 42 26 67 blender_r 164 53 310 deepsjeng_r 30 33 136 imagick_r 15 24 56 nab_r 12 xz_r 66 LLVM Developers’ Meeting 2018

Ineligible Loops SPEC 2006 SPEC 2017 perlbench 62 451 485 340 bzip2 76 Benchmark May Throw Exception Contains Volatile Access Invalid Exiting Blocks Invalid Exit Block Invalid Trip Count perlbench 62 451 485 340 bzip2 76 122 gcc 6 1643 1697 1418 mcf 8 11 milc 13 116 namd 4 91 67 gobmk 274 280 157 dealII 810 2098 2106 1319 soplex 98 155 povray 448 408 426 156 hmmer 246 libquantum 3 15 h264ref 80 81 225 omnetpp 70 174 183 astar 16 5 sphinx3 87 203 xalancbmk 704 1817 1888 1062 Benchmark May Throw Exception Contains Volatile Access Invalid Exiting Blocks Invalid Exit Block Invalid Trip Count perlbench_r 18 850 875 736 gcc_r 2864 2923 4012 mcf_r 15 26 namd_r 67 75 71 parest_r 545 2147 3816 povray_r 435 421 439 167 omnetpp_r 293 400 415 321 xalancbmk_r 2477 2450 2526 1344 x264_r 211 212 500 blender_r 31 10 3058 3067 5952 deepsjeng_r 30 32 6 imagick_r 526 542 2387 leela_r 97 nab_r 117 163 323 xz_r 74 61 LLVM Developers’ Meeting 2018

LLVM Developers’ Meeting 2018 Next steps Post patch Investigate location to run loop fusion Enhancements to fuse more Non-equal trip counts Loop peeling or splitting Dependencies Loop alignment or skewing LLVM Developers’ Meeting 2018