Advanced Analysis in SUIF2 and Future Work Monica Lam Stanford University

Slides:

Advertisements

Similar presentations

Optimizing Compilers for Modern Architectures Syllabus Allen and Kennedy, Preface Optimizing Compilers for Modern Architectures.

Advertisements

Using the Iteration Space Visualizer in Loop Parallelization Yijun YU

1 Optimizing compilers Managing Cache Bercovici Sivan.

School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.

A Process Splitting Transformation for Kahn Process Networks Sjoerd Meijer.

Architecture-dependent optimizations Functional units, delay slots and dependency analysis.

1 Code Optimization. 2 The Code Optimizer Control flow analysis: control flow graph Data-flow analysis Transformations Front end Code generator Code optimizer.

Enhancing Fine-Grained Parallelism Chapter 5 of Allen and Kennedy Optimizing Compilers for Modern Architectures.

Performance Visualizations using XML Representations Presented by Kristof Beyls Yijun Yu Erik H. D’Hollander.

Numerical Algorithms Matrix multiplication

TM Pro64™: Performance Compilers For IA-64™ Jim Dehnert Principal Engineer 5 June 2000.

Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.

Data Locality CS 524 – High-Performance Computing.

Affine Partitioning for Parallelism & Locality Amy Lim Stanford University

1 HW/SW Partitioning Embedded Systems Design. 2 Hardware/Software Codesign “Exploration of the system design space formed by combinations of hardware.

Next Section: Pointer Analysis Outline: –What is pointer analysis –Intraprocedural pointer analysis –Interprocedural pointer analysis (Wilson & Lam) –Unification.

Cpeg421-08S/final-review1 Course Review Tom St. John.

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts, Amherst Advanced Compilers CMPSCI 710.

CS 536 Spring Intermediate Code. Local Optimizations. Lecture 22.

Stanford University CS243 Winter 2006 Wei Li 1 Data Dependences and Parallelization.

Java for High Performance Computing Jordi Garcia Almiñana 14 de Octubre de 1998 de la era post-internet.

Range Analysis. Intraprocedural Points-to Analysis Want to compute may-points-to information Lattice:

Intraprocedural Points-to Analysis Flow functions:

A Data Locality Optimizing Algorithm based on A Data Locality Optimizing Algorithm by Michael E. Wolf and Monica S. Lam.

Intermediate Code. Local Optimizations

Interprocedural Program Analyses David Heine Vladimir Livshits Brian Murphy Christopher Unkel Hansel Wan Stanford University

Data Dependences CS 524 – High-Performance Computing.

Tile Reduction: the first step towards tile aware parallelization in OpenMP Ge Gan Department of Electrical and Computer Engineering Univ. of Delaware.

Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2 pp

Pointer analysis. Pointer Analysis Outline: –What is pointer analysis –Intraprocedural pointer analysis –Interprocedural pointer analysis Andersen and.

Optimizing Compilers for Modern Architectures Compiling High Performance Fortran Allen and Kennedy, Chapter 14.

DATA LOCALITY & ITS OPTIMIZATION TECHNIQUES Presented by Preethi Rajaram CSS 548 Introduction to Compilers Professor Carol Zander Fall 2012.

1 Presenter: Ming-Shiun Yang Sah, A., Balakrishnan, M., Panda, P.R. Design, Automation & Test in Europe Conference & Exhibition, DATE ‘09. A Generic.

- 1 - EE898-HW/SW co-design Hardware/Software Codesign “Finding right combination of HW/SW resulting in the most efficient product meeting the specification”

Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.

Topic #10: Optimization EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.

1 Parallel Programming using the Iteration Space Visualizer Yijun YuYijun Yu and Erik H. D'HollanderErik H. D'Hollander University of Ghent, Belgium

Aug 15-18, Montreal, Canada1 Recurrence Chain Partitioning of Non-Uniform Dependences Yijun Yu Erik H. D ’ Hollander.

Array Dependence Analysis COMP 621 Special Topics By Nurudeen Lameed

Programming Models & Runtime Systems Breakout Report MICS PI Meeting, June 27, 2002.

Toward Efficient Flow-Sensitive Induction Variable Analysis and Dependence Testing for Loop Optimization Yixin Shou, Robert A. van Engelen, Johnnie Birch,

Advanced Compiler Design Early Optimizations. Introduction Constant expression evaluation (constant folding)  dataflow independent Scalar replacement.

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M

Design of a High-Throughput Low-Power IS95 Viterbi Decoder Xun Liu Marios C. Papaefthymiou Advanced Computer Architecture Laboratory Electrical Engineering.

Carnegie Mellon Lecture 14 Loop Optimization and Array Analysis I. Motivation II. Data dependence analysis Chapter , 11.6 Dror E. MaydanCS243:

Compiler Principles Fall Compiler Principles Lecture 0: Local Optimizations Roman Manevich Ben-Gurion University.

Carnegie Mellon Lecture 15 Loop Transformations Chapter Dror E. MaydanCS243: Loop Optimization and Array Analysis1.

6. A PPLICATION MAPPING 6.3 HW/SW partitioning 6.4 Mapping to heterogeneous multi-processors 1 6. Application mapping (part 2)

PROC-1 1. Software Development Process. PROC-2 A Process Software Development Process User’s Requirements Software System Unified Process: Component Based.

Introduction to Compiling

NERSC/LBNL UPC Compiler Status Report Costin Iancu and the UCB/LBL UPC group.

Pointer Analysis Survey. Rupesh Nasre. Aug 24, 2007.

Pointer Analysis for Multithreaded Programs Radu Rugina and Martin Rinard M I T Laboratory for Computer Science.

CSCI1600: Embedded and Real Time Software Lecture 33: Worst Case Execution Time Steven Reiss, Fall 2015.

1 Control Flow Graphs. 2 Optimizations Code transformations to improve program –Mainly: improve execution time –Also: reduce program size Can be done.

Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology Presented by: Luis Ortiz Department of Computer.

1 The Software Development Process ► Systems analysis ► Systems design ► Implementation ► Testing ► Documentation ► Evaluation ► Maintenance.

3/2/2016© Hal Perkins & UW CSES-1 CSE P 501 – Compilers Optimizing Transformations Hal Perkins Autumn 2009.

Memory-Aware Compilation Philip Sweany 10/20/2011.

Department of Computer Science, Johns Hopkins University Lecture 7 Finding Concurrency EN /420 Instructor: Randal Burns 26 February 2014.

Credible Compilation With Pointers Martin Rinard and Darko Marinov Laboratory for Computer Science Massachusetts Institute of Technology.

Code Optimization Overview and Examples

Code Optimization.

Data Dependence, Parallelization, and Locality Enhancement (courtesy of Tarek Abdelrahman, University of Toronto)

CSCI1600: Embedded and Real Time Software

A Practical Stride Prefetching Implementation in Global Optimizer

STUDY AND IMPLEMENTATION

Code Optimization Overview and Examples Control Flow Graph

Pointer analysis.

CSCI1600: Embedded and Real Time Software

Presentation transcript:

Advanced Analysis in SUIF2 and Future Work Monica Lam Stanford University

Interprocedural, High-Level Transforms for Locality and Parallelism zProgram transformation for computational kernels ya new technique based on affine partitioning zInterprocedural analysis framework: to maximize code reuse yFlow sensitivity; context sensitivity zInterprocedural program analysis yPointer alias analysis (Steensgaard’s algorithm) yScalar/Scalar dependence, privatization, reduction recognition zParallel code generation yDefine new IR nodes for parallel code

Loop Transforms: Cholesky factorization example DO 1 J = 0, N I0 = MAX ( -M, -J ) DO 2 I = I0, -1 DO 3 JJ = I0 - I, -1 DO 3 L = 0, NMAT 3A(L,I,J) = A(L,I,J) - A(L,JJ,I+J) * A(L,I+JJ,J) DO 2 L = 0, NMAT 2 A(L,I,J) = A(L,I,J) * A(L,0,I+J) DO 4 L = 0, NMAT 4 EPSS(L) = EPS * A(L,0,J) DO 5 JJ = I0, -1 DO 5 L = 0, NMAT 5 A(L,0,J) = A(L,0,J) - A(L,JJ,J) ** 2 DO 1 L = 0, NMAT 1 A(L,0,J) = 1. / SQRT ( ABS (EPSS(L) + A(L,0,J)) ) DO 6 I = 0, NRHS DO 7 K = 0, N DO 8 L = 0, NMAT 8 B(I,L,K) = B(I,L,K) * A(L,0,K) DO 7 JJ = 1, MIN (M, N-K) DO 7 L = 0, NMAT 7 B(I,L,K+JJ) = B(I,L,K+JJ) - A(L,-JJ,K+JJ) * B(I,L,K) DO 6 K = N, 0, -1 DO 9 L = 0, NMAT 9 B(I,L,K) = B(I,L,K) * A(L,0,K) DO 6 JJ = 1, MIN (M, K) DO 6 L = 0, NMAT 6 B(I,L,K-JJ) = B(I,L,K-JJ) - A(L,-JJ,K) * B(I,L,K)

Results for Optimizing Perfect Nests Speedup on a Digital Turbolaser with 8 300Mhz processors

Optimizing Arbitrary Loop Nesting Using Affine Partitions DO 1 J = 0, N I0 = MAX ( -M, -J ) DO 2 I = I0, -1 DO 3 JJ = I0 - I, -1 DO 3 L = 0, NMAT 3A(L,I,J) = A(L,I,J) - A(L,JJ,I+J) * A(L,I+JJ,J) DO 2 L = 0, NMAT 2 A(L,I,J) = A(L,I,J) * A(L,0,I+J) DO 4 L = 0, NMAT 4 EPSS(L) = EPS * A(L,0,J) DO 5 JJ = I0, -1 DO 5 L = 0, NMAT 5 A(L,0,J) = A(L,0,J) - A(L,JJ,J) ** 2 DO 1 L = 0, NMAT 1 A(L,0,J) = 1. / SQRT ( ABS (EPSS(L) + A(L,0,J)) ) DO 6 I = 0, NRHS DO 7 K = 0, N DO 8 L = 0, NMAT 8 B(I,L,K) = B(I,L,K) * A(L,0,K) DO 7 JJ = 1, MIN (M, N-K) DO 7 L = 0, NMAT 7 B(I,L,K+JJ) = B(I,L,K+JJ) - A(L,-JJ,K+JJ) * B(I,L,K) DO 6 K = N, 0, -1 DO 9 L = 0, NMAT 9 B(I,L,K) = B(I,L,K) * A(L,0,K) DO 6 JJ = 1, MIN (M, K) DO 6 L = 0, NMAT 6 B(I,L,K-JJ) = B(I,L,K-JJ) - A(L,-JJ,K) * B(I,L,K) A B EPSS L L L

Results with Affine Partitioning + Blocking

New Transform Theory zDomain: arbitrary loop nesting, instruction optimized separately zUnifies yPermutation ySkewing yReversal yFusion yFission yStatement reordering zSupports blocking across all loop nests zOptimal: Max. deg. of parallelism & min. deg. of synchronization zMinimize communication by aligning the computation and pipelining zMore powerful & simpler software engineering

A Simple Example FOR i = 1 TO n DO FOR j = 1 TO n DO A[i,j] = A[i,j]+B[i-1,j];(S 1 ) B[i,j] = A[i,j-1]*B[i,j];(S 2 ) i j S1S1 S2S2

Best Parallelization Scheme SPMD code: Let p be the processor’s ID number if (1-n <= p <= n) then if (1 <= p) then B[p,1] = A[p,0] * B[p,1]; (S 2 ) for i 1 = max(1,1+p) to min(n,n-1+p) do A[i 1,i 1 -p] = A[i 1,i 1 -p] + B[i 1 -1,i 1 -p];(S 1 ) B[i 1,i 1 -p+1] = A[i 1,i 1 -p] * B[i 1,i 1 -p+1];(S 2 ) if (p <= 0) then A[n+p,n] = A[n+p,N] + B[n+p-1,n]; (S 1 ) Solution can be expressed as affine partitions: S1: Execute iteration (i, j) on processor i-j. S2: Execute iteration (i, j) on processor i-j+1.

Maximum Parallelism & No Communication Let F xj be an access to array x in statement j, i j be an iteration index for statement j, B j i j  0 represent loop bound constraints for statement j, Find C j which maps an instance of statement j to a processor:  i j, i k B j i j  0, B k i k  0 F xj ( i j ) = F xk ( i k )  C j ( i j ) = C k ( i k ) with the objective of maximizing the rank of C j Loops Array Processor ID F1 (i1)F1 (i1) F2 (i2)F2 (i2) C1 (i1)C1 (i1) C2 (i2)C2 (i2)

Algorithm  i j, i k B j i j  0, B k i k  0 F xj ( i j ) = F xk ( i k )  C j ( i j ) = C k ( i k ) zRewrite partition constraints as systems of linear equations  use affine form of Farkas Lemma to rewrite constraints as systems of linear inequalities in C and  use Fourier-Motzkin algorithm to eliminate Farkas multipliers and get systems of linear equations AC =0 zFind solutions using linear algebra techniques  the null space for matrix A is a solution of C with maximum rank.

Pipelining Alternating Direction Integration Example Requires transposing data DO J = 1 to N (parallel) DO I = 1 to N A(I,J) = f(A(I,J),A(I-1,J) DO J = 1 to N DO I = 1 to N(parallel) A(I,J) = g(A(I,J),A(I,J-1)) Moves only boundary data DO J = 1 to N (parallel) DO I = 1 to N A(I,J) = f(A(I,J),A(I-1,J) DO J = 1 to N(pipelined) DO I = 1 to N A(I,J) = g(A(I,J),A(I,J-1))

Finding the Maximum Degree of Pipelining Let F xj be an access to array x in statement j, i j be an iteration index for statement j, B j i j  0 represent loop bound constraints for statement j, Find T j which maps an instance of statement j to a time stage:  i j, i k B j i j  0, B k i k  0 ( i j  i k )  ( F xj ( i j ) = F xk ( i k ))  T j ( i j )  T k ( i k ) lexicographically with the objective of maximizing the rank of T j Loops Array Time Stage F1 (i1)F1 (i1) F2 (i2)F2 (i2) T1 (i1)T1 (i1) T2 (i2)T2 (i2)

Key Insight zChoice in time mapping => (pipelined) parallelism zDegrees of parallelism = rank(T) - 1

Putting it All Together zFind maximum outer-loop parallelism with minimum synchronization yDivide into strongly connected components yApply processor mapping algorithm (no communication) to program yIf no parallelism found, xApply time mapping algorithm to find pipelining xIf no pipelining found (found outer sequential loop) Repeat process on inner loops zMinimize communication yUse a greedy method to order communicating pairs yTry to find communication-free, or neighborhood only communication by solving similar equations zAggregate computations of consecutive data to improve spatial locality

Current Status zCompleted: yMathematics package xIntegrated Omega: Pugh’s presburger arithmetic xLinear algebra package Farkas lemma, gaussian elimination, finding null space yCan find communication-free partitions zIn progress yRest of affine partitioning yCode generation

Interprocedural Analysis zTwo major design choices in program analysis yAcross procedures xNo interprocedural analysis xInterprocedural: context-insensitive xInterprocedural: context-sensitive yWithin a procedure xFlow-insensitive xFlow-sensitive: interval/region based xFlow-sensitive: iterative over flow-graph xFlow-sensitive: SSA based

Efficient Context-Sensitive Analysis zBottom-up yA region/interval: a procedure or a loop yAn edge: call or code in inner scope ySummarize each region (with a transfer function) yFind strongly connected components (sccs) yBottom-up traversal of sccs yIteration to find fixed-point for recursive functions zTop-down yTop-down propagation of values yIteration to find fixed-point for recursive functions call inner loop scc

Interprocedural Framework Architecture E.g. Array summaries E.g. Pointer aliases Primitive Handlers Procedure calls and returns Regions & Statements Basic blocks Compound Handlers Bottom-up Top-down Linear traversal Driver Call graph, SCC Regions, control flow graphs Data Structures

Interprocedural Framework Architecture zInterprocedural analysis data structures ye.g. call graphs, SSA form, regions or intervals zHandlers: Orthogonal sets of handlers for different groups of constructs yPrimitive: user specifies analysis-specific semantics of primitives yCompound: handles compound statements and calls xUser chooses between handlers of different strengths e.g. no interprocedural analysis versus context-sensitive e.g. flow-insensitive vs. flow-sensitive cfg yAll the handlers are registered in a visitor zDriver yDriver invoked by user’s request for information (demand driven) yBuild prepass data structures yInvokes the right set of handlers in right order (e.g. bottom-up traversal of call graph)

Pointer Alias Analysis zSteensgaard’s pointer alias analysis (completed) yFlow-insensitive and context-insensitive, type-inference based analysis yVery efficient: near linear-time analysis yVery inaccurate

Parallelization Analysis zScalar analysis yMod/ref, reduction recognition: Bottom-up flow-insensitive yLiveness for privatization: Bottom-up and top-down, flow-sensitive zRegion-based array analysis yMay-write, must-write, read, upwards-exposed read: bottom-up yArray liveness for privatization: bottom-up and top-down yUses our interprocedural framework + omega zSymbolic analysis yFind linear relationships between scalar variables to improve array analysis

Parallel Code Generation zLoop bound generation yUse omega based on affine mappings zOutlining and cloning primitives zSpecial IR nodes to represent parallelization primitives yAllows a succint and high-level description of parallelization decision yFor communication to and from users yReduction and private variables and primitives ySynchronization and parallelization primitives SUIF+ par IR SUIF

Status zCompleted yCall graphs, scc ySteensgaard’s pointer alias analysis yIntegration of garbage collector with SUIF zIn progress yInterprocedural analysis framework yArray summaries yScalar dependence analysis yParallel code generation zTo be done yScalar symbolic analysis

Future work: Basic compiler research zA flexible and integrated platform for new optimizations yCombinations of pointers, OO, parallelization optimizations to parallelize or SIMDize (MMX) multimedia applications yInteraction between garbage collection, exception handling with back end optimizations yEmbedded compilers with application-specific additions at the source language and architectural level

As a Useful Compiler for High-Performance Computers zBasic ingredients of a state-of-the-art parallelizing compiler zRequires experimentation, tuning, refinement yFirst implementation of affine partitioning yInterprocedural parallelization requires many analyses working together zMissing functions yAutomatic data distribution yUser interaction needed for parallelizing large code region xSUIF Explorer - a prototype interactive parallelizer in SUIF1 xRequires tools: algorithm to guide performance tuning, program slices, visualization tools zNew techniques yExtend affine mapping to sparse codes (with permutation index arrays) zFortran 90 front end zDebugging support

New-Generation Productivity Tool zApply high-level program analysis to increase programmers’ productivity zMany existing analyses yHigh-level, interprocedural side effect analysis with pointers and arrays zNew analyses yFlow and context sensitive pointer alias analysis yInterprocedural control-path based analysis zExamples of tools yFind bugs in program yProve or disapprove user invariants yGenerate test cases yInteractive demand-driven analysis to aid in program debugging yCan also apply to Verilog/VHDL to improve hardware verification

Finally... zThe system has to be actively maintained and supported to be useful.

The End