Idiom Recognition in the Polaris Parallelizing Compiler Bill Pottenger and Rudolf Eigenmann Presented by Vincent Yau.

Slides:



Advertisements
Similar presentations
Optimizing Compilers for Modern Architectures Copyright, 1996 © Dale Carnegie & Associates, Inc. Dependence Testing Allen and Kennedy, Chapter 3 thru Section.
Advertisements

Generalized Index-Set Splitting Christopher Barton Arie Tal Bob Blainey Jose Nelson Amaral.
Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.
CS 201 Compiler Construction
1 Optimizing compilers Managing Cache Bercovici Sivan.
A Process Splitting Transformation for Kahn Process Networks Sjoerd Meijer.
Loops or Lather, Rinse, Repeat… CS153: Compilers Greg Morrisett.
4.3 – Solve x 2 + bx + c = 0 by Factoring A monomial is an expression that is either a number, a variable, or the product of a number and one or more variables.
Chapter 10 Code Optimization. A main goal is to achieve a better performance Front End Code Gen Intermediate Code source Code target Code user Machine-
Optimizing Compilers for Modern Architectures Preliminary Transformations Chapter 4 of Allen and Kennedy.
Preliminary Transformations Chapter 4 of Allen and Kennedy Harel Paz.
1 CS 201 Compiler Construction Lecture 5 Code Optimizations: Copy Propagation & Elimination.
Nonlinear and Symbolic Data Dependence Testing Presented by Chen-Yong Cher William Blume, Rudolf Eigenmann.
The LRPD Test: Speculative Run-Time Parallelization of Loops with Privatization and Reduction Parallelization Lawrence Rauchwerger and David A. Padua PLDI.
Chapter 2 The Algorithmic Foundations of Computer Science
CS 112 Intro to Computer Science II Sami Rollins Spring 2007.
Pipelined Computations Divide a problem into a series of tasks A processor completes a task sequentially and pipes the results to the next processor Pipelining.
Optimizing Compilers for Modern Architectures Coarse-Grain Parallelism Chapter 6 of Allen and Kennedy.
Dependence Testing Optimizing Compilers for Modern Architectures, Chapter 3 Allen and Kennedy Presented by Rachel Tzoref and Rotem Oshman.
Parallelizing Compilers Presented by Yiwei Zhang.
Parallel Implementation of the Inversion of Polynomial Matrices Alina Solovyova-Vincent March 26, 2003 A thesis submitted in partial fulfillment of the.
CHAPTER 2 ANALYSIS OF ALGORITHMS Part 2. 2 Running time of Basic operations Basic operations do not depend on the size of input, their running time is.
Arun Kejariwal Paolo D’Alberto Alexandru Nicolau Paolo D’Alberto Alexandru Nicolau Constantine D. Polychronopoulos A Geometric Approach for Partitioning.
PSUCS322 HM 1 Languages and Compiler Design II IR Code Optimization Material provided by Prof. Jingke Li Stolen with pride and modified by Herb Mayer PSU.
Loop Induction Variable Canonicalization. Motivation Background: Open64 Compilation Scheme Loop Induction Variable Canonicalization Project Tracing and.
Optimizing Compilers for Modern Architectures Compiling High Performance Fortran Allen and Kennedy, Chapter 14.
Fast Fourier Transform Irina Bobkova. Overview I. Polynomials II. The DFT and FFT III. Efficient implementations IV. Some problems.
– 1 – Basic Machine Independent Performance Optimizations Topics Load balancing (review, already discussed) In the context of OpenMP notation Performance.
Compiler Code Optimizations. Introduction Introduction Optimized codeOptimized code Executes faster Executes faster efficient memory usage efficient memory.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Array Dependence Analysis COMP 621 Special Topics By Nurudeen Lameed
Toward Efficient Flow-Sensitive Induction Variable Analysis and Dependence Testing for Loop Optimization Yixin Shou, Robert A. van Engelen, Johnnie Birch,
Section 5.6 Special Products of Binomials. 5.6 Lecture Guide: Special Products of Binomials Objective 1: Multiply by inspection a sum times a difference.
1 Code optimization “Code optimization refers to the techniques used by the compiler to improve the execution efficiency of the generated object code”
High-Level Transformations for Embedded Computing
October 11, 2007 © 2007 IBM Corporation Multidimensional Blocking in UPC Christopher Barton, Călin Caşcaval, George Almási, Rahul Garg, José Nelson Amaral,
Chapter 10 Code Optimization Zhang Jing, Wang HaiLing College of Computer Science & Technology Harbin Engineering University.
Comparison of Array Operation Synthesis and Straightforward Compilation FORALL (I=1:N:1; J=1:N:1) IF (1
Program Analysis & Transformations Loop Parallelization and Vectorization Toheed Aslam.
Algorithm Analysis Algorithm Analysis Lectures 3 & 4 Resources Data Structures & Algorithms Analysis in C++ (MAW): Chap. 2 Introduction to Algorithms (Cormen,
1 Chapter 3: Loops and Logic. 2 Control Statements If statement Example NumberCheck.java Relational operators (, >=, ==, !=) Using code blocks with If.
COSC 1P03 Data Structures and Abstraction 2.1 Analysis of Algorithms Only Adam had no mother-in-law. That's how we know he lived in paradise.
ECE 1747H: Parallel Programming Lecture 2-3: More on parallelism and dependences -- synchronization.
Invitation to Computer Science 5 th Edition Chapter 2 The Algorithmic Foundations of Computer Science.
Optimizing Compilers for Modern Architectures Enhancing Fine-Grained Parallelism Part II Chapter 5 of Allen and Kennedy.
3/2/2016© Hal Perkins & UW CSES-1 CSE P 501 – Compilers Optimizing Transformations Hal Perkins Autumn 2009.
Software Group © 2004 IBM Corporation Compiler Technology October 6, 2004 Experiments with auto-parallelizing SPEC2000FP benchmarks Guansong Zhang CASCON.
Learning Objectives 1. Understand how the Small Basic Turtle operates. 2. Be able to draw geometric patterns using iteration.
High-level optimization Jakub Yaghob
Code Optimization.
CS314 – Section 5 Recitation 13
Data Dependence, Parallelization, and Locality Enhancement (courtesy of Tarek Abdelrahman, University of Toronto)
Loop Restructuring Loop unswitching Loop peeling Loop fusion
CS4230 Parallel Programming Lecture 12: More Task Parallelism Mary Hall October 4, /04/2012 CS4230.
Presented by: Huston Bokinsky Ying Zhang 25 April, 2013
Algorithm Analysis Lectures 3 & 4
Parallel Inversion of Polynomial Matrices
Optimizing Transformations Hal Perkins Autumn 2011
A Practical Stride Prefetching Implementation in Global Optimizer
You have a system that contains a special processor for doing floating-point operations. You have determined that 50% of your computations can use the.
Preliminary Transformations
CSE 1342 Programming Concepts
Optimizing Transformations Hal Perkins Winter 2008
Compiler Code Optimizations
Connected Word Recognition
Lecture 2 The Art of Concurrency
Interprocedural Symbolic Range Propagation for Optimizing Compilers
Lecture 19: Code Optimisation
Print the following triangle, using nested loops
Programming II Vocabulary Review Loops & functions
Presentation transcript:

Idiom Recognition in the Polaris Parallelizing Compiler Bill Pottenger and Rudolf Eigenmann Presented by Vincent Yau

Induction Variable Substitution Most of the compilers do not able to transform some general forms of induction variable triangular nested loops multiplicative expressions.

General Induction Variable Algorithm Step 1: Recognize the induction variable pattern iv = iv + inc_expression inc_expression - outerloop index, other iv or loop-invariant Step 2: Compute 3 Definitions (next slide) Compute closed forms Step 3: Direct substitution of the closed forms

iv = 0 do i = 1, n do j = 1, i a(iv) = … iv = iv + 1 enddo Example:

iv = 0 do i = 1, n do j = 1, i a(iv) = … iv = iv + 1 enddo Example: = j

iv = 0 do i = 1, n do j = 1, i a(iv) = … iv = iv + 1 enddo Example: = j = i*(i - 1)/2

iv = 0 do i = 1, n do j = 1, i a(iv) = … iv = iv + 1 enddo Example: do i = 1, n do j = 1, i a(j+(i 2 - i) / 2 - 1) = … enddo = j = i*(i - 1)/2

Symbolic sum function base on Bernoulli numbers Bernoulli numbers are defined by special case of Bernoulli polynomials

Wrap-Around Variables Definition: The variable that takes on the value of an induction variable after one iteration of a loop.

Example: m = 0 do i = 1, n do j = 1, i lb = j ub = i do k = i, n do l = lb, ub m = m + 1 a(m) = … enddo lb = 1 ub = k + 1 enddo m = m + i enddo

Example: m = 0 do i = 1, n do j = 1, i lb = j ub = i do k = i, n do l = lb, ub m = m + 1 a(m) = … enddo lb = 1 ub = k + 1 enddo m = m + i enddo Step 1: recognized the wrap-around variable Step 2: remove the wrap-around variable (lb, ub) by peeling the first iteration of the k loop. (next slide) Step 3: apply induction variable substitution Powerful symbolic manipulation is needed

Example: m = 0 do i = 1, n do j = 1, i lb = j ub = i do k = i, n do l = lb, ub m = m + 1 a(m) = … enddo lb = 1 ub = k + 1 enddo m = m + i enddo m = 0 do i = 1, n do j = 1, i do l = j, i m = m + 1 a(m) = … enddo do k = 1 + i, n do l = 1, k m = m + 1 a(m) = … enddo m = m + i enddo step 2

m = 0 do i = 1, n do j = 1, i do l = j, i m = l + (i + (-9i 2 - 3i 4 + 6i + 6i 3 - 6in - 6in 2 + 6i 2 n 2 )/4 - 3n - 3j 2 -n 2 + 2i 3 + 3j + 3i 2 - 3ij - 3ji 2 + 3jn + 3jn 2 )/6 - 2i + 2ij a(m) = … enddo do k = 1 + i, n do l = 1, k m = l + ((-9i 2 - 3i 4 + 6i + 6i 3 - 6in - 6in 2 + 6i 2 n 2 )/4 - 3k - 3j 2 - 3n 2 - 2i + 3j 3 + 3j + 3k 2 - 3ij - 3ji 2 +3jn +3jn 2 )/6 - 2i + 2ij a(m) = … enddo Example: m = 0 do i = 1, n do j = 1, i do l = j, i m = m + 1 a(m) = … enddo do k = 1 + i, n do l = 1, k m = m + 1 a(m) = … enddo m = m + i enddo step 3

Reduction Recognition Step 1: Recognition Pass A(x 1, x 2, x 3, …) = A(x 1, x 2, x 3, …) + B set the reduction variable flag for A( ) Step 2: Data Dependence Pass analyzes candidate reduction variables. removes the reduction flag, if it can be proven to be independence Step 3: Transformation Pass 3 different types of parallel reduction transformation:  blocked  privatized  expanded

Transformation Pass insert synchronization primitives around each reduction statement synchronization overhead was high.

Performance Result Overall Program Speedups (running on 8-processor set of an SGI Challenge R440)

Performance Results