Program Analysis & Transformations Loop Parallelization and Vectorization Toheed Aslam.

Slides:

Advertisements

Similar presentations

Optimizing Compilers for Modern Architectures Compiler Improvement of Register Usage Chapter 8, through Section 8.4.

Advertisements

Optimizing Compilers for Modern Architectures Syllabus Allen and Kennedy, Preface Optimizing Compilers for Modern Architectures.

Optimizing Compilers for Modern Architectures Coarse-Grain Parallelism Chapter 6 of Allen and Kennedy.

CS 201 Compiler Construction

School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.

Programmability Issues

Compiler Challenges for High Performance Architectures

Enhancing Fine-Grained Parallelism Chapter 5 of Allen and Kennedy Optimizing Compilers for Modern Architectures.

Optimizing Compilers for Modern Architectures Preliminary Transformations Chapter 4 of Allen and Kennedy.

Parallel and Cluster Computing 1. 2 Optimising Compilers u The main specific optimization is loop vectorization u The compilers –Try to recognize such.

Enhancing Fine- Grained Parallelism Chapter 5 of Allen and Kennedy Mirit & Haim.

A Seminar on Optimizations for Modern Architectures Computer Science Seminar 2, Winter 2006 Lecturer: Erez Petrank

Compiler Challenges, Introduction to Data Dependences Allen and Kennedy, Chapter 1, 2.

CMPUT680 - Fall 2006 Topic A: Data Dependence in Loops José Nelson Amaral

Compiler Improvement of Register Usage Part 1 - Chapter 8, through Section 8.4 Anastasia Braginsky.

Optimizing Compilers for Modern Architectures Coarse-Grain Parallelism Chapter 6 of Allen and Kennedy.

CMPUT Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic B: Loop Restructuring José Nelson Amaral

Dependence Testing Optimizing Compilers for Modern Architectures, Chapter 3 Allen and Kennedy Presented by Rachel Tzoref and Rotem Oshman.

Parallelizing Compilers Presented by Yiwei Zhang.

Data Dependences CS 524 – High-Performance Computing.

Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9 Yaniv Carmeli.

Creating Coarse-Grained Parallelism Chapter 6 of Allen and Kennedy Dan Guez.

Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2 pp

Dependence: Theory and Practice Allen and Kennedy, Chapter 2 Liza Fireman.

Optimizing Compilers for Modern Architectures Compiling High Performance Fortran Allen and Kennedy, Chapter 14.

DATA LOCALITY & ITS OPTIMIZATION TECHNIQUES Presented by Preethi Rajaram CSS 548 Introduction to Compilers Professor Carol Zander Fall 2012.

Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2.

– 1 – Basic Machine Independent Performance Optimizations Topics Load balancing (review, already discussed) In the context of OpenMP notation Performance.

ECE 1747H : Parallel Programming Lecture 1-2: Overview.

Optimizing Loop Performance for Clustered VLIW Architectures by Yi Qian (Texas Instruments) Co-authors: Steve Carr (Michigan Technological University)

ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Novel Architectures Copyright 2004 Daniel J. Sorin Duke University.

09/27/2011CS4961 CS4961 Parallel Programming Lecture 10: Introduction to SIMD Mary Hall September 27, 2011.

09/15/2011CS4961 CS4961 Parallel Programming Lecture 8: Dependences and Locality Optimizations Mary Hall September 15,

Enhancing Fine-Grained Parallelism Chapter 5 of Allen and Kennedy Optimizing Compilers for Modern Architectures.

High Performance Fortran (HPF) Source: Chapter 7 of "Designing and building parallel programs“ (Ian Foster, 1995)

U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2011 Dependence Analysis and Loop Transformations.

1 Theory and Practice of Dependence Testing Data and control dependences Scalar data dependences  True-, anti-, and output-dependences Loop dependences.

CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.

Loop Optimizations Scheduling. Loop fusion int acc = 0; for (int i = 0; i < n; ++i) { acc += a[i]; a[i] = acc; } for (int i = 0; i < n; ++i) { b[i] +=

ECE/CS 757: Advanced Computer Architecture II Instructor:Mikko H Lipasti Spring 2015 University of Wisconsin-Madison Lecture notes based on slides created.

Optimizing Compilers for Modern Architectures Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9.

Compiling Fortran D For MIMD Distributed Machines Authors: Seema Hiranandani, Ken Kennedy, Chau-Wen Tseng Published: 1992 Presented by: Sunjeev Sikand.

CprE / ComS 583 Reconfigurable Computing Prof. Joseph Zambreno Department of Electrical and Computer Engineering Iowa State University Lecture #13 – Other.

Exploiting Vector Parallelism in Software Pipelined Loops Sam Larsen Rodric Rabbah Saman Amarasinghe Computer Science and Artificial Intelligence Laboratory.

Optimizing Compilers for Modern Architectures Enhancing Fine-Grained Parallelism Part II Chapter 5 of Allen and Kennedy.

An Overview of Parallel Processing

Memory-Aware Compilation Philip Sweany 10/20/2011.

10/01/2009CS4961 CS4961 Parallel Programming Lecture 12/13: Introduction to Locality Mary Hall October 1/3,

Dependence Analysis and Loops CS 3220 Spring 2016.

Compilers: History and Context COMP Outline Compilers and languages Compilers and architectures – parallelism – memory hierarchies Other uses.

Lecture 38: Compiling for Modern Architectures 03 May 02

Conception of parallel algorithms

CS314 – Section 5 Recitation 13

ECE/CS 757: Advanced Computer Architecture II SIMD

Dependence Analysis Important and difficult

Data Dependence, Parallelization, and Locality Enhancement (courtesy of Tarek Abdelrahman, University of Toronto)

Loop Restructuring Loop unswitching Loop peeling Loop fusion

Parallelizing Loops Moreno Marzolla

מבוסס על פרק 7 בספר של Allen and Kennedy

Presented by: Huston Bokinsky Ying Zhang 25 April, 2013

Parallelization, Compilation and Platforms 5LIM0

Register Pressure Guided Unroll-and-Jam

CS4961 Parallel Programming Lecture 11: SIMD, cont

Coe818 Advanced Computer Architecture

AN INTRODUCTION ON PARALLEL PROCESSING

Copyright 2003, Keith D. Cooper & Linda Torczon, all rights reserved.

Introduction to Optimization

Optimizing single thread performance

Presentation transcript:

Program Analysis & Transformations Loop Parallelization and Vectorization Toheed Aslam

Background  Vector processors  Multi processors  Vectorizing & parallelizing compilers  SIMD & MIMD models

Loop Parallelism Bräunl 2004

Data Dependence Analysis  Flow dependence  Anti dependence  Output dependence S 1 X = … S 2 … = X S 1 … = X S 2 X = … S 1 X = … S 2 X = … S1  S2S1  S2 S 1  -1 S 2 S1 o S2S1 o S2

Vectorization  Exploiting vector architecture DO I = 1, 50 A(I) = B(I) + C(I)‏ D(I) = A(I) / 2.0 ENDDO A(1:50) = B(1:50) + C(1:50)‏ D(1:50) = A(1:50) / 2.0 vectorize

Vectorization vadd A[1], B[1], C[1], 50 vdiv D[1], A[1], SR, 50 mov VL, 50 vload V1, B[1], 1 vload V2, C[1], 1 vadd V1, V2, V3 vstore V3, A[1], 1 vdiv V3, SR, V4 vstore V4, D[1], 1 A(1:50) = B(1:50) + C(1:50)‏ D(1:50) = A(1:50) / 2.0

 Discovery build data dependence graph inspect dependence cycles inspect each loop statement to see if target machine has vector instruction to execute accordingly  Proper course of action? Vectorization

 Transformation loops with multiple statements must be transformed using the loop distribution loops with no loop-carried dependence or has forward flow dependences DO I = 1, N S 1 A(I+1) = B(I) + C S 2 D(I) = A(I) + E ENDDO S 1 A(2:N+1) = B(1:N) + C S 2 D(1:N) = A(1:N) + E DO I = 1, N S 1 A(I+1) = B(I) + C ENDDO DO I = 1, N S 2 D(I) = A(I) + E ENDDO loop distribution vectorize loop distribution vectorize

 Dependence Cycles acyclic  solution: re-ordering of statements Vectorization DO I = 1, N S 1 D(I) = A(I) + E S 2 A(I+1) = B(I) + C ENDDO S 2 A(2:N+1) = B(1:N) + C S 1 D(1:N) = A(1:N) + E DO I = 1, N S 2 A(I+1) = B(I) + C ENDDO DO I = 1, N S 1 D(I) = A(I) + E ENDDO vectorize loop distribution

Vectorization  Dependence Cycles cyclic  solution: statement substitution  otherwise, distribute loop dependence cycle statements in a serial loop rest of the loop as vectorized DO I = 1, N S 1 A(I) = B(I) + C(I) S 2 B(I+1) = A(I) + 2 ENDDO DO I = 1, N S 1 A(I) = B(I) + C(I) S 2 B(I+1) = (B(I) + C(I)) + 2 ENDDO

Vectorization  Nested loops  Conditions in loop DO I = 1, N S 1 A(I, 1:N) = B(I, 1:N) + C(I, 1:N) S 2 B(I+1, 1:N) = A(I, 1:N) + 2 ENDDO S 3 D(1:N, 1:N) = D(1:N, 1:N) + 1 DO I = 1, N S 1 IF( A(I) < 0) S 2 A(I) = -A(I) ENDDO WHERE(A(1:N) < 0) A(1:N) = -A(1:N) vectorize

Parallelization  Exploiting multi-processors  Allocate individual loop iterations to different processors  Additional synchronization is required depending on data dependences

Parallelization  Fork/Join parallelism  Scheduling Static Self-scheduled

Parallelization for i:=1 to n do S1: A[i]:= C[i]; S2: B[i]:= A[i]; end;  Data dependency: S1 δ(=) S2 (due to A[i])  Synchronization required: NO doacross i:=1 to n do S1: A[i]:= C[i]; S2: B[i]:= A[i]; enddoacross;

Parallelization  The inner loop is to be parallelized: for i:=1 to n do for j:=1 to m do S1: A[i,j]:= C[i,j]; S2: B[i,j]:= A[i-1,j-1]; end;  Data dependency: S1 δ(<,<) S2 (due to A[i,j])  Synchronization required: NO for i:=1 to n do doacross j:=1 to m do S1: A[i,j]:= C[i,j]; S2: B[i,j]:= A[i-1,j-1]; enddoacross; end;

Parallelization for i:= 1 to n do S1: A[i] := B[i] + C[i]; S2: D[i] := A[i] + E[i-1]; S3: E[i] := E[i] + 2 * B[i]; S4: F[i] := E[i] + 1; end;  Data Dependences: S1 δ(=) S2 (due to A[i]) ← no synch. required S3 δ(=) S4 (due to E[i]) ← no synch. required S3 δ(<) S2 (due to E[i]) ← synch. required

Parallelization  After re-ordering and adding sync code Bräunl 2004

Review-I  Data dependence within an instruction for i:= 1 to n do S1: A[i] := A[i+1] end;  Is this loop vectorizable?

Review-II  Data dependence within an instruction for j:= 1 to n do S1: X[j+1] := X[j] + C end;  Is this loop vectorizable?

References  Optimizing Supercompilers for Supercomputers, Michael Wolfe  Parallelizing and Vectorizing Compilers, Rudolf Eigenmann and Jay Hoeﬂinger  Optimizing Compilers for Modern Architectures, Randy Allen, Ken Kennedy