Compiling Fortran D For MIMD Distributed Machines Authors: Seema Hiranandani, Ken Kennedy, Chau-Wen Tseng Published: 1992 Presented by: Sunjeev Sikand.

Slides:

Advertisements

Similar presentations

Optimizing Compilers for Modern Architectures Coarse-Grain Parallelism Chapter 6 of Allen and Kennedy.

Advertisements

Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments.

School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.

Introduction to Openmp & openACC

879 CISC Parallel Computation High Performance Fortran (HPF) Ibrahim Halil Saruhan Although the [Fortran] group broke new ground …

A Process Splitting Transformation for Kahn Process Networks Sjoerd Meijer.

Programmability Issues

U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Potential Languages of the Future Chapel,

The Assembly Language Level

Compiler Challenges for High Performance Architectures

Enhancing Fine-Grained Parallelism Chapter 5 of Allen and Kennedy Optimizing Compilers for Modern Architectures.

Performance Visualizations using XML Representations Presented by Kristof Beyls Yijun Yu Erik H. D’Hollander.

CISC October Goals for today: Foster’s parallel algorithm design –Partitioning –Task dependency graph Granularity Concurrency Collective communication.

Reference: Message Passing Fundamentals.

CS 584. Review n Systems of equations and finite element methods are related.

DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.

ECE669 L11: Static Routing Architectures March 4, 2004 ECE 669 Parallel Computer Architecture Lecture 11 Static Routing Architectures.

High Performance Fortran (HPF) Source: Chapter 7 of "Designing and building parallel programs“ (Ian Foster, 1995)

ECE669 L5: Grid Computations February 12, 2004 ECE 669 Parallel Computer Architecture Lecture 5 Grid Computations.

Message Passing Fundamentals Self Test. 1.A shared memory computer has access to: a)the memory of other nodes via a proprietary high- speed communications.

Java for High Performance Computing Jordi Garcia Almiñana 14 de Octubre de 1998 de la era post-internet.

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.

Parallel Programming Models and Paradigms

CISC 879 : Software Support for Multicore Architectures John Cavazos Dept of Computer & Information Sciences University of Delaware

Optimizing Compilers for Modern Architectures Coarse-Grain Parallelism Chapter 6 of Allen and Kennedy.

Parallelizing Compilers Presented by Yiwei Zhang.

High Performance Fortran (HPF) Source: Chapter 7 of "Designing and building parallel programs“ (Ian Foster, 1995)

Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2 pp

ECE669 L23: Parallel Compilation April 29, 2004 ECE 669 Parallel Computer Architecture Lecture 23 Parallel Compilation.

Mapping Techniques for Load Balancing

CMSC 611: Advanced Computer Architecture Parallel Computation Most slides adapted from David Patterson. Some from Mohomed Younis.

Optimizing Compilers for Modern Architectures Compiling High Performance Fortran Allen and Kennedy, Chapter 14.

Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2.

Reference: / Parallel Programming Paradigm Yeni Herdiyeni Dept of Computer Science, IPB.

Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.

Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.

L15: Putting it together: N-body (Ch. 6) October 30, 2012.

Compiler, Languages, and Libraries ECE Dept., University of Tehran Parallel Processing Course Seminar Hadi Esmaeilzadeh

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

Lecture 8 – Stencil Pattern Stencil Pattern Parallel Computing CIS 410/510 Department of Computer and Information Science.

Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.

1 中華大學資訊工程學系 Ching-Hsien Hsu ( 許慶賢 ) Localization and Scheduling Techniques for Optimizing Communications on Heterogeneous.

1 Multiprocessor and Real-Time Scheduling Chapter 10 Real-Time scheduling will be covered in SYSC3303.

Institute for Software Science – University of ViennaP.Brezany Parallel and Distributed Systems Peter Brezany Institute for Software Science University.

Case Study in Computational Science & Engineering - Lecture 2 1 Parallel Architecture Models Shared Memory –Dual/Quad Pentium, Cray T90, IBM Power3 Node.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow/ Thread Execution.

High Performance Fortran (HPF) Source: Chapter 7 of "Designing and building parallel programs“ (Ian Foster, 1995)

Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Principles of Parallel Programming First Edition by Calvin Lin Lawrence Snyder.

Improving I/O with Compiler-Supported Parallelism Why Should We Care About I/O? Disk access speeds are much slower than processor and memory access speeds.

Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.

October 11, 2007 © 2007 IBM Corporation Multidimensional Blocking in UPC Christopher Barton, Călin Caşcaval, George Almási, Rahul Garg, José Nelson Amaral,

1 Qualifying ExamWei Chen Unified Parallel C (UPC) and the Berkeley UPC Compiler Wei Chen the Berkeley UPC Group 3/11/07.

Message-Passing Computing Chapter 2. Programming Multicomputer Design special parallel programming language –Occam Extend existing language to handle.

Optimizing Compilers for Modern Architectures Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9.

M. Accetta, R. Baron, W. Bolosky, D. Golub, R. Rashid, A. Tevanian, and M. Young MACH: A New Kernel Foundation for UNIX Development Presenter: Wei-Lwun.

CS 471 Final Project 2d Advection/Wave Equation Using Fourier Methods December 10, 2003 Jose L. Rodriguez

Finding concurrency Jakub Yaghob. Finding concurrency design space Starting point for design of a parallel solution Analysis The patterns will help identify.

ECE 1747H: Parallel Programming Lecture 2-3: More on parallelism and dependences -- synchronization.

An Evaluation of Data-Parallel Compiler Support for Line-Sweep Applications Daniel Chavarría-Miranda John Mellor-Crummey Dept. of Computer Science Rice.

Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology Presented by: Luis Ortiz Department of Computer.

A Pattern Language for Parallel Programming Beverly Sanders University of Florida.

Optimizing Compilers for Modern Architectures Enhancing Fine-Grained Parallelism Part II Chapter 5 of Allen and Kennedy.

Parallel Computing Presented by Justin Reschke

Concurrency and Performance Based on slides by Henri Casanova.

Seminar On Rain Technology

Department of Computer Science, Johns Hopkins University Lecture 7 Finding Concurrency EN /420 Instructor: Randal Burns 26 February 2014.

Application of Design Patterns to Geometric Decompositions V. Balaji, Thomas L. Clune, Robert W. Numrich and Brice T. Womack.

Parallel Programming By J. H. Wang May 2, 2017.

Parallel Algorithm Design

Parallel Programming in C with MPI and OpenMP

Presentation transcript:

Compiling Fortran D For MIMD Distributed Machines Authors: Seema Hiranandani, Ken Kennedy, Chau-Wen Tseng Published: 1992 Presented by: Sunjeev Sikand Saturday, January 09, 2016

Problem Parallel computers represent the only plausible way to continue to increase the computational power available to scientists and engineers However, they are difficult to program In particular MIMD machines require message-passing to separate address spaces and synchronizing among processors

Problem cont. Because parallel programs are machine- specific scientists are discouraged from utilizing them because they lose their investment when the program changes or a new architecture arrives However, vectorizable programs are easily maintained, debugged, portable, and the compilers do all the work

Solution Previous Fortran dialects lack a means of specifying a data decomposition The authors believe that if a program is written in a data parallel programming style with reasonable data decompositions it can be implemented efficiently. Thus they propose to develop a compiler technology to establish such a machine- independent programming model. Want to reduce both communication and load imbalance

Data Decomposition A decomposition is an abstract problem or index domain; it does not require any storage Each element of a decomposition represents a unit of computation The DECOMPOSITION statement declares the name, dimensionality, and size of a decomposition for later use There are two levels of parallelism in data parallel applications

Decomposition Statement DECOMPOSITION D(N,N)

Data Decomposition - Alignment First level of parallelism is array alignment/problem mapping that is how arrays are aligned with respect to one another Represents the minimal requirements for reducing data movement for the program given an unlimited number of processors Machine independent and depends on the fine-grained parallelism defined by the individual member of data arrays

Alignment cont. Corresponding elements in aligned arrays are always mapped to the same processor Array operations between aligned arrays are usually more efficient than array operations between arrays that are not known to be aligned.

Alignment Example REAL A(N,N) DECOMPOSITION D(N,N) ALIGN A(I,J) with D(J-2,I+3)

Data Decomposition - Distribution Other level of parallelism is distribution/machine mapping that is how arrays are distributed on the actual parallel machine Represents the translation of the problem onto the finite resources of the machine Affected by the topology, communication mechanisms, size of local memory, and number of processors on the underlying machine

Distribution cont. Specified by assigning an independent attribute to each dimension. Predefined attributes include BLOCK, CYCLIC, and BLOCK_CYCLIC The symbol : marks dimensions that are not distributed

Distribution Example 1 DISTRIBUTE D(:,BLOCK)

Distribution Example 2 DISTRIBUTE D(:,CYCLIC)

Fortran D Compiler Two major steps in writing a data parallel program are selecting a data decomposition and using it to derive node programs with explicit movement The former is left to user Latter is automatically generated by the compiler when given a data decomposition Translated program to a SPMD program with explicit message passing that execute directly on the nodes of the distributed-memory machine

Fortran D Compiler Structure 1 Program Analysis a-Dependence Analysis b-Data Decomposition Analysis c-Partitioning Analysis d-Communication Analysis 2 Program optimization a-Message vectorization b-Collective communications c-Run-Time processing d-Pipelined computations 3 Code generation a-Program partitioning b-Message generation c-Storage management

Partition Analysis Original program REAL A(100) do i = I, I00 A(i) = 0.0 enddo SPMD node Program REAL A(25) do i = i, 25 A(i) = 0.0 enddo Converting global to local indices

Jacobi Relaxation In the grid approximation that discretizes the physical problem, the heat flow into any given point at a given moment is the sum of the four temperature differences between that point and each of the four points surrounding it. Translating this into an iterative method, the correct solution can be found if the temperature of a given grid point at a given iteration is taken to be the average of the temperatures of the four surrounding grid points at the previous iteration.

Jacobi Relaxation Code REAL A(100,100), B(100,100) DECOMPOSITION D(100,100) ALIGN A, B with D DISTRIBUTE D(:,BLOCK) do k = l,time do j = 2,99 do i = 2,99 S1 A(i,j) = (B(i,j-l)+B(i-l,j)+ B(i+l,j)+B(i,j+l))/4 enddo do j = 2,99 do i = 2,99 S2 B(i,j) = A(i,j) enddo

Jacobi Relaxation Processor Layout Compiling for a four-processor machine. Both arrays A and B are aligned identically with decomposition D, so they have the same distribution as D. Because the first dimension of D is local and the second dimension is block- distributed, the local index set for both A and B on each processor (in local indices) is [1:100,1:25].

Jacobi Relaxation cont.

The iteration set of the loop nest (in global indices) is [l:time,2:99,2:99]. Local iteration sets for each processor (in local indices) Proc(1) = [1 : time, 2 : 25, 2 : 99] Proc(2 : 3) = [1 :time, 1 : 25, 2 : 99] Proc(4) = [1 : time, 1 : 24, 2 : 99]

Generated Jacobi REAL A(100,25), B(100,0:26) if (Plocal = 1) lb1 = 2 else lb1 = 1 if (Plocal = 4) ub1 = 24 else ub1 = 25 do k = l,time if (Plocal > l) send(B(2:99,1), Pleft) if (Plocal < 4) send(B(2:99,25), Pright) if (Plocal < 4) recv(B(2:99,26), Pright) if (Plocal > 1) recv(B(2:99,0), Pleft) do j = lb1, ub1 do i = 2,99 A(i,j) = (B(i,j-l)+B(i-l,j)+ B(i+l,j)+B(i,j+l) )/4 enddo

Generated Jacobi cont. do j = lb1,ub1 do i = 2,99 B(i,j) = A(i,j) enddo Only true cross-processor dependences are on the k loop thus able to vectorize messages

Pipelined Computation In loosely synchronous all processors execute in loose lockstep, alternating between phases of local computation and global communication e.g. Red Black SOR and Jacobi However some computations such as SOR contain loop carried dependences They present an opportunity to exploit parallelism through pipelining.

Pipelined Computation cont. The observation is that for some pipelined computations, the program order must be changed. Fine grained pipelining interchanges cross processor loops as deeply as possible to improve sequential computation but incurs the most communication overhead Coarse grained pipelining uses strip mining and loop interchange to adjust the granularity of the pipelining. Decreases communication overhead at the expense of some parallelism

Conclusions A usable and efficient machine independent parallel programming model is needed to make large-scale parallel machines useful to scientific programmers Fortran D with its data decomposition model performs message vectorization, collective communication, fine-grained pipelining, and several other optimizations for block distributed arrays Fortran D compiler will generate efficient code a for a large class of data parallel programs with minimal effort

Discussion Q: How is this applicable to sensor networks? A: There is no reference to sensor networks explicitly as this paper was written over a decade ago. But they provide a unified programming methodology to distribute data and communicate among processors. Replace this with motes and you’ll this is indeed relevant

Discussion cont. Q: What about issues such as fault tolerance? A: Point well taken. If a message is lost it doesn’t seem as though the infrastructure is there to deal with this. The model could be extended to have redundant computation. Perhaps even check pointing but as someone mentioned the memory of motes may be an issue here. Q: They provide a means for load balancing is this even applicable to sensor networks? A: Yes, it is in sensor networks as we want to balance the load so energy isn’t completely spent on a mote.