University of Minnesota Introduction to Co-Array Fortran Robert W. Numrich Minnesota Supercomputing Institute University of Minnesota, Minneapolis and.

Slides:



Advertisements
Similar presentations
Introduction to C Programming
Advertisements

Unified Parallel C at LBNL/UCB Implementing a Global Address Space Language on the Cray X1 Christian Bell and Wei Chen.
879 CISC Parallel Computation High Performance Fortran (HPF) Ibrahim Halil Saruhan Although the [Fortran] group broke new ground …
1 PGAS Languages and Halo Updates Will Sawyer, CSCS.
U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Potential Languages of the Future Chapel,
Introduction to the Partitioned Global Address Space (PGAS) Programming Model David E. Hudak, Ph.D. Program Director for HPC Engineering
More Tools Done basic, most useful programming tools: MPI OpenMP What else is out there? Languages and compilers? Parallel computing is significantly more.
Programming Languages Marjan Sirjani 2 2. Language Design Issues Design to Run efficiently : early languages Easy to write correctly : new languages.
1 An Evaluation of Global Address Space Languages: Co-Array Fortran and Unified Parallel C Cristian Coarfa, Yuri Dotsenko, John Mellor-Crummey Rice University.
Chapter 7:: Data Types Programming Language Pragmatics
Computational Methods of Scientific Programming Lecturers Thomas A Herring, Room A, Chris Hill, Room ,
Co-array Fortran Performance and Potential: an NPB Experimental Study Cristian Coarfa Yuri Dotsenko Jason Lee EckhardtJohn Mellor-Crummey Department of.
High Performance Fortran (HPF) Source: Chapter 7 of "Designing and building parallel programs“ (Ian Foster, 1995)
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
Introduction to Fortran Fortran Evolution Drawbacks of FORTRAN 77 Fortran 90 New features Advantages of Additions.
Java for High Performance Computing Jordi Garcia Almiñana 14 de Octubre de 1998 de la era post-internet.
High Performance Fortran (HPF) Source: Chapter 7 of "Designing and building parallel programs“ (Ian Foster, 1995)
Chapter 5, CLR Textbook Algorithms on Grids of Processors.
Object-oriented design CS 345 September 20,2002. Unavoidable Complexity Many software systems are very complex: –Many developers –Ongoing lifespan –Large.
Optimizing Compilers for Modern Architectures Compiling High Performance Fortran Allen and Kennedy, Chapter 14.
Introduction to Parallel Processing 3.1 Basic concepts 3.2 Types and levels of parallelism 3.3 Classification of parallel architecture 3.4 Basic parallel.
Reference: / Parallel Programming Paradigm Yeni Herdiyeni Dept of Computer Science, IPB.
C.S. Choy95 COMPUTER ORGANIZATION Logic Design Skill to design digital components JAVA Language Skill to program a computer Computer Organization Skill.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.
M1G Introduction to Programming 2 4. Enhancing a class:Room.
OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.
Imperative Programming
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
1 A Multi-platform Co-Array Fortran Compiler Yuri Dotsenko Cristian Coarfa John Mellor-Crummey Department of Computer Science Rice University Houston,
Center for Programming Models for Scalable Parallel Computing: Project Meeting Report Libraries, Languages, and Execution Models for Terascale Applications.
CSE 260 – Parallel Processing UCSD Fall 2006 A Performance Characterization of UPC Presented by – Anup Tapadia Fallon Chen.
1 John Mellor-Crummey Cristian Coarfa, Yuri Dotsenko Department of Computer Science Rice University Experiences Building a Multi-platform Compiler for.
Compilation Technology SCINET compiler workshop | February 17-18, 2009 © 2009 IBM Corporation Software Group Coarray: a parallel extension to Fortran Jim.
Programming in Co-Array Fortran John M. Levesque CTO Office Applications With Help from Bob Numrich and Jim Schwarzmeier.
Co-Array Fortran Open-source compilers and tools for scalable global address space computing John Mellor-Crummey Rice University.
High Performance Fortran (HPF) Source: Chapter 7 of "Designing and building parallel programs“ (Ian Foster, 1995)
Computers Operating System Essentials. Operating Systems PROGRAM HARDWARE OPERATING SYSTEM.
HPC User Forum Back End Compiler Panel SiCortex Perspective Kevin Harris Compiler Manager April 2009.
COP4020 Programming Languages Names, Scopes, and Bindings Prof. Xin Yuan.
Processes Introduction to Operating Systems: Module 3.
Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.
Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Principles of Parallel Programming First Edition by Calvin Lin Lawrence Snyder.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Fortress John Burgess and Richard Chang CS691W University of Massachusetts Amherst.
University of Minnesota Comments on Co-Array Fortran Robert W. Numrich Minnesota Supercomputing Institute University of Minnesota, Minneapolis.
August 2001 Parallelizing ROMS for Distributed Memory Machines using the Scalable Modeling System (SMS) Dan Schaffer NOAA Forecast Systems Laboratory (FSL)
A Multi-platform Co-array Fortran Compiler for High-Performance Computing John Mellor-Crummey, Yuri Dotsenko, Cristian Coarfa {johnmc, dotsenko,
University of Minnesota Introduction to Co-Array Fortran Robert W. Numrich Minnesota Supercomputing Institute University of Minnesota, Minneapolis and.
1 Qualifying ExamWei Chen Unified Parallel C (UPC) and the Berkeley UPC Compiler Wei Chen the Berkeley UPC Group 3/11/07.
Introduction to OpenMP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB Fredericton, New Brunswick.
Fortress Aaron Becker Abhinav Bhatele Hassan Jafri 2 May 2006.
Orca A language for parallel programming of distributed systems.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
FORTRAN History. FORTRAN - Interesting Facts n FORTRAN is the oldest Language actively in use today. n FORTRAN is still used for new software development.
Experiences with Co-array Fortran on Hardware Shared Memory Platforms Yuri DotsenkoCristian Coarfa John Mellor-CrummeyDaniel Chavarria-Miranda Rice University,
A Multi-platform Co-Array Fortran Compiler for High-Performance Computing Cristian Coarfa, Yuri Dotsenko, John Mellor-Crummey {dotsenko, ccristi,
Chapter 18 Object Database Management Systems. Outline Motivation for object database management Object-oriented principles Architectures for object database.
HPF (High Performance Fortran). What is HPF? HPF is a standard for data-parallel programming. Extends Fortran-77 or Fortran-90. Similar extensions exist.
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
1 HPJAVA I.K.UJJWAL 07M11A1217 Dept. of Information Technology B.S.I.T.
Memory-Aware Compilation Philip Sweany 10/20/2011.
April 24, 2002 Parallel Port Example. April 24, 2002 Introduction The objective of this lecture is to go over a simple problem that illustrates the use.
CPE779: Shared Memory and OpenMP Based on slides by Laxmikant V. Kale and David Padua of the University of Illinois.
Application of Design Patterns to Geometric Decompositions V. Balaji, Thomas L. Clune, Robert W. Numrich and Brice T. Womack.
Computer Engg, IIT(BHU)
An Emerging, Portable Co-Array Fortran Compiler for High-Performance Computing Daniel Chavarría-Miranda, Cristian Coarfa, Yuri.
Computer Engg, IIT(BHU)
5.2 Eleven Advanced Optimizations of Cache Performance
The Future of Fortran is Bright …
MPI-Message Passing Interface
Presentation transcript:

University of Minnesota Introduction to Co-Array Fortran Robert W. Numrich Minnesota Supercomputing Institute University of Minnesota, Minneapolis and Goddard Space Flight Center Greenbelt, Maryland Assisted by Carl Numrich, Minnehaha Academy High School John Numrich, Minnetonka Middle School West

2 What is Co-Array Fortran? Co-Array Fortran is one of three simple language extensions to support explicit parallel programming. –Co-Array Fortran (CAF) Minnesota –Unified Parallel C (UPC) GWU-Berkeley- NSA-Michigan Tech –Titanium ( extension to Java) Berkeley –

3 The Guiding Principle What is the smallest change required to make Fortran 90 an effective parallel language? How can this change be expressed so that it is intuitive and natural for Fortran programmers? How can it be expressed so that existing compiler technology can implement it easily and efficiently?

4 Programming Model Single-Program-Multiple-Data (SPMD) Fixed number of processes/threads/images Explicit data decomposition All data is local All computation is local One-sided communication thru co-dimensions Explicit synchronization

5 Co-Array Fortran Execution Model The number of images is fixed and each image has its own index, retrievable at run-time: 1  num_images() 1  this_image() ≤ num_images() Each image executes the same program independently of the others. The programmer inserts explicit synchronization and branching as needed. An “object” has the same name in each image. Each image works on its own local data. An image moves remote data to local data through, and only through, explicit co-array syntax.

6 What is Co-Array Syntax? Co-Array syntax is a simple parallel extension to normal Fortran syntax. –It uses normal rounded brackets ( ) to point to data in local memory. –It uses square brackets [ ] to point to data in remote memory. –Syntactic and semantic rules apply separately but equally to ( ) and [ ].

7 Declaration of a Co-Array real :: x(n)[  ]

8 CAF Memory Model x(1) x(n) x(1) x(n) x(1)[q] pq x(n)[p] x(1) x(n) x(1) x(n) x(1) x(n)

9 Examples of Co-Array Declarations real :: a(n)[  ] complex :: z[0:  ] integer :: index(n)[  ] real :: b(n)[p,  ] real :: c(n,m)[0:p, -7:q, +11:  ] real, allocatable :: w(:)[:] type(field) :: maxwell[p,  ]

10 Communication Using CAF Syntax y(:) = x(:)[p] x(index(:)) = y[index(:)] x(:)[q] = x(:) + x(:)[p] Absent co-dimension defaults to the local object.

11 One-to-One Execution Model x(1) x(n) x(1) x(n) x(1)[q] pq x(n)[p] x(1) x(n) x(1) x(n) x(1) x(n) One Physical Processor

12 Many-to-One Execution Model x(1) x(n) x(1) x(n) x(1)[q] pq x(n)[p] x(1) x(n) x(1) x(n) x(1) x(n) Many Physical Processors

13 One-to-Many Execution Model x(1) x(n) x(1) x(n) x(1)[q] pq x(n)[p] x(1) x(n) x(1) x(n) x(1) x(n) One Physical Processor

14 Many-to-Many Execution Model x(1) x(n) x(1) x(n) x(1)[q] pq x(n)[p] x(1) x(n) x(1) x(n) x(1) x(n) Many Physical Processors

15 What Do Co-Dimensions Mean? real :: x(n)[p,q,  ] 1.Replicate an array of length n, one on each image. 2.Build a map so each image knows how to find the array on any other image. 3.Organize images in a logical (not physical) three-dimensional grid. 4.The last co-dimension acts like an assumed size array:   num_images()/(pxq)

16 Relative Image Indices (1) this_image() = 15 this_image(x) = (/3,4/)x[4,*]

17 Relative Image Indices (II) this_image() = 15 this_image(x) = (/2,3/)x[0:3,0:*]

18 Relative Image Indices (III) this_image() = 15 this_image(x) = (/-3, 3/)x[-5:-2,0:*]

19 Relative Image Indices (IV) x[0:1,0:*] this_image() = 15 this_image(x) =(/0,7/)

20 Synchronization Intrinsic Procedures sync_all() Full barrier; wait for all images before continuing. sync_all(wait(:)) Partial barrier; wait only for those images in the wait(:) list. sync_team(list(:)) Team barrier; only images in list(:) are involved. sync_team(list(:),wait(:)) Team barrier; wait only for those images in the wait(:) list. sync_team(myPartner) Synchronize with one other image.

21 Exercise 1: Global Reduction subroutine globalSum(x) real(kind=8),dimension[0:*] :: x real(kind=8) :: work integer n,bit,i, mypal,dim,me, m dim = log2_images() if(dim.eq. 0) return m = 2**dim bit = 1 me = this_image(x) do i=1,dim mypal=xor(me,bit) bit=shiftl(bit,1) call sync_all() work = x[mypal] call sync_all() x=x+work enddo end subroutine globalSum

22 Events sync_team(list(:),list(me:me)) post event sync_team(list(:),list(you:you)) wait event

23 Other CAF Intrinsic Procedures sync_memory() Make co-arrays visible to all images sync_file(unit ) Make local I/O operations visible to the global file system. start_critical() end_critical() Allow only one image at a time into a protected region.

24 Other CAF Intrinsic Procedures log2_images() Log base 2 of the greatest power of two less than or equal to the value of num_images() rem_images() The difference between num_images() and the nearest power-of-two.

25 Matrix Multiplication = x myP myQ myP myQ

26 Matrix Multiplication real,dimension(n,n)[p,*] :: a,b,c do k=1,n do q=1,p c(i,j)[myP,myQ] = c(i,j)[myP,myQ] + a(i,k)[myP, q]*b(k,j)[q,myQ] enddo

27 Matrix Multiplication real,dimension(n,n)[p,*] :: a,b,c do k=1,n do q=1,p c(i,j) = c(i,j) + a(i,k)[myP, q]*b(k,j)[q,myQ] enddo

28 Block Matrix Multiplication

29 Block Matrix Multiplication

30 2. An Example from the UK Met Unified Model

31 Incremental Conversion to Co-Array Fortran Fields are allocated on the local heap One processor knows nothing about another processor’s memory structure But each processor knows how to find co- arrays in another processor’s memory Define one supplemental co-array structure Create an alias for the local field through the co-array field Communicate through the alias

32 CAF Alias to Local Fields real :: u(0:m+1,0:n+1,lev) type(field) :: z[p,  ] z%ptr => u u = z[p,q]%ptr

33 Irregular and Changing Data Structures z%ptr u u z[p,q]%ptr

34 Problem Decomposition and Co-Dimensions [p,q+1] [p-1,q] [p,q][p+1,q] [p,q-1] EW S N

35 Cyclic Boundary Conditions East-West Direction real,dimension [p,*] :: z myP = this_image(z,1) !East-West West = myP - 1 if(West < 1) West = nProcEW !Cyclic East = myP + 1 if(East > nProcEW) East = 1 !Cyclic

36 East-West Halo Swap Move last row from west to my first halo u(0,1:n,1:lev) = z[West,myQ]%ptr(m,1:n,1:lev) Move first row from east to my last halo u(m+1,1:n,1:lev)=z[East,myQ]%Field(1,1:n,1:lev)

37 Total Time (s) PxQSHMEM SHMEM w/CAF SWAP MPI w/CAF SWAP MPI 2x x x x x

38 3. CAF and “Object-Oriented” Programming Methodology

39 Using “Object-Oriented” Techniques with Co-Array Fortran Fortran 95 is not an object-oriented language. But it contains some features that can be used to emulate object-oriented programming methods. –Allocate/deallocate for dynamic memory management –Named derived types are similar to classes without methods. –Modules can be used to associate methods loosely with objects. –Constructors and destructors can be defined to encapsulate parallel data structures. –Generic interfaces can be used to overload procedures based on the named types of the actual arguments.

40 A Parallel “Class Library” for CAF Combine the object-based features of Fortran 95 with co-array syntax to obtain an efficient parallel numerical class library that scales to large numbers of processors. Encapsulate all the hard stuff in modules using named objects, constructors,destructors, generic interfaces, dynamic memory management.

41 CAF Parallel “Class Libraries” use BlockMatrices use BlockVectors type(PivotVector) :: pivot[p,*] type(BlockMatrix) :: a[p,*] type(BlockVector) :: x[*] call newBlockMatrix(a,n,p) call newPivotVector(pivot,a) call newBlockVector(x,n) call luDecomp(a,pivot) call solve(a,x,pivot)

42 LU Decomposition

43 Communication for LU Decomposition Row interchange –temp(:) = a(k,:) –a(k,:) = a(j,:) [p,myQ] –a(j,:) [p,myQ] = temp(:) Row “Broadcast” –L0(i:n,i) = a(i:,n,i) [p,p] i=1,n Row/Column “Broadcast” –L1 (:,:) = a(:,:) [myP,p] –U1(:,:) = a(:,:) [p,myQ]

Vector Maps

Cyclic-Wrap Distribution

46 Vector Objects type vector real,allocatable :: vector(:) integer :: lowerBound integer :: upperBound integer :: halo end type vector

47 Block Vectors type BlockVector type(VectorMap) :: map type(Vector),allocatable :: block(:) --other components-- end type BlockVector

48 Block Matrices type BlockMatrix type(VectorMap) :: rowMap type(VectorMap) :: colMap type(Matrix),allocatable :: block(:,:) --other components-- end type BlockMatrix

49 CAF I/O for Named Objects use BlockMatrices use DiskFiles type(PivotVector) :: pivot[p,*] type(BlockMatrix) :: a[p,*] type(DirectAccessDiskFile) :: file call newBlockMatrix(a,n,p) call newPivotVector(pivot,a) call newDiskFile(file) call readBlockMatrix(a,file) call luDecomp(a,pivot) call writeBlockMatrix(a,file)

50 5. Where Can I Try CAF?

51 CRAY Co-Array Fortran CAF has been a supported feature of Cray Fortran 90 since release 3.1 CRAY T3E –f90 -Z src.f90 –mpprun -n7 a.out CRAY X1 –ftn -Z src.f90 –aprun -n7 a.out

52 Co-Array Fortran on Other Platforms Rice University is developing an open source compiling system for CAF. –Runs on the HP-Alpha system at PSC –Runs on SGI platforms –We are planning to install it on Halem at GSFC IBM may put CAF on the BlueGene/L machine at LLNL. DARPA High Productivity Computing Systems (HPCS) Project wants CAF. –IBM, CRAY, SUN

53 The Co-Array Fortran Standard Co-Array Fortran is defined by: –R.W. Numrich and J.K. Reid, “Co-Array Fortran for Parallel Programming”, ACM Fortran Forum, 17(2):1-31, 1998 Additional information on the web: – –

54 6. Summary

55 Why Language Extensions? Programmer uses a familiar language. Syntax gives the programmer control and flexibility. Compiler concentrates on local code optimization. Compiler evolves as the hardware evolves. –Lowest latency and highest bandwidth allowed by the hardware –Data ends up in registers or cache not in memory –Arbitrary communication patterns –Communication along multiple channels

56 Summary Co-dimensions match your logical problem decomposition –Run-time system matches them to hardware decomposition –Explicit representation of neighbor relationships –Flexible communication patterns Code simplicity –Non-intrusive code conversion –Modernize code to Fortran 95 standard Code is always simpler and performance is always better than MPI.