Chapel: Motivating Themes Brad Chamberlain Cray Inc. CSEP 524 May 20, 2010.

Slides:

Advertisements

Similar presentations

Unified Parallel C at LBNL/UCB Implementing a Global Address Space Language on the Cray X1 Christian Bell and Wei Chen.

Advertisements

Computer Abstractions and Technology

U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Potential Languages of the Future Chapel,

PGAS Language Update Kathy Yelick. PGAS Languages: Why use 2 Programming Models when 1 will do? Global address space: thread may directly read/write remote.

1 Lawrence Livermore National Laboratory By Chunhua (Leo) Liao, Stephen Guzik, Dan Quinlan A node-level programming model framework for exascale computing*

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Chapel: The Cascade High Productivity Language Ting Yang University of Massachusetts.

March 18, 2008SSE Meeting 1 Mary Hall Dept. of Computer Science and Information Sciences Institute Multicore Chips and Parallel Programming.

The Mother of All Chapel Talks Brad Chamberlain Cray Inc. CSEP 524 May 20, 2010.

Introduction CS 524 – High-Performance Computing.

High Performance Fortran (HPF) Source: Chapter 7 of "Designing and building parallel programs“ (Ian Foster, 1995)

Parallel Computing Overview CS 524 – High-Performance Computing.

Reasons to study concepts of PL

Java for High Performance Computing Jordi Garcia Almiñana 14 de Octubre de 1998 de la era post-internet.

Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.

CISC 879 : Software Support for Multicore Architectures John Cavazos Dept of Computer & Information Sciences University of Delaware

An Introduction to Chapel Cray Cascade’s High-Productivity Language compiled for Mary Hall, February 2006 Brad Chamberlain Cray Inc. compiled for Mary.

UPC at CRD/LBNL Kathy Yelick Dan Bonachea, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu, Mike Welcome, Christian Bell.

Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition Chapter 2: Operating-System Structures Modified from the text book.

Hossein Bastan Isfahan University of Technology 1/23.

Fine Grain MPI Earl J. Dodd Humaira Kamal, Alan University of British Columbia 1.

SEC(R) 2008 Intel® Concurrent Collections for C++ - a model for parallel programming Nikolay Kurtov Software and Services.

Reference: / Parallel Programming Paradigm Yeni Herdiyeni Dept of Computer Science, IPB.

Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.

Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.

OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.

1 Developing Native Device for MPJ Express Advisor: Dr. Aamir Shafi Co-advisor: Ms Samin Khaliq.

Dilemma of Parallel Programming Xinhua Lin ( 林新华 ) HPC Lab of 17 th Oct 2011.

ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.

Parallel Algorithms Sorting and more. Keep hardware in mind When considering ‘parallel’ algorithms, – We have to have an understanding of the hardware.

Types for Programs and Proofs Lecture 1. What are types? int, float, char, …, arrays types of procedures, functions, references, records, objects,...

STRATEGIC NAMING: MULTI-THREADED ALGORITHM (Ch 27, Cormen et al.) Parallelization Four types of computing: –Instruction (single, multiple) per clock cycle.

Center for Programming Models for Scalable Parallel Computing: Project Meeting Report Libraries, Languages, and Execution Models for Terascale Applications.

SPMD: Single Program Multiple Data Streams

Compilation Technology SCINET compiler workshop | February 17-18, 2009 © 2009 IBM Corporation Software Group Coarray: a parallel extension to Fortran Jim.

Presented by High Productivity Language and Systems: Next Generation Petascale Programming Wael R. Elwasif, David E. Bernholdt, and Robert J. Harrison.

Presented by High Productivity Language Systems: Next-Generation Petascale Programming Aniruddha G. Shet, Wael R. Elwasif, David E. Bernholdt, and Robert.

Co-Array Fortran Open-source compilers and tools for scalable global address space computing John Mellor-Crummey Rice University.

4.2.1 Programming Models Technology drivers – Node count, scale of parallelism within the node – Heterogeneity – Complex memory hierarchies – Failure rates.

Hybrid MPI and OpenMP Parallel Programming

Case Study in Computational Science & Engineering - Lecture 2 1 Parallel Architecture Models Shared Memory –Dual/Quad Pentium, Cray T90, IBM Power3 Node.

Alternative ProcessorsHPC User Forum Panel1 HPC User Forum Alternative Processor Panel Results 2008.

High Performance Fortran (HPF) Source: Chapter 7 of "Designing and building parallel programs“ (Ian Foster, 1995)

1 Serial Run-time Error Detection and the Fortran Standard Glenn Luecke Professor of Mathematics, and Director, High Performance Computing Group Iowa State.

HPC User Forum Back End Compiler Panel SiCortex Perspective Kevin Harris Compiler Manager April 2009.

Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.

1 The Portland Group, Inc. Brent Leback HPC User Forum, Broomfield, CO September 2009.

Kevin Skadron University of Virginia Dept. of Computer Science LAVA Lab Wrapup and Open Issues.

October 11, 2007 © 2007 IBM Corporation Multidimensional Blocking in UPC Christopher Barton, Călin Caşcaval, George Almási, Rahul Garg, José Nelson Amaral,

© 2009 IBM Corporation Parallel Programming with X10/APGAS IBM UPC and X10 teams  Through languages –Asynchronous Co-Array Fortran –extension of CAF with.

1 CSCD 326 Data Structures I Software Design. 2 The Software Life Cycle 1. Specification 2. Design 3. Risk Analysis 4. Verification 5. Coding 6. Testing.

1 Qualifying ExamWei Chen Unified Parallel C (UPC) and the Berkeley UPC Compiler Wei Chen the Berkeley UPC Group 3/11/07.

CS- 492 : Distributed system & Parallel Processing Lecture 7: Sun: 15/5/1435 Foundations of designing parallel algorithms and shared memory models Lecturer/

FORTRAN History. FORTRAN - Interesting Facts n FORTRAN is the oldest Language actively in use today. n FORTRAN is still used for new software development.

Experiences with Co-array Fortran on Hardware Shared Memory Platforms Yuri DotsenkoCristian Coarfa John Mellor-CrummeyDaniel Chavarria-Miranda Rice University,

Chapter 1 Basic Concepts of Operating Systems Introduction Software A program is a sequence of instructions that enables the computer to carry.

A Pattern Language for Parallel Programming Beverly Sanders University of Florida.

Silberschatz, Galvin and Gagne ©2013 Operating System Concepts – 9 th Edition Chapter 4: Threads.

Parallel IO for Cluster Computing Tran, Van Hoai.

Concurrency and Performance Based on slides by Henri Casanova.

Layers Architecture Pattern Source: Pattern-Oriented Software Architecture, Vol. 1, Buschmann, et al.

Chapter 1: Preliminaries Lecture # 2. Chapter 1: Preliminaries Reasons for Studying Concepts of Programming Languages Programming Domains Language Evaluation.

Parallel Performance Wizard: A Generalized Performance Analysis Tool Hung-Hsun Su, Max Billingsley III, Seth Koehler, John Curreri, Alan D. George PPW.

Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.

High-Level Programming Models for Clusters Issues and Challenges Hans P. Zima Institute of Scientific Computing, University of Vienna, Austria and.

Distributed Shared Memory

Chapter 4: Threads.

Chapter 4: Threads.

Multithreaded Programming

Chapter 4: Threads & Concurrency

Programming Parallel Computers

Presentation transcript:

Chapel: Motivating Themes Brad Chamberlain Cray Inc. CSEP 524 May 20, 2010

What is Chapel?  A new parallel language being developed by Cray Inc.  Part of Cray’s entry in DARPA’s HPCS program  Main Goal: Improve programmer productivity Improve the programmability of parallel computers Match or beat the performance of current programming models Provide better portability than current programming models Improve robustness of parallel codes  Target architectures: multicore desktop machines clusters of commodity processors Cray architectures systems from other vendors  A work in progress

Chapel’s Setting: HPCS HPCS: High Productivity Computing Systems (DARPA et al.) Goal: Raise productivity of high-end computing users by 10  Productivity = Performance + Programmability + Portability + Robustness  Phase II: Cray, IBM, Sun (July 2003 – June 2006) Evaluated the entire system architecture’s impact on productivity…  processors, memory, network, I/O, OS, runtime, compilers, tools, …  …and new languages: Cray: Chapel IBM: X10 Sun: Fortress  Phase III: Cray, IBM (July 2006 – ) Implement the systems and technologies resulting from phase II (Sun also continues work on Fortress, without HPCS funding)

Chapel: Motivating Themes 1) general parallel programming 2) global-view abstractions 3) multiresolution design 4) control of locality/affinity 5) reduce gap between mainstream & parallel languages

1) General Parallel Programming  General software parallelism Algorithms: should be able to express any that come to mind  should never hit a limitation requiring the user to return to MPI Styles: data-parallel, task-parallel, concurrent algorithms  as well as the ability to compose these naturally Levels: module-level, function-level, loop-level, statement-level, …  General hardware parallelism Types: multicore desktops, clusters, HPC systems, … Levels: inter-machine, inter-node, inter-core, vectors, multithreading

2) Global-view vs. Fragmented Problem: “Apply 3-pt stencil to vector” global-view = + ( )/2)/2 fragmented

2) Global-view vs. Fragmented Problem: “Apply 3-pt stencil to vector” global-view = + ( )/2)/2 fragmented = + = + = )/2+ ( ((

2) Global-view vs. SPMD Code Problem: “Apply 3-pt stencil to vector” global-view def main() { var n: int = 1000; var a, b: [1..n] real; forall i in 2..n-1 { b(i) = (a(i-1) + a(i+1))/2; } SPMD def main() { var n: int = 1000; var locN: int = n/numProcs; var a, b: [0..locN+1] real; if (iHaveRightNeighbor) { send(right, a(locN)); recv(right, a(locN+1)); } if (iHaveLeftNeighbor) { send(left, a(1)); recv(left, a(0)); } forall i in 1..locN { b(i) = (a(i-1) + a(i+1))/2; }

2) Global-view vs. SPMD Code Problem: “Apply 3-pt stencil to vector” SPMD def main() { var n: int = 1000; var locN: int = n/numProcs; var a, b: [0..locN+1] real; var innerLo: int = 1; var innerHi: int = locN; if (iHaveRightNeighbor) { send(right, a(locN)); recv(right, a(locN+1)); } else { innerHi = locN-1; } if (iHaveLeftNeighbor) { send(left, a(1)); recv(left, a(0)); } else { innerLo = 2; } forall i in innerLo..innerHi { b(i) = (a(i-1) + a(i+1))/2; } Assumes numProcs divides n; a more general version would require additional effort global-view def main() { var n: int = 1000; var a, b: [1..n] real; forall i in 2..n-1 { b(i) = (a(i-1) + a(i+1))/2; }

2) SPMD pseudo-code + MPI Problem: “Apply 3-pt stencil to vector” var n: int = 1000, locN: int = n/numProcs; var a, b: [0..locN+1] real; var innerLo: int = 1, innerHi: int = locN; var numProcs, myPE: int; var retval: int; var status: MPI_Status; MPI_Comm_size(MPI_COMM_WORLD, &numProcs); MPI_Comm_rank(MPI_COMM_WORLD, &myPE); if (myPE < numProcs-1) { retval = MPI_Send(&(a(locN)), 1, MPI_FLOAT, myPE+1, 0, MPI_COMM_WORLD); if (retval != MPI_SUCCESS) { handleError(retval); } retval = MPI_Recv(&(a(locN+1)), 1, MPI_FLOAT, myPE+1, 1, MPI_COMM_WORLD, &status); if (retval != MPI_SUCCESS) { handleErrorWithStatus(retval, status); } } else innerHi = locN-1; if (myPE > 0) { retval = MPI_Send(&(a(1)), 1, MPI_FLOAT, myPE-1, 1, MPI_COMM_WORLD); if (retval != MPI_SUCCESS) { handleError(retval); } retval = MPI_Recv(&(a(0)), 1, MPI_FLOAT, myPE-1, 0, MPI_COMM_WORLD, &status); if (retval != MPI_SUCCESS) { handleErrorWithStatus(retval, status); } } else innerLo = 2; forall i in (innerLo..innerHi) { b(i) = (a(i-1) + a(i+1))/2; } SPMD (pseudocode + MPI) Communication becomes geometrically more complex for higher-dimensional arrays

2) rprj3 stencil from NAS MG = ++ = = w 0 = w 1 = w 2 = w 3

2) NAS MG rprj3 stencil in Fortran + MPI subroutine comm3(u,n1,n2,n3,kk) use caf_intrinsics implicit none include 'cafnpb.h' include 'globals.h' integer n1, n2, n3, kk double precision u(n1,n2,n3) integer axis if(.not. dead(kk) )then do axis = 1, 3 if( nprocs.ne. 1) then call sync_all() call give3( axis, +1, u, n1, n2, n3, kk ) call give3( axis, -1, u, n1, n2, n3, kk ) call sync_all() call take3( axis, -1, u, n1, n2, n3 ) call take3( axis, +1, u, n1, n2, n3 ) else call comm1p( axis, u, n1, n2, n3, kk ) endif enddo else do axis = 1, 3 call sync_all() enddo call zero3(u,n1,n2,n3) endif return end subroutine give3( axis, dir, u, n1, n2, n3, k ) use caf_intrinsics implicit none include 'cafnpb.h' include 'globals.h' integer axis, dir, n1, n2, n3, k, ierr double precision u( n1, n2, n3 ) integer i3, i2, i1, buff_len,buff_id buff_id = 2 + dir buff_len = 0 if( axis.eq. 1 )then if( dir.eq. -1 )then do i3=2,n3-1 do i2=2,n2-1 buff_len = buff_len + 1 buff(buff_len,buff_id ) = u( 2, i2,i3) enddo buff(1:buff_len,buff_id+1)[nbr(axis,dir,k)] = > buff(1:buff_len,buff_id) else if( dir.eq. +1 ) then do i3=2,n3-1 do i2=2,n2-1 buff_len = buff_len + 1 buff(buff_len, buff_id ) = u( n1-1, i2,i3) enddo buff(1:buff_len,buff_id+1)[nbr(axis,dir,k)] = > buff(1:buff_len,buff_id) endif if( axis.eq. 2 )then if( dir.eq. -1 )then do i3=2,n3-1 do i1=1,n1 buff_len = buff_len + 1 buff(buff_len, buff_id ) = u( i1, 2,i3) enddo buff(1:buff_len,buff_id+1)[nbr(axis,dir,k)] = > buff(1:buff_len,buff_id) else if( dir.eq. +1 ) then do i3=2,n3-1 do i1=1,n1 buff_len = buff_len + 1 buff(buff_len, buff_id )= u( i1,n2-1,i3) enddo buff(1:buff_len,buff_id+1)[nbr(axis,dir,k)] = > buff(1:buff_len,buff_id) endif if( axis.eq. 3 )then if( dir.eq. -1 )then do i2=1,n2 do i1=1,n1 buff_len = buff_len + 1 buff(buff_len, buff_id ) = u( i1,i2,2) enddo buff(1:buff_len,buff_id+1)[nbr(axis,dir,k)] = > buff(1:buff_len,buff_id) else if( dir.eq. +1 ) then do i2=1,n2 do i1=1,n1 buff_len = buff_len + 1 buff(buff_len, buff_id ) = u( i1,i2,n3-1) enddo buff(1:buff_len,buff_id+1)[nbr(axis,dir,k)] = > buff(1:buff_len,buff_id) endif return end subroutine take3( axis, dir, u, n1, n2, n3 ) use caf_intrinsics implicit none include 'cafnpb.h' include 'globals.h' integer axis, dir, n1, n2, n3 double precision u( n1, n2, n3 ) integer buff_id, indx integer i3, i2, i1 buff_id = 3 + dir indx = 0 if( axis.eq. 1 )then if( dir.eq. -1 )then do i3=2,n3-1 do i2=2,n2-1 indx = indx + 1 u(n1,i2,i3) = buff(indx, buff_id ) enddo else if( dir.eq. +1 ) then do i3=2,n3-1 do i2=2,n2-1 indx = indx + 1 u(1,i2,i3) = buff(indx, buff_id ) enddo endif if( axis.eq. 2 )then if( dir.eq. -1 )then do i3=2,n3-1 do i1=1,n1 indx = indx + 1 u(i1,n2,i3) = buff(indx, buff_id ) enddo else if( dir.eq. +1 ) then do i3=2,n3-1 do i1=1,n1 indx = indx + 1 u(i1,1,i3) = buff(indx, buff_id ) enddo endif if( axis.eq. 3 )then if( dir.eq. -1 )then do i2=1,n2 do i1=1,n1 indx = indx + 1 u(i1,i2,n3) = buff(indx, buff_id ) enddo else if( dir.eq. +1 ) then do i2=1,n2 do i1=1,n1 indx = indx + 1 u(i1,i2,1) = buff(indx, buff_id ) enddo endif return end subroutine comm1p( axis, u, n1, n2, n3, kk ) use caf_intrinsics implicit none include 'cafnpb.h' include 'globals.h' integer axis, dir, n1, n2, n3 double precision u( n1, n2, n3 ) integer i3, i2, i1, buff_len,buff_id integer i, kk, indx dir = -1 buff_id = 3 + dir buff_len = nm2 do i=1,nm2 buff(i,buff_id) = 0.0D0 enddo dir = +1 buff_id = 3 + dir buff_len = nm2 do i=1,nm2 buff(i,buff_id) = 0.0D0 enddo dir = +1 buff_id = 2 + dir buff_len = 0 if( axis.eq. 1 )then do i3=2,n3-1 do i2=2,n2-1 buff_len = buff_len + 1 buff(buff_len, buff_id ) = u( n1-1, i2,i3) enddo endif if( axis.eq. 2 )then do i3=2,n3-1 do i1=1,n1 buff_len = buff_len + 1 buff(buff_len, buff_id )= u( i1,n2- 1,i3) enddo endif if( axis.eq. 3 )then do i2=1,n2 do i1=1,n1 buff_len = buff_len + 1 buff(buff_len, buff_id ) = u( i1,i2,n3- 1) enddo endif dir = -1 buff_id = 2 + dir buff_len = 0 if( axis.eq. 1 )then do i3=2,n3-1 do i2=2,n2-1 buff_len = buff_len + 1 buff(buff_len,buff_id ) = u( 2, i2,i3) enddo endif if( axis.eq. 2 )then do i3=2,n3-1 do i1=1,n1 buff_len = buff_len + 1 buff(buff_len, buff_id ) = u( i1, 2,i3) enddo endif if( axis.eq. 3 )then do i2=1,n2 do i1=1,n1 buff_len = buff_len + 1 buff(buff_len, buff_id ) = u( i1,i2,2) enddo endif do i=1,nm2 buff(i,4) = buff(i,3) buff(i,2) = buff(i,1) enddo dir = -1 buff_id = 3 + dir indx = 0 if( axis.eq. 1 )then do i3=2,n3-1 do i2=2,n2-1 indx = indx + 1 u(n1,i2,i3) = buff(indx, buff_id ) enddo endif if( axis.eq. 2 )then do i3=2,n3-1 do i1=1,n1 indx = indx + 1 u(i1,n2,i3) = buff(indx, buff_id ) enddo endif if( axis.eq. 3 )then do i2=1,n2 do i1=1,n1 indx = indx + 1 u(i1,i2,n3) = buff(indx, buff_id ) enddo endif dir = +1 buff_id = 3 + dir indx = 0 if( axis.eq. 1 )then do i3=2,n3-1 do i2=2,n2-1 indx = indx + 1 u(1,i2,i3) = buff(indx, buff_id ) enddo endif if( axis.eq. 2 )then do i3=2,n3-1 do i1=1,n1 indx = indx + 1 u(i1,1,i3) = buff(indx, buff_id ) enddo endif if( axis.eq. 3 )then do i2=1,n2 do i1=1,n1 indx = indx + 1 u(i1,i2,1) = buff(indx, buff_id ) enddo endif return end subroutine rprj3(r,m1k,m2k,m3k,s,m1j,m2j,m3j,k) implicit none include 'cafnpb.h' include 'globals.h' integer m1k, m2k, m3k, m1j, m2j, m3j,k double precision r(m1k,m2k,m3k), s(m1j,m2j,m3j) integer j3, j2, j1, i3, i2, i1, d1, d2, d3, j double precision x1(m), y1(m), x2,y2 if(m1k.eq.3)then d1 = 2 else d1 = 1 endif if(m2k.eq.3)then d2 = 2 else d2 = 1 endif if(m3k.eq.3)then d3 = 2 else d3 = 1 endif do j3=2,m3j-1 i3 = 2*j3-d3 do j2=2,m2j-1 i2 = 2*j2-d2 do j1=2,m1j i1 = 2*j1-d1 x1(i1-1) = r(i1-1,i2-1,i3 ) + r(i1-1,i2+1,i3 ) > + r(i1-1,i2, i3-1) + r(i1-1,i2, i3+1) y1(i1-1) = r(i1-1,i2-1,i3-1) + r(i1-1,i2-1,i3+1) > + r(i1-1,i2+1,i3-1) + r(i1-1,i2+1,i3+1) enddo do j1=2,m1j-1 i1 = 2*j1-d1 y2 = r(i1, i2-1,i3-1) + r(i1, i2-1,i3+1) > + r(i1, i2+1,i3-1) + r(i1, i2+1,i3+1) x2 = r(i1, i2-1,i3 ) + r(i1, i2+1,i3 ) > + r(i1, i2, i3-1) + r(i1, i2, i3+1) s(j1,j2,j3) = > 0.5D0 * r(i1,i2,i3) > D0 * (r(i1-1,i2,i3) + r(i1+1,i2,i3) + x2) > D0 * ( x1(i1-1) + x1(i1+1) + y2) > D0 * ( y1(i1-1) + y1(i1+1) ) enddo j = k-1 call comm3(s,m1j,m2j,m3j,j) return end

2) NAS MG rprj3 stencil in Chapel def rprj3(S, R) { const Stencil = [-1..1, -1..1, -1..1], w: [0..3] real = (0.5, 0.25, 0.125, ), w3d = [(i,j,k) in Stencil] w((i!=0) + (j!=0) + (k!=0)); forall ijk in S.domain do S(ijk) = + reduce [offset in Stencil] (w3d(offset) * R(ijk + offset*R.stride)); } Our previous work in ZPL showed that compact, global-view codes like these can result in performance that matches or beats hand-coded Fortran+MPI while also supporting more runtime flexibility

NAS MG rprj3 stencil in ZPL procedure rprj3(var S,R: [,,] double; d: array [] of direction); begin S := 0.5 * R * 1, 0, 0] + 0, 1, 0] + 0, 0, 1] + 0, 0] + 0,-1, 0] + 0, 0,-1]) * 1, 1, 0] + 1, 0, 1] + 0, 1, 1] + 1,-1, 0] + 1, 0,-1] + 0, 1,-1] + 1, 0] + 0, 1] + 0,-1, 1] + 0] + 0,-1] + 0,-1,-1]) * 1, 1, 1] + 1, 1,-1] + 1,-1, 1] + 1,-1,-1] + 1, 1] + 1,-1] + 1] + end;

NAS MG Speedup: ZPL vs. Fortran + MPI ZPL scales better than MPI since its communication is expressed in an implementation-neutral way; this permits the compiler to use SHMEM on this Cray T3E but MPI on a commodity cluster ZPL also performs better at smaller scales where communication is not the bottleneck  new languages need not imply performance sacrifices Cray T3E Similar observations—and more dramatic ones—have been made using more recent architectures, languages, and benchmarks

Generality Notes Each ZPL binary supports: an arbitrary load-time problem size an arbitrary load-time # of processors 1D/2D/3D data decompositions This MPI binary only supports: a static 2**k problem size a static 2**j # of processors a 3D data decomposition The code could be rewritten to relax these assumptions, but at what cost? - in performance? - in development effort? Cray T3E

Code Size

Code Size Notes the ZPL codes are 5.5–6.5x shorter because it supports a global view of parallelism rather than an SPMD programming model  little/no code for communication  little/no code for array bookkeeping More important than the size difference is that it is easier to write, read, modify, and maintain

Global-view models can benefit Productivity Cray T3E  more programmable, flexible  able to achieve competitive performance  more portable; leave low-level details to the compiler

NPB: MPI vs. ZPL Code Size (timed kernel) MG CGEP IS FT

NPB: MPI vs. ZPL Performance MG CGEP IS FT C/Fortran + MPI ZPL versions

 communication libraries: data / control MPI, MPI-2 fragmented / fragmented/SPMD SHMEM, ARMCI, GASNet fragmented / SPMD  shared memory models: OpenMP global-view / global-view (trivially)  PGAS languages: Co-Array Fortran fragmented / SPMD UPC global-view / SPMD Titanium fragmented / SPMD global-view / global-view 2) Classifying HPC Programming Notations  communication libraries: MPI, MPI-2 SHMEM, ARMCI, GASNet  shared memory models: OpenMP, pthreads  PGAS languages: Co-Array Fortran UPC Titanium  HPCS languages: Chapel X10 (IBM) Fortress (Sun)

3) Multiresolution Languages: Motivation Target Machine MPI OpenMP pthreads Expose Implementing Mechanisms “Why is everything so tedious?” Target Machine ZPL HPF Higher-Level Abstractions “Why don’t I have more control?” Two typical camps of parallel language design: low-level vs. high-level

3) Multiresolution Language Design Our Approach: Structure the language in a layered manner, permitting it to be used at multiple levels as required/desired support high-level features and automation for convenience provide the ability to drop down to lower, more manual levels use appropriate separation of concerns to keep these layers clean Distributions Data parallelism Task Parallelism Locality Control Target Machine Base Language language concepts

4) Ability to Tune for Locality/Affinity  Large-scale systems tend to store memory w/ processors a good approach for building scalable parallel systems  Remote accesses tend to be significantly more expensive than local  Therefore, placement of data relative to computation matters for scalable performance  programmer should have control over placement of data, tasks  As multicore chips grow in #cores, locality likely to become more important in desktop parallel programming as well GPUs/accelerators also expose node-level locality concerns

4) A Note on Machine Model  As with ZPL, the CTA is still present in our design to reason about locality  That said, it is probably more subconscious for us  And we vary in some minor ways: no controller node  though we do utilize a front-end launcher node in practice nodes can execute multiple tasks/threads  through software multiplexing if not hardware

5) Support for Modern Language Concepts  students graduate with training in Java, Matlab, Perl, C#  HPC community mired in Fortran, C (maybe C++) and MPI  we’d like to narrow this gulf leverage advances in modern language design better utilize the skills of the entry-level workforce… …while not ostracizing traditional HPC programmers  examples: build on an imperative, block-structured language design support object-oriented programming, but make its use optional support for static type inference, generic programming to support… …exploratory programming as in scripting languages …code reuse