High-Level Programming Models for Clusters Issues and Challenges Hans P. Zima Institute of Scientific Computing, University of Vienna, Austria and.

High-Level Programming Models for Clusters Issues and Challenges Hans P. Zima Institute of Scientific Computing, University of Vienna, Austria and Jet Propulsion Laboratory, California Institute of Technology, Pasadena, CA CCGSC 2006 Flat Rock, NC September 13, 2006

Contents Introduction Issues and Challenges for HPC Languages
From HPF to High Productivity Languages A Short Overview of Chapel Future Research Conclusion

Abstraction in Programming
Programming models and languages bridge the gap between “reality” and hardware – at different levels of abstraction - e.g., assembly languages general-purpose procedural languages functional languages very high-level domain-specific languages libraries Abstraction implies loss of information – gain in simplicity, clarity, verifiability, portability versus potential performance degradation

The Emergence of High-Level Sequential Languages
The designers of the very first high level programming language were aware that their success depended on acceptable performance of the generated target programs: John Backus (1957): “… It was our belief that if FORTRAN … were to translate any reasonable scientific source program into an object program only half as fast as its hand-coded counterpart, then acceptance of our system would be in serious danger …” High-level algorithmic languages became generally accepted standards for sequential programming since their advantages outweighed any performance drawbacks For parallel programming no similar development took place

HPC Programming Paradigm: State of the Art
Current HPC programming is dominated by the use of a standard language (Fortran, C/C++), combined with message-passing (MPI) MPI has made a tremendous contribution to the field, providing a portable standard accessible everywhere BUT: there is a wide gap between the domain of the scientist and this programming model conceptually simple problems (e.g., stencil computations) can result in very complex programs conceptually simple changes (like replacing a block data distribution with a cyclic distribution) are not easy to handle exploiting performance may require “heroic” programmer effort HPC Programming Paradigm: State of the Art

Fortran+MPI Communication for 3D Stencil in NAS MG (Source: Brad Chamberlain, Cray Inc.)
subroutine comm3(u,n1,n2,n3,kk) use caf_intrinsics implicit none include 'cafnpb.h' include 'globals.h' integer n1, n2, n3, kk double precision u(n1,n2,n3) integer axis if( .not. dead(kk) )then do axis = 1, 3 if( nprocs .ne. 1) then call sync_all() call give3( axis, +1, u, n1, n2, n3, kk ) call give3( axis, -1, u, n1, n2, n3, kk ) call take3( axis, -1, u, n1, n2, n3 ) call take3( axis, +1, u, n1, n2, n3 ) else call comm1p( axis, u, n1, n2, n3, kk ) endif enddo call zero3(u,n1,n2,n3) return end subroutine give3( axis, dir, u, n1, n2, n3, k ) integer axis, dir, n1, n2, n3, k, ierr double precision u( n1, n2, n3 ) integer i3, i2, i1, buff_len,buff_id buff_id = 2 + dir buff_len = 0 if( axis .eq. 1 )then if( dir .eq. -1 )then do i3=2,n3-1 do i2=2,n2-1 buff_len = buff_len + 1 buff(buff_len,buff_id ) = u( 2, i2,i3) buff(1:buff_len,buff_id+1)[nbr(axis,dir,k)] = > buff(1:buff_len,buff_id) else if( dir .eq. +1 ) then buff(buff_len, buff_id ) = u( n1-1, i2,i3) if( axis .eq. 2 )then do i=1,nm2 buff(i,buff_id) = 0.0D0 enddo dir = +1 buff_id = 3 + dir buff_len = nm2 buff_id = 2 + dir buff_len = 0 if( axis .eq. 1 )then do i3=2,n3-1 do i2=2,n2-1 buff_len = buff_len + 1 buff(buff_len, buff_id ) = u( n1-1, i2,i3) endif if( axis .eq. 2 )then do i1=1,n1 buff(buff_len, buff_id )= u( i1,n2-1,i3) if( axis .eq. 3 )then do i2=1,n2 buff(buff_len, buff_id ) = u( i1,i2,n3-1) dir = -1 buff(buff_len,buff_id ) = u( 2, i2,i3) buff(buff_len, buff_id ) = u( i1, 2,i3) buff(buff_len, buff_id ) = u( i1,i2,2) buff(i,4) = buff(i,3) buff(i,2) = buff(i,1) buff_id = 3 + dir indx = 0 if( axis .eq. 1 )then do i3=2,n3-1 do i2=2,n2-1 indx = indx + 1 u(n1,i2,i3) = buff(indx, buff_id ) enddo endif if( axis .eq. 2 )then do i1=1,n1 u(i1,n2,i3) = buff(indx, buff_id ) if( axis .eq. 3 )then do i2=1,n2 u(i1,i2,n3) = buff(indx, buff_id ) dir = +1 u(1,i2,i3) = buff(indx, buff_id ) u(i1,1,i3) = buff(indx, buff_id ) u(i1,i2,1) = buff(indx, buff_id ) return end do i3=2,n3-1 do i1=1,n1 buff_len = buff_len + 1 buff(buff_len, buff_id ) = u( i1, 2,i3) enddo buff(1:buff_len,buff_id+1)[nbr(axis,dir,k)] = > buff(1:buff_len,buff_id) else if( dir .eq. +1 ) then buff(buff_len, buff_id )= u( i1,n2-1,i3) endif if( axis .eq. 3 )then if( dir .eq. -1 )then do i2=1,n2 buff(buff_len, buff_id ) = u( i1,i2,2) buff(buff_len, buff_id ) = u( i1,i2,n3-1) return end subroutine take3( axis, dir, u, n1, n2, n3 ) use caf_intrinsics implicit none include 'cafnpb.h' include 'globals.h' integer axis, dir, n1, n2, n3 double precision u( n1, n2, n3 ) integer buff_id, indx integer i3, i2, i1 buff_id = 3 + dir indx = 0 if( axis .eq. 1 )then do i2=2,n2-1 indx = indx + 1 u(n1,i2,i3) = buff(indx, buff_id ) enddo else if( dir .eq. +1 ) then do i3=2,n3-1 do i2=2,n2-1 indx = indx + 1 u(1,i2,i3) = buff(indx, buff_id ) endif if( axis .eq. 2 )then if( dir .eq. -1 )then do i1=1,n1 u(i1,n2,i3) = buff(indx, buff_id ) u(i1,1,i3) = buff(indx, buff_id ) if( axis .eq. 3 )then do i2=1,n2 u(i1,i2,n3) = buff(indx, buff_id ) u(i1,i2,1) = buff(indx, buff_id ) return end subroutine comm1p( axis, u, n1, n2, n3, kk ) use caf_intrinsics implicit none include 'cafnpb.h' include 'globals.h' integer axis, dir, n1, n2, n3 double precision u( n1, n2, n3 ) integer i3, i2, i1, buff_len,buff_id integer i, kk, indx dir = -1 buff_id = 3 + dir buff_len = nm2

From HPF to High-Productivity Languages A Short Overview of Chapel Future Research Conclusion

Productivity Challenges for Peta-Scale Systems
Large-scale architectural parallelism tens of thousands to hundreds of thousands of processors component failures may occur frequently Extreme non-uniformity in data access more than 1000 cycles to access local memory Applications: large, complex, and long-lived multi-disciplinary, multi-language, multi-paradigm dynamic, irregular, and adaptive long-lived, surviving many hardware generations: support for efficient migration has high priority

Key Requirements for High-Productivity Languages
High-Level Support for: Explicit concurrency Locality-awareness Distributed collections Multi-lingual, multi-paradigm, multi-disciplinary programming-in-the-large

Goal: Enhance productivity of scientists and engineers, without compromising performance
Vision: A programming environment built around a universal high-productivity language, a common intermediate language and execution model, and an associated common infrastructure: a universal High Productivity Language, to serve as a standard for programming parallel systems over the next years a common Intermediate Language and Execution Model, providing a common infrastructure for multi-platform, multi-lingual, multi-paradigm compilation and performance-portable migration of legacy codes an infrastructure for an Intelligent Programming Environment supporting autonomous system operation and self-tuning based on advanced expert system and introspection technology Goal and Vision

UNCOL Revisited: Portable Compilation/Tool Environment
legacy lg 1 source program HPL source program … Front End architecture-independent unified compiler front end legacy lg m source program IHPL intermediate program optimization transformations … Back End 1 Back End n architecture-specific compiler back ends TPL 1 target program TPL n target program

The Path to High Productivity Languages
High Performance Fortran (HPF) Language Family HPF Predecessors: CM-Fortran, Fortran D, Vienna Fortran High Performance Fortran (HPF): HPF-1 (1993); HPF-2(1997) Post-HPF Developments: HPF+, JAHPF OpenMP ZPL Partitioned Global Address Space (PGAS) Languages Co-Array Fortran, UPC, Titanium High-Productivity Languages Chapel, X10, Fortress

The High Performance Fortran Idea
Message Passing Approach HPF Approach initialize MPI global computation do while (.not. converged) do J=1,N do I=1,N B(I,J) = 0.25 * (A(I-1,J)+A(I+1,J)+ A(I,J-1)+A(I,J+1)) end do A(1:N,1:N) = B(1:N,1:N) do while (.not. converged) do J=1,M do I=1,N B(I,J) = 0.25 * (A(I-1,J)+A(I+1,J)+ A(I,J-1)+A(I,J+1)) end do A(1:N,1:N) = B(1:N,1:N) local computation data distribution if (MOD(myrank,2) .eq. 1) then call MPI_SEND(B(1,1),N,…,myrank-1,..) call MPI_RCV(A(1,0),N,…,myrank-1,..) if (myrank .lt. s-1) then call MPI_SEND(B(1,M),N,…,myrank+1,..) call MPI_RCV(A(1,M+1),N,…,myrank+1,..) endif else … … processors P(NUMBER_OF_PROCESSORS) distribute(*,BLOCK) onto P :: A, B communication compiler-generated

HPF: Problems and Successes
HPF-1 lacked important functionality data distributions do not support irregular data structures and algorithms lack of flexibility for processor-mapping and data/thread affinity focus on SPMD data parallelism Fortran 90 as the base language lacked vendor compiler support Compiler and runtime support for some HPF features (e.g., dynamic data distributions) was not sufficiently mature HPF+/JAHPF plasma code reached efficiency of 40% on Earth Simulator explicit high-level formulation of communication patterns explicit high-level control of communication schedules and “halos” The basic idea underlying HPF has been re-incarnated, in a more general context, in the recently developed HPCS languages However the HPF concept survived:

PGAS Language Overview
Co-Array Fortran, UPC, Titanium typical representatives Based on explicit SPMD model Both private and shared data Support for global distributed data structures global address space is logically partitioned and mapped to processors in general, static distinction between local and global references Processor-centric view (“fragmented programming model”) One sided shared-memory communication Collective communication and I/O libraries Cut down on code on this slides

Example: Setting up a block-distributed array in Titanium
// determine parameters of local block: Point<3> startCell = myBlockPos * numCellsPerBlockSide; Point<3> endCell = startCell + (numCellsPerBlockSide-[1,1,1]); //create local myBlock array: double [3d] myBlock = new double[startCell:endCell]; //build the distributed structure: //declare blocks as a 1D-array of references (one element per processor) blocks.exchange(myBlock); P0 P1 blocks blocks P2 blocks myBlock myBlock myBlock Source: K.Yelick et al.: Parallel Languages and Compilers: Perspective from the Titanium Experience

Example: Setting up a block-distributed array in Chapel
Standard Distribution Library class block: Distribution{…} class cyclic: Distribution{…} … class sparse-brd: Distribution{…} const D: domain(3) distributed (block) = [l1..u1,l2..u2,l3..u3]; … var A: [D] float; 14

HPCS Language Overview
HPCS Languages Chapel (Cascade Project, led by Cray Inc.) X (PERCS Project, led by IBM) Fortress (HERO Project, led by Sun Microsystems Inc.) Global name space and global data access in general, no static distinction between local and global references Explicit high-level specification of parallelism Explicit high-level specification of locality data distribution and alignment, affinity (on-clause) High-level support for distributed collections Support for data and task parallelism Object orientation

The Cascade Project Phase 1: Concept Study July 2002 -- June 2003
Phase 2: Prototyping Phase July July 2006 Led by Cray Inc. Cascade Partners Caltech/JPL University of Notre Dame Stanford University Chapel is the Cascade High Productivity Language David Callahan, Brad Chamberlain, Steve Deitz, Roxana Diaconescu, John Plevyak, Hans Zima

High-Level Control of Locality
Locality Control is Key for Performance all modern HPC architectures have distributed memory Fully Automatic Control is Beyond State-of-the-Art automatic locality analysis had some limited successes for regular codes for irregular and adaptive applications compiler knowledge is not sufficient to exploit locality efficiently; runtime analysis can be highly expensive explicit user control of locality is essential for achieving high performance Chapel Approach responsibility for generating communication delegated to compiler/runtime system locality specified by data distributions, data alignment, and affinity all data distributions are user-defined additional user-provided locality assertions help the compiler where static knowledge is not available

Example: Matrix-Vector Multiplication (dense)
var Mat: domain(2) = [1..m, 1..n]; var MatCol: domain(1) = Mat(2); var MatRow: domain(1) = Mat(1); var A:[Mat] float; var v:[MatCol] float; var s:[MatRow] float; s = sum(dim=2) [i,j in Mat] A(i,j)*v(j); Example: Matrix-Vector Multiplication (dense) Version 1 var L:[1..p1,1..p2] locale = reshape(Locales); var Mat: domain(2) distributed(myB,myB) on L =[1..m,1..n]; var MatCol: domain(1) aligned(*,Mat(2))= Mat(2); var MatRow: domain(1) aligned(Mat(1),*)= Mat(1); var A:[Mat] float; var v:[MatCol] float; var s:[MatRow] float; s = sum(dim=2) [i,j in Mat] A(i,j)*v(j); Version 2: distributions added, algorithm unchanged Sparse code can be formulated in a similar way

Locality Control in Chapel: Basic Concepts
Locale “locale”: abstract unit of locality, bound to an execution user-defined views of locale sets explicit allocation of data and computations on locales Domain first-class entity components: index set, distribution, associated arrays, iterators Array --- Mapping from a Domain to a Set of Variables User-Defined Distributions original ideas in Vienna Fortran and library-based approaches user can work with distributions at three levels naïve use of a predefined library distribution explicit specification of a distribution by its global mapping explicit specification of a distribution by global mapping and data layout

Key Functionality of the Distribution Framework
Two levels: global mapping and layout mapping User-Defined Global Mappings from Index Sets to Locales “standard” distributions (block, block-cyclic, etc.) distribution of irregular meshes, links to external partitioners distribution of hierarchical structures (multiblock) User-Defined Layout Specifications layout specifies data arrangement within a locale sparse data structures important target Dynamic Reallocation, Redistribution High-Level Control of Communication user-defined specification of halos (ghost cells) user-defined assertions on communication

User-Defined Distributions:Global Mapping(1)
class MyC: Distribution { const z:integer; /* block size */ const ntl:integer; /* number of target locales*/ function map(i:index(source)):locale { /* global mapping for MyC */ return Locales(mod(ceil(i/z-1)+1,ntl)); } class MyB: Distribution { var bl:integer = ...; /* block length */ function map(i: index(source)):locale { /* global mapping for MyB */ return Locales(ceil(i/bl)); const D1C: domain(1) distributed(MyC(z=100))=1..n1; const D1B: domain(1) distributed(MyB) on Locales(1..num_locales/10)=1..n1; var A1: [D1C] float; var A2: [D1B] float; /* declaration of distribution classes MyC and MyB: */ /* use of distribution classes MyC and MyB in declarations: */

User-Defined Distributions:Global Mapping(2)
class MyC1: Distribution { /* cyclic(1) */ const ntl:integer; /* number of target locales */ function map(i:index(source)):locale { /* global mapping for MyC1 */ return Locales(mod(i-1,ntl)+1); } iterator DistSegIterator(loc: index(target)): index(source) { const N: integer = getSource().extent; const k: integer = locale_index(loc); for i in k..N by ntl { yield(i); } function GetDistributionSegment(loc: index(target)): Domain { return (k..N by ntl); const D1C1: domain(1) distributed(MyC1()) on Locales(1..4)=1..16; var A1: [D1C1] float; ... /* declaration of distribution class MyC1: */ /* set of local iterators : */ /* distribution segment : */ /* use of distribution class MyC1 in declarations: */

An Artificial Example: Banded Distribution*
2 3 4 5 6 7 8 9 Diagonal A/d = { A(i,j) | d=i+j } 10 j 1 2 3 4 5 bw = 3 (bandwidth) 6 7 8 9 11 1 p=4 (number of locales) 1 12 Distribution—global map: 2 Blocks of bw diagonals are cyclically mapped to locales 2 13 3 14 Layout: 4 3 i Each diagonal is represented as a one-dimensional dense array. Arrays in a locale are referenced by a pointer array 15 5 4 16 6 17 7 1 18 8 2 9 *This example was proposed by Brad Chamberlain (Cray Inc.)

User-Defined Banded Distribution (1)
/* declaration of distribution class Banded: */ class Banded: Distribution { const b: integer; const n: integer = getSource()(1).extent(); const p: integer = getTargetLocales().extent(); const ndiags: integer = 2*n-1; var firstDiagsInDS: [1..p] domain(1); var DiagsInDS: [1..p] seq(integer) = nil; constructor Banded() { forall k in 1..p { firstDiagsInDS(k) = (k-1)*b+2 .. ndiags by b*p; forall d in firstDiagsInDS(k) diagsInDS(k)#d..min(d+b-1,2*n); } /* global mapping: */ function map(i,j:integer):locale{return(Locales(mod((i+j-2)/b+1,p))}; /* set of local iterators: */ iterator DistSegIterator(loc:locale): index(source){ const k:integer=locale_index(loc); for d in DiagsInDS { for i in first_i(d)..first_i(d)+length(d)-1 0..b-1 { yield(i.d-i);}

User-Defined Banded Distribution (2)
class BandedLayout: LocalSegment { /* diagonals in distribution segment loc: /* const DIAGS_k: domain = Banded.diagsInDS(k); /* local index domain:/ const LocalDomain=[DIAGS_k] domain(1); /* structure of local array segment */ … constructor BandedLayout() { forall d in DIAGS_k {LocalDomain(d) = 1..Banded.length(d);} } function layout(i,j:integer): index(LocalDomain){ /* layout mapping */ return this.index(i+j,i-first_i(i+j) + 1); const D: domain(2) distributed(Banded(b=bw),BandedLayout()) on Locales[1..p] = [1..n,1..n]; iterator acrossDiagonal(d: index(diagD)):(integer,integer) {…} var A: [D] eltType; forall d in diagD { for dx in acrossDiagonal(d) { … A(dx)… } /* declaration of layout class BandedLayout: */ /* use of distribution class Banded and layout class BandedLayout in declarations: */

Example: Matrix-Vector Multiplication (sparse CRS)
D0 ____ 53 19 17 93 D0 ____ 53 19 17 93 C0 ____ 2 1 4 5 R0 ____ 1 2 3 4 5 D0 ____ 53 19 17 93 D1 ____ 21 16 72 13 C1 ____ 7 8 6 C0 ____ 2 1 4 5 R0 ____ 1 2 3 4 5 R1 ____ 1 2 3 4 5 D2 ____ 23 69 27 11 D0 ____ 53 19 17 93 C2 ____ 2 3 1 4 R3 ____ 1 3 4 5 R2 ____ 1 3 5 D3 ____ 44 19 37 64 C3 ____ 5 8 7 const D: domain(2)=[1..m,1..n]; const DD: domain(D) sparse(CRS)= …; distribute(DD,Block_CRS); var AA: [DD] float; …

User-Defined Halos User-Defined Specification of halo (ghost cells)
Compiler/Runtime System allocates images defines mapping between remote objects and images Halo Management update flush User-Defined Halos distribution segment

Intelligent Programming Environments
Objective: Create an infrastructure for an intelligent programming environment that supports: a portable framework for introspection fault tolerance and resilience to component failures for large-scale systems autonomy and dynamic adaptation (self-tuning) for complex applications tool integration and support for high productivity knowledge acquisition, learning, consulting, and knowledge presentation shared user interface

Case Study: Offline Performance Tuning
source program restructuring commands Transformation system actuators Parallelizing compiler Knowledge Base Analysis monitoring instrumented target program HPCS system end of tuning cycle 50

Case Study: Online Performance Tuning
source program restructuring commands Transformation system actuator: restructure/ recompile Parallelizing compiler Expert System actuator: re-instrument instrumented target program … actuator: change function implementation monitoring HPCS system 50

Introspection Framework Overview
KNOWLEDGE BASE INTROSPECTION FRAMEWORK hardware operating system languages compilers libraries … System Knowledge A P SENSORS P … Inference Engine L Application Domain Knowledge I A C A Agent System for - monitoring - analysis - tuning A ACTUATORS T … … I components semantics performance experiments … Application Knowledge O N Presentation Knowledge 50

Heterogeneous System Example: Future On-Board Systems
Future space missions will require enhanced on-board computing systems deep-space missions, operating in hazardous environments advanced sensor technology producing large amounts of high-quality data latency and bandwidth limitations for transmissions between spacecraft and Earth real-time on-board decisions concerning navigation, planning of science experiments, and analysis of unexpected phenomena On-board systems consist of two major components: radiation-hardened control system heterogeneous, highly parallel, reduced-reliability computational subsystem New, high-level, fault-tolerant programming models for such systems are needed Spacecraft Control Computer Computational Subsystem on-board system structure Earth

Software-Enhanced Computational Subsystem
introspection Global actuators Introspection … Framework Heterogeneous Spacecraft Control Computer Fault Massively Parallel Tolerance Computational Subsystem Performance … Tuning introspection sensors Earth

Conclusion MPI provides a key infrastructure for HPC that is here to stay However, productivity and reliability considerations will, in the long term, enforce a programming paradigm in which MPI is the target – not the source From a practical point of view, general agreement on a single HPC language combined with a common intermediate language, front end, and programming environment would be the best solution to address the difficult open problems Acceptance of a new language depends on many criteria, including: functionality and target code performance mature compiler and runtime system technology intelligent programming environment closely integrated with the language familiarity of users with syntax and semantics for conventional (sequential) features support by funding agencies and major vendors Open research issues include models for heterogeneous systems, design of intelligent programming environments, fault tolerance and automatic performance tuning

High-Level Programming Models for Clusters Issues and Challenges Hans P. Zima Institute of Scientific Computing, University of Vienna, Austria and.

Similar presentations

Presentation on theme: "High-Level Programming Models for Clusters Issues and Challenges Hans P. Zima Institute of Scientific Computing, University of Vienna, Austria and."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

High-Level Programming Models for Clusters Issues and Challenges Hans P. Zima Institute of Scientific Computing, University of Vienna, Austria and.

Similar presentations

Presentation on theme: "High-Level Programming Models for Clusters Issues and Challenges Hans P. Zima Institute of Scientific Computing, University of Vienna, Austria and."— Presentation transcript:

Similar presentations

About project

Feedback