X10 Tutorial PSC Software Productivity Study May 23 – 27, 2005 Vivek Sarkar IBM T.J. Watson Research Center This work has been supported.

X10 Tutorial PSC Software Productivity Study May 23 – 27, 2005 Vivek Sarkar IBM T.J. Watson Research Center vsarkar@us.ibm.com This work has been supported in part by the Defense Advanced Research Projects Agency (DARPA) under contract No. NBCH30390004. Vivek Sarkar IBM T.J. Watson Research Center vsarkar@us.ibm.com This work has been supported in part by the Defense Advanced Research Projects Agency (DARPA) under contract No. NBCH30390004.

X10 Tutorial2 Acknowledgments X10 core team Philippe Charles Chris Donawa Kemal Ebcioglu Christian Grothoff Allan Kielstra Christoph von Praun Vivek Sarkar Vijay Saraswat X10 productivity team Catalina Danis Christine Halverson Additional contributors to PSC Productivity Study David Bader Bill Clark Nick Nystrom John Urbanic

X10 Tutorial3 Outline 1.What is X10? background, status 2.Basic X10 (single place) async, finish, atomic future, force 3.Basic X10 (arrays & loops) points, rectangular regions, arrays for, foreach 4.Scalable X10 (multiple places) places, distributions, distributed arrays, ateach, BadPlaceException 5.Clocks creation, registration, next, resume, drop, ClockUseException 6.Basic serial constructs that differ from Java const, nullable, extern 7.Advanced topics Value types, conditional atomic sections (when), general regions & distributions Refer to language spec for details

X10 Tutorial4 What is X10? X10 is a new experimental language developed in the IBM PERCS project as part of the DARPA program on High Productivity Computing Systems (HPCS) X10s goal is to provide a new parallel programming model and its embodiment in a high level language that: 1.is more productive than current models, 2.can support higher levels of abstraction better than current models, and 3.can exploit the multiple levels of parallelism and nonuniform data access that are critical for obtaining scalable performance in current and future HPC systems,

X10 Tutorial5 X10 status and schedule 6/2003PERCS programming model concept (end of PERCS Phase 1) 7/2004Start of PERCS Phase 2 2/2004Kickoff of X10 as concrete embodiment of PERCS programming model as a new language 7/2004First draft of X10 language specification 2/2005 First X10 implementation -- unoptimized single-VM prototype »Emulates distributed parallelism in a single process »This is what you will use to run X10 programs this week 5/2005X10 productivity study at Pittsburgh Supercomputing Center 7/2005Results from X10 application & productivity studies 2H2005Revise language based on application & productivity feedback 2H2005Start participation in High Productivity Language consortium? 1/2006Second X10 implementation – optimized multi-VM prototype 6/2006Open source release of X10 reference implementation 6/2006Design completed for production X10 implementation in Phase 3 (end of Phase 2)

X10 Tutorial6 Current X10 Environment: Unoptimized Single-VM Implementation Foo.x10 x10c X10 compiler --- translates Foo.x10 to Foo.java, uses javac to generate Foo.class from Foo.java Foo.class X10 source program --- must contain a class named Foo with a public static void main(String[] args) method X10 Virtual Machine (JVM + J2SE libraries + X10 libraries + X10 Multithreaded Runtime) External DLLs X10 extern interface X10 Abstract Performance Metrics (event counts, distribution efficiency) X10 Program Output X10 program translated into Java --- // #line pseudocomment in Foo.java specifies source line mapping in Foo.x10 Foo.java x10c Foo.x10 x10 Foo.x10 Caveat: this is a prototype implementation with many limitations. Please be patient!

X10 Tutorial7 Examples of X10 Compiler Error Messages 1) x10c TutError1.x10 TutError1.x10:8: Could not find field or local variable "evenSum". for (int i = 2 ; i <= n ; i += 2 ) evenSum += i; ^----^ 2) x10c TutError2.x10 x10c: TutError2.x10:4:27:4:27: unexpected token(s) ignored 3) x10c TutError3.x10 x10c: C:\vivek\eclipse\workspace\x10\examples\Tutorial\TutError3.java:49: local variable n is accessed from within inner class; needs to be declared final Case 1: Error message identifies source file and line number Case 2: Error message identifies source file, line number, and column range Case 1: Carats indicate column range Case 3: Error message reported by Java compiler – look for #line comment in.java file to identify X10 source location

X10 Tutorial8 Future X10 Environment Very High Level Languages (VHLLs), Domain Specific Languages (DSLs) X10 High Level Language X10 Deployment HPC Runtime Environment (Parallel Environment, MPI, LAPI, …) HPC Parallel System Implicit parallelism, Implicit data distributions X10 places --- abstraction of explicit control & data distribution Mapping of places to nodes in HPC Parallel Environment Primitive constructs for parallelism, communication, and synchronization Target system for parallel application X10 Libraries

X10 Tutorial9 Future X10 Environment: Targeting Scalable HPC Parallel Systems Functiona l Gigabit Ethernet I/O Node 0 C-Node 0 Thin X10 VM I/O Node 1023 C-Node 0 Thin X10 VM C-Node 63 Thin X10 VM C-Node 63 Thin X10 VM Console interconnect Front-end Nodes Pset 1023 Pset 0 File Servers Thick X10 VM Thick X10 VM... Full X10 VM...

X10 Tutorial10 Functiona l Gigabit Ethernet I/O Node 0 C-Node 0 Thin X10 VM I/O Node 1023 C-Node 0 Thin X10 VM C-Node 63 Thin X10 VM C-Node 63 Thin X10 VM Console interconnect Front-end Nodes Pset 1023 Pset 0 File Servers Thick X10 VM Thick X10 VM... Full X10 VM... L3 Cache Memory... L2 Cache PEs, L1 $ Proc Cluster PEs, L1 $... L2 Cache PEs, L1 $ Proc Cluster PEs, L1 $... Clusters (scale-out) SMP Multiple cores on a chip Coprocessors (SPUs) SMTs SIMD ILP Future X10 Environment: Targeting Scalable HPC Parallel Systems

X10 Tutorial11 X10 vs. Java X10 is an extended subset of Java Base language = Java 1.4 Java 5 features (generics, metadata, etc.) are currently not supported in X10 Notable features removed from Java Concurrency --- threads, synchronized, etc. Java arrays – replaced by X10 arrays Notable features added to Java Concurrency – async, finish, atomic, future, force, foreach, ateach, clocks Distribution --- points, distributions X10 arrays --- multidimensional distributed arrays, array reductions, array initializers, Serial constructs --- nullable, const, extern, value types X10 supports both OO and non-OO programming paradigms

X10 Tutorial12 x10.lang standard library Java package with built in classes that provide support for selected X10 constructs Standard types boolean, byte, char, double, float, int, long, short, String x10.lang.Object -- root class for all instances of X10 objects x10.lang.clock --- clock instances & clock operations x10.lang.dist --- distribution instances & distribution operations x10.lang.place --- place instances & place operations x10.lang.point --- point instances & point operations x10.lang.region --- region instances & region operations All X10 programs implicitly import the x10.lang.* package, so the x10.lang prefix can be omitted when referring to members of x10.lang.* classes e.g., place.MAX_PLACES, dist.factory.block([0:100,0:100]), … Similarly, all X10 programs also implicitly import the java.lang.* package e.g., X10 programs can use Math.min() and Math.max() from java.lang

X10 Tutorial13 Calling foreign functions from X10 programs Java methods Can be called directly from X10 programs Java class will be loaded automatically as part of X10 program execution Basic rule: dont call any method that can perform wait/notify or related thread operations Calling synchronized methods is okay C functions Need to use extern declaration in X10, and perform a System.loadLibrary() call

X10 Tutorial14 Resources available in current X10 installation Readme.txt --- basic information on X10 installation and usage Limitations.txt --- list of known limitations in the current X10 implementation etc/standard.cfg --- default configuration information examples/ -- root directory for a number of working X10 example programs examples/Constructs shows usage of different X10 constructs examples/Tutorial contains examples used in this tutorial

X10 Tutorial16 X10 Programming Model (Single Place) Activity Stacks (S) Shared Heap (H) Activity = lightweight thread Main program starts as single activity in Place 0 Permitted object references (pointers); I H, H I, I I, H H, S H, S->I, Prohibited references: H S, I S, S S No data sharing permitted between parent activitys stack and child activitys stack Single Place Memory model No coherence constraints needed for I and S storage classes Guaranteed coherence for H storage class --- all writes to same shared location are observed in same order by all activities Largest deployment granularity for a single place is a single SMP Storage classes: Immutable Data (I) Shared Heap (H) Activity Stacks (S) Immutable Data (I) -- final variables, value type instances Locally Synchronous (coherent access to intra-place shared heap)... Activities Place 0

X10 Tutorial17 Basic X10 (Single Place) Core constructs used for intra-place (shared memory) parallel programming: Async = construct used to execute a statement in parallel as a new activity Finish = construct used to check for global termination of statement and all the activities that it has created Atomic = construct used to coordinate accesses to shared heap by multiple activities Future = construct used to evaluate an expression in parallel as a new activity Force = construct used to check for termination of future

X10 Tutorial18 async Parent activity creates a new child activity to execute in the same place as the parent activity An async statement returns immediately – parent execution proceeds immediately to next statement Any access to parents local data must be through final variables Similar to data access rules for inner classes in Java Example public class TutAsync { const boxedInt oddSum=new boxedInt(); const boxedInt evenSum=new boxedInt(); public static void main(String[] args) { final int n = 100; async for (int i=1 ; i<=n ; i+=2 ) oddSum.val += i; for (int j=2 ; j<=n ; j+=2 ) evenSum.val += j; Variable n must be declared as final --- its value is passed from parent to child activity async statement

X10 Tutorial19 finish Execute as usual, but wait until all activities spawned (transitively) by have terminated before completing the execution of finish S finish traps all exceptions thrown by activities spawned by S, and throws a wrapping exception after S has terminated. Example (see TutAsync.x10):... finish { async for (int i=1 ; i<=n ; i+=2 ) oddSum.val += i; for (int j=2 ; j<=n ; j+=2 ) evenSum.val += j; } // Both oddSum and evenSum have been computed now System.out.println("oddSum = " + oddSum.val + " ; evenSum = " + evenSum.val); } // main() } // TutAsync finish statement Console output: oddSum = 2500 ; evenSum = 2550

X10 Tutorial20 Atomic statements & methods atomic, atomic An atomic statement/method is conceptually executed in a single step, while other activities are suspended Note: programmer does not manage any locks explicitly An atomic section may not include Blocking operations Creation of activities Example (see TutAtomic1.x10): finish { async for (int i=1 ; i<=n ; i+=2 ) { double r = 1.0d / i ; atomic rSum += r; } for (int j=2 ; j<=n ; j+=2 ) { double r = 1.0d / j ; atomic rSum += r; } System.out.println("rSum = " + rSum); Console output: rSum = 5.187377517639618

X10 Tutorial21 Another Example (TutAtomic2.x10) public class TutAtomic2 { const int a = new boxedInt(100); const int b = new boxedInt(100); public static atomic void incr_a() { a.val++ ; b.val-- ; } public static atomic void decr_a() { a.val-- ; b.val++ ; } public static void main(String args[]) { int sum; finish { async for (int i=1 ; i<=10 ; i++ ) incr_a(); for (int i=1 ; i<=10 ; i++ ) decr_a(); } atomic sum = a.val + b.val; System.out.println("a+b = " + sum); } // main() } // TutAtomic2 Console output: a+b = 200

X10 Tutorial22 Future & Force future F = future { } Parent activity creates a new asynchronous child activity at to evaluate value = F.force() Caller blocks until return value is obtained from future (and all activities spawned transitively by have terminated ) Example (see TutFuture2.x10): // Note that future and int are different types future Fi = future { fib(10) } ; int i = Fi.force(); // Nested future types can also be created (if need be) future > FFj= future { future{fib(100)} }; future Fj = FFj.force(); int j = Fj.force();

X10 Tutorial23 Example (TutFuture1.x10) public class TutFuture1 { static int fib(final int n) { if ( n <= 0 ) return 0; else if ( n == 1 ) return 1; else { future fn_1 = future { fib(n-1) }; future fn_2 = future { fib(n-2) }; return fn_1.force() + fn_2.force(); } } // fib() public static void main(String[] args) { System.out.println("fib(10) = " + fib(10)); } // main() } // TutFuture1 Example of recursive divide-and- conquer parallelism --- calls to fib(n-1) and fib(n-2) execute in parallel

X10 Tutorial24 Parallel Programming Pitfalls: Deadlock Deadlock occurs when parallel threads/activities acquire locks or perform other blocking operations in a sequence that creates a dependence cycle Java example: Thread 0 synchronized (Foo.a) { synchronized(Foo.b) { … } } Thread 1 synchronized (Foo.b) { synchronized(Foo.a) { … } } MPI example: Process 0: MPI_Recv(recvbuf, count, MPI_REAL, 1, tag, …) Process 1: MPI_Recv(recvbuf, count, MPI_REAL, 0, tag, …)

X10 Tutorial25 Parallel Programming Pitfalls: Deadlock (contd.) X10 guarantee Any program written with async, finish, atomic, foreach, ateach, and clock parallel constructs will never deadlock Unrestricted use of future and force may lead to deadlock (see examples/Constructs/Future/FutureDeadlock_MustFailTimeout.x10): f1 = future { a1() } ; f2 = future { a2() }; int a1() { … f2.force(); … } Int a2() { … f1.force(); … } Restricted use of future and force in X10 can preserve guaranteed freedom from deadlocks Sufficient condition #1: ensure that activity that creates the future also performs the force() operation Sufficient condition #2:...

X10 Tutorial26 Parallel Programming Pitfalls: Data Races A data race occurs when two (or more) threads/activities can access the same shared location in parallel such that one of the accesses is a write operation Java example: Thread 0: a++ ; b-- ; Thread 1: a++ ; b--; Data race can violate invariant that (a+b) is constant Data race may also prevent multiple increments from being combined correctly X10 guidelines for avoiding data races Use atomic methods and blocks without worrying about deadlock Declare data to be read-only (i.e., final or value type instance) whenever possible

X10 Tutorial28 Points A point is an element of an n-dimensional Cartesian space (n>=1) with integer-valued coordinates e.g., [5], [1, 2], … Dimensions are numbered from 0 to n-1 n is also referred to as the rank of the point A point variable can hold values of different ranks e.g., point p; p = [1]; … p = [2,3]; … The following operations are defined on a point-valued expression p1 p1.rank --- returns rank of point p1 p1.get(i) --- returns element i of point p1 Returns element (i mod p1.rank) if i = p1.rank p1.lt(p2), p1.le(p2), p1.gt(p2), p1.ge(p2) Returns true iff p1 is lexicographically, or >= p2 Only defined when p1.rank and p1.rank are equal

X10 Tutorial29 Example (see TutPoint.x10) public class TutPoint { public static void main(String[] args) { point p1 = [1,2,3,4,5]; point p2 = [1,2]; point p3 = [2,1]; System.out.println("p1 = " + p1 + " ; p1.rank = " + p1.rank + " ; p1.get(2) = " + p1.get(2)); System.out.println("p2 = " + p2 + " ; p3 = " + p3 + " ; p2.lt(p3) = " + p2.lt(p3)); } // main() } // TutPoint Console output: p1 = [1,2,3,4,5] ; p1.rank = 5 ; p1.get(2) = 3 p2 = [1,2] ; p3 = [2,1] ; p2.lt(p3) = true

X10 Tutorial30 Rectangular Regions A rectangular region is the set of points contained in a rectangular subspace A region variable can hold values of different ranks e.g., region R; R = [0:10]; … R = [-100:100, -100:100]; … R = [0:-1]; … The following operations are defined on a region-valued expression R R.rank = # dimensions in region; R.size() = # points in region R.contains(P) = true if region R contains point P R.contains(S) = true if region R contains region S R.equal(S) = true if region R equals region S R.rank(i) = projection of region R on dimension i (a one-dimensional region) R.rank(i).low() = lower bound of i th dimension of region R R.rank(i).high() = upper bound of i th dimension of region R R.ordinal(P) = ordinal value of point P in region R R.coord(N) = point in region R with ordinal value = N R1 && R2 = region intersection (will be rectangular if R1 and R2 are rectangular) R1 || R2 = union of regions R1 and R2 (may not be rectangular) R1 – R2 = region difference (may not be rectangular)

X10 Tutorial31 Example (see TutRegion.x10) public class TutRegion { public static void main(String[] args) { region R1 = [1:10, -100:100]; System.out.println("R1 = " + R1 + " ; R1.rank = " + R1.rank + " ; R1.size() = " + R1.size() + " ; R1.ordinal([10,100]) = " + R1.ordinal([10,100])); region R2 = [1:10,90:100]; System.out.println("R2 = " + R2 + " ; R1.contains(R2) = " + R1.contains(R2) + " ; R2.rank(1).low() = " + R2.rank(1).low() + " ; R2.coord(0) = " + R2.coord(0)); } // main() } // TutRegion Console output: R1 = {1:10,-100:100} ; R1.rank = 2 ; R1.size() = 2010 ; R1.ordinal([10,100]) = 2009 R2 = {1:10,90:100} ; R1.contains(R2) = true ; R2.rank(1).low() = 90 ; R2.coord(0) = [1,90]

X10 Tutorial32 X10 Arrays Java arrays are one-dimensional and local e.g., array args in main(String[] args) Multi-dimensional arrays are represented as arrays of arrays in Java X10 has true multi-dimensional arrays (as in C, Fortran) that can be distributed (as in UPC, Co-Array Fortran, ZPL, Chapel, etc.) Array declaration T [.] A declares an X10 array with element type T An array variable can hold values of different rank) The [.] syntax is used to avoid confusion with Java arrays Array creation new T [ R ] creates a local rectangular X10 array with rectangular region R as the index domain and T as the element (range) type e.g., int[.] A = new int[ [0:N+1, 0:N+1] ]; Array initializers can also be specified in conjunction with creation (see TutArray1.x10) E.g., int[.] A = new int[ [1:10,1:10] ] (point[i,j]) { return i+j; } ;

X10 Tutorial33 X10 Array Operations The following operations are defined on array-valued expression s A.rank = # dimensions in array A.region = index region (domain) of array A[P] = element at point P, where P belongs to A.region A | R = restriction of array onto region R Useful for extracting subarrays A.sum(), A.max() = sum/max of elements in array A1 op A2 returns result of applying a pointwise op on array elements, when A1.region = A2. region Op can include +, -, *, and / A1 || A2 = disjoint union of arrays A1 and A2 (A1.region and A2.region must be disjoint) A1.overlay(A2) Returns an array with region, A1.region || A2.region, with element value A2[P] for all points P in A2.region and A1[P] otherwise. A.distribution = distribution of array A Will be discussed later when we introduce X10 places

X10 Tutorial34 Example (see TutArray1.x10) public class TutArray1 { public static void main(String[] args) { int[.] A = new int[ [1:10,1:10] ] (point [i,j]) { return i+j;} ; System.out.println("A.rank = " + A.rank + " ; A.region = " + A.region); int[.] B = A | [1:5,1:5]; System.out.println("B.max() = " + B.max()); } // main() } // TutArray1 Console output: A.rank = 2 ; A.region = {1:10,1:10} B.max() = 10

X10 Tutorial35 Pointwise for loop X10 extends Javas for loop to support sequential iteration over points in region R in canonical lexicographic order for ( point p : R )... Standard point operations can be used to extract individual index values from point p for ( point p : R ) { int i = p.get(0); int j = p.get(1);... } Or an exploded syntax can be used instead of explicitly declaring a point variable for ( point [i,j] : R ) {... } The exploded syntax declares the constituent variables (i, j, …) as local int variables in the scope of the for loop body

X10 Tutorial36 Example (see TutFor.x10) public class TutFor { public static void main(String[] args) { region R = [0:1,0:2]; System.out.print("Points in region " + R + " ="); for ( point p : R ) System.out.print(" " + p); System.out.println(); // Use exploded syntax instead System.out.print("(i,j) pairs in region " + R + " ="); for ( point[i,j] : R ) System.out.print("(" + i + "," + j + ")"); System.out.println(); } // main() } // TutFor Console output: Points in region {0:1,0:2} = [0,0] [0,1] [0,2] [1,0] [1,1] [1,2] (i,j) pairs in region {0:1,0:2} =(0,0)(0,1)(0,2)(1,0)(1,1)(1,2)

X10 Tutorial37 foreach loop (Parallel iteration) The X10 foreach loop is similar to the pointwise for loop, except that each iteration executes in parallel as a new asynchronous activity i.e., foreach ( point p : R ) S is equivalent to for ( point p : R ) async S As before, finish can be used to wait for termination of all foreach iterations finish foreach ( point[i,j] : [0:M-1,0:N-1] )... Special case: use foreach to create a single-dimensional parallel loop foreach ( point[i] : [0 : N-1] ) S Allowing a single foreach construct to span multiple dimensions makes it convenient to write parallel matrix code that is independent of the underlying rank and region e.g. foreach ( point p : A.region ) A[p] = f(B[p], C[p], D[p]) ; Multiple foreach instances may accesses shared data in the same place use finish, atomic, force to avoid data races

X10 Tutorial38 Example (see TutForeach1.x10) public class TutForeach1 { public static void main(String[] args) { final int N = 5; int[.] A = new int[[1:N,1:N]] (point[i,j]) {return i+j;}; // For the A[i,j] = F(A[i,j]) case, // both loops can execute in parallel finish foreach ( point[i,j] : A.region ) A[i,j] = A[i,j] + 1; // For the A[i,j] = F(A[i,j-1]) case, // only the outer loop can execute in parallel finish foreach ( point[i] : A.region.rank(0) ) for (point[j]: [(A.region.rank(1).low()+1):A.region.rank(1).high()]) A[i,j] = A[i,j-1] + 1; NOTE: A.region.rank(0) is the same as [1:N]

X10 Tutorial39 Example contd. (see TutForeach1.x10) // For the A[i,j] = F(A[i-1,j]) case, // only the inner loop can execute in parallel for (point[i]: [(A.region.rank(0).low()+1):A.region.rank(0).high()] ) finish foreach ( point[j] : A.region.rank(1) ) A[i,j] = A[i-1,j] + 1; // For the A[i,j] = F(A[i-1,j],A[i,j-1]) case, // use loop skewing to execute the inner loop in parallel for ( point[t] : [4:2*N]) { finish foreach ( point[j] : [Math.max(2,t-N):Math.min(N,t-2)]) { int i = t - j; System.out.print("(" + i + "," + j + ")"); A[i,j] = A[i-1,j] + A[i,j-1] + 1; } System.out.println(); Console output: (2,2) (3,2)(2,3) (4,2)(3,3)(2,4) (5,2)(3,4)(4,3)(2,5) (5,3)(4,4)(3,5) (5,4)(4,5) (5,5)

X10 Tutorial41 Limitations of using a Single Place Activity Stacks (S) Shared Heap (H) Largest deployment granularity for a single place is a single SMP Smallest granularity can be a single CPU or even a single hardware thread Single SMP is inadequate for solving problems with large memory and compute requirements X10 solution: incorporate multiple places as a core foundation of the X10 programming model Enable deployment on large-scale clustered machines, with integrated support for intra-place parallelism Storage classes: Immutable Data (I) Shared Heap (H) Activity Stacks (S) Immutable Data (I) -- final variables, value type instances Locally Synchronous (coherent access to intra-place shared heap)... Activities Place 0

X10 Tutorial42 Scalable X10: using multiple places Place = collection of activities & objects Activities and data objects do not move after being created Scalar object, O -- maps to a single place specified by O.location Array object, A – may be local to a place or distributed across multiple places, as specified by A.distribution Storage classes: Immutable Data (I) PGAS Local Heap (LH) Remote Heap (RH) Activity Stacks (S) Locally Synchronous (coherent access to intra-place shared heap) Activity Stacks (S) Local Heap (LH) Immutable Data (I) -- final variables, value type instances... Activities Activity Stacks (S) Local Heap (LH)... Activities Outbound activities Inbound activities Outbound activity replies Inbound activity replies... Globally Asynchronous Partitioned Global Address Space (PGAS) Place 0 Place (MAX_PLACES -1)

X10 Tutorial43 Locality Rule Any access to a mutable (shared heap) datum must be performed by an activity located at the place as the datum The prohibited references are similar as before: LH/RH S, I S, S S Local-to-remote (LH RH) and remote-to-local (RH LH) heap references are freely permitted However, direct access via a remote heap reference is not permitted! Inter-place data accesses can only be performed by creating remote activities (with weaker ordering guarantees than intra-place data accesses) The locality rule is currently not checked by default. Instead, the user can perform the check explicitly by inserting a place cast operator as follows: (@ P) E checks if expression E can be evaluated at place P If so, expression E is evaluated as usual If not, a BadPlaceException is thrown

X10 Tutorial44 Activity Execution within a Place Outbound activities Inbound activities Outbound replies Inbound replies Place Ready Activities Completed Activities Blocked Activities Clock Future Executing Activities... Atomic sections do not have blocking semantics Place-local activity can only its stack (S), place-local heap (LH), or immutable data (I)

X10 Tutorial45 Places place.MAX_PLACES = total number of places Default value is 4 Can be changed by using the -NUMBER_OF_LOCAL_PLACES option in x10 command place.places = Set of all places in an X10 program(see java.lang.Set) place.factory.place(i) = place corresponding to index i here = place in which current activity is executing.toString() returns a string of the form place(id=99).id returns the id of the place X10 Places System Nodes X10 language defines mapping from X10 objects to X10 places, and abstract performance metrics on places X10 Data Structures Future X10 deployment system will define mapping from X10 places to system nodes; not supported in current implementation

X10 Tutorial46 Extension of async and future to places async (P) S Creates new activity to execute statement S at place P async S is equivalent to async (here) S future (P) { E } Create new activity to evaluate expression E at place P future { E } is equivalent to future (here) { E } Note that here in a child activity for an async/future computation will refer to the place P at which the child activity is executing, not the place where the parent activity is executing The goal is to specify the destination place for async/future activities so as to obey the Locality Rule e.g., async (O.location) O.x = 1; future F = future (A.distribution[i]) { A[i] } ;

X10 Tutorial47 Distribution = mapping from region to places Creating distributions (x10.lang.dist): dist D1 = R-> here;// local distribution – maps region R to here dist D2 = dist.factory.block(R); // blocked distribution dist D3 = dist.factory.cyclic(R); // cyclic distribution dist D4 = dist.factory.unique(); // identity map on [0:MAX_PLACES-1] Using distributions D[P] = place to which point P is mapped by distribution D (assuming that P is in D.region) Allocate a distributed array e.g., T[.] A = new T[ D ]; Allocates an array with index set = D.region, such that element A[P] is located at place D[P] for each point P in D.region NOTE: new T[R] for region R is equivalent to new T[R->here] Iterating over a distribution – generalization of foreach to ateach ateach is discussed in more detail later

X10 Tutorial48 Operations defined on distributions D.region = source region of distribution D.rank = rank of D.region D | R = region restriction for distribution D and region R (returns a restricted distribution) D | P = place restriction for distribution D and place P (returns region mapped by D to place P) D1 || D2 = union of distributions D1 and D2 (assumes that D1.region and D2.region are disjoint) D1.overlay(D2); // Overlay of D2 over D1 – asymmetric union D.contains(p) = true iff D.region contains point p D = R -> P, constant distribution which maps entire region R to place P D1 – D2 = distribution difference = D1 | (D1.region – D2.region) D.distributionEfficiency() = load balance efficiency of distribution D

X10 Tutorial49 Inter-place communication using async and future Question: how to assign A[i] = B[j], when A[i] and and B[j] may be in different places? Answer #1 --- use nested asyncs! finish async ( B.distribution[j] ) { final int bb = B[j]; async ( A.distribution[i] ) A[i] = bb; } Answer #2 --- use future-force and an async! final int b = future (B.distribution[j]) { B[j] }.force(); finish async ( A.distribution[i] ) A[i] = b;

X10 Tutorial50 Load Balance Efficiency Consider a parallel application that is executed on P places Let T(i) = computation load mapped to place i For distribution D, T(i) = (D | place.factory.place(i)).size() Let Tmax = max { T(i) | 1 <= i <= P } Let E = SUM { T(i) | 1 <= i <= P } / (Tmax * P) E is the load balance efficiency, 1/P <= E <= 1 E = 1 is the best case computation load is perfectly balanced E = 1/P is the worst case computation load is placed on a single processor/place Load balance efficiency is one of the key factors that limit speedup on a parallel machine there are several other factors e.g., comm. & synchronization overhead ignoring other factors, we expect speedup to be <= E * P NOTE: also try x10 –DUMP_STATS_ON_EXIT=true … to see activity and atomic counts

X10 Tutorial51 ateach loop (distributed parallel iteration) The X10 ateach loop is similar to the foreach loop, except that each iteration executes in parallel at a place specified by a distribution ateach ( point p : D ) S is equivalent to for ( point p : D.region ) async (D[p]) S As before, finish can be used to wait for termination of all ateach iterations finish ateach( point[i] : dist.factory.unique() ) S creates one activity per place, as in an SPMD computation ateach is a convenient construct for writing parallel matrix code that is independent of the underlying distribution e.g., ateach ( point p : A.distribution ) A[p] = f(B[p], C[p], D[p]) ;

X10 Tutorial52 Example (see TutAteach1.x10) public class TutAteach1 { public static void main(String args[]) { finish ateach( point[i] : dist.factory.unique() ) { System.out.println("Hello from " + i); } } // main() } // TutAteach1 Console output: Hello from 1 Hello from 0 Hello from 3 Hello from 2 dist.factory.unique() maps point i in the region, [0 : place.MAX_PLACES-1], to place place.factory.place(i)

X10 Tutorial53 Example: converting foreach to ateach (see TutAteach2.x10) foreach version: // For the A[i,j] = F(A[i,j]) case, // both loops can execute in parallel finish foreach ( point[i,j] : A.region ) A[i,j] = A[i,j] + 1; ateach version #1: finish ateach ( point[i,j] : A.distribution) A[i,j] = A[i,j] + 1; ateach version #2 (create only one activity per place): finish ateach ( point p : dist.factory.unique() ) for ( point[i,j] : A.distribution | here ) A[i,j] = A[i,j] + 1;

X10 Tutorial54 Example: converting foreach to ateach, contd. (see TutAteach2.x10) foreach version: // For the A[i,j] = F(A[i,j-1]) case, // only the outer loop can execute in parallel finish foreach ( point[i] : [1:N] ) for ( point[j]: [2:N] ) A[i,j] = A[i,j-1] + 1; ateach version: // Assume that N is a multiple of place.MAX_PLACES finish ateach ( point[i] : dist.factory.block([1:N]) ) for ( point[j]: [2:N] ) A[i,j] = A[i,j-1] + 1;

X10 Tutorial56 X10 clocks: Motivation Activity coordination using finish and force() is accomplished by checking for activity termination However, there are many cases in which a producer-consumer relationship exists among the activities, and a barrier-like coordination is needed without waiting for activity termination The activities involved may be in the same place or in different places Activity 0Activity 1Activity 2... Phase 0 Phase 1...

X10 Tutorial57 X10 Clocks clock c = clock.factory.clock(); Allocate a clock, register current activity with it. Phase 0 of c starts. async(…) clocked (c1,c2,…) S ateach(…) clocked (c1,c2,…) S foreach(…) clocked (c1,c2,…) S Create async activities registered on clocks c1, c2, … c.resume(); Nonblocking operation that signals completion of work by current activity for this phase of clock c next; Barrier --- suspend until all clocks that the current activity is registered with can advance. c.resume() is first performed for each such clock, if needed. Next can be viewed like a finish of all computations under way in the current phase of the clock

X10 Tutorial58 X10 Clocks (contd.) c.drop(); Unregister with c. A terminating activity will implicitly drop all clocks that it is registered on. c.registered() Return true iff current activity is registered on clock c c.dropped() returns the opposite of c.registered() ClockUseException Thrown if an activity attempts to transmit or operate on a clock that it is not registered on

X10 Tutorial59 Example (see TutClock1.x10) finish async { final clock c = clock.factory.clock(); foreach (point[i]: [1:N]) clocked (c) { while ( true ) { int old_A_i = A[i]; int new_A_i = Math.min(A[i],B[i]); if ( i > 1 ) new_A_i = Math.min(new_A_i,B[i-1]); if ( i < N ) new_A_i = Math.min(new_A_i,B[i+1]); A[i] = new_A_i; next; int old_B_i = B[i]; int new_B_i = Math.min(B[i],A[i]); if ( i > 1 ) new_B_i = Math.min(new_B_i,A[i-1]); if ( i < N ) new_B_i = Math.min(new_B_i,A[i+1]); B[i] = new_B_i; next; if ( old_A_i == new_A_i && old_B_i == new_B_i ) break; } // while } // foreach } // finish async NOTE: exiting from while loop terminates activity for iteration i, and automatically deregisters activity from clock Example of transmitting clock from parent to child

X10 Tutorial61 nullable By default, object references in X10 are not allowed to take on the null value However, the nullable type constructor can be used to enable certain object references to be set to null, or to compare them with null e.g., T1 a; nullable T2 b; a = null; // Not allowed b = null; // Allowed NOTE: const is simply a shorthand for static final

X10 Tutorial62 extern X10 provides a simple mechanism for invoking external functions written in C Currently, the C function is restricted to arguments with primitive types or references to unsafe X10 arrays The X10 program must contain an external declaration of the C function as follows … static extern char doit(int a, float b) … and also a statement to ensure that the native DLL,.dll is loaded static { System.loadLibrary( ");} The X10 compiler then generates a file called _x10stub.c To generate the DLL, the C programmer must compile the C function by including the file jni.h in tehir C function, and must link with the object file obtained from _x10stub.c

X10 Tutorial PSC Software Productivity Study May 23 – 27, 2005 Vivek Sarkar IBM T.J. Watson Research Center This work has been supported.

Similar presentations

Presentation on theme: "X10 Tutorial PSC Software Productivity Study May 23 – 27, 2005 Vivek Sarkar IBM T.J. Watson Research Center This work has been supported."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

X10 Tutorial PSC Software Productivity Study May 23 – 27, 2005 Vivek Sarkar IBM T.J. Watson Research Center This work has been supported.

Similar presentations

Presentation on theme: "X10 Tutorial PSC Software Productivity Study May 23 – 27, 2005 Vivek Sarkar IBM T.J. Watson Research Center This work has been supported."— Presentation transcript:

Similar presentations

About project

Feedback