Distributed Programming in Scala with APGAS Philippe Suter, Olivier Tardieu, Josh Milthorpe IBM Research Picture by Simon Greig.

2 APGAS - Context Model for concurrency + distribution in X10. X10, general purpose language – Developed at IBM Research for 10+ years. – Focus/bias towards distributed HPC tasks. – JVM + native back-ends (through Java & C++). – Some X10 apps ran on >50K cores. Asynchronous Partitioned Global Address Space http://x10-lang.org and X10’15 @ PLDI (tomorrow)

3 APGAS in Scala Goal: expose the concurrent/distributed core of X10 as a library. – In Java 8 and as a Scala DSL. This contribution: – Introduction to programming w/ APGAS in Scala. – Illustrated through two benchmarks: K-means clustering Unbalanced Tree Search (see paper) – Contrasting model with Akka (see paper). – Preliminary experimental scaling results.

4 APGAS Primer Concurrent tasks run at distributed places. The environment exposes the available places. def places : Seq[Place] def here : Place def asyncAt(p : Place)(body: =>Unit) : Unit def async(body: =>Unit) : Unit Tasks can be remote or local. Tasks are asynchronous by default.

5 APGAS Primer The termination of tasks is controlled by the finish construct. def finish(body: =>Unit) : Unit Blocks until enclosed tasks have completed, including all nested tasks, local or remote. Distributed termination is challenging, finish is a powerful contribution of APGAS.

6 Hello World finish { for(p <- places) { asyncAt(p) { println(s“Hello from $here.”) } Completes when all places have completed their task. asyncAt returns immediately. $> … Hello from place(0). Hello from place(3). Hello from place(1). Hello from place(2).

7 “Academic” Fibonacci def fibonacci(i: Int) : Long = { if(i <= 1 ) i else { var a,b = 0L finish { async { a = fibonacci(i – 2) } b = fibonacci(i – 1) } a + b } finish guards a single asyncAt… …but recursive invocations enclose many more. finish completes exactly when the computation of all dependencies is complete.

8 Messages and Memory Default mechanism for transferring memory between places is to capture it in the closure of the body of asyncAt. APGAS lets the programmer define global symbols for memory local to places. class Worker(…) extends PlaceLocal

9 Place-local Objects All instances of PlaceLocal resolve to objects that are place-specific. class Worker(…) extends PlaceLocal val w : Worker = PlaceLocal.forPlaces(places) { new Worker(…) } for(p <- places) { asyncAt(p) { } } One distinct instance is created at each place. Here, w resolves to the worker at place p.

10 Global and Shared References For objects that cannot extend PlaceLocal, APGAS provides a wrapper (“pointer”) trait GlobalRef[T] { def apply(): T } Shared references refer to an object at a particular place and can only be dereferenced there. – Useful to “call back” from an asynchronous task. trait SharedRef[T] { def apply(): T }

11 Global and Shared References // at place p1 val largeArray : Array[Double] = … val ref = SharedRef.make(largeArray) asyncAt(p2) { … asyncAt(p1) { val array = ref() array(…) = … } … } Dereference at p1 resolves to largeArray. largeArray is never captured, therefore never serialized. Dereferencing ref() here would be an error.

12 Distributed K-means Clustering Goal: iteratively divide a set of points into K disjoint clusters. Distribute the points among workers. In each iteration: – workers: computes the new centroids for their own points. communicate their view of the centroid to the master – the master: aggregates all workers’ data and checks convergence

13 Distributed K-Means: Memory Each worker needs to hold: – Its set of points. – Its local view of centroids. In addition, the master holds: – The aggregated centroids. In our implementation, the workers write their results directly at the master’s. – Requires synchronized data structure. GlobalRef[WorkerData] SharedRef[MasterData]

14 Distributed K-Means: Structure while(!converged) { finish { for(p <- places) { asyncAt(p) { // compute new local centroids asyncAt(masterRef.home()) { // merge local centroids in master }

15 Unbalanced Tree Search Counts nodes in a dynamically generated tree. Each node: – Has an associated SHA1 hash. – Has a number of children determined by a probabilistic law. Trees are unbalanced in an unpredictable but deterministic way.

16 Unbalanced Tree Search Algorithm combines work-stealing and work- dealing among workers. Workers are modeled as state machines. Termination: – in APGAS: a single, top-level finish. – in Akka: requires a counting protocol.

17 APGAS Implementation APGAS implementation: – ~2000 lines Java 8 – ~200 lines Scala (definitions, helpers, serialization) Tasks are scheduled using fork/join. Distribution built on top of Hazelcast. Benchmarks are ~1200 Scala lines – 1/3 APGAS, 1/3 Akka, 1/3 common.

18 Performance Evaluation For both benchmarks, we ran a fixed problem using 1, 2, 4, 8, 16, and 32 workers. Measured “unit of work” per second per worker. All experiments ran on single 48 core machine. – Akka benchmarks use akka-remote.

19 Performance Evaluation Experiments are meant to: – be a sanity check, – provide evidence of scalability potential. Please do not interpret as claim that X is better than Y. “Comparable performance and scalability for comparable complexity.”

20 K-Means Iterations/second/worker Number of workers

21 Unbalanced Tree Search Million of nodes/second/worker Number of workers

22 Conclusion Made APGAS programming problem accessible to Scala programmers. Programming style is different, but a good fit for some problems. In particular, finish concisely solves hard distributed termination problems. Complexity is similar to equivalent Akka impls. Promising preliminary scaling results.

23 Thank you!

