Pattern Parallel Programming B. Wilkinson/Clayton Ferner PatternProgIntro.ppt Modification date: August 14, 2014 1 1
Problem Addressed To make parallel programming more useable and scalable. Parallel programming -- writing programs using multiple computers and processors collectively to solve problems -- has a very long history but still a challenge. 2
Traditional approach Need a better structured approach. Explicitly specifying message-passing (MPI), and Low-level threads APIs (Pthreads, Java threads, OpenMP, …). Both require programmers to use low-level routines Need a better structured approach. 3
Pattern Programming Concept Programmer begins by constructing his program using established computational or algorithmic “patterns” that provide a structure. “Design patterns” part of software engineering for many years: Reusable solutions to commonly occurring problems * Patterns provide guide to “best practices”, not a final implementation Provides good scalable design structure Can reason more easily about programs Potential for automatic conversion into executable code avoiding low-level programming – We do that here. Particularly useful for the complexities of parallel/distributed computing * http://en.wikipedia.org/wiki/Design_pattern_(computer_science)
In parallel/distributed computing, what patterns are we talking about? Low-level algorithmic patterns that might be embedded into a program such as fork-join, broadcast/scatter/gather. Higher level algorithm patterns for forming a complete program such as workpool, pipeline, stencil, map-reduce. We tend to concentrate upon higher-level “computational/algorithm ” level patterns rather than lower level patterns.
MPI point-to-point Data Transfer (Send-Receive) Low level MPI message-passing patterns MPI point-to-point Data Transfer (Send-Receive) Source Destination Data
Collective patterns Broadcast Pattern Sends same data to each of a group of processes. A common pattern to get same data to all processes, especially at beginning of a computation Destinations Same data sent to all destinations Source Note: Patterns given do not mean the implementation does them as shown. Only the final result is the same in any parallel implementation. Patterns do not describe the implementation.
Scatter Pattern Distributes a collection of data items to a group of processes. A common pattern to get data to all processes. Usually data sent are parts of an array Different data sent to each destinations Source Destinations
Gather pattern Sources Destination Essentially reverse of scatter pattern. It receives data items from a group of processes Data Data Data collected at destination in an array Data Common pattern especially at the end of a computation to collect results.
Reduce Pattern Sources Destination A common pattern to get data back to master from all processes and then aggregate it by combining collected data into one answer. Reduction operation must be a binary operation that is commutative (changing the order of the operands does not change the result) Data Data Data collected at destination and combined to get one answer with a commutative operation Data Needs to be commutative operation to allow the implementation to do the operations in any order.
Collective all-to-all broadcast Sources and destinations are the same processes Sources Destinations A common all-to-all pattern, often within a computation, is to send data from all processes to all processes often within a computation Every process sends data to every other process (one-way) Versions of this can be found in MPI.
Some Higher Level Message-Passing Patterns Slaves Master/slave Master Two-way connection Computation divided into parts, which are then passed out to slaves to perform and return their results, basis of most parallel computing Compute node Source/sink
Workpool Very widely applicable pattern Slaves/Workers Master Task from task queue Another task if task queue not empty Aggregate answers Slaves/Workers Result Task queue Very widely applicable pattern Once a slave completes a task, slave given another task from task queue master -- load-balancing quality. Need to differentiate between master-slave pattern, which does not imply a task queue.
More Specialized High-level Patterns Pipeline Stage 1 Stage 2 Stage 3 Workers One-way connection Two-way connection Master Compute node Source/sink
Divide and Conquer Divide Merge Two-way connection Compute node Source/sink
All-to-All All compute nodes can communicate with all the other nodes Two-way connection Compute node Source/sink Master
Iterative synchronous patterns When a pattern is repeated until some termination condition occurs. Synchronization at each iteration, to establish termination condition, often a global condition. Note this is two patterns merged together sequentially if we call iteration a pattern. We actually have a pattern operator that can do this Pattern Check termination condition Repeat Stop
Iterative synchronous all-to-all pattern Repeat Stop Check termination condition Example: N-body problem needs an “iterative synchronous all-to-all” pattern, where on each iteration all processes exchange data with each other. 18
Stencil All compute nodes can communicate with only neighboring nodes Usually a synchronous computation - Performs number of iterations to converge on solution, e.g. solving Laplace’s/heat equation On each iteration, each node communicates with neighbors to get stored computed values Two-way connection Compute node Source/sink
Parallel Patterns -- Advantages Abstracts/hides underlying computing environment Generally avoids deadlocks and race conditions Reduces source code size (lines of code) Leads to automated conversion into parallel programs without need to write with low level message-passing routines such as MPI. Hierarchical designs with patterns embedded into patterns, and pattern operators to combine patterns. Disadvantages New approach to learn Takes away some of the freedom from programmer Performance reduced (c.f. using high level languages instead of assembly language)
Previous/Existing Work Patterns explored in several projects. Industrial efforts Intel Threading Building Blocks (TBB), Cilk plus, Array Building Blocks (ArBB). Focus on very low level patterns such as fork-join Universities: University of Illinois at Urbana-Champaign and University of California, Berkeley University of Torino/Università di Pisa Italy “Structured Parallel Programming: Patterns for Efficient Computation,” Michael McCool, James Reinders, Arch Robison, Morgan Kaufmann, 2012 Intel tools, TBB, Cilk, ArBB
Note on Terminology “Skeletons” Sometimes term “skeleton” used to describe “patterns”, especially directed acyclic graphs with a source, a computation, and a sink. We do not make that distinction and use the term “pattern” whether directed or undirected and whether acyclic or cyclic. This is done elsewhere.
We focus on higher-level tools to avoid using low level routines. We have developed several tools at different levels of abstraction that avoid using low level MPI and enable students to create working patterns very quickly. Seeds framework – high-level Java-based software that self deploys and executes on any platform, local computers or distributed computers. Several patterns implemented including workpool, pipeline, synchronous iterative all-to-all, stencil, … Paraguin compiler – C-based compiler directive approach that creates MPI code. Patterns implemented include scatter-gather for a master slave pattern, stencil, … Suzaku framework – provides pre-written pattern-based routines and macros that hide the MPI code. At an early stage of development.
Acknowledgements The Seeds framework was developed by Jeremy Villalobos in his PhD thesis “Running Parallel Applications on a Heterogeneous Environment with Accessible Development Practices and Automatic Scalability,” UNC-Charlotte, 2011. Extending work to teaching environment supported by the National Science Foundation under grant "Collaborative Research: Teaching Multicore and Many-Core Programming at a Higher Level of Abstraction" #1141005/1141006 (2012-2015). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation. 24
Questions