Dr. Barry Wilkinson University of North Carolina Charlotte

Pattern Programming Approach for Teaching Parallel and Distributed Computing
Dr. Barry Wilkinson University of North Carolina Charlotte Dr. Jeremy Villalobos Formerly of University of North Carolina Charlotte Dr. Clayton Ferner University of North Carolina Wilmington SIGCSE The 44th ACM Technical Symposium on Computer Science Education Friday March 8, 2013 1:45 PM - 3:00 PM; Room: Governors 12 © B. Wilkinson/Clayton Ferner SIGCSE SIGCSE13paper.ppt Modification date: March 4, 2013 1 1

Problem Addressed To make parallel programming more useable and scalable. Parallel programming -- writing programs using multiple computers and processors collectively to solve problems -- has a very long history but still a challenge. 2

Traditional approach Traditional approach Explicitly specifying message-passing (MPI) Low-level threads APIs (Pthreads, Java threads, OpenMP, …), CUDA, … . Need a better structured approach for god programming practices with scalable designs. 3

Pattern Programming Concept
Programmer begins by constructing his program using established computational or algorithmic “patterns” that provide a structure. What patterns are we talking about? Low-level algorithmic patterns that might be embedded into a program such as fork-join, broadcast/scatter/gather. Higher level algorithm patterns for forming a complete program such as workpool, pipeline, stencil, map-reduce. We concentrate upon higher-level “computational/algorithm ” level patterns rather than lower level patterns.

Some patterns (a) Workpool (d) Stencil Stage 1 Stage 3 Stage 2
(b) Pipeline (e) All-to-all Divide Merge (c) Divide and conquer Compute node Master (Source/sink) Two-way connection One-way connection

Combining patterns Example Applications
Many problems require a pattern to be repeated without returning control to the master Example A pattern that combines all-to-all with synchronous iteration, we call CompleteSyncGraph pattern Slave processes can exchange data with each other at each iteration (synchronization point) without stopping pattern. Some problems of this type require a number of iterations to converge on the solution. Applications N-body problem Solving general system of linear equations by iteration

Pattern operators In our framework, can create your own combined patterns with a pattern operator. Example: Adding Stencil and All-to-All synchronous pattern Example use: Heat distribution simulation (Laplace’s eq.) Multiple cells on a stencil pattern work in a loop parallel fashion, computing and synchronizing on each iteration. However, every x iterations, they must implement an all-to-all communication pattern to run an algorithm to detect termination.

Patterns -- Advantages
Leads to an automated conversion into parallel programs without need to write with low level message-passing routines such as MPI – see later. Abstracts/hides underlying computing environment Generally avoids deadlocks and race conditions Reduces source code size (lines of code) “Design patterns” part of software engineering for many years “Reusable solutions to commonly occurring problems” Patterns provide guide to “best practices”, not a final implementation Provides good scalable design structure to parallel programs Can reason more easier about programs

Patterns Disadvantages
New approach to learn Takes away some of the freedom from programmer Performance reduced slightly (but c.f. using high level languages instead of assembly language)

Previous/Existing Work
Patterns/skeletons explored in several projects. Industrial efforts Intel Intel Threading Building Blocks (TBB), Intel Cilk plus, Intel Array Building Blocks (ArBB). Focus on very low level patterns such as fork-join, and provides constructs for them. Somewhat competing tools obtained through takeovers of small companies. Each implemented differently. Microsoft Universities: University of Illinois at Urbana-Champaign and University of California, Berkeley University of Torino/Università di Pisa Italy

Our approach Focuses on a few higher level patterns of wide applicability (e.g. workpool, synchronous all-to-all, pipelined, stencil). Software framework developed called Seeds that enables the programmer to very easily construct an application from established patterns without need to write low level message passing or thread based code. Will to automatically distribute code across processor cores, computers, or geographical distributed computers and execute the parallel code.

“Seeds” Parallel Grid Application Framework
Some Key Features Pattern-programming (Java) user interface Self-deploys on computers, clusters, and geographically distributed computers Load balances Three levels of user interface: Basic Advanced Expert

Basic User Programmer Interface
To create and execute parallel programs, programmer selects a pattern and implements three principal Java methods: Diffuse method – to distribute pieces of data. Compute method – the actual computation Gather method – used to gather the results Programmer also has to fill in details in a “bootstrap” class to deploy and start the framework. Diffuse Compute Gather Bootstrap class The framework self-deploys on a geographically distributed platform and executes pattern.

Seeds Workpool DiffuseData, Compute, and GatherData Methods
Private variable total (answer) DataMap d Master Returns d to each slave Data argument dat Compute Note DiffuseData, Compute and GatherData methods start with a capital letter although method names should not! Data argument data DataMap input Slaves DataMap output DataMap d created in diffuse DataMap output created in compute

Complete Seeds Workpool code to compute pi by Monte Carlo method
See Session 1 Hands-on public Data Compute (Data data) { // input gets the data produced by DiffuseData() DataMap<String, Object> input = (DataMap<String,Object>)data; // output will emit the partial answers done by this method DataMap<String, Object> output = new DataMap<String, Object>(); Long seed = (Long) input.get("seed"); // get random seed Random r = new Random(); r.setSeed(seed); Long inside = 0L; for (int i = 0; i < DoubleDataSize ; i++) { double x = r.nextDouble(); double y = r.nextDouble(); double dist = x * x + y * y; if (dist <= 1.0) { ++inside; } output.put("inside", inside);// store partial answer to return to GatherData() return output; public Data DiffuseData (int segment) { DataMap<String, Object> d =new DataMap<String, Object>(); d.put("seed", R.nextLong()); return d; // returns a random seed for each job unit public void GatherData (int segment, Data dat) { DataMap<String,Object> out = (DataMap<String,Object>) dat; Long inside = (Long) out.get("inside"); total += inside; // aggregate answer from all the worker nodes. public double getPi() { // returns value of pi based on the job done by all the workers double pi = (total / (random_samples * DoubleDataSize)) * 4; return pi; public int getDataCount() { return random_samples; Computation package edu.uncc.grid.example.workpool; import java.util.Random; import java.util.logging.Level; import edu.uncc.grid.pgaf.datamodules.Data; import edu.uncc.grid.pgaf.datamodules.DataMap; import edu.uncc.grid.pgaf.interfaces.basic.Workpool; import edu.uncc.grid.pgaf.p2p.Node; public class MonteCarloPiModule extends Workpool { private static final long serialVersionUID = 1L; private static final int DoubleDataSize = 1000; double total; int random_samples; Random R; public MonteCarloPiModule() { R = new Random(); public void initializeModule(String[] args) { total = 0; Node.getLog().setLevel(Level.WARNING); // reduce verbosity for logging random_samples = 3000; // set number of random samples Note: No explicit message passing

Data cast into a DataMap
segment used by Framework to keep track of where to put results public Data DiffuseData (int segment) { DataMap<String, Object> d =new DataMap<String, Object>(); input Data = …. d.put(“name_of_inputdata", inputData); return d; } public Data Compute (Data data) { DataMap<String, Object> input = (DataMap<String,Object>)data; //data produced by DiffuseData() DataMap<String, Object> output = new DataMap<String, Object>(); //output returned to gatherdata inputData = input.get(“name_of_inputdata”); … // computation output.put("name_of _results", results); // to return to GatherData() return output; public void GatherData (int segment, Data dat) { DataMap<String,Object> out = (DataMap<String,Object>) dat; outdata = out.get (“name_of_results”); result … // aggregate outdata from all the worker nodes. result a private variable Data cast into a DataMap By framework GatherData gives back Data object with a segment number By framework

package edu. uncc. grid. example. workpool; import java. io
package edu.uncc.grid.example.workpool; import java.io.IOException; import net.jxta.pipe.PipeID; import edu.uncc.grid.pgaf.Anchor; import edu.uncc.grid.pgaf.Operand; import edu.uncc.grid.pgaf.Seeds; import edu.uncc.grid.pgaf.p2p.Types; public class RunMonteCarloPiModule { public static void main(String[] args) { try { MonteCarloPiModule pi = new MonteCarloPiModule(); Seeds.start( "/path/to/seeds/seed/folder" , false); PipeID id = Seeds.startPattern(new Operand( (String[])null, new Anchor( "hostname" , Types.DataFlowRoll.SINK_SOURCE), pi ) ); System.out.println(id.toString() ); Seeds.waitOnPattern(id); System.out.println( "The result is: " + pi.getPi() ) ; Seeds.stop(); } catch (SecurityException e) { e.printStackTrace(); } catch (IOException e) { } catch (Exception e) { } Bootstrap class This code deploys framework and starts execution of pattern Different patterns have similar code

Multicore version of Seeds
Multicore version using shared memory Faster on a multicore platform Does not use JXTA P2P network to run cluster nodes, thread based. Bootstrap class does not need to start and stop JXTA P2P. Seeds.start() and Seeds.stop() not needed. Otherwise user code similar. public class RunMonteCarloPiModule { public static void main(String[] args) { try { MonteCarloPiModule pi=new MonteCarloPiModule(); Thread id = Seeds.startPatternMulticore( new Operand( (String[])null , new Anchor( args[0], Types.DataFlowRole.SINK_SOURCE) , pi ), 4 ); id.join(); System.out.println( "The result is: " + pi.getPi() ) ; } catch (SecurityException e) { … }

Compiling/executing Can be done on the command line (ant script provided) or through an IDE (Eclipse)

Pattern programming introduced into our regular undergraduate parallel programming course Fall 2012 before lower-level tools such as MPI, OpenMP, CUDA. Prototype course on NCREN televideo network between UNC-Charlotte and UNC-Wilmington. Future offering will include other sites.

Brief outline of course contents
Parallel Computing -- Demand for computational speed, grand challenge problems, potential speed-up using multiple processors, speed-up factor, max speed-up, Amdahl's law, Gustafson's law. Parallel Computers -- types of parallel computers, shared memory systems, multicore, distributed memory systems, networked computers clusters, GPU systems. Pattern Programming -- parallel patterns for structured parallel programming, workpool, pipeline, divide and conquer, stencil, all-to-all patterns, advantages of patterns, Seeds framework, user interface, programming examples Assignment 1 -- Using the Seeds Pattern Programming Framework: 1 - Workpool

Lower-level message-passing computing -- MPI point-to-point message passing, message tags, MPI communicator, blocking, nonblocking, and synchronous send/recv, MPI collective routines, broadcast, scatter, gather, reduce, barrier. compiling and executing MPI programs, measuring execution time Assignment 2 -- Compiling and executing MPI programs. Comparison with pattern framework More patterns and applications -- Synchronous All-To-All pattern, CompleteSynchGraph pattern, gravitational N-body problem, Barnes-Hut algorithm, code. Divide and conquer pattern, recursive, examples: numerical integration with adaptive quadrature. Pipeline pattern, examples, sorting, prime numbers, upper triangular linear equations. Seeds pipeline pattern, code. Iterative synchronous All-To-All pattern, solving system of linear equations by iteration, Jacobi iteration, convergence rate. Stencil pattern, applications, heat distribution problem, Seeds code, cellular automata, game of life, partially synchronous method.

Compiler directive approach -- Paraguin compiler, parallel region, forall, broadcast, gather, examples Assignment 3 -- Using Paraguin to create MPI programs (workpool pattern). Programming with Shared Memory -- Processes, threads, interleaved statements, thread safe routines, re-ordering code, compiler/processor optimizations, accessing shared data, critical sections, locks, condition variables, deadlock, semaphores, monitors, dependency analysis (Bernstein's conditions), serializing code, cache false sharing, sequential consistency. OpenMP -- directives/constructs, parallel, shared and local variables, work-sharing, sections, for, loop scheduling, for reduction, single master, critical, barrier, atomic, flush. Assignment 4 -- Using Paraguin to create MPI programs, Sobel edge detection, and hybrid MPI/OpenMP

Data parallel pattern -- examples, data parallel prefix sum algorithm, matrix multiplication, introduction to HPC GPU systems and CUDA Assignment 5 -- CUDA programs using GPU server, vector addition and heat distribution problem, with graphics.

Conclusions This presentation describes an approach for teaching parallel programming by first starting with higher-level computational patterns. We have developed a new software framework that enables parallel and distributed programs to be implemented and executed on a parallel or distributed platform without needing to write low-level message passing code. We strongly believe that our approach builds a foundation for students to tackle larger professional applications by thinking about established higher-level patterns first.

Acknowledgements This work was initiated by Jeremy Villalobos and described in his PhD thesis “Running Parallel Applications on a Heterogeneous Environment with Accessible Development Practices and Automatic Scalability,” UNC-Charlotte, Jeremy developed the “Seeds” pattern programming software. Extending the work to a teaching environment is based upon work supported by the National Science Foundation under the collaborative grant "Collaborative Research: Teaching Multicore and Many-Core Programming at a Higher Level of Abstraction" # / ( ). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation. 26

Questions

Follow-up Hands-on SIGCSE 13 Workshop
"Workshop 31: Developing a Hands-on Undergraduate Parallel Programming Course with Pattern Programming Saturday March 9, 2013 3:00 pm - 6:00 pm Room: Governors 9 28

Dr. Barry Wilkinson University of North Carolina Charlotte

Similar presentations

Presentation on theme: "Dr. Barry Wilkinson University of North Carolina Charlotte"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Dr. Barry Wilkinson University of North Carolina Charlotte

Similar presentations

Presentation on theme: "Dr. Barry Wilkinson University of North Carolina Charlotte"— Presentation transcript:

Similar presentations

About project

Feedback