STAPL: A High Productivity Programming Infrastructure for Parallel & Distributed Computing Lawrence Rauchwerger Parasol Lab, Dept.

http://parasol.tamu.edu STAPL: A High Productivity Programming Infrastructure for Parallel & Distributed Computing Lawrence Rauchwerger Parasol Lab, Dept of Computer Science Texas A&M University http://parasol.tamu.edu/~rwerger/

Motivation l Parallel programming is Costly l Parallel programs are not portable l Scalability & Efficiency is (usually) poor l Dynamic programs are even harder l Small scale parallel machines: ubiquitous

Our Approach: STAPL l STAPL: Parallel components library –Extensible, open ended –Parallel superset of STL –Sequential inter-operability l Layered architecture: User – Developer - Specialist –Extensible –Portable (only lowest layer needs to be specialized) l High Productivity Environment –components have (almost) sequential interfaces.

STAPL Specification l STL Philosophy l Shared Object View –User Layer: No explicit communication –Machine Layer: Architecture dependent code l Distributed Objects –no replication –no software coherence l Portable efficiency –Runtime System virtualizes underlying architecture. l Concurrency & Communication Layer –SPMD (for now) parallelism

STAPL Applications l Motion Planning Probabilistic Roadmap Methods for motion planning with application to protein folding, intelligent CAD, animation, robotics, etc. l Molecular Dynamics A discrete event simulation that computes interactions between particles. start goal obstacles Motion Planning Protein Folding

STAPL Applications l Seismic Ray Tracing Simulation of propagation of seismic rays in earth’s crust. l Particle Transport Computation Efficient Massively Parallel Implementation of Discrete Ordinates Particle Transport Calculation. Seismic Ray Tracing Particle Transport Simulation

STAPL Overview Data is stored in pContainers Parallel equivalents of all STL containers & more (e.g., pGraph) STAPL provides generic pAlgorithms Parallel equivalents of STL algorithms & more (e.g., list ranking) pRanges bind pAlgorithms to pContainers Similar to STL iterators, but also support parallelism pContainer pAlgorithm pRange Runtime System SchedulerExecutor

STAPL Overview l pContainers l pRange l pAlgorithms l RTS & ARMI Communication Infrastructure l Applications using STAPL

pContainer Overview pContainer: A distributed (no replication) data structure with parallel (thread-safe) methods l Ease of Use –Shared Object View –Handles data distribution and remote data access internally (no explicit communication) l Efficiency –De-centralized distribution management –OO design to optimize specific containers –Minimum overhead over STL containers l Extendability –A set of base classes with basic functionality –New pContainers can be derived from Base classes with extended and optimized functionality

pContainer Layered Architecture pContainer provides different views for users with different needs/levels of expertise – Basic User view: – a single address space – interfaces similar to STL containers – Advanced User view: – access to data distribution info to optimize methods – can provide customized distributions that exploit knowledge of application Non-partitioned Shared Memory View of Data STAPL pContainer STAPL RTS and ARMI Data In Shared Memory Data In Distributed Memory Data In Distributed Memory Partitioned Shared Memory View of Data

pContainer Design –Base Sequential Container –STL Containers used to store data –Distribution Manager –provides shared object view –BasePContainer

STAPL Overview l pContainers l pRange l pAlgorithms l RTS & ARMI Communication Infrastructure l Applications using STAPL

pRange Overview l Interface between pAlgorithms and pContainers –pAlgorithms expressed in terms of pRanges –pContainers provide pRanges –Similar to STL Iterator l Parallel programming support –Expression of computation as parallel task graph –Stores DDGs used in processing subranges l Less abstract than STL iterator –Access to pContainer methods l Expresses the Data—Task Parallelism Duality

pRange l View of a work space –Set of tasks in a parallel computation l Can be recursively partitioned into subranges –Defined on disjoint portions of the work space –Leaf subrange in the hierarchy l Represents a single task l Smallest schedulable entity l Task: –Function object to apply l Using same function for all subranges results in SPMD –Description of the data to which function is applied

pRange Example l Each subrange is a task l Boundary of each subrange is a set of cut edges l Data from several threads in subrange –If pRange partition matches data distribution then data access is all local Application data stored in pMatrix Thread 1 Thread 2 Subrange 6Subrange 5 Subrange 4 Subrange 1Subrange 2 Subrange 3 Function Thread 1 Thread 2 Dependences between subranges pRange defined on application data

pRange Example l Each subrange has a boundary and a function object l Data from several threads in subrange –pMatrix is distributed –If subrange partition matches data distribution then all data access is local l DDGs can be defined on subranges of the pRange and on elements inside each subrange –No DDG is shown here l Subranges of pRange –Matrix elements in several subranges –Each subrange has a function object l Partitioning of subrange –Subranges can be recursively partitioned –Each subrange has a function object

Overview l pContainers l pRange l pAlgorithms l RTS & ARMI Communication Infrastructure l Applications using STAPL

pAlgorithms l pAlgorithm is a set of parallel task objects –input for parallel tasks specified by the pRange –(Intermediate) results stored in pContainers –ARMI for communication between parallel tasks l pAlgorithms in STAPL –Parallel counterparts of STL algorithms provided in STAPL –STAPL contains additional parallel algorithms l List ranking l Parallel Strongly Connected Components l Parallel Euler Tour l etc

Algorithm Adaptivity in STAPL l Problem: Parallel algorithms highly sensitive to: –Architecture – number of processors, memory interconnection, cache, available resources, etc –Environment – thread management, memory allocation, operating system policies, etc –Data Characteristics – input type, layout, etc l Solution: adaptively choose the best algorithm from a library of options at run-time l Adaptive Patterns ?

Adaptive Framework Overview of Approach l Given Multiple implementation choices for the same high level algorithm. l STAPL installation Analyze each pAlgorithm’s performance on system and create a selection model. l Program execution Gather parameters, query model, and use predicted algorithm. Installation Benchmarks Architecture & Environment Algorithm Performance Model User Code Parallel Algorithm Choices Data Characteristics Run-Time Tests Selected Algorithm Data Repository STAPL Adaptive Executable

Model generation l Installation Benchmarking –Determine parameters that may affect performance (i.e., num procs, input size, algorithm specific…) –Run all algorithms on a sampling of instance space –Insert timings from runs into performance database l Create a Decision Model –Generic interface enables learners to compete l Currently: decision tree, neural net, Bayes naïve classifier l Based on predicted accuracy (10-way validation test). –“Winning” learner outputs query function in C++ func* predict_pAlgorithm(attribute1, attributes2,..)

Runtime Algorithm Selection l Gather parameters –Immediately available (e.g., num procs) –Computed (e.g., disorder estimate for sorting) l Query model and execute –Query function statically linked at compile time. Current work: dynamic linking with online model refinement.

Experiments l Investigated two operations –Parallel Sorting –Parallel Matrix Multiplication l Three Platforms –128 processor SGI Altix –1152 nodes, dual processor Xeon Cluster –68 nodes, 16 way IBM SMP Cluster

Parallel Sorting Algorithms l Sample Sort –Samples to define processor bucket thresholds –Scan and distribute elements to buckets –Each processor sort local elements l Radix Sort –Parallel version of linear time sequential approach. –Passes over data multiple times, each time considering r bits. l Column Sort –O(lg n) on running time –Requires 4 local sorts and 4 communication steps –Uses pMatrix data structure for workspace

Sorting Attributes Attributes used to model sorting decision –Processor Count –Data Type –Input Size –Max Value l Smaller value ranges may favor radix sort by reducing required passes –Presortedness l Generate data by varying initial state (sorted, random, reversed) and % displacement l Quantify at runtime with normalized average distance metric derived from input sampling

Training Set Creation l 1000 Training inputs per platform by uniform random sampling of space defined below: ParameterValue Processors*2, 4, 8, 16, 32, 64 Data Typeint, double Input Size(100K..20M)*P/sizeof(T) Max ValueN/1000, N/100, N/10, N, 3N, MAX_INT Input Ordersorted, reverse sorted, random Displacement0..20% of N** *P = 64 linux cluster, frost **only for sorted and reverse sorted

Model Error Rate l Model accuracy with all training inputs is 94%, 98% and 94% on Cluster, Altix, and SMP Cluster

Altix Cluster SMPCluster F(p, dist_norm) F(p, n, dt, dist_norm, max) F(p, n, dt, dist_norm, max)

Parallel Sorting: Experimental Results Altix Model Decision Tree Validation (N=120M) on Altix 0 Sorted 0.5 Reversed Random Interpretation of dist_norm

Current Implementation Protocols l Shared-Memory (OpenMP/Pthreads) –shared request queues l Message Passing (MPI-1.1) –sends/receives l Mixed-Mode –combination of MPI with threads –flat view of parallelism (for now) l take advantage of shared-memory

STAPL Run-Time System l Scheduler –Determine an execution order (DDG) –Policies: l Automatic:Static, Block, Dynamic, Partial Self Scheduling,Complete Self Scheduling l User defined l Executor –Execute DDG l Processor assignment l Synchronization and Communication Run-Time Cluster 1 Proc 12 Proc 14 Proc 13 Proc 15 Cluster 4 pAlgorithm Cluster 2Cluster 3

ARMI: STAPL Communication Infrastructure ARMI: Adaptive Remote Method Invocation –abstraction of shared-memory and message passing communication layer –programmer expresses fine-grain parallelism that ARMI adaptively coarsens –support for sync, async, point-to-point and group communication ARMI can be as easy/natural as shared memory and as efficient as message passing

ARMI Communication Primitives l armi_sync –question: ask a thread something –blocking version l function doesn’t return until answer received from rmi –non-blocking version l function returns without answer l program can poll with rtnHandle.ready() and then access armi’s return value with rtnHandle.value() l collective operations –armi_broadcast, armi_reduce, etc. – can adaptively set groups for communication –arguments always passed by value

ARMI Synchronization Primitives l armi_fence, armi_barrier –tree-based barrier –implements distributed termination algorithm to ensure that all outstanding ARMI requests have been sent, received, and serviced l armi_wait –blocks until at least one (possibly more) ARMI request is received and serviced l armi_flush –empties local send buffer, pushing outstanding ARMI requests to remote destinations

ARMI Request Scheduling Future Work –Optimizing communication with comm. thread l If servicing multiple computation threads, aggregate all messages to a single remote host and send together l If using different processing resources than computation, spend free cycles optimizing communication (i.e., aggregation factor, discover program comm. patterns) –Explore benefits of extreme platform customization l Further clarify layers of abstraction within ARMI to ease specialized implementations l Customize implementation, consistency model to target platform

Particle Transport Q: What is the particle transport problem? A: Particle transport is all about counting particles (such as neutrons). Given a physical volume we want to know how many particles there are and their locations, directions, and energies. Q: Why is it an important problem? A: Needed for the accurate simulation of complex physical systems such as nuclear reactions. Requires an estimated 50-80% of the total execution time in multi-physics simulations. Particle Transport is an important problem.

Transport Problem Applications l Oil Well Logging Tool l Shaft dug at possible well location l Radioactive sources placed in shaft with monitoring equipment l Simulation allows for verification of new techniques and tools

Discrete Ordinates Method Iterative method for solving the first-order form of the transport equation discretizes: Algorithm: , the angular directions for each direction in  R, the spatial domain for each grid cell in R E, the energy variable for each energy group in E

Discrete Ordinates Method

TAXI Algorithm

Transport Sweeps l Involves a sweep of the spatial grid for each direction in . l For orthogonal grids there are only eight distinct sweep orderings. l Note: A full transport sweep must process each direction.

Multiple Simultaneous Sweeps l One approach is to sequentially process each direction. l Another approach is to process each direction simultaneously. –Requires processors to sequentially process each direction during the sweep.

Sweep Dependence l Each sweep direction generates a unique dependence graph. A sweep starting from cell 1 is shown. l For example, cell 3 must wait until cell 1 has been processed and must be processed before cells 5 and 7. l Note that all cells in the same diagonal plane can be processed simultaneously.

pRange Dependence Graph l Numbers are cellset indices l Colors indicate processors angle-set Aangle-set Bangle-set Cangle-set D

1 25 36 47 8 1013 18211215 141711 9 22251619 26292023 302427 2831 32 angle-set A 4 38 27 16 5 1116 1924914 152010 12 23281318 27321722 312126 2530 29 angle-set B Adding a reflecting boundary 32 3128 3027 2926 25 2320 15122118 191622 24 1181714 741310 396 52 1 angle-set C 29 3025 3126 3227 28 2217 1492419 181323 21 1052015 611611 2127 83 4 angle-set D

Opposing reflecting boundary 1 25 36 47 8 1013 18211215 141711 9 22251619 26292023 302427 2831 32 angle-set A 4 38 27 16 5 1116 1924914 152010 12 23281318 27321722 312126 2530 29 angle-set B 32 3128 3027 2926 25 2320 15122118 191622 24 1181714 741310 396 52 1 angle-set C 29 3025 3126 3227 28 2217 1492419 181323 21 1052015 611611 2127 83 4 angle-set D

Strong Scalability System Specs l Large, dedicated IBM cluster at LLNL l 68 Nodes. l 16 375 Mhz Power 3 processors and 16GB RAM/node l Nodes connected by IBM SP switch Problem Info l 64x64x256 grid l 6,291,456 unknowns

Work in Progress (Open Topics) l STAPL Algorithms l STAPL Adaptive Containers l ARMI v2 (multi-threaded, communication pattern library) l STAPL RTS -- K42 Interface l A Compiler for STAPL: –A high level, source to source compiler –Understands STAPL blocks –Optimizes composition –Automates composition –Generates checkers for STAPL programs

References l [1] "STAPL: An Adaptive, Generic Parallel C++ Library", Ping An, Alin Jula, Silvius Rus, Steven Saunders, Tim Smith, Gabriel Tanase, Nathan Thomas, Nancy Amato and Lawrence Rauchwerger, 14th Workshop on Languages and Compilers for Parallel Computing (LCPC), Cumberland Falls, KY, August, 2001. l [2] "ARMI: An Adaptive, Platform Independent Communication Library“, Steven Saunders and Lawrence Rauchwerger, ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), San Diego, CA, June, 2003. l [3] "Finding Strongly Connected Components in Parallel in Particle Transport Sweeps", W. C. McLendon III, B. A. Hendrickson, S. J. Plimpton, and L. Rauchwerger, in Thirteenth ACM Symposium on Parallel Algorithms and Architectures (SPAA), Crete, Greece, July, 2001. l [4] “A Framework for Adaptive Algorithm Selection in STAPL", N. Thomas, G. Tanase, O. Tkachyshyn, J. Perdue, N.M. Amato, L. Rauchwerger, in ACM SIGPLAN 2005 Symposium on Principles and Practice of Parallel Programming (PPOPP), Chicago, IL, June, 2005. (to appear)

STAPL: A High Productivity Programming Infrastructure for Parallel & Distributed Computing Lawrence Rauchwerger Parasol Lab, Dept.

Similar presentations

Presentation on theme: "STAPL: A High Productivity Programming Infrastructure for Parallel & Distributed Computing Lawrence Rauchwerger Parasol Lab, Dept."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

STAPL: A High Productivity Programming Infrastructure for Parallel & Distributed Computing Lawrence Rauchwerger Parasol Lab, Dept.

Similar presentations

Presentation on theme: "STAPL: A High Productivity Programming Infrastructure for Parallel & Distributed Computing Lawrence Rauchwerger Parasol Lab, Dept."— Presentation transcript:

Similar presentations

About project

Feedback