STAPL The C++ Standard Template Adaptive Parallel Library Alin Jula Department of Computer Science, Texas A&M Ping An, Silvius Rus, Steven Saunders, Tim Smith, Gabriel Tanase, Nathan Thomas, Nancy M. Amato and Lawrence Rauchwerger
2Texas A&M UniversitySTAPL Motivation Building block library – Nested parallelism Inter-operability with existing code – Superset of STL Portability and Performance – Layered architecture – Run-time adaptivity STAPL – C++ Standard Template Adaptive Parallel Library
3Texas A&M UniversitySTAPL Philosophy Interface Layer – STL compatible Concurrency & Communication Layer – Generic parallelism, synchronization Software Implementation Layer – Instantiates concurrency & communication Machine Layer – Architecture dependent code
4Texas A&M UniversitySTAPL Related Work STAPLAVTLCHARM++CHAOS++CILK * NESL*POOMAPSTLSPLIT C * Paradigm SPMD/ MIMD SPMDMIMDSPMDSPMD/ MIMD SPMD Architecture shared/ dist distshared/ dist shared/ dist shared/ dist shared/ dist shared/ dist Nested Parallelism yesno yes no yes Adaptive yesno Generic yes Yes (limits)yesnoyes no Irregular yesnoyes noyes Data Decomp. auto/ user autouserauto/ user auto/ user Data Mapping auto/ user auto auto/ user auto userauto/ user auto Scheduling user, static, dyn, block user, MPI- based prioritized execution data decom. based work stealing work & depth model pthread scheduling Tulip RTS user Comm/Comp Overlap yesnoyesno yes * Parallel programming language
5Texas A&M UniversitySTAPL STL Overview Containers Data is stored in Containers Algorithms STL provides standardized Algorithms Iterators Iterators bind Algorithms to Containers – are generalized pointers Example Container AlgorithmIterator vector vect; … // initialization of ‘vect’ variable sort(vect.begin(),vect.end());
6Texas A&M UniversitySTAPL STAPL Overview pContainers Data is stored in pContainers pAlgorithms STAPL provides standardized pAlgorithms pRanges pRanges bind pAlgorithms to pContainers Similar to STL Iterators, but must also support parallelism
7Texas A&M UniversitySTAPL pRange pRange is the Parallel Counterpart of STL iterator: – – Binds pAlgorithms to pContainers – – Provides an abstract view of a scoped data space – – data space is (recursively) partitioned into subranges More than an iterator since it supports parallelization – – Scheduler/distributor decides how computation and data structures should be mapped to the machine – – Data dependences among subranges can be represented by a data dependence graph (DDG) – – Executor launches parallel computation, manages communication, and enforces dependences
8Texas A&M UniversitySTAPL pRange Provides random access to a partition of the data space –View and access provided by a collection of iterators describing pRange boundary pRanges are partitioned into subranges – Automatically by STAPL based on machine characteristics, number of processors, partition factors, etc. – Manually according to user-specified partitions pRange can represent relationships among subspaces as Data Dependence Graphs (DDG) ( for scheduling )
9Texas A&M UniversitySTAPL pRange Each subspace is disjoint and could be itself a pRange – Nested parallelism Data Space stapl::pRange ::iterator> dataRange(segBegin, segEnd); dataRange.partition(); stapl::pRange ::iterator> dataSubrange = dataRange.get_subrange(3); dataSubrange.partition_like ( * size);
10Texas A&M UniversitySTAPL pRange Each subspace is disjoint and could be itself a pRange – Nested parallelism Prange Data Space stapl::pRange ::iterator> dataRange(segBegin, segEnd); dataRange.partition(); stapl::pRange ::iterator> dataSubrange = dataRange.get_subrange(3); dataSubrange.partition_like ( * size);
11Texas A&M UniversitySTAPL pRange Each subspace is disjoint and could be itself a pRange – Nested parallelism Prange Data Space subspace stapl::pRange ::iterator> dataRange(segBegin, segEnd); dataRange.partition(); stapl::pRange ::iterator> dataSubrange = dataRange.get_subrange(3); dataSubrange.partition_like ( * size);
12Texas A&M UniversitySTAPL pRange Each subspace is disjoint and could be itself a pRange – Nested parallelism subspace Prange Data Space stapl::pRange ::iterator> dataRange(segBegin, segEnd); dataRange.partition(); stapl::pRange ::iterator> dataSubrange = dataRange.get_subrange(3); dataSubrange.partition_like ( * size);
13Texas A&M UniversitySTAPL pRange Each subspace is disjoint and could be itself a pRange – Nested parallelism subspace Prange Data Space stapl::pRange ::iterator> dataRange(segBegin, segEnd); dataRange.partition(); stapl::pRange ::iterator> dataSubrange = dataRange.get_subrange(3); dataSubrange.partition_like ( * size);
14Texas A&M UniversitySTAPL pRange Each subspace is disjoint and could be itself a pRange – Nested parallelism subspace Prange Data Space stapl::pRange ::iterator> dataRange(segBegin, segEnd); dataRange.partition(); stapl::pRange ::iterator> dataSubrange = dataRange.get_subrange(3); dataSubrange.partition_like ( * size);
15Texas A&M UniversitySTAPL pContainer pContainer is the parallel counterpart of STL container – Provides parallel and concurrent methods Maintains internal pRange – Updated during insert/delete operations – Minimizes redistribution Completed: pVector, pList, pTree Example: pVector pRange STL vector
16Texas A&M UniversitySTAPL pAlgorithm pAlgorithm is the parallel counterpart of STL algorithm Parallel Algorithms take as input – pRange – Work functions that operate on subRanges and apply the work function to all subranges template class pAddOne : public stapl::pFunction { public:... void operator()(SubRange& spr) { typename SubRange::iterator i; for (i=spr.begin(); i!=spr.end(); i++) (*i)++ } }... p_transform(pRange, pAddOne);
17Texas A&M UniversitySTAPL Run-Time System Support for different architectures – HP V2200 – SGI Origin 2000, SGI Power Challenge Support for different paradigms – OpenMP, Pthreads – MPI Memory allocation – HOARD Run-Time Cluster 1 Proc 12 Proc 14 Proc 13 Proc 15 Cluster 4 pAlgorithm Cluster 2Cluster 3
18Texas A&M UniversitySTAPL Run-Time System Scheduler – Determine an execution order (DDG) – Policies: Automatic :Static, Block, Dynamic, Partial Self Scheduling, Complete Self Scheduling User defined Distributor – Hierarchical data distribution – Automatic and user defined Executor – Execute DDG Processor assignment Synchronization and Communication
19Texas A&M UniversitySTAPL STL to STAPL Automatic Translation C++ preprocessor converts STL code into STAPL parallel code Iterators used to construct pRanges User is responsible for safe parallelization #include accumulate(x.begin(), x.end(), 0); for_each(x.begin(), x.end(), foo()); #include pi_accumulate(x.begin(), x.end(), 0); pi_for_each(x.begin(), x.end(), foo()); p_accumulate(x_pRange, 0); p_for_each(x_pRange,foo()); Preprocessing phase pRange construction In some cases automatic translation provides similar performance to STAPL written code (5% deterioration)
20Texas A&M UniversitySTAPL Performance: p_find Experimental results on HP V2200
21Texas A&M UniversitySTAPL Performance: p_inner_product Experimental results on HP V2200
22Texas A&M UniversitySTAPL pTree Parallel Tree supports bulk commutative operations in parallel Each processor is assigned a set of subtrees to maintain Operations on the base are atomic Operations on subtrees are parallel Base (atomic) Subtrees (parallel) P1 P2 P3 Example : Parallel Insertion Algorithm Each processor is given a set of elements 1) 1)Each proc creates local buckets corresponding to the subtrees 2) 2)Each processor collects the buckets that correspond to its subtrees 3) 3)Elements in the subtree buckets are inserted into tree in parallel
23Texas A&M UniversitySTAPL pTree Basis for STAPL pSet, pMultiSet, pMap, pMultiMap containers – Covers all remaining STL containers Results are sequentially consistent although internal structure may vary Requires negligible additional memory pTrees can be used either sequentially or in parallel in the same execution – allows switching back and forth between parallel & sequential
24Texas A&M UniversitySTAPL Performance: pTree Experimental results on HP V2200
25Texas A&M UniversitySTAPL Algorithm Adaptivity Problem - Parallel algorithms are highly sensitive – Architecture – number of processors, memory interconnection, cache, available resources, etc – Environment – thread management, memory allocation, operating system policies, etc – Data Characteristics – input type, layout, etc Solution - implement a number of different algorithms and adaptively choose the best one at run-time
26Texas A&M UniversitySTAPL Adaptive Framework
27Texas A&M UniversitySTAPL Case Study - Adaptive Sorting SortStrengthWeakness ColumnTheoretically time optimal Many passes over data MergeLow memory overhead Poor scalability RadixExtremely fastIntegers only SampleTwo passes over data High memory overhead
28Texas A&M UniversitySTAPL Performance: Adaptive Sorting V2200 Power Challenge Origin 2000 Performance on 10 million integers
29Texas A&M UniversitySTAPL Performance: Run-Time Tests if (data_type = INTEGER) radix_sort(); else if (num_procs < 5) merge_sort(); else column_sort(); Origin 2000
30Texas A&M UniversitySTAPL Performance: Molecular Dynamics* Discrete time particle interaction simulation – Written in STL – Time steps calculate system evolution (dependence) – Parallelized within time step STAPL utilization: – pAlgorithms: p_for_each, p_transform, p_accumulate – pContainers: pVector (push_back) – Automatic vs. Manual (5% performance deterioration ) * Code written by Danny Rintoul at Sandia National Labs
31Texas A&M UniversitySTAPL Performance: Molecular Dynamics Number of particles 108K 23k Number of processors Execution Time (sec) Experimental results on HP V %-49% parallelized Input sensitive Use pTree on rest
32Texas A&M UniversitySTAPL Performance - Particle Transport* Generic particle transport solver – Regular and arbitrary grids – Numerically intensive, 25k line, C++ STAPL code – Sweep function unaware of parallel issues STAPL utilization: – pAlgorithms: p_for_each – pContainers: pVector (for data distribution) – Scheduler: determine grid data dependencies – Executor: satisfy data dependencies * Joint effort between Texas A&M Nuclear Engineering and Computer Science, funded by DOE ASCI
33Texas A&M UniversitySTAPL Performance - Particle Transport Profile and Speedups on SGI Origin 2000 using 16 processors Code Region% Seq.Speedup Create computational grid Scattering across group-sets0.05N/A Scattering within a group-set Sweep Convergence across group sets0.05N/A Convergence within group sets0.05N/A Other2.59N/A Total
34Texas A&M UniversitySTAPL Performance - Particle Transport Experimental results on SGI Origin 2000
35Texas A&M UniversitySTAPL Summary Parallel equivalent to STL – Many codes can immediately utilize STAPL – Automatic translation Building block library – Portability (layered architecture) – Performance (adaptive) – Automatic recursive parallelism STAPL performs well in small pAlgorithm test cases and large codes
36Texas A&M UniversitySTAPL STAPL Status and Current Work pAlgorithms - fully implemented pContainers - pVector, pList, pTree pRange - mostly implemented Run-Time – Executorfully implemented – Scheduler fully implemented – Distributorwork in progress Adaptive mechanism (case study – sorting) OpenMP + MPI (mixed)work in progress – OpenMP versionfully implemented – MPI versionwork in progress
37Texas A&M UniversitySTAPL Project funded by Project funded by NSF NSF DOE DOE