Download presentation
Presentation is loading. Please wait.
Published byScott Crawford Modified over 7 years ago
1
A Parallel Communication Infrastructure for STAPL
Steven Saunders Lawrence Rauchwerger PARASOL Laboratory Department of Computer Science Texas A&M University
2
Overview STAPL The Parallel Communication Infrastructure
the Standard Template Adaptive Parallel Library parallel superset to the C++ Standard Template Library provides transparent communication through parallel containers and algorithms The Parallel Communication Infrastructure foundation for communication in STAPL maintains high performance and portability simplifies parallel programming PARASOL Lab: Texas A&M
3
Common Communication Models
level of abstraction PARASOL Lab: Texas A&M
4
STL Overview The C++ Standard Template Library
set of generic containers and algorithms generically bound by iterators containers: data structures with methods algorithms: operations over a sequence of data iterators: abstract view of data abstracted pointer: dereference, increment, equality Example: std::vector<int> v( 100 ); ...initialize v... std::sort( v.begin(), v.end() ); PARASOL Lab: Texas A&M
5
STAPL Overview The Standard Template Adaptive Parallel Library
set of generic, parallel containers and algorithms generically bound by pRanges pContainers - distributed containers pAlgorithms - parallel algorithms pRange - abstract view of partitioned data view of dist. data, random access, data dependencies Example: stapl::pvector<int> pv( 100 ); ...initialize pv... stapl::p_sort( pv.get_prange() ); PARASOL Lab: Texas A&M
6
Fundamental Requirements for the Communication Infrastructure
statement: tell a process something question: ask a process for something Synchronization mutual exclusion: ensure atomic access event ordering: ensure dependencies are satisfied PARASOL Lab: Texas A&M
7
Design Goal: abstract underlying communication model to enable efficient support for the requirements focus on parallelism for STL-oriented C++ code Solution: message passing makes C++ STL code difficult shared-memory not yet implemented on large systems remote method invocation can support the communication requirements can support high performance through message passing or shared-memory implementations maps cleanly to object-oriented C++ PARASOL Lab: Texas A&M
8
Design Communication Synchronization statement: question:
template<class Class, class Rtn, class Arg1…> void async_rmi( int destNode, Class* destPtr, Rtn (*method)(Arg1…), Arg1 a1… ) question: Rtn sync_rmi( int destNode, Class* destPtr, groups: void broadcast_rmi(), Rtn collect_rmi() Synchronization mutual exclusion: remote methods are atomic event ordering: void rmi_fence(), void rmi_wait() PARASOL Lab: Texas A&M
9
Design: Data Transfer Transfer the work to the data
only one instance of an object exists at once no replication and merging as in DSM transfer granularity: method arguments pass-by-value: eliminate sharing Argument classes must implement a method that defines its type internal variables are either local or dynamic used to pack/unpack as necessary PARASOL Lab: Texas A&M
10
Integration with STAPL
User Code pAlgorithms pContainers pRange Address Translator Communication Infrastructure Pthreads OpenMP MPI Native PARASOL Lab: Texas A&M
11
Integration: pContainers
Set of distributed sub-containers pContainer methods abstract communication decision between shared-memory/message passing made in communication infrastructure Communication patterns: access: access data in another sub-container handled by sync_rmi update: update data in another sub-container handled async_rmi group update: update all sub-containers handled by broadcast_rmi PARASOL Lab: Texas A&M
12
Integration: pAlgorithms
Set of parallel_task objects input per parallel_task specified by the pRange intermediate results stored in pContainers RMI for communication between parallel_tasks Communication patterns: event ordering: tell workers when to start data parallel: apply operation in parallel followed by a parallel reduction handled by collect_rmi bulk communication: large number of small messages handled async_rmi PARASOL Lab: Texas A&M
13
Case Study: Sample Sort
Common parallel algorithm for sorting based on distributing data into buckets Algorithm: sample from the input a set of splitters send elements to appropriate bucket based on splitters (e.g., elements less than splitter 0 are sent to bucket 0) sort each bucket splitters input buckets PARASOL Lab: Texas A&M
14
Case Study: Sample Sort
//...all processors execute code in parallel... ... stapl::pvector<int> splitters( p-1 ); splitters[id] = //sample stapl::pvector< vector<int> > buckets( p ); for( i=0; i<size; i++ ) { //distribute int dest = //...appropriate bucket based on splitters... stapl::async_rmi( dest, ..., &stapl::pvector::push_back, input[i] ); } stapl::rmi_fence(); sort( bucket[id].begin(), bucket[id].end() ); //sort template<class T> T& pvector<T>::operator[](const int index) { if( /*...index is local...*/ ) return //...element... else return stapl::sync_rmi( /*owning node*/, ..., &stapl::pvector<T>::operator[], index ); PARASOL Lab: Texas A&M
15
Implementation Issues
RMI request scheduling tradeoff local computation with incoming RMI requests current solution: explicit polling async_rmi automatic buffering to reduce network congestion rmi_fence deadlock: native barriers block while waiting must poll while waiting current solution: custom fence implementation completion: RMI’s can invoke other RMI’s... current solution: overlay a distributed termination algorithm PARASOL Lab: Texas A&M
16
Performance Two implementations: Three benchmark platforms:
Pthreads (shared-memory) MPI-1.1 (message passing) Three benchmark platforms: Hewlett Packard V2200 16 processors, shared-memory SMP SGI Origin 3800 48 processors, distributed shared-memory CC-NUMA Linux Cluster 8 processors, 1Gb/s Ethernet switch PARASOL Lab: Texas A&M
17
Latency Overhead due to high cost of RMI request creation and scheduling versus low cost of communication. Overhead due to native MPI optimizations that are not applicable with RMI. Time (us) to ping-pong a message between two processors using explicit communication or STAPL (async_rmi/sync_rmi) PARASOL Lab: Texas A&M
18
Effect of Automatic Aggregation
PARASOL Lab: Texas A&M
19
Effect of Automatic Aggregation
PARASOL Lab: Texas A&M
20
Native Barrier vs. STAPL Fence
Overhead due to native optimizations within MPI_Barrier that are unavailable to STAPL. Message Passing Majority of overhead due to cost of polling and termination detection. Shared-Memory PARASOL Lab: Texas A&M
21
Sample Sort for 10M Integers
PARASOL Lab: Texas A&M
22
Sample Sort for 10M Integers
PARASOL Lab: Texas A&M
23
Inner Product for 40M Integers
Time (s) to compute the inner product of 40M element vectors using shared-memory PARASOL Lab: Texas A&M
24
Inner Product for 40M Integers
Time (s) to compute the inner product of 40M element vectors using message passing PARASOL Lab: Texas A&M
25
Conclusion STAPL The Parallel Communication Infrastructure Future Work
provides transparent communication through parallel containers and algorithms The Parallel Communication Infrastructure foundation for communication in STAPL maintains high performance and portability simplifies parallel programming Future Work mixed-mode MPI and OpenMP additional implementation issues PARASOL Lab: Texas A&M
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.