Download presentation
Presentation is loading. Please wait.
Published byShannon Anderson Modified over 8 years ago
1
Dan Ibanez, Micah Corah, Seegyoung Seol, Mark Shephard 2/27/2013 Scientific Computation Research Center Rensselaer Polytechnic Institute 1 Advances in PUMI for High Core Count Machines
2
Outline 1.Distributed Mesh Data Structure 2.Phased Message Passing 3.Hybrid (MPI/thread) Programming Model 4.Hybrid Phased Message Passing 5.Hybrid Partitioning 6.Hybrid Mesh Migration 2
3
Unstructured Mesh Data Structure 3 Mesh Part Regions Edges Faces Vertices Pointer in Data Structure
4
Distributed Mesh Representation Mesh elements assigned to parts Uniquely identified by handle or global ID Treated as a serial mesh with the addition of part boundaries Part boundary: groups of mesh entities on shared links between parts Remote copy: duplicated entity copy on non-local part Resident part set : list of parts where the entity exists Can have multiple parts per process. 4
5
Message Passing Primitive functional set: Size – members in group Rank – ID of self in group Send – non-blocking synchronous send Probe – non-blocking probe Receive – blocking receive Non-blocking barrier (ibarrier) API Call 1: Begin ibarrier API Call 2: Wait for ibarrier termination Used for phased message passing Will be available in MPI3, right now custom solution 5
6
ibarrier Implementation Using all non-blocking point-to-point calls: For N ranks, lg(N) go to and from rank 0 Uses a separate MPI communicator 01234 Reduce Broadcast 6
7
Phased Message Passing Similar to Bulk Synchronous Parallel Uses non-blocking barrier 1.Begin phase 2.Send all messages 3.Receive any messages sent this phase 4.End phase Benefits: Efficient termination detection when neighbors unknown Phases are implicit barriers – simplify algorithms Allows buffering all messages per rank per phase 7
8
Phased Message Passing Implementation: 1.Post all sends for this phase 2.While local sends incomplete: receive any 1.Now local sends complete (remember they are synchronous) 3.Begin “stopped sending” ibarrier 4.While ibarrier incomplete: receive any 1.Now all sends complete, can stop receiving 5.Begin “stopped receiving” ibarrier 6.While ibarrier incomplete: compute 1.Now all ranks stopped receiving, safe to send next phase 7.Repeat send recv send recv send recv ibarriers are signal edges 8
9
Hybrid System Node Core Process Thread Blue Gene/Q Program MAPPING *Processes per node and threads per core are variable 9
10
Hybrid Programming System 1. Message Passing is the de facto standard programming model for distributed memory architectures. 2. The classic shared memory programming model: mutexes, atomic operations, lockless structures Most massively parallel code is currently using model 1. Models are very different, hard to convert from 1 to 2. 10
11
Hybrid Programming System We will try message passing between threads. Threads can send to other threads in the same process And to threads in a different process. Same model as MPI, replace “process” with “thread”. Porting is faster: change the message passing API. Shared memory is still exploited, lock with messages: 11 Thread 1: Write(A) Release(lockA) Thread2: Lock(lockA) Write(A) Thread 1: Write(A) SendTo(2) Thread2: ReceiveFrom(1) Write(A) becomes
12
Parallel Control Utility Multi-threading API for hybrid MPI/thread mode Launch a function pointer on N threads Get thread ID, number of threads in process Uses pthread directly Phased communication API Send messages in batches per phase, detect end of phase Hybrid MPI/thread communication API Uses hybrid ranks and size Same phased API, automatically changes to hybrid when called within threads Future: Hardware queries by wrapping hwloc* * Portable Hardware Locality (http://www.open-mpi.org/projects/hwloc/) 12
13
Hybrid Message Passing Everything built from primitives, need hybrid primitives: Size: # of threads on the whole machine Rank: machine-unique ID of the thread Send, Probe, and Receive using hybrid ranks 01230123 01 4567 0123 Process rank Thread rank Hybrid rank 13
14
Hybrid Message Passing Initial simple hybrid primitives: just wrap MPI primitives MPI_Init_thread with MPI_THREAD_MULTIPLE MPI rank = floor(Hybrid rank / threads per process) MPI tag bit fields: From threadTo threadHybrid tag Phased ibarrier MPI primitives Phased ibarrier Hybrid primitives MPI primitives MPI mode: Hybrid mode: 14
15
Hybrid Partitioning Partition mesh to processes, then partition to threads Map Parts to threads, 1-to-1 Share entities on inter-thread part boundaries Process 1 Process 2 Process 3 Process 4 pthreads Part pthreads Part pthreads Part pthreads Part 15
16
Hybrid Partitioning Entities are shared within a process Part boundary entity is created once per process Part boundary entity is shared by all local parts Only owning part can modify entity (avoids almost all contention) Remote copy: duplicate entity copy on another process Parallel control utility can provide architecture info to mesh, which is distributed accordingly. i M 0 j M 0 1 P 0 P 2 P inter-process boundary intra-process part boundary (implicit) process j process i 16
17
Mesh Migration Moving mesh entities between parts Input: local mesh elements to send to other parts Other entities to move are determined by adjacencies Complex Subtasks Reconstructing mesh adjacencies Re-structuring the partition model Re-computing remote copies Considerations Neighborhoods change: try to maintain scalability despite loss of communication locality How to benefit from shared memory 17
18
1 1 1 Migration Steps Mesh Migration 1 (B) Get affected entities and compute post-migration residence parts (D) Delete migrated entities P0P0 P2P2 P1P1 2 1 1 1 1 1 2 2 2 2 2 2 1 (A) Mark destination part id 2 1 1 1 1 1 2 2 2 2 2 2 1 (C) Exchange entities and update part boundary P0P0 P2P2 P1P1 2 1 1 2 2 2 2 2 2 18
19
Hybrid Migration Shared memory optimizations: Thread to part matching: use partition model for concurrency Threads handle part boundary entities which they own Other entities are ‘released’ Inter-process entity movement Send entity to one thread per process Intra-process entity movement Send message containing pointer 01 2 3 01 2 3 release grab 19
20
Hybrid Migration 1.Release shared entities 2.Update entity resident part sets 3.Move entities between processes 4.Move entities between threads 5.Grab shared entities Two-level temporary ownership: Master and Process Master Master: smallest resident part ID Process Master: smallest on-process resident part ID 20 01 2 3
21
Representative Phase: 1.Old Master Part sends entity to new Process Master Parts 2.Receivers bounce back addresses of created entities 3.Senders broadcast union of all addresses Hybrid Migration 21 0 1 45 6 7 Old Resident Parts: {1,2,3} New Resident Parts: {5,6,7} Data to create copy Address of local copy Addresses of all copies
22
Many subtle complexities: 1.Most steps have to be done one dimension at a time 2.Assigning upward adjacencies causes thread contention 1.Use a separate phase of communication to make them 2.Use another phase to remove them when entities are deleted 3.Assigning downward adjacencies requires addresses on the new process 1.Use a separate phase to gather remote copies Hybrid Migration 22
23
Preliminary Results Model: bi-unit cube Mesh: 260K tets, 16 parts Migration: sort by X coordinate 23
24
Preliminary Results First test of hybrid algorithm: Using 1 node of the CCNI Blue Gene /Q: Cases: 1.16 MPI ranks, 1 thread per rank 1.18.36 seconds for migration 2.433 MB mesh memory use (sum of all MPI ranks) 2.1 MPI rank, 16 threads per rank 1.9.62 seconds for migration + thread create/join 2.157 MB mesh memory use (sum of all threads) 24
25
Thank You 25 Seegyoung Seol – FMDB architect, part boundary sharing Micah Corah – SCOREC undergraduate, threaded part loading
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.