Presentation is loading. Please wait.

Presentation is loading. Please wait.

 Dan Ibanez, Micah Corah, Seegyoung Seol, Mark Shephard  2/27/2013  Scientific Computation Research Center  Rensselaer Polytechnic Institute 1 Advances.

Similar presentations


Presentation on theme: " Dan Ibanez, Micah Corah, Seegyoung Seol, Mark Shephard  2/27/2013  Scientific Computation Research Center  Rensselaer Polytechnic Institute 1 Advances."— Presentation transcript:

1  Dan Ibanez, Micah Corah, Seegyoung Seol, Mark Shephard  2/27/2013  Scientific Computation Research Center  Rensselaer Polytechnic Institute 1 Advances in PUMI for High Core Count Machines

2 Outline 1.Distributed Mesh Data Structure 2.Phased Message Passing 3.Hybrid (MPI/thread) Programming Model 4.Hybrid Phased Message Passing 5.Hybrid Partitioning 6.Hybrid Mesh Migration 2

3 Unstructured Mesh Data Structure 3 Mesh Part Regions Edges Faces Vertices Pointer in Data Structure

4 Distributed Mesh Representation Mesh elements assigned to parts Uniquely identified by handle or global ID Treated as a serial mesh with the addition of part boundaries Part boundary: groups of mesh entities on shared links between parts Remote copy: duplicated entity copy on non-local part Resident part set : list of parts where the entity exists Can have multiple parts per process. 4

5 Message Passing  Primitive functional set: Size – members in group Rank – ID of self in group Send – non-blocking synchronous send Probe – non-blocking probe Receive – blocking receive Non-blocking barrier (ibarrier) API Call 1: Begin ibarrier API Call 2: Wait for ibarrier termination Used for phased message passing Will be available in MPI3, right now custom solution 5

6 ibarrier Implementation  Using all non-blocking point-to-point calls:  For N ranks, lg(N) go to and from rank 0  Uses a separate MPI communicator 01234 Reduce Broadcast 6

7 Phased Message Passing  Similar to Bulk Synchronous Parallel  Uses non-blocking barrier 1.Begin phase 2.Send all messages 3.Receive any messages sent this phase 4.End phase  Benefits: Efficient termination detection when neighbors unknown Phases are implicit barriers – simplify algorithms Allows buffering all messages per rank per phase 7

8 Phased Message Passing  Implementation: 1.Post all sends for this phase 2.While local sends incomplete: receive any 1.Now local sends complete (remember they are synchronous) 3.Begin “stopped sending” ibarrier 4.While ibarrier incomplete: receive any 1.Now all sends complete, can stop receiving 5.Begin “stopped receiving” ibarrier 6.While ibarrier incomplete: compute 1.Now all ranks stopped receiving, safe to send next phase 7.Repeat send recv send recv send recv ibarriers are signal edges 8

9 Hybrid System Node Core Process Thread Blue Gene/Q Program MAPPING *Processes per node and threads per core are variable 9

10 Hybrid Programming System 1. Message Passing is the de facto standard programming model for distributed memory architectures. 2. The classic shared memory programming model: mutexes, atomic operations, lockless structures Most massively parallel code is currently using model 1. Models are very different, hard to convert from 1 to 2. 10

11 Hybrid Programming System We will try message passing between threads. Threads can send to other threads in the same process And to threads in a different process. Same model as MPI, replace “process” with “thread”. Porting is faster: change the message passing API. Shared memory is still exploited, lock with messages: 11 Thread 1: Write(A) Release(lockA) Thread2: Lock(lockA) Write(A) Thread 1: Write(A) SendTo(2) Thread2: ReceiveFrom(1) Write(A) becomes

12 Parallel Control Utility Multi-threading API for hybrid MPI/thread mode Launch a function pointer on N threads Get thread ID, number of threads in process Uses pthread directly Phased communication API Send messages in batches per phase, detect end of phase Hybrid MPI/thread communication API Uses hybrid ranks and size Same phased API, automatically changes to hybrid when called within threads Future: Hardware queries by wrapping hwloc* * Portable Hardware Locality (http://www.open-mpi.org/projects/hwloc/) 12

13 Hybrid Message Passing  Everything built from primitives, need hybrid primitives: Size: # of threads on the whole machine Rank: machine-unique ID of the thread Send, Probe, and Receive using hybrid ranks 01230123 01 4567 0123 Process rank Thread rank Hybrid rank 13

14 Hybrid Message Passing  Initial simple hybrid primitives: just wrap MPI primitives MPI_Init_thread with MPI_THREAD_MULTIPLE MPI rank = floor(Hybrid rank / threads per process) MPI tag bit fields: From threadTo threadHybrid tag Phased ibarrier MPI primitives Phased ibarrier Hybrid primitives MPI primitives MPI mode: Hybrid mode: 14

15 Hybrid Partitioning  Partition mesh to processes, then partition to threads  Map Parts to threads, 1-to-1  Share entities on inter-thread part boundaries Process 1 Process 2 Process 3 Process 4 pthreads Part pthreads Part pthreads Part pthreads Part 15

16 Hybrid Partitioning  Entities are shared within a process Part boundary entity is created once per process Part boundary entity is shared by all local parts Only owning part can modify entity (avoids almost all contention) Remote copy: duplicate entity copy on another process Parallel control utility can provide architecture info to mesh, which is distributed accordingly. i M 0 j M 0 1 P 0 P 2 P inter-process boundary intra-process part boundary (implicit) process j process i 16

17 Mesh Migration  Moving mesh entities between parts Input: local mesh elements to send to other parts Other entities to move are determined by adjacencies  Complex Subtasks Reconstructing mesh adjacencies Re-structuring the partition model Re-computing remote copies  Considerations Neighborhoods change: try to maintain scalability despite loss of communication locality How to benefit from shared memory 17

18 1 1 1  Migration Steps Mesh Migration 1 (B) Get affected entities and compute post-migration residence parts (D) Delete migrated entities P0P0 P2P2 P1P1 2 1 1 1 1 1 2 2 2 2 2 2       1 (A) Mark destination part id 2 1 1 1 1 1 2 2 2 2 2 2 1 (C) Exchange entities and update part boundary P0P0 P2P2 P1P1 2 1 1 2 2 2 2 2 2 18

19 Hybrid Migration  Shared memory optimizations: Thread to part matching: use partition model for concurrency Threads handle part boundary entities which they own Other entities are ‘released’ Inter-process entity movement Send entity to one thread per process Intra-process entity movement Send message containing pointer 01 2 3 01 2 3 release grab 19

20 Hybrid Migration 1.Release shared entities 2.Update entity resident part sets 3.Move entities between processes 4.Move entities between threads 5.Grab shared entities Two-level temporary ownership: Master and Process Master Master: smallest resident part ID Process Master: smallest on-process resident part ID 20 01 2 3

21 Representative Phase: 1.Old Master Part sends entity to new Process Master Parts 2.Receivers bounce back addresses of created entities 3.Senders broadcast union of all addresses Hybrid Migration 21 0 1 45 6 7 Old Resident Parts: {1,2,3} New Resident Parts: {5,6,7} Data to create copy Address of local copy Addresses of all copies

22 Many subtle complexities: 1.Most steps have to be done one dimension at a time 2.Assigning upward adjacencies causes thread contention 1.Use a separate phase of communication to make them 2.Use another phase to remove them when entities are deleted 3.Assigning downward adjacencies requires addresses on the new process 1.Use a separate phase to gather remote copies Hybrid Migration 22

23 Preliminary Results  Model: bi-unit cube  Mesh: 260K tets, 16 parts  Migration: sort by X coordinate 23

24 Preliminary Results  First test of hybrid algorithm:  Using 1 node of the CCNI Blue Gene /Q:  Cases: 1.16 MPI ranks, 1 thread per rank 1.18.36 seconds for migration 2.433 MB mesh memory use (sum of all MPI ranks) 2.1 MPI rank, 16 threads per rank 1.9.62 seconds for migration + thread create/join 2.157 MB mesh memory use (sum of all threads) 24

25 Thank You 25 Seegyoung Seol – FMDB architect, part boundary sharing Micah Corah – SCOREC undergraduate, threaded part loading


Download ppt " Dan Ibanez, Micah Corah, Seegyoung Seol, Mark Shephard  2/27/2013  Scientific Computation Research Center  Rensselaer Polytechnic Institute 1 Advances."

Similar presentations


Ads by Google