Presentation is loading. Please wait.

Presentation is loading. Please wait.

Implementation and Performance of Munin (Distributed Shared Memory System) Dongying Li Department of Electrical and Computer Engineering University of.

Similar presentations


Presentation on theme: "Implementation and Performance of Munin (Distributed Shared Memory System) Dongying Li Department of Electrical and Computer Engineering University of."— Presentation transcript:

1 Implementation and Performance of Munin (Distributed Shared Memory System) Dongying Li Department of Electrical and Computer Engineering University of Toronto (Original Authors: J. B. Carter, et al.) ECE 1147, Parallel Computation Oct. 30, 2006

2 2 Distributed Shared Memory Shared address space spanning the processors of a distributed memory multiprocessor proc1proc3 X=0 proc2 X=0

3 3 Distributed Shared Memory mem0 proc0 mem1 proc1 mem2 proc2 memN procN network... shared memory

4 4 Distributed Shared Memory Design objectives –Good performance comparable to shared memory programs –No significant deviation from shared memory coding model –Low communication and message passing overheads

5 5 Munin System Characterized features –Software released consistency –Multiple consistency protocols Same interface with shared memory code model –Threads, syncs, data sharing etc. –Deviations All shared variable annotated by access pattern Syncs explicitly visible to runtime system (important for release consistency!)

6 6 Contents Basic concepts –Shared object –Software release consistency –Multiple consistency protocols Software implementation –Prototype overview –Execution process –Advanced programming features –Data object directory and delayed update queue –Synchronization Performance Overview of other DSM systems Conclusion

7 7 Basic Concepts Basic concepts –Shared object –Software release consistency –Multiple consistency protocols Software implementation –Prototype overview –Execution process –Advanced programming features –Data object directory and delayed update queue –Synchronization Performance Overview of other DSM systems Conclusion

8 8 Shared Object x y x x 8-kilo

9 9 Software Release Consistency Sequential Consistency –All processors observe the same order –Must correspond to some serial order –Only ordering constraint is that reads/writes of P1 appear in the same order, but no restrictions on relative ordering between processors. Synchronous read/write –Writes must be propagated before moving on to the next operation

10 10 Software Release Consistency Special weak consistency protocol Reduction of message passing overhead Two categories of shared variable operations –Ordinary access Read Write –Synchronization access (lock, semaphore, barrier) Acquire Release

11 11 Software Release Consistency Before ordinary access (read, write) allowed, all previous acquire performed Before release allowed, all previous ordinary access performed Before acquire allowed, all previous release performed Before release allowed, all previous acquire performed In a word, results of writes prior to a release propagated before next processor acquiring this released lock

12 12 Release Consistency Write propagating at release

13 13 Multiple Consistency Protocols No single consistency protocol suitable for all parallelization purpose Shared variables accessed in different ways within single program Variable access pattern changes during execution Multiple protocols allow access pattern-oriented tuning for different shared variables

14 14 Multiple Consistency Protocols High-level sharing pattern annotation –Specified in shared variable declaration –Combinations of low-level protocol parameters Low-level protocol parameter –Specified in shared variable directory –Specific aspect of protocol

15 15 Protocol Parameters I:propagate invalidating or updating after modification? R:Replicas allowed in other nodes? D:Delayed operation (update, invalidation) allowed? FO:Having fixed owner (no writes at other nodes)? M:Multiple writers allowed? S:Stable sharing pattern (accessed by fixed threads)? FL:Flush changes to owner & invalidate local copy? W:Writable?

16 16 Sharing annotations Read only –Simplest pattern: once initialized, no further access –Suitable for constant etc. Migratory –Only one thread can access at one period of time –Suitable for variables accessed only in critical session Write-shared –Can be written concurrently by multiple threads –Different threads update different words of variable Producer-consumer –Written only by one threads and read by others –Replicate and update the object, not invalidate

17 17 Sharing annotations Example: producer-consumer for some number of timesteps/iterations { for (i=0; i<n; i++ ) for( j=1, j<n, j++ ) temp[i][j] = 0.25 * ( grid[i-1][j] + grid[i+1][j] grid[i][j-1] + grid[i][j+1] ); for( i=0; i<n; i++ ) for( j=1; j<n; j++ ) grid[i][j] = temp[i][j]; }

18 18 Sharing annotations Reduction –Accessed by fetching and operation (read, write then release) –Example: min(), a++ Result –Phase 1: multiple write allowed –Phase 2: one thread (the result) access exclusively Conventional –Conventional update protocol for shared variables

19 19 Sharing annotations w(x) r(x) w(x) r(x)

20 20 Sharing annotations Sharing Annotations Protocol Parameters IRDFOMSFLW Read-onlyNY-----N MigratoryYN-NN-NY Write-sharedNYYNYNNY Producer- Consumer NYYNYYNY ReductionNYNYN-NY ResultNYYYY-YY ConventionalYYNNN-NY

21 21 Software Implementation Basic concepts –Shared object –Software release consistency –Multiple consistency protocols Software implementation –Prototype overview –Execution process –Advanced programming features –Data object directory and delayed update queue –Synchronization Performance Overview of other DSM systems Conclusion

22 22 Prototype Overview A simple processor converting annotations to suitable format A linker creating the shared memory segment Library routines linked into program Operating system support for page fault handling and page table manipulation

23 23 Execution Process Compiling Sharing annotations Munin processor Auxiliary files Linker Shared data segment Shared data description table

24 24 Execution Process Initialization P1 P2 Pn.... Munin root thread Munin worker thread User_init() Code copy Data segment Code copy Data segment user root thread

25 25 Execution Process Synchronization P1 P2 Pn.... Munin root thread Munin worker thread Synchronization operation User thread

26 26 Advanced Programming Features Associate data & Synch msg acq(m) r(x) rel(m) msg acq(m) r(x) rel(m) w(x)

27 27 Advanced Programming Features PhaseChange() –Change the producer consumer relationship –Example: adaptive mesh sor ChangeAnnotation() –Change the access pattern in execution Invalidate() Flush() SingleObject() PreAcquire()

28 28 Data Object Directory Start Address and Size Protocol parameters Object state (valid, writable, invalid) Copyset (which remote has copies) Synchq (corresponding synchronization object) Probable owner Home node Access control semaphore Links

29 29 Delayed Update Queue acq(m) w(x) w(y) rel(m) x x y

30 30 Multiple Writer Handling

31 31 Synchronization Queue based synchronization Request – reply – lock forward mechanism CreateLock(), AcquireLock(), ReleaseLock(), CreateBarrier(), WaitAtBarrier()

32 32 Performance Basic concepts –Shared object –Software release consistency –Multiple consistency protocols Software implementation –Prototype overview –Execution process –Advanced programming features –Data object directory and delayed update queue –Synchronization Performance Overview of other DSM systems Conclusion

33 33 Matrix Multiply

34 34 Matrix Multiply Optimized

35 35 SOR

36 36 Effect of Multiple Protocols ProtocolMatrix MultiplySOR Multiple72.4127.64 Write-shared75.5964.48 Conventional75.8567.64

37 37 Performance Problem with Munin Note: inefficient performance for task-queue model! (TSP-Q, quicksort, etc.) Eg. Speed up with MPI for TSP (16 procs) code Icode II 8.913.4 Speed up with Munin code Icode II 6.08.9 Major overhead: time for thread waiting at the lock which protects the work queue: caused by transferring whole work queue between threads

38 38 Overview of Other DSM System Basic concepts –Shared object –Software release consistency –Multiple consistency protocols Software implementation –Prototype overview –Execution process –Advanced programming features –Data object directory and delayed update queue –Synchronization Performance Overview of other DSM systems Conclusion

39 39 Overview of Other DSM System Clouds:per-segment (object) based consistency protocol Mirage: per-page based Orca: reliable ordered broadcast protocol Amber:user responsible for the data distribution among processors Linda:shared variable in tuple space, atomic operation: insertion, removal, reading Midway:using entry consistency (weaker consistency than release consistency) DASH:hardware DSM

40 40 Conclusion Objective: efficient DSM system with similar protocol to shared memory programming and small message passing overhead Special feature: multiple protocols, software release consistency Implementation: synchronization realized by Munin root thread and Munin worker threads

41 41 Thank you


Download ppt "Implementation and Performance of Munin (Distributed Shared Memory System) Dongying Li Department of Electrical and Computer Engineering University of."

Similar presentations


Ads by Google