Download presentation
Presentation is loading. Please wait.
Published byKellie Perry Modified over 9 years ago
1
Message Passing Vs. Shared Address Space on a Cluster of SMPs Leonid Oliker NERSC/LBNL www.nersc.gov/~oliker Hongzhang Shan, Jaswinder Pal Singh Princeton University Rupak Biswas NASA Ames Research Center
2
2 Overview Scalable computing using clusters of PCs has become an attractive platform for high-end scientific computing Currently MP and SAS are the leading programming paradigms MPI is more mature and provides performance & portability; however, code development can be very difficult SAS provides substantial ease of programming, but performance may suffer due to poor spatial locality and protocol overhead We compare performance of MP and SAS models using best implementations available to us (MPI/Pro and GeNIMA SVM) Also examine hybrid programming (MPI + SAS) Platform: eight 4-way 200 MHz Pentium Pro SMPs (32 procs) Applications: regular (LU, OCEAN), irregular (RADIX, N-BODY) Propose / investigate improved collective comm on SMP clusters
3
3 Architectural Platform 32 Processor Pentium Pro System 4-way SMP 200MHz 8Kb L1 512Kb L2 512Mb Mem Giganet or Myrinet Single crossbar switch Network Interface 33MHz processor Node-to-network bandwidth constrained by 133MB/s PCI bus
4
4 A A P0 P1 MPI Communication Library SendReceive Send-Receive pair Load/Store A 1 = A 0 SAS A1A1 A0A0 P0 P1 Comparison of programming models
5
5 SAS Programming SAS in software: page-based shared virtual memory (SVM) Use GeNIMA protocol built with VMMC on Myrinet network VMMC – Virtual Memory Mapped Communication Protected reliable user-level comm; variable size packets Allows data transfer directly between two virtual memory address spaces Single 16-way Myrinet crossbar switch High-speed system area network with point-to-point links Each NI connects nodes to network with two unidirectional links of 160 MB/s peak bandwidth What is the SVM overhead compared with hardware supported cache-coherent system (Origin2000)?
6
6 GeNIMA Protocol GeNIMA (GEneral-purpose NI support in a shared Memory Abstraction): Synch home-based lazy-release consistency Uses virtual memory mgmt sys for page-level coherence Most current systems use asynchronous interrupts for both data exchange and protocol handling Asynchronous message handling on network interface (NI) eliminates need to interrupt receiving host processor Use general-purpose NI mechanism to move data between network and user-level memory & for mutual exclusion Protocol handling on host processor at “synchronous” points – when a process is sending / receiving messages Procs can modify local page copies until synchronization
7
7 MP Programming Use MPI/Pro developed by VIA interface over Giganet VIA - Virtual Interface Architecture Industry standard interface for system area networks Protected zero-copy user-space inter-process communication Giganet (like Myrinet) NI use single crossbar switch VIA and VMMC have similar communication overhead Time ( secs)
8
8 Regular Applications: LU and OCEAN LU factorization: Factors matrix into lower and upper tri Lowest communication requirements among our benchmarks One-to-many non-personalized communication In SAS, each process directly fetches pivot block; in MPI, block owner sends pivot block to other processes OCEAN: Models large-scale eddy and boundary currents Nearest-neighbor comm patterns in a multigrid formation Red-black Gauss-Seidel multigrid equation solver High communication-to-computation ratio Partitioning by rows instead of by blocks (fewer but larger messages) increased speedup from 14.1 to 15.2 (on 32 procs) MP and SAS partition subgrids in the same way; but MPI involves more programming
9
9 Irregular Applications: RADIX and N-BODY RADIX Sorting: Iterative sorting based on histograms Local histogram creates global histogram then permutes keys Irregular all-to-all communication Large comm-to-comp ratio, and high memory bandwidth requirement (can exceed capacity of PC-SMP) SAS uses global binary prefix tree to collect local histogram; MPI uses Allgather (instead of fine-grained comm) N-BODY: Simulates body interaction (galaxy, particle, etc) 3D Barnes-Hut hierarchical octree method Most complex code, highly irregular fine-grained comm Compute forces on particles, then update their positions Significantly different MPI and SAS tree-building algorithms
10
10 SAS MPI Distribute / Collect Cells / Particles N-BODY Implementation Differences
11
11 Duplicate high-level cells Algorithm becomes much more like message passing Replication not “natural” programming style for SAS Improving N-BODY SAS Implementation SAS Shared Tree
12
12 Performance of LU Communication requirements small compared to our other apps SAS and MPI have similar performance characteristics Protocol overhead of running SAS version a small fraction of overall time (Speedups on 32p: SAS = 21.78, MPI = 22.43) For applications with low comm requirements, it is possible to achieve high scalability on PC clusters using both MPI and SAS 6144 x 6144 matrix on 32 processors SYNC RMEM LOCAL 140 0 120 100 20 80 60 40 SAS MPI Time (sec)
13
13 Performance of OCEAN SAS performance significantly worse than MPI (Speedups on 32p: SAS = 6.49, MPI = 15.20) SAS suffers from expensive synchronization overhead – after each nearest-neighbor comm, a barrier sync is required 50% of sync overhead spent waiting, rest is protocol processing Sync in MPI is much lower due to implicit send / receive pairs SYNC RMEM LOCAL 514 x 514 grid on 32 processors SAS MPI 42 0 35 7 28 21 14 Time (sec)
14
14 Performance of RADIX MPI performance more than three times better than SAS (Speedups on 32p: SAS = 2.07, MPI = 7.78) Poor SAS speedup due to memory bandwidth contention Once again, SAS suffers from high protocol overhead of maintaining page coherence: compute diffs, create timestamps, generate write notices, and garbage collection SYNC RMEM LOCAL 32M integers on 32 processors SAS MPI 12 0 10 2 8 6 4 Time (sec)
15
15 Performance of N-BODY SAS performance about half that of MPI (Speedups on 32p: SAS = 14.30, MPI = 26.94) Synchronization overhead dominates SAS runtime 82% of barrier time spent on protocol handling If very high performance is the goal, message passing necessary for commodity SMP clusters SYNC RMEM LOCAL 128K particles on 32 processors SAS MPI Time (sec) 7 0 6 5 1 4 3 2
16
16 Node ArchitectureCommunication Architecture Origin2000 (Hardware Cache Coherency) Memory Hub L2 Cache Directory Dir (>32P) R12K Router L2 Cache R12K Previous results showed that on a hardware-supported cache-coherent multiprocessor platform, SAS achieved MPI performance for this set of applications
17
17 Hybrid Performance on PC Cluster Latest teraflop-scale systems contain large number of SMPs; novel paradigm combines two layers of parallelism Allows codes to benefit from loop-level parallelism and shared- memory algorithms in addition to coarse-grained parallelism Tradeoff: SAS may reduce intra-SMP communication, but possibly incur additional overhead for explicit synchronization Complexity example: Hybrid N-BODY requires two types of tree- building: MPI – distributed local tree, SAS – globally shared tree Hybrid performance gain (11% max) does not compensate for increased programming complexity
18
18 MPI Collective Function: MPI_Allreduce How to better structure collective communication on PC-SMP clusters? We explore algorithms for MPI_Allreduce and MPI_Allgather MPI/Pro version labeled “Original” (exact algorithms undocumented) For MPI_Allreduce, structure of our 4-way SMP motivates us to modify the deepest level of the B-Tree to a quadtree (B-Tree-4) No difference in using SAS or MPI communication at lowest level secs) on 32 procs for one double-precision variable Original 1117 B-Tree 1035 B-Tree-4 981 Execution time (in secs) on 32 procs for one double-precision variable Original 1117 B-Tree 1035 B-Tree-4 981
19
19 MPI Collective Function: MPI_Allgather Several algorithms were explored: Initially, B-Tree and B-Tree-4 B-Tree-4*: After a processor at Level 0 collects data, it sends it to Level 1 and below; however, Level 1 already contains data from its own subtree Thus redundant to broadcast ALL the data back, instead only the necessary data needs to be exchanged (can be extended to the lowest level of the tree (bounded by size of SMP)) Improved communication functions result in up to 9% performance gain (most time spent in send / receive functions) Time ( secs) for P=32 (8 nodes)
20
20 Conclusions Examined performance for several regular and irregular applications using MP (MPI/Pro on Giganet by VIA) and SAS (GeNIMA on Myrinet by VMMC) on 32-processor PC-SMP cluster SAS provides substantial ease of programming, esp. for more complex codes which are irregular and dynamic Unlike previous research on hardware-supported CC-SAS machines, SAS achieved about half the parallel efficiency of MPI for most of our applications (LU was an exception, where performance was similar) High overhead for SAS due to excessive cost of SVM protocol associated with maintaining page coherence and implementing synch Hybrid codes offered no significant performance advantage over pure MPI, but increased programming complexity and reduced portability Presented new algorithms for improved SMP communication functions If very high performance is the goal, the difficulty of MPI programming appears to be necessary for commodity SMP clusters of today
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.