Download presentation
Presentation is loading. Please wait.
Published byTiffany Dawson Modified over 9 years ago
1
S AN D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE On pearls and perils of hybrid OpenMP/MPI programming on the Blue Horizon D. Pekurovsky, L. Nett-Carrington, D. Holland, T. Kaiser San Diego Supercomputing Center
2
S AN D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE Overview Blue Horizon Hardware Motivation for this work Two methods of hybrid programming Fine grain results A word on coarse grain techniques Coarse grain results Time variability Effects of thread binding Final Conclusions
3
S AN D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE Blue Horizon Hardware 144 IBM SP High Nodes Each node: 8-way SMP 4 GB memory crossbar Each processor: Power3 222 MHz 4 Flop/cycle Aggregate peak 1.002 Tflop/s Compilers: IBM mpxlf_r, version 7.0.1 KAI guidef90, version 3.9
4
S AN D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE Blue Horizon Hardware Interconnect (between nodes): Currently: 115 MB/s 4 MPI tasks/node Must use OpenMP to utilize all processors Soon: 500 MB/s 8 MPI tasks/node Can use OpenMP to supplement MPI (if it’s worth it)
5
S AN D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE Hybrid Programming: why use it? Non-performance-related reasons Avoid replication of data on the node Performance-related reasons: Avoid latency of MPI on the node Avoid unnecessary data copies inside the node Reduce latency of MPI calls between the nodes Decrease global MPI operations (reduction, all-to-all) The price to pay: OpenMP Overheads False sharing Is it really worth trying?
6
S AN D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE Hybrid Programming Two methods of combining MPI and OpenMP in parallel programs Fine grainCoarse grain main program ! MPI initialization.... ! cpu intensive loop !$OMP PARALLEL DO do i=1,n !work end do.... end main program !MPI initialization !$OMP PARALLEL.... do i=1,n !work end do.... !$OMP END PARALLEL end
7
S AN D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE Hybrid programming Fine grain approach Easy to implement Performance: low due to overheads of OpenMP directives (OMP PARALLEL DO) Coarse grain approach Time-consuming implementation Performance: less overhead for thread creation
8
S AN D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE Hybrid NPB using fine grain parallelism CG, MG, and FT suites of NAS Parallel Benchmarks (NPB). Suite name# loops parallelized CG - Conjugate Gradient18 MG - Multi-Grid 50 FT - Fourier Transform8 Results shown are the best of 5-10 runs Complete results at http://www.sdsc.edu/SciComp/PAA/Benchmarks/Hybrid/hybrid.html
9
S AN D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE Fine grain results - CG (A&B class)
10
S AN D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE Fine grain results - MG (A&B class)
11
S AN D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE Fine grain results - MG (C class)
12
S AN D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE Fine grain results - FT (A&B class)
13
S AN D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE Fine grain results - FT (C class)
14
S AN D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE Hybrid NPB using coarse grain parallelism: MG suite Overview of the method Task 2 Task 3 Task 4 Task 1 Thread 1Thread 2
15
S AN D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE Coarse grain programming methodology Start with MPI code Each MPI task spawns threads once in the beginning Serial work (initialization etc) and MPI calls are done inside MASTER or SINGLE region Main arrays are global Work distribution: each thread gets a chunk of the array based on its number (omp_get_thread_num()). In this work, one-dimensional blocking. Avoid using OMP DO Careful with scoping and synchronization
16
S AN D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE Coarse grain results - MG (A class)
17
S AN D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE Coarse grain results - MG (C class)
18
S AN D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE Coarse grain results - MG (C class) Full node results # of SMP Nodes MPI Tasks x OpenMP Threads Max MOPS/CPU 84x2 19.1 82x4 14.9 Min MOPS/CPU 644x2 15.63.7 81x8 84.213.6 644x2 641x8 75.7 92.6 641x8642x4642x4 18.6 21.25.3 56.842.3 15.45.6 8.22.2 49.5
19
S AN D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE Variability 2 -- 5 times (on 64 nodes) Seen mostly when the full node is used Seen both in fine grain and coarse grain runs Seen both with IBM and KAI compiler Seen in runs on the same set of nodes as well as between different sets On a large number of nodes, the average performance suffers a lot Confirmed in micro-study of OpenMP on 1 node
20
S AN D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE OpenMP on 1 node microbenchmark results http://www.sdsc.edu/SciComp/PAA/ Benchmarks/Open/open.html
21
S AN D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE Thread binding Question: is variability related to thread migration? A study on 1 node: –Each OpenMP thread performs an independent matrix inversion taking about 1.6 seconds –Monitor processor id and run time for each thread –Repeat 100 times –Threads bound OR not bound
22
S AN D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE Thread binding Results for OMP_NUM_THREADS=8 Without binding, threads migrate in about 15% of the runs With thread binding turned on there was no migration 3% of iterations had threads with runtimes > 2.0 sec., a 25% slowdown Slowdown occurs with/without binding Effect of single thread slowdown Probability that complete calculation will be slowed P=1-(1-c%)^M with c=3% M=144 nodes of Blue Horizon P=0.9876 probability overall results slowed by 25%
23
S AN D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE Thread binding Calculation was rerun OMP_NUM_THREADS = 7 12.5% reduction in computational power No threads showed a slowdown, all ran in about 1.6 seconds Summary OMP_NUM_THREADS = 7 –yields 12.5% reduction in computational power OMP_NUM_THREADS = 8 –0.9876 probability overall results slowed by 25% independent of thread binding
24
S AN D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE Overall Conclusions Based on our study of NPB on Blue Horizon: Fine grain hybrid approach is generally worse than pure MPI Coarse grain approach for MG is comparable with pure MPI or slightly better Coarse grain approach is time and effort consuming Coarse grain techniques are given Big variability when using the full node. Until this is fixed, recommend to use less than 8 threads Thread binding does not influence performance
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.