Download presentation
Presentation is loading. Please wait.
1
1 Presentation at the 4 th PMEO-PDS Workshop Benchmark Measurements of Current UPC Platforms Zhang Zhang and Steve Seidel Michigan Technological University Denver, Colorado 3/22/2005
2
2 Presentation Outline Background –Unified Parallel C, implementations and users. –Previous UPC performance studies. Experiments –Available UPC platforms –Benchmarks Performance measurements Conclusions
3
3 UPC Overview UPC is an extension of C for partitioned shared memory parallel programming. –A special case of shared memory programming model. –Similar languages: Co-Array Fortran, Titanium. –UPC homepage: http://www.upc.gwu.edu Platforms supported: –Cray X1, Cray T3E, SGI Origin, HP AlphaServer, HP UX, Linux clusters, IBM SP. UPC compilers: –Open source: MuPC, Berkeley UPC, Intrepid UPC –Commercial: HP UPC, Cray UPC Users: –LBNL, IDA, AHPCRC, …
4
4 Related UPC Performance Studies Performance benchmark suites –UPC_Bench (GWU) Synthetic microbenchmark based on the STREAM benchmark. Application benchmarks: Sobel edge detection, matrix multiplication, N-Queens problem –UPC NAS Parallel Benchmarks (GWU) Performance monitoring –Performance analysis for HP UPC compiler (GWU) –Performance of Berkeley UPC on HP AlphaServer (Berkeley) –Performance of Intrepid UPC on SGI Origin (GWU)
5
5 Benchmarking UPC Systems Extended shared memory bandwidth microbenchmarks to cover various reference patterns: –Scalar references: 11 access patterns –Block memory operations: 9 access patterns Benchmarked six combinations of available UPC compilers and platforms using both the UPC STREAM (MTU code) and the UPC NAS Parallel Benchmarks (GWU code). –Compilers: MuPC, HP UPC, Berkeley UPC and Intrepid UPC –Platforms: Myrinet Linux cluster, HP AlphaServer SC, and T3E The first comparison of performance for currently available UPC implementations. The first report on MuPC performance.
6
6 Benchmarks Synthetic benchmarks: –The STREAM microbenchmark was rewritten using UPC with more diversities of shared memory access patterns: Local shared read / write Unit stride shared read / write / copy Random shared read / write / copy Stride-n shared read / write / copy Block transfers with variations of source and sink affinities. NAS Parallel Benchmark Suite v2.4 –The UPC version was developed at GWU. –Five cores: CG, EP, FT, IS and MG. –Two variations: Naïve version and Hand-tuned version. –Input size: Class A workload.
7
7 Local Shared References Intrepid UPC: performance is poor on local shared accesses. HP UPC: cache state has significant effects on local shared accesses.
8
8 Remote Shared References HP UPC and MuPC: caches help unit stride remote shared accesses. Intrepid UPC does the best for remote shared accesses.
9
9 Block Memory Operations HP UPC: performance is poor on certain string functions. Intrepid UPC: low performance on all categories.
10
10 NPB – CG The only case that scales well: Berkeley UPC + optimized code.
11
11 NPB – EP
12
12 NPB – FT HP, Berkeley and MuPC: performance is comparable.
13
13 NPB – IS HP, Berkeley and MuPC: performance is comparable.
14
14 NPB – MG MG performance is very inconsistent.
15
15 Conclusions STREAM benchmarking: –UPC language overhead reduces performance of local shared references. –Remote reference caching helps stride-1 accesses. –Copying between two locations with the same affinity to a remote thread needs optimization. NPB benchmarking: –Some implementation failed for some benchmarks. More stable and reliable implementations are needed. –Hand-tuning techniques (e.g. prefetching) are critical in performance. –Berkeley UPC is the best at handling unstructured, fine-grained references. –MuPC experience shows that it will be more rewarding to optimize remote shared references than to improve network interconnects.
16
16 Thank you! For more information: http://www.upc.mtu.edu
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.