Download presentation
Presentation is loading. Please wait.
1
Understanding Application Scaling NAS Parallel Benchmarks 2.2 on NOW and SGI Origin 2000 Frederick Wong, Rich Martin, Remzi Arpaci-Dusseau, David Wu, and David Culler {fredwong, rmartin, remzi, davidwu, culler}@CS.Berkeley.EDU Department of Electrical Engineering and Computer Science Computer Science Division University of California, Berkeley June 15 th, 1998
2
Introduction n NAS Parallel Benchmarks suite 2.2 (NPB) has been used widely to evaluate modern parallel systems n 7 scientific benchmarks that represents the most common computation kernels n NPB is written on top of Message Passing Interface (MPI) for portability n NPB is a Constant Problem Size (CPS) scaling benchmark suite n This study focuses on understanding NPB scaling on both NOW and SGI Origin 2000
3
Motivation n Early study on NPB shows ideal speedup on NOW! u Scaling as good as T3D and better than SP-2 u Per node performance better than T3D, close to SP-2 n Submitted results for Origin 2000 show a spread
4
Presentation Outline n Hardware Configuration n Time Breakdown of the Applications n Communication Performance n Computation Performance n Conclusion
5
Hardware Configuration n SGI Origin 2000 (64 nodes) u MIPS R10000 processor, 195 MHz, 32KB/32KB L1 u 4MB external L2 cache per processor u 16GB memory total u MPI performance: 13 sec one-way latency, 150 MB peak, half-power at 8KB message size n Network Of Workstations (NOW) u UltraSPARC I processor, 167MHz, 16KB/16KB L1 u 512KB external L2 cache per processor u 128 MB memory per processor u MPI performance: 22 sec one-way latency, 27 MB peak, half-power at 4KB message size
6
Time Breakdown -- LU n Black line -- total running time u a single-man - 10 secs job u ideally, requires 5 secs for 2 men u total amount of work -- 10 secs n More work, need communication
7
Time Breakdown -- LU
8
Time Breakdown -- SP
9
Communication Performance n Micro-benchmarks show that SGI O2000 has better pt2pt comm. performance when compare to NOW
10
Communication Efficiency n absolute bandwidth delivered are close u SP/32 on NOW -- 215s u SP/32 on SGI -- 289s n comm. efficiency on SGI only achieved 30% of potential bandwidth n protocols tradeoff are pronounce u hand-shake vs. bulk- send in pt2pt u collective ops
11
Computation Performance n Relative performance of the benchmarks on single node roughly close to the processor performance difference n Both computational CPI and L2 misses change significantly on both platforms when scaled
12
Recap on CPS Scaling 4 8 16 3264 128256
13
LU Working Set n 4-processor u Knee starts at 256KB
14
LU Working Set n 4-processor u Knee starts at 256KB n 8-processor u Knee starts at 128KB
15
LU Working Set n 4-processor u Knee starts at 256KB n 8-processor u Knee starts at 128KB n 16-processor u Knee starts at 64KB
16
LU Working Set n 4-processor u Knee starts at 256KB n 8-processor u Knee starts at 128KB n 16-processor u Knee starts at 64KB n 32-processor u Knee starts at 32KB n miss rate drops from 2MB to 4 MB global cache
17
n Cost under scaling u extra work worsen memory system’s performance SP Working Set u total memory references on SGI F 4-processor has 64.38 billion memory reference F 25-processor has 72.35 billion memory reference F 12.38% increase Cost Benefit
18
Conclusion n NPB u -benchmarks hard to predict comm performance u global cache increases effectively reduce comp. time u sequential node arch. is a dominant factor in NPB perf. n NOW u an inexpensive way to go parallel u absolute performance is excellent u MPI on NOW has good scalability and performance u NOW vs. proprietary system -- detail instrumentation ability n speedup cannot tell the whole story, scalability involves: u the interplay of program and machine scaling u delivered comm. performance, not -benchmarks u complicated memory system performance
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.