Download presentation
Presentation is loading. Please wait.
1
Eidgenössische TechnischeHochschule Zürich Ecolepolytechniquefédérale de Zurich PolitecnicofederalediZurigo Swiss Federal Institute of Technology Zurich 25th Annual International Symposium on Computer Architecture 7th Workshop on Scalable Shared Memory Multiprocessor Memory System Performance of High End SMPs, PCs and Clusters of PCs Ch. Kurmann, T. Stricker Laboratory for Computer Systems ETHZ - Swiss Institute of Technology CH-8092 Zurich Color Slides: http://www.cs.inf.ethz.ch/CoPs/isca98ws/
2
2 Memory Systems n Low End designs in PCs: u extremely low cost u standard I/O interface n High End designs in “Killer” Workstations: u well engineered memory systems u support for additional datastreams u better I/O busses n Are Low End SMPs the universal compute nodes for parallel and distributed systems?
3
3 Contribution n The answer is probably the memory system performance. n How significant are the differences in memory system performance? n Limitations of Low End memory systems u for local computation (e.g. in scientific applications) u for inter-node communication (e.g. in databases)
4
4 Extended Copy Transfer Characterization ECT is a method to characterize the performance of memory systems (ISCA95 and HPCA97): u Categories F Access pattern, stride (spatial locality) F Working set (temporal locality) u Value F Transfer bandwidth (large amount of data) u Same chart resulting from one microbenchmark F Local and Remote transfers F compute and communicate accesses
5
5 Measurement Problems Some parameter combinations are hard to measure, even with carefully tuned C code: u Reduced performance for large strides and small working-sets in L1 caches is a measurement artifact and not architecture related. u Compilers occasionally generate suboptimal instruction schedules for loads / stores.
6
6 Local Load Access: Pentium Pro PC Working set Access pattern (stride between 64bit words)
7
7 Local Load Access: SGI Origin Working set Access pattern (stride between 64bit words)
8
8 Local Load Access: DEC 8400 Working set Access pattern (stride between 64bit words)
9
9 Local Load Access: Sun Enterprise Working set Access pattern (stride between 64bit words) 128 127 96 64 63 48 32 31 24 16 15 12 8 7 6 5 4 3 2 1 16 M 8 M 4 M 2 M 1 M 512 K 256 K 128 K 64 K 32 K 16 K 8 K 4 K 2 K 1 K 0.5 K 700 600 500 400 300 200 100 0 700 600 500 400 300 200 100 0 Load bandwidth (MBytes/sec) Load bandwidth (MByte/s) Sun Ultra Enterprise one Ultra SPARC II 248 MHz DRAM L1 L2
10
10 Local Load Access: SGI Cray T3E 128 127 96 64 63 48 32 31 24 16 15 12 8 7 6 5 4 3 2 1 16 M 8 M 4 M 2 M 1 M 512 K 256 K 128 K 64 K 32 K 16 K 8 K 4 K 2 K 1 K 0.5 K 1200 1000 800 600 400 200 0 1200 1000 800 600 400 200 0 Load bandwidth (MBytes/sec) Load bandwidth (MByte/s) Cray T3E one processor 300 MHz DRAM L1 L2 Working set Access pattern (stride between 64bit words)
11
11 Comparison - Local Access
12
12 Performance in an SMP setting n Copy bandwidth decreases for simultaneous access with 1, 2, 4 and 8 processors n Topics of interest: u small working sets in caches: performance remains same u large working sets in memory: interesting differences u behavior for even/uneven strides n “Gather copy stream” (strided load / contiguous store)
13
13 Local Copy: Pentium Pro SMP
14
14 Local Copy: SGI Origin CC-NUMA
15
15 Local Copy: DEC 8400 SMP
16
16 Local Copy: Sun Enterprise SMP
17
17 Remote in Parallel Computers Parallel & Network Symmetric Computers Multiprocessors SGI Cray T3E, SGI Origin DEC 8400, Sun Enterprise, Clusters of PCs (CoPs) Pentium Pro SMPs Processor Caches Memory P C M P C M P C M Network P C P C P C MM Bus/Network PCM
18
18 Remote Transfers: CoPs Pentium Pro with SCI / Myrinet t t t
19
19 Remote Transfers: SGI Origin
20
20 Remote Transfers: DEC 8400
21
21 Remote Transfers: SGI Cray T3E
22
22 Comparison - Remote Transfers
23
23 Improvement of PC Chipsets n Intel 440 BX AGP Chip Set 400 MHz / 100 MHz n Intel 440 LX AGP Chip Set 233 MHz / 66 MHz n Intel 440 FX Natoma Chip Set 200 MHz / 66 MHz
24
24 Conclusion n ECT-Characterizations for different memory systems: u T3E (MMP-Node), Origin (NUMA), DEC8400 (SMP) u CoPs Intel P6 SMPs and Clusters n High End SMP vs. Low End SMP: u Less than half performance on two processor PCs. n Fast communication puts high demands on the memory system: u Unlike in traditional SMPs and CC-NUMAs fine grained remote access do not perform at all in PC-SMPs and CoPs n Adding more commodity microprocessors processors without reinforcing the memory system is therefore questionable.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.