Eidgenössische TechnischeHochschule Zürich Ecolepolytechniquefédérale de Zurich PolitecnicofederalediZurigo Swiss Federal Institute of Technology Zurich.

Eidgenössische TechnischeHochschule Zürich Ecolepolytechniquefédérale de Zurich PolitecnicofederalediZurigo Swiss Federal Institute of Technology Zurich 25th Annual International Symposium on Computer Architecture 7th Workshop on Scalable Shared Memory Multiprocessor Memory System Performance of High End SMPs, PCs and Clusters of PCs Ch. Kurmann, T. Stricker Laboratory for Computer Systems ETHZ - Swiss Institute of Technology CH-8092 Zurich Color Slides: http://www.cs.inf.ethz.ch/CoPs/isca98ws/

2 Memory Systems n Low End designs in PCs: u extremely low cost u standard I/O interface n High End designs in “Killer” Workstations: u well engineered memory systems u support for additional datastreams u better I/O busses n Are Low End SMPs the universal compute nodes for parallel and distributed systems?

3 Contribution n The answer is probably the memory system performance. n How significant are the differences in memory system performance? n Limitations of Low End memory systems u for local computation (e.g. in scientific applications) u for inter-node communication (e.g. in databases)

4 Extended Copy Transfer Characterization ECT is a method to characterize the performance of memory systems (ISCA95 and HPCA97): u Categories F Access pattern, stride (spatial locality) F Working set (temporal locality) u Value F Transfer bandwidth (large amount of data) u Same chart resulting from one microbenchmark F Local and Remote transfers F compute and communicate accesses

5 Measurement Problems Some parameter combinations are hard to measure, even with carefully tuned C code: u Reduced performance for large strides and small working-sets in L1 caches is a measurement artifact and not architecture related. u Compilers occasionally generate suboptimal instruction schedules for loads / stores.

6 Local Load Access: Pentium Pro PC Working set Access pattern (stride between 64bit words)

7 Local Load Access: SGI Origin Working set Access pattern (stride between 64bit words)

8 Local Load Access: DEC 8400 Working set Access pattern (stride between 64bit words)

9 Local Load Access: Sun Enterprise Working set Access pattern (stride between 64bit words) 128 127 96 64 63 48 32 31 24 16 15 12 8 7 6 5 4 3 2 1 16 M 8 M 4 M 2 M 1 M 512 K 256 K 128 K 64 K 32 K 16 K 8 K 4 K 2 K 1 K 0.5 K 700 600 500 400 300 200 100 0 700 600 500 400 300 200 100 0 Load bandwidth (MBytes/sec) Load bandwidth (MByte/s) Sun Ultra Enterprise one Ultra SPARC II 248 MHz DRAM L1 L2

10 Local Load Access: SGI Cray T3E 128 127 96 64 63 48 32 31 24 16 15 12 8 7 6 5 4 3 2 1 16 M 8 M 4 M 2 M 1 M 512 K 256 K 128 K 64 K 32 K 16 K 8 K 4 K 2 K 1 K 0.5 K 1200 1000 800 600 400 200 0 1200 1000 800 600 400 200 0 Load bandwidth (MBytes/sec) Load bandwidth (MByte/s) Cray T3E one processor 300 MHz DRAM L1 L2 Working set Access pattern (stride between 64bit words)

11 Comparison - Local Access

12 Performance in an SMP setting n Copy bandwidth decreases for simultaneous access with 1, 2, 4 and 8 processors n Topics of interest: u small working sets in caches: performance remains same u large working sets in memory: interesting differences u behavior for even/uneven strides n “Gather copy stream” (strided load / contiguous store)

13 Local Copy: Pentium Pro SMP

14 Local Copy: SGI Origin CC-NUMA

15 Local Copy: DEC 8400 SMP

16 Local Copy: Sun Enterprise SMP

17 Remote in Parallel Computers Parallel & Network Symmetric Computers Multiprocessors SGI Cray T3E, SGI Origin DEC 8400, Sun Enterprise, Clusters of PCs (CoPs) Pentium Pro SMPs Processor Caches Memory P C M P C M P C M Network P C P C P C MM Bus/Network PCM

18 Remote Transfers: CoPs Pentium Pro with SCI / Myrinet t t t

19 Remote Transfers: SGI Origin

20 Remote Transfers: DEC 8400

21 Remote Transfers: SGI Cray T3E

22 Comparison - Remote Transfers

23 Improvement of PC Chipsets n Intel 440 BX AGP Chip Set 400 MHz / 100 MHz n Intel 440 LX AGP Chip Set 233 MHz / 66 MHz n Intel 440 FX Natoma Chip Set 200 MHz / 66 MHz

24 Conclusion n ECT-Characterizations for different memory systems: u T3E (MMP-Node), Origin (NUMA), DEC8400 (SMP) u CoPs Intel P6 SMPs and Clusters n High End SMP vs. Low End SMP: u Less than half performance on two processor PCs. n Fast communication puts high demands on the memory system: u Unlike in traditional SMPs and CC-NUMAs fine grained remote access do not perform at all in PC-SMPs and CoPs n Adding more commodity microprocessors processors without reinforcing the memory system is therefore questionable.

Eidgenössische TechnischeHochschule Zürich Ecolepolytechniquefédérale de Zurich PolitecnicofederalediZurigo Swiss Federal Institute of Technology Zurich.

Similar presentations

Presentation on theme: "Eidgenössische TechnischeHochschule Zürich Ecolepolytechniquefédérale de Zurich PolitecnicofederalediZurigo Swiss Federal Institute of Technology Zurich."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Eidgenössische TechnischeHochschule Zürich Ecolepolytechniquefédérale de Zurich PolitecnicofederalediZurigo Swiss Federal Institute of Technology Zurich.

Similar presentations

Presentation on theme: "Eidgenössische TechnischeHochschule Zürich Ecolepolytechniquefédérale de Zurich PolitecnicofederalediZurigo Swiss Federal Institute of Technology Zurich."— Presentation transcript:

Similar presentations

About project

Feedback