Presentation is loading. Please wait.

Presentation is loading. Please wait.

Uppsala University Information Technology Department of Computer Systems Uppsala Architecture Research Team [UART] Removing the Overhead from Software-Based.

Similar presentations


Presentation on theme: "Uppsala University Information Technology Department of Computer Systems Uppsala Architecture Research Team [UART] Removing the Overhead from Software-Based."— Presentation transcript:

1 Uppsala University Information Technology Department of Computer Systems Uppsala Architecture Research Team [UART] Removing the Overhead from Software-Based Shared Memory Zoran Radovic and Erik Hagersten {zoranr, eh}@it.uu.se

2 Supercomputing 2001Uppsala Architecture Research Team (UART) Problems with Traditional SW-DSMs  Page-sized coherence unit  False Sharing! [e.g., Ivy, Munin, TreadMarks, Cashmere-2L, GeNIMA, …]  Protocol agent messaging is slow  Most efficiency lost in interrupt/poll CPUs Mem Prot. agent CPUs Mem Prot. agent LD x

3 Supercomputing 2001Uppsala Architecture Research Team (UART) Our proposal: D S Z O O M  Run entire protocol in requesting-processor  No protocol agent communication!  Assumes user-level remote memory access  put, get, and atomics [  InfiniBand SM ]  Fine-grain access-control checks [e.g., Shasta, Blizzard-S, Sirocco-S] CPUs Mem Protocol CPUs Mem atomic, get/put DIR get LD x DIR

4 Supercomputing 2001Uppsala Architecture Research Team (UART) Outline Motivation  General DSZOOM Overview  DSZOOM-WF Implementation Details  Experimentation Environment  Performance Results  Conclusions

5 Supercomputing 2001Uppsala Architecture Research Team (UART) DSZOOM Cluster  DSZOOM Nodes:  Each node consists of an unmodified SMP multiprocessor  SMP hardware keeps coherence among the caches and the memory within each SMP node  DSZOOM Cluster Network:  Non-coherent cluster interconnect  Inexpensive user-level remote memory access  Remote atomic operations [e.g., InfiniBand SM ]

6 Supercomputing 2001Uppsala Architecture Research Team (UART) Squeezing Protocols into Binaries …  Static Binary Instrumentation  EEL — Machine-independent Executable Editing Library implemented in C++ Instrument global LOADs with snippets containing fine-grain access control checks Instrument global STOREs with MTAG snippets Insert calls to coherence protocols implemented in C

7 Supercomputing 2001Uppsala Architecture Research Team (UART) 1: ld [address],%reg // original LOAD 2: fcmps %fcc0,%reg,%reg // compare reg with itself 3: fbe,pt %fcc0,hit // if (reg == reg) goto hit 4: nop 5: // Call global coherence load routine hit: Fine-grain Access Control Checks  The “magic” value is a small integer corresponding to an IEEE floating-point NaN [e.g., Blizzard-S, Sirocco-S]  Floating-point load example: Coherence Protocols (C-code)

8 Supercomputing 2001Uppsala Architecture Research Team (UART) Blocking Directory Protocols  Originally proposed to simplify the design and verification of HW-DSMs  Eliminates race conditions  DSZOOM implements a distributed version of a blocking protocol Node 0 G_MEM 00000001LOCK After MEM_STORE Presence bits DIR_ENTRY 01101000LOCK Before MEM_STORE One DIR_ENTRY per cache line Distributed DIR MEM_STORE

9 Supercomputing 2001Uppsala Architecture Research Team (UART) Global Coherency Action Read data from home node: 2–hop read MemDIR 1a. f&s = Small packet (~10 bytes) = Large packet (~68 bytes) = Message on the critical path = Message off the critical path 1b. get data 2. put Requestor LD x

10 Supercomputing 2001Uppsala Architecture Research Team (UART) Global Coherency Action Read data modified in a third node: 3–hop read DIR Mem MTAG 1. f&s 3b. put 2a. f&s 2b. get data 3a. put Requestor LD x

11 Supercomputing 2001Uppsala Architecture Research Team (UART) Compilation Process DSZOOM-WF Implementation of PARMACS Macros a.out (Un)executable EEL DSZOOM-WF Run-Time Library m4 GN U gcc Unmodified SPLASH-2 Application Coherence Protocols (C-code)

12 Supercomputing 2001Uppsala Architecture Research Team (UART) Instrumentation Performance ProgramProblem Size % LD % ST Instrumentation Overhead FFT1,048,576 points (48.1 MB)19.016.51.38 LU-Cont 1024  1024, block 16 (8.0 MB) 15.59.41.59 LU-Non-Cont 1024  1024, block 16 (8.0 MB) 16.711.11.50 Radix4,194,304 items (36.5 MB)15.611.61.13 Barnes-Hut16,384 bodies (32.8 MB)23.831.11.03 FMM32,768 particles (8.1 MB)17.513.61.06 Ocean-Cont 514  514 (57.5 MB) 27.023.91.34 Ocean-Non-Cont 258  258 (22.9 MB) 11.628.01.24 RadiosityRoom (29.4 MB)26.327.21.07 RaytraceCar (32.2 MB)19.018.11.21 Water-nsq2,197 mols., 2 steps (2.0 MB)13.416.21.06 Water-sp2,197 mols., 2 steps (1.5 MB)15.713.91.09 Average18.418.31.22

13 Supercomputing 2001Uppsala Architecture Research Team (UART) Instrumentation Breakdown Sequential Execution

14 Supercomputing 2001Uppsala Architecture Research Team (UART) Current DSZOOM Hardware  Two E6000 connected through a hardware-coherent interface (Sun- WildFire) with a raw bandwidth of 800 MB/s in each direction  Data migration and coherent memory replication (CMR) are kept inactive  16 UltraSPARC II (250 MHz) CPUs per node and 8 GB memory  Memory access times: 330 ns local / 1700 ns remote (lmbench latency)  Run as 16-way SMP, 2  8 CC-NUMA, and 2  8 SW-DSM

15 Supercomputing 2001Uppsala Architecture Research Team (UART) Stack Text & Data Heap PRIVATE_DATA shmid = A Physical Memory Cabinet 1 shmget shmid = B shmget Physical Memory Cabinet 2 Process and Memory Distribution Cabinet 1 fork pset_bind pset_bind fork 0x80000000 G_MEM Cabinet_1_G_MEM Cabinet_2_G_MEM Cabinet_1_G_MEM Stack Text & Data Heap PRIVATE_DATA G_MEM Cabinet_2_G_MEM Cabinet_1_G_MEM Stack Text & Data Heap PRIVATE_DATA G_MEM Cabinet_2_G_MEM Stack Text & Data Heap PRIVATE_DATA Stack Text & Data Heap PRIVATE_DATA Cabinet_1_G_MEM Cabinet_2_G_MEM Stack Text & Data Heap PRIVATE_DATA G_MEM ”Aliasing” Stack Text & Data Heap PRIVATE_DATA Cabinet 2 Stack Text & Data Heap PRIVATE_DATA Cabinet_1_G_MEM Cabinet_2_G_MEM Stack Text & Data Heap PRIVATE_DATA G_MEM shmat

16 Supercomputing 2001Uppsala Architecture Research Team (UART) Results (1) Execution Times in Seconds (16 CPUs) HWSW EEL 8 8 SW 16 EEL 16 8888 EEL

17 Supercomputing 2001Uppsala Architecture Research Team (UART) Results (2) Normalized Execution Time Breakdowns (16 CPUs) SW 88 EEL

18 Supercomputing 2001Uppsala Architecture Research Team (UART)  DSZOOM completely eliminates asynchronous messaging between protocol agents  Consistently competitive and stable performance in spite of high instrumentation overhead   30% slowdown compared to hardware  State-of-the-art checking overheads are in the range of 5–35% (e.g., Shasta), DSZOOM: 3–59% Conclusions

19 Supercomputing 2001Uppsala Architecture Research Team (UART) http://www.it.uu.se/research/group/uart DSZOOM’s Home Page


Download ppt "Uppsala University Information Technology Department of Computer Systems Uppsala Architecture Research Team [UART] Removing the Overhead from Software-Based."

Similar presentations


Ads by Google