Presentation is loading. Please wait.

Presentation is loading. Please wait.

Architecture Research Team (UART)1 Zoran Radović and Erik Hagersten {zoranr, Uppsala University Information Technology.

Similar presentations


Presentation on theme: "Architecture Research Team (UART)1 Zoran Radović and Erik Hagersten {zoranr, Uppsala University Information Technology."— Presentation transcript:

1 DSZOOM@wmpi2001Uppsala Architecture Research Team (UART)1 Zoran Radović and Erik Hagersten {zoranr, eh}@it.uu.se Uppsala University Information Technology Department of Computer Systems Implementing Low Latency Distributed Software-Based Shared Memory

2 DSZOOM@wmpi2001Uppsala Architecture Research Team (UART)2 Problems with Traditional SW-DSMs Page-sized coherence unit  False Sharing! [e.g., Ivy, Munin, TreadMarks, Cashmere-2L, Shasta, GeNIMA, …] Protocol agent messaging is slow  Most efficiency lost in interrupt/poll CPUs Mem Prot. agent CPUs Mem Prot. agent LD x

3 DSZOOM@wmpi2001Uppsala Architecture Research Team (UART)3 Our proposal: D S Z O O M Run entire protocol in requesting-processor  No protocol agent communication! Assumes user-level remote memory access  put, get, and atomics [  InfiniBand] Fine-grain access-control checks [e.g., Shasta, Blizzard-S, Sirocco-S] CPUs Mem Protocol CPUs Mem atomic DIR get LD x DIR

4 DSZOOM@wmpi2001Uppsala Architecture Research Team (UART)4 Outline Motivation General DSZOOM Overview Experimentation Environment DSZOOM-WF Implementation Details Performance Results Improved DSZOOM… [SC2001]

5 DSZOOM@wmpi2001Uppsala Architecture Research Team (UART)5 DSZOOM Cluster DSZOOM Nodes:  Each node consists of an unmodified SMP multiprocessor  SMP hardware keeps coherence among the caches and the memory within each SMP node DSZOOM Cluster Network:  Non-coherent cluster interconnect  Inexpensive user-level remote memory access  Remote atomic operations [e.g., InfiniBand]

6 DSZOOM@wmpi2001Uppsala Architecture Research Team (UART)6 Current DSZOOM Hardware Two E6000 connected through a hardware-coherent interface (Sun- WildFire) with a raw bandwidth of 800 MB/s in each direction  Data migration and coherent memory replication (CMR) are kept inactive 16 UltraSPARC II (250 MHz) CPUs per node and 8 GB memory  Memory access times: 330 ns local / 1700 ns remote (lmbench latency) Run as 16-way SMP, 2  8 HW-ccNUMA, and 2  8 SW-DSM

7 DSZOOM@wmpi2001Uppsala Architecture Research Team (UART)7 Compilation Process DSZOOM-WF Implementation of PARMACS Macros a.out Binary EEL DSZOOM-WF Run-Time Library m4 GN U gcc Unmodified SPLASH-2 Application Coherence Protocols

8 DSZOOM@wmpi2001Uppsala Architecture Research Team (UART)8 Stack Text & Data Heap PRIVATE_DATA shmid = A Physical Memory of the Cabinet 1 shmget shmid = B shmget Physical Memory of the Cabinet 2 Process and Memory Distribution Cabinet 1 fork pset_bind pset_bind fork 0x80000000 G_MEM Cabinet_1_G_MEM Cabinet_2_G_MEM Cabinet_1_G_MEM Stack Text & Data Heap PRIVATE_DATA G_MEM Cabinet_2_G_MEM Cabinet_1_G_MEM Stack Text & Data Heap PRIVATE_DATA G_MEM Cabinet_2_G_MEM Stack Text & Data Heap PRIVATE_DATA Stack Text & Data Heap PRIVATE_DATA Cabinet_1_G_MEM Cabinet_2_G_MEM Stack Text & Data Heap PRIVATE_DATA G_MEM ”Aliasing” Stack Text & Data Heap PRIVATE_DATA Cabinet 2 Stack Text & Data Heap PRIVATE_DATA Cabinet_1_G_MEM Cabinet_2_G_MEM Stack Text & Data Heap PRIVATE_DATA G_MEM shmat

9 DSZOOM@wmpi2001Uppsala Architecture Research Team (UART)9 So far … DSZOOM-WF Implementation of PARMACS Macros a.out (Un)executable EEL DSZOOM-WF Run-Time Library m4 GN U gcc Unmodified SPLASH-2 Application Coherence Protocols

10 DSZOOM@wmpi2001Uppsala Architecture Research Team (UART)10 Squeezing Protocols into Binaries … Static Binary Instrumentation  EEL — Machine-independent Executable Editing Library implemented in C++ Replace global loads with snippets containing fine- grain access control checks Insert coherence protocols

11 DSZOOM@wmpi2001Uppsala Architecture Research Team (UART)11 1: ld [address],%reg // original LD 2: fcmps %fcc0,%reg,%reg // compare reg with itself 3: fbe,pt %fcc0,hit // if (reg == reg) goto hit 4: nop 5: Call global coherence load routine hit: Fine-grain Access Control Checks The “magic” value is a small integer corresponding to an IEEE floating-point NaN [Blizzard-S, Sirocco-S] Floating-point load example: Coherence Protocols

12 DSZOOM@wmpi2001Uppsala Architecture Research Team (UART)12 Modified-Shared-Invalid (MSI) G_MEM Cabinet_2_G_MEM Shared cache line Invalid cache line MEM_STORE Cabinet_1_G_MEM 00000010LOCK After MEM_STORE Presence bits DIR_ENTRY 00000001LOCK Before MEM_STORE One DIR_ENTRY per cache line Distributed DIR ”Aliasing”

13 DSZOOM@wmpi2001Uppsala Architecture Research Team (UART)13 Read Data from Home Node: 2–hop read MemDIR 1a. f&s 2. put = Small packet (~10 bytes) = Large packet (~68 bytes) = Message on the critical path = Message off the critical path 1b. get data Requestor

14 DSZOOM@wmpi2001Uppsala Architecture Research Team (UART)14 Instrumentation Performance ProgramProblem Size % LD % ST Instrumentation Overhead FFT1,048,576 points (48.1 MB)26.122.21.43 LU-Cont 1024  1024, block 16 (8.0 MB) 22.714.51.68 LU-Non-Cont 1024  1024, block 16 (8.0 MB) 23.916.61.42 Radix4,194,304 items (36.5 MB)24.114.91.15 Barnes-Hut16,384 bodies (32.8 MB)37.550.51.25 FMM32,768 particles (8.1 MB)25.522.91.12 Ocean-Cont 514  514 (57.5 MB) 28.626.21.34 Ocean-Non-Cont 258  258 (22.9 MB) 15.531.61.21 RadiosityRoom (29.4 MB)31.135.01.11 RaytraceCar (32.2 MB)28.831.51.53 Water-nsq2,197 mols., 2 steps (2.0 MB)24.532.41.21 Water-sp2,197 mols., 2 steps (1.5 MB)25.527.61.21 Average26.227.21.30

15 DSZOOM@wmpi2001Uppsala Architecture Research Team (UART)15 Normalized Instrumentation Overhead Breakdown (Seq. Exec.)

16 DSZOOM@wmpi2001Uppsala Architecture Research Team (UART)16 Results (1) Execution Times in Seconds (16 CPUs)

17 DSZOOM@wmpi2001Uppsala Architecture Research Team (UART)17 Results (2) Normalized Execution Time Breakdowns (16 CPUs)

18 DSZOOM@wmpi2001Uppsala Architecture Research Team (UART)18 DSZOOM completely eliminates asynchronous messaging between protocol agents Consistently competitive and stable performance in spite of high instrumentation overhead   35% slowdown compared to hardware  State-of-the-art checking overheads are in the range of 5–35% (e.g., Shasta), DSZOOM: 11–68% Conclusions

19 DSZOOM@wmpi2001Uppsala Architecture Research Team (UART)19 Improved DSZOOM… [SC2001] Protocol/Overall optimizations  Coherency unit variations Synchronization improvements  More balanced execution between cabinets Better instrumentation  More detailed backward slice algorithm

20 DSZOOM@wmpi2001Uppsala Architecture Research Team (UART)20 SC2001 Teaser Execution Times in Seconds (16 CPUs)

21 DSZOOM@wmpi2001Uppsala Architecture Research Team (UART)21 http://www.it.uu.se/research/group/uart DSZOOM’s Home Page


Download ppt "Architecture Research Team (UART)1 Zoran Radović and Erik Hagersten {zoranr, Uppsala University Information Technology."

Similar presentations


Ads by Google