Uppsala University Information Technology Department of Computer Systems Uppsala Architecture Research Team [UART] Removing the Overhead from Software-Based Shared Memory Zoran Radovic and Erik Hagersten {zoranr,
Supercomputing 2001Uppsala Architecture Research Team (UART) Problems with Traditional SW-DSMs Page-sized coherence unit False Sharing! [e.g., Ivy, Munin, TreadMarks, Cashmere-2L, GeNIMA, …] Protocol agent messaging is slow Most efficiency lost in interrupt/poll CPUs Mem Prot. agent CPUs Mem Prot. agent LD x
Supercomputing 2001Uppsala Architecture Research Team (UART) Our proposal: D S Z O O M Run entire protocol in requesting-processor No protocol agent communication! Assumes user-level remote memory access put, get, and atomics [ InfiniBand SM ] Fine-grain access-control checks [e.g., Shasta, Blizzard-S, Sirocco-S] CPUs Mem Protocol CPUs Mem atomic, get/put DIR get LD x DIR
Supercomputing 2001Uppsala Architecture Research Team (UART) Outline Motivation General DSZOOM Overview DSZOOM-WF Implementation Details Experimentation Environment Performance Results Conclusions
Supercomputing 2001Uppsala Architecture Research Team (UART) DSZOOM Cluster DSZOOM Nodes: Each node consists of an unmodified SMP multiprocessor SMP hardware keeps coherence among the caches and the memory within each SMP node DSZOOM Cluster Network: Non-coherent cluster interconnect Inexpensive user-level remote memory access Remote atomic operations [e.g., InfiniBand SM ]
Supercomputing 2001Uppsala Architecture Research Team (UART) Squeezing Protocols into Binaries … Static Binary Instrumentation EEL — Machine-independent Executable Editing Library implemented in C++ Instrument global LOADs with snippets containing fine-grain access control checks Instrument global STOREs with MTAG snippets Insert calls to coherence protocols implemented in C
Supercomputing 2001Uppsala Architecture Research Team (UART) 1: ld [address],%reg // original LOAD 2: fcmps %fcc0,%reg,%reg // compare reg with itself 3: fbe,pt %fcc0,hit // if (reg == reg) goto hit 4: nop 5: // Call global coherence load routine hit: Fine-grain Access Control Checks The “magic” value is a small integer corresponding to an IEEE floating-point NaN [e.g., Blizzard-S, Sirocco-S] Floating-point load example: Coherence Protocols (C-code)
Supercomputing 2001Uppsala Architecture Research Team (UART) Blocking Directory Protocols Originally proposed to simplify the design and verification of HW-DSMs Eliminates race conditions DSZOOM implements a distributed version of a blocking protocol Node 0 G_MEM LOCK After MEM_STORE Presence bits DIR_ENTRY LOCK Before MEM_STORE One DIR_ENTRY per cache line Distributed DIR MEM_STORE
Supercomputing 2001Uppsala Architecture Research Team (UART) Global Coherency Action Read data from home node: 2–hop read MemDIR 1a. f&s = Small packet (~10 bytes) = Large packet (~68 bytes) = Message on the critical path = Message off the critical path 1b. get data 2. put Requestor LD x
Supercomputing 2001Uppsala Architecture Research Team (UART) Global Coherency Action Read data modified in a third node: 3–hop read DIR Mem MTAG 1. f&s 3b. put 2a. f&s 2b. get data 3a. put Requestor LD x
Supercomputing 2001Uppsala Architecture Research Team (UART) Compilation Process DSZOOM-WF Implementation of PARMACS Macros a.out (Un)executable EEL DSZOOM-WF Run-Time Library m4 GN U gcc Unmodified SPLASH-2 Application Coherence Protocols (C-code)
Supercomputing 2001Uppsala Architecture Research Team (UART) Instrumentation Performance ProgramProblem Size % LD % ST Instrumentation Overhead FFT1,048,576 points (48.1 MB) LU-Cont 1024 1024, block 16 (8.0 MB) LU-Non-Cont 1024 1024, block 16 (8.0 MB) Radix4,194,304 items (36.5 MB) Barnes-Hut16,384 bodies (32.8 MB) FMM32,768 particles (8.1 MB) Ocean-Cont 514 514 (57.5 MB) Ocean-Non-Cont 258 258 (22.9 MB) RadiosityRoom (29.4 MB) RaytraceCar (32.2 MB) Water-nsq2,197 mols., 2 steps (2.0 MB) Water-sp2,197 mols., 2 steps (1.5 MB) Average
Supercomputing 2001Uppsala Architecture Research Team (UART) Instrumentation Breakdown Sequential Execution
Supercomputing 2001Uppsala Architecture Research Team (UART) Current DSZOOM Hardware Two E6000 connected through a hardware-coherent interface (Sun- WildFire) with a raw bandwidth of 800 MB/s in each direction Data migration and coherent memory replication (CMR) are kept inactive 16 UltraSPARC II (250 MHz) CPUs per node and 8 GB memory Memory access times: 330 ns local / 1700 ns remote (lmbench latency) Run as 16-way SMP, 2 8 CC-NUMA, and 2 8 SW-DSM
Supercomputing 2001Uppsala Architecture Research Team (UART) Stack Text & Data Heap PRIVATE_DATA shmid = A Physical Memory Cabinet 1 shmget shmid = B shmget Physical Memory Cabinet 2 Process and Memory Distribution Cabinet 1 fork pset_bind pset_bind fork 0x G_MEM Cabinet_1_G_MEM Cabinet_2_G_MEM Cabinet_1_G_MEM Stack Text & Data Heap PRIVATE_DATA G_MEM Cabinet_2_G_MEM Cabinet_1_G_MEM Stack Text & Data Heap PRIVATE_DATA G_MEM Cabinet_2_G_MEM Stack Text & Data Heap PRIVATE_DATA Stack Text & Data Heap PRIVATE_DATA Cabinet_1_G_MEM Cabinet_2_G_MEM Stack Text & Data Heap PRIVATE_DATA G_MEM ”Aliasing” Stack Text & Data Heap PRIVATE_DATA Cabinet 2 Stack Text & Data Heap PRIVATE_DATA Cabinet_1_G_MEM Cabinet_2_G_MEM Stack Text & Data Heap PRIVATE_DATA G_MEM shmat
Supercomputing 2001Uppsala Architecture Research Team (UART) Results (1) Execution Times in Seconds (16 CPUs) HWSW EEL 8 8 SW 16 EEL EEL
Supercomputing 2001Uppsala Architecture Research Team (UART) Results (2) Normalized Execution Time Breakdowns (16 CPUs) SW 88 EEL
Supercomputing 2001Uppsala Architecture Research Team (UART) DSZOOM completely eliminates asynchronous messaging between protocol agents Consistently competitive and stable performance in spite of high instrumentation overhead 30% slowdown compared to hardware State-of-the-art checking overheads are in the range of 5–35% (e.g., Shasta), DSZOOM: 3–59% Conclusions
Supercomputing 2001Uppsala Architecture Research Team (UART) DSZOOM’s Home Page