Download presentation
Presentation is loading. Please wait.
1
Euro-Par 2004. Uppsala Architecture Research Team [UART] | www.it.uu.se/research/group/uart Uppsala University Dept. of Information Technology Div. of Computer Systems Uppsala Architecture Research Team [UART] Exploiting Store Locality through Permission Caching in Software DSMs Exploiting Store Locality through Permission Caching in Software DSMs Håkan Zeffer, Zoran Radovic, Oskar Grenholm and Erik Hagersten zeffer@it.uu.se
2
Euro-Par 2004. Uppsala Architecture Research Team [UART] | www.it.uu.se/research/group/uart Software Distributed Shared Memory
3
Euro-Par 2004. Uppsala Architecture Research Team [UART] | www.it.uu.se/research/group/uart Traditional Software DSMs Page based coherence [e.g., Ivy, Munin, TreadMarks] Virtual memory hardware for coherence checks Expensive TLB traps Large coherence unit size Problem: False sharing Solution: Weak memory consistency models CPUs DATA dir req. ST miss
4
Euro-Par 2004. Uppsala Architecture Research Team [UART] | www.it.uu.se/research/group/uart Fine-Grain Software DSMs Fine-grain access-control checks [Shasta, Blizzard] Relies on binary instrumentation Avoids operating system trapping Less false sharing Extra instructions introduce overhead CPUs DATA dir req. if (miss) goto st_protocol ST Checking code instrumented into the application
5
Euro-Par 2004. Uppsala Architecture Research Team [UART] | www.it.uu.se/research/group/uart Fine-Grain Pros and Cons Pros Small coherence unit Hardware-like memory consistency model Cons Extra check instructions to execute Our proposal: Write Permission Cache (WPC) Exploits store locality Caches write permission Effectively reduces the store instrumentation cost
6
Euro-Par 2004. Uppsala Architecture Research Team [UART] | www.it.uu.se/research/group/uart Outline Motivation Problem: Instrumentation Overhead Solution: Write Permission Cache Experimental Setup Results on Real HW- and SW-DSM Systems Conclusions
7
Euro-Par 2004. Uppsala Architecture Research Team [UART] | www.it.uu.se/research/group/uart add R1, R2 -> R3 loop: load snippet for G_LD1 call coherence protocol if load miss load snippet for G_LD2 call coherence protocol if load miss sub R9, 1 -> R9 add R6, R7 -> R8 store snippet for G_ST1 call coherence protocol if store miss add R4, 4 -> R4 bnz R9, loop L134: st R3 -> [R7 + 4] Software Fine-Grain Coherence add R1, R2 -> R3 loop: ld [R1 + R4] -> R6 // G_LD1 ld [R2 + R4] -> R7 // G_LD2 sub R9, 1 -> R9 add R6, R7 -> R8 st R8 -> [R3 + R4] // G_ST1 add R4, 4 -> R4 bnz R9, loop L134: st R3 -> [R7 + 4] Binary instrumentation of global loads and stores Inserted code “snippet” maintains coherence Original program Instrumented program
8
Euro-Par 2004. Uppsala Architecture Research Team [UART] | www.it.uu.se/research/group/uart Operation CUID Original snippet handling ST 0xE22F0000 98 lock dir entry 98; store; unlock dir entry 98 ST 0xE22F0008 98 lock dir entry 98; store; unlock dir entry 98 ST 0xE22F0010 98 lock dir entry 98; store; unlock dir entry 98 ST 0xE22F0018 98 lock dir entry 98; store; unlock dir entry 98 ST 0xE22F0020 98 lock dir entry 98; store; unlock dir entry 98 ST 0xE22F0028 98 lock dir entry 98; store; unlock dir entry 98 ST 0xE22F0030 98 lock dir entry 98; store; unlock dir entry 98 ST 0xE22F0038 98 lock dir entry 98; store; unlock dir entry 98 ST 0xE22F0040 99 lock dir entry 99; store; unlock dir entry 99 ST 0xE22F0048 99 lock dir entry 99; store; unlock dir entry 99 The Lock Problem (original DSZOOM) Example store access pattern (array traversal)
9
Euro-Par 2004. Uppsala Architecture Research Team [UART] | www.it.uu.se/research/group/uart add R1, R2 -> R3 loop: ld [R1 + R4] -> R6 // original load if (R6 == MAGIC) // test permission LD_PROTOCOL(); // protocol if miss ld [R2 + R4] -> R7 // original load if (R7 == MAGIC) // test permission LD_PROTOCOL(); // protocol if miss sub R9, 1 -> R9 add R6, R7 -> R8 LOCK(LOCAL_DIR); // lock local dir if (LOCAL_DIR != WRITE_PERMISSION) ST_PROTOCOL(); // protocol if miss st R8 -> [R3 + R4] // original store UNLOCK(LOCAL_DIR); // unlock local dir add R4, 4 -> R4 bnz R9, loop L134: st R3 -> [R7 + 4] DSZOOM Fine-Grain Coherence Magic value (load), atomic operations (store) add R1, R2 -> R3 loop: ld [R1 + R4] -> R6 // G_LD1 ld [R2 + R4] -> R7 // G_LD2 sub R9, 1 -> R9 add R6, R7 -> R8 st R8 -> [R3 + R4] // G_ST1 add R4, 4 -> R4 bnz R9, loop L134: st R3 -> [R7 + 4] Original programInstrumented program
10
Euro-Par 2004. Uppsala Architecture Research Team [UART] | www.it.uu.se/research/group/uart Average instrumentation overhead when run on a single processor (SPLASH2 –O3): Integer load instrumentation overhead: 3% Overhead when only integer loads are instrumented Float load instrumentation overhead: 31% Only floating-point loads instrumented Store instrumentation overhead: 61% Only stores instrumented Sequential Instrumentation Overhead
11
Euro-Par 2004. Uppsala Architecture Research Team [UART] | www.it.uu.se/research/group/uart Operation CUID WPC snippet handling ST 0xE22F0000 98 check WPC; miss; upd. WPC; lock dir entry 98; store ST 0xE22F0008 98 check WPC; hit; store ST 0xE22F0010 98 check WPC; hit; store ST 0xE22F0018 98 check WPC; hit; store ST 0xE22F0020 98 check WPC; hit; store ST 0xE22F0028 98 check WPC; hit; store ST 0xE22F0030 98 check WPC; hit; store ST 0xE22F0038 98 check WPC; hit; store ST 0xE22F0040 99 check WPC; miss; unlock 98; upd. WPC; lock 99; store ST 0xE22F0048 99 check WPC; hit; store Write Permission Caching in Action Example store access pattern (array traversal) Write Permission Cache 98 99
12
Euro-Par 2004. Uppsala Architecture Research Team [UART] | www.it.uu.se/research/group/uart add R1, R2 -> R3 loop: ld [R1 + R4] -> R6 // G_LD1 ld [R2 + R4] -> R7 // G_LD2 sub R9, 1 -> R9 add R6, R7 -> R8 st R8 -> [R3 + R4] // G_ST1 add R4, 4 -> R4 bnz R9, loop L134: st R3 -> [R7 + 4] WPC_FASTPATH: if (WPC != CU_ID(ADDR)) WPC_SLOWPATH() st R8 -> [R3 + R4]; // original store WPC_SLOWPATH: UNLOCK(WPC) WPC = CU_ID(ADDR) LOCK(WPC); if (LOCAL_DIR != WRITE_PERMISSION) ST_PROTOCOL(); The Write Permission Cache Idea Keep the lock Rely on store locality SPARC application registers Original program Write Permission Cache Snippet
13
Euro-Par 2004. Uppsala Architecture Research Team [UART] | www.it.uu.se/research/group/uart Experimental Setup: Software Benchmarks: unmodified SPLASH2 Compiler: GCC 3.3.3 (-O0 and –O3) Instrumentation tool: custom made
14
Euro-Par 2004. Uppsala Architecture Research Team [UART] | www.it.uu.se/research/group/uart Experimental Setup: Hardware SMP: Sun Enterprise E6000 Server 16 UltraSPARC II (250 MHz) Memory access time 330 ns [lmbench] HW-DSM: Sun Wildfire (2 E6000 nodes) Remote memory access time 1700 ns [lmbench] Hardware coherent interconnect. BW 800 MB/s DSZOOM: Runs in user space on the Wildfire system put (get) = uncacheable block load (store) operation atomic = ldstub (load store unsigned byte SPARC V9) maintains coherence between private copies of G_MEM
15
Euro-Par 2004. Uppsala Architecture Research Team [UART] | www.it.uu.se/research/group/uart Write Permission Cache Hit Rate
16
Euro-Par 2004. Uppsala Architecture Research Team [UART] | www.it.uu.se/research/group/uart Sequential Instrumentation Overhead
17
Euro-Par 2004. Uppsala Architecture Research Team [UART] | www.it.uu.se/research/group/uart Execution Time, 16 processors (2x8) Performance bug in paper (popc).
18
Euro-Par 2004. Uppsala Architecture Research Team [UART] | www.it.uu.se/research/group/uart Conclusions Write permission cache (WPC) Effectively reduces store instrumentation overhead 2 entries is sufficient Store instrumentation overhead reduction: 42% HW-, SW-DSM gap reduction: 28% Parallel performance improvement: 9%
19
Euro-Par 2004. Uppsala Architecture Research Team [UART] | www.it.uu.se/research/group/uart http://www.it.uu.se/research/group/uart Thanks and Questions
20
Euro-Par 2004. Uppsala Architecture Research Team [UART] | www.it.uu.se/research/group/uart Memory Consistency The base architecture implements sequential consistency by requiring all acknowledges from sharing nodes before a global store request is granted Introducing the WPC in an invalidation-based environment will not weaken the memory model WPC just extends the duration of the permission tenure before the write permission is given up If the memory model of each node is weaker than SC, it will decide the memory model of the system
21
Euro-Par 2004. Uppsala Architecture Research Team [UART] | www.it.uu.se/research/group/uart Deadlock WPC entries are flushed at: Synchronization points Failures to acquire directory locks Thread termination WPC + flag synchronization can lead to deadlock Timers Interrupt other CPUs Lack of forward progress
22
Euro-Par 2004. Uppsala Architecture Research Team [UART] | www.it.uu.se/research/group/uart Directory Collisions Directory collision: if a requesting processor fails to acquire a directory lock The number of directory collisions doesn’t increase when less than 32 WPC entries are used More information in the paper
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.