Presentation is loading. Please wait.

Presentation is loading. Please wait.

Euro-Par 2004. Uppsala Architecture Research Team [UART] | www.it.uu.se/research/group/uart Uppsala University Dept. of Information Technology Div. of.

Similar presentations


Presentation on theme: "Euro-Par 2004. Uppsala Architecture Research Team [UART] | www.it.uu.se/research/group/uart Uppsala University Dept. of Information Technology Div. of."— Presentation transcript:

1 Euro-Par 2004. Uppsala Architecture Research Team [UART] | www.it.uu.se/research/group/uart Uppsala University Dept. of Information Technology Div. of Computer Systems Uppsala Architecture Research Team [UART] Exploiting Store Locality through Permission Caching in Software DSMs Exploiting Store Locality through Permission Caching in Software DSMs Håkan Zeffer, Zoran Radovic, Oskar Grenholm and Erik Hagersten zeffer@it.uu.se

2 Euro-Par 2004. Uppsala Architecture Research Team [UART] | www.it.uu.se/research/group/uart Software Distributed Shared Memory

3 Euro-Par 2004. Uppsala Architecture Research Team [UART] | www.it.uu.se/research/group/uart Traditional Software DSMs  Page based coherence [e.g., Ivy, Munin, TreadMarks]  Virtual memory hardware for coherence checks Expensive TLB traps  Large coherence unit size Problem: False sharing Solution: Weak memory consistency models CPUs DATA dir req. ST miss

4 Euro-Par 2004. Uppsala Architecture Research Team [UART] | www.it.uu.se/research/group/uart Fine-Grain Software DSMs  Fine-grain access-control checks [Shasta, Blizzard]  Relies on binary instrumentation  Avoids operating system trapping  Less false sharing  Extra instructions introduce overhead CPUs DATA dir req. if (miss) goto st_protocol ST Checking code instrumented into the application

5 Euro-Par 2004. Uppsala Architecture Research Team [UART] | www.it.uu.se/research/group/uart Fine-Grain Pros and Cons  Pros  Small coherence unit  Hardware-like memory consistency model  Cons  Extra check instructions to execute  Our proposal: Write Permission Cache (WPC)  Exploits store locality  Caches write permission  Effectively reduces the store instrumentation cost

6 Euro-Par 2004. Uppsala Architecture Research Team [UART] | www.it.uu.se/research/group/uart Outline Motivation  Problem: Instrumentation Overhead  Solution: Write Permission Cache  Experimental Setup  Results on Real HW- and SW-DSM Systems  Conclusions

7 Euro-Par 2004. Uppsala Architecture Research Team [UART] | www.it.uu.se/research/group/uart add R1, R2 -> R3 loop: load snippet for G_LD1 call coherence protocol if load miss load snippet for G_LD2 call coherence protocol if load miss sub R9, 1 -> R9 add R6, R7 -> R8 store snippet for G_ST1 call coherence protocol if store miss add R4, 4 -> R4 bnz R9, loop L134: st R3 -> [R7 + 4] Software Fine-Grain Coherence add R1, R2 -> R3 loop: ld [R1 + R4] -> R6 // G_LD1 ld [R2 + R4] -> R7 // G_LD2 sub R9, 1 -> R9 add R6, R7 -> R8 st R8 -> [R3 + R4] // G_ST1 add R4, 4 -> R4 bnz R9, loop L134: st R3 -> [R7 + 4]  Binary instrumentation of global loads and stores  Inserted code “snippet” maintains coherence Original program Instrumented program

8 Euro-Par 2004. Uppsala Architecture Research Team [UART] | www.it.uu.se/research/group/uart Operation CUID Original snippet handling ST 0xE22F0000 98 lock dir entry 98; store; unlock dir entry 98 ST 0xE22F0008 98 lock dir entry 98; store; unlock dir entry 98 ST 0xE22F0010 98 lock dir entry 98; store; unlock dir entry 98 ST 0xE22F0018 98 lock dir entry 98; store; unlock dir entry 98 ST 0xE22F0020 98 lock dir entry 98; store; unlock dir entry 98 ST 0xE22F0028 98 lock dir entry 98; store; unlock dir entry 98 ST 0xE22F0030 98 lock dir entry 98; store; unlock dir entry 98 ST 0xE22F0038 98 lock dir entry 98; store; unlock dir entry 98 ST 0xE22F0040 99 lock dir entry 99; store; unlock dir entry 99 ST 0xE22F0048 99 lock dir entry 99; store; unlock dir entry 99 The Lock Problem (original DSZOOM)  Example store access pattern (array traversal)

9 Euro-Par 2004. Uppsala Architecture Research Team [UART] | www.it.uu.se/research/group/uart add R1, R2 -> R3 loop: ld [R1 + R4] -> R6 // original load if (R6 == MAGIC) // test permission LD_PROTOCOL(); // protocol if miss ld [R2 + R4] -> R7 // original load if (R7 == MAGIC) // test permission LD_PROTOCOL(); // protocol if miss sub R9, 1 -> R9 add R6, R7 -> R8 LOCK(LOCAL_DIR); // lock local dir if (LOCAL_DIR != WRITE_PERMISSION) ST_PROTOCOL(); // protocol if miss st R8 -> [R3 + R4] // original store UNLOCK(LOCAL_DIR); // unlock local dir add R4, 4 -> R4 bnz R9, loop L134: st R3 -> [R7 + 4] DSZOOM Fine-Grain Coherence  Magic value (load), atomic operations (store) add R1, R2 -> R3 loop: ld [R1 + R4] -> R6 // G_LD1 ld [R2 + R4] -> R7 // G_LD2 sub R9, 1 -> R9 add R6, R7 -> R8 st R8 -> [R3 + R4] // G_ST1 add R4, 4 -> R4 bnz R9, loop L134: st R3 -> [R7 + 4] Original programInstrumented program

10 Euro-Par 2004. Uppsala Architecture Research Team [UART] | www.it.uu.se/research/group/uart Average instrumentation overhead when run on a single processor (SPLASH2 –O3):  Integer load instrumentation overhead: 3%  Overhead when only integer loads are instrumented  Float load instrumentation overhead: 31%  Only floating-point loads instrumented  Store instrumentation overhead: 61%  Only stores instrumented Sequential Instrumentation Overhead

11 Euro-Par 2004. Uppsala Architecture Research Team [UART] | www.it.uu.se/research/group/uart Operation CUID WPC snippet handling ST 0xE22F0000 98 check WPC; miss; upd. WPC; lock dir entry 98; store ST 0xE22F0008 98 check WPC; hit; store ST 0xE22F0010 98 check WPC; hit; store ST 0xE22F0018 98 check WPC; hit; store ST 0xE22F0020 98 check WPC; hit; store ST 0xE22F0028 98 check WPC; hit; store ST 0xE22F0030 98 check WPC; hit; store ST 0xE22F0038 98 check WPC; hit; store ST 0xE22F0040 99 check WPC; miss; unlock 98; upd. WPC; lock 99; store ST 0xE22F0048 99 check WPC; hit; store Write Permission Caching in Action  Example store access pattern (array traversal) Write Permission Cache 98 99

12 Euro-Par 2004. Uppsala Architecture Research Team [UART] | www.it.uu.se/research/group/uart add R1, R2 -> R3 loop: ld [R1 + R4] -> R6 // G_LD1 ld [R2 + R4] -> R7 // G_LD2 sub R9, 1 -> R9 add R6, R7 -> R8 st R8 -> [R3 + R4] // G_ST1 add R4, 4 -> R4 bnz R9, loop L134: st R3 -> [R7 + 4] WPC_FASTPATH: if (WPC != CU_ID(ADDR)) WPC_SLOWPATH() st R8 -> [R3 + R4]; // original store WPC_SLOWPATH: UNLOCK(WPC) WPC = CU_ID(ADDR) LOCK(WPC); if (LOCAL_DIR != WRITE_PERMISSION) ST_PROTOCOL(); The Write Permission Cache Idea  Keep the lock  Rely on store locality  SPARC application registers Original program Write Permission Cache Snippet

13 Euro-Par 2004. Uppsala Architecture Research Team [UART] | www.it.uu.se/research/group/uart Experimental Setup: Software  Benchmarks: unmodified SPLASH2  Compiler: GCC 3.3.3 (-O0 and –O3)  Instrumentation tool: custom made

14 Euro-Par 2004. Uppsala Architecture Research Team [UART] | www.it.uu.se/research/group/uart Experimental Setup: Hardware  SMP: Sun Enterprise E6000 Server  16 UltraSPARC II (250 MHz)  Memory access time 330 ns [lmbench]  HW-DSM: Sun Wildfire (2 E6000 nodes)  Remote memory access time 1700 ns [lmbench]  Hardware coherent interconnect. BW 800 MB/s  DSZOOM: Runs in user space on the Wildfire system  put (get) = uncacheable block load (store) operation  atomic = ldstub (load store unsigned byte SPARC V9)  maintains coherence between private copies of G_MEM

15 Euro-Par 2004. Uppsala Architecture Research Team [UART] | www.it.uu.se/research/group/uart Write Permission Cache Hit Rate

16 Euro-Par 2004. Uppsala Architecture Research Team [UART] | www.it.uu.se/research/group/uart Sequential Instrumentation Overhead

17 Euro-Par 2004. Uppsala Architecture Research Team [UART] | www.it.uu.se/research/group/uart Execution Time, 16 processors (2x8) Performance bug in paper (popc).

18 Euro-Par 2004. Uppsala Architecture Research Team [UART] | www.it.uu.se/research/group/uart Conclusions  Write permission cache (WPC)  Effectively reduces store instrumentation overhead  2 entries is sufficient  Store instrumentation overhead reduction: 42%  HW-, SW-DSM gap reduction: 28%  Parallel performance improvement: 9%

19 Euro-Par 2004. Uppsala Architecture Research Team [UART] | www.it.uu.se/research/group/uart http://www.it.uu.se/research/group/uart Thanks and Questions

20 Euro-Par 2004. Uppsala Architecture Research Team [UART] | www.it.uu.se/research/group/uart Memory Consistency  The base architecture implements sequential consistency by requiring all acknowledges from sharing nodes before a global store request is granted  Introducing the WPC in an invalidation-based environment will not weaken the memory model  WPC just extends the duration of the permission tenure before the write permission is given up  If the memory model of each node is weaker than SC, it will decide the memory model of the system

21 Euro-Par 2004. Uppsala Architecture Research Team [UART] | www.it.uu.se/research/group/uart Deadlock  WPC entries are flushed at:  Synchronization points  Failures to acquire directory locks  Thread termination  WPC + flag synchronization can lead to deadlock  Timers  Interrupt other CPUs  Lack of forward progress

22 Euro-Par 2004. Uppsala Architecture Research Team [UART] | www.it.uu.se/research/group/uart Directory Collisions  Directory collision: if a requesting processor fails to acquire a directory lock  The number of directory collisions doesn’t increase when less than 32 WPC entries are used  More information in the paper


Download ppt "Euro-Par 2004. Uppsala Architecture Research Team [UART] | www.it.uu.se/research/group/uart Uppsala University Dept. of Information Technology Div. of."

Similar presentations


Ads by Google