Presentation is loading. Please wait.

Presentation is loading. Please wait.

Nov 18, 2005 Seminar Dissertation Seminar, 18/11 – 2005 Auditorium Minus, Museum Gustavianum Software Techniques for.

Similar presentations


Presentation on theme: "Nov 18, 2005 Seminar Dissertation Seminar, 18/11 – 2005 Auditorium Minus, Museum Gustavianum Software Techniques for."— Presentation transcript:

1 Nov 18, 2005 Zoran.Radovic@it.uu.seDissertation Seminar Dissertation Seminar, 18/11 – 2005 Auditorium Minus, Museum Gustavianum Software Techniques for Distributed Shared Memory Zoran Radovic zoran.radovic@it.uu.se Dissertation Seminar, 18/11 – 2005 Auditorium Minus, Museum Gustavianum Software Techniques for Distributed Shared Memory Zoran Radovic zoran.radovic@it.uu.se

2 Nov 18, 2005 Zoran.Radovic@it.uu.seDissertation Seminar Outline  NUCA Locks  DSZOOM – Software-based Shared Memory  TMA – Trap-based Memory Architecture

3 Nov 18, 2005 Zoran.Radovic@it.uu.seDissertation Seminar Vasaloppet “Contention Problem in Sweden” Traditional cross-country ski race 90 km … 85.6533 km to go… CS

4 Nov 18, 2005 Zoran.Radovic@it.uu.seDissertation Seminar Spin Locks under Contention Amount of Contention Spin locks with backoff Critical Section (CS) Cost IF (more contention)  THEN less efficient CS … “The more important the slower it runs…” IF (more contention)  THEN less efficient CS … “The more important the slower it runs…”

5 Nov 18, 2005 Zoran.Radovic@it.uu.seDissertation Seminar Queue-based Locks Amount of Contention Spin locks with backoff CS Cost Queue-based locks IF (more contention)  THEN constant CS cost … IF (more contention)  THEN constant CS cost …

6 Nov 18, 2005 Zoran.Radovic@it.uu.seDissertation Seminar This Dissertation Amount of Contention Queue-based locks Spin locks with backoff NUCA locks CS Cost IF (more contention)  THEN more efficient CS … “The more important the faster it runs…” IF (more contention)  THEN more efficient CS … “The more important the faster it runs…”

7 Nov 18, 2005 Zoran.Radovic@it.uu.seDissertation Seminar NUCA Locks (Basic Idea) Switch Memory Test Lock/Unlock P $ P $ P $ … P $ P $ P $ … P $ P $ P $ … Test 1) Reduce traffic - one CPU per node is testing… 2) Improve lock handover 3) More efficient CS - local traffic is cheaper 1) Reduce traffic - one CPU per node is testing… 2) Improve lock handover 3) More efficient CS - local traffic is cheaper

8 Nov 18, 2005 Zoran.Radovic@it.uu.seDissertation Seminar The HBO Lock (the simplest HBO)  What do we need?  node_id  Compare&swap ( CAS ) atomic operation CAS (Lock_address, FREE, node_id)  lock-acquire:  If the lock-value is in the state FREE: The node_id is CAS -ed into the lock location  Else: 2 cases The lock is “local”  Spin with small backoff The lock is “remote”  Spin with large backoff  Simple but fairly effective… Creates Communication Affinity

9 Nov 18, 2005 Zoran.Radovic@it.uu.seDissertation Seminar Performance Results Realistic microbenchmark, 2-node WildFire, 28 CPUs WF 14 Fairness?

10 Nov 18, 2005 Zoran.Radovic@it.uu.seDissertation Seminar Fairness Study Realistic microbenchmark, 2-node WildFire, 28 CPUs t

11 Nov 18, 2005 Zoran.Radovic@it.uu.seDissertation Seminar Application Performance 28-processor runs ≈ 4x

12 Nov 18, 2005 Zoran.Radovic@it.uu.seDissertation Seminar Total Traffic: Raytrace

13 Nov 18, 2005 Zoran.Radovic@it.uu.seDissertation Seminar HBO Locks inside Linux Kernel  Patch provided by Silicon Graphics, Inc.  Linux-IA64 kernel implementation, May 2005  Page-fault handler runs 3x faster  60 processors

14 Nov 18, 2005 Zoran.Radovic@it.uu.seDissertation Seminar Outline NUCA Locks  DSZOOM – Software-based Shared Memory  TMA – Trap-based Memory Architecture

15 Nov 18, 2005 Zoran.Radovic@it.uu.seDissertation Seminar The DSZOOM Proposal

16 Nov 18, 2005 Zoran.Radovic@it.uu.seDissertation Seminar The DSZOOM Proposal  Run entire protocol in requesting-processor  No protocol agent communication!  Assumes user-level remote memory access  put, get, and atomics [  InfiniBand ]  Fine-grain memory protocols (e.g., 64 bytes)  Hardware-like memory models [Shasta, Blizzard, Sirocco]

17 Nov 18, 2005 Zoran.Radovic@it.uu.seDissertation Seminar “Squeezing” Protocols into Binaries…... cmp %g0, %l5 bne 0x24431 nop ldd [%o0 + 16], %f4 clr %l5...... cmp %g0, %l5 bne 0x24431 nop ldd [%o0 + 16], %f4 clr %l5... ld [%o1 + 64], %o0 ld [%o1 + 64], %o0 mov 255, %g6 and %g6, %o0, %g6 cmp %g6, 170 bne 0x24450 nop Original Program DSZOOM Program Fast-path Protocol Code Slow-path Protocol Code (C-code) Binary/Assembler level instrumentation

18 Nov 18, 2005 Zoran.Radovic@it.uu.seDissertation Seminar Write Permission Caching  Problem: store instrumentation relies on locking  More complex instrumentation  Solution: write permission cache (WPC)  Small and fast software-managed cache  Keeps write permissions  The WPC idea:  Exploit store locality  Dynamically reduce the number of memory references in store checking code

19 Nov 18, 2005 Zoran.Radovic@it.uu.seDissertation Seminar Other “Features”  Two kinds of protocols  Invalidate  Update  Many optimizations  Instrumentation scheduling(update and invalidate)  Instrumentation batching(invalidate)  WPC-based write batching(update)  WPC-based dirty-data filtering(update)  Private-data filtering(update)  # of WPC entries(update and invalidate)  Coherence unit size(update and invalidate)

20 Nov 18, 2005 Zoran.Radovic@it.uu.seDissertation Seminar Coherence Flags and Profiling  Coherence flags  Similar to optimization flags of compilers  Possible scenario: gcc -dszoom-cl 128 -dszoom-inv –O3 my_app.c  Execution profiling  Similar to profile feedback of compilers  Helps finding appropriate coherence flag settings  Low overhead implementation in DSZOOM Less than 30 percent overhead  Works for both small and large input sets

21 Nov 18, 2005 Zoran.Radovic@it.uu.seDissertation Seminar DSZOOM Results 2-node WildFire, 16 CPUs 1.45x1.11x

22 Nov 18, 2005 Zoran.Radovic@it.uu.seDissertation Seminar Outline NUCA Locks DSZOOM – Software-based Shared Memory  TMA – Trap-based Memory Architecture

23 Nov 18, 2005 Zoran.Radovic@it.uu.seDissertation Seminar Instrumentation Drawbacks... cmp %g0, %l5 bne 0x24431 nop ldd [%o0 + 16], %f4 clr %l5...... cmp %g0, %l5 bne 0x24431 nop ldd [%o0 + 16], %f4 clr %l5... ld [%o1 + 64], %o0 ld [%o1 + 64], %o0 mov 255, %g6 and %g6, %o0, %g6 cmp %g6, 170 bne 0x24450 nop Original Program DSZOOM Program Fast-path Protocol Code Slow-path Protocol Code (C-code) Binary transparency? Run-time execution overhead Binary transparency? Run-time execution overhead

24 Nov 18, 2005 Zoran.Radovic@it.uu.seDissertation Seminar Trap-Based Memory Architectures  Basic idea  Detect fine-grained coherence violations in hardware  Trigger a coherence trap when one occur  Maintain coherence by software protocols  No memory system modifications  Minimal processor modifications  Binary Transparency  No need to instrument binaries/applications

25 Nov 18, 2005 Zoran.Radovic@it.uu.seDissertation Seminar TMA Lite Proof-of-concept Implementation  Load permission check  Hardware implementation of software check Predefined “magic-value” convention  Store permission check  Hardware WPC  Can be seen as a very small cache  Operates on virtual addresses  Accessed in parallel with the data TLB

26 Nov 18, 2005 Zoran.Radovic@it.uu.seDissertation Seminar TMA Lite Performance [TMA: simulation study, 4 nodes | DSZOOM: 2-node WildFire ] 1.75x1.01x

27 Nov 18, 2005 Zoran.Radovic@it.uu.seDissertation Seminar Topics not Presented  RH lock algorithm  Controlled (un)fairness  HBO_GT and HBO_GT_SD algorithms  Global throttling and starvation detection  DSZOOM implementation details  Instrumentation challenges; scheduling, batching, etc.  Bandwidth filtering techniques; dirty- & private-data  Innovative TMA simulation tricks  Low-level “good days” hacks  Reusing Simics checkpoints

28 Nov 18, 2005 Zoran.Radovic@it.uu.seDissertation Seminar Dissertation Seminar, 18/11 – 2005 Auditorium Minus, Museum Gustavianum Software Techniques for Distributed Shared Memory Zoran Radovic zoran.radovic@it.uu.se Dissertation Seminar, 18/11 – 2005 Auditorium Minus, Museum Gustavianum Software Techniques for Distributed Shared Memory Zoran Radovic zoran.radovic@it.uu.se


Download ppt "Nov 18, 2005 Seminar Dissertation Seminar, 18/11 – 2005 Auditorium Minus, Museum Gustavianum Software Techniques for."

Similar presentations


Ads by Google