Download presentation
Presentation is loading. Please wait.
Published byElmer Betley Modified over 9 years ago
1
Nov 18, 2005 Zoran.Radovic@it.uu.seDissertation Seminar Dissertation Seminar, 18/11 – 2005 Auditorium Minus, Museum Gustavianum Software Techniques for Distributed Shared Memory Zoran Radovic zoran.radovic@it.uu.se Dissertation Seminar, 18/11 – 2005 Auditorium Minus, Museum Gustavianum Software Techniques for Distributed Shared Memory Zoran Radovic zoran.radovic@it.uu.se
2
Nov 18, 2005 Zoran.Radovic@it.uu.seDissertation Seminar Outline NUCA Locks DSZOOM – Software-based Shared Memory TMA – Trap-based Memory Architecture
3
Nov 18, 2005 Zoran.Radovic@it.uu.seDissertation Seminar Vasaloppet “Contention Problem in Sweden” Traditional cross-country ski race 90 km … 85.6533 km to go… CS
4
Nov 18, 2005 Zoran.Radovic@it.uu.seDissertation Seminar Spin Locks under Contention Amount of Contention Spin locks with backoff Critical Section (CS) Cost IF (more contention) THEN less efficient CS … “The more important the slower it runs…” IF (more contention) THEN less efficient CS … “The more important the slower it runs…”
5
Nov 18, 2005 Zoran.Radovic@it.uu.seDissertation Seminar Queue-based Locks Amount of Contention Spin locks with backoff CS Cost Queue-based locks IF (more contention) THEN constant CS cost … IF (more contention) THEN constant CS cost …
6
Nov 18, 2005 Zoran.Radovic@it.uu.seDissertation Seminar This Dissertation Amount of Contention Queue-based locks Spin locks with backoff NUCA locks CS Cost IF (more contention) THEN more efficient CS … “The more important the faster it runs…” IF (more contention) THEN more efficient CS … “The more important the faster it runs…”
7
Nov 18, 2005 Zoran.Radovic@it.uu.seDissertation Seminar NUCA Locks (Basic Idea) Switch Memory Test Lock/Unlock P $ P $ P $ … P $ P $ P $ … P $ P $ P $ … Test 1) Reduce traffic - one CPU per node is testing… 2) Improve lock handover 3) More efficient CS - local traffic is cheaper 1) Reduce traffic - one CPU per node is testing… 2) Improve lock handover 3) More efficient CS - local traffic is cheaper
8
Nov 18, 2005 Zoran.Radovic@it.uu.seDissertation Seminar The HBO Lock (the simplest HBO) What do we need? node_id Compare&swap ( CAS ) atomic operation CAS (Lock_address, FREE, node_id) lock-acquire: If the lock-value is in the state FREE: The node_id is CAS -ed into the lock location Else: 2 cases The lock is “local” Spin with small backoff The lock is “remote” Spin with large backoff Simple but fairly effective… Creates Communication Affinity
9
Nov 18, 2005 Zoran.Radovic@it.uu.seDissertation Seminar Performance Results Realistic microbenchmark, 2-node WildFire, 28 CPUs WF 14 Fairness?
10
Nov 18, 2005 Zoran.Radovic@it.uu.seDissertation Seminar Fairness Study Realistic microbenchmark, 2-node WildFire, 28 CPUs t
11
Nov 18, 2005 Zoran.Radovic@it.uu.seDissertation Seminar Application Performance 28-processor runs ≈ 4x
12
Nov 18, 2005 Zoran.Radovic@it.uu.seDissertation Seminar Total Traffic: Raytrace
13
Nov 18, 2005 Zoran.Radovic@it.uu.seDissertation Seminar HBO Locks inside Linux Kernel Patch provided by Silicon Graphics, Inc. Linux-IA64 kernel implementation, May 2005 Page-fault handler runs 3x faster 60 processors
14
Nov 18, 2005 Zoran.Radovic@it.uu.seDissertation Seminar Outline NUCA Locks DSZOOM – Software-based Shared Memory TMA – Trap-based Memory Architecture
15
Nov 18, 2005 Zoran.Radovic@it.uu.seDissertation Seminar The DSZOOM Proposal
16
Nov 18, 2005 Zoran.Radovic@it.uu.seDissertation Seminar The DSZOOM Proposal Run entire protocol in requesting-processor No protocol agent communication! Assumes user-level remote memory access put, get, and atomics [ InfiniBand ] Fine-grain memory protocols (e.g., 64 bytes) Hardware-like memory models [Shasta, Blizzard, Sirocco]
17
Nov 18, 2005 Zoran.Radovic@it.uu.seDissertation Seminar “Squeezing” Protocols into Binaries…... cmp %g0, %l5 bne 0x24431 nop ldd [%o0 + 16], %f4 clr %l5...... cmp %g0, %l5 bne 0x24431 nop ldd [%o0 + 16], %f4 clr %l5... ld [%o1 + 64], %o0 ld [%o1 + 64], %o0 mov 255, %g6 and %g6, %o0, %g6 cmp %g6, 170 bne 0x24450 nop Original Program DSZOOM Program Fast-path Protocol Code Slow-path Protocol Code (C-code) Binary/Assembler level instrumentation
18
Nov 18, 2005 Zoran.Radovic@it.uu.seDissertation Seminar Write Permission Caching Problem: store instrumentation relies on locking More complex instrumentation Solution: write permission cache (WPC) Small and fast software-managed cache Keeps write permissions The WPC idea: Exploit store locality Dynamically reduce the number of memory references in store checking code
19
Nov 18, 2005 Zoran.Radovic@it.uu.seDissertation Seminar Other “Features” Two kinds of protocols Invalidate Update Many optimizations Instrumentation scheduling(update and invalidate) Instrumentation batching(invalidate) WPC-based write batching(update) WPC-based dirty-data filtering(update) Private-data filtering(update) # of WPC entries(update and invalidate) Coherence unit size(update and invalidate)
20
Nov 18, 2005 Zoran.Radovic@it.uu.seDissertation Seminar Coherence Flags and Profiling Coherence flags Similar to optimization flags of compilers Possible scenario: gcc -dszoom-cl 128 -dszoom-inv –O3 my_app.c Execution profiling Similar to profile feedback of compilers Helps finding appropriate coherence flag settings Low overhead implementation in DSZOOM Less than 30 percent overhead Works for both small and large input sets
21
Nov 18, 2005 Zoran.Radovic@it.uu.seDissertation Seminar DSZOOM Results 2-node WildFire, 16 CPUs 1.45x1.11x
22
Nov 18, 2005 Zoran.Radovic@it.uu.seDissertation Seminar Outline NUCA Locks DSZOOM – Software-based Shared Memory TMA – Trap-based Memory Architecture
23
Nov 18, 2005 Zoran.Radovic@it.uu.seDissertation Seminar Instrumentation Drawbacks... cmp %g0, %l5 bne 0x24431 nop ldd [%o0 + 16], %f4 clr %l5...... cmp %g0, %l5 bne 0x24431 nop ldd [%o0 + 16], %f4 clr %l5... ld [%o1 + 64], %o0 ld [%o1 + 64], %o0 mov 255, %g6 and %g6, %o0, %g6 cmp %g6, 170 bne 0x24450 nop Original Program DSZOOM Program Fast-path Protocol Code Slow-path Protocol Code (C-code) Binary transparency? Run-time execution overhead Binary transparency? Run-time execution overhead
24
Nov 18, 2005 Zoran.Radovic@it.uu.seDissertation Seminar Trap-Based Memory Architectures Basic idea Detect fine-grained coherence violations in hardware Trigger a coherence trap when one occur Maintain coherence by software protocols No memory system modifications Minimal processor modifications Binary Transparency No need to instrument binaries/applications
25
Nov 18, 2005 Zoran.Radovic@it.uu.seDissertation Seminar TMA Lite Proof-of-concept Implementation Load permission check Hardware implementation of software check Predefined “magic-value” convention Store permission check Hardware WPC Can be seen as a very small cache Operates on virtual addresses Accessed in parallel with the data TLB
26
Nov 18, 2005 Zoran.Radovic@it.uu.seDissertation Seminar TMA Lite Performance [TMA: simulation study, 4 nodes | DSZOOM: 2-node WildFire ] 1.75x1.01x
27
Nov 18, 2005 Zoran.Radovic@it.uu.seDissertation Seminar Topics not Presented RH lock algorithm Controlled (un)fairness HBO_GT and HBO_GT_SD algorithms Global throttling and starvation detection DSZOOM implementation details Instrumentation challenges; scheduling, batching, etc. Bandwidth filtering techniques; dirty- & private-data Innovative TMA simulation tricks Low-level “good days” hacks Reusing Simics checkpoints
28
Nov 18, 2005 Zoran.Radovic@it.uu.seDissertation Seminar Dissertation Seminar, 18/11 – 2005 Auditorium Minus, Museum Gustavianum Software Techniques for Distributed Shared Memory Zoran Radovic zoran.radovic@it.uu.se Dissertation Seminar, 18/11 – 2005 Auditorium Minus, Museum Gustavianum Software Techniques for Distributed Shared Memory Zoran Radovic zoran.radovic@it.uu.se
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.