Download presentation
Presentation is loading. Please wait.
Published byTodd Lynch Modified over 9 years ago
1
Fence Scoping Changhui Lin †, Vijay Nagarajan*, Rajiv Gupta † † University of California, Riverside * University of Edinburgh
2
Memory operations are reordered to improve performance Hardware (e.g., store buffer, reorder buffer) Compiler (e.g., code motion, caching value in register) No harm as long as dependences are respected Reordering in Uniprocessors a1: St x a2: Ld y a1: St x
3
counter-intuitive program behavior Reordering in Multiprocessors Initially x=y=0 (R x =1, R y =1) (R x =1, R y =0) (R x =0, R y =0) b1 : R y = y; b2 : R x = x; a1 : x = 1; a2 : y = 1; b2 : R x = x; a1 : x = 1; a2 : y = 1; b1 : R y = y; b2 : R x = x; a1 : x = 1; a2 : y = 1; b1 : R y = y; b2 : R x = x; (R x =0, R y =1) Intuitively, y=1 x=1 R y =1 R x =1 a1 : x = 1; b1 : R y = y; b2 : R x = x; a2 : y = 1; P1 P2 a1 : x = 1; a2 : y = 1;
4
Reordering in Multiprocessors p = new A(…) if (flag) a = p->var; flag = true; P1 P2 flag is supposed to be set after p is allocated Initially p=NULL, flag = false counter-intuitive program behavior
5
Fence Instructions p = new A(…) flag = true; P1 Memory Consistency Models Specify what reordering is allowed e.g., SC, TSO (x86, SPARC), RMO (ARM, PowerPC) Fence Instructions (Fences/Memory barriers) Selectively override default relaxed memory order Order memory operations before and after the fence FENCE
6
Fence Instructions Memory Consistency Models Specify what reordering is allowed e.g., SC, TSO (x86, SPARC), RMO (ARM, PowerPC) Fence Instructions (Fences/Memory barriers) Selectively override default relaxed memory order Order memory operations before and after the fence Inevitable -- building concurrent implementations (e.g., mutual exclusion, queues) [Attiya et. al., POPL’11] Expensive -- Cilk-5’s THE protocol spends 50% of its time executing a memory fence [Frigo et. al., PLDI’98]
7
Motivation Not all memory orderings enforced by fences are necessary Fences are usually used to enforce some specific memory operations Programmers know better how a fence is used, which can be conveyed to the hardware Process Data Control Data Access Concurrent algorithm
8
Scoped Fence (S-Fence) A S-Fence only orders memory operations in the scope Scope definition (Class scope, Set scope) Bridge the gap between programmers’ intention and hardware execution Programmers specify the scope Scope information is conveyed to hardware, imposing fewer ordering constraints Lightweight hardware and compiler support
9
Scoped Fence (S-Fence) Programming support S-FENCE global scope S-FENCE[class] class scope S-FENCE[set, {var1, var2, …}] set scope
10
Work-Stealing Queue Algorithm 1void put (TASK task){ 2 tail = TAIL; 3 wsq[tail] = task; 4 FENCE // store-store 5 TAIL = tail+1; 6 } 7TASK take ( ){ 8 tail = TAIL – 1; 9 TAIL = tail; 10 FENCE // store-load 11 head = HEAD; 12 if (tail<head){ 13 TAIL = head; 14 return EMPTY; 15 } … … 24 return task 25} 26TASK steal ( ){ 27 head = HEAD; 28 tail = TAIL; … … 35 return task; 36} Chase-Lev lock-free concurrent work-stealing queue
11
Parallel Spanning Tree 1task = wsq.take(); 2 for (each neighbor task’ of task) 3 if (task’ is not processed){ 4 process(task’); 5 wsq.put(task’) ; 6 } (a) ① ② ③ 8tail = TAIL – 1; 9TAIL = tail; 10 FENCE 11head = HEAD; …… color[task’] = label; parent[task’] = task; 2tail = TAIL; 3wsq[tail] = task’; 4 FENCE 5TAIL = tail + 1; (b) FENCE
12
Class Scope S-FENCE[class] class scope Make use of class in OO languages to illustrate the concept Constrain a fence to the object class where it is used (Encapsulation) Intuition: function members operate on data members of the class
13
Class Scope S-FENCE[class] class scope class A { B b; int m1, m2; void funcA() { m1 = val1; b.funcB(); S-FENCE1[class] m2 = val2; } class B { int n1, n2; void funcB() { n1 = val3; S-FENCE2[class] n2 = val4; } S-FENCE1: m1, m2, n1, n2 S-FENCE2: n1, n2
14
Class Scope Semantics More details in paper
15
Parallel Spanning Tree 1task = wsq.take(); 2 for (each neighbor task’ of task) 3 if (task’ is not processed){ 4 process(task’); 5 wsq.put(task’) ; 6 } (a) ① ② ③ 8tail = TAIL – 1; 9TAIL = tail; 10 FENCE 11head = HEAD; …… color[task’] = label; parent[task’] = task; 2tail = TAIL; 3wsq[tail] = task’; 4 FENCE 5TAIL = tail + 1; (b) SFENCE[class]
16
Compiler Support ISA Extension class-fence fs_start – start of a fence scope fs_end – end of a fence scope Use fs_start and fs_end to embrace functions containing fences Informing hardware to mark memory operations properly
17
Hardware Support Fence Scope Bits (FSB) Each entry of ROB and store buffer is associated with FSB Flag whether a memory operation is in the scope of some fence... Store Buffer Reorder Buffer... Fence Scope Bits (FSB) Decoding - memory operations in the scope are marked via FSB Fence issue - check the entry for current scope
18
Hardware Support Fence Scope Bits (FSB) Each entry of ROB and store buffer is associated with FSB Flag whether a memory operation is in the scope of some fence... Store Buffer Reorder Buffer... Fence Scope Bits (FSB) Decoding - memory operations in the scope are marked via FSB Fence issue - check the entry for current scope
19
Hardware Support Setting Fence Bits FSS: stack to record scope 0 1 2 3 fs_start a fs_start b fs_end b fs_end a inner outer I0 I1 I2 I3 I4 I5 I6 I7 FSB
20
Hardware Support 0 1 2 3 fs_start a fs_start b fs_end b fs_end a inner outer I0 I1 I2 I3 I4 I5 I6 I7 FSB Setting Fence Bits FSS: stack to record scope
21
Hardware Support 0 1 2 3 fs_start a fs_start b fs_end b fs_end a inner outer I0 I1 I2 I3 I4 I5 I6 I7 FSB Issue Fence by checking FSB on the current scope Setting Fence Bits FSS: stack to record scope
22
Hardware Support 0 1 2 3 fs_start a fs_start b fs_end b fs_end a inner outer I0 I1 I2 I3 I4 I5 I6 I7 FSB Issue Fence by checking FSB on the current scope Setting Fence Bits FSS: stack to record scope
23
Why S-Fence performs Better? St A St X Ld Y FENCE St B 0123401234 SB ROB St A St X St A Traditional Fence Scoped Fence stall Store Buffer drained & Fence issued stall...... Ld Y St B St A St X SB ROB stall St A Ld Y St B stall Timeline St A : a cache miss
24
flag1 = 1; flag2 = 1; if (flag2 == 0) if (flag1 == 0) critical section critical section P1 P2 Initially flag1 = flag2 = 0 FENCE m1 = … m2 = … Set Scope Dekker algorithm
25
flag1 = 1; flag2 = 1; if (flag2 == 0) if (flag1 == 0) critical section critical section P1 P2 Initially flag1 = flag2 = 0 S-FENCE[set, {flag1, flag2}] S-FENCE … m1 = … m2 = … Set Scope Dekker algorithm
26
Set Scope S-FENCE[set, {var1, var2, …}] set scope only order memory accesses to {var1, var2, …} Compiler and Hardware Supports flag memory accesses to the specified variables set fence scope bits in hardware for flagged memory accesses For simplicity, we do not differentiate memory accesses to different sets
27
Experimental Evaluation Cycle-accurate simulation (SESC) Integrate scoped fence logic RMO memory model Benchmarks pst - parallel spanning tree (work-stealing queue, class scope) ptc – parallel transitive closure (work-stealing queue, class scope) barnes – from SPLASH2 (fences inserted for SC, set scope) radiosity – from SPLASH2 (fences inserted for SC, set scope)
28
Experimental Evaluation Traditional fence (T) vs. Scoped fence (S) Fence Stall Reduced~40-50% ~13% ~50% class scope set scope
29
Conclusion Introduce the concept of fence scope Propose class scope and set scope OpenCL 2.0 (sub-group, work-group, device, system) Lightweight compiler and hardware support No change in inter-processor communication Fence scope should be implemented in some form !
30
Fence Scoping Changhui Lin †, Vijay Nagarajan*, Rajiv Gupta † † University of California, Riverside * University of Edinburgh
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.