Fence Scoping Changhui Lin †, Vijay Nagarajan*, Rajiv Gupta † † University of California, Riverside * University of Edinburgh.

Slides:



Advertisements
Similar presentations
Bounded Model Checking of Concurrent Data Types on Relaxed Memory Models: A Case Study Sebastian Burckhardt Rajeev Alur Milo M. K. Martin Department of.
Advertisements

Symmetric Multiprocessors: Synchronization and Sequential Consistency.
1 Lecture 20: Synchronization & Consistency Topics: synchronization, consistency models (Sections )
Memory Consistency Models Kevin Boos. Two Papers Shared Memory Consistency Models: A Tutorial – Sarita V. Adve & Kourosh Gharachorloo – September 1995.
Exploring Memory Consistency for Massively Threaded Throughput- Oriented Processors Blake Hechtman Daniel J. Sorin 0.
Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.
4/4/2013 CS152, Spring 2013 CS 152 Computer Architecture and Engineering Lecture 17: Synchronization and Sequential Consistency Krste Asanovic Electrical.
CS 162 Memory Consistency Models. Memory operations are reordered to improve performance Hardware (e.g., store buffer, reorder buffer) Compiler (e.g.,
“FENDER” AUTOMATIC MEMORY FENCE INFERENCE Presented by Michael Kuperstein, Technion Joint work with Martin Vechev and Eran Yahav, IBM Research 1.
© Krste Asanovic, 2014CS252, Spring 2014, Lecture 12 CS252 Graduate Computer Architecture Spring 2014 Lecture 12: Synchronization and Memory Models Krste.
Chapter 6: Process Synchronization
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 6: Process Synchronization.
Process Synchronization. Module 6: Process Synchronization Background The Critical-Section Problem Peterson’s Solution Synchronization Hardware Semaphores.
CS492B Analysis of Concurrent Programs Consistency Jaehyuk Huh Computer Science, KAIST Part of slides are based on CS:App from CMU.
PARTIAL-COHERENCE ABSTRACTIONS FOR RELAXED MEMORY MODELS Presented by Michael Kuperstein, Technion Joint work with Martin Vechev, IBM Research and Eran.
Slides 8d-1 Programming with Shared Memory Specifying parallelism Performance issues ITCS4145/5145, Parallel Programming B. Wilkinson Fall 2010.
Martin Vechev IBM Research Michael Kuperstein Technion Eran Yahav Technion (FMCAD’10, PLDI’11) 1.
Martin Vechev IBM Research Michael Kuperstein Technion Eran Yahav Technion (FMCAD’10, PLDI’11) 1.
Lock-free Cache-friendly Software Queue for Decoupled Software Pipelining Student: Chen Wen-Ren Advisor: Wuu Yang 學生 : 陳韋任 指導教授 : 楊武 Abstract Multicore.
1 Lecture 21: Transactional Memory Topics: consistency model recap, introduction to transactional memory.
By Sarita Adve & Kourosh Gharachorloo Review by Jim Larson Shared Memory Consistency Models: A Tutorial.
1 Lecture 23: Transactional Memory Topics: consistency model recap, introduction to transactional memory.
1 Lecture 7: Consistency Models Topics: sequential consistency, requirements to implement sequential consistency, relaxed consistency models.
Lecture 13: Consistency Models
Computer Architecture II 1 Computer architecture II Lecture 9.
1 Lecture 15: Consistency Models Topics: sequential consistency, requirements to implement sequential consistency, relaxed consistency models.
1 Sharing Objects – Ch. 3 Visibility What is the source of the issue? Volatile Dekker’s algorithm Publication and Escape Thread Confinement Immutability.
Memory Consistency Models
CS 152 Computer Architecture and Engineering Lecture 20: Snoopy Caches Krste Asanovic Electrical Engineering and Computer Sciences University of California,
Instructor: Umar KalimNUST Institute of Information Technology Operating Systems Process Synchronization.
Shared Memory Consistency Models: A Tutorial By Sarita V Adve and Kourosh Gharachorloo Presenter: Sunita Marathe.
Memory Consistency Models Some material borrowed from Sarita Adve’s (UIUC) tutorial on memory consistency models.
Evaluation of Memory Consistency Models in Titanium.
Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.
Shared Memory Consistency Models: A Tutorial Sarita V. Adve Kouroush Ghrachorloo Western Research Laboratory September 1995.
Shared Memory Consistency Models. Quiz (1)  Let’s define shared memory.
Memory Consistency Models Alistair Rendell See “Shared Memory Consistency Models: A Tutorial”, S.V. Adve and K. Gharachorloo Chapter 8 pp of Wilkinson.
By Sarita Adve & Kourosh Gharachorloo Slides by Jim Larson Shared Memory Consistency Models: A Tutorial.
Memory Consistency Models. Outline Review of multi-threaded program execution on uniprocessor Need for memory consistency models Sequential consistency.
December 1, 2006©2006 Craig Zilles1 Threads and Cache Coherence in Hardware  Previously, we introduced multi-cores. —Today we’ll look at issues related.
Complexity Implications of Memory Models. Out-of-Order Execution Avoid with fences (and atomic operations) Shared memory processes reordering buffer Hagit.
ICFEM 2002, Shanghai Reasoning about Hardware and Software Memory Models Abhik Roychoudhury School of Computing National University of Singapore.
CS533 Concepts of Operating Systems Jonathan Walpole.
HARD: Hardware-Assisted lockset- based Race Detection P.Zhou, R.Teodorescu, Y.Zhou. HPCA’07 Shimin Chen LBA Reading Group Presentation.
CISC 879 : Advanced Parallel Programming Rahul Deore Dept. of Computer & Information Sciences University of Delaware Exploring Memory Consistency for Massively-Threaded.
CS267 Lecture 61 Shared Memory Hardware and Memory Consistency Modified from J. Demmel and K. Yelick
1 Programming with Shared Memory - 3 Recognizing parallelism Performance issues ITCS4145/5145, Parallel Programming B. Wilkinson Jan 22, 2016.
Symmetric Multiprocessors: Synchronization and Sequential Consistency
Lecture 20: Consistency Models, TM
CS 152 Computer Architecture and Engineering Lecture 18: Snoopy Caches
Speculative Lock Elision
Memory Consistency Models
Threads Cannot Be Implemented As a Library
Lecture 11: Consistency Models
Memory Consistency Models
Chapter 5: Process Synchronization
Threads and Memory Models Hal Perkins Autumn 2011
Symmetric Multiprocessors: Synchronization and Sequential Consistency
Symmetric Multiprocessors: Synchronization and Sequential Consistency
Threads and Memory Models Hal Perkins Autumn 2009
Lecture 22: Consistency Models, TM
Shared Memory Consistency Models: A Tutorial
Memory Consistency Models
CSE 153 Design of Operating Systems Winter 19
CS 152 Computer Architecture and Engineering Lecture 20: Snoopy Caches
Relaxed Consistency Part 2
Relaxed Consistency Finale
Programming with Shared Memory Specifying parallelism
Lecture: Consistency Models, TM
CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture Lecture 19 Memory Consistency Models Krste Asanovic Electrical Engineering.
Presentation transcript:

Fence Scoping Changhui Lin †, Vijay Nagarajan*, Rajiv Gupta † † University of California, Riverside * University of Edinburgh

Memory operations are reordered to improve performance Hardware (e.g., store buffer, reorder buffer) Compiler (e.g., code motion, caching value in register) No harm as long as dependences are respected Reordering in Uniprocessors a1: St x a2: Ld y a1: St x

counter-intuitive program behavior Reordering in Multiprocessors Initially x=y=0 (R x =1, R y =1) (R x =1, R y =0) (R x =0, R y =0) b1 : R y = y; b2 : R x = x; a1 : x = 1; a2 : y = 1; b2 : R x = x; a1 : x = 1; a2 : y = 1; b1 : R y = y; b2 : R x = x; a1 : x = 1; a2 : y = 1; b1 : R y = y; b2 : R x = x; (R x =0, R y =1) Intuitively, y=1  x=1 R y =1  R x =1 a1 : x = 1; b1 : R y = y; b2 : R x = x; a2 : y = 1; P1 P2 a1 : x = 1; a2 : y = 1;

Reordering in Multiprocessors p = new A(…) if (flag) a = p->var; flag = true; P1 P2 flag is supposed to be set after p is allocated Initially p=NULL, flag = false counter-intuitive program behavior

Fence Instructions p = new A(…) flag = true; P1 Memory Consistency Models Specify what reordering is allowed e.g., SC, TSO (x86, SPARC), RMO (ARM, PowerPC) Fence Instructions (Fences/Memory barriers) Selectively override default relaxed memory order Order memory operations before and after the fence FENCE

Fence Instructions Memory Consistency Models Specify what reordering is allowed e.g., SC, TSO (x86, SPARC), RMO (ARM, PowerPC) Fence Instructions (Fences/Memory barriers) Selectively override default relaxed memory order Order memory operations before and after the fence Inevitable -- building concurrent implementations (e.g., mutual exclusion, queues) [Attiya et. al., POPL’11] Expensive -- Cilk-5’s THE protocol spends 50% of its time executing a memory fence [Frigo et. al., PLDI’98]

Motivation Not all memory orderings enforced by fences are necessary Fences are usually used to enforce some specific memory operations Programmers know better how a fence is used, which can be conveyed to the hardware Process Data Control Data Access Concurrent algorithm

Scoped Fence (S-Fence) A S-Fence only orders memory operations in the scope Scope definition (Class scope, Set scope) Bridge the gap between programmers’ intention and hardware execution Programmers specify the scope Scope information is conveyed to hardware, imposing fewer ordering constraints Lightweight hardware and compiler support

Scoped Fence (S-Fence) Programming support S-FENCE global scope S-FENCE[class] class scope S-FENCE[set, {var1, var2, …}] set scope

Work-Stealing Queue Algorithm 1void put (TASK task){ 2 tail = TAIL; 3 wsq[tail] = task; 4 FENCE // store-store 5 TAIL = tail+1; 6 } 7TASK take ( ){ 8 tail = TAIL – 1; 9 TAIL = tail; 10 FENCE // store-load 11 head = HEAD; 12 if (tail<head){ 13 TAIL = head; 14 return EMPTY; 15 } … … 24 return task 25} 26TASK steal ( ){ 27 head = HEAD; 28 tail = TAIL; … … 35 return task; 36} Chase-Lev lock-free concurrent work-stealing queue

Parallel Spanning Tree 1task = wsq.take(); 2 for (each neighbor task’ of task) 3 if (task’ is not processed){ 4 process(task’); 5 wsq.put(task’) ; 6 } (a) ① ② ③ 8tail = TAIL – 1; 9TAIL = tail; 10 FENCE 11head = HEAD; …… color[task’] = label; parent[task’] = task; 2tail = TAIL; 3wsq[tail] = task’; 4 FENCE 5TAIL = tail + 1; (b) FENCE

Class Scope S-FENCE[class] class scope Make use of class in OO languages to illustrate the concept Constrain a fence to the object class where it is used (Encapsulation) Intuition: function members operate on data members of the class

Class Scope S-FENCE[class] class scope class A { B b; int m1, m2; void funcA() { m1 = val1; b.funcB(); S-FENCE1[class] m2 = val2; } class B { int n1, n2; void funcB() { n1 = val3; S-FENCE2[class] n2 = val4; } S-FENCE1: m1, m2, n1, n2 S-FENCE2: n1, n2

Class Scope Semantics More details in paper

Parallel Spanning Tree 1task = wsq.take(); 2 for (each neighbor task’ of task) 3 if (task’ is not processed){ 4 process(task’); 5 wsq.put(task’) ; 6 } (a) ① ② ③ 8tail = TAIL – 1; 9TAIL = tail; 10 FENCE 11head = HEAD; …… color[task’] = label; parent[task’] = task; 2tail = TAIL; 3wsq[tail] = task’; 4 FENCE 5TAIL = tail + 1; (b) SFENCE[class]

Compiler Support ISA Extension class-fence fs_start – start of a fence scope fs_end – end of a fence scope Use fs_start and fs_end to embrace functions containing fences Informing hardware to mark memory operations properly

Hardware Support Fence Scope Bits (FSB) Each entry of ROB and store buffer is associated with FSB Flag whether a memory operation is in the scope of some fence... Store Buffer Reorder Buffer... Fence Scope Bits (FSB) Decoding - memory operations in the scope are marked via FSB Fence issue - check the entry for current scope

Hardware Support Fence Scope Bits (FSB) Each entry of ROB and store buffer is associated with FSB Flag whether a memory operation is in the scope of some fence... Store Buffer Reorder Buffer... Fence Scope Bits (FSB) Decoding - memory operations in the scope are marked via FSB Fence issue - check the entry for current scope

Hardware Support Setting Fence Bits FSS: stack to record scope fs_start a fs_start b fs_end b fs_end a inner outer I0 I1 I2 I3 I4 I5 I6 I7 FSB

Hardware Support fs_start a fs_start b fs_end b fs_end a inner outer I0 I1 I2 I3 I4 I5 I6 I7 FSB Setting Fence Bits FSS: stack to record scope

Hardware Support fs_start a fs_start b fs_end b fs_end a inner outer I0 I1 I2 I3 I4 I5 I6 I7 FSB Issue Fence by checking FSB on the current scope Setting Fence Bits FSS: stack to record scope

Hardware Support fs_start a fs_start b fs_end b fs_end a inner outer I0 I1 I2 I3 I4 I5 I6 I7 FSB Issue Fence by checking FSB on the current scope Setting Fence Bits FSS: stack to record scope

Why S-Fence performs Better? St A St X Ld Y FENCE St B SB ROB St A St X St A Traditional Fence Scoped Fence stall Store Buffer drained & Fence issued stall Ld Y St B St A St X SB ROB stall St A Ld Y St B stall Timeline St A : a cache miss

flag1 = 1; flag2 = 1; if (flag2 == 0) if (flag1 == 0) critical section critical section P1 P2 Initially flag1 = flag2 = 0 FENCE m1 = … m2 = … Set Scope Dekker algorithm

flag1 = 1; flag2 = 1; if (flag2 == 0) if (flag1 == 0) critical section critical section P1 P2 Initially flag1 = flag2 = 0 S-FENCE[set, {flag1, flag2}] S-FENCE … m1 = … m2 = … Set Scope Dekker algorithm

Set Scope S-FENCE[set, {var1, var2, …}] set scope only order memory accesses to {var1, var2, …} Compiler and Hardware Supports flag memory accesses to the specified variables set fence scope bits in hardware for flagged memory accesses For simplicity, we do not differentiate memory accesses to different sets

Experimental Evaluation Cycle-accurate simulation (SESC) Integrate scoped fence logic RMO memory model Benchmarks pst - parallel spanning tree (work-stealing queue, class scope) ptc – parallel transitive closure (work-stealing queue, class scope) barnes – from SPLASH2 (fences inserted for SC, set scope) radiosity – from SPLASH2 (fences inserted for SC, set scope)

Experimental Evaluation Traditional fence (T) vs. Scoped fence (S) Fence Stall Reduced~40-50% ~13% ~50% class scope set scope

Conclusion Introduce the concept of fence scope Propose class scope and set scope OpenCL 2.0 (sub-group, work-group, device, system) Lightweight compiler and hardware support No change in inter-processor communication Fence scope should be implemented in some form !

Fence Scoping Changhui Lin †, Vijay Nagarajan*, Rajiv Gupta † † University of California, Riverside * University of Edinburgh