Are We Trading Consistency Too Easily? A Case for Sequential Consistency Madan Musuvathi Microsoft Research Dan Marino Todd Millstein UCLAUniversity of.

Slides:



Advertisements
Similar presentations
Synopsys University Courseware Copyright © 2012 Synopsys, Inc. All rights reserved. Compiler Optimization and Code Generation Lecture - 3 Developed By:
Advertisements

7. Optimization Prof. O. Nierstrasz Lecture notes by Marcus Denker.
Memory Consistency Models Kevin Boos. Two Papers Shared Memory Consistency Models: A Tutorial – Sarita V. Adve & Kourosh Gharachorloo – September 1995.
The Case for a SC-preserving Compiler Madan Musuvathi Microsoft Research Dan Marino Todd Millstein UCLA University of Michigan Abhay Singh Satish Narayanasamy.
ECE 454 Computer Systems Programming Compiler and Optimization (I) Ding Yuan ECE Dept., University of Toronto
CS 162 Memory Consistency Models. Memory operations are reordered to improve performance Hardware (e.g., store buffer, reorder buffer) Compiler (e.g.,
Advanced microprocessor optimization Kampala August, 2007 Agner Fog
Dec 5, 2007University of Virginia1 Efficient Dynamic Tainting using Multiple Cores Yan Huang University of Virginia Dec
1 Starting a Program The 4 stages that take a C++ program (or any high-level programming language) and execute it in internal memory are: Compiler - C++
DRF x A Simple and Efficient Memory Model for Concurrent Programming Languages Dan Marino Abhay Singh Todd Millstein Madan Musuvathi Satish Narayanasamy.
Optimization Compiler Baojian Hua
Intermediate code generation. Code Generation Create linear representation of program Result can be machine code, assembly code, code for an abstract.
Recap from last time We were trying to do Common Subexpression Elimination Compute expressions that are available at each program point.
Fast Effective Dynamic Compilation Joel Auslander, Mathai Philipose, Craig Chambers, etc. PLDI’96 Department of Computer Science and Engineering Univ.
9. Optimization Marcus Denker. 2 © Marcus Denker Optimization Roadmap  Introduction  Optimizations in the Back-end  The Optimizer  SSA Optimizations.
Honors Compilers Addressing of Local Variables Mar 19 th, 2002.
Computer Architecture II 1 Computer architecture II Lecture 9.
Memory Model Safety of Programs Sebastian Burckhardt Madanlal Musuvathi Microsoft Research EC^2, July 7, 2008.
1. 2 FUNCTION INLINE FUNCTION DIFFERENCE BETWEEN FUNCTION AND INLINE FUNCTION CONCLUSION 3.
Machine-Independent Optimizations Ⅰ CS308 Compiler Theory1.
Generative Programming. Generic vs Generative Generic Programming focuses on representing families of domain concepts Generic Programming focuses on representing.
Memory Consistency Models Some material borrowed from Sarita Adve’s (UIUC) tutorial on memory consistency models.
Compiler Code Optimizations. Introduction Introduction Optimized codeOptimized code Executes faster Executes faster efficient memory usage efficient memory.
Topic #10: Optimization EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.
Evaluation of Memory Consistency Models in Titanium.
What’s in an optimizing compiler?
Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.
Shared Memory Consistency Models: A Tutorial Sarita V. Adve Kouroush Ghrachorloo Western Research Laboratory September 1995.
O VERVIEW OF THE IBM J AVA J UST - IN -T IME C OMPILER Presenters: Zhenhua Liu, Sanjeev Singh 1.
1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.
Thread-Level Speculation Karan Singh CS
Generative Programming. Automated Assembly Lines.
Foundations of the C++ Concurrency Memory Model Hans-J. Boehm Sarita V. Adve HP Laboratories UIUC.
Memory Consistency Models Alistair Rendell See “Shared Memory Consistency Models: A Tutorial”, S.V. Adve and K. Gharachorloo Chapter 8 pp of Wilkinson.
COP4020 Programming Languages Subroutines and Parameter Passing Prof. Xin Yuan.
1 Code optimization “Code optimization refers to the techniques used by the compiler to improve the execution efficiency of the generated object code”
Aritra Sengupta, Swarnendu Biswas, Minjia Zhang, Michael D. Bond and Milind Kulkarni ASPLOS 2015, ISTANBUL, TURKEY Hybrid Static-Dynamic Analysis for Statically.
Compiler Principles Fall Compiler Principles Lecture 0: Local Optimizations Roman Manevich Ben-Gurion University.
Shared Memory Consistency Models. SMP systems support shared memory abstraction: all processors see the whole memory and can perform memory operations.
Memory Consistency Models. Outline Review of multi-threaded program execution on uniprocessor Need for memory consistency models Sequential consistency.
Data races, informally [More formal definition to follow] “race condition” means two different things Data race: Two threads read/write, write/read, or.
Transactional Coherence and Consistency Presenters: Muhammad Mohsin Butt. (g ) Coe-502 paper presentation 2.
December 1, 2006©2006 Craig Zilles1 Threads and Cache Coherence in Hardware  Previously, we introduced multi-cores. —Today we’ll look at issues related.
CS 295 – Memory Models Harry Xu Oct 1, Multi-core Architecture Core-local L1 cache L2 cache shared by cores in a processor All processors share.
Compiler Optimizations ECE 454 Computer Systems Programming Topics: The Role of the Compiler Common Compiler (Automatic) Code Optimizations Cristiana Amza.
13-1 Chapter 13 Concurrency Topics Introduction Introduction to Subprogram-Level Concurrency Semaphores Monitors Message Passing Java Threads C# Threads.
A Safety-First Approach to Memory Models Madan Musuvathi Microsoft Research ISMM ‘13 Keynote 1.
CoreDet: A Compiler and Runtime System for Deterministic Multithreaded Execution Tom Bergan Owen Anderson, Joe Devietti, Luis Ceze, Dan Grossman To appear.
Programming for Performance CS 740 Oct. 4, 2000 Topics How architecture impacts your programs How (and how not) to tune your code.
Transactional Memory Coherence and Consistency Lance Hammond, Vicky Wong, Mike Chen, Brian D. Carlstrom, John D. Davis, Ben Hertzberg, Manohar K. Prabhu,
Hello world !!! ASCII representation of hello.c.
LECTURE 19 Subroutines and Parameter Passing. ABSTRACTION Recall: Abstraction is the process by which we can hide larger or more complex code fragments.
©SoftMoore ConsultingSlide 1 Code Optimization. ©SoftMoore ConsultingSlide 2 Code Optimization Code generation techniques and transformations that result.
Code Optimization.
Introduction To Computer Systems
Memory Consistency Models
Atomic Operations in Hardware
Memory Consistency Models
Optimization Code Optimization ©SoftMoore Consulting.
Decoupled Access-Execute Pioneering Compilation for Energy Efficiency
Henk Corporaal TUEindhoven 2009
Persistency for Synchronization-Free Regions
Shared Memory Consistency Models: A Tutorial
Threads and Memory Models Hal Perkins Autumn 2009
Compiler Code Optimizations
Instruction Level Parallelism (ILP)
Memory Consistency Models
Compilers, Languages, and Memory Models
Dynamic Binary Translators and Instrumenters
Presentation transcript:

Are We Trading Consistency Too Easily? A Case for Sequential Consistency Madan Musuvathi Microsoft Research Dan Marino Todd Millstein UCLAUniversity of Michigan Abhay Singh Satish Narayanasamy

M EMORY C ONSISTENCY M ODEL Abstracts the program runtime (compiler + hardware) Hides compiler transformations Hides hardware optimizations, cache hierarchy, … Sequential consistency (SC) [Lamport ‘79] “The result of any execution is the same as if the operations were executed in some sequential order, and the operations of each individual processor thread in this sequence appear in the program order”

S EQUENTIAL C ONSISTENCY E XPLAINED X = 1; F = 1; X = 1; F = 1; t = F; u = X; t = F; u = X; int X = F = 0; // F = 1 implies X is initialized X = 1; F = 1; t = F; u = X; X = 1; F = 1; t = F; u = X; X = 1; F = 1; t = F; u = X; X = 1; F = 1; t = F; u = X; X = 1; F = 1; t = F; u = X; X = 1; F = 1; t = F; u = X; t=1, u=1t=0, u=1 t=0, u=0t=0, u=1 t=1 implies u=1

C ONVENTIONAL W ISDOM SC is slow Disables important compiler optimizations Disables important hardware optimizations Relaxed memory models are faster

C ONVENTIONAL W ISDOM SC is slow Hardware speculation can hide the cost of SC hardware [Gharachorloo et.al. ’91, …, Blundell et.al. ’09] Compiler optimizations that break SC provide negligible performance improvement [PLDI ’11] Relaxed memory models are faster Need fences for correctness Programmers conservatively add more fences than necessary Libraries use the strongest fence necessary for all clients Fence implementations are slow Efficient fence implementations require speculation support

asm: mov eax [X]; asm: mov eax [X]; I MPLEMENTING S EQUENTIAL C ONSISTENCY E FFICIENTLY SC-Preserving Compiler SC-Preserving Compiler SC Hardware src: t = X; src: t = X; Every SC behavior of the binary is a SC behavior of the source Every observed runtime behavior is a SC behavior of the binary This talk

C HALLENGE : I MPORTANT C OMPILER O PTIMIZATIONS ARE NOT SC-P RESERVING Example: Common Subexpression Elimination (CSE) t,u,v are local variables X,Y are possibly shared L1: t = X*5; L2: u = Y; L3: v = X*5; L1: t = X*5; L2: u = Y; L3: v = X*5; L1: t = X*5; L2: u = Y; L3: v = t; L1: t = X*5; L2: u = Y; L3: v = t;

C OMMON S UBEXPRESSION E LIMINATION IS NOT SC-P RESERVING L1: t = X*5; L2: u = Y; L3: v = X*5; L1: t = X*5; L2: u = Y; L3: v = X*5; L1: t = X*5; L2: u = Y; L3: v = t; L1: t = X*5; L2: u = Y; L3: v = t; M1: X = 1; M2: Y = 1; M1: X = 1; M2: Y = 1; M1: X = 1; M2: Y = 1; M1: X = 1; M2: Y = 1; u == 1 implies v == 5 possibly u == 1 && v == 0 Init: X = Y = 0;

I MPLEMENTING CSE IN A SC-P RESERVING C OMPILER Enable this transformation when X is a local variable, or Y is a local variable In these cases, the transformation is SC-preserving Identifying local variables: Compiler generated temporaries Stack allocated variables whose address is not taken L1: t = X*5; L2: u = Y; L3: v = X*5; L1: t = X*5; L2: u = Y; L3: v = X*5; L1: t = X*5; L2: u = Y; L3: v = t; L1: t = X*5; L2: u = Y; L3: v = t;

A SC- PRESERVING LLVM C OMPILER FOR C PROGRAMS Modify each of ~70 phases in LLVM to be SC-preserving Without any additional analysis Enable trace-preserving optimizations These do not change the order of memory operations e.g. loop unrolling, procedure inlining, control-flow simplification, dead-code elimination,… Enable transformations on local variables Enable transformations involving a single shared variable e.g. t= X; u=X; v=X;  t=X; u=t; v=t;

A VERAGE P ERFORMANCE OVERHEAD IS ~2% Baseline: LLVM –O3 Experiments on Intel Xeon, 8 cores, 2 threads/core, 6GB RAM

H OW F AR C AN A SC-P RESERVING C OMPILER G O ? float s, *x, *y; int i; s=0; for( i=0; i<n; i++ ){ s += (x[i]-y[i]) * (x[i]-y[i]); } float s, *x, *y; int i; s=0; for( i=0; i<n; i++ ){ s += (x[i]-y[i]) * (x[i]-y[i]); } float s, *x, *y; float *px, *py, *e; s=0; py=y; e = &x[n] for( px=x; px<e; px++, py++){ s += (*px-*py) * (*px-*py); } float s, *x, *y; float *px, *py, *e; s=0; py=y; e = &x[n] for( px=x; px<e; px++, py++){ s += (*px-*py) * (*px-*py); } float s, *x, *y; int i; s=0; for( i=0; i<n; i++ ){ s += (*(x + i*sizeof(float)) – *(y + i*sizeof(float))) * (*(x + i*sizeof(float)) – *(y + i*sizeof(float))); } float s, *x, *y; int i; s=0; for( i=0; i<n; i++ ){ s += (*(x + i*sizeof(float)) – *(y + i*sizeof(float))) * (*(x + i*sizeof(float)) – *(y + i*sizeof(float))); } float s, *x, *y; float *px, *py, *e, t; s=0; py=y; e = &x[n] for( px=x; px<e; px++, py++){ t = (*px-*py); s += t*t; } float s, *x, *y; float *px, *py, *e, t; s=0; py=y; e = &x[n] for( px=x; px<e; px++, py++){ t = (*px-*py); s += t*t; } no opt. SC pres full opt

W E C AN R EDUCE THE F ACE S IM O VERHEAD ( IF WE CHEAT A BIT ) 30% overhead comes from the inability to perform CSE in But argument evaluation in C is nondeterministic The specification explicitly allows overlapped evaluation of function arguments return MATRIX_3X3 ( x[0]*A.x[0]+x[3]*A.x[1]+x[6]*A.x[2], x[1]*A.x[0]+x[4]*A.x[1]+x[7]*A.x[2], x[2]*A.x[0]+x[5]*A.x[1]+x[8]*A.x[2], x[0]*A.x[3]+x[3]*A.x[4]+x[6]*A.x[5], x[1]*A.x[3]+x[4]*A.x[4]+x[7]*A.x[5], x[2]*A.x[3]+x[5]*A.x[4]+x[8]*A.x[5], x[0]*A.x[6]+x[3]*A.x[7]+x[6]*A.x[8], x[1]*A.x[6]+x[4]*A.x[7]+x[7]*A.x[8], x[2]*A.x[6]+x[5]*A.x[7]+x[8]*A.x[8] ); return MATRIX_3X3 ( x[0]*A.x[0]+x[3]*A.x[1]+x[6]*A.x[2], x[1]*A.x[0]+x[4]*A.x[1]+x[7]*A.x[2], x[2]*A.x[0]+x[5]*A.x[1]+x[8]*A.x[2], x[0]*A.x[3]+x[3]*A.x[4]+x[6]*A.x[5], x[1]*A.x[3]+x[4]*A.x[4]+x[7]*A.x[5], x[2]*A.x[3]+x[5]*A.x[4]+x[8]*A.x[5], x[0]*A.x[6]+x[3]*A.x[7]+x[6]*A.x[8], x[1]*A.x[6]+x[4]*A.x[7]+x[7]*A.x[8], x[2]*A.x[6]+x[5]*A.x[7]+x[8]*A.x[8] );

I MPROVING P ERFORMANCE OF SC-P RESERVING C OMPILER Request programmers to reduce shared accesses in hot loops Use sophisticated static analysis Infer more thread-local variables Infer data-race-free shared variables Use program annotations Requires changing the program language Minimum annotations sufficient to optimize the hot loops Perform load-optimizations speculatively Hardware exposes speculative-load optimization to the software Load optimizations reduce the max overhead to 6%

C ONCLUSION Hardware should support strong memory models TSO is efficiently implementable [Mark Hill] Speculation support for SC over TSO is not currently justifiable Can we quantify the programmability cost for TSO? Compiler optimizations should preserve the hardware memory model High-level programming models can abstract TSO/SC Further enable compiler/hardware optimizations Improve programmer productivity, testability, and debuggability

E AGER -L OAD O PTIMIZATIONS Eagerly perform loads or use values from previous loads or stores L1: t = X*5; L2: u = Y; L3: v = X*5; L1: t = X*5; L2: u = Y; L3: v = X*5; L1: t = X*5; L2: u = Y; L3: v = t; L1: t = X*5; L2: u = Y; L3: v = t; L1: X = 2; L2: u = Y; L3: v = X*5; L1: X = 2; L2: u = Y; L3: v = X*5; L1: X = 2; L2: u = Y; L3: v = 10; L1: X = 2; L2: u = Y; L3: v = 10; L1: L2: for(…) L3: t = X*5; L1: L2: for(…) L3: t = X*5; L1: u = X*5; L2: for(…) L3: t = u; L1: u = X*5; L2: for(…) L3: t = u; Common Subexpression Elimination Constant/copy Propagation Loop-invariant Code Motion

P ERFORMANCE OVERHEAD Allowing eager-load optimizations alone reduces max overhead to 6%

C ORRECTNESS C RITERIA FOR E AGER -L OAD O PTIMIZATIONS Eager-loads optimizations rely on a variable remaining unmodified in a region of code Sequential validity: No mods to X by the current thread in L1-L3 SC-preservation: No mods to X by any other thread in L1-L3 L1: t = X*5; L2: *p = q; L3: v = X*5; L1: t = X*5; L2: *p = q; L3: v = X*5; Enable invariant “t == 5.X” Maintain invariant “t == 5.X” Use invariant “t == 5.X" to transform L3 to v = t; Use invariant “t == 5.X" to transform L3 to v = t;

S PECULATIVELY P ERFORMING E AGER -L OAD O PTIMIZATIONS On monitor.load, hardware starts tracking coherence messages on X’s cache line The interference check fails if X’s cache line has been downgraded since the monitor.load In our implementation, a single instruction checks interference on up to 32 tags L1: t = X*5; L2: u = Y; L3: v = X*5; L1: t = X*5; L2: u = Y; L3: v = X*5; L1: t = monitor.load(X, tag) * 5; L2: u = Y; L3: v = t; C4: if (interference.check(tag)) C5: v = X*5; L1: t = monitor.load(X, tag) * 5; L2: u = Y; L3: v = t; C4: if (interference.check(tag)) C5: v = X*5;