Detecting and Eliminating Potential Violation of Sequential Consistency for concurrent C/C++ program Duan Yuelu, Feng Xiaobing, Pen-chung Yew.

Slides:



Advertisements
Similar presentations
Dataflow Analysis for Datarace-Free Programs (ESOP 11) Arnab De Joint work with Deepak DSouza and Rupesh Nasre Indian Institute of Science, Bangalore.
Advertisements

Bounded Model Checking of Concurrent Data Types on Relaxed Memory Models: A Case Study Sebastian Burckhardt Rajeev Alur Milo M. K. Martin Department of.
Enabling Speculative Parallelization via Merge Semantics in STMs Kaushik Ravichandran Santosh Pande College.
An Case for an Interleaving Constrained Shared-Memory Multi-Processor Jie Yu and Satish Narayanasamy University of Michigan.
Compilation 2011 Static Analysis Johnni Winther Michael I. Schwartzbach Aarhus University.
Architecture-aware Analysis of Concurrent Software Rajeev Alur University of Pennsylvania Amir Pnueli Memorial Symposium New York University, May 2010.
Relaxed Consistency Models. Outline Lazy Release Consistency TreadMarks DSM system.
Abstraction and Modular Reasoning for the Verification of Software Corina Pasareanu NASA Ames Research Center.
CS 162 Memory Consistency Models. Memory operations are reordered to improve performance Hardware (e.g., store buffer, reorder buffer) Compiler (e.g.,
A Randomized Dynamic Program Analysis for Detecting Real Deadlocks Koushik Sen CS 265.
(C) 2001 Daniel Sorin Correctly Implementing Value Prediction in Microprocessors that Support Multithreading or Multiprocessing Milo M.K. Martin, Daniel.
1/20 Generalized Symbolic Execution for Model Checking and Testing Charngki PSWLAB Generalized Symbolic Execution for Model Checking and Testing.
5.1 Silberschatz, Galvin and Gagne ©2009 Operating System Concepts with Java – 8 th Edition Chapter 5: CPU Scheduling.
Process Synchronization. Module 6: Process Synchronization Background The Critical-Section Problem Peterson’s Solution Synchronization Hardware Semaphores.
Computer Architecture Computer Architecture Processing of control transfer instructions, part I Ola Flygt Växjö University
Is SC + ILP = RC? Presented by Vamshi Kadaru Chris Gniady, Babak Falsafi, and T. N. VijayKumar - Purdue University Spring 2005: CS 7968 Parallel Computer.
Enforcing Sequential Consistency in SPMD Programs with Arrays Wei Chen Arvind Krishnamurthy Katherine Yelick.
U NIVERSITY OF M ASSACHUSETTS, A MHERST – Department of Computer Science The Implementation of the Cilk-5 Multithreaded Language (Frigo, Leiserson, and.
Atomicity in Multi-Threaded Programs Prachi Tiwari University of California, Santa Cruz CMPS 203 Programming Languages, Fall 2004.
Lock-free Cache-friendly Software Queue for Decoupled Software Pipelining Student: Chen Wen-Ren Advisor: Wuu Yang 學生 : 陳韋任 指導教授 : 楊武 Abstract Multicore.
“THREADS CANNOT BE IMPLEMENTED AS A LIBRARY” HANS-J. BOEHM, HP LABS Presented by Seema Saijpaul CS-510.
1 Lecture 21: Transactional Memory Topics: consistency model recap, introduction to transactional memory.
1 Lecture 7: Consistency Models Topics: sequential consistency, requirements to implement sequential consistency, relaxed consistency models.
Lecture 13: Consistency Models
Computer Architecture II 1 Computer architecture II Lecture 9.
Register Allocation (via graph coloring). Lecture Outline Memory Hierarchy Management Register Allocation –Register interference graph –Graph coloring.
1 Lecture 15: Consistency Models Topics: sequential consistency, requirements to implement sequential consistency, relaxed consistency models.
1 Sharing Objects – Ch. 3 Visibility What is the source of the issue? Volatile Dekker’s algorithm Publication and Escape Thread Confinement Immutability.
Overview of program analysis Mooly Sagiv html://
1 Lecture 12: Relaxed Consistency Models Topics: sequential consistency recap, relaxing various SC constraints, performance comparison.
Overview of program analysis Mooly Sagiv html://
Cormac Flanagan UC Santa Cruz Velodrome: A Sound and Complete Dynamic Atomicity Checker for Multithreaded Programs Jaeheon Yi UC Santa Cruz Stephen Freund.
Learning From Mistakes—A Comprehensive Study on Real World Concurrency Bug Characteristics Shan Lu, Soyeon Park, Eunsoo Seo and Yuanyuan Zhou Appeared.
Memory Consistency Models Some material borrowed from Sarita Adve’s (UIUC) tutorial on memory consistency models.
Maria-Cristina Marinescu Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology A Synthesis Algorithm for Modular Design of.
Rahul Sharma (Stanford) Michael Bauer (NVIDIA Research) Alex Aiken (Stanford) Verification of Producer-Consumer Synchronization in GPU Programs June 15,
Evaluation of Memory Consistency Models in Titanium.
Accelerating Precise Race Detection Using Commercially-Available Hardware Transactional Memory Support Serdar Tasiran Koc University, Istanbul, Turkey.
The Daikon system for dynamic detection of likely invariants MIT Computer Science and Artificial Intelligence Lab. 16 January 2007 Presented by Chervet.
Pallavi Joshi* Mayur Naik † Koushik Sen* David Gay ‡ *UC Berkeley † Intel Labs Berkeley ‡ Google Inc.
Use of Coverity & Valgrind in Geant4 Gabriele Cosmo.
1 Evaluating the Impact of Thread Escape Analysis on Memory Consistency Optimizations Chi-Leung Wong, Zehra Sura, Xing Fang, Kyungwoo Lee, Samuel P. Midkiff,
Foundations of the C++ Concurrency Memory Model Hans-J. Boehm Sarita V. Adve HP Laboratories UIUC.
Operating Systems ECE344 Ashvin Goel ECE University of Toronto Mutual Exclusion.
Shared Memory Consistency Models. SMP systems support shared memory abstraction: all processors see the whole memory and can perform memory operations.
Memory Consistency Models. Outline Review of multi-threaded program execution on uniprocessor Need for memory consistency models Sequential consistency.
Data races, informally [More formal definition to follow] “race condition” means two different things Data race: Two threads read/write, write/read, or.
Threads Cannot be Implemented as a Library Hans-J. Boehm.
Dataflow Analysis for Concurrent Programs using Datarace Detection Ravi Chugh, Jan W. Voung, Ranjit Jhala, Sorin Lerner LBA Reading Group Michelle Goodstein.
ICFEM 2002, Shanghai Reasoning about Hardware and Software Memory Models Abhik Roychoudhury School of Computing National University of Singapore.
CS510 Concurrent Systems Jonathan Walpole. RCU Usage in Linux.
CoreDet: A Compiler and Runtime System for Deterministic Multithreaded Execution Tom Bergan Owen Anderson, Joe Devietti, Luis Ceze, Dan Grossman To appear.
HARD: Hardware-Assisted lockset- based Race Detection P.Zhou, R.Teodorescu, Y.Zhou. HPCA’07 Shimin Chen LBA Reading Group Presentation.
Soyeon Park, Shan Lu, Yuanyuan Zhou UIUC Reading Group by Theo.
Random Test Generation of Unit Tests: Randoop Experience
Reachability Testing of Concurrent Programs1 Reachability Testing of Concurrent Programs Richard Carver, GMU Yu Lei, UTA.
Memory Consistency Models
Threads Cannot Be Implemented As a Library
Lecture 11: Consistency Models
Memory Consistency Models
Amir Kamil and Katherine Yelick
Over-Approximating Boolean Programs with Unbounded Thread Creation
Amir Kamil and Katherine Yelick
CSE 153 Design of Operating Systems Winter 19
Compilers, Languages, and Memory Models
Programming with Shared Memory Specifying parallelism
Lecture 11: Relaxed Consistency Models
Problems with Locks Andrew Whitaker CSE451.
Pointer analysis John Rollinson & Kaiyuan Li
Presentation transcript:

Detecting and Eliminating Potential Violation of Sequential Consistency for concurrent C/C++ program Duan Yuelu, Feng Xiaobing, Pen-chung Yew

Outline Motivation Approach & Implementation Results Related Work Conclusion

Motivation Programmers develop “low-lock” code for better performance  lock is expensive  data race are deliberately employed  require sequential consistency (SC) model Such code might fail in relaxed consistency (RC) models  E.g. Double Checked Locking (DCL) for lazy initialized singleton

Example 1 (a) : Lazy initialized singleton Object::Object() { this.field = 100; } Object Object::getInstance() { if (!_instance) _instance = new Object(); return _instance; } Object Object::getInstance() { lock(l); if (!_instance) _instance = new Object(); unlock(l); return _instance; } work only for single thread work for multi-thread, but is expensive... void Object::useInstance() { Object ins; ins = Object::getInstance(); int f = ins.getField(); }

(b): Double Checked Locking for lazy initialized singleton Object Object::getInstance() { if (!_instance) { lock(l); if (!_instance) _instance = new Object(); unlock(l); } return _instance; } If the architecture is SC, then it works correctly, with better performance than (a). But, how about running on RC models that allows write-write reorder?

A possible execution interleave…correct! Object Object::getInstance() { if (!_instance) { lock(l); if (!_instance) { temp = malloc(..); A1: temp->field = 100; A2: _instance = temp; } unlock(l); } return _instance; } B1: if (!_instance) {…} … B2: read _instance->field; Initializer Thread (T1)Reader Thread (T2) Data races are employed, since these accesses are improperly synchronized

But, how about reorder write-write? Object Object::getInstance() { if (!_instance) { lock(l); if (!_instance) { temp = malloc(..); temp->field = 100; A2: _instance = temp; A1: temp->field = 100; } … B1: if (!_instance) {…} … B2: read _instance->field; Initializer Thread (T1)Reader Thread (T2) Get Un-initialized value of instance->field Violate Sequential Consistency

bug pattern: Potential Violation of Sequential Consistency (PVSC), - since these defects might cause SC violation. How to detect and eliminate PVSC bugs? - Basically, we combine Shasha/Snir’s conflict graph and delay set theory with existing data race detection scheme.

Outline Motivation Approach & Implementation Results Related Work Conclusion

our scheme (1) Construct Race Graph (2) Find cycles in it  A cycle in race graph corresponds to a PVSC bug (3) Compute delay set (4) Insert memory ordering fences

Constructing Race Graph For all the instructions that executed in a particular execution of a program P:  Add program order edge for instructions in each thread.  Add race edge for each data race. wr a wr b rd b rd a Thread 1Thread 2 Race edge Program order edge

A: wr a B: wr b C: rd b D: rd a Example 1. Race Graph for DCL … lock(l); if (!_instance) { temp = malloc(..); temp->field = 100; _instance = temp; } unlock(l); } if (!_instance) {…} … read _instance->field;

Find cycles in race graph Theorem 1. A cycle in race graph corresponds to a PVSC bug.  Proof: If a cycle is found in race graph, then it is possible to get a non-sequential-consistent execution by letting the race order be consistent with the cycle. E.g, we can get a non- SC execution E={B->C, D->A} from the cycle A- >B->C->D->A in previous example.

Compute delay set Delay lemma : Any execution should be consistent with a delay set D. [Shasha/Snir] Theorem 2. Let D be the delay set which contains all the program order edge of the race cycles in race graph. Then D enforces sequential consistency for the executions that generates D.  Proof: Omitted

Insert memory ordering fences A fence instruction delays the issue of an instruction until all previous instructions completed. Insert a fence for each delay in D. Then D can be enforced, and, Detected PVSC can be eliminated.

Thread 2Thread 1 Examples for above 3 steps… wr a wr b rd a rd b Fig. 1 : No cycles, no PVSC, no fence is needed. (Implies that any execution on RC is sequential consistent, thus we don ’ t need fences.)

Thread 1Thread 2Thread 3 A: a=1 C: b = 1 D: if (b) B: if (a) Fig. 2 : contains a cycle A->B->C->D->E->A, PVSC. It’s possible to get the execution {A->B, C -> D,E->A} which violates SC and results in {a=1,b=1, R1=0}. If we insert fences between A and B, C and D, then PVSC is eliminated. E: R1=a Initially a = b = 0

Fig. 3: Corrected version of DCL for lazy initialized singleton. Object getInstance() { Object *tmp = _instance; Fence(); if (!tmp) { lock(l); tmp = _instance; if (!tmp) tmp = new Object(); Fence(); _instance = tmp; unlock(l); } return _instance; }

Optimization To handle real-world applications with  Long execution time  Many threads We convert the race graph into PC race graph  Combine nodes with same PC into one node. The graph contains N nodes, where N equals the number of race access instructions.  Adopt SCC algorithm on PC race graph. Each SCC corresponds to a PVSC bug Can introduce false negatives.

Outline Motivation Approach & Implementation Results Related Work Conclusion

Results Detected PVSC bugs Performance loss after fence insertion Cost of PVSC detection over race detection

Part of detected bugs MySQL 5.0.x sql/slave.c, handle_slave_io() Assertion in slave shutdown. mi->slave_running=0 could be visible to other threads before the cleanup is completed. Thus causes assertion during slave shutdown. httpd 2.2.xmodules/cache/ mod_cache.c, cache_store_content() store_header() might be visible to other threads before store_body(), thus mod_cache might provide old content despite new content has been fetched. httpd 2.2.xprefork/prefork.c, ap_mpm_run() restart_pending = shutdown_pending = 0; might be visible to child threads after set_singal(), thus if httpd receives SIGTERM, it will be ignored while child processes are being spawned.

Performance loss of SPLASH-2 Figure 10: Performance on Intel Itanium SMP

Cost over data race detection Figure 13: Cost of PVSC detection over different race detecting algorithm

Related Work Compiler Analysis: Conservative for C/C++ programs, insert much redundant fences which hurt performance severely. Verification: Enumerate all possible executions fit with a RC model. Not scale to large applications. Data race detection: Do not concern with the problem of SC violation. [many] Other concurrency bugs : Atomicity[AVIO,yyzhou], Correlation[MUVI,yyzhou], do not consider the PVSC problem.

Outline Motivation Approach & Implementation Results Related Work Conclusion

An effective and efficient scheme of detect Potential Violation of Sequential Consistency for concurrent C/C++ programs.  Easy to be ported to the matured data race detection tools.  Retain the performance after PVSC elimination.  Scalable and low-cost. Current limitation  Dynamic data race detection limitations: false positive and false negative.  Can be addressed with the progress in data race detection  Loop

Thanks! Suggestion?