Download presentation
Presentation is loading. Please wait.
Published byMarshall Mathews Modified over 9 years ago
1
Diagnosing and Fixing Concurrency Bugs Credits to Dr. Guoliang Jin, Computer Science, NC STATE Presented by Tao Wang
2
2 We need reliable software People’s daily life now depends on reliable software Software companies spend lots of resources on debugging More than 50% effort on finding and fixing bugs Around $300 billion per year
3
Concurrency bugs hurt It is an increasingly parallel world Concurrency bugs in history 3
4
Multi-threaded program Concurrent programs under the shared-memory model Programs execute multiple interacting threads in parallel Threads communicate via shared memory Shared-memory accesses should be well-synchronized Multicore chip core1 cache thread1 core2 cache thread2 core3 cache thread3 core4 cache thread4 shared memory 4
5
Huge Interleaving space An example of concurrency bug Thread 1 if (ptr != NULL) { ptr->field = 1; } Thread 2 ptr = NULL; Thread 1 if (ptr != NULL) { ptr->field = 1; } Thread 2 ptr = NULL; The interleaving space 5 Bad interleavings Previous research focuses on finding Thread 1 if (ptr != NULL) { ptr->field = 1; } Thread 2 ptr = NULL; Thread 1 if (ptr != NULL) { ptr->field = 1; } Thread 2 ptr = NULL; Segmentation Fault
6
Software quality does not improve until bugs are fixed Manual concurrency bug fixing is time-consuming: 73 days on average error-prone: 39% patches are buggy in the first release CFix : automated concurrency-bug fixing [PLDI’11*, OSDI’12] Program behaves correctly if bad interleavings do not occur Fix concurrency bugs by disabling bad interleavings Bug fixing 6 *SIGPLAN: “one of the first papers to attack the problem of automated bug fixing” *SIGPLAN: “one of the first papers to attack the problem of automated bug fixing”
7
Huge Interleaving space Bad interleavings Disabled The interleaving space (again) lead to production-run failures lead to production-run failures 7 Bad interleavings Disabled
8
Failures still happen in production runs The reason behind failure needs to be understood Tools dealing with production runs demand low overhead Diagnostic information needs to be informative Production-run concurrency-bug failure diagnosis Design new monitoring schemes and sampling strategies CCI: a pure software solution [OOPSLA’10] PBI, LXR: hardware-assisted solutions [ASPLOS’13 & 14] Failure diagnosis 8
9
My work on concurrency bugs [ASPLOS’11] Production-Run Failure Diagnosis: CCI/PBI/LXR [OOPSLA’10, ASPLOS’13 & 14] 9 [PLDI’11*, OSDI’12] *Received a SIGPLAN CACM nomination Bug Detection and software testing: ConSeq Automated Concurrency-Bug Fixing: CFix
10
Outline Motivation and Overview Automated Concurrency-Bug Fixing The problem and idea Overview Internals of CFix Evaluation and summary 10
11
What is the correct behavior? Usually requires developers’ knowledge How to get the correct behavior? Correct program states under bug-triggering inputs No change to program states under other inputs Automated fixing is difficult 11 Description: Symptom Triggering condition … Description: Symptom Triggering condition … Patch: Correctness Performance Simplicity Patch: Correctness Performance Simplicity ? ?
12
What is the correct behavior? The program state is correct as long as the buggy interleaving does not occur How to get the correct behavior? Only need to disable failure-inducing interleavings Can leverage well-defined synchronization operations CFix’ insights 12 Description: Symptom Triggering condition … Description: Symptom Triggering condition … Patch: Correctness Performance Simplicity Patch: Correctness Performance Simplicity ? ?
13
Description: Symptom Triggering condition … Description: Symptom Triggering condition … Description: Interleavings that lead to software failure Description: Interleavings that lead to software failure 13 Patch: Correctness Performance Simplicity Patch: Correctness Performance Simplicity ? ? atomicity violation detectors ParkASPLOS’09, FlanaganPOPL’04, LuASPLOS’06, ChewEuroSys’10 atomicity violation detectors ParkASPLOS’09, FlanaganPOPL’04, LuASPLOS’06, ChewEuroSys’10 order violation detectors ZhangASPLOS’10, LuciaMICRO’09, YuISCA’09, GaoASPLOS’11 order violation detectors ZhangASPLOS’10, LuciaMICRO’09, YuISCA’09, GaoASPLOS’11 data race detectors SenPLDI’08, SavageTOCS’97, YuSOSP’05, EricksonOSDI’10, KasikciASPLOS’10 data race detectors SenPLDI’08, SavageTOCS’97, YuSOSP’05, EricksonOSDI’10, KasikciASPLOS’10 abnormal data flow detectors ZhangASPLOS’11, ShiOOPSLA’10 abnormal data flow detectors ZhangASPLOS’11, ShiOOPSLA’10 p r c A B W b R W g I 1 I 2 How to get a general solution that generates good patches?
14
... Patched binary Merged binary... Selected binary Mutual exclusion Order Mutual exclusion Order Final patched binary 14 Description: Interleavings that lead to software failure Description: Interleavings that lead to software failure Patch: Correctness Performance Simplicity Patch: Correctness Performance Simplicity CFix Run-time Support Patch Merging Patch Merging Patch Testing & Selection Synchronization Enforcement Fix-Strategy Design Source code Bug reports
15
Fix-strategy design: what to fix Challenges: Huge variety of bugs Challenges: Huge variety of bugs 15 Run-time Support Fix-Strategy Design Synchronization Enforcement Synchronization Enforcement Patch Merging Patch Merging Patch Testing & Selection Patch Testing & Selection
16
Why these two? Real-world concurrency bug characteristics study[SHAN ASPLOS’08]: 97% either atomicity violation or order violation Either can be fixed by mutual exclusion or order enforcement Two types of Concurrency bugs 16 Atomicity violation Order violation
17
Fix-strategy design: how to fix Challenges: Inaccurate root cause Challenges: Inaccurate root cause 17 Run-time Support Fix-Strategy Design Synchronization Enforcement Synchronization Enforcement Patch Merging Patch Merging Patch Testing & Selection Patch Testing & Selection
18
atomicity-violation Thread 1 if (ptr != NULL) { ptr->field = 1; } ptr = NULL; Thread 2 18 P C R
19
Fix-strategy for atomicity-voilation Thread 1 if (ptr != NULL) { ptr->field = 1; } ptr = NULL; Thread 2 19
20
CFix: fix-strategy design Challenges: Inaccurate root cause Huge variety of bugs Solution: A combination of mutual exclusion & order relationship enforcement Challenges: Inaccurate root cause Huge variety of bugs Solution: A combination of mutual exclusion & order relationship enforcement 20 Run-time Support Fix-Strategy Design Synchronization Enforcement Synchronization Enforcement Patch Merging Patch Merging Patch Testing & Selection Patch Testing & Selection
21
Fix-strategies Overview OV Detector AV Detector Race Detector DU Detector I 1 I 2 A B p r c W b R W g 21
22
CFix: synchronization enforcement Challenges: Correctness Performance simplicity Solution: Mutual exclusion enforcement: AFix [PLDI’11] Order relationship enforcement: OFix [OSDI’12] Challenges: Correctness Performance simplicity Solution: Mutual exclusion enforcement: AFix [PLDI’11] Order relationship enforcement: OFix [OSDI’12] 22 Run-time Support Fix-Strategy Design Synchronization Enforcement Synchronization Enforcement Patch Merging Patch Merging Patch Testing & Selection Patch Testing & Selection
23
Input: three statements (p, c, r) with contexts Idea: making the code region from p to c be mutually exclusive with r Atomicity violation in Fixing 23 Thread 1 if (ptr != NULL) { ptr->field = 1; } Thread 2 ptr = NULL; r p c
24
Approach: lock Goal: Correctness: paired lock acquisition and release operations Performance: Make the critical section as small as possible Mutual exclusion enforcement: AFix p c r 24
25
A naïve solution Add lock on edges reaching p Add unlock on edges leaving c Potential new bugs Could lock without unlock Could unlock without lock etc. A naïve solution p c p c p c p c 25
26
Assume p and c are in the same function f Step 1: find protected nodes in critical section Step 2: add lock operations unprotected node protected node protected node unprotected node Avoid those potential bugs mentioned The AFix solution p c 26
27
p and c adjustment when they are in different functions Observation: people put lock and unlock in one function Find the longest common prefix of p’s and c’s stack traces Adjust p and c accordingly Put r into a critical section Do nothing if we can reach r from the p–c critical section Lock type: Lock with timeout: if critical section has blocking operations Reentrant lock: if recursion is possible within critical section Subtle details 27
28
use read initialization destroy OFix: two order relationships A i A B A j … … ? firstA-B allA-B A 1 B A n … A 1 B A n … 28
29
Approach: condition variable and flag Insert signal operations in A-threads Insert wait operation before B Rules A-thread signals exactly once when it will not execute more A A-thread signals as soon as possible B proceeds when each A-thread has signaled OFix allA-B enforcement 29
30
OFix allA-B enforcement: A side How to identify the last A instance in one thread A...; for (...)... ; // A...; Each thread that executes A exactly once as soon as it can execute no more A 30
31
OFix allA-B enforcement: A side How to identify the last thread that executes A void main() { for (...) thread_create(thr_main);...; } void ofix_signal() { mutex_lock(L); --; if ( == 0) cond_broadcast(con); mutex_unlock(L); } void thr_main() { for (...)... ; // A...; } counter for signal threads =1 ++ thread _create A 31
32
Safe to execute only when is 0 Give up if OFix knows that it introduces new deadlock Timed wait-operation to mask potential deadlocks OFix allA-B enforcement: B side B void ofix_wait() { mutex_lock(L); if ( != 0) cond_timedwait(con, L, t); mutex_unlock(L); } 32
33
Basic enforcement When A may not execute Add a safety-net of signal with allA-B algorithm OFix firstA-B B A 33
34
CFix: patch testing & selection Challenge: Multi-thread software testing Solution: CFix-patch oriented testing Challenge: Multi-thread software testing Solution: CFix-patch oriented testing 34 Run-time Support Fix-Strategy Design Synchronization Enforcement Synchronization Enforcement Patch Merging Patch Merging Patch Testing & Selection Patch Testing & Selection
35
Patch testing principles Two ideas: No exhaustive testing, but patch oriented testing Leverage existing testing techniques, with extra heuristics The work-flow Step 1 Prune incorrect patches Patches causing failures due to wrong fix strategies, etc Step 2 Prune slow patches Step 3 Prune complicated patches 35
36
Run once without external perturbation Reject if there is a time-out or failure Patches fixing wrong root cause Make software to fail deterministically Thread 1 ptr->field = 1; Thread 2 ptr = NULL; 36
37
Implicit bad patch A failure in patch_b implies a failure in patch_a If patch_a is less restrictive than patch_b Helpful to prune patch_a Traditional testing may not find the failure in patch_a a Mutual Exclusion b c Order Relationships 37
38
Challenge: One single programming mistake usually leads to multiple bug reports Solution: Heuristics to merge patches Challenge: One single programming mistake usually leads to multiple bug reports Solution: Heuristics to merge patches CFix: patch merging 38 Run-time Support Fix-Strategy Design Synchronization Enforcement Synchronization Enforcement Patch Merging Patch Merging Patch Testing & Selection Patch Testing & Selection
39
c1 r1 p1 p2 c2, r2 void buf_write() { int tmp = buf_len + str_len; if (tmp > MAX) return; memcpy(buf[buf_len], str, str_len); buf_len = tmp; } An example with multiple reports p1 c1 p2 r1 c2, r2 Too many lock/unlock operations Potential new deadlocks May hurt performance and simplicity 39
40
Related patch: a case of AFix Merge if p, c, or r is in some other patch’s critical sections lock(L1) p1 lock(L2) p2 c1 unlock(L1) c2 unlock(L2) lock(L1) r1 unlock(L1) lock(L2) r2 unlock(L2) lock(L1) p1 p2 c1 c2 unlock(L1) lock(L1) r2 unlock(L1) 40
41
c1 r1 p1 p2 c2,r2 void buf_write() { int tmp = buf_len + str_len; if (tmp > MAX) { return; } memcpy(buf[buf_len], str, str_len); buf_len = tmp; } The merged patch for the example p1 c1 p2 r1 c2, r2 c1,p2 c2,r1,r2 p1 41
42
To understand whether there is a deadlock underlying time-out Low-overhead, and suitable for production runs To understand whether there is a deadlock underlying time-out Low-overhead, and suitable for production runs CFix: run-time support 42 Run-time Support Fix-Strategy Design Synchronization Enforcement Synchronization Enforcement Patch Merging Patch Merging Patch Testing & Selection Patch Testing & Selection
43
Evaluation methodology APP. PBZIP2 x264 FFT HTTrack Mozilla-1 transmission ZSNES Apache MySQL-1 MySQL-2 Mozilla-2 Cherokee Mozilla-3 AV Detector OV Detector RA Detector DU Detector 43
44
Evaluation result # of Ops 5 7 5 2 2 2 3 3 5 9 3 2 5 APP. PBZIP2 x264 FFT HTTrack Mozilla-1 transmission ZSNES Apache MySQL-1 MySQL-2 Mozilla-2 Cherokee Mozilla-3 44
45
Summary Software reliability is critical Fixing Concurrency bugs is costly and error-prone CFix uses some heuristics, with good results in practice A combination of mutual exclusion and order enforcement Use testing to select the best patch Fix root cause without requiring detectors to report it Small overhead and good simplicity 45
46
Questions ? Thank you 46
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.