Download presentation
Presentation is loading. Please wait.
Published byGregory Barber Modified over 9 years ago
1
- 1 - Dongyoon Lee, Peter Chen, Jason Flinn, Satish Narayanasamy University of Michigan, Ann Arbor Chimera: Hybrid Program Analysis for Determinism * Chimera image from http://superpunch.blogspot.com/2009/02/chimera-sketch.html
2
- 2 - Deterministic Replay Goal: record and reproduce multithreaded execution Debugging concurrency bugs Offline heavyweight dynamic analysis Forensics and intrusion detection … and many more uses Problem Multithreaded record-and-replay is too slow (>2x) or requires custom hardware
3
- 3 - Multithreaded Record-and-Replay is Slow Write Read Log shared memory dependencies Checkpoint Memory and Register State Log non-deterministic program input - Interrupts, I/O values, DMA, etc. Log non-deterministic program input - Interrupts, I/O values, DMA, etc. Thread 1Thread 2Thread 3
4
- 4 - Replay for Data-Race-Free Programs is Cheap Data-race-free programs Shared memory accesses are well ordered by synchronization ops. Recording happens-before order of sync. ops. is sufficient Problem: Programs with data races T1T2 X=0 Y=0 X=1 Y=1 Y=2 Unlock(l) Lock(l) Unlock(l) Signal(c) Wait(c) Z=1 X=2 Z=2 T3 order of mem. ops. order of sync. ops.
5
- 5 - Our Contribution: A Hybrid Analysis Potentially racy program P Data-race-free program P’ Sound static data race analysis Add synchronizations for potential data races Problem: Too many false positives Profiling non-concurrent code regions Symbolic bounds analysis Chimera
6
- 6 - Roadmap Motivation Chimera Analysis 1)Static data race analysis 2)Profiling non-concurrent code regions 3)Symbolic bounds analysis Weak-lock Design Evaluation Conclusion
7
- 7 - Roadmap Motivation Chimera Analysis 1)Static data race analysis 2)Profiling non-concurrent code regions 3)Symbolic bounds analysis Weak-lock Design Evaluation Conclusion
8
- 8 - Static Data Race Analysis Find potential data-races using a sound static data race detector RELAY [Voung et al., FSE’07] Protect all potential data-races using weak-locks − A new time-out lock which may be preempted (discussed later) Record and replay the happens-before order of weak-locks
9
- 9 - Protect Potential Races using Weak-locks Potential racy-pair Potential racy-pair Static analysis helps avoid instrumentation for access to Z No race report void foo() { X = 0; for(i =... ){ Y[ tid ][ i ] = 0; } void bar() { X = 1; for(i = … ){ Y[ tid ][ i ] = 1; Z = 1; }
10
- 10 - Sources of False Positives in RELAY Sound data-race detector reports too many false data-races − 53x overhead Source 1: Non-mutex synchronizations are ignored − Lockset based analysis ignores fork-join, barrier, signal-wait, etc. − May report a false data-race between memory instructions that can never execute concurrently Source 2: Conservative pointer analysis − Overestimate variables accessed by a memory instruction − May report a false data-race between memory instructions that can never access the same location Solution: Profiling non-concurrent code regions Solution: Symbolic bounds Analysis
11
- 11 - Roadmap Motivation Chimera Analysis 1)Static data race analysis 2)Profiling non-concurrent code regions 3)Symbolic bounds analysis Weak-lock Design Evaluation Conclusion
12
- 12 - Profiling Non-concurrent Code Regions Problem Lockset based analysis ignores non-mutex synchronization ops. Solution Profile non-concurrent code regions (e.g., functions) Increase the granularity of weak-locks to protect a larger code region instead of each potential racy instruction Parallelism is preserved unless mis-profiled T1 foo() BARRIER T2 BARRIER bar() False Race
13
- 13 - Function-Level Weak-Locks if profiler says foo() and bar() are not likely to run concurrently foo() BARRIER bar() False Race void foo() { X = 0; for(i = … ){ Y[ tid ][ i ] = 0; } void bar() { X = 1; for(i = … ){ Y[ tid ][ i ] = 1; Z = 1; }
14
- 14 - Roadmap Motivation Chimera Analysis 1)Static data race analysis 2)Profiling non-concurrent code regions 3)Symbolic bounds analysis Design Evaluation Conclusion
15
- 15 - Imprecision in Conservative Pointer Analysis T1 foo() BARRIER T2 BARRIER May run Concurrently bar()
16
- 16 - Imprecision in Conservative Pointer Analysis RELAY uses Steensgaard’s and Anderson’s pointer analysis − Flow-Insensitive and Context-Insensitive (FICI) analysis − Naming heap objects is conservative Overestimate the variables accessed by a memory instruction void foo() { … for(i = 0 to N){ Y[ tid ][ i ] = 0; … } void bar() { … for(i= 0 to N){ Y[ tid ][ i ] = 1; … } False Race Y[][] Thread1 Thread 2 … … … Potential racy-pair
17
- 17 - Symbolic Bounds Analysis Our Solution Derive the symbolic lower and upper bounds that a racy code region may access (e.g., loops) [Rugina and Rinard, PLDI’00] Increase the granularity of weak-locks to protect a larger code region for a set of addresses specified by a symbolic expression Parallelism is preserved if the bounds are precise enough void foo() { … for(i = 0 to N){ Y[ tid ][ i ] = 0; } … } Bounds: &Y[tid][0] to &Y[tid][N] Symbolic Bounds Analysis
18
- 18 - Loop-level Weak-locks Symbolic bounds: &Y[tid][0] ~ &Y[tid][N] (&Y[tid][0],&Y[tid][N]) void foo() { X = 0; for(i = 0 to N){ Y[ tid ][ i ] = 0; } void bar() { X = 1; for(i = 0 to N){ Y[ tid ][ i ] = 1; Z = 1; }
19
- 19 - Imprecise Symbolic Bounds Sources Depend on the value computed inside the code region Depend on arithmetic operations not supported in the analysis − e.g., modulo operations, logical AND/OR, etc. Choosing the optimal granularity If bounds are too imprecise and the loop body is long enough, resort to instruction (basic-block) level weak-locks for parallelism void qux() { … for(i = 0 to N){ prev = Z[ prev ]; } … } Bounds: -INF to +INF Symbolic Bounds Analysis
20
- 20 - Roadmap Motivation Chimera Analysis Weak-lock Design Evaluation Conclusion
21
- 21 - Deadlock due to Weak-locks No deadlocks between weak-locks function-level > loop-level > instruction-level Deadlock between weak-locks and original sync. ops. is possible T1T1 … wait (cv) … T2T2 signal(cv) … Time-out !!
22
- 22 - Weak-lock Time-out A weak-lock might time-out Invoke a special system call to handle it Weak-lock guarantee Only one thread holds a given weak-lock at any given time Mutual exclusion may be compromised; but sufficient for replay T2T2 … signal(cv) … Time-out !! T1T1 … wait (cv) … Current owner Logged order of weak-locks
23
- 23 - Roadmap Motivation Chimera Analysis Weak-lock Design Evaluation Conclusion
24
- 24 - Implementation Source-to-source Instrumentation Implemented in OCaml using CIL as a front end Static analysis Data race detection: RELAY [Voung et al., FSE’07] − Include all library source codes for soundness (uClibc’s libc, libm, etc.) Symbolic bounds analysis: [Rugina and Rinard, PLDI’00] − Intra-procedural analysis for racy loops only Runtime system Modified Linux kernel to record/replay program input Modified pthread library to record/replay happens-before order of original synchronization operations and weak-locks
25
- 25 - Evaluation Setup Test Environment 2.66 GHz 8-core Xeon processor with 4 GB of RAM Different set of inputs for profiling and performance evaluation Average of five trials with 4 worker threads 2, 4, 8 threads for scalability results Benchmarks Desktop applications − aget, pfscan, and pbzip2 Server programs − knot and apache SPLASH-2 suite − ocean, water-nsq, fft, and radix
26
- 26 - Record and Replay Performance Recording : 39% on average Replay : similar to recording; much lower for I/O intensive prgs. 2.4% slowdown 86% slowdown 39%
27
- 27 - Effectiveness of Coarse-grained Weak-locks 53x
28
- 28 - Effectiveness of Coarse-grained Weak-locks Coarse-grained weak-locks reduce the cost of instrumentation
29
- 29 - Effectiveness of Coarse-grained Weak-locks Coarse-grained weak-locks reduce the cost of instrumentation Exception: control-flow dependency (e.g., pfscan)
30
- 30 - Effectiveness of Coarse-grained Weak-locks Coarse-grained weak-locks reduce the cost of instrumentation Exception: control-flow dependency (e.g., pfscan)
31
- 31 - Effectiveness of Coarse-grained Weak-locks Coarse-grained weak-locks reduce the cost of instrumentation Exception: control-flow dependency (e.g., pfscan) 1.39x
32
- 32 - Breakdown of Recording Overhead Weak-lock overhead = contention (waiting) cost + logging cost func locks loop locks instr/bb locks sync op & system log
33
- 33 - Breakdown of Recording Overhead func wait loop wait instr/bb wait sync op & system log func log loop log instr/bb log Weak-lock overhead = contention (waiting) cost + logging cost High loop-lock contention High instr/bb-lock contention
34
- 34 - Scalability Scientific applications scale worse due to imprecise symbolic bounds analysis
35
- 35 - Conclusion Goal: Software-only deterministic multiprocessor replay systems Chimera Analysis Static data race analysis − Find and protect potential data races with weak-locks − Instruction/basic-block-level weak-locks Profiling non-concurrent code regions − Address the inadequacy of lockset-based algorithm − Function-level weak-locks Symbolic bounds analysis − Address the imprecision of conservative pointer analysis − Loop-level weak-locks Low Recording Overhead 39% recording overhead for 4 worker threads
36
- 36 - Thank you
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.