- 1 - Dongyoon Lee, Peter Chen, Jason Flinn, Satish Narayanasamy University of Michigan, Ann Arbor Chimera: Hybrid Program Analysis for Determinism * Chimera.

- 1 - Dongyoon Lee, Peter Chen, Jason Flinn, Satish Narayanasamy University of Michigan, Ann Arbor Chimera: Hybrid Program Analysis for Determinism * Chimera image from http://superpunch.blogspot.com/2009/02/chimera-sketch.html

- 2 - Deterministic Replay Goal: record and reproduce multithreaded execution Debugging concurrency bugs Offline heavyweight dynamic analysis Forensics and intrusion detection … and many more uses Problem Multithreaded record-and-replay is too slow (>2x) or requires custom hardware

- 3 - Multithreaded Record-and-Replay is Slow Write Read Log shared memory dependencies Checkpoint Memory and Register State Log non-deterministic program input - Interrupts, I/O values, DMA, etc. Log non-deterministic program input - Interrupts, I/O values, DMA, etc. Thread 1Thread 2Thread 3

- 4 - Replay for Data-Race-Free Programs is Cheap Data-race-free programs Shared memory accesses are well ordered by synchronization ops. Recording happens-before order of sync. ops. is sufficient Problem: Programs with data races T1T2 X=0 Y=0 X=1 Y=1 Y=2 Unlock(l) Lock(l) Unlock(l) Signal(c) Wait(c) Z=1 X=2 Z=2 T3 order of mem. ops. order of sync. ops.

- 5 - Our Contribution: A Hybrid Analysis Potentially racy program P Data-race-free program P’ Sound static data race analysis Add synchronizations for potential data races Problem: Too many false positives Profiling non-concurrent code regions Symbolic bounds analysis Chimera

- 6 - Roadmap Motivation Chimera Analysis 1)Static data race analysis 2)Profiling non-concurrent code regions 3)Symbolic bounds analysis Weak-lock Design Evaluation Conclusion

- 8 - Static Data Race Analysis Find potential data-races using a sound static data race detector RELAY [Voung et al., FSE’07] Protect all potential data-races using weak-locks − A new time-out lock which may be preempted (discussed later) Record and replay the happens-before order of weak-locks

- 9 - Protect Potential Races using Weak-locks Potential racy-pair Potential racy-pair Static analysis helps avoid instrumentation for access to Z No race report void foo() { X = 0; for(i =... ){ Y[ tid ][ i ] = 0; } void bar() { X = 1; for(i = … ){ Y[ tid ][ i ] = 1; Z = 1; }

- 10 - Sources of False Positives in RELAY Sound data-race detector reports too many false data-races − 53x overhead Source 1: Non-mutex synchronizations are ignored − Lockset based analysis ignores fork-join, barrier, signal-wait, etc. − May report a false data-race between memory instructions that can never execute concurrently Source 2: Conservative pointer analysis − Overestimate variables accessed by a memory instruction − May report a false data-race between memory instructions that can never access the same location Solution: Profiling non-concurrent code regions Solution: Symbolic bounds Analysis

- 12 - Profiling Non-concurrent Code Regions Problem Lockset based analysis ignores non-mutex synchronization ops. Solution Profile non-concurrent code regions (e.g., functions) Increase the granularity of weak-locks to protect a larger code region instead of each potential racy instruction Parallelism is preserved unless mis-profiled T1 foo() BARRIER T2 BARRIER bar() False Race

- 13 - Function-Level Weak-Locks if profiler says foo() and bar() are not likely to run concurrently foo() BARRIER bar() False Race void foo() { X = 0; for(i = … ){ Y[ tid ][ i ] = 0; } void bar() { X = 1; for(i = … ){ Y[ tid ][ i ] = 1; Z = 1; }

- 14 - Roadmap Motivation Chimera Analysis 1)Static data race analysis 2)Profiling non-concurrent code regions 3)Symbolic bounds analysis Design Evaluation Conclusion

- 15 - Imprecision in Conservative Pointer Analysis T1 foo() BARRIER T2 BARRIER May run Concurrently bar()

- 16 - Imprecision in Conservative Pointer Analysis RELAY uses Steensgaard’s and Anderson’s pointer analysis − Flow-Insensitive and Context-Insensitive (FICI) analysis − Naming heap objects is conservative Overestimate the variables accessed by a memory instruction void foo() { … for(i = 0 to N){ Y[ tid ][ i ] = 0; … } void bar() { … for(i= 0 to N){ Y[ tid ][ i ] = 1; … } False Race Y[][] Thread1 Thread 2 … … … Potential racy-pair

- 17 - Symbolic Bounds Analysis Our Solution Derive the symbolic lower and upper bounds that a racy code region may access (e.g., loops) [Rugina and Rinard, PLDI’00] Increase the granularity of weak-locks to protect a larger code region for a set of addresses specified by a symbolic expression Parallelism is preserved if the bounds are precise enough void foo() { … for(i = 0 to N){ Y[ tid ][ i ] = 0; } … } Bounds: &Y[tid][0] to &Y[tid][N] Symbolic Bounds Analysis

- 18 - Loop-level Weak-locks Symbolic bounds: &Y[tid][0] ~ &Y[tid][N] (&Y[tid][0],&Y[tid][N]) void foo() { X = 0; for(i = 0 to N){ Y[ tid ][ i ] = 0; } void bar() { X = 1; for(i = 0 to N){ Y[ tid ][ i ] = 1; Z = 1; }

- 19 - Imprecise Symbolic Bounds Sources Depend on the value computed inside the code region Depend on arithmetic operations not supported in the analysis − e.g., modulo operations, logical AND/OR, etc. Choosing the optimal granularity If bounds are too imprecise and the loop body is long enough, resort to instruction (basic-block) level weak-locks for parallelism void qux() { … for(i = 0 to N){ prev = Z[ prev ]; } … } Bounds: -INF to +INF Symbolic Bounds Analysis

- 20 - Roadmap Motivation Chimera Analysis Weak-lock Design Evaluation Conclusion

- 21 - Deadlock due to Weak-locks No deadlocks between weak-locks function-level > loop-level > instruction-level Deadlock between weak-locks and original sync. ops. is possible T1T1 … wait (cv) … T2T2 signal(cv) … Time-out !!

- 22 - Weak-lock Time-out A weak-lock might time-out Invoke a special system call to handle it Weak-lock guarantee Only one thread holds a given weak-lock at any given time Mutual exclusion may be compromised; but sufficient for replay T2T2 … signal(cv) … Time-out !! T1T1 … wait (cv) … Current owner Logged order of weak-locks

- 23 - Roadmap Motivation Chimera Analysis Weak-lock Design Evaluation Conclusion

- 24 - Implementation Source-to-source Instrumentation Implemented in OCaml using CIL as a front end Static analysis Data race detection: RELAY [Voung et al., FSE’07] − Include all library source codes for soundness (uClibc’s libc, libm, etc.) Symbolic bounds analysis: [Rugina and Rinard, PLDI’00] − Intra-procedural analysis for racy loops only Runtime system Modified Linux kernel to record/replay program input Modified pthread library to record/replay happens-before order of original synchronization operations and weak-locks

- 25 - Evaluation Setup Test Environment 2.66 GHz 8-core Xeon processor with 4 GB of RAM Different set of inputs for profiling and performance evaluation Average of five trials with 4 worker threads 2, 4, 8 threads for scalability results Benchmarks Desktop applications − aget, pfscan, and pbzip2 Server programs − knot and apache SPLASH-2 suite − ocean, water-nsq, fft, and radix

- 26 - Record and Replay Performance Recording : 39% on average Replay : similar to recording; much lower for I/O intensive prgs. 2.4% slowdown 86% slowdown 39%

- 27 - Effectiveness of Coarse-grained Weak-locks 53x

- 28 - Effectiveness of Coarse-grained Weak-locks Coarse-grained weak-locks reduce the cost of instrumentation

- 29 - Effectiveness of Coarse-grained Weak-locks Coarse-grained weak-locks reduce the cost of instrumentation Exception: control-flow dependency (e.g., pfscan)

- 30 - Effectiveness of Coarse-grained Weak-locks Coarse-grained weak-locks reduce the cost of instrumentation Exception: control-flow dependency (e.g., pfscan)

- 31 - Effectiveness of Coarse-grained Weak-locks Coarse-grained weak-locks reduce the cost of instrumentation Exception: control-flow dependency (e.g., pfscan) 1.39x

- 32 - Breakdown of Recording Overhead Weak-lock overhead = contention (waiting) cost + logging cost func locks loop locks instr/bb locks sync op & system log

- 33 - Breakdown of Recording Overhead func wait loop wait instr/bb wait sync op & system log func log loop log instr/bb log Weak-lock overhead = contention (waiting) cost + logging cost High loop-lock contention High instr/bb-lock contention

- 34 - Scalability Scientific applications scale worse due to imprecise symbolic bounds analysis

- 35 - Conclusion Goal: Software-only deterministic multiprocessor replay systems Chimera Analysis Static data race analysis − Find and protect potential data races with weak-locks − Instruction/basic-block-level weak-locks Profiling non-concurrent code regions − Address the inadequacy of lockset-based algorithm − Function-level weak-locks Symbolic bounds analysis − Address the imprecision of conservative pointer analysis − Loop-level weak-locks Low Recording Overhead 39% recording overhead for 4 worker threads

- 36 - Thank you

- 1 - Dongyoon Lee, Peter Chen, Jason Flinn, Satish Narayanasamy University of Michigan, Ann Arbor Chimera: Hybrid Program Analysis for Determinism * Chimera.

Similar presentations

Presentation on theme: "- 1 - Dongyoon Lee, Peter Chen, Jason Flinn, Satish Narayanasamy University of Michigan, Ann Arbor Chimera: Hybrid Program Analysis for Determinism * Chimera."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

- 1 - Dongyoon Lee, Peter Chen, Jason Flinn, Satish Narayanasamy University of Michigan, Ann Arbor Chimera: Hybrid Program Analysis for Determinism * Chimera.

Similar presentations

Presentation on theme: "- 1 - Dongyoon Lee, Peter Chen, Jason Flinn, Satish Narayanasamy University of Michigan, Ann Arbor Chimera: Hybrid Program Analysis for Determinism * Chimera."— Presentation transcript:

Similar presentations

About project

Feedback