Scalable Dynamic Analysis for Automated Fault Location and Avoidance Rajiv Gupta Funded by NSF grants from CPA, CSR, & CRI programs and grants from Microsoft.

Slides:



Advertisements
Similar presentations
More on Processes Chapter 3. Process image _the physical representation of a process in the OS _an address space consisting of code, data and stack segments.
Advertisements

Software & Services Group PinPlay: A Framework for Deterministic Replay and Reproducible Analysis of Parallel Programs Harish Patil, Cristiano Pereira,
INTROPERF: TRANSPARENT CONTEXT- SENSITIVE MULTI-LAYER PERFORMANCE INFERENCE USING SYSTEM STACK TRACES Chung Hwan Kim*, Junghwan Rhee, Hui Zhang, Nipun.
Overview Motivations Basic static and dynamic optimization methods ADAPT Dynamo.
Program Slicing Mark Weiser and Precise Dynamic Slicing Algorithms Xiangyu Zhang, Rajiv Gupta & Youtao Zhang Presented by Harini Ramaprasad.
Presented By: Krishna Balasubramanian
1 Cost Effective Dynamic Program Slicing Xiangyu Zhang Rajiv Gupta The University of Arizona.
CS590F Software Reliability What is a slice? S: …. = f (v)  Slice of v at S is the set of statements involved in computing v’s value at S. [Mark Weiser,
Bouncer securing software by blocking bad input Miguel Castro Manuel Costa, Lidong Zhou, Lintao Zhang, and Marcus Peinado Microsoft Research.
Recording Inter-Thread Data Dependencies for Deterministic Replay Tarun GoyalKevin WaughArvind Gopalakrishnan.
SHelp: Automatic Self-healing for Multiple Application Instances in a Virtual Machine Environment Gang Chen, Hai Jin, Deqing Zou, Weizhong Qiang, Gang.
1 S. Tallam, R. Gupta, and X. Zhang PACT 2005 Extended Whole Program Paths Sriraman Tallam Rajiv Gupta Xiangyu Zhang University of Arizona.
Program Slicing Xiangyu Zhang. CS590F Software Reliability What is a slice? S: …. = f (v)  Slice of v at S is the set of statements involved in computing.
CS4723 Software Engineering Lecture 10 Debugging and Fault Localization.
A Comparison of Online and Dynamic Impact Analysis Algorithms Ben Breech Mike Tegtmeyer Lori Pollock University of Delaware.
EEC 688/788 Secure and Dependable Computing Lecture 12 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Pruning Dynamic Slices With Confidence Xiangyu Zhang Neelam Gupta Rajiv Gupta The University of Arizona.
S. Narayanasamy, Z. Wang, J. Tigani, A. Edwards, B. Calder UCSD and Microsoft PLDI 2007.
Continuously Recording Program Execution for Deterministic Replay Debugging.
TaintCheck and LockSet LBA Reading Group Presentation by Shimin Chen.
Rajiv Gupta Chen Tian, Min Feng, Vijay Nagarajan Speculative Parallelization of Applications on Multicores.
BugNet Continuously Recording Program Execution for Deterministic Replay Debugging Satish Narayanasamy Gilles Pokam Brad Calder.
An Integrated Framework for Dependable Revivable Architectures Using Multi-core Processors Weiding Shi, Hsien-Hsin S. Lee, Laura Falk, and Mrinmoy Ghosh.
CS590 Z Software Defect Analysis Xiangyu Zhang. CS590F Software Reliability What is Software Defect Analysis  Given a software program, with or without.
EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing Lecture 12 Wenbing Zhao Department of Electrical and Computer Engineering.
Program Representations Xiangyu Zhang. CS590Z Software Defect Analysis Program Representations  Static program representations Abstract syntax tree;
1 Concurrency: Deadlock and Starvation Chapter 6.
AFID: An Automated Fault Identification Tool Alex Edwards Sean Tucker Sébastien Worms Rahul Vaidya Brian Demsky.
Backup and Recovery Part 1.
Backup Concepts. Introduction Backup and recovery procedures protect your database against data loss and reconstruct the data, should loss occur. The.
Rex: Replication at the Speed of Multi-core Zhenyu Guo, Chuntao Hong, Dong Zhou*, Mao Yang, Lidong Zhou, Li Zhuang Microsoft ResearchCMU* 1.
U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science 2006 Exterminator: Automatically Correcting Memory Errors Gene Novark, Emery Berger.
Presenter: Chi-Hung Lu 1. Problems Distributed applications are hard to validate Distribution of application state across many distinct execution environments.
Backup & Recovery 1.
0 Deterministic Replay for Real- time Software Systems Alice Lee Safety, Reliability & Quality Assurance Office JSC, NASA Yann-Hang.
CS527: (Advanced) Topics in Software Engineering Overview of Software Quality Assurance Tao Xie ©D. Marinov, T. Xie.
Address Space Layout Permutation
Chapter Oracle Server An Oracle Server consists of an Oracle database (stored data, control and log files.) The Server will support SQL to define.
LOOM: Bypassing Races in Live Applications with Execution Filters Jingyue Wu, Heming Cui, Junfeng Yang Columbia University 1.
- 1 - Dongyoon Lee †, Mahmoud Said*, Satish Narayanasamy †, Zijiang James Yang*, and Cristiano L. Pereira ‡ University of Michigan, Ann Arbor † Western.
Bug Localization with Machine Learning Techniques Wujie Zheng
Scalable Dynamic Analysis for Automated Fault Location and Avoidance Rajiv Gupta Funded by NSF grants from CPA, CSR, & CRI programs and grants from Microsoft.
Dynamic Analysis of Multithreaded Java Programs Dr. Abhik Roychoudhury National University of Singapore.
Interactive Debugging QuickZoom: A State Alteration and Inspection-based Interactive Debugger 1.
Replay Compilation: Improving Debuggability of a Just-in Time Complier Presenter: Jun Tao.
Title of Selected Paper: IMPRES: Integrated Monitoring for Processor Reliability and Security Authors: Roshan G. Ragel and Sri Parameswaran Presented by:
Scalable Dynamic Analysis for Automated Fault Location and Avoidance Rajiv Gupta Funded by NSF grants from CPA, CSR, & CRI programs and grants from Microsoft.
1 Test Selection for Result Inspection via Mining Predicate Rules Wujie Zheng
Debugging parallel programs. Breakpoint debugging Probably the most widely familiar method of debugging programs is breakpoint debugging. In this method,
References: “Pruning Dynamic Slices With Confidence’’, by X. Zhang, N. Gupta and R. Gupta (PLDI 2006). “Locating Faults Through Automated Predicate Switching’’,
PRESTO: Program Analyses and Software Tools Research Group, Ohio State University Efficient Checkpointing of Java Software using Context-Sensitive Capture.
CISC Machine Learning for Solving Systems Problems Presented by: Suman Chander B Dept of Computer & Information Sciences University of Delaware Automatic.
Bug Localization with Association Rule Mining Wujie Zheng
A Binary Agent Technology for COTS Software Integrity Anant Agarwal Richard Schooler InCert Software.
Distributed Logging Facility Castor External Operation Workshop, CERN, November 14th 2006 Dennis Waldron CERN / IT.
Pruning Dynamic Slices With Confidence Original by: Xiangyu Zhang Neelam Gupta Rajiv Gupta The University of Arizona Presented by: David Carrillo.
Security Attacks Tanenbaum & Bo, Modern Operating Systems:4th ed., (c) 2013 Prentice-Hall, Inc. All rights reserved.
Reachability Testing of Concurrent Programs1 Reachability Testing of Concurrent Programs Richard Carver, GMU Yu Lei, UTA.
Self Recovery in Server Programs The University of California, Riverside Vijay Nagarajan Dennis JeffreyRajiv Gupta International Symposium on Memory Management.
Qin Zhao1, Joon Edward Sim2, WengFai Wong1,2 1SingaporeMIT Alliance 2Department of Computer Science National University of Singapore
Memory Management What if pgm mem > main mem ?. Memory Management What if pgm mem > main mem ? Overlays – program controlled.
DBMS ● What are they? ● Why used ● Examples? – Oracle – Access – MySQL – Postgres – SQLServer – Sqlite.
Optimistic Hybrid Analysis

Presented by: Daniel Taylor
Threads vs. Events SEDA – An Event Model 5204 – Operating Systems.
Static Slicing Static slice is the set of statements that COULD influence the value of a variable for ANY input. Construct static dependence graph Control.
Process Management Presented By Aditya Gupta Assistant Professor
Process & its States Lecture 5.
Test Case Test case Describes an input Description and an expected output Description. Test case ID Section 1: Before execution Section 2: After execution.
Presentation transcript:

Scalable Dynamic Analysis for Automated Fault Location and Avoidance Rajiv Gupta Funded by NSF grants from CPA, CSR, & CRI programs and grants from Microsoft Research

Motivation  Software bugs cost the U.S. economy about $59.5 billion each year [NIST 02].  Embedded Systems Mission Critical / Safety Critical Tasks A failure can lead to Loss of Mission/Life. (Ariane 5) arithmetic overflow led to shutdown of guidance computer. (Mars Climate Orbiter) missed unit conversion led to faulty navigation data. (Mariner I) missing superscripted bar in the specification for the guidance program led to its destruction 293 seconds after launch. (Mars Pathfinder) priority inversion error causing system reset. (Boeing ) loss of engine & flight displays while in flight. (Toyota hybrid Prius) VSC, gasoline-powered engine shut off. (Therac-25) wrong dosage during radiation therapy. …….

Overview Dynamic Slicing Offline Environment Faults Online Program Execution Fault Location Fault Avoidance Scalability Tracing + Logging Long-running Multi-threaded

Fault Location Goal: Assist the programmer in debugging by automatically narrowing the fault to a small section of the code.  Dynamic Information Data dependences Control dependences Values  Execution Runs One failed execution & Its perturbations

Dynamic Information …… ExecutionProgram Dynamic Dependence Graph Data Control

Approach Detect execution of statement s such that  Faulty code Affects the value computed by s; or  Faulty code is Affected-by the value computed by s through a chain of dependences. Estimate the set of potentially faulty statements from s:  Affects: statements from which s is reachable in the dynamic dependence graph. (Backward Slice)  Affected-by: statements that are reachable from s in the dynamic dependence graph. (Forward Slice)  Intersect slices to obtain a smaller fault candidate set.

Backward & Forward Slices Erroneous Output Failure inducing Input Backward Slice [Korel&Laski,1988] Forward Slice [ASE-05]

Failure Inducing Input Erroneous Output [ASE-05]  For memory bugs the number of statements is very small (< 5). Backward & Forward Slices

Bidirectional Slices Backward Slice of CP Forward Slice of CP + Bidirectional Slice Combined Slice [ICSE-06] Found critical predicates in 12 out of 15 bugs Search for critical predicate: Brute force: 32 predicates to 155K predicates; After Filtering and Ordering: 1 to 7K predicates. Critical Predicate: An execution instance of a predicate such that changing its outcome “repairs” the program state.

Pruning Slices  Confidence in v  v C( v ): [0,1] 0 - all values of v produce same 1 - any change in v will change How? Value profiles.  [PLDI-06]

Test Programs Real Reported BugsInjected Bugs  Nine logical bugs (incorrect ouput) Unix utilities  grep 2.5, grep 2.5.1, flex , make  Six memory bugs (program crashes) Unix utilities  gzip, ncompress, polymorph, tar, bc, tidy.  Siemens Suite (numerous versions)  schedule, schedule2, replace, print_tokens.. Unix utilities  gzip, flex

Dynamic Slice Sizes Buggy RunsBSFSBiS flex (a) flex (b) NA flex (c) NA grep 2.5NA73188 grep 2.5.1(a)NA32111 grep 2.5.1(b)NA599NA grep 2.5.1(c)NA12453 make 3.80(a) make 3.80(b) gzip ncompress polymorph tar bc tidy

Combined Slices Buggy Runs BS BS^FS^BiS (%BS) flex (a)69527 (3.9%) flex (b) (37.5%) flex (c)505 (10%) grep 2.5NA86 (7.4%*EXEC) grep 2.5.1(a)NA25 (4.9%*EXEC) grep 2.5.1(b)NA599 (53.3%*EXEC) grep 2.5.1(c)NA12 (0.9%*EXEC) make 3.80(a) (81.4%) make 3.80(b) (75.3%) gzip (8.8%) ncompress (14.3%) polymorph (14.3%) tar (42.9%) bc (50%) tidy (29.1%)

Evaluation of Pruning ProgramDescriptionLOCVersionsTests print_tokensLexical analyzer print_tokens2Lexical analyzer replacePattern replacement schedulePriority scheduler schedule2Priority scheduler gzipUnix utility flexUnix utility Siemen’s Suite Single error is injected in each version. All the versions are not included:  No output or the very first output is wrong;  Root cause is not contained in the BS (code missing error).

ProgramBSPruned SlicePruned Slice / BS print_tokens % print_tokens % replace % schedule % schedule % gzip % flex % Evaluation of Pruning

Effectiveness Backward Slice [AADEBUG-05] ≈ 31% of Executed Statements Combined Slice [ASE-05,ICSE-06] ≈ 36% of Backward Slice ≈ 11% of Exec. Erroneous output Failure inducing input Critical predicate Pruned Slice [PLDI-06] ≈ 41% of Backward Slice ≈ 13% of Exec. Confidence Analysis

Effectiveness  Slicing is effective in locating faults.  No more than 10 static statements had to be inspected. Program-bugInspected Stmts. mutt – heap overflow8 pine – stack overflow3 pine – heap overflow10 mc – stack overflow2 squid – heap overflow5 bc – heap overflow3

Execution Omission Errors X= =X A = A<0 X= =X A = A<0 X= =X A = A<0 Inspect pruned slice. Dynamically detect an Implicit dependence. Incrementally expand the pruned slice. [PLDI-07] Implicit dependence

Scalability of Tracing Dynamic Information Needed Dynamic Dependences  for all slicing Values for Confidence Analysis  for pruning slices  annotates the static program representation Whole Execution Trace (WET)  Trace Size ≈ 15 Bytes / Instruction

Trace Sizes & Collection Overheads  Trace sizes are very large for even 10s of execution. ProgramRunning Time Dep. Trace Collection Time mysql13 s21 GB2886 s prozilla8 s6 GB2640 s proxyC10 s456 MB880 s mc10 s55 GB418 s mutt20 s388 GB3238 s pine14 s156 GB2088 s squid15 s88 GB1132 s

Compacting Whole Execution Traces  Explicitly remember dynamic control flow trace.  Infer as many dynamic dependences as possible from control flow (94%), remember the remaining dependences explicitly (≈ 6%).  Specialized graph representation to enable inference.  Explicitly remember value trace.  Use context-based method to compress dynamic control flow, value, and address trace.  Bidirectional traversal with equal ease [MICRO-04, TACO-05]

Input: N=2 5 1 : for I=1 to N do 6 1 : if (i%2==0) then 7 1 : p=&a 8 1 : a=a : z=2*(*p) 10 1 : print(z) 1 1 : z=0 2 1 : a=0 3 1 : b=2 4 1 : p=&b 5 2 : for I=1 to N do 6 2 : if (i%2==0) then 8 2 : a=a : z=2*(*p) 1: z=0 2: a=0 3: b=2 4: p=&b 5: for i = 1 to N do 6: if ( i %2 == 0) then 7: p=&a endif endfor 8: a=a+1 9: z=2*(*p) 10: print(z) Dependence Graph Representation

5:for i=1 to N 6:if (i%2==0) then 7: p=&a 8: a=a+1 9: z=2*(*p) 10: print(z) T F 1: z=0 2: a=0 3: b=2 4: p=&b T Input: N=2 1 1 : z=0 2 1 : a=0 3 1 : b=2 4 1 : p=&b 5 1 : for i = 1 to N do 6 1 : if ( i %2 == 0) then 8 1 : a=a : z=2*(*p) 5 2 : for i = 1 to N do 6 2 : if ( i %2 == 0) then 7 1 : p=&a 8 2 : a=a : z=2*(*p) 10 1 : print(z) T F Dependence Graph Representation

Transform: Traces of Blocks

Infer: Local Dependence Labels X = Y= X X = Y= X (10,10) (20,20) (30,30) 10,20,30 X = Y= X =Y 21 (20,21)... (...,20)...

Transform: Local Dep. Labels X = *P = Y= X X = *P = Y= X (10,10) (20,20) 10,20 X = *P = Y= X (20,20)

Transform: Local Dep. Labels X = *P = Y= X X = *P = Y= X (10,10) (20,20) X = *P = Y= X 10,20 X = *P = Y= X ,21 (10,11) (20,21) =Y 11,21 =Y (20,21) (10,11)

Group: Non-Local Dep. Edges X = Y = = Y = X X = Y = X = Y = = Y = X X = Y = (10,21) (20,11) X = Y = = Y = X X = Y = (20,11) (10,21) 11,

Compacted WET Sizes Program Statements Executed (Millions) WET Size (MB)Before / After BeforeAfter 300.twolf 256.bzip2 255.vortex 197.parser 181.mcf 164.gzip 130.li 126.gcc 099.go Average ,666 11,921 8,748 8,730 10,541 9,688 10,399 5,238 10,369 9, ≈ 4 Bits / Instruction

[PLDI-04] vs. [ICSE-03] Slicing Times

Dep. Graph Generation Times  Offline post-processing after collecting address and control flow traces  ≈ 35x of execution time  Online techniques [ICSM 2007]  Information Flow: 9x to18x slowdown  Basic block Opt.: 6x to10x slowdown  Trace level Opt.: 5.5x to 7.5x slowdown  Dual Core: ≈1.5x slowdown  Online Filtering techniques  Forward slice of all inputs  User-guided bypassing of functions

Reducing Online Overhead  Record non-deterministic events online Less than 2x overhead Deterministic replay of executions  Trace faulty executions off-line Replay the execution Switch on tracing Collect and inspect traces  Trace analysis is still a problem The traces correspond to huge executions Off-line overhead of trace collection is still significant

Reducing Trace Sizes Checkpointing Schemes  Trace from the most recent checkpoint Checkpoints are of the order of minutes. Better but the trace sizes are still very large. Exploiting Program Characteristics  Multithreaded and server-like [ISSTA-07, FSE-06] Examples : mysql, apache. Each request spawns a new thread. Do not trace irrelevant threads.

Beyond Tracing  Checkpoint: capture memory image.  Execute and Record (log) Events.  Upon Crash, Rollback to checkpoint.  Reduce log and Replay execution using reduced log.  Turn on tracing during replay. x Checkpointlog x Trace Reduced log  Applicable to Multithreaded Programs [ISSTA-07]

An Example  A mysql bug “ load …” command will crash the server if database is not specified Without typing “use database_name”, thd->db is Null.

Example – Execution and log file open path=/etc/my.cnf … Wait for connection Create Thread 1 Wait for command Create Thread 2 Wait for command Recv “show databases” Handle command Recv “load data …” Handle -- ( server crashes ) Recv “use test; select * from b” Handle command Run mysql server User 1 connects to the server User 2 connects to the server User 1: “show databases” User 2: “use test” “ select * from b” User 1: “load data into table1” Time Blue – T0 Red – T1 Green – T2 Gray - Scheduler

Execution Replay using Reduced log open path=/etc/my.cnf … Wait for connection Create Thread 1 Recv “load data …” Handle -- ( server crashes ) Run mysql server User 1 connects to the server User 2 connects to the server User 1: “show databases” User 2: “show databases” “ select * from b” User 1: “load data into table1” Time

Execution Reduction  Effects of Reduction  Irrelevant Threads  Replay-only vs. Replay & Trace  How? By identifying Inter-thread Dependences  Event Dependences - found using the log  File Dependences - found using the log  Shared-Memory Dependences - found using replay  Space requirement reduced by 4x  Time requirement reduced by 2x  Naïve approach requires thread id of last writer of each address  Space and time efficient detection o Memory Regions: Non-shared vs shared o Locality of References to Regions

Experimental Results

Original Optimized Program-bug Trace Sizes Num. of dependences Experimental Results

Orig. OPT. Program-bug Execution Times (seconds) Logging Experimental Results

Debugging System Slicing Module WET Slices Application binary Execution Engine Valgrind InputOutput Traces Instrument code Compressed Trace Static Binary Analyzer Diablo Control Dependence Reduced Log Record Replay Jockey Checkpoint + log

Fault Avoidance Large number of faults in server programs are caused by the environment. 56 % of faults in Apache server. Types of Faults Handled  Atomicity Violation Faults. Try alternate scheduling decisions.  Heap Buffer Overflow Faults. Pad memory requests.  Bad User Request Faults. Drop bad requests. Avoidance Strategy  Recover first time, Prevent later. Record the change that avoided the fault.

Experiments ProgramType of BugEnv. Change# of Trials Time taken (secs.) mysql-1Atomicity Violn.Scheduler1130 mysql-2Atomicity Violn.Scheduler165 mysql-3Atomicity Violn.Scheduler165 mysql-4Buffer Overflow.Mem. Padding1700 pine-1Buffer Overflow.Mem. Padding1325 pine-2Buffer Overflow.Mem. Padding1270 mutt-1Bad User Req.Drop Req.3205 bc-1Bad User Req.Drop Req.3290 bc-2Bad User Req.Drop Req.3195

Summary Dynamic Slicing Offline Environment Faults Online Program Execution Fault Location Fault Avoidance Scalability Tracing + Logging Long-running Multi-threaded

Dissertations Xiangyu Zhang, Purdue University  Fault Location Via Precise Dynamic Slicing, SIGPLAN Outstanding Doctoral Dissertation Award Sriraman Tallam, Google  Fault Location and Avoidance in Long-Running Multithreaded Programs, 2007.

Ongoing Work Monitoring Parallel Applications  On Multicores: Vijay Nagarajan [ISCA’09] Debugging Parallel Applications  State-based Approach: Dennis Jeffrey [ ISSTA’08a ]  Race-detection in Parallel Applications: Chen Tian & Vijay Nagarajan [ISSTA’08b] Optimistic Parallelization for Multicores  Speculation: Min Feng & Chen Tian [MICRO’08]

Breakdowns of Different Optimizations Infer Transform Group Others Explicit

Reducing Trace Sizes Checkpointing Schemes  Trace from the most recent checkpoint Checkpoints are of the order of minutes. Better but the trace sizes are still very large. Exploiting Program Characteristics  Multithreaded and server-like [ISSTA-07] Examples : mysql, apache. Each request spawns a new thread. Do not trace irrelevant threads.  Single-threaded and event processing [FSE-06] Examples : pine, mutt, squid. Events are independent. Do not trace irrelevant events.

Example of a Fault  Input: >./mysqld –log-bin= binlog >./mysql -u root -D test -e 'delete from b' & >./mysql -u root -D test -e 'insert into b values (1)' &  Output Observation >./mysqlbinlog binlog set timestamp = insert into b values (1); set timestamp = delete from b

File : mysql_delete.cc mysql_delete(THD *thd,...) error=generate_table(thd,...);... generate_table(THD *thd,...) pthread_mutex_lock(...);... // Critical Section 105 pthread_mutex_unlock(...); // Logging not locked 109 mysql_update_log.write(thd,...);... Log of an execution (scheduling decision) TEI 1 Recv “delete …” Handle it but hasn’t logged it TEI 2 Recv “insert” Handle it (logged insert) TEI 3 Logging delete Red – T1 Green – T2 Gray - Scheduler Example of a Fault

System for Fault Avoidance  Put the server program into a 3-phase system Log the original execution Replay execution and apply the environmental changes Use environment patch to prevent the fault from occurring again  Logging – jockey system.  Apply environment changes – valgrind system. Logging Phase Orig. Execution Event Log Fault Avoidance Phase (Env. Changes) Env. Fault Patch Prevention-Logging Phase Fault Avoided FAULT REPLAY & AVOID NEW FAULT OLD FAULT RECORD PATCH [COMPSAC-08]