Finding Errors in.NET with Feedback-Directed Random Testing Carlos Pacheco (MIT) Shuvendu Lahiri (Microsoft) Thomas Ball (Microsoft) July 22, 2008.

Slides:



Advertisements
Similar presentations
Feedback-directed Random Test Generation (to appear in ICSE 2007) Carlos Pacheco Shuvendu Lahiri Michael Ernst Thomas Ball MIT Microsoft Research January.
Advertisements

Author: Carlos Pacheco, Shuvendu K. Lahiri, Michael D. Ernst, Thomas Ball MIT CSAIL.
Mahadevan Subramaniam and Bo Guo University of Nebraska at Omaha An Approach for Selecting Tests with Provable Guarantees.
Feedback-Directed Random Test Generation Automatic Testing & Validation CSI5118 By Wan Bo.
Hybrid Concolic Testing Rupak Majumdar Koushik Sen UC Los Angeles UC Berkeley.
Software Quality Assurance Inspection by Ross Simmerman Software developers follow a method of software quality assurance and try to eliminate bugs prior.
Formal Methods in Software Engineering Credit Hours: 3+0 By: Qaisar Javaid Assistant Professor Formal Methods in Software Engineering1.
Software testing.
Fuzzing Dan Fleck CS 469: Security Engineering Sources:
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 23 Slide 1 Software testing.
Finding Errors in.NET with Feedback-Directed Random Testing By Carlos Pacheco, Shuvendu K. Lahiri and Thomas Ball Presented by Bob Mazzi 10/7/08.
1 Today More on random testing + symbolic constraint solving (“concolic” testing) Using summaries to explore fewer paths (SMART) While preserving level.
CS4723 Software Validation and Quality Assurance Lecture 02 Overview of Software Testing.
Swami NatarajanJuly 14, 2015 RIT Software Engineering Reliability: Introduction.
State coverage: an empirical analysis based on a user study Dries Vanoverberghe, Emma Eyckmans, and Frank Piessens.
1 Joe Meehean. 2 Testing is the process of executing a program with the intent of finding errors. -Glenford Myers.
Terms: Test (Case) vs. Test Suite
Operational Java for Technical Committee.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 23 Slide 1 Software testing.
Software Testing Verification and validation planning Software inspections Software Inspection vs. Testing Automated static analysis Cleanroom software.
CS527: (Advanced) Topics in Software Engineering Overview of Software Quality Assurance Tao Xie ©D. Marinov, T. Xie.
Streamline Verification Process with Formal Property Verification to Meet Highly Compressed Design Cycle Prosenjit Chatterjee, nVIDIA Corporation.
Feed Back Directed Random Test Generation Carlos Pacheco1, Shuvendu K. Lahiri2, Michael D. Ernst1, and Thomas Ball2 1MIT CSAIL, 2Microsoft Research Presented.
©Ian Sommerville 2006Software Engineering, 8th edition. Chapter 23 Slide 1 Software testing Slightly adapted by Anders Børjesson.
An Introduction to MBT  what, why and when 张 坚
Copyright © Siemens AG All rights reserved. Essential Criteria on MBT to Ensure Quality of Software in Industry PVR Murthy Andreas Ulrich Siemens.
What is Software Testing? And Why is it So Hard J. Whittaker paper (IEEE Software – Jan/Feb 2000) Summarized by F. Tsui.
CSCE 548 Code Review. CSCE Farkas2 Reading This lecture: – McGraw: Chapter 4 – Recommended: Best Practices for Peer Code Review,
Chapter 8 – Software Testing Lecture 1 1Chapter 8 Software testing The bearing of a child takes nine months, no matter how many women are assigned. Many.
 CS 5380 Software Engineering Chapter 8 Testing.
Dr. Tom WayCSC Testing and Test-Driven Development CSC 4700 Software Engineering Based on Sommerville slides.
Testing Methods Carl Smith National Certificate Year 2 – Unit 4.
Introduction to Software Testing. Types of Software Testing Unit Testing Strategies – Equivalence Class Testing – Boundary Value Testing – Output Testing.
From Quality Control to Quality Assurance…and Beyond Alan Page Microsoft.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 22 Slide 1 Software Verification, Validation and Testing.
Automatically Repairing Broken Workflows for Evolving GUI Applications Sai Zhang University of Washington Joint work with: Hao Lü, Michael D. Ernst.
Chapter 8 Usability Specification Techniques Hix & Hartson.
1 Test Selection for Result Inspection via Mining Predicate Rules Wujie Zheng
jFuzz – Java based Whitebox Fuzzing
Feedback-directed Random Test Generation Carlos Pacheco Shuvendu Lahiri Michael Ernst Thomas Ball MIT Microsoft Research January 19, 2007.
Finding Errors in.NET with Feedback-Directed Random Testing Carlos Pacheco (MIT) Shuvendu Lahiri (Microsoft) Thomas Ball (Microsoft) July 22, 2008.
CS5103 Software Engineering Lecture 02 More on Software Process Models.
Directed Random Testing Evaluation. FDRT evaluation: high-level – Evaluate coverage and error-detection ability large, real, and stable libraries tot.
Test Case Designing UNIT - 2. Topics Test Requirement Analysis (example) Test Case Designing (sample discussion) Test Data Preparation (example) Test.
Benchmarking Effectiveness for Object-Oriented Unit Testing Anthony J H Simons and Christopher D Thomson.
08120: Programming 2: SoftwareTesting and Debugging Dr Mike Brayshaw.
Using Symbolic PathFinder at NASA Corina Pãsãreanu Carnegie Mellon/NASA Ames.
PROGRAMMING TESTING B MODULE 2: SOFTWARE SYSTEMS 22 NOVEMBER 2013.
Improving Structural Testing of Object-Oriented Programs via Integrating Evolutionary Testing and Symbolic Execution Kobi Inkumsah Tao Xie Dept. of Computer.
Automating Configuration Troubleshooting with Dynamic Information Flow Analysis Mona Attariyan Jason Flinn University of Michigan.
Automated Test Generation CS Outline Previously: Random testing (Fuzzing) – Security, mobile apps, concurrency Systematic testing: Korat – Linked.
HNDIT23082 Lecture 09:Software Testing. Validations and Verification Validation and verification ( V & V ) is the name given to the checking and analysis.
Winter 2007SEG2101 Chapter 121 Chapter 12 Verification and Validation.
Whole Test Suite Generation. Abstract Not all bugs lead to program crashes, and not always is there a formal specification to check the correctness of.
CS527 Topics in Software Engineering (Software Testing and Analysis) Darko Marinov September 7, 2010.
Random Test Generation of Unit Tests: Randoop Experience
PROGRAMMING FUNDAMENTALS INTRODUCTION TO PROGRAMMING. Computer Programming Concepts. Flowchart. Structured Programming Design. Implementation Documentation.
Verification vs. Validation Verification: "Are we building the product right?" The software should conform to its specification.The software should conform.
Shadow Shadow of a Doubt: Testing for Divergences Between Software Versions Hristina PalikarevaTomasz KuchtaCristian Cadar ICSE’16, 20 th May 2016 This.
Cs498dm Software Testing Darko Marinov January 24, 2012.
Regression Testing with its types
Input Space Partition Testing CS 4501 / 6501 Software Testing
Chapter 8 – Software Testing
CS5123 Software Validation and Quality Assurance
Automated Pattern Based Mobile Testing
Applied Software Implementation & Testing
Eclat: Automatic Generation and Classification of Test Inputs
Test Case Purification for Improving Fault Localization
Chapter 7 Software Testing.
Chapter 10: Testing and Quality Assurance
Presentation transcript:

Finding Errors in.NET with Feedback-Directed Random Testing Carlos Pacheco (MIT) Shuvendu Lahiri (Microsoft) Thomas Ball (Microsoft) July 22, 2008

Outline Motivation for case study −Do techniques based on random test generation work in the real world? Feedback-Directed Random Test Generation −Technique and Randoop tool overview Case study: Finding errors in.NET with Randoop −Goals, process, results Insights −Open research problems based on our observations 1

Motivation Software testing is expensive −Can consume half of entire software development budget −At Microsoft, there is a test engineer for every developer Automated test generation techniques can −Reduce cost −Improve quality Research community has developed many techniques −E.g. based on exhaustive search, symbolic execution, random generation, etc… 2

Research and Practice Random vs. non-random techniques −Some results suggest random testing based techniques less effective than non-random techniques −Other results suggest the opposite How do these results translate to an industrial setting? −Large amounts of code to test −Human time is a scarce resource −A test generation tool prove cost-effective vs. other tools/methods Our goal: shed light on this question for feedback-directed random testing. 3

Useful test Date d = new Date(); assertTrue(d.equals(d)); Illegal test Date d = new Date(); d.setMonth(-1); assertTrue(d.equals(d)); Useful test Set t = new HashSet(); s.add(“a”); assertTrue(s.equals(s)); Redundant test Set t = new HashSet(); s.add(“a”); s.isEmpty(); assertTrue(s.equals(s)); Random testing Easy to implement, fast, scalable, creates useful tests But also has weaknesses −Creates many illegal and redundant test inputs Example: randomly-generated unit tests for Java’s JDK:

Feedback-directed random testing Incorporate execution into the generation process −Execute every sequence immediately after creating it −If sequence reveals an error, output as a failing test case −If sequence appears to be illegal or redundant, discard Build method sequences incrementally −Use (legal, non-redundant) sequences to create new, larger ones −E.g. don’t use sequences that raise exceptions to create new sequences 5

Useful test Date d = new Date(2007, 5, 23); assertTrue(d.equals(d)); Illegal test Date d = new Date(2007, 5, 23); d.setMonth(-1); assertTrue(d.equals(d)); Illegal test (extends above test) Date d = new Date(2007, 5, 23); d.setMonth(-1); d.setDay(5); assertTrue(d.equals(d)); Useful test Set t = new HashSet(); s.add(“a”); assertTrue(s.equals(s)); Redundant test Set t = new HashSet(); s.add(“a”); s.isEmpty(); assertTrue(s.equals(s)); never create Feedback-directed random testing do not output

Randoop Extract Public API Method Sequence/ Input Generator.Net dll or.exe Violating C# Test Cases Good C# Test Cases Execute Method Sequence Feedback Guidance Examine output Generates tests for.Net assemblies Input: an assembly (.dll or.exe) Output: tests cases, one per file, each an executable C# program Violating tests raise assertion or access violations at runtime 7

Randoop for Java: try it out! 8 Google “randoop” Has been used in research projects and courses Version 1.2 just released

Randoop: previous experimental evaluations 9 On container data structures −Higher or equal coverage, in less time, than: −Model checking (with and without abstraction) −Symbolic execution −Undirected random testing On real-sized programs (totaling 750KLOC) −Finds more errors than o JPF: Model checking, symbolic execution [Visser 2003, 2006] o jCUTE: concolic testing [ Sen 2006] o JCrasher: undirected random testing [Csallner 2004]

Goal of the Case Study Evaluate FDRT’s effectiveness in an industrial setting −Will the tool be effective outside a research setting? −Is FDRT cost-effective? Under what circumstances? −How does FDRT compare with other techniques/methods? −How will a test team use the tool? Suggest research directions −Grounded in industrial experience 10

Case study structure Ask engineers from a test team at Microsoft to use Randoop on their code base over 2 months. Provide technical support for Randoop −Fix bugs, implement feature requests Met on a regular basis (approx. every 2 weeks) −Ask team for experience and results o Amount of time spent using the tool o Errors found o Ways in which they used the tool o Comparison with other techniques/methodologies in use 11

Subject program Test team responsible for a critical.Net component 100KLOC, large API, used by all.Net applications Uses both managed and native code Heavy use of assertions Component stable, heavily tested: high bar for new technique −40 testers over 5 years Many automatic techniques already applied −Fuzz, robustness, stress, boundary-condition testing −Concurrently trying research tool based on symbolic execution 12

Results 13 Human time spent interacting with Randoop 15 hours CPU time150 hours Total distinct tests cases generated by Randoop 4 million New errors revealed by Randoop 30 Error-revealing test sequence length average: 3.4 calls min: 1 call max: 15 calls

Human effort with/without Randoop At this point in the component’s lifecycle, a test engineer is expected to discover ~20 new errors in one year of effort. Randoop found 30 new errors in 15 hours of effort. −This time includes: interacting with Randoop inspecting the resulting tests discarding redundant failures 14

What kinds of errors did Randoop find? Randoop found errors: −In code where tests achieved full coverage o By following error-revealing code paths not previously considered −That were supposed to be caught by other tools o Revealed errors in testing tools −That highlighted holes in existing manual testing practices o Tool helped institute new practices −When combined with other testing tools in the team’s toolbox o Tool was used as a building block for testing activities 15

Errors in fully-covered code Randoop revealed errors in code in which existing tests achieved 100% branch coverage Example: garbage collection error −Component includes memory-managed and native code −If native call manipulates references, must inform GC of changes − Previously untested path in native code caused component to report a new reference to an invalid address −Garbage collector raised an assertion violation −The erroneous code was in a method with 100% branch coverage 16

Errors in testing tools Randoop revealed errors in the team’s testing and program analysis tools Example: missing resource −When exception is raised, component finds message in resource file −Rarely-used exception was missing message in file −Attempting lookup led to assertion violation −Two errors: o Missing message in resource file o Error in tool that verified state of resource file 17

Errors highlighted holes in existing practices Errors revealed by Randoop led to other testing activities −Write new manual tests −Instituting new manual testing guidelines Example: empty arrays −Many methods in the component API take array inputs −Testing empty arraycase left to the discretion of test creator −Randoop revealed an error that caused an access violation on an empty array −New practice: always test empty array 18

Errors when combining Randoop with other tools Initially we thought of Randoop as an end-to-end bug finder Test team also used Randoop’s tests as input to other tools −Feature request: output all generated inputs, not just error- revealing ones −Used test inputs to drive other tools o Stress tester: run input while invoking GC every few instructions o Concurrency tester: run input multiple times, in parallel Increased the scope of the exploration and the types of errors revealed beyond those that Randoop could find. −For example, team discovered concurrency errors this way 19

Summary: strengths and weaknesses Strengths of feedback-directed random testing −Finds new, critical errors (not subsumed by other techniques) −Fully automatic −Scalable, immediately applicable to large software −Unbiased search finds holes in existing testing infrastructure Weaknesses of feedback-directed random testing −No clear stopping criterion can lead to wasted effort −Spends majority of time on subset classes −Reaches a coverage plateau −Only as good as the manually-created oracle 20

Randoop vs. other techniques Randoop revealed errors not found by other techniques −Manual testing −Fuzz testing −Bounded exhaustive testing over a small domain −Test generation based on symbolic execution These techniques revealed errors not found by Randoop Random testing techniques are not subsumed by non-random techniques 21

Randoop vs. symbolic execution Concurrently with Randoop, test team used a test generator based on symbolic execution −Input/output similar to Randoop’s, internal operation different In theory, the tool was more powerful than Randoop In practice, it found no errors Example: garbage collection error not discoverable via symbolic execution, because it was in native code. 22

Randoop vs. fuzz testing Randoop found errors not caught by fuzz testing Fuzz testing’s domain is files, stream, protocols Randoop’s domain is method sequences Think of Randoop as a smart fuzzer for APIs 23

The Plateau Effect After its initial period of effectiveness, Randoop ceased to reveal errors −Randoop stopped covering new code Towards the end, test team made a parallel run of Randoop −Dozens of machines, hundreds of machine hours −Each machine with a different random seed −Found fewer errors than it first 2 hours of use on a single machine Our observations are consistent with recent studies reporting a coverage plateau for random test generation 24

Future Research Directions Overcome coverage plateau −New techniques will be required −Combining random and non-random generation a promising approach Richer oracles could yield more bugs −Regression oracles: capture the state of objects Test amplification −Take advantage of existing test suites −One idea: use existing tests as input to Randoop 25

Conclusion Feedback-directed random test generation finds errors −In mature, well-tested code −When used in an real industrial setting −That elude other techniques Randoop still used internally at Microsoft −Added to list of recommended tools for other product groups −Has revealed dozens more errors in other products Random testing techniques are effective in industry −Find deep and critical errors −Randomness reveals biases in a test team’s practices −Scalability yields impact 26