Testing Challenges for Next-Generation CPS Software Mike Whalen University of Minnesota Heterogeneity: Low-level: hardware failures, system resets, etc. High level: command and control or human interfaces will tend to focus on higher-level goals Tests on data storage on the Mars Rover split into two parts. Tests for low-level flash storage, produced by random testing and model-checking,hardware simulators, fault injection, very high coverage and complex spec Presentation doesn’t ever talk about environment for “closed loop” tests Presentation doesn’t ever talk about concurrency – this is certainly a distinguishing factor for CPS. Also, didn’t really talk about IoT where you have cloud and device interactions. Wish I could have talked about more stuff! 7/12/2017
Acknowledgements Rockwell Collins: Steven Miller, Darren Cofer, Lucas Wagner, Andrew Gacek, John Backes University of Minnesota: Mats P. E. Heimdahl, Sanjai Rayadurgam, Matt Staats, Ajitha Rajan, Gregory Gay Funding Sponsors: NASA, Air Force Research Labs, DARPA 7/12/2017
Who Am I? My main aim is in reducing verification and validation (V&V) cost and increasing rigor Applied automated V&V techniques on industrial systems at Rockwell Collins for 6 ½ years Proofs, bounded analyses, static analysis, automated testing Combining several kinds of assurance artifacts I’m interested in requirements as they pertain to V&V. Main research thrusts in testing Factors in testing: how do we make testing experiments fair and repeatable? Test metrics: What are reasonable metrics for testing safety-critical systems? What does it mean for a metric to be reasonable? 7/12/2017
Software Size Graphic: Andrea Busnelli 7/12/2017
The Future of Software Engineering December 2010 Slide courtesy Lockheed Martin, Inc. 7/12/2017 Twin-SPIN
Software Connectivity 7/12/2017
Image courtesy of energyclub.stanford.edu Networked Vehicles Currently: Bluetooth and OnStar Adaptive Cruise Control Platooning Traffic Routing Emergency Response Adaptive traffic lights What could possibly go wrong? Image courtesy of energyclub.stanford.edu 7/12/2017
Attacks on Embedded Systems Poland Tram Stuxnet FBI - iPhone Poland Tram Hack: 14 year old derails four trains and forces emergency stops Miller: Remote Car Hack 7/12/2017
Hypotheses CPS testers are facing enormous challenges of scale and scrutiny Substantially larger code bases Increased attention from attackers Thorough use of automation is necessary to increase rigor for CPS verification Requires understanding of factors in testing Common coverage metrics are not as well-suited for CPS as for general purpose software Structure of programs, oracles is important for automated testing! Creating intelligent / adaptive systems will make the testing problem harder Use of “deep learning” for critical functionality We have little knowledge of how to systematically white-box test deep-learning generated code such as neural nets Should I talk about differences between GP software and embedded software here? 7/12/2017
Testing Process Test Suite Test Inputs Specification Model/Program Executed On Specification Implements Model/Program Program Path Create Additional Oracle Correct/incorrect Evaluated by Assess Test Coverage Metric 7/12/2017
Testing Artifacts J. Gourlay. A mathematical framework for the investigation of testing. TSE, 1983 Staats, Whalen, and Heimdahl, Programs, Tests, and Oracles: The Foundations of Testing Revisited. ICSE 2011 7/12/2017
Testing Artifacts – In Practice Argument here about two things: 1. Embedded programs often have different characteristics than general-purpose software 2. first that the choice of oracle is very important and tends to be less accurate for embedded systems 7/12/2017
Staats’ Framework 7/12/2017
Theory in Practice 7/12/2017
Complete Adequacy Criteria I.e.: is your testing regimen adequate, given the program structure, specification, oracle and test suite? 7/12/2017
Complete Adequacy Criteria Fault Finding Effectiveness, 100% Branch Coverage Output-Only Oracle Maximum Oracle DWM_1 55% 83% DWM_2 14% Latctl_Batch 33% 89% Vertmax_Batch 32% 85% 7/12/2017
Complete Adequacy Criteria Fault Finding Effectiveness, 100% Branch Coverage Output-Only Oracle Maximum Oracle DWM_1 55% 83% DWM_2 14% Latctl_Batch 33% 89% Vertmax_Batch 32% 85% 7/12/2017
Complete Adequacy Criteria 7/12/2017
Complete Adequacy Criteria Gay, Staats, Whalen, and Heimdahl, The Risks of Coverage-Directed Test Case Generation, FASE 2012, TSE 2015. 7/12/2017
MC/DC Effectiveness DWM_2 Code structure has large effect! DWM_2 Code structure has large effect! Choice of oracle has large effect! Vertmax_Batch DWM_3 7/12/2017
Goals for “Good” Test Metric: Inozemtseva and Holmes, Coverage Is Not Strongly Correlated with Test Suite Effectiveness, ICSE 14 Effective at finding faults; Better than random testing for suites of the same size Better than other metrics This often requires accounting for oracle Robust to changes in program structure Reasonable in terms of the number of required tests and cost of coverage analysis Zhang and Mesbah: Assertions Are Strongly Correlated with Test Suite Effectiveness, FSE 15 7/12/2017
Another way to look at MC/DC Masking MC/DC can be expressed: Describes whether a condition is observable in a decision (i.e., not masked) Problem 1: any masking after the decision is not accounted for. Problem 2: we can rewrite programs to make decisions large or small (and MC/DC easy or hard to satisfy!) Where means, For program P, the computed value for the nth instance of expression e is replaced by value v For MC/DC, given decision D for each condition c in D, we want a pair of test cases ti and tj that ensure c is observable for both true and false values. 7/12/2017
Reachability and Observability 7/12/2017
Examining Observability With Model Counting and Symbolic Evaluation [ true ] test (X,Y) int test(int x, int y) { int z; if (y == x*10) S0; else S1; if (x > 3 && y > 10) S2; S3; return z; } [ Y=X*10 ] S0 [ Y!=X*10 ] S1 [ X>3 & 10<Y=X*10] S2 [ X>3 & 10<Y!=X*10] S2 [ Y=X*10 & !(X>3 & Y>10) ] S3 [ Y!=X*10 & !(X>3 & Y>10) ] S3 Test(1,10) reaches S0,S3 Test(0,1) reaches S1,S3 Test(4,11) reaches S1,S2 Work by: Willem Visser, Matt Dwyer, Jaco Geldenhuys, Corina Pasareanu, Antonio Filieri, Tevfik Bultan ISSTA ‘12, ICSE ‘13, PLDI’14, SPIN ‘15, CAV ‘15 7/12/2017
Probabilistic Symbolic Execution 104 [ true ] y=10x The statement z = 10 gets visited in 99.9% of tests int test(int x, int y: 0..99) { int z; if (y == x*10) S0; else z = 10; if (x > 1 && y > 10) z = 8; S3; return z; } [ Y=X*10 ] [ Y!=X*10 ] 9990 10 x>3 & y>10 x>3 & y>10 But it only affects the outcome in 14% of tests Check with Willem’s data: 98 x values * 89 y values – 10 combos. Y!=X*10 9990 X <= 1 2 for any value of y: 200 – 10 = 190 Y <= 10 11 for any value of x other than 0..1: 1100 = 8538 1452 6 4 [ X>3 & 10<Y=X*10] [ Y=X*10 & !(X>3 & Y>10) ] 7/12/2017 [ X>3 & 10<Y!=X*10] [ Y!=X*10 & !(X>3 & Y>10) ]
Probabilistic SE Hard to reach Easy to reach Easy to observe 104 [ true ] y=10x Hard to reach Easy to reach [ Y=X*10 ] [ Y!=X*10 ] 9990 10 x>3 & y>10 x>3 & y>10 Easy to observe (Somewhat) Hard to observe Now: suppose we put assertions at different points in the code. What are we doing? We are increasing observability of the code. 8538 1452 6 4 [ X>3 & 10<Y=X*10] [ Y=X*10 & !(X>3 & Y>10) ] 7/12/2017 [ X>3 & 10<Y!=X*10] [ Y!=X*10 & !(X>3 & Y>10) ]
Location, Location, Location More important than chicken or bull How hard is it to kill a mutant? Just, Jalali, Inozemtseva, Ernst, Holmes, and Fraser. Are mutants a valid substitute for real faults in software testing? FSE 2014 Yao, Harmon, Jia, A study of Equivalent and Stubborn Mutation Operators using Human Analysis of Equivalence. ICSE 2014 Location, Location, Location Spoiler Alert Not hard at all More important than chicken or bull W. Visser, What makes killing a mutant hard? ASE 2016. 7/12/2017
They saw something interesting In the initial results They saw something interesting 7/12/2017
What did they find? Stubborn Barrier public static int classify(int i, int j, int k) { if ((i <= 0) || (j <= 0) || (k <= 0)) return 4; int type = 0; if (i == j) type = type + 1; if (i == k) type = type + 2; if (j == k) type = type + 3; if (type == 0) { if ((i + j <= k) || (j + k <= i) || (i + k <= j)) type = 4; else type = 1; return type; } if (type > 3) type = 3; else if ((type == 1) && (i + j > k)) type = 2; else if ((type == 2) && (i + k > j)) type = 2; else if ((type == 3) && (j + k > i)) type = 2; else type = 4; Stubborn Barrier Almost all Mutations are Stubborn (<1%) 7/12/2017
Only 3% of inputs pass here Why? public static int classify(int i, int j, int k) { if ((i <= 0) || (j <= 0) || (k <= 0)) return 4; int type = 0; if (i == j) type = type + 1; if (i == k) type = type + 2; if (j == k) type = type + 3; if (type == 0) { if ((i + j <= k) || (j + k <= i) || (i + k <= j)) type = 4; else type = 1; return type; } if (type > 3) type = 3; else if ((type == 1) && (i + j > k)) type = 2; else if ((type == 2) && (i + k > j)) type = 2; else if ((type == 3) && (j + k > i)) type = 2; else type = 4; Only 3% of inputs pass here Should be 30 minutes in. Almost all Mutations are Stubborn (<1%) 7/12/2017
A (Very) Small Experiment on Operator Mutations Programs Muts Stubborn < 0.1% Really < 0.1% Always 100% Easy > 33% TRI-YHJ 5 4 TRI-V1 19 1 8 18 TRI-V2 7 TCAS 38 9 28 Arithmetic Operators: Triangle calculator: “general purpose” Software TCAS “Embedded” software Programs Muts Stubborn < 0.1% Really < 0.1% Always 100% Easy > 33% TRI-YHJ 40 5 24 TRI-V1 85 6 3 4 61 TRI-V2 55 38 TCAS 185 32 12 46 Relational Operators: Three versions of a triangle program and the Seimens’ TCAS example 7/12/2017
A (Very) Small Experiment on Operator Mutations Programs Muts Stubborn < 0.1% Really < 0.1% Always 100% Easy > 33% TRI-YHJ 5 4 TRI-V1 19 1 8 18 TRI-V2 7 TCAS 38 9 28 Arithmetic Operators: Triangle calculator: “general purpose” Software TCAS “Embedded” software Programs Muts Stubborn < 0.1% Really < 0.1% Always 100% Easy > 33% TRI-YHJ 40 5 24 TRI-V1 85 6 3 4 61 TRI-V2 55 38 TCAS 185 32 12 46 Relational Operators: Three versions of a triangle program and the Seimens’ TCAS example 7/12/2017
Why is observability an issue for embedded systems? Often long tests are required to expose faults from earlier computations Rate Limiters Hysteresis / de-bounce Feedback bounds System Modes Physical systems can impede observability Cannot observe all outputs Or cannot observe them accurately Fault tolerance logic can impede observability Richer oracle data than system outputs required Structure of programs can impede observability Graphical dataflow notations (Simulink / SCADE) put conditional blocks at the end of computation flows rather than at the beginning. 7/12/2017
Idea: lift observation from decisions to programs Observable MC/DC Idea: lift observation from decisions to programs Explicitly account for oracle Strength should be unaffected by simple program transformations (e.g., inlining) For MC/DC, given decision D for each condition c in D, we want a pair of test cases ti and tj that ensure c is observable for both true and false values. Whalen, Gay, You, Staats, and Heimdahl. Observable Modified Condition / Decision Coverage. ICSE 2013 7/12/2017
DWM1 Research questions: -> Effectiveness of fault finding, especially for output-only oracle -> Robustness to inlining -> Test suite size -> Effect of oracle DWM1 7/12/2017
DWM2 Latctl Vertmax Microwave Research questions: -> Effectiveness of fault finding, especially for output-only oracle -> Robustness to inlining -> Test suite size -> Effect of oracle 7/12/2017 Vertmax Microwave
Adoption in SCADE and Mathworks Tools SCADE: Generalization to all variables. Called Stream Coverage Also in discussions with The Mathworks on these ideas Currently they support an inlining solution for MCDC 7/12/2017
Testing Code for Complex Mathematics 7/12/2017
Testing Complex Mathematics Metrics describing branching logic often miss errors in complex mathematics Errors often exist in parts of the “numerical space” rather than portions of the CFG - Overflow / underflow - Loss of precision - Divide by zero - Oscillation - Transients 7/12/2017
Matinnejad, Nejati & Briand: Metrics for Complex Mathematics Use multi-objective search-based testing to try to maximize the diversity of outputs vectors in terms of distance and the number of numerical features, and to maximize failure features in a test suite. 7/12/2017
Output Diversity -- Vector-Based Matinnejad, Nejati, and Briand, Automated Test Suite Generation for Time-Continuous Simulink Models. ICSE 2016 Output Matinnejad, Nejati, Briand, Bruckmann, Poull, Search-based automated testing of continuous controllers: Framework, tool support, and case studies. I&ST 57 (2015) ADD a slide for coverage Matinnejad, Nejati, Briand, Bruckmann: Effective test suites for mixed discrete-continuous stateflow controllers. FSE 2015 Output Signal 1 Time Output Signal 2
Failure-based Test Generation Maximizing the likelihood of presence of specific failure patterns in output signals Instability Discontinuity Output
Search-Based Test Generation Procedure Initial Test Suite Slightly Modifying Each Test Input Output-based Heuristics
Output Diversity: Comparison with Random Seeded faults into mathematical software with few branches Measure deviation from expected values Two metrics: Of is failure diversity Ov is variable diversity q is size of test suite Q0 is max deviation Of is failure diversity Ov is variable diversity FR is % of faults revealed 7/12/2017
Output Diversity: Comparison with SLDV CAVEATS Not much branching logic in models (MCDC strength) MCDC is not very good at catching relational or arithmetic faults SLDV is not designed for non-linear arithmetic and continuous time However, this demonstrates need for new kinds of metrics and generation tools 7/12/2017
Example: Neural Nets Build a network of “neurons” that map from inputs to outputs Each node performs a summation and has a threshold to “fire” Each connection has a weight, which can be positive or negative As of 2017, neural networks typically have a few thousand to a few million units and millions of connections. Neural nets are trained rather than programmed. Linear models Kernel methods like Gaussian processes Vector machines By Glosser.ca - Own work, Derivative of File:Artificial neural network.svg, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=24913461
Machine Learning Use cases: (Self-) diagnosis Predictive Maintenance Condition Monitoring Anomaly Detection / Event Detection Image analysis in production Pattern recognition Increasingly proposed for use in safety-critical applications: road following, adaptive control 7/12/2017
Neural Net Code Structure (in MATLAB) function [y1] = simulateStandaloneNet(x1) % Input 1 x1_step1_xoffset = 0; x1_step1_gain = 0.200475452649894; x1_step1_ymin = -1; % Layer 1 b1 = [6.0358701949520981;2.725693924978148;0.58426771719145909;-5.1615078566382975]; IW1_1 = [-14.001919491063946;4.90641117353245;-15.228280764533135;-5.264207948688032]; % Layer 2 b2 = -0.75620725148640833; LW2_1 = [0.5484626432316061 -0.43580234386123884 -0.085111261420612969 -1.1367922825337915]; % Output 1 y1_step1_ymin = -1; y1_step1_gain = 0.2; y1_step1_xoffset = 0; % ===== SIMULATION ======== % Dimensions Q = size(x1,2); % samples % Input 1 xp1 = mapminmax_apply(x1,x1_step1_gain, x1_step1_xoffset,x1_step1_ymin); % Layer 1 a1 = tansig_apply(repmat(b1,1,Q) + IW1_1*xp1); % Layer 2 a2 = repmat(b2,1,Q) + LW2_1*a1; % Output 1 y1 = mapminmax_reverse(a2,y1_step1_gain, y1_step1_xoffset,y1_step1_ymin); end 7/12/2017
…continued Code observations: No branches! No relational operators! % ===== MODULE FUNCTIONS ======== % Map Minimum and Maximum Input Processing Function function y = mapminmax_apply(x, settings_gain, settings_xoffset, settings_ymin) y = bsxfun(@minus,x,settings_xoffset); y = bsxfun(@times,y,settings_gain); y = bsxfun(@plus,y,settings_ymin); End % Sigmoid Symmetric Transfer Function function a = tansig_apply(n) a = 2 ./ (1 + exp(-2*n)) - 1; % Map Minimum and Maximum Output Reverse-Processing Function function x = mapminmax_reverse(y, settings_gain, settings_xoffset, settings_ymin) x = bsxfun(@minus,y,settings_ymin); x = bsxfun(@rdivide,x,settings_gain); x = bsxfun(@plus,x,settings_xoffset); end Code observations: No branches! No relational operators! 7/12/2017
So, how do we test this? Black-box reliability testing? How do we determine the input distributions? How do we gain sufficient confidence for safety-critical use? Ricky W. Butler, George B. Finelli: The Infeasibility of Quantifying the Reliability of Life-Critical Real-Time Software Mutation testing? What do we mutate? What is our expectation as to the output effect? A specialized testing regime? 7/12/2017
7/12/2017
To Recap CPS systems are getting enormous Character of CPS systems is different than problems in common benchmark suites! Understanding factors in CPS systems influencing test is key to effective testing Observability is important but more difficult in CPS systems Testing complex mathematical code needs more research will be necessary to help gain confidence in safety-critical “deep learning”-generated code. 7/12/2017
Citations J. Gourlay. A mathematical framework for the investigation of testing. TSE, 1983 Staats, Whalen, and Heimdahl, Programs, Tests, and Oracles: The Foundations of Testing Revisited. ICSE 2011 Gay, Staats, Whalen, and Heimdahl, The Risks of Coverage-Directed Test Case Generation, FASE 2012, TSE 2015. Inozemtseva and Holmes, Coverage Is Not Strongly Correlated with Test Suite Effectiveness, ICSE 14 Zhang and Mesbah: Assertions Are Strongly Correlated with Test Suite Effectiveness, FSE 15 Just, Jalali, Inozemtseva, Ernst, Holmes, and Fraser. Are mutants a valid substitute for real faults in software testing? FSE 2014 Yao, Harmon, Jia, A study of Equivalent and Stubborn Mutation Operators using Human Analysis of Equivalence. ICSE 2014 W. Visser, What makes killing a mutant hard? ASE 2016. Whalen, Gay, You, Staats, and Heimdahl. Observable Modified Condition / Decision Coverage. ICSE 2013 Matinnejad, Nejati, and Briand, Automated Test Suite Generation for Time-Continuous Simulink Models. ICSE 2016 Matinnejad, Nejati, Briand, Bruckmann, Poull, Search-based automated testing of continuous controllers: Framework, tool support, and case studies. I&ST 57 (2015) Matinnejad, Nejati, Briand, Bruckmann: Effective test suites for mixed discrete-continuous stateflow controllers. FSE 2015 Machine Learning for Cyber Physical Systems. 2015, 2016, 2017. Springer. Ricky W. Butler, George B. Finelli: The Infeasibility of Quantifying the Reliability of Life-Critical Real-Time Software, TSE 1993 7/12/2017
Basic Idea of ML Some examples of ML techniques: Linear Models Parameters are things like number and size of hidden layers, weight initialization scheme, etc. Some examples of ML techniques: Linear Models Kernel methods like Gaussian processes Vector machines Image from: https://www.kth.se/polopoly_fs/1.578616!/RH_MachineLearning_presentation.pdf
Differences between CPS software and GP Testing Benchmarks Some ad-hoc observations Standard GP Testing Benchmarks – (Say) defects4j or Java Datatypes CPS Software Types Rich data types throughout Usually simple, non-recursive data Minimal test depth to failure Short sequences of operations or single inputs can trigger behavior Often long input sequences are required. For controllers, often hundreds of input “steps” are necessary to trigger erroneous behavior Statement observability Straightforward; can be assisted with mocks and stubs Often poor. Effects of executed line of code are often (1) masked by other logic, or (2) delayed by other logic for a non-trivial amount of time until visible at output. Worse with HIL testing Complexity of Numerics Usually not complex (Apache.math is an exception, but most library routines are small) Usually long and complex Timing Usually ignored Usually important! 7/12/2017