Learning Software Behavior for Automated Diagnosis Ori Bar-ilan | Dr. Meir Kalech | Dr. Roni Stern
Background & Motivation </> High Level Research Goal: Integrate ML techniques into software diagnosis
1. Background & Motivation Let’s talk about diagnosis for a moment
Background & Motivation Model-Based [F. Wotawa, ‘02] </> Traditional Diagnosis: Model-based OFTEN ABSENT IN SOFTWARE w x y 𝐴=1,𝐵=1, 𝐶=0,𝑍=1 𝑊 System Model Observations Diagnosis
Background & Motivation Spectrum-based [R. Abreu, ‘09] </> Another approach: Spectrum-based Observing the system for expected behavior V 𝑡𝑒𝑠 𝑡 1 System 𝑡𝑒𝑠 𝑡 2 X ⋮ ⋮ V 𝑡𝑒𝑠 𝑡 𝑁
Background & Motivation Spectrum-based [R. Abreu, ‘09] </> System’s behavior representation – In Practice test failed? M components 𝟏 ⋯ 𝟎 ⋮ ⋱ ⋮ 𝟏 ⋯ 𝟏 𝟏 ⋮ 𝟎 N tests
Background & Motivation Motivation </> Candidate Ranking Challenge: Too many candidates diagnoses Which one to choose? 𝐶 2 , 𝐶 3 𝐶 4 , 𝐶 8 , 𝐶 12 … 𝐶 2 𝐶 2 𝐶 2 , 𝐶 3 … 𝐶 4 , 𝐶 8 , 𝐶 12 𝒂 𝟏𝟏 ⋯ 𝒂 𝟏𝑴 ⋮ ⋱ ⋮ 𝒂 𝑵𝟏 ⋯ 𝒂 𝑵𝑴 𝒆 𝟏 ⋮ 𝒆 𝑵 Diagnoses Ordered Diagnoses Likelihood for each diagnosis given the observation rui
Background & Motivation Motivation </> “Ranker” Diagnoses 𝐶 2 , 𝐶 3 𝐶 4 , 𝐶 8 , 𝐶 12 … 𝐶 2 𝐶 2 𝐶 2 , 𝐶 3 … 𝐶 4 , 𝐶 8 , 𝐶 12 Ordered Diagnoses Given each observation, compute the likelihood for each diagnosis being correct Likelihood for each diagnosis given the observation rui
Background & Motivation Motivation [R. Abreu, ‘09] </> BARINEL’s Ranker Diagnoses 𝐶 2 , 𝐶 3 𝐶 4 , 𝐶 8 , 𝐶 12 … 𝐶 2 𝐶 2 𝐶 2 , 𝐶 3 … 𝐶 4 , 𝐶 8 , 𝐶 12 Ordered Diagnoses Observation = Test trace Likelihood for each diagnosis given the observation rui
Background & Motivation Motivation </> Ranker Diagnoses 𝐶 2 , 𝐶 3 𝐶 4 , 𝐶 8 , 𝐶 12 … 𝐶 2 𝐶 2 𝐶 2 , 𝐶 3 … 𝐶 4 , 𝐶 8 , 𝐶 12 Ordered Diagnoses Test trace + What more can be observed? Observation = Likelihood for each diagnosis given the observation rui
Research Method and Details 3. Research Research Method and Details
Research Components’ State - Intuition </> test failed? 𝑪 𝟏 𝑪 𝟐 𝑪 𝟑 𝒕𝒆𝒔 𝒕 𝟏 𝟏 𝟏 𝟎 𝟏 𝟎 𝟏 𝟏 𝟎 𝟎 𝟏 𝟏 𝟎 𝒕𝒆𝒔 𝒕 𝟐 𝒕𝒆𝒔 𝒕 𝟑 Think of the possible diagnoses and their ranking
Research Components’ State - Intuition </> Assume each component has 2 possible arguments: test failed? 𝑪 𝟏 𝑪 𝟐 𝑪 𝟑 𝟏, 𝟏, 𝟎, 𝟏, 𝟎, 𝟏, 𝟏, 𝟎, 𝟎, 𝟏 𝟏 𝟎 𝒕𝒆𝒔 𝒕 𝟏 𝒕𝒆𝒔 𝒕 𝟐 𝒕𝒆𝒔 𝒕 𝟑 Think of the possible diagnoses and their ranking And now?
Research Components’ State </> State-Oriented Ranker Diagnoses 𝐶 2 , 𝐶 3 𝐶 4 , 𝐶 8 , 𝐶 12 … 𝐶 2 𝐶 2 𝐶 2 , 𝐶 3 … 𝐶 4 , 𝐶 8 , 𝐶 12 Ordered Diagnoses Test trace + Components’ State Observation = Likelihood for each diagnosis given the observation rui
Research High Level Methodology SYNTHETIC MODEL Sample a Project Model Components’ Behavior Invoke Diagnosis Algorithm Create State-Oriented Input
Granularity of Atomic Components Statements Blocks Methods Modules … Research Granularity Granularity of Atomic Components Statements Blocks Methods Modules … Chosen for this discussion
Research Component’s State </> Test i What is a method’s state? … … Component j 𝒐𝒖𝒕𝒑𝒖𝒕 … Function Foo(self invoker, boolean a, Object o):
Research Enriching SFL Input </> 𝟏 𝟏 𝟎 𝟏 𝟎 𝟏 𝟏 𝟎 𝟎 𝟏 𝟏 𝟎 𝒔 𝒊𝒋 = sampled state of component 𝑗 in test 𝑖 𝟏, 𝑺 𝟏𝟏 𝟏, 𝑺 𝟏𝟐 𝟎, 𝑺 𝟑𝟏 𝟏, 𝑺 𝟐𝟏 𝟎, 𝑺 𝟐𝟐 𝟏, 𝑺 𝟑𝟐 𝟏, 𝑺 𝟑𝟏 𝟎, 𝑺 𝟑𝟐 𝟎, 𝑺 𝟑𝟑 𝟏 𝟏 𝟎 𝑺 𝟏𝟏 Enrich with components’ states 𝑺 𝟏𝟐 𝑺 𝟏𝟑 𝑺 𝟐𝟏 𝑺 𝟐𝟐 𝑺 𝟐𝟑 𝑺 𝟑𝟏 𝑺 𝟑𝟐 𝑺 𝟑𝟑
Research Method Method Observe the system over time and sample components’ state Learn states that correlate to failures Prioritize diagnoses with a stronger correlation to test failures
Research Learning Components’ Behavior Train Set for Method foo() Sample from 𝑪 𝒋 in 𝒕𝒆𝒔 𝒕 𝒊 Self Arg1 ArgN Output/ Exception Failure … 21312 0.756 1 5 23423 0.223 Self Arg1 ArgN Output/ Exception 21312 0.756 1 5 𝑆 𝑖𝑗 𝑏 𝑖𝑗 Correlation with failures 0.82 State ij -> ti fails? Add bij
Research Ranking Policy The Goodness Function Ranking the diagnosis candidates is done according to this policy: 𝝐= 𝑪𝒋∈𝝎 𝒂 𝒊𝒋 =𝟏 𝟏− 𝒃 𝒊𝒋 𝒊𝒇 𝒆 𝒊 =𝟎 𝟏− 𝑪𝒋∈𝝎 𝒂 𝒊𝒋 =𝟏 𝟏−𝒃 𝒊𝒋 𝒊𝒇 𝒆 𝒊 =𝟏
Experiment Setup, Evaluated Algorithms, Evaluation Metrics & Results 3. Experiment Experiment Setup, Evaluated Algorithms, Evaluation Metrics & Results
4 real world open source Java projects Experiment Setup [Elmishali, 2015] 4 real world open source Java projects Known bugs (using Issue Trackers) Generating 536 instances (134 per project) Project Tests Methods Bug Reported Bugs Fixed Orient DB 790 19,207 4,625 2,459 Eclipse CDT 3,990 66,982 17,713 9,091 Apache Ant 5,190 10,830 5,890 1,176 Apache POI 2,346 21,475 3,361 1,408 elmishali
Experiment Evaluated Algorithms Our State-Augmented diagnoser compared against: BARINEL (Abreu et al.) Data-Augmented variant of BARINEL (Elmishali et al.)
Experiment Synthesizing Behavior Using a synthetic behavior model to control the model’s accuracy Generated Model Software System Diagnosis Spectrum-based Algorithm
Experiment Synthesizing Behavior Using a synthetic behavior model to control the model’s accuracy Use True Diagnosis (Ground Truth) Generated Model Software System Diagnosis Spectrum-based Algorithm
Experiment Synthesizing Behavior Using a synthetic behavior model to control the model’s accuracy Use True Diagnosis (Ground Truth) Synthetic Noise Generated Model Software System Diagnosis Spectrum-based Algorithm
Experiment Synthesizing Behavior Example Given: Ground Truth = { 𝑪 𝟏 } Synthetic Error: 𝟎 𝟎.𝟏 𝑪 𝟏 𝑪 𝟐 𝑪 𝟑 𝒕𝒆𝒔 𝒕 𝟏 𝟏, 𝟏, 𝟎, 𝟏, 𝟎, 𝟏, 𝟏, 𝟎, 𝟎, 𝟏 𝟏 𝟎 𝟏 𝟎.𝟗 𝟎.𝟏 𝟎 𝒕𝒆𝒔 𝒕 𝟐 𝟏 𝟎.𝟗 𝟎.𝟏 𝟎 𝟎 𝟎 𝒕𝒆𝒔 𝒕 𝟑
Experiment Evaluation Metric How to measure a diagnosis’ quality? We used 3 known metrics: Weighted Average Precision Weighted Average Recall Health State Wasted Effort
Experiment Overview Results Precision Recall With 0.15 synthetic error - similar results to the DA diagnoser With 0.2 synthetic error - significantly better results than Barinel
Health State Wasted Effort Experiment Results Health State Wasted Effort With 0.3 synthetic error rate - superior results over both diagnosers
Experiment Conclusions Even with 30% error, this technique can provide a significant improvement in candidate ranking
Challenges & Future Steps 4. Roadmap Challenges & Future Steps
Roadmap Challenges Dealing with small data-sets (model per component): Live systems / test generation Diagnosing on a higher level (e.g. class) Learning states with imbalanced data-sets (only few faults) abnormal states rather than “correlative to faults”
Roadmap Future Steps Instrument real software for a non-synthetic behavior approximation Consider more variations for utilizations of the learned behavior Combine this work with orthogonal diagnosers
THANKS! Any questions?