1 Identifying Bug Signatures Using Discriminative Graph Mining Hong Cheng 1, David Lo 2, Yang Zhou 1, Xiaoyin Wang 3, and Xifeng Yan 4 1 Chinese University.

Slides:

Advertisements

Similar presentations

Delta Debugging and Model Checkers for fault localization

Advertisements

 Data mining has emerged as a critical tool for knowledge discovery in large data sets. It has been extensively used to analyze business, financial,

Mining Compressed Frequent- Pattern Sets Dong Xin, Jiawei Han, Xifeng Yan, Hong Cheng Department of Computer Science University of Illinois at Urbana-Champaign.

gSpan: Graph-based substructure pattern mining

Shuai Ma, Yang Cao, Wenfei Fan, Jinpeng Huai, Tianyu Wo Capturing Topology in Graph Pattern Matching University of Edinburgh.

Query Optimization of Frequent Itemset Mining on Multiple Databases Mining on Multiple Databases David Fuhry Department of Computer Science Kent State.

Mahadevan Subramaniam and Bo Guo University of Nebraska at Omaha An Approach for Selecting Tests with Provable Guarantees.

Knowledge Graph: Connecting Big Data Semantics

A survey of techniques for precise program slicing Komondoor V. Raghavan Indian Institute of Science, Bangalore.

Mining Top-K Large Structural Patterns in a Massive Network Feida Zhu 1, Qiang Qu 2, David Lo 1, Xifeng Yan 3, Jiawei Han 4, and Philip S. Yu 5 1 Singapore.

Context-aware Query Suggestion by Mining Click-through and Session Data Authors: H. Cao et.al KDD 08 Presented by Shize Su 1.

Comprehensive Evaluation of Association Measures for Software Fault Localization LUCIA, David LO, Lingxiao JIANG, Aditya BUDI Singapore Management University.

Using Statically Computed Invariants Inside the Predicate Abstraction and Refinement Loop Himanshu Jain Franjo Ivančić Aarti Gupta Ilya Shlyakhter Chao.

© 2008 IBM Corporation Mining Significant Graph Patterns by Leap Search Xifeng Yan (IBM T. J. Watson) Hong Cheng, Jiawei Han (UIUC) Philip S. Yu (UIC)

CS590Z Statistical Debugging Xiangyu Zhang (part of the slides are from Chao Liu)

Statistical Debugging: A Tutorial Steven C.H. Hoi Acknowledgement: Some slides in this tutorial were borrowed from Chao Liu at UIUC.

Comparing path-based and vertically-partitioned RDF databases Preetha Lakshmi & Chris Mueller 12/10/2007 CSCI 8715 Shashi Shekhar.

Mining Long Sequential Patterns in a Noisy Environment Jiong Yang, Wei Wang, Philip S. Yu, Jiawei Han SIGMOD 2002.

State coverage: an empirical analysis based on a user study Dries Vanoverberghe, Emma Eyckmans, and Frank Piessens.

Automated Diagnosis of Software Configuration Errors

FAST FREQUENT FREE TREE MINING IN GRAPH DATABASES Marko Lazić 3335/2011 Department of Computer Engineering and Computer Science,

Software Testing Sudipto Ghosh CS 406 Fall 99 November 9, 1999.

272: Software Engineering Fall 2012 Instructor: Tevfik Bultan Lecture 17: Code Mining.

What Is Sequential Pattern Mining?

Slides are modified from Jiawei Han & Micheline Kamber

Subgraph Containment Search Dayu Yuan The Pennsylvania State University 1© Dayu Yuan9/7/2015.

Graph Indexing: A Frequent Structure based Approach Authors:Xifeng Yan†, Philip S‡. Yu, Jiawei Han†

Theory and Practice, Do They Match ? A Case with Spectrum-Based Fault Localization Tien-Duy B. Le, Ferdian Thung, and David Lo School of Information Systems.

An Integrated Approach to Extracting Ontological Structures from Folksonomies Huairen Lin, Joseph Davis, Ying Zhou ESWC 2009 Hyewon Lim October 9 th, 2009.

Mining Optimal Decision Trees from Itemset Lattices Dr, Siegfried Nijssen Dr. Elisa Fromont KDD 2007.

An Automated Approach to Predict Effectiveness of Fault Localization Tools Tien-Duy B. Le, and David Lo School of Information Systems Singapore Management.

VLDB 2012 Mining Frequent Itemsets over Uncertain Databases Yongxin Tong 1, Lei Chen 1, Yurong Cheng 2, Philip S. Yu 3 1 The Hong Kong University of Science.

Bug Localization with Machine Learning Techniques Wujie Zheng

Xiangnan Kong,Philip S. Yu Department of Computer Science University of Illinois at Chicago KDD 2010.

1 Frequent Subgraph Mining Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY June 12, 2010.

Graph Indexing: A Frequent Structure- based Approach Alicia Cosenza November 26 th, 2007.

Xiangnan Kong,Philip S. Yu Multi-Label Feature Selection for Graph Classification Department of Computer Science University of Illinois at Chicago.

Stefan Mutter, Mark Hall, Eibe Frank University of Freiburg, Germany University of Waikato, New Zealand The 17th Australian Joint Conference on Artificial.

Frequent Subgraph Discovery Michihiro Kuramochi and George Karypis ICDM 2001.

1 Program Testing (Lecture 14) Prof. R. Mall Dept. of CSE, IIT, Kharagpur.

Outline Introduction – Frequent patterns and the Rare Item Problem – Multiple Minimum Support Framework – Issues with Multiple Minimum Support Framework.

1 Test Selection for Result Inspection via Mining Predicate Rules Wujie Zheng

University at BuffaloThe State University of New York Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo Frequent.

MINING COLOSSAL FREQUENT PATTERNS BY CORE PATTERN FUSION FEIDA ZHU, XIFENG YAN, JIAWEI HAN, PHILIP S. YU, HONG CHENG ICDE07 Advisor: Koh JiaLing Speaker:

References: “Pruning Dynamic Slices With Confidence’’, by X. Zhang, N. Gupta and R. Gupta (PLDI 2006). “Locating Faults Through Automated Predicate Switching’’,

“Isolating Failure Causes through Test Case Generation “ Jeremias Rößler Gordon Fraser Andreas Zeller Alessandro Orso Presented by John-Paul Ore.

Bug Localization with Association Rule Mining Wujie Zheng

Computer Science and Engineering TreeSpan Efficiently Computing Similarity All-Matching Gaoping Zhu #, Xuemin Lin #, Ke Zhu #, Wenjie Zhang #, Jeffrey.

Mining Top-K Large Structural Patterns in a Massive Network Feida Zhu 1, Qiang Qu 2, David Lo 1, Xifeng Yan 3, Jiawei Han 4, and Philip S. Yu 5 1 Singapore.

1 Knowledge Discovery from Transportation Network Data Paper Review Jiang, W., Vaidya, J., Balaporia, Z., Clifton, C., and Banich, B. Knowledge Discovery.

Discriminative Frequent Pattern Analysis for Effective Classification By Hong Cheng, Xifeng Yan, Jiawei Han, Chih- Wei Hsu Presented by Mary Biddle.

HEMANTH GOKAVARAPU SANTHOSH KUMAR SAMINATHAN Frequent Word Combinations Mining and Indexing on HBase.

Graph Indexing From managing and mining graph data.

Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,

Data Mining: Principles and Algorithms Graph Pattern Mining Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign

1 Substructure Similarity Search in Graph Databases R 陳芃安.

Ning Jin, Wei Wang ICDE 2011 LTS: Discriminative Subgraph Mining by Learning from Search History.

Gspan: Graph-based Substructure Pattern Mining

10/23/ /23/2017 Presented at KDD’09 Classification of Software Behaviors for Failure Detection: A Discriminative Pattern Mining Approach David Lo1,

Experience Report: System Log Analysis for Anomaly Detection

Software Testing.

Mining in Graphs and Complex Structures

Mining Frequent Subgraphs

Mining Frequent Itemsets over Uncertain Databases

Graph Database Mining and Its Applications

Slides are modified from Jiawei Han & Micheline Kamber

50.530: Software Engineering

Precise Condition Synthesis for Program Repair

CoXML: A Cooperative XML Query Answering System

Presentation transcript:

1 Identifying Bug Signatures Using Discriminative Graph Mining Hong Cheng 1, David Lo 2, Yang Zhou 1, Xiaoyin Wang 3, and Xifeng Yan 4 1 Chinese University of Hong Kong 2 Singapore Management University 3 Peking University 4 University of California at Santa Barbara ISSTA’09

2 Automated Debugging o Bugs part of day-to-day software development o Bugs caused the loss of much resources – NIST report 2002 – 59.5 billion dollars/annum o Much time is spent on debugging – Need support for debugging activities – Automate debugging process o Problem description – Given labeled correct and faulty execution traces – Make debugging an easier task to do

3 Bug Localization and Signature Identification oBug localization – Pinpointing a single statement or location which is likely to contain bugs – Does not produce the bug context oBug signature mining [Hsu et al., ASE’08] – Provides the context where a bug occurs – Does not assume “perfect bug understanding” – In the form of sequences of program elements – Occur when the bug is manifested

4 Outline oMotivation: Bug Localization and Bug Signature oPioneer Work on Bug Signature Mining oIdentifying Bug Signatures Using Discriminative Graph Mining oExperimental Study oRelated Work oConclusions and Future Work

5 Pioneer Work on Bug Signature Identification oRAPID [Hsu et al., ASE’08] –Identify relevant suspicious program elements via Tarantula –Compute the longest common subsequences that appear in all faulty executions with a sequence mining tool BIDE [Wang and Han, ICDE’04] –Sort returned signatures by length –Able to identify a bug involving path-dependent fault

6 Software Behavior Graphs o Model software executions as behavior graphs –Node: method or basic block –Edge: call or transition (basic block/method) or return –Two levels of granularities: method and basic block o Represent signatures as discriminating subgraphs oAdvantages of graph over sequence representation –Compactness: loops  mining scalability –Expressiveness: partial order and total order

7 Example: Software Behavior Graphs Two executions from Mozilla Rhino with a bug of number Solid edge: function call Dashed edge: function transition

8 Bug Signature: Discriminative Sub-Graph oGiven two sets of graphs: correct and failing oFind the most discriminative subgraph oInformation gain: IG(c|g) = H(c) – H(c|g) – Commonly used in data mining/machine learning – Capacity in distinguishing instances from different classes – Correct vs. Failing o Meaning: – As frequency difference of a subgraph g in faulty and correct executions increases – The higher is the information gain of g oLet F be the objective function (i.e., information gain), compute:

9 Bug Signature: Discriminative Sub-Graph oImplication – High information gain if: – Observed in many failing but few correct execution – Observed in many correct but few failing executions

10 Bug Signature: Discriminative Sub-Graph oThe discriminative subgraph mined from behavior graphs contrasts the program flow of correct and failing executions and provides context for understanding the bug oDifferences with RAPID: –Not only element-level suspiciousness, signature-level suspiciousness/discriminative-ness –Does not restrict that the signature must hold across all failing executions –Sort by level of suspiciousness

11 System Framework STEP 1 STEP 2 STEP 3

12 oStep 1 – Trace is “coiled” to form behavior graphs – Based on transitions, call, and return relationship – Granularity: method calls, basic blocks oStep 2 –Filter off non-suspicious edges –Similar to Tarantula suspiciousness –Focus on relationship between blocks/calls oStep 3 –Mine top-k discriminating graphs –Distinguishes buggy from correct executions System Framework (2)

13 An Example Four test cases Generated traces 1: void replaceFirstOccurrence (char arr [], int len, char cx, char cy, char cz) { int i; 2: for (i=0;i<len;i++) { 3: if (arr[i]==cx){ 4: arr[i] = cz; 5: // a bug, should be a break; 6: } 7: if (arr[i]==cy)){ 8: arr[i] = cz; 9: // a bug, should be a break; 10: } 11: }}

14 An Example (2) Behavior Graphs for Trace 1, 2, 3 & 4 Normal Buggy

15 An Example (3)

16 Challenges in Graph Mining: Search Space Explosion oIf a graph is frequent, all its subgraphs are frequent – the Apriori property oAn n-edge frequent graph may have up to 2 n subgraphs which are also frequent oAmong 423 chemical compounds which are confirmed to be active in an AIDS antiviral screen dataset, there are around 1,000,000 frequent subgraphs if the minimum support is 5%

17 Traditional Frequent Graph Mining Framework Exploratory task Graph clustering Graph classification Graph index Objective functions: discrimininative, selective clustering tendency Graph Database Frequent Patterns Optimal Patterns 1.Computational bottleneck : millions, even billions of patterns 2.No guarantee of quality

18 Leap Search for Discriminative Graph Mining oYan et al. proposed a new leap search mining paradigm in SIGMOD’08 –Core idea: structural proximity for search space pruning oDirectly outputs the most discriminative subgraph, highly efficient!

19 Core Idea: Structural Similarity Sibling Structural similarity  Significance similarity Mine one branch and skip the other similar branch! Size-4 graph Size-5 graph Size-6 graph

20 Skip g’ subtree if : tolerance of frequency dissimilarity Structural Leap Search Criterion Mining PartLeap Part g : a discovered graph g’: a sibling of g g g’

21 Extending LEAP to Top-K LEAP oLEAP returns the single most discriminative subgraph from the dataset oA ranked list of k most discriminative subgraphs is more informative than the single best one oTop-K LEAP idea –The LEAP procedure is called for k times –Checking partial result in the process –Producing k most discriminative subgraphs

22 Experimental Evaluation oDatasets – Siemens datasets: All 7 programs, all versions oMethods – RAPID [Hsu et al., ASE’08] – Top-K LEAP: our method oMetrics – Recall and Precision from top-k returned signatures – Recall = proportion of the bugs that could be found by the bug signatures – Precision = proportion of the returned results that highlight the bug – Distance-based metric to exact bug location penalize the bug context

23 Experimental Results (Top 5) Result - Method Level

24 Experimental Results (Top 5) Result – Basic Block Level

25 Experimental Results (2) - Schedule PrecisionRecall

26 Efficiency Test oTop-K LEAP finishes mining on every dataset between 1 and 258 seconds oRAPID cannot finish running on several datasets in hours –Version 6 of replace dataset, basic block level –Version 10 of print_tokens2, basic block level

27 Experience (1) Version 7 of schedule Top-K LEAP finds the bug, while RAPID fails

28 Experience (2) Version 18 of tot_info if ( rdf <=0 || cdf <= 0) Our method finds a graph connecting block 3 with block 5 with a transition edge For rdf<0, cdf<0 bb1  bb3  bb5

29 Threat to Validity o Human error during the labeling process – Human is the best judge to decide whether a signature is relevant or not. o Only small programs – Scalability on larger programs o Only c programs – Concept of control flow is universal

30 Related Work oBug Signature Mining: RAPID [Hsu et al., ASE’08] oBug Predictors to Faulty CF Path [Jiang et al., ASE’07] – Clustering similar bug predictors and inferring approximate path connecting similar predictors in CFG. – Our work: finding combination of bug predictors that are discriminative. Result guaranteed to be feasible paths. oBug Localization Methods –Tarantula [Jones and Harrold, ASE’05], WHITHER [Renieris and Reiss, ASE’03], Delta Debugging [Zeller and Hildebrandt, TSE’02], AskIgor [Cleve and Zeller, ICSE’05], Predicate evaluation [Liblit et al., PLDI’03, PLDI’05], Sober [Liu et al., FSE’05], etc.

31 Related Work on Graph Mining oEarly work –SUBDUE [Holder et al., KDD’94], WARMR [Dehaspe et al., KDD’98] o Apriori-based approach AGM [Inokuchi et al., PKDD’00] FSG [Kuramochi and Karypis, ICDM’01] oPattern-growth approach– state-of-the-art gSpan [Yan and Han, ICDM’02] MoFa [Borgelt and Berthold, ICDM’02] FFSM [Huan et al., ICDM’03] Gaston [Nijssen and Kok, KDD’04]

32 Conclusions oA discriminative graph mining approach to identify bug signatures –Compactness, Expressiveness, Efficiency oExperimental results on Siemens datasets –On average, 18.1% higher precision, 32.6% higher recall (method level) –On average, 1.8% higher precision, 17.3% higher recall (basic block level) –Average signature size of 3.3 nodes (vs. 4.1) (method level) or 3.8 nodes (vs 10.3) (basic block level) –Mining at basic block level is more accurate than method level - (74.3%,91%) vs (58.5%,73%)

33 Future Extensions o Mine minimal subgraph patterns – Current patterns may contain irrelevant nodes and edges for the bug o Enrich software behavior graph representation – Currently only captures program flow semantics – May attach additional information to nodes and edges such as program parameters and return values

34 Questions, Comments, Advice ? Thank You