Presentation is loading. Please wait.

Presentation is loading. Please wait.

Mining Software Data: Code Tao Xie University of Illinois at Urbana-Champaign

Similar presentations


Presentation on theme: "Mining Software Data: Code Tao Xie University of Illinois at Urbana-Champaign"— Presentation transcript:

1 Mining Software Data: Code Tao Xie University of Illinois at Urbana-Champaign http://web.engr.illinois.edu/~taoxie/ taoxie@illinois.edu

2 2 MAIN GOAL  Transform static record- keeping SE data to active data  Make SE data actionable by uncovering hidden patterns and trends Mining Software Engineering Data Mailings Bugzilla Code repository Execution traces CVS

3 Mining Software Engineering Data code bases change history program states structural entities software engineering data bug reports/nl programmingdefect detectiontestingdebuggingmaintenance software engineering tasks data mining techniques … … https://sites.google.com/site/asergrp/dmse

4 Mining Software Engineering Data code bases change history program states structural entities software engineering data bug reports/nl programmingdefect detectiontestingdebuggingmaintenance software engineering tasks data mining techniques … …

5 5 5  Programmers commonly reuse APIs of existing frameworks or libraries – Advantages: High productivity of development – Challenges: Complexity and lack of documentation – Consequences: Spend more efforts in understanding APIs Introduce defects in API client code – Solution: Mining API properties as common patterns across API client code Frame works Motivation

6 6 Basic mining algorithms Solution-Driven  Problem-Driven Advanced mining algorithms New/adapted mining algorithms Where can I apply X miner?What patterns do we really need? E.g., frequent partial order mining [ESEC/FSE 07] E.g., association rule mining, frequent itemset mining… E.g., [ICSE 09], [ASE 09]

7 7 7 7 Code repositories 1 2 N … 12 mining patterns searchingmining patterns Code search engine e.g., Open source code on the web Eclipse, Linux, … Traditional approaches Our new approaches Often lack sufficient relevant data points (Eg. API call sites)‏ Code repositories Mining  Searching + Mining

8 8 Agenda  Motivation  Mining Sequence Association Rules (CAR-Miner) [ICSE 09]  Detecting Exception-Handling Defects  Mining Alternative Patterns (Alattin) [ASE 09]  Detecting Neglected Condition Defects  Conclusion

9 9  APIs throw exceptions during runtime errors Example: Session API of Hibernate framework throws HibernateException  APIs expect client applications to implement recovery actions after exceptions occur Example: Hibernate Session API expects client application to rollback open uncommitted transactions after HibernateException occurs  Failure to handle exceptions results in Fatal issues, e.g., database lock won’t be released if the transaction is not rolled back Exception Handling

10 10  Use exception-handling specification to detect violations as defects  Problem: Often specifications are not documented  Solution: Mine specifications from existing API client code  Challenges:  Limited data points: Only from a few code bases  searching + mining  Limited expressiveness: Not sufficient to characterize common exception-handling behaviors: why? Problem Addressed by CAR-Miner

11 11 Example  Defect: No rollback done when SQLException occurs  Requires specification such as “Connection should be rolled back when a connection is created and SQLException occurs”  Q: Should every connection instance has to be rolled back when SQLException occurs? Missing “conn.rollback()”

12 12 Example (cont.)‏ Specification: “Connection creation => Connection rollback”  Satisfied by Scenario 1 but not by Scenario 2  But Scenario 2 has no defect c

13 13  Simple association rules of the form “FCa => FCe” are not expressive  Requires more general association rules (sequence association rules) such as (FCc1 FCc2) Λ FCa => FCe1, where FCc1 -> Connection conn = OracleDataSource.getConnection()‏ FCc2 -> Statement stmt = Connection.createStatement()‏ FCa -> stmt.executeUpdate()‏ FCe1 -> conn.rollback()‏ Example (cont.)‏

14 14  Simple association rules of the form “FCa => FCe” are not expressive  Requires more general association rules (sequence association rules) such as (FCc1 FCc2) Λ FCa => FCe1, where FCc1 -> Connection conn = OracleDataSource.getConnection()‏ FCc2 -> Statement stmt = Connection.createStatement()‏ FCa -> stmt.executeUpdate() //Triggering Action FCe1 -> conn.rollback()‏ Example (cont.)‏

15 15  Simple association rules of the form “FCa => FCe” are not expressive  Requires more general association rules (sequence association rules) such as (FCc1 FCc2) Λ FCa => FCe1, where FCc1 -> Connection conn = OracleDataSource.getConnection()‏ FCc2 -> Statement stmt = Connection.createStatement()‏ FCa -> stmt.executeUpdate()‏ FCe1 -> conn.rollback() //Recovery Action Example (cont.)‏

16 16  Simple association rules of the form “FCa => FCe” are not expressive  Requires more general association rules (sequence association rules) such as (FCc1 FCc2) Λ FCa => FCe1, where FCc1 -> Connection conn = OracleDataSource.getConnection()‏ FCc2 -> Statement stmt = conn.createStatement() //Context FCa -> stmt.executeUpdate()‏ FCe1 -> conn.rollback()‏ Example (cont.)‏

17 17 CAR-Miner Approach Input Application Check whether there are any exception-related defects Classes and Functions Open Source Projects on web 1 2 N … … Exception-Flow Graphs Static Traces Sequence Association Rules Violations Extract classes and functions reused Issue queries and collect relevant code examples. Eg: “lang:java java.sql.Statement executeUpdate” Construct exception- flow graphs Collect static traces Mine static traces Detect violations

18 18 CAR-Miner Approach Input Application Classes and Functions Open Source Projects on web 1 2 N … … Exception-Flow Graphs Static Traces Sequence Association Rules Violations

19 Exception-Flow-Graph Construction  Based on a previous algorithm [Sinha&Harrold TSE 00] : normal execution path ----: exceptional execution path

20 20 Exception-Flow-Graph Construction  Prevent infeasible edges using a sound static analysis [Robillard&Murphy FSE 99]

21 21 CAR-Miner Approach Input Application Classes and Methods Open Source Projects on web 1 2 N … … Exception-Flow Graphs Static Traces Sequence Association Rules Violations

22 22 Static Trace Generation  Collect static traces with the actions taken when exceptions occur  A static trace for Node 7: “4 -> 5 -> 6 -> 7 -> 15 -> 16 -> 17”

23 23 Static Trace Generation  Includes 3 sections:  Normal function- call sequence (4 -> 5 -> 6)‏  Function call (7)‏  Exception function-call sequence (15 -> 16 -> 17)‏  A static trace for Node 7: “4 -> 5 -> 6 -> 7 -> 15 -> 16 -> 17”

24 24 Trace Post-Processing  Identify and remove unrelated function calls using data dependency  “4 -> 5 -> 6 -> 7 -> 15 -> 16 -> 17” 4: FileWriter fw = new FileWriter(“output.txt”)‏ 5: BufferedWriter bw = new BufferedWriter(fw)‏... 7: Statement stmt = conn.createStatement()‏...  Filtered sequence “6 -> 7 -> 15 -> 16“

25 25 CAR-Miner Approach Input Application Classes and Methods Open Source Projects on web 1 2 N … … Exception-Flow Graphs Static Traces Sequence Association Rules Violations

26 26 Static Trace Mining  Handle traces of each function call (triggering function call) individually  Input: Two sequence databases with a one-to-one mapping normal function-call sequences (context)‏ exception function-call sequences (recovery)‏  Objective: Generate sequence association rules of the form (FCc1... FCcn) Λ FCa => FCe1... FCen Context Trigger Recovery

27 27  Input: Two sequence databases with a one-to-one mapping Mining Problem Definition  Objective: To get association rules of the form FC1 FC2... FCm -> FE1 FE2... FEn where {FC1, FC2,..., Fcm} Є SDB1 and {FE1, FE2,..., Fen} Є SDB2  Existing association rule mining algorithms cannot be directly applied on multiple sequence databases Context Recovery

28 28  Annotate the sequences to generate a single combined database Mining Problem Solution  Apply frequent subsequence mining algorithm [Wang and Han, ICDE 04] to get frequent sequences  Transform mined sequences into sequence association rules  Rank rules based on the support assigned by frequent subsequence mining algorithm (3 10) Λ FCa => (2 8)‏ Context Trigger Recovery

29 29 CAR-Miner Approach Input Application Classes and Methods Open Source Projects on web 1 2 N … … Exception-Flow Graphs Static Traces Sequence Association Rules Violations

30 30 Violation Detection  Analyze each call site of triggering call FCa  Step 1: Extract context call sequence “CC1 CC2... CCm” from the beginning of the function to the call site of FCa  Step 2: If CC1 CC2... CCm is super-sequence of FCc1... FCcn  Report any missing function calls of {FCe1... FCen} in any exception path API client: ( CC1 CC2... CCm) Λ FCa => Missing any? isSuperSeqOf API Rule: (FCc1... FCcn) Λ FCa => FCe1... FCen Context Trigger Recovery

31 31 Evaluation Research Questions: 1. Do the mined rules represent real rules? 2. Do the detected violations represent real defects? 3. Does CAR-Miner perform better than WN- miner [Weimer and Necula, TACAS 05] ? 4. Do the sequence association rules help detect new defects?

32 32 Subjects  Internal Info: classes and methods belonging to the app  External Info: classes and methods used by the app  Code examples: #files collected through code search engine

33 33 RQ1: Real Rules Real rules: 55% (Total: 294)‏ Usage patterns: 3% False positives: 43%  Do the mined rules represent real rules?

34 34 RQ1: Distribution of Real Rules for Axion  #false positives is quite low between 1 to 60 rules  Distribution of rules based on ranks assigned by CAR-Miner

35 35 RQ2: Detected Violations  Do the detected violations represent real defects?  Total number of defects: 160  New defects not found by WN-Miner approach: 87

36 36 RQ2: Status of Detected Violations  HsqlDB developers responded on the first 10 reported defects  Accepted 7 defects  Rejected 3 defects  Reason given by HsqlDB developers for rejected defects: “Although it can throw exceptions in general, it should not throw with HsqlDB, So it is fine ”

37 37 RQ3: Comparison with WN-miner  Does CAR-Miner performs better than WN-miner?  Found 224 new rules and missed 32 rules  CAR-Miner detected most of the rules mined by WN-miner  Two major factors:  sequence association rules  Increase in the data scope

38 38 RQ4: New defects by sequence association rules  Detected 21 new real defects among all applications  Do the sequence association rules detect new defects?

39 39 Agenda  Motivation  Mining Sequence Association Rules (CAR-Miner) [ICSE 09]  Detecting Exception-Handling Defects  Mining Alternative Patterns (Alattin) [ASE 09]  Detecting Neglected Condition Defects  Conclusion

40 40  Existing approaches produce a large number of false positives  One major observation:  Programmers often write code in different ways for achieving the same task  Some ways are more frequent than others Large Number of False Positives Frequent ways Infrequent ways Mined Patterns mine patterns detect violations

41 41 Example: java.util.Iterator.next() PrintEntries1(ArrayList entries) { … Iterator it = entries.iterator(); if(it.hasNext()) { string last = (string) it.next(); } … } PrintEntries1(ArrayList entries) { … Iterator it = entries.iterator(); if(it.hasNext()) { string last = (string) it.next(); } … } Code Sample 1 PrintEntries2(ArrayList entries) { … if(entries.size() > 0) { Iterator it = entries.iterator(); string last = (string) it.next(); } … } PrintEntries2(ArrayList entries) { … if(entries.size() > 0) { Iterator it = entries.iterator(); string last = (string) it.next(); } … } Code Example 2 Code Sample 2 Java.util.Iterator.next() throws NoSuchElementException when invoked on a list without any elements

42 42 Example: java.util.Iterator.next() PrintEntries1(ArrayList entries) { … Iterator it = entries.iterator(); if(it.hasNext()) { string last = (string) it.next(); } … } PrintEntries1(ArrayList entries) { … Iterator it = entries.iterator(); if(it.hasNext()) { string last = (string) it.next(); } … } Code Sample 1 PrintEntries2(ArrayList entries) { … if(entries.size() > 0) { Iterator it = entries.iterator(); string last = (string) it.next(); } … } PrintEntries2(ArrayList entries) { … if(entries.size() > 0) { Iterator it = entries.iterator(); string last = (string) it.next(); } … } Code Sample 2 1243 code examples Sample 1 (1218 / 1243) Sample 2 (6/1243) Mined Pattern from existing approaches: “boolean check on return of Iterator.hasNext before Iterator.next”

43 43 Example: java.util.Iterator.next()  Require more general patterns (alternative patterns): P 1 or P 2 P 1 : boolean check on return of Iterator.hasNext before Iterator.next P 2 : boolean check on return of ArrayList.size before Iterator.next  Cannot be mined by existing approaches, since alternative P 2 is infrequent PrintEntries1(ArrayList entries) { … Iterator it = entries.iterator(); if(it.hasNext()) { string last = (string) it.next(); } … } PrintEntries1(ArrayList entries) { … Iterator it = entries.iterator(); if(it.hasNext()) { string last = (string) it.next(); } … } Code Sample 1 PrintEntries2(ArrayList entries) { … if(entries.size() > 0) { Iterator it = entries.iterator(); string last = (string) it.next(); } … } PrintEntries2(ArrayList entries) { … if(entries.size() > 0) { Iterator it = entries.iterator(); string last = (string) it.next(); } … } Code Sample 2

44 44 Our Solution: ImMiner Algorithm  Mines alternative patterns of the form P 1 or P 2  Based on the observation that infrequent alternatives such as P 2 are frequent among code examples that do not support P 1 1243 code examples Sample 1 (1218 / 1243) Sample 2 (6/1243) P 2 is frequent among code examples not supporting P 1 P 2 is infrequent among entire 1243 code examples

45 45 Alternative Patterns  ImMiner mines three kinds of alternative patterns of the general form “P 1 or P 2 ” Balanced: all alternatives (both P 1 and P 2 ) are frequent Imbalanced: some alternatives (P 1 ) are frequent and others are infrequent (P 2 ). Represented as “P 1 or P ^ 2 ” Single: only one alternative

46 46 ImMiner Algorithm  Uses frequent-itemset mining [Burdick et al. ICDE 01] iteratively  An input database with the following APIs for Iterator.next() Input databaseMapping of IDs to APIs

47 47 ImMiner Algorithm: Frequent Alternatives Input database Frequent itemset mining (min_sup 0.5) Frequent item: 1 P 1 : boolean-check on the return of Iterator.hasNext() before Iterator.next()

48 48 ImMiner: Infrequent Alternatives of P 1 Positive database (PSD) Negative database (NSD)  Split input database into two databases: Positive and Negative  Mine patterns that are frequent in NSD and are infrequent in PSD  Reason: Only such patterns serve as alternatives for P 1  Alternative Pattern : P 2 “const check on the return of ArrayList.size() before Iterator.next()”  Alattin applies ImMiner algorithm to detect neglected conditions

49 49 Neglected Conditions  Neglected conditions refer to  Missing conditions that check the arguments or receiver of the API call before the API call  Missing conditions that check the return or receiver of the API call after the API call  One of the primary reasons for many fatal issues  security or buffer-overflow vulnerabilities [Chang et al. ISSTA 07]

50 50 Alattin Approach Application Under Analysis Detect neglected conditions Classes and methods Open Source Projects on web 1 2 N … … Pattern Candidates Alternative Patterns Violations Extract classes and methods reused Phase 1: Issue queries and collect relevant code samples. Eg: “lang:java java.util.Iterator next” Phase 2: Generate pattern candidates Phase 3: Mine alternative patterns Phase 4: Detect neglected conditions statically

51 51 Evaluation  Research Questions: 1. Do alternative patterns exist in real applications? 2. How high percentage of false positives are reduced (with low or no increase of false negatives) in detected violations?

52 52 Subjects  Two categories of subjects:  3 Java default API libraries  3 popular open source libraries #Samples: #code examples collected from Google code search

53 53 RQ1: Balanced and Imbalanced Patterns  How high percentage of balanced and imbalanced patterns exist in real apps?  Balanced patterns: 0% to 30% (average: 9.69%)  Imbalanced patterns:  30% to 100% (average: 65%) for Java default API libraries  0% to 9.5% (average: 5%) for open source libraries  Explanation: Java default API libraries provide more different ways of writing code compared to open source libraries

54 54 RQ2: False Positives and False Negatives  How high % of false positives are reduced (with low or no increase of false negatives)?  Applied mined patterns (“P 1 or P 2 or... or P i or A ^ 1 or A ^ 2 or... or A ^ j ”) in three modes:  Existing mode: “P 1 or P 2 or... or P i or A ^ 1 or A ^ 2 or... or A ^ j ”  P 1, P 2,..., P i  Balanced mode: “P 1 or P 2 or... or P i or A ^ 1 or A ^ 2 or... or A ^ j ”  “P 1 or P 2 or... or P i ”  Imbalanced mode: “P 1 or P 2 or... or P i or A ^ 1 or A ^ 2 or... or A ^ j ”  “P 1 or P 2 or... or P i or A ^ 1 or A ^ 2 or... or A ^ j ”

55 55 RQ2: False Positives and False Negatives ApplicationExisting ModeBalanced Mode DefectsFalse Positives DefectsFalse Positives % of reduction False Negatives Java Util371043710400 Java Transaction 511055110500 Java SQL561435690 37.06 0 BCEL21428 42.86 0 HSqlDB101000 Hibernate109 811.110 AVERAGE/ TOTAL 15.17 0  Existing Mode vs Balanced Mode  Balanced mode reduced false positives by 15.17% without any increase in false negatives

56 RQ2: False Positives and False Negatives ApplicationExisting ModeImbalanced Mode DefectsFalse Positives DefectsFalse Positives % of reduction False Negatives Java Util371043674 28.85 1 Java Transaction 511054776 27.62 4 Java SQL561435381 43.36 3 BCEL21426 57.04 0 HSqlDB101000 Hibernate109 811.110 AVERAGE/ TOTAL 28.01 8  Existing Mode vs Imbalanced Mode  Imbalanced mode reduced false positives by 28% with quite small increase in false negatives 56

57 57 Conclusion  Problem-driven methodology by identifying  new problems, patterns  mining algorithms, defects  CAR-Miner [ICSE 09]: mining sequence association rules of the form (FCc1... FCcn) Λ FCa => (FCe1... Fcen)‏ Context Trigger Recovery  reduce false negatives  Alattin [ASE 09]: mining alternative patterns classified into three categories: balanced, imbalanced, and single P 1 or P 2 or... or P i or A ^ 1 or A ^ 2 or... or A ^ j  reduce false positives

58 58 Thank You Questions? Tao Xie University of Illinois at Urbana-Champaign http://web.engr.illinois.edu/~taoxie/ taoxie@illinois.edu


Download ppt "Mining Software Data: Code Tao Xie University of Illinois at Urbana-Champaign"

Similar presentations


Ads by Google