Consensus-based Mining of API Preconditions in Big Code Hoan NguyenRobert DyerTien N. NguyenHridesh Rajan.

Consensus-based Mining of API Preconditions in Big Code Hoan NguyenRobert DyerTien N. NguyenHridesh Rajan

API Preconditions 2 Constraints on the receiver and parameters that must hold right before calling the API Constraints on the receiver and parameters that must hold right before calling the API java.lang.String.substring(int start, int end) start >= 0 start <= end end <= length() start >= 0 start <= end end <= length()

API Preconditions 3 Must hold to guarantee the method behaves as expected Otherwise  Bug

while (...) { start++; if (start >= lenMacro) break; ch = macro.substring(start, 1); } while (...) { start++; if (start >= lenMacro) break; ch = macro.substring(start, 1); } A bug with String.substring(int, int) in project MSS Code Factory 4

while (...) { start++; if (start >= lenMacro) break; ch = macro.substring(start, 1); } while (...) { start++; if (start >= lenMacro) break; ch = macro.substring(start, 1); } A bug with String.substring(int, int) in project MSS Code Factory 5 StringIndexOutOfBoundsException when start > 1

while (...) { start++; if (start >= lenMacro) break; ch = macro.substring(start, 1); } while (...) { start++; if (start >= lenMacro) break; ch = macro.substring(start, 1); } A bug with String.substring(int, int) in project MSS Code Factory 6 StringIndexOutOfBoundsException when start > 1

while (...) { start++; if (start >= lenMacro) break; ch = macro.substring(start, 1); } while (...) { start++; if (start >= lenMacro) break; ch = macro.substring(start, 1); } A bug with String.substring(int, int) in project MSS Code Factory 7

while (...) { start++; if (start >= lenMacro) break; ch = macro.substring(start, 1); } while (...) { start++; if (start >= lenMacro) break; ch = macro.substring(start, 1); } macro.substring(start, start + 1) A precondition-related bug fix in project MSS Code Factory 8 fix bug

Study on the precondition-related bug fixing Data collection – SourceForge projects: 3,413 – Revisions: ~2M – Fixing revisions: ~370,000 Method – Analyzing code changes in each revision – Using heuristics to identify candidate fixing changes that added precondition(s) – Verifying candidates manually 9

Study on the precondition-related bug fixing Result – Candidates: 3,130 (0.85%) fixing revisions 4,399 call sites – Manually verify a sample of 100 call sites 80 are actually related to missing preconditions 10

Use of Preconditions Writing code conforming to the specifications Automated program verification – Runtime assertion checking – Extended static checking Bug detection Automatic test case generation 11

Challenges Manually specifying the specifications is time-consuming Not many APIs are released with specified specifications 12 Mining Specifications Automatically

13 Prior Work Focuses on single projects This Work Uses consensus across large number of projects to separate API-specific preconditions (wheat) from project-specific constraints (chaff)

Precondition Mining – Key Ideas 14 Preconditions can be mined from guard conditions at the call sites of the code using the APIs Preconditions mined from multiple projects in a large-scale code corpus can be used to filter out chaff

15 api() C1 C3 C2 api() C1 C3 C2 api() C1 C3 C2 Key Ideas Preconditions can be mined from guard conditions at the call sites of the code using the APIs Preconditions mined from multiple projects in a large-scale code corpus can be used to filter out chaff

16 api() C1 C3 C2 api() C1 C3 C2 api() C1 C3 C2 api() C1 C3 C2 api() C1 C3 C2 api() C1 C3 C2 api() C1 C3 C2 api() C1 C3 C2 api() C1 C3 C2 Key Ideas Preconditions can be mined from guard conditions at the call sites of the code using the APIs Preconditions mined from multiple projects in a large-scale code corpus can be used to filter out chaff

17 api() C1 C3 C2 api() C1 C3 C2 api() C1 C3 C2 api() C1 C3 C2 api() C1 C3 C2 api() C1 C3 C2 api() C1 C3 C2 api() C1 C3 C2 api() C1 C3 C2 api() C1 C3 C2 api() C1 C3 C2 api() C1 C3 C2 api() C1 C3 C2 api() C1 C3 C2 api() C1 C3 C2 api() C1 C3 C2 api() C1 C3 C2 api() C1 C3 C2 api() C1 C3 C2 api() C1 C3 C2 api() C1 C3 C2 Key Ideas Preconditions can be mined from guard conditions at the call sites of the code using the APIs Preconditions mined from multiple projects in a large-scale code corpus can be used to filter out chaff

api() C1 C3 C2 api() C1 C3 api() C1 C3 C2 18 api() C1 C3 C2 api() C1 C3 C2 api() C1 C3 C2 api() C1 C3 C2 api() C1 C3 C2 api() C1 C3 C2 api() C1 C3 C2 api() C1 C3 C2 api() C1 C3 C2 api() C1 C3 C2 api() C1 C3 C2 api() C1 C3 C2 api() C1 C3 C2 api() C1 C3 C2 api() C1 C3 C2 api() C1 C3 C2 api() C1 C3 C2 api() C1 C3 C2 api() C1 C3 C2 api() C1 C3 C2 api() C1 C3 C2 api() C1 C3 C2 api() C1 C3 C2 api() C1 C3 C2 api() C1 C3 C2 api() C1 C3 C2 api() C1 C3 C2 api() C1 C3 C2 api() C1 C3 C2 api() C1 C3 C2 api() C1 C3 C2 api() C1 C3 C2 api() C1 C3 C2 api() C1 C3 C2 api() C1 C3 C2 api() C1 C3 C2 api() C1 C3 C2 api() C1 C3 C2 api() C1 C3 C2 Key Ideas Preconditions can be mined from guard conditions at the call sites of the code using the APIs Preconditions mined from multiple projects in a large-scale code corpus can be used to filter out chaff

19 Preconditions can be mined from guard conditions at the call sites of the code using the APIs Client code of API String.substring(int,int) in project SeMoA at revision 1929 completePath_.substring(servletPathStart, extraPathStart) servletPathStart >= 0 extraPathStart >= 0 servletPathStart <= completePath_.length() extraPathStart <= completePath_.length() servletPathStart <= extraPathStart servletPathStart >= 0 extraPathStart >= 0 servletPathStart <= completePath_.length() extraPathStart <= completePath_.length() servletPathStart <= extraPathStart

20 Project-specific guard conditions will be filtered completePath_.substring(servletPathStart, extraPathStart) completePath_.charAt(servletPathStart) == ‘/’ completePath_.charAt(extraPathStart) == ‘/’

Consensus-based Precondition Mining 21 Client method M 1 Client method M 1 Conditions 0 <= start start <= end end <= length contains(‘@’) 0 <= start start <= end end <= length contains(‘@’) Build CFG Extract and Normalize Conditions Extract and Normalize Conditions Infer 0 <= start start <= end end <= length 0 <= start start <= end end <= length Client method M N Client method M N 0 < start start <= end end <= length ends(‘\n’) 0 < start start <= end end <= length ends(‘\n’)... Client method M i Client method M i... Preconditions 0 = start start <= end end <= length starts(‘/’) 0 = start start <= end end <= length starts(‘/’) api(...) Build CFG Extract and Normalize Conditions Extract and Normalize Conditions Build CFG Extract and Normalize Conditions Extract and Normalize Conditions Filter and Rank Filter and Rank Conditions 0 <= start start <= end end <= length contains(‘@’) 0 <= start start <= end end <= length contains(‘@’) api(...)

Precondition Mining 22 Entry Exit string.substring (start, end) c1 start > end Exit do_true do_false true false true false client method class String... substring (int, int)... class String... substring (int, int)... Build CFG

Precondition Mining 23 Entry Exit string.substring (start, end) c1 start > end Exit do_true do_false true false true false substring (int, int): {start <= end} client method Build CFG control-dependent

Precondition Mining 24 Entry Exit string.substring (start, end) c1 start > end Exit do_true do_false true false true false substring (int, int): {start <= end} client method Build CFG

Precondition Mining 25 Entry Exit string.substring (start, end) c1 start > end Exit do_true do_false true false true false substring (int, int): {start <= end} client method Build CFG

Precondition Mining 26 Entry Exit string.substring (start, end) c1 start > end Exit do_true do_false true false true false substring (int, int): {start <= end} client method Build CFG Abstract substring (int, int): {arg0 <= arg1}

Precondition Mining 27 arg0 < arg1 arg1 > arg0 arg0 – arg1 < 0 arg1 – arg0 > 0

arg0 < arg1 arg1 > arg0 arg0 – arg1 < 0 arg1 – arg0 > 0 Precondition Mining 28 Normalize arg0 < arg1

Precondition Mining 29 arg0 – arg1 == 0 arg0 == arg1 arg0 < arg1 arg1 > arg0 arg0 – arg1 < 0 arg1 – arg0 > 0 Normalize

Precondition Mining 30 Infer arg0 – arg1 == 0 arg0 == arg1 arg0 < arg1 arg1 > arg0 arg0 – arg1 < 0 arg1 – arg0 > 0 Normalize arg0 <= arg1

Precondition Mining 31 confidence >  Filter and Rank arg0 <= arg1 Infer arg0 – arg1 == 0 arg0 == arg1 arg0 < arg1 arg1 > arg0 arg0 – arg1 < 0 arg1 – arg0 > 0 Normalize arg0 <= arg1

Precondition Mining 32 confidence >  confidence <  Filter and Rank arg0 <= 12 arg0 <= arg1 Infer arg0 – arg1 == 0 arg0 == arg1 arg0 < arg1 arg1 > arg0 arg0 – arg1 < 0 arg1 – arg0 > 0 Normalize arg0 <= arg1 arg0 <= 12

Tool Implementation Eclipse plugin 33

Datasets 34 SourceForgeApache Projects3,413146 Total source files497,453132,951 Total classes600,274173,120 Total methods4,735,1511,243,911 Total SLOCs92,495,41025,117,837 Total used JDK classes806 (63%)918 (72%) Total used JDK methods7,592 (63%)6,109 (55%) Total method calls22,308,2515,544,437 Total JDK method calls5,588,4871,271,210

Datasets 35 SourceForgeApache Projects3,413146 Total source files497,453132,951 Total classes600,274173,120 Total methods4,735,1511,243,911 Total SLOCs92,495,41025,117,837 Total used JDK classes806 (63%)918 (72%) Total used JDK methods7,592 (63%)6,109 (55%) Total method calls22,308,2515,544,437 Total JDK method calls5,588,4871,271,210 Almost 120 millions lines of source code

Datasets 36 SourceForgeApache Projects3,413146 Total source files497,453132,951 Total classes600,274173,120 Total methods4,735,1511,243,911 Total SLOCs92,495,41025,117,837 Total used JDK classes806 (63%)918 (72%) Total used JDK methods7,592 (63%)6,109 (55%) Total method calls22,308,2515,544,437 Total JDK method calls5,588,4871,271,210 Not all JDK APIs are used in SourceForge and Apache

Datasets 37 SourceForgeApache Projects3,413146 Total source files497,453132,951 Total classes600,274173,120 Total methods4,735,1511,243,911 Total SLOCs92,495,41025,117,837 Total used JDK classes806 (63%)918 (72%) Total used JDK methods7,592 (63%)6,109 (55%) Total method calls22,308,2515,544,437 Total JDK method calls5,588,4871,271,210 More than 20% method calls are to JDK APIs

Evaluation – Accuracy 38 Data collection Building ground-truth Extracting preconditions from published formal specifications for JDK APIs on JML website Building ground-truth Extracting preconditions from published formal specifications for JDK APIs on JML website /*@ public normal_behavior @ requires 0 <= beginIndex @ && beginIndex <= endIndex @ && endIndex <= length(); @ … /*@ public behavior @ … @ signals (NoSuchElementException) isEmpty(); @*/

Building ground-truth 39 Number of methods: 797 Number of preconditions: 1155

Evaluation – Accuracy 40 Data collection Metrics Precision Recall Metrics Precision Recall

Running Time for Mining Preconditions of 797 JDK APIs 41 SLOCsClient methodsTime SourceForge92M4.7M17h35m Apache25M1.2M34m

Types of Incorrectly-mined Preconditions Type 1. The mined preconditions are stronger than specified – java.util.List.add(Object obj): obj != null 42 DatasetTotalStrongerSpecificAnalysis Error SourceForge173118532 Apache187121651 Both195129660 Type 2. The mined preconditions are project-specific, but common – java.lang.Math.min(double a, double b): a > 0, b > 0 Type 3. The mined preconditions are incorrect due to error in analysis – java.lang.StringBuffer.ensureCapacity(int capacity): capacity <= 0 Developers sometimes check stronger preconditions than specified

Types of Missing Preconditions 43 PrivateNo callNo occurLow confidence SF4% 9%3% Apache5% 12%3% Both5%2%10%4% Preconditions involve private element(s) of classes Type 1. Private Methods are never called Type 2. No call Preconditions are never checked Type 3. No occur Preconditions are checked with low confidence Type 4. Low frequency

Accuracy over Preconditions 44 PrecisionRecall SourceForge84%79% Apache82%75% Both83%80%

Newly Found Preconditions 45 5 preconditions are newly found for the JDK API methods that has already had JML specifications MethodPrecondition String. getChars(int,int,char[],int)arg3 >= 0 StringBuffer.append(char[])arg0 != null BitSet.flip(int, int)arg0 <= arg1 BitSet.set(int, int)arg0 <= arg1 BitSet.set(int, int, boolean)arg0 <= arg1

Accuracy by size 46 SourceForgeApache

Evaluation – Usefulness Suggesting preconditions for writing formal specification Web-based survey 47

Suggesting Preconditions for Writing Formal Specifications 48 ClassMethodSuggestAccept StringBufferdelete(int,int)3Y replace(int,int,String)2Y* setLength(int)1Y subSequence(int,int)3Y substring(int,int)3Y LinkedListadd(int,Object)2Y addAll(int,Collection)3Y get(int)2Y listIterator(int)2Y remove(int)2Y set(int,Object)2Y 2 classes11 methods25 miss

Conclusions Mining API preconditions from large code corpus – 120 million SLOCs on SourceForge and Apache Tool implementation: Eclipse plugin High accuracy – Recall: 75–80% and Precision: 82–84% Useful for helping write specifications – All suggestions are accepted by specification writer 49

Contact Hoan Nguyen hoan@iastate.edu 50

Web-based Survey http://boa.cs.iastate.edu/jml/ http://boa.cs.iastate.edu/jml/ 51 API method

Web-based Survey http://boa.cs.iastate.edu/jml/ http://boa.cs.iastate.edu/jml/ 52 Documentation links

Web-based Survey http://boa.cs.iastate.edu/jml/ http://boa.cs.iastate.edu/jml/ 53 Mined preconditions

Web-based Survey http://boa.cs.iastate.edu/jml/ http://boa.cs.iastate.edu/jml/ 54 Rating on correctness

Web-based Survey http://boa.cs.iastate.edu/jml/ http://boa.cs.iastate.edu/jml/ 55 Rating on correctness Rating on usefulness

Web-based Survey http://boa.cs.iastate.edu/jml/ http://boa.cs.iastate.edu/jml/ 56

Web-based Survey http://boa.cs.iastate.edu/jml/ http://boa.cs.iastate.edu/jml/ 57

Consensus-based Mining of API Preconditions in Big Code Hoan NguyenRobert DyerTien N. NguyenHridesh Rajan.

Similar presentations

Presentation on theme: "Consensus-based Mining of API Preconditions in Big Code Hoan NguyenRobert DyerTien N. NguyenHridesh Rajan."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Consensus-based Mining of API Preconditions in Big Code Hoan NguyenRobert DyerTien N. NguyenHridesh Rajan.

Similar presentations

Presentation on theme: "Consensus-based Mining of API Preconditions in Big Code Hoan NguyenRobert DyerTien N. NguyenHridesh Rajan."— Presentation transcript:

Similar presentations

About project

Feedback