Mining Programming Feature Usage at a Very Large Scale Robert Dyer These research activities supported in part by the US National Science Foundation (NSF) grants CNS , CNS , CCF , CCF , CCF , CCF , CCF , TWC , CCF , CCF , and CCF
Collaborators Tien N. Nguyen Hridesh Rajan Hoan Anh Nguyen Nitin Tiwari Sambhav Srirama
Participate in the MSR 2016 Mining Challenge 3 deadline: Feb 19
4 Boa [TOSEM] (to appear) [ICSE'14] [GPCE'13] [ICSE'13]
5 What is the most used programming language?
6 How many words are in commit messages? Words[] = update, Words[] = cleanup, Words[] = updated, Words[] = refactoring, Words[] = fix, Words[] = test, 9428 Words[] = typo, 9288 Words[] = updates, 7746 Words[] = javadoc, 6893 Words[] = bugfix, 6295
7 How has unit testing been adopted over time? JUnit 4 release
8 What makes this ultra-large-scale mining?
9 Previous examples queried... Projects699,331 Code Repositories494,158 Revisions15,063,073 Unique Files69,863,970 File Snapshots147,074,540 AST Nodes18,651,043,23 Over 250GB of pre-processed data from SourceForge
10 Most recent dataset (Sep 2015) Projects7,830,023 Code Repositories380,125 Revisions23,229,406 Unique Files146,398,339 File Snapshots484,947,086 AST Nodes71,810,106,868 Over 270GB of pre-processed data from GitHub (focusing on Java projects)
What can we do with Boa? 11
12 Previous Language Studies What languages do programmers choose? [Meyerovich&Rabkin SPLASH'13] Reflection [Livshits et al. APLAS'05] [Callaú et al. MSR'11] JavaScript / eval [Yue&Wang WWW'09] [Richards et al. PLDI'10] [Ratanaworabhan et al. WEBAPPS'10] [Richards et al. ECOOP'11] Generics [Basit et al. SEKE'05] [Parnin et al. MSR'11] [Hoppe&Hanenberg SPLASH'13] Object-oriented Features [Tempero et al. ECOOP'08] [Muschevici et al. OOPSLA'08] [Tempero ASWEC'09] [Grechanik et al. ESEM'10] [Gorschek et al. ICSE'10]
What is this study about? How have new Java language features been adopted over time? Assume Java Corpus of 30k+ projects Study 18 new features from 3 language editions Over 10 years of history
Finding use of assert Requires use of a parser (e.g. JDT) Requires knowledge of several APIs –SF.net / GitHub API –SVNkit/JGit/etc Must be manually parallelized 14
15 ASSERTS: output sum of int; visit(input, visitor { before node: CodeRepository -> { snapshot := getsnapshot(node, "SOURCE_JAVA_JLS"); foreach (i: int; def(snapshot[i])) visit(snapshot[i]); stop; } before node: Statement -> if (node.kind == StatementKind.ASSERT) ASSERTS << 1; }); Automatically parallelized Analyzes 18 billion AST nodes in minutes Only 12 lines of code No external libraries Finding use of assert
16 Boa's Architecture Replicate Stored on cluster User submits query Deployed and executed on cluster Query result returned via web cache Boa's Data Infrastructure and Transform Compiled into Hadoop program Boa's Computing Infrastructure
17 input = project 1 input = project 2 input = project 3 input = project n Dataset Boa Program Assert Assert = Output Assert << 1; Processes ASSERTS: output sum of int; visit(input, visitor { before node: CodeRepository -> { snapshot := getsnapshot(node, "SOURCE_JAVA_JLS"); foreach (i: int; def(snapshot[i])) visit(snapshot[i]); stop; } before node: Statement -> if (node.kind == StatementKind.ASSERT) ASSERTS << 1; });
18 Automatic Parallelization ASSERTS: output sum of int; visit(input, visitor { before node: CodeRepository -> { snapshot := getsnapshot(node, "SOURCE_JAVA_JLS"); foreach (i: int; def(snapshot[i])) visit(snapshot[i]); stop; } before node: Statement -> if (node.kind == StatementKind.ASSERT) ASSERTS << 1; }); Output variables with built in aggregator functions: sum, mean, top(k), bottom(k), set, collection, etc Compiler generates Hadoop MapReduce code
19 Abstracting MSR with Types ASSERTS: output sum of int; visit(input, visitor { before node: CodeRepository -> { snapshot := getsnapshot(node, "SOURCE_JAVA_JLS"); foreach (i: int; def(snapshot[i])) visit(snapshot[i]); stop; } before node: Statement -> if (node.kind == StatementKind.ASSERT) ASSERTS << 1; }); Custom domain-specific types for mining software repositories 5 base types and 9 types for source code No need to understand multiple data formats or APIs
20 Abstracting MSR with Types Project CodeRepository Revision ChangedFile ASTRoot 1 1..* 1 * 1 *
21 Abstracting MSR with Types ASTRoot Namespace Declaration 1 * 1 1..* MethodVariable Type 1 * 1 * 1 * Statement Expression * * 1 1
22 Challenge: How can we make mining source code easier? Answer: Declarative Visitors
23 Easing Source Code Mining with Visitors id := visitor { before T -> statement; after T -> statement; }; visit(node, id);
24 Easing Source Code Mining with Visitors id := visitor { before id : T1 -> statement; before T2, T3 -> statement; before _ -> statement; };
25 Easing Source Code Mining with Visitors ASTRoot Namespace Declaration MethodVariable Type StatementExpression ASTRoot Namespace Declaration MethodVariable Type StatementExpression
26 before n: Declaration -> { } Easing Source Code Mining with Visitors Method Type StatementExpression ASTRoot Namespace Declaration Variable before n: Declaration -> { foreach (i: int; n.fields[i]) visit(n.fields[i]); } before n: Declaration -> { foreach (i: int; n.fields[i]) visit(n.fields[i]); stop; }
Let’s revisit the assert use example. 27
28 Finding use of assert ASSERTS: output sum of int;
29 Finding use of assert ASSERTS: output sum of int; visit(input, visitor { });
30 Finding use of assert ASSERTS: output sum of int; visit(input, visitor { before node: Statement -> });
31 Finding use of assert ASSERTS: output sum of int; visit(input, visitor { before node: Statement -> if (node.kind == StatementKind.ASSERT) });
32 Finding use of assert ASSERTS: output sum of int; visit(input, visitor { before node: Statement -> if (node.kind == StatementKind.ASSERT) ASSERTS << 1; });
33 Finding use of assert ASSERTS: output sum of int; visit(input, visitor { before node: CodeRepository -> { snapshot := getsnapshot(node, "SOURCE_JAVA_JLS"); foreach (i: int; def(snapshot[i])) visit(snapshot[i]); stop; } before node: Statement -> if (node.kind == StatementKind.ASSERT) ASSERTS << 1; });
Back to our feature study… 34
35 Research Questions RQ2: How frequently is each feature used? RQ4: Could features have been used more? RQ5: Was old code converted to use new features?
Research Question 2 How frequently was each language feature used?
37 Project Histogram: Annotation Use
38 Project Density: Annotation Use
39 Some features popular
40 Some features popular. Why?
41 Some features popular. Why? List ArrayList Map HashMap Set Collection Vector Class Iterator HashSet (confirms [Parnin et al. MSR'11])
Research Question 4 Could features have been used more?
43 Opportunity: Assert void m(..) { if (cond) throw new IllegalArgumentException();... } void m(..) { assert cond;... } Find methods that throw IllegalArgumentException. Simpler Machine-checkable Easily disabled for production
44 Opportunity: Binary Literals int x = 1 << 5; Find where literal 1 is shifted left. short[] phases = { 0x7, 0xE, 0xD, 0xB }; short[] phases = { 0b0111, 0b1110, 0b1101, 0b1011 };
45 Opportunity: Underscore Literals int x = ; int x = 1_000_000; Find integers with 7 or more digits and no underscores.
46 Opportunity: Diamond List l = new ArrayList (); List l = new ArrayList<>(); Instantiation of generics not using diamond.
47 Opportunity: MultiCatch try {.. } catch (T1 e) { b1 } catch (T2 e) { b1 } try {.. } catch (T1 | T2 e) { b1 } A try with multiple, identical catch blocks.
48 Opportunity: Try w/ Resources try {.. } finally { var.close(); } try (var =..) {.. } Try statements calling close() in the finally block.
49 AssertVarargs Binary Literals DiamondMultiCatch Try w/ Resources Underscore Literals Old 89K612K56K3.3M341K489K5.3M New 291K1.6M5K414K24K33K507K Millions of opportunities!
Potential Uses Projects 18.18%88.78%5.9%59.08%49.75%37.27%51.15% 50 Actual Uses AssertVarargs Binary Literals DiamondMultiCatch Try w/ Resources Underscore Literals Projects 12.72%15.43%0.02%0.4%0.27%0.21%0.02% Millions of opportunities!
Research Question 5 Was old code converted to use new features?
52 Detecting Conversions potential N uses N potential N+1 uses N+1 uses N < uses N+1 potential N > potential N+1 File.java (Revision N) File.java (Revision N+1)
53 Detected lots of conversions! manual, systematic sampling confirms 2602 conversions 13 not conversions AssertVarargsDiamondMultiCatch Try w/ Resources Underscore Literals Count K8.5K Files K3.8K Projects
54 Similar usage patterns AssertVarargsDiamondMultiCatch Try w/ Resources Underscor e Literals Count K8.5K Files K3.8K Projects Old code converted to use new features Only few features see high use AssertVarargs Binary Literals DiamondMultiCatch Try w/ Resources Underscore Literals Old 89K612K56K3.3M341K489K5.3M New 291K1.6M5K414K24K33K507K All 380K2.2M61K3.7M365K522K5.8M Files 1.39%12.74%0.11%12.25%2.28%1.85%5.86% Projects 18.18%88.78%5.9%59.08%49.75%37.27%51.15% Despite (missed) potential for use Feature adoption by individuals To summarize...
55 Summary Ultra-large-scale software repository mining poses several challenges Automatically parallelizes queries Domain-specific language, types, and functions to make mining software repositories easier Boa provides abstractions to address these challenges Ultra-large-scale dataset with millions of projects
56 Boa's Global Impact 300+ users from over 20 countries!