Efficiently Mining Source Code with Boa Robert Dyer The research activities described in this talk were supported in part by the US National Science Foundation (NSF) grants CCF , CCF , TWC , CCF , CCF , and CCF Tien N. Nguyen Hridesh Rajan Hoan Anh Nguyen
2 What do I mean by software repository?
3
4 What features do they have?
5 What do I mean by mining software repositories (MSR)?
6
7 What are some examples of software repository mining?
8 What is the most used programming language?
9 How many words are in commit messages? Words[] = update, Words[] = cleanup, Words[] = updated, Words[] = refactoring, Words[] = fix, Words[] = test, 9428 Words[] = typo, 9288 Words[] = updates, 7746 Words[] = javadoc, 6893 Words[] = bugfix, 6295
10 How has unit testing evolved over time? JUnit 4 release
11 What makes this ultra-large-scale mining?
12 Previous examples queried... Projects699,331 Code Repositories494,158 Revisions15,063,073 Unique Files69,863,970 File Snapshots147,074,540 AST Nodes18,651,043,23 Over 250GB of pre-processed data
13 What does bringing BIGDATA to the masses mean?
14 How has unit testing evolved over time? How can we solve this task?
15 Results foreach mine project metadata Has repository? Method yes Access repository Find all methods Find all source files mine revisions mine sources
16 Results foreach mine project metadata Has repository? Method yes Access repository Find all methods Find all source files mine revisions mine sources Challenge: Volume
17 Challenge: Volume Projects699,331 Code Repositories494,158 Revisions15,063,073 Unique Files69,863,970 File Snapshots147,074,540 AST Nodes18,651,043,23 How do you: Find such a large dataset?Transform the data for analysis? Access this data?Efficiently analyze the data? Store the data?
18 Results foreach mine project metadata Has repository? Method yes Access repository Find all methods Find all source files mine revisions mine sources Challenge: Velocity
19 Challenge: Velocity
20 Challenge: Velocity
21 Results foreach mine project metadata Has repository? Method yes Access repository Find all methods Find all source files mine revisions mine sources Challenge: Variety
22 Challenge: Variety
Ultra-large-scale Software Repository Mining The Boa Experience [ICSE'14] [ICSE'13] [GPCE'13] [SPLASH'13 SRC] [TOSEM] (under review)
24 Boa's Architecture Replicate Stored on cluster User submits query Deployed and executed on cluster Query result returned via web cache Boa's Data Infrastructure and Transform Compiled into Hadoop program Boa's Computing Infrastructure
25 Results foreach mine project metadata Has repository? Method yes Access repository Find all methods Find all source files mine revisions mine sources Challenge: Volume Challenge: Velocity Challenge: Variety
26 Tests: output sum[timestamp] of int; cur_time:timestamp; visit(input, visitor { before n: Revision -> cur_time = n.commit_date; before n: Modifier -> if (n.kind == ModifierKind.ANNOTATION && match(`^(org\.junit\.)?Test$`, n.annotation_name)) Tests[cur_time] << 1; }); Automatically parallelized Analyzes 18 billion AST nodes in minutes Only 10 lines of code No external libraries A better solution...
27 How has unit testing evolved over time? Tests: output sum[timestamp] of int;
28 How has unit testing evolved over time? Tests: output sum[timestamp] of int; visit(input, visitor { });
29 How has unit testing evolved over time? Tests: output sum[timestamp] of int; visit(input, visitor { before n: Modifier -> });
30 How has unit testing evolved over time? Tests: output sum[timestamp] of int; visit(input, visitor { before n: Modifier -> if (n.kind == ModifierKind.ANNOTATION && });
31 How has unit testing evolved over time? Tests: output sum[timestamp] of int; visit(input, visitor { before n: Modifier -> if (n.kind == ModifierKind.ANNOTATION && match(`^(org\.junit\.)?Test$`, n.annotation_name)) });
32 How has unit testing evolved over time? Tests: output sum[timestamp] of int; visit(input, visitor { before n: Modifier -> if (n.kind == ModifierKind.ANNOTATION && match(`^(org\.junit\.)?Test$`, n.annotation_name)) Tests[cur_time] << 1; });
33 How has unit testing evolved over time? Tests: output sum[timestamp] of int; cur_time:timestamp; visit(input, visitor { before n: Revision -> cur_time = n.commit_date; before n: Modifier -> if (n.kind == ModifierKind.ANNOTATION && match(`^(org\.junit\.)?Test$`, n.annotation_name)) Tests[cur_time] << 1; });
34 Tests: output sum[timestamp] of int; cur_time:timestamp; visit(input, visitor { before n: Revision -> cur_time = n.commit_date; before n: Modifier -> if (n.kind == ModifierKind.ANNOTATION && match(`^(org\.junit\.)?Test$`, n.annotation_name)) Tests[cur_time] << 1; }); Tests: output sum[timestamp] of int; cur_time:timestamp; visit(input, visitor { before n: Revision -> cur_time = n.commit_date; before n: Modifier -> if (n.kind == ModifierKind.ANNOTATION && match(`^(org\.junit\.)?Test$`, n.annotation_name)) Tests[cur_time] << 1; }); input = project 1 input = project 2 input = project 3 input = project n Dataset Tests: output sum[timestamp] of int; cur_time:timestamp; visit(input, visitor { before n: Revision -> cur_time = n.commit_date; before n: Modifier -> if (n.kind == ModifierKind.ANNOTATION && match(`^(org\.junit\.)?Test$`, n.annotation_name)) Tests[cur_time] << 1; }); Boa Program Tests Tests[ ] = 5 Tests[ ] = 12 Tests[ ] = 14 Tests[ ] = 18. Output Tests: output sum[timestamp] of int; cur_time:timestamp; visit(input, visitor { before n: Revision -> cur_time = n.commit_date; before n: Modifier -> if (n.kind == ModifierKind.ANNOTATION && match(`^(org\.junit\.)?Test$`, n.annotation_name)) Tests[cur_time] << 1; }); Tests: output sum[timestamp] of int; cur_time:timestamp; visit(input, visitor { before n: Revision -> cur_time = n.commit_date; before n: Modifier -> if (n.kind == ModifierKind.ANNOTATION && match(`^(org\.junit\.)?Test$`, n.annotation_name)) Tests[cur_time] << 1; }); Tests[ ] << 1; , 1 Tests[ ] << 1; , , , , , 1
35 Automatic Parallelization Tests: output sum[timestamp] of int; cur_time:timestamp; visit(input, visitor { before n: Revision -> cur_time = n.commit_date; before n: Modifier -> if (n.kind == ModifierKind.ANNOTATION && match(`^(org\.junit\.)?Test$`, n.annotation_name)) Tests[cur_time] << 1; }); Output variables with built in aggregator functions: sum, mean, top(k), bottom(k), set, collection, etc Compiler generates Hadoop MapReduce code
36 Abstracting MSR with Types Tests: output sum[timestamp] of int; cur_time:timestamp; visit(input, visitor { before n: Revision -> cur_time = n.commit_date; before n: Modifier -> if (n.kind == ModifierKind.ANNOTATION && match(`^(org\.junit\.)?Test$`, n.annotation_name)) Tests[cur_time] << 1; }); Custom domain-specific types for mining software repositories 5 base types and 9 types for source code No need to understand multiple data formats or APIs
37 Abstracting MSR with Types Project CodeRepository Revision ChangedFile ASTRoot 1 1..* 1 * 1 *
38 Abstracting MSR with Types ASTRoot Namespace Declaration 1 * 1 1..* MethodVariable Type 1 * 1 * 1 * Statement Expression * * 1 1
39 Challenge: How can we make mining source code easier? Answer: Declarative Visitors
40 Background: Visitor Pattern Rectangle Triangle draw(Graphics g) scale(int x, int y) Circle draw(Graphics g) scale(int x, int y) draw(Graphics g) scale(int x, int y) Rectangle Triangle accept(Visitor v) Circle accept(Visitor v) DrawVisitor visit(Rectangle r) visit(Circle c) visit(Triangle t) ScaleVisitor visit(Rectangle r) visit(Circle c) visit(Triangle t)
41 Easing Source Code Mining with Visitors id := visitor { before T -> statement; after T -> statement; }; visit(node, id);
42 Easing Source Code Mining with Visitors id := visitor { before id : T1 -> statement; before T2, T3 -> statement; before _ -> statement; };
43 Easing Source Code Mining with Visitors ASTRoot Namespace Declaration MethodVariable Type StatementExpression ASTRoot Namespace Declaration MethodVariable Type StatementExpression
44 before n: Declaration -> { } Easing Source Code Mining with Visitors Method Type StatementExpression ASTRoot Namespace Declaration Variable before n: Declaration -> { foreach (i: int; n.fields[i]) visit(n.fields[i]); } before n: Declaration -> { foreach (i: int; n.fields[i]) visit(n.fields[i]); stop; }
45 Let's see it in action!
46 Summary Ultra-large-scale software repository mining poses several challenges Automatically parallelizes queries Domain-specific language, types, and functions to make mining software repositories easier Boa provides abstractions to address these challenges Ultra-large-scale dataset with almost 700k projects
47 Boa's Global Impact 90+ users from over 20 countries!
48 Thank you!