Efficiently Mining Source Code with Boa Robert Dyer The research activities described in this talk were supported in part by the US National Science Foundation.

Slides:



Advertisements
Similar presentations
Three Perspectives & Two Problems Shivnath Babu Duke University.
Advertisements

The Datacenter Needs an Operating System Matei Zaharia, Benjamin Hindman, Andy Konwinski, Ali Ghodsi, Anthony Joseph, Randy Katz, Scott Shenker, Ion Stoica.
MapReduce.
Spark: Cluster Computing with Working Sets
Computer Science 1620 Loops.
Computer Science 1620 Programming & Problem Solving.
(Hadoop) Pig Dataflow Language B. Ramamurthy Based on Cloudera’s tutorials and Apache’s Pig Manual 6/27/2015.
Presenter: Joshan V John Robert Dyer, Hoan Anh Nguyen, Hridesh Rajan & Tien N. Nguyen Iowa State University, USA Instructor: Christoph Csallner 1 Joshan.
Understanding and Managing WebSphere V5
Programming is instructing a computer to perform a task for you with the help of a programming language.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
Data Mining on the Web via Cloud Computing COMS E6125 Web Enhanced Information Management Presented By Hemanth Murthy.
Discussion and conclusion The OGC SOS describes a global standard for storing and recalling sensor data and the associated metadata. The standard covers.
Rationale Aspiring Database Developers should be able to efficiently query and maintain databases. This module will help students learn the Structured.
CIS Computer Programming Logic
CIS 375—Web App Dev II Microsoft’s.NET. 2 Introduction to.NET Steve Ballmer (January 2000): Steve Ballmer "Delivering an Internet-based platform of Next.
Chapter 1 Introduction Dr. Frank Lee. 1.1 Why Study Compiler? To write more efficient code in a high-level language To provide solid foundation in parsing.
Ohio State University Department of Computer Science and Engineering Automatic Data Virtualization - Supporting XML based abstractions on HDF5 Datasets.
Cloud Computing Other High-level parallel processing languages Keke Chen.
Introduction of C++ language. C++ Predecessors Early high level languages or programming languages were written to address a particular kind of computing.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Cloud Distributed Computing Platform 2 Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)
Hive Facebook 2009.
Reviewing Recent ICSE Proceedings For:.  Defining and Continuous Checking of Structural Program Dependencies  Automatic Inference of Structural Changes.
MapReduce High-Level Languages Spring 2014 WPI, Mohamed Eltabakh 1.
Distributed Computing with Turing Machine. Turing machine  Turing machines are an abstract model of computation. They provide a precise, formal definition.
INFSO-RI Module 01 ETICS Overview Etics Online Tutorial Marian ŻUREK Baltic Grid II Summer School Vilnius, 2-3 July 2009.
A Little Language for Surveys: Constructing an Internal DSL in Ruby H. Conrad Cunningham Computer and Information Science University of Mississippi.
Mining Billions of AST Nodes to Study Actual and Potential Usage of Java Language Features Robert Dyer The research activities described in this talk were.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
C++ Functions. Objectives 1. Be able to implement C++ functions 2. Be able to share data among functions 2.
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
CPS120: Introduction to Computer Science Functions.
Nguyen Tuan Anh. VN-Grid: Goals  Grid middleware (focus of this presentation)  Tuan Anh  Grid applications  Hoai.
INNOV-10 Progress® Event Engine™ Technical Overview Prashant Thumma Principal Software Engineer.
Page 1 CSISS Center for Spatial Information Science and Systems CWIC Metrics: Current and Future Weiguo Han, Liping Di, Yuanzheng Shao, Lingjun Kang Center.
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduce Leonidas Akritidis Panayiotis Bozanis Department of Computer & Communication.
Demonstrating Programming Language Feature Mining using Boa Robert Dyer These research activities supported in part by the US National Science Foundation.
Copyright © 2010, SAS Institute Inc. All rights reserved. SAS ® Using the SAS Grid.
Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009.
ApproxHadoop Bringing Approximations to MapReduce Frameworks
Mining Programming Language Usage with Boa Robert Dyer These research activities supported in part by the US National Science Foundation (NSF) grants CNS ,
Improving User Access to Metadata for Public and Restricted Use US Federal Statistical Files William C. Block Jeremy Williams Lars Vilhuber Carl Lagoze.
Apache PIG rev Tools for Data Analysis with Hadoop Hadoop HDFS MapReduce Pig Statistical Software Hive.
AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.
Mining Programming Feature Usage at a Very Large Scale Robert Dyer These research activities supported in part by the US National Science Foundation (NSF)
Copyright 2007, Information Builders. Slide 1 iWay Web Services and WebFOCUS Consumption Michael Florkowski Information Builders.
User surveys - why and how? Annegrete Wulff Statistics Denmark
Software. Introduction n A computer can’t do anything without a program of instructions. n A program is a set of instructions a computer carries out.
Review A program is… a set of instructions that tell a computer what to do. Programs can also be called… software. Hardware refers to… the physical components.
In this session, you will learn to: Create and manage views Implement a full-text search Implement batches Objectives.
1 Copyright © 2008, Oracle. All rights reserved. Repository Basics.
Organizations Are Embracing New Opportunities
Big Data is a Big Deal!.
Sushant Ahuja, Cassio Cristovao, Sameep Mohta
Spark Presentation.
Introduction to C++ Introduced by Bjarne Stroustrup of AT&T’s Bell Laboratories in mid-1980’s Based on C C++ extended C to support object-oriented programming.
Introduction to MapReduce and Hadoop
Program Analysis on Thousands of Projects
Operation System Program 4
Big Data - in Performance Engineering
湖南大学-信息科学与工程学院-计算机与科学系
ExaO: Software Defined Data Distribution for Exascale Sciences
CS110: Discussion about Spark
CS 345A Data Mining MapReduce This presentation has been altered.
Apache Hadoop and Spark
Computer Terms Review from what language did C++ originate?
Final Review 27th March Final Review 27th March 2019.
Copyright © JanBask Training. All rights reserved Get Started with Hadoop Hive HiveQL Languages.
Presentation transcript:

Efficiently Mining Source Code with Boa Robert Dyer The research activities described in this talk were supported in part by the US National Science Foundation (NSF) grants CCF , CCF , TWC , CCF , CCF , and CCF Tien N. Nguyen Hridesh Rajan Hoan Anh Nguyen

2 What do I mean by software repository?

3

4 What features do they have?

5 What do I mean by mining software repositories (MSR)?

6

7 What are some examples of software repository mining?

8 What is the most used programming language?

9 How many words are in commit messages? Words[] = update, Words[] = cleanup, Words[] = updated, Words[] = refactoring, Words[] = fix, Words[] = test, 9428 Words[] = typo, 9288 Words[] = updates, 7746 Words[] = javadoc, 6893 Words[] = bugfix, 6295

10 How has unit testing evolved over time? JUnit 4 release

11 What makes this ultra-large-scale mining?

12 Previous examples queried... Projects699,331 Code Repositories494,158 Revisions15,063,073 Unique Files69,863,970 File Snapshots147,074,540 AST Nodes18,651,043,23 Over 250GB of pre-processed data

13 What does bringing BIGDATA to the masses mean?

14 How has unit testing evolved over time? How can we solve this task?

15 Results foreach mine project metadata Has repository? Method yes Access repository Find all methods Find all source files mine revisions mine sources

16 Results foreach mine project metadata Has repository? Method yes Access repository Find all methods Find all source files mine revisions mine sources Challenge: Volume

17 Challenge: Volume Projects699,331 Code Repositories494,158 Revisions15,063,073 Unique Files69,863,970 File Snapshots147,074,540 AST Nodes18,651,043,23 How do you: Find such a large dataset?Transform the data for analysis? Access this data?Efficiently analyze the data? Store the data?

18 Results foreach mine project metadata Has repository? Method yes Access repository Find all methods Find all source files mine revisions mine sources Challenge: Velocity

19 Challenge: Velocity

20 Challenge: Velocity

21 Results foreach mine project metadata Has repository? Method yes Access repository Find all methods Find all source files mine revisions mine sources Challenge: Variety

22 Challenge: Variety

Ultra-large-scale Software Repository Mining The Boa Experience [ICSE'14] [ICSE'13] [GPCE'13] [SPLASH'13 SRC] [TOSEM] (under review)

24 Boa's Architecture Replicate Stored on cluster User submits query Deployed and executed on cluster Query result returned via web cache Boa's Data Infrastructure and Transform Compiled into Hadoop program Boa's Computing Infrastructure

25 Results foreach mine project metadata Has repository? Method yes Access repository Find all methods Find all source files mine revisions mine sources Challenge: Volume Challenge: Velocity Challenge: Variety

26 Tests: output sum[timestamp] of int; cur_time:timestamp; visit(input, visitor { before n: Revision -> cur_time = n.commit_date; before n: Modifier -> if (n.kind == ModifierKind.ANNOTATION && match(`^(org\.junit\.)?Test$`, n.annotation_name)) Tests[cur_time] << 1; }); Automatically parallelized Analyzes 18 billion AST nodes in minutes Only 10 lines of code No external libraries A better solution...

27 How has unit testing evolved over time? Tests: output sum[timestamp] of int;

28 How has unit testing evolved over time? Tests: output sum[timestamp] of int; visit(input, visitor { });

29 How has unit testing evolved over time? Tests: output sum[timestamp] of int; visit(input, visitor { before n: Modifier -> });

30 How has unit testing evolved over time? Tests: output sum[timestamp] of int; visit(input, visitor { before n: Modifier -> if (n.kind == ModifierKind.ANNOTATION && });

31 How has unit testing evolved over time? Tests: output sum[timestamp] of int; visit(input, visitor { before n: Modifier -> if (n.kind == ModifierKind.ANNOTATION && match(`^(org\.junit\.)?Test$`, n.annotation_name)) });

32 How has unit testing evolved over time? Tests: output sum[timestamp] of int; visit(input, visitor { before n: Modifier -> if (n.kind == ModifierKind.ANNOTATION && match(`^(org\.junit\.)?Test$`, n.annotation_name)) Tests[cur_time] << 1; });

33 How has unit testing evolved over time? Tests: output sum[timestamp] of int; cur_time:timestamp; visit(input, visitor { before n: Revision -> cur_time = n.commit_date; before n: Modifier -> if (n.kind == ModifierKind.ANNOTATION && match(`^(org\.junit\.)?Test$`, n.annotation_name)) Tests[cur_time] << 1; });

34 Tests: output sum[timestamp] of int; cur_time:timestamp; visit(input, visitor { before n: Revision -> cur_time = n.commit_date; before n: Modifier -> if (n.kind == ModifierKind.ANNOTATION && match(`^(org\.junit\.)?Test$`, n.annotation_name)) Tests[cur_time] << 1; }); Tests: output sum[timestamp] of int; cur_time:timestamp; visit(input, visitor { before n: Revision -> cur_time = n.commit_date; before n: Modifier -> if (n.kind == ModifierKind.ANNOTATION && match(`^(org\.junit\.)?Test$`, n.annotation_name)) Tests[cur_time] << 1; }); input = project 1 input = project 2 input = project 3 input = project n Dataset Tests: output sum[timestamp] of int; cur_time:timestamp; visit(input, visitor { before n: Revision -> cur_time = n.commit_date; before n: Modifier -> if (n.kind == ModifierKind.ANNOTATION && match(`^(org\.junit\.)?Test$`, n.annotation_name)) Tests[cur_time] << 1; }); Boa Program Tests Tests[ ] = 5 Tests[ ] = 12 Tests[ ] = 14 Tests[ ] = 18. Output Tests: output sum[timestamp] of int; cur_time:timestamp; visit(input, visitor { before n: Revision -> cur_time = n.commit_date; before n: Modifier -> if (n.kind == ModifierKind.ANNOTATION && match(`^(org\.junit\.)?Test$`, n.annotation_name)) Tests[cur_time] << 1; }); Tests: output sum[timestamp] of int; cur_time:timestamp; visit(input, visitor { before n: Revision -> cur_time = n.commit_date; before n: Modifier -> if (n.kind == ModifierKind.ANNOTATION && match(`^(org\.junit\.)?Test$`, n.annotation_name)) Tests[cur_time] << 1; }); Tests[ ] << 1; , 1 Tests[ ] << 1; , , , , , 1

35 Automatic Parallelization Tests: output sum[timestamp] of int; cur_time:timestamp; visit(input, visitor { before n: Revision -> cur_time = n.commit_date; before n: Modifier -> if (n.kind == ModifierKind.ANNOTATION && match(`^(org\.junit\.)?Test$`, n.annotation_name)) Tests[cur_time] << 1; }); Output variables with built in aggregator functions: sum, mean, top(k), bottom(k), set, collection, etc Compiler generates Hadoop MapReduce code

36 Abstracting MSR with Types Tests: output sum[timestamp] of int; cur_time:timestamp; visit(input, visitor { before n: Revision -> cur_time = n.commit_date; before n: Modifier -> if (n.kind == ModifierKind.ANNOTATION && match(`^(org\.junit\.)?Test$`, n.annotation_name)) Tests[cur_time] << 1; }); Custom domain-specific types for mining software repositories 5 base types and 9 types for source code No need to understand multiple data formats or APIs

37 Abstracting MSR with Types Project CodeRepository Revision ChangedFile ASTRoot 1 1..* 1 * 1 *

38 Abstracting MSR with Types ASTRoot Namespace Declaration 1 * 1 1..* MethodVariable Type 1 * 1 * 1 * Statement Expression * * 1 1

39 Challenge: How can we make mining source code easier? Answer: Declarative Visitors

40 Background: Visitor Pattern Rectangle Triangle draw(Graphics g) scale(int x, int y) Circle draw(Graphics g) scale(int x, int y) draw(Graphics g) scale(int x, int y) Rectangle Triangle accept(Visitor v) Circle accept(Visitor v) DrawVisitor visit(Rectangle r) visit(Circle c) visit(Triangle t) ScaleVisitor visit(Rectangle r) visit(Circle c) visit(Triangle t)

41 Easing Source Code Mining with Visitors id := visitor { before T -> statement; after T -> statement; }; visit(node, id);

42 Easing Source Code Mining with Visitors id := visitor { before id : T1 -> statement; before T2, T3 -> statement; before _ -> statement; };

43 Easing Source Code Mining with Visitors ASTRoot Namespace Declaration MethodVariable Type StatementExpression ASTRoot Namespace Declaration MethodVariable Type StatementExpression

44 before n: Declaration -> { } Easing Source Code Mining with Visitors Method Type StatementExpression ASTRoot Namespace Declaration Variable before n: Declaration -> { foreach (i: int; n.fields[i]) visit(n.fields[i]); } before n: Declaration -> { foreach (i: int; n.fields[i]) visit(n.fields[i]); stop; }

45 Let's see it in action!

46 Summary Ultra-large-scale software repository mining poses several challenges Automatically parallelizes queries Domain-specific language, types, and functions to make mining software repositories easier Boa provides abstractions to address these challenges Ultra-large-scale dataset with almost 700k projects

47 Boa's Global Impact 90+ users from over 20 countries!

48 Thank you!