Mining Programming Feature Usage at a Very Large Scale Robert Dyer These research activities supported in part by the US National Science Foundation (NSF)

Slides:



Advertisements
Similar presentations
MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
Advertisements

Lesson 4: Formatting Input Data for Arithmetic
 2005 Pearson Education, Inc. All rights reserved Introduction.
Computer Science 1620 Loops.
Three types of computer languages
Rossella Lau Lecture 1, DCO10105, Semester B, DCO10105 Object-Oriented Programming and Design  Lecture 1: Introduction What this course is about:
COMP1170 Midterm Preparation (March 17 th 2009) Acknowledgment The notes are adapted from those provided by Deitel & Associates, Inc. and Pearson Education.
Presenter: Joshan V John Robert Dyer, Hoan Anh Nguyen, Hridesh Rajan & Tien N. Nguyen Iowa State University, USA Instructor: Christoph Csallner 1 Joshan.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
Developing Health Geographic Information Systems (HGIS) for Khorasan Province in Iran (Technical Report) S.H. Sanaei-Nejad, (MSc, PhD) Ferdowsi University.
COMPUTER SCIENCE I C++ INTRODUCTION
Data Mining on the Web via Cloud Computing COMS E6125 Web Enhanced Information Management Presented By Hemanth Murthy.
Interpreting the data: Parallel analysis with Sawzall LIN Wenbin 25 Mar 2014.
1 Yolanda Gil Information Sciences InstituteJanuary 10, 2010 Requirements for caBIG Infrastructure to Support Semantic Workflows Yolanda.
UNIT 3 TEMPLATE AND EXCEPTION HANDLING. Introduction  Program errors are also referred to as program bugs.  A C program may have one or more of four.
CIS Computer Programming Logic
C++ Code Analysis: an Open Architecture for the Verification of Coding Rules Paolo Tonella ITC-irst, Centro per la Ricerca Scientifica e Tecnologica
Chapter 1 Introduction Dr. Frank Lee. 1.1 Why Study Compiler? To write more efficient code in a high-level language To provide solid foundation in parsing.
Ohio State University Department of Computer Science and Engineering Automatic Data Virtualization - Supporting XML based abstractions on HDF5 Datasets.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Cloud Computing Other High-level parallel processing languages Keke Chen.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
1 PARSEWeb: A Programmer Assistant for Reusing Open Source Code on the Web Suresh Thummalapenta and Tao Xie Department of Computer Science North Carolina.
Copyright 2001 Oxford Consulting, Ltd1 January Storage Classes, Scope and Linkage Overview Focus is on the structure of a C++ program with –Multiple.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Introduction to Programming David Goldschmidt, Ph.D. Computer Science The College of Saint Rose Java Fundamentals (Comments, Variables, etc.)
Lecture 2 Object Oriented Programming Basics of Java Language MBY.
Reviewing Recent ICSE Proceedings For:.  Defining and Continuous Checking of Structural Program Dependencies  Automatic Inference of Structural Changes.
Introduction to Java Applications Part II. In this chapter you will learn:  Different data types( Primitive data types).  How to declare variables?
MapReduce High-Level Languages Spring 2014 WPI, Mohamed Eltabakh 1.
Introduction to Computer Systems and the Java Programming Language.
Question of the Day  On a game show you’re given the choice of three doors: Behind one door is a car; behind the others, goats. After you pick a door,
Rossella Lau Lecture 1, DCO10105, Semester B, DCO10105 Object-Oriented Programming and Design  Lecture 1: Introduction What this course is about:
D. M. Akbar Hussain: Department of Software & Media Technology 1 Compiler is tool: which translate notations from one system to another, usually from source.
Integer numerical data types. The integer data types The integer data types use the binary number system as encoding method There are a number of different.
Chameleon Automatic Selection of Collections Ohad Shacham Martin VechevEran Yahav Tel Aviv University IBM T.J. Watson Research Center Presented by: Yingyi.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Mining Billions of AST Nodes to Study Actual and Potential Usage of Java Language Features Robert Dyer The research activities described in this talk were.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Efficiently Mining Source Code with Boa Robert Dyer The research activities described in this talk were supported in part by the US National Science Foundation.
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
Object-Oriented Program Development Using Java: A Class-Centered Approach, Enhanced Edition.
Alattin: Mining Alternative Patterns for Detecting Neglected Conditions Suresh Thummalapenta and Tao Xie Department of Computer Science North Carolina.
Exploiting Code Search Engines to Improve Programmer Productivity and Quality Suresh Thummalapenta Advisor: Dr. Tao Xie Department of Computer Science.
1 CSCD 326 Data Structures I Software Design. 2 The Software Life Cycle 1. Specification 2. Design 3. Risk Analysis 4. Verification 5. Coding 6. Testing.
Evaluating and Improving an OpenMP-based Circuit Design Tool Tim Beatty, Dr. Ken Kent, Dr. Eric Aubanel Faculty of Computer Science University of New Brunswick.
Ohio State University Department of Computer Science and Engineering Data-Centric Transformations on Non- Integer Iteration Spaces Swarup Kumar Sahoo Gagan.
Demonstrating Programming Language Feature Mining using Boa Robert Dyer These research activities supported in part by the US National Science Foundation.
Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009.
ApproxHadoop Bringing Approximations to MapReduce Frameworks
Mike Hildreth DASPOS Update Mike Hildreth representing the DASPOS project 1.
Mining Programming Language Usage with Boa Robert Dyer These research activities supported in part by the US National Science Foundation (NSF) grants CNS ,
AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.
JAVA: An Introduction to Problem Solving & Programming, 6 th Ed. By Walter Savitch ISBN © 2012 Pearson Education, Inc., Upper Saddle River,
In this class, we will cover: Overriding a method Overloading a method Constructors Mutator and accessor methods The import statement and using prewritten.
CS 440 Database Management Systems Stored procedures & OR mapping 1.
Review A program is… a set of instructions that tell a computer what to do. Programs can also be called… software. Hardware refers to… the physical components.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Open Source Compiler Construction (for the JVM)
Big Data is a Big Deal!.
Sushant Ahuja, Cassio Cristovao, Sameep Mohta
Introduction to the C Language
Spark Presentation.
Program Analysis on Thousands of Projects
Ruru Yue1, Na Meng2, Qianxiang Wang1 1Peking University 2Virginia Tech
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Accurate and Efficient Refactoring Detection in Commit History
The Ohio State University
Copyright © JanBask Training. All rights reserved Get Started with Hadoop Hive HiveQL Languages.
Presentation transcript:

Mining Programming Feature Usage at a Very Large Scale Robert Dyer These research activities supported in part by the US National Science Foundation (NSF) grants CNS , CNS , CCF , CCF , CCF , CCF , CCF , TWC , CCF , CCF , and CCF

Collaborators Tien N. Nguyen Hridesh Rajan Hoan Anh Nguyen Nitin Tiwari Sambhav Srirama

Participate in the MSR 2016 Mining Challenge 3 deadline: Feb 19

4 Boa [TOSEM] (to appear) [ICSE'14] [GPCE'13] [ICSE'13]

5 What is the most used programming language?

6 How many words are in commit messages? Words[] = update, Words[] = cleanup, Words[] = updated, Words[] = refactoring, Words[] = fix, Words[] = test, 9428 Words[] = typo, 9288 Words[] = updates, 7746 Words[] = javadoc, 6893 Words[] = bugfix, 6295

7 How has unit testing been adopted over time? JUnit 4 release

8 What makes this ultra-large-scale mining?

9 Previous examples queried... Projects699,331 Code Repositories494,158 Revisions15,063,073 Unique Files69,863,970 File Snapshots147,074,540 AST Nodes18,651,043,23 Over 250GB of pre-processed data from SourceForge

10 Most recent dataset (Sep 2015) Projects7,830,023 Code Repositories380,125 Revisions23,229,406 Unique Files146,398,339 File Snapshots484,947,086 AST Nodes71,810,106,868 Over 270GB of pre-processed data from GitHub (focusing on Java projects)

What can we do with Boa? 11

12 Previous Language Studies What languages do programmers choose? [Meyerovich&Rabkin SPLASH'13] Reflection [Livshits et al. APLAS'05] [Callaú et al. MSR'11] JavaScript / eval [Yue&Wang WWW'09] [Richards et al. PLDI'10] [Ratanaworabhan et al. WEBAPPS'10] [Richards et al. ECOOP'11] Generics [Basit et al. SEKE'05] [Parnin et al. MSR'11] [Hoppe&Hanenberg SPLASH'13] Object-oriented Features [Tempero et al. ECOOP'08] [Muschevici et al. OOPSLA'08] [Tempero ASWEC'09] [Grechanik et al. ESEM'10] [Gorschek et al. ICSE'10]

What is this study about? How have new Java language features been adopted over time? Assume Java Corpus of 30k+ projects Study 18 new features from 3 language editions Over 10 years of history

Finding use of assert Requires use of a parser (e.g. JDT) Requires knowledge of several APIs –SF.net / GitHub API –SVNkit/JGit/etc Must be manually parallelized 14

15 ASSERTS: output sum of int; visit(input, visitor { before node: CodeRepository -> { snapshot := getsnapshot(node, "SOURCE_JAVA_JLS"); foreach (i: int; def(snapshot[i])) visit(snapshot[i]); stop; } before node: Statement -> if (node.kind == StatementKind.ASSERT) ASSERTS << 1; }); Automatically parallelized Analyzes 18 billion AST nodes in minutes Only 12 lines of code No external libraries Finding use of assert

16 Boa's Architecture Replicate Stored on cluster User submits query Deployed and executed on cluster Query result returned via web cache Boa's Data Infrastructure and Transform Compiled into Hadoop program Boa's Computing Infrastructure

17 input = project 1 input = project 2 input = project 3 input = project n Dataset Boa Program Assert Assert = Output Assert << 1; Processes ASSERTS: output sum of int; visit(input, visitor { before node: CodeRepository -> { snapshot := getsnapshot(node, "SOURCE_JAVA_JLS"); foreach (i: int; def(snapshot[i])) visit(snapshot[i]); stop; } before node: Statement -> if (node.kind == StatementKind.ASSERT) ASSERTS << 1; });

18 Automatic Parallelization ASSERTS: output sum of int; visit(input, visitor { before node: CodeRepository -> { snapshot := getsnapshot(node, "SOURCE_JAVA_JLS"); foreach (i: int; def(snapshot[i])) visit(snapshot[i]); stop; } before node: Statement -> if (node.kind == StatementKind.ASSERT) ASSERTS << 1; }); Output variables with built in aggregator functions: sum, mean, top(k), bottom(k), set, collection, etc Compiler generates Hadoop MapReduce code

19 Abstracting MSR with Types ASSERTS: output sum of int; visit(input, visitor { before node: CodeRepository -> { snapshot := getsnapshot(node, "SOURCE_JAVA_JLS"); foreach (i: int; def(snapshot[i])) visit(snapshot[i]); stop; } before node: Statement -> if (node.kind == StatementKind.ASSERT) ASSERTS << 1; }); Custom domain-specific types for mining software repositories 5 base types and 9 types for source code No need to understand multiple data formats or APIs

20 Abstracting MSR with Types Project CodeRepository Revision ChangedFile ASTRoot 1 1..* 1 * 1 *

21 Abstracting MSR with Types ASTRoot Namespace Declaration 1 * 1 1..* MethodVariable Type 1 * 1 * 1 * Statement Expression * * 1 1

22 Challenge: How can we make mining source code easier? Answer: Declarative Visitors

23 Easing Source Code Mining with Visitors id := visitor { before T -> statement; after T -> statement; }; visit(node, id);

24 Easing Source Code Mining with Visitors id := visitor { before id : T1 -> statement; before T2, T3 -> statement; before _ -> statement; };

25 Easing Source Code Mining with Visitors ASTRoot Namespace Declaration MethodVariable Type StatementExpression ASTRoot Namespace Declaration MethodVariable Type StatementExpression

26 before n: Declaration -> { } Easing Source Code Mining with Visitors Method Type StatementExpression ASTRoot Namespace Declaration Variable before n: Declaration -> { foreach (i: int; n.fields[i]) visit(n.fields[i]); } before n: Declaration -> { foreach (i: int; n.fields[i]) visit(n.fields[i]); stop; }

Let’s revisit the assert use example. 27

28 Finding use of assert ASSERTS: output sum of int;

29 Finding use of assert ASSERTS: output sum of int; visit(input, visitor { });

30 Finding use of assert ASSERTS: output sum of int; visit(input, visitor { before node: Statement -> });

31 Finding use of assert ASSERTS: output sum of int; visit(input, visitor { before node: Statement -> if (node.kind == StatementKind.ASSERT) });

32 Finding use of assert ASSERTS: output sum of int; visit(input, visitor { before node: Statement -> if (node.kind == StatementKind.ASSERT) ASSERTS << 1; });

33 Finding use of assert ASSERTS: output sum of int; visit(input, visitor { before node: CodeRepository -> { snapshot := getsnapshot(node, "SOURCE_JAVA_JLS"); foreach (i: int; def(snapshot[i])) visit(snapshot[i]); stop; } before node: Statement -> if (node.kind == StatementKind.ASSERT) ASSERTS << 1; });

Back to our feature study… 34

35 Research Questions RQ2: How frequently is each feature used? RQ4: Could features have been used more? RQ5: Was old code converted to use new features?

Research Question 2 How frequently was each language feature used?

37 Project Histogram: Annotation Use

38 Project Density: Annotation Use

39 Some features popular

40 Some features popular. Why?

41 Some features popular. Why? List ArrayList Map HashMap Set Collection Vector Class Iterator HashSet (confirms [Parnin et al. MSR'11])

Research Question 4 Could features have been used more?

43 Opportunity: Assert void m(..) { if (cond) throw new IllegalArgumentException();... } void m(..) { assert cond;... } Find methods that throw IllegalArgumentException. Simpler Machine-checkable Easily disabled for production

44 Opportunity: Binary Literals int x = 1 << 5; Find where literal 1 is shifted left. short[] phases = { 0x7, 0xE, 0xD, 0xB }; short[] phases = { 0b0111, 0b1110, 0b1101, 0b1011 };

45 Opportunity: Underscore Literals int x = ; int x = 1_000_000; Find integers with 7 or more digits and no underscores.

46 Opportunity: Diamond List l = new ArrayList (); List l = new ArrayList<>(); Instantiation of generics not using diamond.

47 Opportunity: MultiCatch try {.. } catch (T1 e) { b1 } catch (T2 e) { b1 } try {.. } catch (T1 | T2 e) { b1 } A try with multiple, identical catch blocks.

48 Opportunity: Try w/ Resources try {.. } finally { var.close(); } try (var =..) {.. } Try statements calling close() in the finally block.

49 AssertVarargs Binary Literals DiamondMultiCatch Try w/ Resources Underscore Literals Old 89K612K56K3.3M341K489K5.3M New 291K1.6M5K414K24K33K507K Millions of opportunities!

Potential Uses Projects 18.18%88.78%5.9%59.08%49.75%37.27%51.15% 50 Actual Uses AssertVarargs Binary Literals DiamondMultiCatch Try w/ Resources Underscore Literals Projects 12.72%15.43%0.02%0.4%0.27%0.21%0.02% Millions of opportunities!

Research Question 5 Was old code converted to use new features?

52 Detecting Conversions potential N uses N potential N+1 uses N+1 uses N < uses N+1 potential N > potential N+1 File.java (Revision N) File.java (Revision N+1)

53 Detected lots of conversions! manual, systematic sampling confirms 2602 conversions 13 not conversions AssertVarargsDiamondMultiCatch Try w/ Resources Underscore Literals Count K8.5K Files K3.8K Projects

54 Similar usage patterns AssertVarargsDiamondMultiCatch Try w/ Resources Underscor e Literals Count K8.5K Files K3.8K Projects Old code converted to use new features Only few features see high use AssertVarargs Binary Literals DiamondMultiCatch Try w/ Resources Underscore Literals Old 89K612K56K3.3M341K489K5.3M New 291K1.6M5K414K24K33K507K All 380K2.2M61K3.7M365K522K5.8M Files 1.39%12.74%0.11%12.25%2.28%1.85%5.86% Projects 18.18%88.78%5.9%59.08%49.75%37.27%51.15% Despite (missed) potential for use Feature adoption by individuals To summarize...

55 Summary Ultra-large-scale software repository mining poses several challenges Automatically parallelizes queries Domain-specific language, types, and functions to make mining software repositories easier Boa provides abstractions to address these challenges Ultra-large-scale dataset with millions of projects

56 Boa's Global Impact 300+ users from over 20 countries!