Improving Programmer Productivity via Mining Program Source Code Tao Xie Department of Computer Science North Carolina State University

Slides:



Advertisements
Similar presentations
Mining Specifications Glenn Ammons, Dept. Computer Science University of Wisconsin Rastislav Bodik, Computer Science Division University of California,
Advertisements

Distributed Search over the Hidden Web Hierarchical Database Sampling and Selection Panagiotis G. Ipeirotis Luis Gravano Computer Science Department Columbia.
Relational Data Mining in Finance Haonan Zhang CFWin /04/2003.
© Prentice Hall1 DATA MINING TECHNIQUES Introductory and Advanced Topics Eamonn Keogh (some slides adapted from) Margaret Dunham Dr. M.H.Dunham, Data Mining,
Previous finals up on the web page use them as practice problems look at them early.
Automatically Extracting and Verifying Design Patterns in Java Code James Norris Ruchika Agrawal Computer Science Department Stanford University {jcn,
Synthesis of Interface Specifications for Java Classes Rajeev Alur University of Pennsylvania Joint work with P. Cerny, G. Gupta, P. Madhusudan, W. Nam,
PRESTO: Program Analyses and Software Tools Research Group, Ohio State University Regression Test Selection for AspectJ Software Guoqing Xu and Atanas.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Adrian Ilie COMP 14 Introduction to Programming Adrian Ilie July 8, 2005.
Creating Architectural Descriptions. Outline Standardizing architectural descriptions: The IEEE has published, “Recommended Practice for Architectural.
Automated Diagnosis of Software Configuration Errors
OLAM and Data Mining: Concepts and Techniques. Introduction Data explosion problem: –Automated data collection tools and mature database technology lead.
FALL 2012 DSCI5240 Graduate Presentation By Xxxxxxx.
OOSE 01/17 Institute of Computer Science and Information Engineering, National Cheng Kung University Member:Q 薛弘志 P 蔡文豪 F 周詩御.
XFindBugs: eXtended FindBugs for AspectJ Haihao Shen, Sai Zhang, Jianjun Zhao, Jianhong Fang, Shiyuan Yao Software Theory and Practice Group (STAP) Shanghai.
272: Software Engineering Fall 2012 Instructor: Tevfik Bultan Lecture 17: Code Mining.
Reverse Engineering State Machines by Interactive Grammar Inference Neil Walkinshaw, Kirill Bogdanov, Mike Holcombe, Sarah Salahuddin.
Topic #10: Optimization EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University ICSE 2003 Java.
Behavior-based Spyware Detection By Engin Kirda and Christopher Kruegel Secure Systems Lab Technical University Vienna Greg Banks, Giovanni Vigna, and.
University of Maryland Bug Driven Bug Finding Chadd Williams.
Tao Xie Automated Software Engineering Group Department of Computer Science North Carolina State University
Dale Roberts Procedural Programming using Java Dale Roberts, Lecturer Computer Science, IUPUI Department of Computer and.
Presented by Abirami Poonkundran.  Introduction  Current Work  Current Tools  Solution  Tesseract  Tesseract Usage Scenarios  Information Flow.
1 PARSEWeb: A Programmer Assistant for Reusing Open Source Code on the Web Suresh Thummalapenta and Tao Xie Department of Computer Science North Carolina.
A Specification Language and Test Planner for Software Testing Aolat A. Adedeji 1 Mary Lou Soffa 1 1 DEPARTMENT OF COMPUTER SCIENCE, UNIVERSITY OF VIRGINIA.
1 A Static Analysis Approach for Automatically Generating Test Cases for Web Applications Presented by: Beverly Leung Fahim Rahman.
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
Bug Localization with Machine Learning Techniques Wujie Zheng
Hipikat: A Project Memory for Software Development The CISC 864 Analysis By Lionel Marks.
CS266 Software Reverse Engineering (SRE) Reversing and Patching Java Bytecode Teodoro (Ted) Cipresso,
CMCD: Count Matrix based Code Clone Detection Yang Yuan and Yao Guo Key Laboratory of High-Confidence Software Technologies (Ministry of Education) Peking.
Samad Paydar Web Technology Lab. Ferdowsi University of Mashhad 10 th August 2011.
Mining Software Data: Code Tao Xie University of Illinois at Urbana-Champaign
Yazd University, Electrical and Computer Engineering Department Course Title: Advanced Software Engineering By: Mohammad Ali Zare Chahooki 1 Machine Learning.
Debug Concern Navigator Masaru Shiozuka(Kyushu Institute of Technology, Japan) Naoyasu Ubayashi(Kyushu University, Japan) Yasutaka Kamei(Kyushu University,
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.
Computer Science Automated Software Engineering Research ( Mining Exception-Handling Rules as Conditional Association.
Alattin: Mining Alternative Patterns for Detecting Neglected Conditions Suresh Thummalapenta and Tao Xie Department of Computer Science North Carolina.
You Are Not Alone: How Authoring Tools Can Leverage Activity Traces to Help Users, Developers & Researchers Bjoern Hartmann Stanford HCI Lunch 8/19/2009.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University July 21, 2008WODA.
Exploiting Code Search Engines to Improve Programmer Productivity and Quality Suresh Thummalapenta Advisor: Dr. Tao Xie Department of Computer Science.
CASE/Re-factoring and program slicing
Java Basics Hussein Suleman March 2007 UCT Department of Computer Science Computer Science 1015F.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University IWPSE 2003 Program.
CSV 889: Concurrent Software Verification Subodh Sharma Indian Institute of Technology Delhi Scalable Symbolic Execution: KLEE.
Generating Software Documentation in Use Case Maps from Filtered Execution Traces Edna Braun, Daniel Amyot, Timothy Lethbridge University of Ottawa, Canada.
Search Engine using Web Mining COMS E Web Enhanced Information Mgmt Prof. Gail Kaiser Presented By: Rupal Shah (UNI: rrs2146)
How Can I Use This Method? 2015 IEEE/ACM 37TH IEEE INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING HOW.
Introduction to Data Mining by Yen-Hsien Lee Department of Information Management College of Management National Sun Yat-Sen University March 4, 2003.
Detecting Inefficiently-Used Containers to Avoid Bloat Guoqing Xu and Atanas Rountev Department of Computer Science and Engineering Ohio State University.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 1 Extracting Sequence.
Recommending Adaptive Changes for Framework Evolution Barthélémy Dagenais and Martin P. Robillard ICSE08 Dec 4 th, 2008 Presented by EJ Park.
1 Recommendation Systems for Code Reuse Tao Xie Department of Computer Science North Carolina State University Raleigh, USA.
Graph Indexing From managing and mining graph data.
© Dr. A. Williams, Fall Present Software Quality Assurance – Clover Lab 1 Tutorial / lab 2: Code instrumentation Goals of this session: 1.Create.
CAR-Miner: Mining Exception-Handling Rules as Sequence Association Rules Suresh Thummalapenta and Tao Xie Department of Computer Science North Carolina.
Verification vs. Validation Verification: "Are we building the product right?" The software should conform to its specification.The software should conform.
Test Case Purification for Improving Fault Localization presented by Taehoon Kwak SoftWare Testing & Verification Group Jifeng Xuan, Martin Monperrus [FSE’14]
1 API Recommendation Wujie Zheng
DATA MINING © Prentice Hall.
Harry Xu University of California, Irvine & Microsoft Research
Ruru Yue1, Na Meng2, Qianxiang Wang1 1Peking University 2Virginia Tech
Mining and Analyzing Data from Open Source Software Repository
Cross-library API Recommendation Using Web Search Engines
Code search & recommendation engines
Panagiotis G. Ipeirotis Luis Gravano
Precise Condition Synthesis for Program Repair
MAPO: Mining and Recommending API Usage Patterns
Presentation transcript:

Improving Programmer Productivity via Mining Program Source Code Tao Xie Department of Computer Science North Carolina State University

T. Xie Mining Program Source Code2 Mining SE Data MAIN GOAL –Transform static record- keeping SE data to active data –Make SE data actionable by uncovering hidden patterns and trends Mailings Bugzilla Code repository Execution traces CVS

T. Xie Mining Program Source Code3 Overview of Mining SE Data code bases change history program states structural entities software engineering data bug reports/nl programmingdefect detectiontestingdebuggingmaintenance software engineering tasks helped by data mining classification association/ patterns clustering data mining techniques … … …

T. Xie Mining Program Source Code4 Overview of Mining SE Data code bases change history program states structural entities software engineering data bug reports/nl … 99 ASE 00 ICSE 05 FSE*2 ASE PLDI POPL OSDI 06 PLDI OOPSLA KDD 07 ICSE*3 FSE*3 ASE PLDI*2 ISSTA*2 KDD 04 ICSE 05 FSE*2 06 ASE 07 ICSE*2 99 ICSE 02 ICSE 03 PLDI 05 FSE PLDI 06 ISSTA 07 ISSTA 99 FSE 01 ICSE FSE 02 ISSTA POPL KDD 03 PLDI 04 ASE ISSTA 05 ICSE ASE 06 ICSE FSE*2 07 PLDI 03 ICSE 06 ICSE 06 ASE 07 ICSE SOSP

T. Xie Mining Program Source Code5 Overview of Mining SE Data code bases change history program states structural entities software engineering data bug reports/nl programmingdefect detectiontestingdebuggingmaintenance software engineering tasks helped by data mining classification association/ patterns clustering data mining techniques … … …

T. Xie Mining Program Source Code6 Overview of Mining SE Data programmingdefect detectiontestingdebuggingmaintenance software engineering tasks helped by data mining … 99 ASE 00 ICSE 05 FSE PLDI POPL 06 FSE OOPSLA PLDI 07 FSE ASE ISSTA KDD 01 SOSP 04 OSDI 05 FSE*2 06 ICSE*2 07 ICSE*2 FSE*2 ISSTA PLDI*2 SOSP 99 ICSE 01 ICSE*2 FSE 02 ICSE ISSTA POPL 04 ISSTA 06 ISSTA 03 ICSE PLDI*2 05 ICSE FSE ASE PLDI 06 ICSE FSE 07 ICSE ISSTA PLDI 02 KDD 04 ICSE ASE 05 FSE ASE*2 06 KDD 07 ICSE*3

T. Xie Mining Program Source Code7 Overview of Mining SE Data code bases change history program states structural entities software engineering data bug reports/nl programmingdefect detectiontestingdebuggingmaintenance software engineering tasks helped by data mining classification association/ patterns clustering data mining techniques … … …

T. Xie Mining Program Source Code8 Sample Projects on Mining Program Source Code DataAlgorithmsTasks Set of functions, variables, etc. in a C function Frequent Itemset Programming-rules-related bug finding UIUC [FSE 05] Statement seq in a basic block in C Frequent subsequence Copy-paste bug finding UIUC [OSDI 04] Methods seq in a Java method from code search engine Frequent subsequence API usage patterns NCSU [MSR 06] Function seq in whole C program Frequent partial order API usage patterns/properties NCSU [FSE 07] System dependence graph in whole C program Frequent subgraph Neglected-condition bug finding CASE [ISSTA 07] Java API method signatures Plan generation API Jungloids Berkeley [PLDI 05] Method seq in a Java method from code search engine Frequent sequences API Jungloids NCSU [ASE 07]

T. Xie Mining Program Source Code9 Some Recent Trends Data: dynamic execution data  +static code bases Task: productivity (programming)  + quality (defect detection, testing, debugging) Mining algorithm: simple ones (association rule)  + frequent itemset/subsequence/ partial order/subgraph Data scope: local repositories  public repositories with code search engines

T. Xie Mining Program Source Code10 Sample Projects on Mining Program Source Code DataAlgorithmsTasks Set of functions, variables, etc. in a C function Frequent itemset Programming-rules-related bug finding UIUC [FSE 05] Statement seq in a basic block in C Frequent subsequence Copy-paste bug finding UIUC [OSDI 04] Methods seq in a Java method from code search engine Frequent subsequence API usage patterns NCSU [MSR 06] Function seq in whole C program Frequent partial order API usage patterns/properties NCSU [FSE 07] System dependence graph in whole C program Frequent subgraph Neglected-condition bug finding CASE [ISSTA 07] Java API method signatures Plan generation API Jungloids Berkeley [PLDI 05] Method seq in a Java method from code search engine Frequent sequences API Jungloids NCSU [ASE 07]

T. Xie Mining Program Source Code11 Mining API Usage Patterns How should an API be used correctly? –An API may serve multiple functionalities –Different styles of API usage MAPO: “I know what method call I need, but I don’t know how to write code before and after this method call” [Xie&Pei MSR 06]

T. Xie Mining Program Source Code12 Example Task -- MAPO “ instrument the bytecode of a Java class by adding an extra method to the class ” –org.apache.bcel.generic.ClassGen public void addMethod(Method m)

T. Xie Mining Program Source Code13 First Try: ClassGen Java API Doc addMethod public void addMethod(Method m) Add a method to this class. Parameters: m - method to add

T. Xie Mining Program Source Code14 Second Try: Code Search Engine

T. Xie Mining Program Source Code15 MAPO Approach Analyze code segments relevant to a given API and disclose the inherent usage patterns –Input: an API characterized by a method, class, or package –Code search engine: used to search relevant source files from open source repositories –Frequent sequence miner: use BIDE [Wang&Han 04] to mine closed sequential patterns from extracted method- call sequences –Output: a short list of frequent API usage patterns related to the API

T. Xie Mining Program Source Code16 Sequence Extraction Method sequences: extracted from Java source files returned from code search engines public void generateStubMethod(ClassGen c) InstructionList il = new InstructionList(); MethodGen m= genFromISList(il); m.setMaxLocals(); m.setMaxStack(); c.addMethod(m.getMethod()); System.out.println(“…”); … } Call sequenceSource code InstructionList. () genFromISList(InstructionList) MethodGen.setMaxStack() MethodGen.setMaxLocals() MethodGen.getMethod() ClassGen.addMethod(Method) PrintStream.println(String) …

T. Xie Mining Program Source Code17 Sequence Preprocessing Remove common Java library calls Inline callees of the same class Remove sequences that contain no query words: ClassGen and addMethod InstructionList. () genFromISList(InstructionList) MethodGen.setMaxStack() MethodGen.setMaxLocals() MethodGen.getMethod() ClassGen.addMethod(Method) PrintStream.println(String) … public void generateStubMethod(ClassGen c) InstructionList il = new InstructionList(); MethodGen m= genFromISList(il); m.setMaxLocals(); m.setMaxStack(); c.addMethod(m.getMethod()); System.out.println(“…”); … }

T. Xie Mining Program Source Code18 Frequent Seq Postprocessing Remove sequences that contain no query words: ClassGen and addMethod Compress consecutive calls of the same method into one, e.g., abbba  aba Remove duplicate frequent sequences after the compression, e.g., aba, aba  aba Reduce a seq if it is a subseq of another, e.g., aba, abab  abab

T. Xie Mining Program Source Code19 Tool Architecture e.g. koders.com

T. Xie Mining Program Source Code20 Sample Mined API Sequence InstructionList. () InstructionFactory.createLoad(Type, int) InstructionList.append(Instruction) InstructionFactory.createReturn(Type) InstructionList.append(Instruction) MethodGen.setMaxStack() MethodGen.setMaxLocals() MethodGen.getMethod() ClassGen.addMethod(Method) InstructionList.dispose()

T. Xie Mining Program Source Code21 Sample Projects on Mining Program Source Code DataAlgorithmsTasks Set of functions, variables, etc. in a C function Frequent itemset Programming-rules-related bug finding UIUC [FSE 05] Statement seq in a basic block in C Frequent subsequence Copy-paste bug finding UIUC [OSDI 04] Methods seq in a Java method from code search engine Frequent subsequence API usage patterns NCSU [MSR 06] Function seq in whole C program Frequent partial order API usage patterns/properties NCSU [FSE 07] System dependence graph in whole C program Frequent subgraph Neglected-condition bug finding CASE [ISSTA 07] Java API method signatures Plan generation API Jungloids Berkeley [PLDI 05] Method seq in a Java method from code search engine Frequent sequences API Jungloids NCSU [ASE 07]

T. Xie Mining Program Source Code22 Mining API Usage Patterns MAPO: “I know what method call I need, but I don’t know how to write code before and after this method call” [Xie&Pei MSR 06] Apiartor: “I know what possible set of APIs I need, but I don’t know what need to be used and what orders to use” [Acharya et al. FSE 07]

T. Xie Mining Program Source Code23 Usage Patterns as Partial Order #include void p ( ) { b ( ); c ( ); } void q ( ) { c ( ); b ( ); } void r ( ) { e ( ); f ( ); } void s ( ) { f ( ); e ( ); } int main ( ) { int i, j, k; a ( ); if ( i == 1) { f ( ); e ( ); c ( ); exit ( ); } else { if ( j == 1 ) p ( ); else q ( ); d ( ); if ( k == 1 ) r ( ); else s ( ); } 1 a  f  e  c 2 a  b  c  d  e  f 3 a  c  b  d  e  f 4 a  b  c  d  f  e 5 a  c  b  d  f  e a d c e b f a  b  d  e a  b  d  f a  c  d  e a  c  d  f (b) Static program traces (c) Frequent subseq patterns (d) Frequent partial order R (a) Example code

T. Xie Mining Program Source Code24 Apiartor Overview User-specified APIs Trigger Generator Triggers Model Checker Traces Scenario Extractor Independent Scenarios Miner Partial Orders Source Code Specification Extractor Specifications Frequent Usage Scenarios Related APIs Trace Generator

T. Xie Mining Program Source Code25 Example Partial Orders XOpenDisplay XCloseDisplay XCreateWindow XGetWindowAttributes XCreateGC XSetForeground XGetBackground XMapWindow XChageWindowAttributes XMapWindow XSelectInput XGetAtomName XFreeGC XNextEvent A usage scenario around XOpenDisplay API as a partial order. Specifications are shown with dotted lines.

T. Xie Mining Program Source Code26 Sample Projects on Mining Program Source Code DataAlgorithmsTasks Set of functions, variables, etc. in a C function Frequent itemset Programming-rules-related bug finding UIUC [FSE 05] Statement seq in a basic block in C Frequent subsequence Copy-paste bug finding UIUC [OSDI 04] Methods seq in a Java method from code search engine Frequent subsequence API usage patterns NCSU [MSR 06] Function seq in whole C program Frequent partial order API usage patterns/properties NCSU [FSE 07] System dependence graph in whole C program Frequent subgraph Neglected-condition bug finding CASE [ISSTA 07] Java API method signatures Plan generation API Jungloids Berkeley [PLDI 05] Method seq in a Java method from code search engine Frequent sequences API Jungloids NCSU [ASE 07]

T. Xie Mining Program Source Code27 Mining API Usage Patterns MAPO: “I know what method call I need, but I don’t know how to write code before and after this method call” [Xie&Pei MSR 06] Apiartor: “I know what possible set of APIs I need, but I don’t know what need to be used and what orders to use” [Acharya et al. FSE 07] PARSEWeb: “I know what type of object I need, but I don’t know how to write the code to get the object” [Thummalapenta&Xie ASE 07]

T. Xie Mining Program Source Code Example Task - OpenJMS Query: “javax.jms.QueueConnectionFactory -> javax.jms.QueueSender” PARSEWeb Solution: FileName:0_UserBean.java MethodName:ingest Rank:1 NumberOfOccurrences:23 Confidence:True Path: javax.jms.QueueConnectionFactory,createQueueConnection() ReturnType:javax.jms.QueueConnection javax.jms.QueueConnection,createQueueSession(boolean,javax.jms.Session.AUTO ACKNOWLEDGE) ReturnType:javax.jms.QueueSession javax.jms.QueueSession,createSender(javax.jms.Queue) ReturnType:javax.jms.QueueSender Sun Java Message Services API Spec

T. Xie Mining Program Source Code29 PARSEWeb Overview Code Downloader Code Search Engine Open Source Repositories Local Source Code Repository Code Analyzer Method Invocation Sequences Sequence Miner Clustered Method Invocation Sequences Query Splitter Final Method Invocation Sequences Query

T. Xie Mining Program Source Code30 PARSEWeb Overview Code Downloader Code Search Engine Open Source Repositories Local Source Code Repository Code Analyzer Method Invocation Sequences Sequence Miner Clustered Method Invocation Sequences Query Splitter Final Method Invocation Sequences Query

T. Xie Mining Program Source Code31 Code Analyzer Collect [Source  Destination] method sequences invoked by each public method –Deal with local method calls by inlining methods –Deal with conditionals/loops by traversing control flow graphs Resolve types in sequences –Challenges: downloaded files are partial –Solutions: heuristics are developed

T. Xie Mining Program Source Code32 Type Heuristics Heuristic 1: The return type of a method-invocation statement contained in an initialization expression is same as the type of the declared variable. e.g., QueueConnection connect; QueueSession session = connect.createQueueSession(false,int) Heuristic 2: The return type of an outer most method- invocation contained in a return statement is same as the return type of the enclosing method declaration. e.g., public int test() {... return connect.createQueueSession(false,int); }

T. Xie Mining Program Source Code33 PARSEWeb Overview Code Downloader Code Search Engine Open Source Repositories Local Source Code Repository Code Analyzer Method Invocation Sequences Sequence Miner Clustered Method Invocation Sequences Query Splitter Final Method Invocation Sequences Query

T. Xie Mining Program Source Code34 Sequence Miner Candidate sequences produced by the code analyzer may be too many Solutions: Cluster similar sequences –Clustering heuristics are developed Rank sequences –Ranking heuristics are developed

T. Xie Mining Program Source Code35 Clustering Heuristics Heuristic 1: Method-invocation sequences with the same set of statements can be considered similar, although the statements are in different order. e.g., '' '' and '' '' Heuristic 2: Method-invocation sequences differing by given cluster precision value can be considered similar. e.g., '' '' and '' '' can be considered similar under cluster precision value one.

T. Xie Mining Program Source Code36 Ranking Heuristics Heuristic 1: Higher frequency -> Higher rank Heuristic 2: Shorter length -> Higher rank

T. Xie Mining Program Source Code37 PARSEWeb Overview Code Downloader Code Search Engine Open Source Repositories Local Source Code Repository Code Analyzer Method Invocation Sequences Sequence Miner Clustered Method Invocation Sequences Query Splitter Final Method Invocation Sequences Query

T. Xie Mining Program Source Code Query Splitter Lack of code samples in the results of code search engines –Code samples are split among different files Solution: Split the user query into multiple queries Compose the results for each split query

T. Xie Mining Program Source Code Query Splitting Example 1. User query: “org.eclipse.jface.viewers.IStructuredSelection->java.io.ObjectInputStream” Results: None 2. Query: “java.io.ObjectInputStream” Results: 3. Most used sources are: java.io.InputStream, java.io.ByteArrayInputStream, java.io.FileInputStream 3. Three Queries to be fired: “org.eclipse.jface.viewers.IStructuredSelection-> java.io.InputStream” Results: 1 “org.eclipse.jface.viewers.IStructuredSelection-> java.io.ByteArrayInputStream” Results: 5 “org.eclipse.jface.viewers.IStructuredSelection-> java.io.FileInputStream” Results: None

T. Xie Mining Program Source Code40 Eclipse Plugin

T. Xie Mining Program Source Code Evaluations Real Programming Problems: To address problems posted in developer forums. Real Projects: To show that solutions recommended by PARSEWeb are –available in real projects –better than solutions recommended by related tools PROSPECTOR, Strathcona, Google Code Search Engine averagely

T. Xie Mining Program Source Code Jakarta BCEL User Forum Jakarta BCEL user forum, 2001 Problem : “How to disassemble java byte code” Query : “Code  Instruction” Solution Sample Code : Code code; InstructionList il = new InstructionList(code.getCode()); Instruction[] ins = il.getInstructions();

T. Xie Mining Program Source Code Dev2Dev Newsgroups Dev 2 Dev Newsgroups, 2006 Problem : “how to connect db by sesseionBean” Query : javax.naming.InitialContext  java.sql.Connection Solution Sequence : FileName:3 AddressBean.java MethodName:getNextUniqueKey Rank:1 NumberOfOccurrences:34 javax.naming.InitialContext,lookup(java.lang.String) ReturnType:javax.sql.DataSource javax.sql.DataSource,getConnection() ReturnType:java.sql.Connection

T. Xie Mining Program Source Code Challenges in Mining Code Sometimes too few data samples –Scalability is usually not an issue –Static code bases vs. change histories Data preparation/preprocessing –Related to traditional program analysis Pattern postprocessing (filtering and ranking) –Heuristics play important roles Demand-driven mining vs. any gold mining –Programming vs. bug finding

T. Xie Mining Program Source Code Conclusion Mining various types of software engineering data to aid software engineering task Mining program source code to improve programmer productivity –MAPO: mining API usage patterns for a given API –Apiartor: mining API usage patterns for a given set of APIs –PARSEWeb: mining API usage patterns for input- output-type quries

Questions? Mining Software Engineering Data Bibliography What software engineering tasks can be helped by data mining? What kinds of software engineering data can be mined? How are data mining techniques used in software engineering? Resources

T. Xie Mining Program Source Code47 Demand-Driven Or Not Any-gold mining Demand-driven mining Examples DynaMine, …MAPO, BugTriage, … Advantages Surface up only cases that are applicable Exploit demands to filter out irrelevant information Issues How much gold is good enough given the amount of data to be mined? How high percentage of cases would work well?

T. Xie Mining Program Source Code48 Code vs. Non-Code Code/ Programming Langs Non-Code/ Natural Langs Examples MAPO, DynaMine, …BugTriage, CVS/Code comments, s, docs Advantages Relatively stable and consistent representation Common source of capturing programmers’ intentions Issues What project/context- specific heuristics to use?

T. Xie Mining Program Source Code49 Static vs. Dynamic Static Data: code bases, change histories Dynamic Data: prog states, structural profiles Examples MAPO, DynaMine, …Spec discovery, … Advantages No need to set up exec environment; More scalable More-precise info Issues How to reduce false positives? How to reduce false negatives? Where tests come from?

T. Xie Mining Program Source Code50 Snapshot vs. Changes Code snapshotCode change history Examples MAPO, …DynaMine, … Advantages Larger amount of available data Revision transactions encode more-focused entity relationships Issues How to group CVS changes into transactions?