Presentation is loading. Please wait.

Presentation is loading. Please wait.

Improving Programmer Productivity via Mining Program Source Code Tao Xie Department of Computer Science North Carolina State University

Similar presentations


Presentation on theme: "Improving Programmer Productivity via Mining Program Source Code Tao Xie Department of Computer Science North Carolina State University"— Presentation transcript:

1 Improving Programmer Productivity via Mining Program Source Code Tao Xie Department of Computer Science North Carolina State University http://ase.csc.ncsu.edu/dmse/

2 T. Xie Mining Program Source Code2 Mining SE Data MAIN GOAL –Transform static record- keeping SE data to active data –Make SE data actionable by uncovering hidden patterns and trends Mailings Bugzilla Code repository Execution traces CVS

3 T. Xie Mining Program Source Code3 Overview of Mining SE Data code bases change history program states structural entities software engineering data bug reports/nl programmingdefect detectiontestingdebuggingmaintenance software engineering tasks helped by data mining classification association/ patterns clustering data mining techniques … … …

4 T. Xie Mining Program Source Code4 Overview of Mining SE Data code bases change history program states structural entities software engineering data bug reports/nl … 99 ASE 00 ICSE 05 FSE*2 ASE PLDI POPL OSDI 06 PLDI OOPSLA KDD 07 ICSE*3 FSE*3 ASE PLDI*2 ISSTA*2 KDD 04 ICSE 05 FSE*2 06 ASE 07 ICSE*2 99 ICSE 02 ICSE 03 PLDI 05 FSE PLDI 06 ISSTA 07 ISSTA 99 FSE 01 ICSE FSE 02 ISSTA POPL KDD 03 PLDI 04 ASE ISSTA 05 ICSE ASE 06 ICSE FSE*2 07 PLDI 03 ICSE 06 ICSE 06 ASE 07 ICSE SOSP

5 T. Xie Mining Program Source Code5 Overview of Mining SE Data code bases change history program states structural entities software engineering data bug reports/nl programmingdefect detectiontestingdebuggingmaintenance software engineering tasks helped by data mining classification association/ patterns clustering data mining techniques … … …

6 T. Xie Mining Program Source Code6 Overview of Mining SE Data programmingdefect detectiontestingdebuggingmaintenance software engineering tasks helped by data mining … 99 ASE 00 ICSE 05 FSE PLDI POPL 06 FSE OOPSLA PLDI 07 FSE ASE ISSTA KDD 01 SOSP 04 OSDI 05 FSE*2 06 ICSE*2 07 ICSE*2 FSE*2 ISSTA PLDI*2 SOSP 99 ICSE 01 ICSE*2 FSE 02 ICSE ISSTA POPL 04 ISSTA 06 ISSTA 03 ICSE PLDI*2 05 ICSE FSE ASE PLDI 06 ICSE FSE 07 ICSE ISSTA PLDI 02 KDD 04 ICSE ASE 05 FSE ASE*2 06 KDD 07 ICSE*3

7 T. Xie Mining Program Source Code7 Overview of Mining SE Data code bases change history program states structural entities software engineering data bug reports/nl programmingdefect detectiontestingdebuggingmaintenance software engineering tasks helped by data mining classification association/ patterns clustering data mining techniques … … …

8 T. Xie Mining Program Source Code8 Sample Projects on Mining Program Source Code DataAlgorithmsTasks Set of functions, variables, etc. in a C function Frequent Itemset Programming-rules-related bug finding UIUC [FSE 05] Statement seq in a basic block in C Frequent subsequence Copy-paste bug finding UIUC [OSDI 04] Methods seq in a Java method from code search engine Frequent subsequence API usage patterns NCSU [MSR 06] Function seq in whole C program Frequent partial order API usage patterns/properties NCSU [FSE 07] System dependence graph in whole C program Frequent subgraph Neglected-condition bug finding CASE [ISSTA 07] Java API method signatures Plan generation API Jungloids Berkeley [PLDI 05] Method seq in a Java method from code search engine Frequent sequences API Jungloids NCSU [ASE 07]

9 T. Xie Mining Program Source Code9 Some Recent Trends Data: dynamic execution data  +static code bases Task: productivity (programming)  + quality (defect detection, testing, debugging) Mining algorithm: simple ones (association rule)  + frequent itemset/subsequence/ partial order/subgraph Data scope: local repositories  public repositories with code search engines

10 T. Xie Mining Program Source Code10 Sample Projects on Mining Program Source Code DataAlgorithmsTasks Set of functions, variables, etc. in a C function Frequent itemset Programming-rules-related bug finding UIUC [FSE 05] Statement seq in a basic block in C Frequent subsequence Copy-paste bug finding UIUC [OSDI 04] Methods seq in a Java method from code search engine Frequent subsequence API usage patterns NCSU [MSR 06] Function seq in whole C program Frequent partial order API usage patterns/properties NCSU [FSE 07] System dependence graph in whole C program Frequent subgraph Neglected-condition bug finding CASE [ISSTA 07] Java API method signatures Plan generation API Jungloids Berkeley [PLDI 05] Method seq in a Java method from code search engine Frequent sequences API Jungloids NCSU [ASE 07]

11 T. Xie Mining Program Source Code11 Mining API Usage Patterns How should an API be used correctly? –An API may serve multiple functionalities –Different styles of API usage MAPO: “I know what method call I need, but I don’t know how to write code before and after this method call” [Xie&Pei MSR 06]

12 T. Xie Mining Program Source Code12 Example Task -- MAPO “ instrument the bytecode of a Java class by adding an extra method to the class ” –org.apache.bcel.generic.ClassGen public void addMethod(Method m)

13 T. Xie Mining Program Source Code13 First Try: ClassGen Java API Doc addMethod public void addMethod(Method m) Add a method to this class. Parameters: m - method to add

14 T. Xie Mining Program Source Code14 Second Try: Code Search Engine

15 T. Xie Mining Program Source Code15 MAPO Approach Analyze code segments relevant to a given API and disclose the inherent usage patterns –Input: an API characterized by a method, class, or package –Code search engine: used to search relevant source files from open source repositories –Frequent sequence miner: use BIDE [Wang&Han 04] to mine closed sequential patterns from extracted method- call sequences –Output: a short list of frequent API usage patterns related to the API

16 T. Xie Mining Program Source Code16 Sequence Extraction Method sequences: extracted from Java source files returned from code search engines public void generateStubMethod(ClassGen c) InstructionList il = new InstructionList(); MethodGen m= genFromISList(il); m.setMaxLocals(); m.setMaxStack(); c.addMethod(m.getMethod()); System.out.println(“…”); … } Call sequenceSource code InstructionList. () genFromISList(InstructionList) MethodGen.setMaxStack() MethodGen.setMaxLocals() MethodGen.getMethod() ClassGen.addMethod(Method) PrintStream.println(String) …

17 T. Xie Mining Program Source Code17 Sequence Preprocessing Remove common Java library calls Inline callees of the same class Remove sequences that contain no query words: ClassGen and addMethod InstructionList. () genFromISList(InstructionList) MethodGen.setMaxStack() MethodGen.setMaxLocals() MethodGen.getMethod() ClassGen.addMethod(Method) PrintStream.println(String) … public void generateStubMethod(ClassGen c) InstructionList il = new InstructionList(); MethodGen m= genFromISList(il); m.setMaxLocals(); m.setMaxStack(); c.addMethod(m.getMethod()); System.out.println(“…”); … }

18 T. Xie Mining Program Source Code18 Frequent Seq Postprocessing Remove sequences that contain no query words: ClassGen and addMethod Compress consecutive calls of the same method into one, e.g., abbba  aba Remove duplicate frequent sequences after the compression, e.g., aba, aba  aba Reduce a seq if it is a subseq of another, e.g., aba, abab  abab

19 T. Xie Mining Program Source Code19 Tool Architecture e.g. koders.com

20 T. Xie Mining Program Source Code20 Sample Mined API Sequence InstructionList. () InstructionFactory.createLoad(Type, int) InstructionList.append(Instruction) InstructionFactory.createReturn(Type) InstructionList.append(Instruction) MethodGen.setMaxStack() MethodGen.setMaxLocals() MethodGen.getMethod() ClassGen.addMethod(Method) InstructionList.dispose()

21 T. Xie Mining Program Source Code21 Sample Projects on Mining Program Source Code DataAlgorithmsTasks Set of functions, variables, etc. in a C function Frequent itemset Programming-rules-related bug finding UIUC [FSE 05] Statement seq in a basic block in C Frequent subsequence Copy-paste bug finding UIUC [OSDI 04] Methods seq in a Java method from code search engine Frequent subsequence API usage patterns NCSU [MSR 06] Function seq in whole C program Frequent partial order API usage patterns/properties NCSU [FSE 07] System dependence graph in whole C program Frequent subgraph Neglected-condition bug finding CASE [ISSTA 07] Java API method signatures Plan generation API Jungloids Berkeley [PLDI 05] Method seq in a Java method from code search engine Frequent sequences API Jungloids NCSU [ASE 07]

22 T. Xie Mining Program Source Code22 Mining API Usage Patterns MAPO: “I know what method call I need, but I don’t know how to write code before and after this method call” [Xie&Pei MSR 06] Apiartor: “I know what possible set of APIs I need, but I don’t know what need to be used and what orders to use” [Acharya et al. FSE 07]

23 T. Xie Mining Program Source Code23 Usage Patterns as Partial Order #include void p ( ) { b ( ); c ( ); } void q ( ) { c ( ); b ( ); } void r ( ) { e ( ); f ( ); } void s ( ) { f ( ); e ( ); } int main ( ) { int i, j, k; a ( ); if ( i == 1) { f ( ); e ( ); c ( ); exit ( ); } else { if ( j == 1 ) p ( ); else q ( ); d ( ); if ( k == 1 ) r ( ); else s ( ); } 1 a  f  e  c 2 a  b  c  d  e  f 3 a  c  b  d  e  f 4 a  b  c  d  f  e 5 a  c  b  d  f  e a d c e b f a  b  d  e a  b  d  f a  c  d  e a  c  d  f (b) Static program traces (c) Frequent subseq patterns (d) Frequent partial order R (a) Example code

24 T. Xie Mining Program Source Code24 Apiartor Overview User-specified APIs Trigger Generator Triggers Model Checker Traces Scenario Extractor Independent Scenarios Miner Partial Orders Source Code Specification Extractor Specifications Frequent Usage Scenarios Related APIs Trace Generator

25 T. Xie Mining Program Source Code25 Example Partial Orders XOpenDisplay XCloseDisplay XCreateWindow XGetWindowAttributes XCreateGC XSetForeground XGetBackground XMapWindow XChageWindowAttributes XMapWindow XSelectInput XGetAtomName XFreeGC XNextEvent A usage scenario around XOpenDisplay API as a partial order. Specifications are shown with dotted lines.

26 T. Xie Mining Program Source Code26 Sample Projects on Mining Program Source Code DataAlgorithmsTasks Set of functions, variables, etc. in a C function Frequent itemset Programming-rules-related bug finding UIUC [FSE 05] Statement seq in a basic block in C Frequent subsequence Copy-paste bug finding UIUC [OSDI 04] Methods seq in a Java method from code search engine Frequent subsequence API usage patterns NCSU [MSR 06] Function seq in whole C program Frequent partial order API usage patterns/properties NCSU [FSE 07] System dependence graph in whole C program Frequent subgraph Neglected-condition bug finding CASE [ISSTA 07] Java API method signatures Plan generation API Jungloids Berkeley [PLDI 05] Method seq in a Java method from code search engine Frequent sequences API Jungloids NCSU [ASE 07]

27 T. Xie Mining Program Source Code27 Mining API Usage Patterns MAPO: “I know what method call I need, but I don’t know how to write code before and after this method call” [Xie&Pei MSR 06] Apiartor: “I know what possible set of APIs I need, but I don’t know what need to be used and what orders to use” [Acharya et al. FSE 07] PARSEWeb: “I know what type of object I need, but I don’t know how to write the code to get the object” [Thummalapenta&Xie ASE 07]

28 T. Xie Mining Program Source Code Example Task - OpenJMS Query: “javax.jms.QueueConnectionFactory -> javax.jms.QueueSender” PARSEWeb Solution: FileName:0_UserBean.java MethodName:ingest Rank:1 NumberOfOccurrences:23 Confidence:True Path: 1 2 3 javax.jms.QueueConnectionFactory,createQueueConnection() ReturnType:javax.jms.QueueConnection javax.jms.QueueConnection,createQueueSession(boolean,javax.jms.Session.AUTO ACKNOWLEDGE) ReturnType:javax.jms.QueueSession javax.jms.QueueSession,createSender(javax.jms.Queue) ReturnType:javax.jms.QueueSender Sun Java Message Services API Spec

29 T. Xie Mining Program Source Code29 PARSEWeb Overview Code Downloader Code Search Engine Open Source Repositories Local Source Code Repository Code Analyzer Method Invocation Sequences Sequence Miner Clustered Method Invocation Sequences Query Splitter Final Method Invocation Sequences Query

30 T. Xie Mining Program Source Code30 PARSEWeb Overview Code Downloader Code Search Engine Open Source Repositories Local Source Code Repository Code Analyzer Method Invocation Sequences Sequence Miner Clustered Method Invocation Sequences Query Splitter Final Method Invocation Sequences Query

31 T. Xie Mining Program Source Code31 Code Analyzer Collect [Source  Destination] method sequences invoked by each public method –Deal with local method calls by inlining methods –Deal with conditionals/loops by traversing control flow graphs Resolve types in sequences –Challenges: downloaded files are partial –Solutions: heuristics are developed

32 T. Xie Mining Program Source Code32 Type Heuristics Heuristic 1: The return type of a method-invocation statement contained in an initialization expression is same as the type of the declared variable. e.g., QueueConnection connect; QueueSession session = connect.createQueueSession(false,int) Heuristic 2: The return type of an outer most method- invocation contained in a return statement is same as the return type of the enclosing method declaration. e.g., public int test() {... return connect.createQueueSession(false,int); }

33 T. Xie Mining Program Source Code33 PARSEWeb Overview Code Downloader Code Search Engine Open Source Repositories Local Source Code Repository Code Analyzer Method Invocation Sequences Sequence Miner Clustered Method Invocation Sequences Query Splitter Final Method Invocation Sequences Query

34 T. Xie Mining Program Source Code34 Sequence Miner Candidate sequences produced by the code analyzer may be too many Solutions: Cluster similar sequences –Clustering heuristics are developed Rank sequences –Ranking heuristics are developed

35 T. Xie Mining Program Source Code35 Clustering Heuristics Heuristic 1: Method-invocation sequences with the same set of statements can be considered similar, although the statements are in different order. e.g., ''2 3 4 5'' and ''2 4 3 5 '' Heuristic 2: Method-invocation sequences differing by given cluster precision value can be considered similar. e.g., ''8 9 6 7'' and ''8 6 10 7 '' can be considered similar under cluster precision value one.

36 T. Xie Mining Program Source Code36 Ranking Heuristics Heuristic 1: Higher frequency -> Higher rank Heuristic 2: Shorter length -> Higher rank

37 T. Xie Mining Program Source Code37 PARSEWeb Overview Code Downloader Code Search Engine Open Source Repositories Local Source Code Repository Code Analyzer Method Invocation Sequences Sequence Miner Clustered Method Invocation Sequences Query Splitter Final Method Invocation Sequences Query

38 T. Xie Mining Program Source Code Query Splitter Lack of code samples in the results of code search engines –Code samples are split among different files Solution: Split the user query into multiple queries Compose the results for each split query

39 T. Xie Mining Program Source Code Query Splitting Example 1. User query: “org.eclipse.jface.viewers.IStructuredSelection->java.io.ObjectInputStream” Results: None 2. Query: “java.io.ObjectInputStream” Results: 3. Most used sources are: java.io.InputStream, java.io.ByteArrayInputStream, java.io.FileInputStream 3. Three Queries to be fired: “org.eclipse.jface.viewers.IStructuredSelection-> java.io.InputStream” Results: 1 “org.eclipse.jface.viewers.IStructuredSelection-> java.io.ByteArrayInputStream” Results: 5 “org.eclipse.jface.viewers.IStructuredSelection-> java.io.FileInputStream” Results: None

40 T. Xie Mining Program Source Code40 Eclipse Plugin

41 T. Xie Mining Program Source Code Evaluations Real Programming Problems: To address problems posted in developer forums. Real Projects: To show that solutions recommended by PARSEWeb are –available in real projects –better than solutions recommended by related tools PROSPECTOR, Strathcona, Google Code Search Engine averagely

42 T. Xie Mining Program Source Code Jakarta BCEL User Forum Jakarta BCEL user forum, 2001 Problem : “How to disassemble java byte code” Query : “Code  Instruction” Solution Sample Code : Code code; InstructionList il = new InstructionList(code.getCode()); Instruction[] ins = il.getInstructions();

43 T. Xie Mining Program Source Code Dev2Dev Newsgroups Dev 2 Dev Newsgroups, 2006 Problem : “how to connect db by sesseionBean” Query : javax.naming.InitialContext  java.sql.Connection Solution Sequence : FileName:3 AddressBean.java MethodName:getNextUniqueKey Rank:1 NumberOfOccurrences:34 javax.naming.InitialContext,lookup(java.lang.String) ReturnType:javax.sql.DataSource javax.sql.DataSource,getConnection() ReturnType:java.sql.Connection

44 T. Xie Mining Program Source Code Challenges in Mining Code Sometimes too few data samples –Scalability is usually not an issue –Static code bases vs. change histories Data preparation/preprocessing –Related to traditional program analysis Pattern postprocessing (filtering and ranking) –Heuristics play important roles Demand-driven mining vs. any gold mining –Programming vs. bug finding

45 T. Xie Mining Program Source Code Conclusion Mining various types of software engineering data to aid software engineering task Mining program source code to improve programmer productivity –MAPO: mining API usage patterns for a given API –Apiartor: mining API usage patterns for a given set of APIs –PARSEWeb: mining API usage patterns for input- output-type quries

46 Questions? Mining Software Engineering Data Bibliography http://ase.csc.ncsu.edu/dmse/ http://ase.csc.ncsu.edu/dmse/ What software engineering tasks can be helped by data mining? What kinds of software engineering data can be mined? How are data mining techniques used in software engineering? Resources

47 T. Xie Mining Program Source Code47 Demand-Driven Or Not Any-gold mining Demand-driven mining Examples DynaMine, …MAPO, BugTriage, … Advantages Surface up only cases that are applicable Exploit demands to filter out irrelevant information Issues How much gold is good enough given the amount of data to be mined? How high percentage of cases would work well?

48 T. Xie Mining Program Source Code48 Code vs. Non-Code Code/ Programming Langs Non-Code/ Natural Langs Examples MAPO, DynaMine, …BugTriage, CVS/Code comments, emails, docs Advantages Relatively stable and consistent representation Common source of capturing programmers’ intentions Issues What project/context- specific heuristics to use?

49 T. Xie Mining Program Source Code49 Static vs. Dynamic Static Data: code bases, change histories Dynamic Data: prog states, structural profiles Examples MAPO, DynaMine, …Spec discovery, … Advantages No need to set up exec environment; More scalable More-precise info Issues How to reduce false positives? How to reduce false negatives? Where tests come from?

50 T. Xie Mining Program Source Code50 Snapshot vs. Changes Code snapshotCode change history Examples MAPO, …DynaMine, … Advantages Larger amount of available data Revision transactions encode more-focused entity relationships Issues How to group CVS changes into transactions?


Download ppt "Improving Programmer Productivity via Mining Program Source Code Tao Xie Department of Computer Science North Carolina State University"

Similar presentations


Ads by Google