Duplicate code detection using anti-unification Peter Bulychev Moscow State University Marius Minea Institute eAustria, Timisoara.

Slides:



Advertisements
Similar presentations
Duplicate code detection using Clone Digger Peter Bulychev Lomonosov Moscow State University CS department.
Advertisements

A Mutation / Injection-based Automatic Framework for Evaluating Code Clone Detection Tools Chanchal Roy University of Saskatchewan The 9th CREST Open Workshop.
Introduction To Compilers And Phase 1 Inside a compiler. Inside a C-- compiler. The compilation process. Example C-- code. Extended Backus-Naur.
ANTLR in SSP Xingzhong Xu Hong Man Aug Outline ANTLR Abstract Syntax Tree Code Equivalence (Code Re-hosting) Future Work.
Compilers and Language Translation
CS590 Z Matching Program Versions Xiangyu Zhang. CS590Z Problem Statement  Suppose a program P’ is created by modifying P. Determine the difference between.
Reverse Engineering © SERG Code Cloning: Detection, Classification, and Refactoring.
Compiler Principle and Technology Prof. Dongming LU Mar. 28th, 2014.
Code recognition & CL modeling through AST Xingzhong Xu Hong Man.
GNANA SUNDAR RAJENDIRAN JOYESH MISHRA RISHI MISHRA FALL 2008 BIOINFORMATICS Clustering Method for Repeat Analysis in DNA sequences.
A Tool Support to Merge Similar Methods with a Cohesion Metric COB ○ Masakazu Ioka 1, Norihiro Yoshida 2, Tomoo Masai 1,Yoshiki Higo 1, Katsuro Inoue 1.
Overview of program analysis Mooly Sagiv html://
1 Efficient Discovery of Conserved Patterns Using a Pattern Graph Inge Jonassen Pattern Discovery Arwa Zabian 13/07/2015.
Overview of program analysis Mooly Sagiv html://
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
ANTLR.
2.2 A Simple Syntax-Directed Translator Syntax-Directed Translation 2.4 Parsing 2.5 A Translator for Simple Expressions 2.6 Lexical Analysis.
ITEC 320 Lecture 16 Packages (1). Review Questions? –HW –Exam Nested records –Benefits –Downsides.
Chapter 1 Introduction Dr. Frank Lee. 1.1 Why Study Compiler? To write more efficient code in a high-level language To provide solid foundation in parsing.
Supervisor:Mr. Sayed Morteza Zaker Presentor:Fateme hadinezhad.
Chapter 10: Compilers and Language Translation Invitation to Computer Science, Java Version, Third Edition.
CSC 338: Compiler design and implementation
Mining and Analysis of Control Structure Variant Clones Guo Qiao.
COMPILER DESIGN Fourth Year (First Semester) Lecture 1
COP4020 Programming Languages Semantics Prof. Xin Yuan.
1 Top Down Parsing. CS 412/413 Spring 2008Introduction to Compilers2 Outline Top-down parsing SLL(1) grammars Transforming a grammar into SLL(1) form.
2002/12/11PROFES20021 On software maintenance process improvement based on code clone analysis Yoshiki Higo* , Yasushi Ueda* , Toshihiro Kamiya** , Shinji.
CMCD: Count Matrix based Code Clone Detection Yang Yuan and Yao Guo Key Laboratory of High-Confidence Software Technologies (Ministry of Education) Peking.
Cross Language Clone Analysis Team 2 October 27, 2010.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Applying Clone.
Department of Computer Science, Graduate School of Information Science & Technology, Osaka University Inoue Laboratory Eunjong Choi 1 Investigating Clone.
CS412/413 Introduction to Compilers and Translators Spring ’99 Lecture 8: Semantic Analysis and Symbol Tables.
1 Evaluating Code Duplication Detection Techniques Filip Van Rysselberghe and Serge Demeyer Lab On Re-Engineering University Of Antwerp Towards a Taxonomy.
Towards the better software metrics tool motivation and the first experiences Gordana Rakić Zoran Budimac.
Introduction Lecture 1 Wed, Jan 12, The Stages of Compilation Lexical analysis. Syntactic analysis. Semantic analysis. Intermediate code generation.
1 Compiler Construction (CS-636) Muhammad Bilal Bashir UIIT, Rawalpindi.
. n COMPILERS n n AND n n INTERPRETERS. -Compilers nA compiler is a program thatt reads a program written in one language - the source language- and translates.
Chapter 1 Introduction Major Data Structures in Compiler
Gordana Rakić, Zoran Budimac
Compiler Construction By: Muhammad Nadeem Edited By: M. Bilal Qureshi.
LogTree: A Framework for Generating System Events from Raw Textual Logs Liang Tang and Tao Li School of Computing and Information Sciences Florida International.
CS307P-SYSTEM PRACTICUM CPYNOT. B13107 – Amit Kumar B13141 – Vinod Kumar B13218 – Paawan Mukker.
Scalable Clone Detection and Elimination for Erlang Programs Huiqing Li, Simon Thompson University of Kent Canterbury, UK.
Bernd Fischer RW713: Compiler and Software Language Engineering.
 Software Clones:( Definitions from Wikipedia) ◦ Duplicate code: a sequence of source code that occurs more than once, either within a program or across.
What kind of and how clones are refactored? A case study of three OSS projects WRT2012 June 1, Eunjong Choi†, Norihiro Yoshida‡, Katsuro Inoue†
Concepts and Realization of a Diagram Editor Generator Based on Hypergraph Transformation Author: Mark Minas Presenter: Song Gu.
CSC 4181 Compiler Construction
LECTURE 3 Compiler Phases. COMPILER PHASES Compilation of a program proceeds through a fixed series of phases.  Each phase uses an (intermediate) form.
CS412/413 Introduction to Compilers Radu Rugina Lecture 11: Symbol Tables 13 Feb 02.
Unified Modeling Language (UML)
Syntax-Directed Definitions CS375 Compilers. UT-CS. 1.
STATIC CODE ANALYSIS. OUTLINE  INTRODUCTION  BACKGROUND o REGULAR EXPRESSIONS o SYNTAX TREES o CONTROL FLOW GRAPHS  TOOLS AND THEIR WORKING  ERROR.
LECTURE 10 Semantic Analysis. REVIEW So far, we’ve covered the following: Compilation methods: compilation vs. interpretation. The overall compilation.
Compilers Principles, Techniques, & Tools Taught by Jing Zhang
Constructing Precedence Table
CS 3304 Comparative Languages
Introduction to Parsing
Introduction to Parsing (adapted from CS 164 at Berkeley)
CS 536 / Fall 2017 Introduction to programming languages and compilers
○Yuichi Semura1, Norihiro Yoshida2, Eunjong Choi3, Katsuro Inoue1
: Clone Refactoring Davood Mazinanian Nikolaos Tsantalis Raphael Stein
Unit 1: Introduction Lesson 1: PArts of a java program
C H A P T E R T W O Syntax.
Tiger Compiler Project
On Refactoring Support Based on Code Clone Dependency Relation
Compilers Principles, Techniques, & Tools Taught by Jing Zhang
Chapter 10: Compilers and Language Translation
Matching Program Versions
Presentation transcript:

Duplicate code detection using anti-unification Peter Bulychev Moscow State University Marius Minea Institute eAustria, Timisoara

Outline Code duplication problem Our anti-unification based algorithm Comparison with existing methods Clone Digger, the tool for finding software clones

What is software clone? Two fragments of code form clone if they are similar enough (according to a given measure of similarity) for(int i=0; i<5; i++) for(j=0; j<=i; j++) cout << i+j; for(int k=0; k<6; k++) for(m=0; m<=k; m++) cout << k+m;

Why is it important to detect code clones? 5% - 20% of code in software systems are clones 1 Why do programmers produce clones? 2 Development strategy Maintenance benefits Overcoming underlying limitations Cloning by accident Why is the presence of code clones bad? Errors in the original must be fixed in every clone 1. I.D. Baxter, et.al. Clone Detection Using Abstract Syntax Trees, C.K. Roy and J.R. Cordy. A Survey on Software Clone Detection Research, 2007.

Our clone definition Different clone definitions can be classified according to the level of granularity: List of strings Sequence of tokens Abstract syntax trees (AST) Semantic information We work on the AST level We consider two sequences of statements a clone if one of them can be obtained from the other by replacing some subtrees

Example x = a; y = f(x,i); cout << y; x = a + b; y = f(x,j); cout << y; ; = cout x+ y ab = yf xj ; = xa y = yf xi

Automatic clone detection tool Detect occurrences of similar code Applications Refactoring into new functions or base classes Number of clones can be used as a measure of code quality Several tools exists 1 1. S. Bellon, et.al. Comparison and Evaluation of Clone Detection Tools, 2007.

The sketch of the algorithm Partition similar statements into clusters Find pairs of identical cluster sequences Refine by examining identified code sequences for structural similarity i=0i++f(i) k++f(k)k=0 i=0f(k)

Main problems How to compute similarity between two trees? Use editing distance How to compute similarity between a new tree and an existing tree cluster? Comparing with each tree in cluster is expensive Compare new tree with an average value stored for a cluster

Anti-unification Anti-unifier of two trees is the most specific generalization that matches both ? f +* ? xyx 2 f +/ xzx2 f + x ?

Anti-unification features Anti-unifier of a set of trees keeps common features: tree structure and common labels Anti-unification can be used to compute editing distance between two trees: Ө 1 и Ө 2 - substitutions, E 0 Ө 1 =E 1 и E 0 Ө 2 =E 2 distance = |Ө 1 | + |Ө 2|

The first phase: building clusters of statements We use a simple one-pass clustering algorithm for each tree in statement trees: bestcluster = argmax(cluster.add_cost(tree)) if bestcluster.add_cost(tree) < threshold bestcluster.append(tree) else clusters.append(new Cluster(tree))

Finding the best cluster What add_cost function should we use? Cost value should be high for these cases: If cluster is large and by joining the new tree the cluster’s average value changes significantly If the average value of the new cluster is far away from the tree add_cost = n * (|au| - |au’|) + (|tree| - |au’|) n – the old size of the cluster au – the old anti-unifier of the cluster au’ - the new anti-unifier of the cluster

Increase of effectiveness In order not to compare each AST with each other AST we use hashing. The upper parts of the trees are hashed. = [ ]+ abx0 = + a+x0 bc

Why is this not enough? By considering pairs from the same cluster only individually we miss sequences of statements We should find all pairs of identical cluster sequences and then check them for similarity void f() { // cluster №1 cin >> i; // cluster №2 int j = i * 100; // cluster №3 cout << i << j; // cluster №4 } void f(int j) { // cluster №5 cin >> i; // cluster №2 int j = i * 100; // cluster №3 cout << j; // cluster №6 }

The second phase: finding all common subsequences After the first phase each statement node is marked with the ID of its cluster We want to find all pairs of similar sequences of cluster IDs We do it using suffix trees Only long common subsequences are considered

The third phase: finding similar sequences of statements i=0 k=3 f(i,k) k=0 n=3 f(k,n) i=0 k=3 f(i,k) k=0 n=3 f(k,n)

Comparison with existing AST methods W. Yang, 1991 Editing distance between two trees I. Baxter, et. al, 1998 Hash functions on subtrees, some kind of editing distance V. Wahler, 2004 Feature vectors comparison S. Evans, et. al, 2007 Subtree patterns (similar to anti-unification), hash functions on subtrees

Clone Digger The tool is written in Python Supported languages: Python (ASTs are build using standard package “compiler”) Java 1.5 (parser generator ANTLR) The information on found clones is written to HTML with a highlighting of differences It’s application to open-source projects NLTK and BioPython showed, that they are 12% clones

Clone Digger Provided under the GPL license and can be downloaded from the site

Thank you!