Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Code-Clone Detection.

Slides:



Advertisements
Similar presentations
ANTLR in SSP Xingzhong Xu Hong Man Aug Outline ANTLR Abstract Syntax Tree Code Equivalence (Code Re-hosting) Future Work.
Advertisements

SYSTEM PROGRAMMING & SYSTEM ADMINISTRATION
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Extraction of.
Reverse Engineering © SERG Code Cloning: Detection, Classification, and Refactoring.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Extracting Code.
A Tool Support to Merge Similar Methods with a Cohesion Metric COB ○ Masakazu Ioka 1, Norihiro Yoshida 2, Tomoo Masai 1,Yoshiki Higo 1, Katsuro Inoue 1.
Chapter 3 Program translation1 Chapt. 3 Language Translation Syntax and Semantics Translation phases Formal translation models.
13/07/2015Dr Andy Brooks1 Fyrirlestrar 9 & 10 CCFinder: A Tool to Detect Clones “I can just copy these lines. That is the safest thing to do. The code.
Refactoring Support Tool: Cancer Yoshiki Higo Osaka University.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Industrial Application.
CS102 Introduction to Computer Programming
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University CoxR: Open Source.
Software Engineering Lab, Osaka University Code Clone Analysis and Its Application Katsuro Inoue Osaka University.
Chapter 17 Programming Tools The Architecture of Computer Hardware and Systems Software: An Information Technology Approach 3rd Edition, Irv Englander.
REFACTORING Lecture 4. Definition Refactoring is a process of changing the internal structure of the program, not affecting its external behavior and.
Department of Computer Science, Graduate School of Information Science & Technology, Osaka University Mining Coding Patterns to Detect Crosscutting Concerns.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University ICSE 2003 Java.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Finding Similar.
Code Clone Analysis and Its Application
Implementation Considerations Yonglei Tao. Components of Coding Standards 2  File header  file location, version number, author, project, update history.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 1 Refactoring.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Software Engineering.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University A Criterion for.
Department of Computer Science, Graduate School of Information Science and Technology, Osaka University DCCFinder: A Very- Large Scale Code Clone Analysis.
Chapter 1 Introduction Dr. Frank Lee. 1.1 Why Study Compiler? To write more efficient code in a high-level language To provide solid foundation in parsing.
Department of Computer Science, Graduate School of Information Science & Technology, Osaka University A clone detection approach for a collection of similar.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 1 ARIES: Refactoring.
Department of Computer Science, Graduate School of Information Science & Technology, Osaka University A Method to Detect License Inconsistencies for Large-
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Code-Clone Analysis.
2002/12/11PROFES20021 On software maintenance process improvement based on code clone analysis Yoshiki Higo* , Yasushi Ueda* , Toshihiro Kamiya** , Shinji.
Department of Computer Science, Graduate School of Information Science & Technology, Osaka University Detection and evolution analysis of code clones for.
CMCD: Count Matrix based Code Clone Detection Yang Yuan and Yao Guo Key Laboratory of High-Confidence Software Technologies (Ministry of Education) Peking.
1 Gemini: Maintenance Support Environment Based on Code Clone Analysis *Graduate School of Engineering Science, Osaka Univ. **PRESTO, Japan Science and.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Applying Clone.
Department of Computer Science, Graduate School of Information Science & Technology, Osaka University Inoue Laboratory Eunjong Choi 1 Investigating Clone.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University How to extract.
Software Engineering Research Group, Graduate School of Engineering Science, Osaka University 1 Evaluation of a Business Application Framework Using Complexity.
Summarizing the Content of Large Traces to Facilitate the Understanding of the Behaviour of a Software System Abdelwahab Hamou-Lhadj Timothy Lethbridge.
5 - 1 Copyright © 2006, The McGraw-Hill Companies, Inc. All rights reserved.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Technology and Science, Osaka University Dependence-Cache.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Development of.
CPS 506 Comparative Programming Languages Syntax Specification.
Duplicate code detection using anti-unification Peter Bulychev Moscow State University Marius Minea Institute eAustria, Timisoara.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Finding Code Clones.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University IWPSE 2003 Program.
Software Engineering Research Group, Graduate School of Engineering Science, Osaka University A Slicing Method for Object-Oriented Programs Using Lightweight.
Gordana Rakić, Zoran Budimac
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Code Clone Analysis.
Extracting a Unified Directory Tree to Compare Similar Software Products Yusuke Sakaguchi, Takashi Ishio, Tetsuya Kanda, Katsuro Inoue Department of Computer.
1 Measuring Similarity of Large Software System Based on Source Code Correspondence Tetsuo Yamamoto*, Makoto Matsushita**, Toshihiro Kamiya***, Katsuro.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 1 Classification.
1 Overview of Component Search System SPARS-J Tetsuo Yamamoto*,Makoto Matsushita**, Katsuro Inoue** *Japan Science and Technology Agency **Osaka University.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 1 Extracting Sequence.
Cross Language Clone Analysis Team 2 February 3, 2011.
What kind of and how clones are refactored? A case study of three OSS projects WRT2012 June 1, Eunjong Choi†, Norihiro Yoshida‡, Katsuro Inoue†
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 1 コードクローン解析に基づくリファクタリング支援.
1 Gemini: Code Clone Analysis Tool †Graduate School of Engineering Science, Osaka Univ., Japan ‡ Graduate School of Information Science and Technology,
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 1 Aries: Refactoring.
Department of Computer Science, Graduate School of Information Science & Technology, Osaka University Detection of License Inconsistencies in Free and.
SESSION 1 Introduction in Java. Objectives Introduce classes and objects Starting with Java Introduce JDK Writing a simple Java program Using comments.
On Detection of Gapped Code Clones using Gap Locations Yasushi Ueda†, Toshihiro Kamiya‡, Shinji Kusumoto†, and Katsuro Inoue† †Graduate School of Information.
Department of Computer Science, Graduate School of Information Science & Technology, Osaka University A Metric-based Approach for Reconstructing Methods.
STATIC CODE ANALYSIS. OUTLINE  INTRODUCTION  BACKGROUND o REGULAR EXPRESSIONS o SYNTAX TREES o CONTROL FLOW GRAPHS  TOOLS AND THEIR WORKING  ERROR.
Estimating Code Size After a Complete Code-Clone Merge Buford Edwards III, Yuhao Wu, Makoto Matsushita, Katsuro Inoue 1 Graduate School of Information.
Handouts Software Testing and Quality Assurance Theory and Practice Chapter 4 Control Flow Testing
Refactoring Support Based on Code Clone Analysis
CBCD: Cloned Buggy Code Detector
○Yuichi Semura1, Norihiro Yoshida2, Eunjong Choi3, Katsuro Inoue1
Refactoring Support Tool: Cancer
On Refactoring Support Based on Code Clone Dependency Relation
Presentation transcript:

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Code-Clone Detection Tool CCFinder Software Engineering Laboratory Department of Computer Science Graduate School of Information Science and Technology Osaka University Japan

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Contents Code Clone Code Clone Detection Tool: CCFinder Code Clone Analysis Tool: Gemini Applications Summaries and Future Works

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Code Clone In our studies, Code clone (or Software Clone) is a code fragment in source files that is identical or similar to another. Clone Pair Clone Class

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Problems caused by code clone It is generally said that code clone is one of problems of software maintenance. If a fault is found in a code portion, all of its clone code portions should be modified. “Programs that have duplicate logic are hard to modify.” [Fowler] It is unrealistic to find code clones by hand in million lines of source code.

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Initial motivation of this project A huge software system used in a division of government One million lines of code of two thousand modules Written mainly in COBOL The system was developed more than 20 years ago and has been maintained continually by a large number of engineers. It was believed that there would be many code clones in the system. but the documentation did not provide enough information about the code clones

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Code Clone Analysis Tools: CCFinder&Gemini We have been developing code clone analysis tools, Code clone detection tool, CCFinder[1], GUI-based clone analysis environment, Gemini[2]. We have delivered these tools to software companies and evaluated the usefulness through some case studies. [1] T. Kamiya, S. Kusumoto, and K. Inoue, “CCFinder: A multi-linguistic token-based code clone detection system for large scale source code”, IEEE Transactions on Software Engineering, 28(7): , [2] Y. Ueda, T. Kamiya, S. Kusumoto and K. Inoue, “Gemini: Maintenance Support Environment Based on Code Clone Analysis”, Proc. Of the 8th IEEE International Symposium on Software Metrics, 67-76, 2002.

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Outline of CCFinder CCFinder directly compares source code on token unit, and detects code clones. Normalization of name space Replacement of names defined by user Removal of table initialization Consideration of module delimiters CCFinder can analyze the system of millions line scale in practical use time. Target language C/C++ , Java , COBOL , FORTRAN, LISP

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Source files Lexical analysis Transformation Token sequence Match detection Transformed token sequence Clones on transformed sequence Formatting Clone pairs 1. static void foo() throws RESyntaxException { 2. String a[] = new String [] { "123,400", "abc", "orange 100" }; 3. org.apache.regexp.RE pat = new org.apache.regexp.RE("[0-9,]+"); 4. int sum = 0; 5. for (int i = 0; i < a.length; ++i) 6. if (pat.match(a[i])) 7. sum += Sample.parseNumber(pat.getParen(0)); 8. System.out.println("sum = " + sum); 9. } 10. static void goo(String [] a) throws RESyntaxException { 11. RE exp = new RE("[0-9,]+"); 12. int sum = 0; 13. for (int i = 0; i < a.length; ++i) 14. if (exp.match(a[i])) 15. sum += parseNumber(exp.getParen(0)); 16. System.out.println("sum = " + sum); 17. } Lexical analysis Transformation Token sequence Match detection Transformed token sequence Clones on transformed sequence Formatting CCFinder Example of clone detection process Lexical analysis Transformation Token sequence Match detection Transformed token sequence Clones on transformed sequence Formatting 1. static void foo() throws RESyntaxException { 2. String a[] = new String [] { "123,400", "abc", "orange 100" }; 3. org.apache.regexp.RE pat = new org.apache.regexp.RE("[0-9,]+"); 4. int sum = 0; 5. for (int i = 0; i < a.length; ++i) 6. if (pat.match(a[i])) 7. sum += Sample.parseNumber(pat.getParen(0)); 8. System.out.println("sum = " + sum); 9. } 10. static void goo(String [] a) throws RESyntaxException { 11. RE exp = new RE("[0-9,]+"); 12. int sum = 0; 13. for (int i = 0; i < a.length; ++i) 14. if (exp.match(a[i])) 15. sum += parseNumber(exp.getParen(0)); 16. System.out.println("sum = " + sum); 17. } Lexical analysis Transformation Token sequence Match detection Transformed token sequence Clones on transformed sequence Formatting 0.13,1 9,111,1 17,1

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Outline of Gemini Gemini is GUI-based clone analysis environment Gemini uses CCFinder as clone detection unit Gemini has mainly three interfaces Scatter plot –User can select clones by mouse dragging –Scatter plot has sort function, zoom function, and so on Metric graph –Metric graph shows several metrics of clone class. –User can select clones by specifying ranges of each metric value Source code view –User can browse the source code of clones selected in other views Gemini is implemented in Java

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Clone pair manager Metrics manager Scatter plot view Metric graph views User Interfaces Gemini: Architecture Source files Source code manager Source code view Clone selection information User Gemini Code clone detector CCFinder Code clone database a b c a b c a d e c a, b, c,... : tokens : matched position

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Clone pair manager Metrics manager Scatter plot view Metric graph views User Interfaces Gemini: Architecture Source files Source code manager Source code view Clone selection information User Gemini Code clone detector CCFinder Code clone database DFL (C ): Estimation of how many tokens would be removed from source files when all code fragments of clone class C are replaced with caller statements of a new identical routine new sub routine caller statements

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Clone pair manager Metrics manager Scatter plot view Metric graph views User Interfaces Gemini: Architecture Source files Source code manager Source code view Clone selection information User Gemini Code clone detector CCFinder Code clone database

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Application of CCFinder&Gemini Open source software Commercial Software (about 30 companies) Students exercise of Osaka University Filed in a court as an evidence for software copyright suit JDK libraries (Java, 570 KLOC) Linux, FreeBSD (C, MLOC) FreeBSD, OpenBSD , NetBSD(C) Qt(C++ , 240KLOC) NTT Data Corp., Hitachi Ltd., Hitachi GP, NEC soft Ltd., ASTEC Inc., SRA Inc., NASDA , Daiwa Computer, etc…

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Application: JDK library JDK ( Java Development Kit ) Number of file: 1700 LOC: 500, 000 Analysis time: 3 minutes. Pentium III 650MHz with 1GB RAM

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Scatter plot Unit of clone 20 LOC A: Many code clones are detected. B: The longest clone A B

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University A: Many code clones 29 files in src/javax/swing/plaf/multi/*.java These codes were generated by automatic code generation tool. 31| */ 32| public class MultiButtonUI extends ButtonUI { 33| 160| public static ComponentUI createUI(JComponent a) { 161| ComponentUI mui = new MultiButtonUI(); 162| return MultiLookAndFeel.createUIs(mui, 163| ((MultiButtonUI) mui).uis, 164| a); 165| }

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University B: The longest clone 349 LOC Eighteen “sort” methods in src/java/util/Arrays.java Difference: type and numbers in argument

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Application FreeBSD, Linux, NetBSD Three types of UNIX FreeBSD 4.0 (C, 2200 KLOC) Linux (C, 2400 KLOC) NetBSD 1.5 ( C, 2600KLOC) FreeBSD and NetBSD were derived from the same code. Unit of code clone: more than 30 tokens Analysis time: 108 minutes

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Scatter Plot

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Clones of FreeBSD and Linux Device driver

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Summary and Future Works Code Clone Detection Tool: CCFinder Code Clone Analysis Tool: Gemini Practical use of code clone information refactoring Reusable component

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

The difference between ‘diff’ and clone detection tools Diff finds the longest common sub-string. Given a code portion, diff does not report two or more same code portions (clones). Clone detection tool finds all the same or similar code portions.

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Suffix-tree Suffix tree is a tree that satisfies the following conditions. 1.A leaf node represents the starting position of sub-string. 2.A path from root node to a leaf node represents a sub-string. 3.First characters of labels of all the edges from one node are different from each other. → A common path means a clone

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Example of transformation rules in Java All identifiers defined by user are transformed to same tokens. Unique identifier is inserted at each end of the top-level definitions and declarations. Prevents detecting clones that begin at the middle of class definition and end at the middle of another one. ” java. lang. Math. PI ” is transformed to ” Math. PI ”. By using import sentence, a class is referred to with either full package name or a shorter name ” new int[] {1, 2, 3} ” is transformed to ” new int[] {$} ” Eliminates table initialization code.

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University The output of CCFinder Output of CCFinder #version: ccfinder 3.1 #langspec: JAVA #option: -b 30,1 #option: -k + #option: -r abcdfikmnprsv #option: -c wfg #begin{file description} C:\Gemini.java C:\GeneralManager.java : #end{file description} #begin{clone} ,9 63, ,9 553, ,9 63, ,9 633, ,9 152, ,9 216,51 42 : #end{clone} Object file ID ( file 0 in Group 0 ) Location of a clone pair ( Lines in file 0.1 and Lines in file 1.10 are identical or similar to each other) It is difficult to analyze source code by only this text-based information of the location of clone pairs.

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University The analysis of comparison among students (non-gapped clones only) A B The corresponding code A (2 students) Similar code fragments were from source code of sample compiler described in textbook. B (4 students) Many code fragments were similar even with respect to name of variables or comments.

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Clone class metrics LEN (C ): Length of token sequence of each element in clone class C LNR (C) : Length of non-repetitive token sequence of LEN(C) POP (C ): Number of elements in clone class C DFL (C ): Estimation of how many tokens would be removed from source files when all code fragments of clone class C are replaced with caller statements of a new identical routine RAD (C ): Distribution in the file system of elements in clone class C new sub routine caller statements

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Comparison with AST approach Features of AST approach Extract the same sub-trees of AST as a clone The result is precise because of strict syntax analysis. High space and time complexity Features of Our approach Hybrid approach of CCFinder’s quick but inaccurate clone detection and CCShaper’s filtering considering syntax structure.

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University The other approaches AST(Abstract syntax tree) approach Clone = the same sub-trees in an AST Deep dependence on program language PDG(Program dependency Graph) approach Clone = the same sub-graph in a PDG Graph comparison is difficult Code metric Clone = the routines which have the same metric values Severe restriction in granularity CCFinder&CCShaper Clone = the code fragments which have the same syntax structure Limited precision

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Why I choose “a” I selected the clones by the following criteria All clone code fragments appear in the same class The metric LEN is high The code fragment includes a whole method body

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Suffix-tree によるマッチングア ルゴリズム 以下の条件を満たす木 (1) 木の葉は部分文字 列の開始位置 (2) 根から葉までラベ ルをたどると部分文 字列になる (3) ひとつの節点から 出る辺のラベルはす べて異なる文字で始 まる → 共通のパス=クロー ン