Presentation is loading. Please wait.

Presentation is loading. Please wait.

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Code-Clone Detection.

Similar presentations


Presentation on theme: "Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Code-Clone Detection."— Presentation transcript:

1 Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Code-Clone Detection Tool CCFinder Software Engineering Laboratory Department of Computer Science Graduate School of Information Science and Technology Osaka University Japan

2 Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Contents Code Clone Code Clone Detection Tool: CCFinder Code Clone Analysis Tool: Gemini Applications Summaries and Future Works

3 Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Code Clone In our studies, Code clone (or Software Clone) is a code fragment in source files that is identical or similar to another. Clone Pair Clone Class

4 Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Problems caused by code clone It is generally said that code clone is one of problems of software maintenance. If a fault is found in a code portion, all of its clone code portions should be modified. “Programs that have duplicate logic are hard to modify.” [Fowler] It is unrealistic to find code clones by hand in million lines of source code.

5 Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Initial motivation of this project A huge software system used in a division of government One million lines of code of two thousand modules Written mainly in COBOL The system was developed more than 20 years ago and has been maintained continually by a large number of engineers. It was believed that there would be many code clones in the system. but the documentation did not provide enough information about the code clones

6 Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Code Clone Analysis Tools: CCFinder&Gemini We have been developing code clone analysis tools, Code clone detection tool, CCFinder[1], GUI-based clone analysis environment, Gemini[2]. We have delivered these tools to software companies and evaluated the usefulness through some case studies. [1] T. Kamiya, S. Kusumoto, and K. Inoue, “CCFinder: A multi-linguistic token-based code clone detection system for large scale source code”, IEEE Transactions on Software Engineering, 28(7):654-670, 2002. [2] Y. Ueda, T. Kamiya, S. Kusumoto and K. Inoue, “Gemini: Maintenance Support Environment Based on Code Clone Analysis”, Proc. Of the 8th IEEE International Symposium on Software Metrics, 67-76, 2002.

7 Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Outline of CCFinder CCFinder directly compares source code on token unit, and detects code clones. Normalization of name space Replacement of names defined by user Removal of table initialization Consideration of module delimiters CCFinder can analyze the system of millions line scale in practical use time. Target language C/C++ , Java , COBOL , FORTRAN, LISP

8 Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Source files Lexical analysis Transformation Token sequence Match detection Transformed token sequence Clones on transformed sequence Formatting Clone pairs 1. static void foo() throws RESyntaxException { 2. String a[] = new String [] { "123,400", "abc", "orange 100" }; 3. org.apache.regexp.RE pat = new org.apache.regexp.RE("[0-9,]+"); 4. int sum = 0; 5. for (int i = 0; i < a.length; ++i) 6. if (pat.match(a[i])) 7. sum += Sample.parseNumber(pat.getParen(0)); 8. System.out.println("sum = " + sum); 9. } 10. static void goo(String [] a) throws RESyntaxException { 11. RE exp = new RE("[0-9,]+"); 12. int sum = 0; 13. for (int i = 0; i < a.length; ++i) 14. if (exp.match(a[i])) 15. sum += parseNumber(exp.getParen(0)); 16. System.out.println("sum = " + sum); 17. } Lexical analysis Transformation Token sequence Match detection Transformed token sequence Clones on transformed sequence Formatting CCFinder Example of clone detection process Lexical analysis Transformation Token sequence Match detection Transformed token sequence Clones on transformed sequence Formatting 1. static void foo() throws RESyntaxException { 2. String a[] = new String [] { "123,400", "abc", "orange 100" }; 3. org.apache.regexp.RE pat = new org.apache.regexp.RE("[0-9,]+"); 4. int sum = 0; 5. for (int i = 0; i < a.length; ++i) 6. if (pat.match(a[i])) 7. sum += Sample.parseNumber(pat.getParen(0)); 8. System.out.println("sum = " + sum); 9. } 10. static void goo(String [] a) throws RESyntaxException { 11. RE exp = new RE("[0-9,]+"); 12. int sum = 0; 13. for (int i = 0; i < a.length; ++i) 14. if (exp.match(a[i])) 15. sum += parseNumber(exp.getParen(0)); 16. System.out.println("sum = " + sum); 17. } Lexical analysis Transformation Token sequence Match detection Transformed token sequence Clones on transformed sequence Formatting 0.13,1 9,111,1 17,1

9 Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Outline of Gemini Gemini is GUI-based clone analysis environment Gemini uses CCFinder as clone detection unit Gemini has mainly three interfaces Scatter plot –User can select clones by mouse dragging –Scatter plot has sort function, zoom function, and so on Metric graph –Metric graph shows several metrics of clone class. –User can select clones by specifying ranges of each metric value Source code view –User can browse the source code of clones selected in other views Gemini is implemented in Java

10 Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Clone pair manager Metrics manager Scatter plot view Metric graph views User Interfaces Gemini: Architecture Source files Source code manager Source code view Clone selection information User Gemini Code clone detector CCFinder Code clone database a b c a b c a d e c a, b, c,... : tokens : matched position

11 Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Clone pair manager Metrics manager Scatter plot view Metric graph views User Interfaces Gemini: Architecture Source files Source code manager Source code view Clone selection information User Gemini Code clone detector CCFinder Code clone database DFL (C ): Estimation of how many tokens would be removed from source files when all code fragments of clone class C are replaced with caller statements of a new identical routine new sub routine caller statements

12 Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Clone pair manager Metrics manager Scatter plot view Metric graph views User Interfaces Gemini: Architecture Source files Source code manager Source code view Clone selection information User Gemini Code clone detector CCFinder Code clone database

13 Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Application of CCFinder&Gemini Open source software Commercial Software (about 30 companies) Students exercise of Osaka University Filed in a court as an evidence for software copyright suit JDK libraries (Java, 570 KLOC) Linux, FreeBSD (C, 1.6 + 1.3 MLOC) FreeBSD, OpenBSD , NetBSD(C) Qt(C++ , 240KLOC) NTT Data Corp., Hitachi Ltd., Hitachi GP, NEC soft Ltd., ASTEC Inc., SRA Inc., NASDA , Daiwa Computer, etc…

14 Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Application: JDK library JDK ( Java Development Kit ) 1.2.2 Number of file: 1700 LOC: 500, 000 Analysis time: 3 minutes. Pentium III 650MHz with 1GB RAM

15 Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Scatter plot Unit of clone 20 LOC A: Many code clones are detected. B: The longest clone A B

16 Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University A: Many code clones 29 files in src/javax/swing/plaf/multi/*.java These codes were generated by automatic code generation tool. 31| */ 32| public class MultiButtonUI extends ButtonUI { 33| 160| public static ComponentUI createUI(JComponent a) { 161| ComponentUI mui = new MultiButtonUI(); 162| return MultiLookAndFeel.createUIs(mui, 163| ((MultiButtonUI) mui).uis, 164| a); 165| }

17 Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University B: The longest clone 349 LOC Eighteen “sort” methods in src/java/util/Arrays.java Difference: type and numbers in argument

18 Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Application FreeBSD, Linux, NetBSD Three types of UNIX FreeBSD 4.0 (C, 2200 KLOC) Linux 2.4.0 (C, 2400 KLOC) NetBSD 1.5 ( C, 2600KLOC) FreeBSD and NetBSD were derived from the same code. Unit of code clone: more than 30 tokens Analysis time: 108 minutes

19 Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Scatter Plot

20 Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Clones of FreeBSD and Linux Device driver

21 Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Summary and Future Works Code Clone Detection Tool: CCFinder Code Clone Analysis Tool: Gemini Practical use of code clone information refactoring Reusable component

22 Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

23 The difference between ‘diff’ and clone detection tools Diff finds the longest common sub-string. Given a code portion, diff does not report two or more same code portions (clones). Clone detection tool finds all the same or similar code portions.

24 Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Suffix-tree Suffix tree is a tree that satisfies the following conditions. 1.A leaf node represents the starting position of sub-string. 2.A path from root node to a leaf node represents a sub-string. 3.First characters of labels of all the edges from one node are different from each other. → A common path means a clone

25 Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Example of transformation rules in Java All identifiers defined by user are transformed to same tokens. Unique identifier is inserted at each end of the top-level definitions and declarations. Prevents detecting clones that begin at the middle of class definition and end at the middle of another one. ” java. lang. Math. PI ” is transformed to ” Math. PI ”. By using import sentence, a class is referred to with either full package name or a shorter name ” new int[] {1, 2, 3} ” is transformed to ” new int[] {$} ” Eliminates table initialization code.

26 Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University The output of CCFinder Output of CCFinder #version: ccfinder 3.1 #langspec: JAVA #option: -b 30,1 #option: -k + #option: -r abcdfikmnprsv #option: -c wfg #begin{file description} 0.0 52 C:\Gemini.java 0.1 94 C:\GeneralManager.java : #end{file description} #begin{clone} 0.1 53,9 63,13 1.10 542,9 553,13 35 0.1 53,9 63,13 1.10 624,9 633,13 35 0.2 124,9 152,31 0.2 154,9 216,51 42 : #end{clone} Object file ID ( file 0 in Group 0 ) Location of a clone pair ( Lines 53 - 63 in file 0.1 and Lines 542 - 553 in file 1.10 are identical or similar to each other) It is difficult to analyze source code by only this text-based information of the location of clone pairs.

27 Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University The analysis of comparison among students (non-gapped clones only) A B The corresponding code A (2 students) Similar code fragments were from source code of sample compiler described in textbook. B (4 students) Many code fragments were similar even with respect to name of variables or comments.

28 Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Clone class metrics LEN (C ): Length of token sequence of each element in clone class C LNR (C) : Length of non-repetitive token sequence of LEN(C) POP (C ): Number of elements in clone class C DFL (C ): Estimation of how many tokens would be removed from source files when all code fragments of clone class C are replaced with caller statements of a new identical routine RAD (C ): Distribution in the file system of elements in clone class C new sub routine caller statements

29 Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Comparison with AST approach Features of AST approach Extract the same sub-trees of AST as a clone The result is precise because of strict syntax analysis. High space and time complexity Features of Our approach Hybrid approach of CCFinder’s quick but inaccurate clone detection and CCShaper’s filtering considering syntax structure.

30 Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University The other approaches AST(Abstract syntax tree) approach Clone = the same sub-trees in an AST Deep dependence on program language PDG(Program dependency Graph) approach Clone = the same sub-graph in a PDG Graph comparison is difficult Code metric Clone = the routines which have the same metric values Severe restriction in granularity CCFinder&CCShaper Clone = the code fragments which have the same syntax structure Limited precision

31 Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Why I choose “a” I selected the clones by the following criteria All clone code fragments appear in the same class The metric LEN is high The code fragment includes a whole method body

32 Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Suffix-tree によるマッチングア ルゴリズム 以下の条件を満たす木 (1) 木の葉は部分文字 列の開始位置 (2) 根から葉までラベ ルをたどると部分文 字列になる (3) ひとつの節点から 出る辺のラベルはす べて異なる文字で始 まる → 共通のパス=クロー ン


Download ppt "Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Code-Clone Detection."

Similar presentations


Ads by Google