Download presentation
Presentation is loading. Please wait.
Published byAllan O’Neal’ Modified over 8 years ago
1
On Detection of Gapped Code Clones using Gap Locations Yasushi Ueda†, Toshihiro Kamiya‡, Shinji Kusumoto†, and Katsuro Inoue† †Graduate School of Information Science and Technology, Osaka University., Japan {y-ueda, y-higo, kusumoto, inoue}@ist.osaka-u.ac.jp ‡PRESTO, Japan Science and Technology Corporation, Japan kamiya@ist.osaka-u.ac.jp
2
APSEC 2002 2 Contents Background Research goals Gapped code clone detection Case study Conclusions and future works
3
APSEC 2002 3 Background (1/2) A code clone is a pair/set of code portions in source files that are identical or similar to each other.
4
Background (2/2) Code clone is one of the factors that make software maintenance more difficult. If some faults are found in a code portion, it is necessary to correct the faults in its all clone pairs. [1] T. Kamiya, S. Kusumoto, and K. Inoue, “CCFinder: A multi-linguistic token-based code clone detection system for large scale source code”, IEEE Transactions on Software Engineering, 28(7):654-670, 2002. [2] Y. Ueda, T. Kamiya, S. Kusumoto and K. Inoue, “Gemini: Maintenance Support Environment Based on Code Clone Analysis”, Proc. Of the 8th IEEE International Symposium on Software Metrics, 67-76, 2002. We have developed a code clone detection tool, CCFinder[1], and its analysis tool, Gemini[2]. CCFinder Token-based clone detector The input is a set of source files and the output (text-based) is the locations of clone pairs. Gemini GUI-based clone analysis environment Uses CCFinder as a clone detector.
5
APSEC 2002 5 Source files Lexical analysis Transformation Token sequence Match detection Transformed token sequence Clones on transformed sequence Formatting Clone pairs 1. static void foo() throws RESyntaxException { 2. String a[] = new String [] { "123,400", "abc", "orange 100" }; 3. org.apache.regexp.RE pat = new org.apache.regexp.RE("[0-9,]+"); 4. int sum = 0; 5. for (int i = 0; i < a.length; ++i) 6. if (pat.match(a[i])) 7. sum += Sample.parseNumber(pat.getParen(0)); 8. System.out.println("sum = " + sum); 9. } 10. static void goo(String [] a) throws RESyntaxException { 11. RE exp = new RE("[0-9,]+"); 12. int sum = 0; 13. for (int i = 0; i < a.length; ++i) 14. if (exp.match(a[i])) 15. sum += parseNumber(exp.getParen(0)); 16. System.out.println("sum = " + sum); 17. } Lexical analysis Transformation Token sequence Match detection Transformed token sequence Clones on transformed sequence Formatting Lexical analysis Transformation Token sequence Match detection Transformed token sequence Clones on transformed sequence Formatting 1. static void foo() throws RESyntaxException { 2. String a[] = new String [] { "123,400", "abc", "orange 100" }; 3. org.apache.regexp.RE pat = new org.apache.regexp.RE("[0-9,]+"); 4. int sum = 0; 5. for (int i = 0; i < a.length; ++i) 6. if (pat.match(a[i])) 7. sum += Sample.parseNumber(pat.getParen(0)); 8. System.out.println("sum = " + sum); 9. } 10. static void goo(String [] a) throws RESyntaxException { 11. RE exp = new RE("[0-9,]+"); 12. int sum = 0; 13. for (int i = 0; i < a.length; ++i) 14. if (exp.match(a[i])) 15. sum += parseNumber(exp.getParen(0)); 16. System.out.println("sum = " + sum); 17. } Lexical analysis Transformation Token sequence Match detection Transformed token sequence Clones on transformed sequence Formatting CCFinder/Gemini (1/4) Example of clone detection process
6
APSEC 2002 6 CCFinder/Gemini (2/4) Gemini overview A GUI-based code clone analysis tool Uses CCFinder as a code clone detector. Has several views to interactive analysis. Scatter plot view Select clones by mouse dragging Metric graph view Select clones by the value of metric for clones Source code view a b c a b c a d e c a, b, c,... : tokens : matched position
7
CCFinder/Gemini (3/4) Classification of code clones Exact clone Renamed clone Gapped clone If (a > b) { b++; a=1;} reused by ‘copy-and-paste’ If (a > b) { b++; a=1; } Exact clone renamed If (i > j) { j++; i=0; } Renamed clone If (i > j) { i = i / 2; j++; i=0; } Gapped clone inserted If (i > j) { i=0; } Gapped clone deleted If (i > j) { j = j + 1; i=0; } Gapped clone modified Non-gapped clone Gaps
8
1. static void foo() throws RESyntaxException 2. { 3. String a[] = new String [] {"123,400", "abc"}; 4. org.apache.regexp.RE pat = 5. new org.apache.regexp.RE("[0-9,]+"); 6. int sum = 0; 7. for (int i = 0; i < a.length; ++i) 8. { 9. if (pat.match(a[i])){ 10. sum += Sample.parseNumber(pat.getParen(0));} 11. } 12. System.out.println("sum = " + sum); 13. } 1. static void goo(String [] a) throws RESyntaxException 2. { 3. RE exp = new RE(“[0-9,]+”); 4. int sum = 0; 5. int i = 0; 6. while (i < a.length) 7. { 8. if (exp.match(a[i])) 9. sum += parseNumber(exp.getParen(0)); 10. i++; 11. } 12. System.out.println("sum = " + sum); 13. } CCFinder/Gemini (4/4) Needs of gapped clone detection CCFinder can detect non-gapped clones. Gapped clone is separately detected as several short non-gapped clones. If each matched portion is too short, CCFinder does not identify it as a clone because the minimum length of clone to be detected must be set in CCFinder beforehand. 14 tokens27 tokens13 tokens Generally, if the minimum length is set to short one, too many clones would be detected. Set the min. length to 20 tokens… Clones longer than 30 tokens (the number of clone pairs is 1208) Clones longer than 10 tokens (the number of clone pairs is 26984)
9
APSEC 2002 9 Research goals Propose a method to efficiently detect gapped clones. Conduct a case study to evaluate the method.
10
APSEC 2002 10 Gapped code clone detection - Overview (1/2) Combination explosion of non-gapped clones If there are many overlapping or overcrowded non-gapped clones, identification of gapped clones makes a combination explosion because one non-gapped clone may have many other non-gapped clones to be combined into a gapped clone. Takes long time for computation. Major premise See the problem to detect gapped clones as a combination problem of non-gapped clones. The number of combinations is 3 15 105
11
APSEC 2002 11 Gapped code clone detection - Overview (2/2) Approach Man-machine collaboration Extract concatenated subsets from all of non-gapped clones Entanglements Visualize the entanglements on a scatter plot. Users can see the locations where gapped clones possibly exist and pick up interactively one of them to find gapped clones in it. Detecting process Step1: Non-gapped clone detection Step2: Gap identification Step3: Visualization Step4: Source code investigation Non-gapped clone detection Gap identification Non-gapped clones Visualization Gaps Gap-and-clone scatter plot Source files Correspondences Source code investigation
12
Gapped code clone detection - Detecting process Sample input Code sequence of source file X: “ABCDCDEFBCDG” Code sequence of source file Y: “ABCEFBCDEBCD” “A”, “B”, “C” … are code portions in a certain unit. Non-gapped clones ABCDCDEFBCDGABCDCDEFBCDG A B C E F B C D E B C D File Y File X Non-gapped clone detection Gap identification Visualization Gaps Gap-and-clone scatter plot Correspondences Source code investigation Source files
13
Gapped code clone detection - Detecting process Non-gapped clones Non-gapped clone detection Gap identification Visualization Gaps Gap-and-clone scatter plot Correspondences Source code investigation Source files The upper limit of gap length Gap
14
Gapped code clone detection - Detecting process ABCDCDEFBCDGABCDCDEFBCDG A B C E F B C D E B C D File Y File X Non-gapped clones Non-gapped clone detection Gap identification Visualization Gaps Gap-and-clone scatter plot Correspondences Source code investigation Source files ABCDCDEFBCDGABCDCDEFBCDG A B C E F B C D E B C D File Y File X
15
Gapped code clone detection - Detecting process Non-gapped clones Non-gapped clone detection Gap identification Visualization Gaps Gap-and-clone scatter plot Correspondences Source code investigation Source files ABCDCDEFBCDGABCDCDEFBCDG A B C E F B C D E B C D File Y File X
16
APSEC 2002 16 Gapped code clone detection - Implementation CCFinder is used as a non-gapped clone detection tool Extend a GUI maintenance support tool Gemini. On the view of gap-and-clone scatter plot implemented in Gemini, user can select a non- gapped clones by mouse dragging and refer to the actual source code. Entanglement
17
APSEC 2002 17 Case study overview Application target Programs developed in a programming exercise of Osaka Univ. Compiler in C language Consists of three steps (sub-exercises): Step1(Ex.1): Making a syntax checker Step2(Ex.2): Making a semantic checker Step3(Ex.3): Making a compiler In Ex.2 and Ex.3, it was also required that the programs are developed by reusing the code of the previous programs. Programs of 69 students. Total size is 360,000 lines of code Issues of analysis Type of gapped clones found in gap-and-clone scatter plot Usefulness of gap-and-clone scatter plot
18
APSEC 2002 18 Compare three versions of a function “sentence()” in Ex.1, Ex.2 and Ex.3 of a certain student. Analysis – Type of gapped clone found in gap-and-clone scatter plot in Ex.3 in Ex.1 in Ex. 2 in Ex. 3 in Ex.3 in Ex. 2 in Ex. 1 40 tokens 45 tokens 27 tokens 50 tokens A The minimum size of non-gapped clones: 20 tokens 18 tokens 14 tokens 12 tokens 14 tokens B The minimum size of non-gapped clones: 10 tokens The maximum size of gaps: 10 tokens The minimum size of entanglements: 20 tokens void sentence() { if ((tok_name == SIDENTIFIER)|| (tok_name == SREADLN) || (tok_name == SWRITELN) || (tok_name == SBEGIN)) basic_sen(); else if (tok_name == SIF) { scan(); if (expression() != TBOOLEAN) error(4); if (tok_name != STHEN) syntax_error(); scan(); multi_sentence(); if (tok_name == SELSE) { scan(); multi_sentence(); } else if (tok_name == SWHILE) { scan(); if (expression() != TBOOLEAN) error(4); if (tok_name != SDO) syntax_error(); scan(); sentence(); } else syntax_error(); } in Ex.2 void sentence() { int llt,llf,lp,lpf; llt=lt; llf=lf; lp=p; lpf=pf; if ((tok_name == SIDENTIFIER) || (tok_name == SREADLN) || (tok_name == SWRITELN) || (tok_name == SBEGIN)) basic_sen(); else if (tok_name == SIF) { scan(); if (expression() != TBOOLEAN) error(4); fprintf(outfile,"\tPOP\tGR2\t;%d\n",tok_line); fprintf(outfile,"\tCPA\tGR2,TRUE\n",sub); fprintf(outfile,"\tJNZ\tLF%d\n\n",llf); lf++;lt++; if (tok_name != STHEN) syntax_error(); scan(); multi_sentence(); fprintf(outfile,"\tJMP\tLT%d\n",llt); fprintf(outfile,"LF%d\n\n",llf); if (tok_name == SELSE) { scan(); multi_sentence(); } fprintf(outfile,"LT%d\n",llt); } else if (tok_name == SWHILE) { scan(); fprintf(outfile,"LOOP%d\n",lp); p++; if (expression() != TBOOLEAN) error(4); fprintf(outfile,"\tPOP\tGR2\t;%d\n",tok_line); fprintf(outfile,"\tCPA\tGR2,TRUE\n",sub); fprintf(outfile,"\tJNZ\tLOOF%d\n\n",lpf); pf++; if (tok_name != SDO) syntax_error(); scan(); sentence(); fprintf(outfile,"\tJMP\tLOOP%d\n",lp); fprintf(outfile,"LOOF%d\n\n",lpf); } else syntax_error(); } in Ex.3 A B1B1 B2B2 B3B3 B4B4
19
APSEC 2002 19 Conclusions and future works The method to show the gapped clones based on the information of the gap location was proposed and implemented. The case study was conducted. As result, we have successfully found the gapped clones that are composed of several short clones each of which is too short to appear individually. Since we just show gapped clones and have no mechanisms to evaluate the characteristic of each of gapped clones quantitatively, we are going to examine the method to extract efficiently the each as future works.
20
APSEC 2002 20
21
APSEC 2002 21 Web page of CCFinder/Gemini is available at http://sel.ist.osaka-u.ac.jp/cdtools/index.html.en
22
APSEC 2002 Application of CCFinder/Gemini Free software JDK libraries (Java, 570 KLOC) Linux, FreeBSD (C, 1.6 + 1.3 MLOC) FreeBSD, OpenBSD , NetBSD(C) Qt(C++ , 240KLOC) Commercial software NTT Data Corp., Hitachi Ltd., Hitachi GP Ltd., NEC soft Ltd., ASTEC Inc., SRA Inc., NASDA, etc… Students exercise of Osaka university Filed in a court as an evidence for software copyright suit.
23
APSEC 2002 23 Differences between our method and homology analysis in genome informatics Alignment analysis Dynamic programming O(mn) (m, n : length of sequences) The optimal alignment is not our interest. Homology search BLAST, FASTA We have no query sequence for search and want to detect all gapped clones.
24
APSEC 2002 24 Related work Baxter et al.[3] Extract clone pairs of statements, declarations, or sequences of them from C source files. Parse source code to build an abstract syntax tree (AST) and compare its sub-trees by characterization metrics (hash functions). Its computation complexity is O(n), where n is the number of the sub-tree of the source files. The hash function enables one to do parameterized matching, to detect gapped clones, and to identify clones of code portions in which some statements are reordered. [3] I. D. Baxter, A. Yahin, L. Moura, M. Sant’Anna, and L. Bier, “Clone Detection Using Abstract Syntax Trees,” Proc. of ICSM ’98, pp. 368-377, Bethesda, Maryland, 1998.
25
APSEC 2002 25 Computation cost of our method Non-gapped clone detection (in CCFinder): O(n + m) n: length of source code m: number of non-gapped clones Gap identification: O(m) Identification of gaps combined with each non-gapped clones : O(1) Total: O(n+m)
26
APSEC 2002 26 The difference between ‘diff’ and clone detection tools Diff finds the longest common sub-string. Given a code portion, diff does not report two or more same code portions (clones). Clone detection tool finds all the same or similar code portions.
27
APSEC 2002 27 Snapshots of clone class metric graph RAD LENPOP DFL Filtering mode : ON
28
APSEC 2002 28 Clone class metrics LEN (C ): Length of token sequence of each element in clone class C POP (C ): Number of elements in clone class C DFL (C ): Estimation of how many tokens would be removed from source files when all code fragments of clone class C are replaced with caller statements of a new identical routine RAD (C ): Distribution in the file system of elements in clone class C new sub routine caller statements
29
APSEC 2002 29 Definitions of DFL and RAD DFL(C ) DFL(C) = LEN(C) ×POP(C) - 5×POP(C) + LEN(C) LEN(C) ×POP(C) : the target code size for restructuring 5×POP(C) : the code size of new caller statements LEN(C) : the code size of new identical routine RAD (C ) Distribution in the file system of elements in clone class C RAD(C) = 0 : C is enclosed within a single file. RAD(C) = 1 : C is enclosed within a single directory. RAD(C) = n : C is enclosed within a directory tree of n layers.
30
Analysis using clone class metrics Example of analysis issue Finding clones that are appropriate for refactoring. Clones having high DFL Clones having high POP and low RAD It may be easy and meaningful to merge clones into one routine because of their density. Finding portions that are not reliable. Clones having high LEN Modules having larger code clones are less maintainable than modules having smaller code clones [4]. [4] Akito Monden, Daikai Nakae, Toshihiro Kamiya, Shin-ichi Sato, Ken-ichi Matsumoto, “Software Quality Analysis by Code Clones in Industrial Legacy Software”, Proc. Of the 8th IEEE International Symposium on Software Metrics, 87-96, 2002.
31
APSEC 2002 31 Suffix-tree Suffix tree is a tree that satisfies the following conditions. 1. A leaf node represents the starting position of sub-string. 2. A path from root node to a leaf node represents a sub-string. 3. First characters of labels of all the edges from one node are different from each other. → A common path means a clone
32
APSEC 2002 32 Example of transformation rules in Java All identifiers defined by user are transformed to same tokens. Unique identifier is inserted at each end of the top-level definitions and declarations. Prevents detecting clones that begin at the middle of class definition and end at the middle of another one. ” java. lang. Math. PI ” is transformed to ” Math. PI ”. By using import sentence, a class is referred to with either full package name or a shorter name ” new int[] {1, 2, 3} ” is transformed to ” new int[] {$} ” Eliminates table initialization code.
33
APSEC 2002 33 The output of CCFinder Output of CCFinder #version: ccfinder 3.1 #langspec: JAVA #option: -b 30,1 #option: -k + #option: -r abcdfikmnprsv #option: -c wfg #begin{file description} 0.0 52 C:\Gemini.java 0.1 94 C:\GeneralManager.java : #end{file description} #begin{clone} 0.1 53,9 63,13 1.10 542,9 553,13 35 0.1 53,9 63,13 1.10 624,9 633,13 35 0.2 124,9 152,31 0.2 154,9 216,51 42 : #end{clone} Object file ID ( file 0 in Group 0 ) Location of a clone pair ( Lines 53 - 63 in file 0.1 and Lines 542 - 553 in file 1.10 are identical or similar to each other) It is difficult to analyze source code by only this text-based information of the location of clone pairs.
34
APSEC 2002 Gapped code clone detection - Algorithm (1/5) Non-gapped clone detection Gap identification Non-gapped clones Visualization Gaps Gap-and-clone scatter plot Source files Correspondences Source code investigation Non-gapped clone ID Pos. in file X (ABCDCDEFBCDG) Pos. in file Y (ABCEFBCDEBCD) Matched subsequence c11 – 3 “ABC” c22 – 46 – 8“BCD” c32 – 410 – 12“BCD” c45 – 53 – 3“C” c55 – 611 – 12“CD” c65 – 67 – 9“CDE” c77 – 114 – 8“EFBCD” c89 – 102 - 3“BC” c99 – 1110 - 12“BCD” Make a clone pair which appears previously in the file appear previously also in the sorted list. When the detected result is one of comparison among three or more files, a set of non-gapped clones can be divided into subsets defined by the combination of two files. Step1: Non-gapped clone detection Detect non-gapped clones from input source files. Set the minimum length of clone (threshold1). Sort the list of the detected non-gapped clones for effective identification of gap locations in Step2.
35
Gapped code clone detection - Algorithm (2/5) Step2: Gap identification Generate gap locations from sorted list of non-gapped clones. Gap location is a kind of the combination of the two non- gapped clones. (c1, c6) = ((1-3, 1-3), (5-6, 7-9)) g1= (4-4, 4-6) The length of each gap is the length of longer unmatched subsequence. Set the upper limit of the length of each gap (threshold2). Non-gapped clone detection Gap identification Non-gapped clones Visualization Gaps Gap-and-clone scatter plot Source files Correspondences Source code investigation Gap IDPos. in file X (ABCDCDEFBCDG) Pos. in file Y (ABCEFBCDEBCD) Length in longer g14 – 44 – 63 g24 – 44 – 107 g34 – 6–3 g44 – 84 – 96 g5–9 – 102 g65 – 894 g78 – 8–1 Use the facts for optimizations non-gapped clones are stored as the sorted result. The number of gaps connected from each non-gapped clone can be considered up to a certain constant. The overall time complexity of Step2 is O(n) (n:number of non- gapped clones)
36
Gapped code clone detection - Algorithm (3/5) Non-gapped clone detection Gap identification Non-gapped clones Visualization Gaps Gap-and-clone scatter plot Source files Correspondences Source code investigation 1 2 3 4 5 6 7 8 9 10 11 12 ABCDCDEFBCDGABCDCDEFBCDG A B C E F B C D E B C D File Y File X 1 2 3 4 5 6 7 8 9 10 11 12 g1 g3 g7 g5 Step3-1: Visualization – gap-and-clone scatter plot Draw gaps on the scatter plot of non-gapped clone to visualize gapped clones in a pseudo way. c1 c2 c3 c7 c6 c9 c5 c8 Gapped clone ID PathSubsequence in file X (ABCDCDEFBCDG) Subsequence in file Y (ABCEFBCDEBCD) gc1c1 g1 c5 g7 c7“ABC-CDE--CD”“ABC---CDEBCD” gc2c1 g3 c6“ABC---EFBCD”“ABCEFBCD” gc3c2 g5 c4“BCDCD”
37
Gapped code clone detection - Algorithm (4/5) Step3-2: Visualization – filtering Remove non-gapped clones and gaps that do not contribute to make a long gapped clone. Introduce the length of each entanglement (“eSize”) of non- gapped clones and gaps. eSize = max (eSizeX, eSizeY) eSizeX = eEndX – eStartX eSizeY = eEndY – eStartY “eSize” means the maximum length of gapped clone included in the entanglement. Set the minimum “eSize” for display (threshold3). Non-gapped clone detection Gap identification Non-gapped clones Visualization Gaps Gap-and-clone scatter plot Source files Correspondences Source code investigation 1 2 3 4 5 6 7 8 9 10 11 12 ABCDCDEFBCDGABCDCDEFBCDG A B C E F B C D E B C D File Y File X 1 2 3 4 5 6 7 8 9 10 11 12 g1 g3 g7 c1 c7 c6 c9 g5 c2 c3 c5 c8
38
Gapped code clone detection - Algorithm (5/5) Non-gapped clone detection Gap identification Non-gapped clones Visualization Gaps Gap-and-clone scatter plot Source files Correspondences Source code investigation Step4: Source code investigation Investigate source files with gap-and-clone scatter plot. Change parameters. Threshold1: Minimum size of non-gapped clones in non-gapped clone detection Threshold2: Maximum size of gaps in identification of gap locations. Threshold3: Minimum size of entanglement of non-gapped clones and gaps in gap-and-clone scatter plot. Theshold1 and threshold2 greatly affect computation time. Small threshold1 makes O(m 2 ) non-gapped clone pairs detected from size-m source code. Large threshold2 makes O(n 2 ) gaps detected from n clone pairs.
39
APSEC 2002 39 Analysis - Usefulness of gap- and-clone scatter plot Compared the scatter plots of non-gapped clones to the gap-and-clone scatter plot Three programs (Ex.1: 2267 tokens, Ex.2: 4394 tokens and Ex.3: 5738 tokens) of a student S are arranged on both of the vertical and horizontal axes. The grid represents boundary lines between sub-exercises. Ex.1 Ex.2 Ex.3 Ex.1 Ex.2 Ex.3 Ex.1 Ex.2 Ex.3 Ex.1 Ex.2 Ex.3 Threshold1 = 10Threshold1 = 30 0 10 20 30 40 50 500 1000 1500 0 (Frequency of non-gapped clones) (Tokens) Ex.1 Ex.2 Ex.3 Ex.1 Ex.2 Ex.3 Threshold1 = 10 Threshold2 = 10 Threshold3 = 30 0 10 20 30 40 50 500 1000 1500 0 (Frequency of non-gapped clones) (Tokens) Shown up as long gapped clones
40
APSEC 2002 40 The analysis of comparison among students (non-gapped clones only) A B The corresponding code A (2 students) Similar code fragments were from source code of sample compiler described in textbook. B (4 students) Many code fragments were similar even with respect to name of variables or comments.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.