Refactoring Support Based on Code Clone Analysis

Slides:



Advertisements
Similar presentations
ANTLR in SSP Xingzhong Xu Hong Man Aug Outline ANTLR Abstract Syntax Tree Code Equivalence (Code Re-hosting) Future Work.
Advertisements

BackTracking Algorithms
Chapter 7 User-Defined Methods. Chapter Objectives  Understand how methods are used in Java programming  Learn about standard (predefined) methods and.
 2005 Pearson Education, Inc. All rights reserved Introduction.
Reverse Engineering © SERG Code Cloning: Detection, Classification, and Refactoring.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Extracting Code.
A Tool Support to Merge Similar Methods with a Cohesion Metric COB ○ Masakazu Ioka 1, Norihiro Yoshida 2, Tomoo Masai 1,Yoshiki Higo 1, Katsuro Inoue 1.
About the Presentations The presentations cover the objectives found in the opening of each chapter. All chapter objectives are listed in the beginning.
13/07/2015Dr Andy Brooks1 Fyrirlestrar 9 & 10 CCFinder: A Tool to Detect Clones “I can just copy these lines. That is the safest thing to do. The code.
Refactoring Support Tool: Cancer Yoshiki Higo Osaka University.
C++ fundamentals.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Industrial Application.
Software Engineering Lab, Osaka University Code Clone Analysis and Its Application Katsuro Inoue Osaka University.
REFACTORING Lecture 4. Definition Refactoring is a process of changing the internal structure of the program, not affecting its external behavior and.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Finding Similar.
Code Clone Analysis and Its Application
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 1 Refactoring.
Department of Computer Science, Graduate School of Information Science and Technology, Osaka University DCCFinder: A Very- Large Scale Code Clone Analysis.
Department of Computer Science, Graduate School of Information Science & Technology, Osaka University A clone detection approach for a collection of similar.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 1 ARIES: Refactoring.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Code-Clone Analysis.
2002/12/11PROFES20021 On software maintenance process improvement based on code clone analysis Yoshiki Higo* , Yasushi Ueda* , Toshihiro Kamiya** , Shinji.
Department of Computer Science, Graduate School of Information Science & Technology, Osaka University Detection and evolution analysis of code clones for.
CMCD: Count Matrix based Code Clone Detection Yang Yuan and Yao Guo Key Laboratory of High-Confidence Software Technologies (Ministry of Education) Peking.
1 Gemini: Maintenance Support Environment Based on Code Clone Analysis *Graduate School of Engineering Science, Osaka Univ. **PRESTO, Japan Science and.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Applying Clone.
Department of Computer Science, Graduate School of Information Science & Technology, Osaka University Inoue Laboratory Eunjong Choi 1 Investigating Clone.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University How to extract.
© Copyright 1992–2004 by Deitel & Associates, Inc. and Pearson Education Inc. All Rights Reserved. Chapter 2 Chapter 2 - Introduction to C Programming.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Code-Clone Detection.
Department of Computer Science, Graduate School of Information Science & Technology, Osaka University 1 Towards an Assessment of the Quality of Refactoring.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University IWPSE 2003 Program.
Software Engineering Research Group, Graduate School of Engineering Science, Osaka University A Slicing Method for Object-Oriented Programs Using Lightweight.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Code Clone Analysis.
1 Measuring Similarity of Large Software System Based on Source Code Correspondence Tetsuo Yamamoto*, Makoto Matsushita**, Toshihiro Kamiya***, Katsuro.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 1 Classification.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 1 Extracting Sequence.
Agent program is the one part(class)of Othello program. How many test cases do you have to test? Reversi [Othello]
Cross Language Clone Analysis Team 2 February 3, 2011.
What kind of and how clones are refactored? A case study of three OSS projects WRT2012 June 1, Eunjong Choi†, Norihiro Yoshida‡, Katsuro Inoue†
© Copyright 1992–2004 by Deitel & Associates, Inc. and Pearson Education Inc. All Rights Reserved. 1 Chapter 2 - Introduction to C Programming Outline.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 1 コードクローン解析に基づくリファクタリング支援.
Classes, Interfaces and Packages
1 Gemini: Code Clone Analysis Tool †Graduate School of Engineering Science, Osaka Univ., Japan ‡ Graduate School of Information Science and Technology,
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 1 Aries: Refactoring.
On Detection of Gapped Code Clones using Gap Locations Yasushi Ueda†, Toshihiro Kamiya‡, Shinji Kusumoto†, and Katsuro Inoue† †Graduate School of Information.
Department of Computer Science, Graduate School of Information Science & Technology, Osaka University A Metric-based Approach for Reconstructing Methods.
Estimating Code Size After a Complete Code-Clone Merge Buford Edwards III, Yuhao Wu, Makoto Matsushita, Katsuro Inoue 1 Graduate School of Information.
Yasuhiro Hayase†, Yu Kashima‡, Yuki Manabe‡, Katsuro Inoue‡
Lecture 9 Symbol Table and Attributed Grammars
Information and Computer Sciences University of Hawaii, Manoa
User-Written Functions
Chapter 7 User-Defined Methods.
Chapter 2 - Introduction to C Programming
Methods Chapter 6.
Do Developers Focus on Severe Code Smells?
CBCD: Cloned Buggy Code Detector
Chapter 2 - Introduction to C Programming
A Pluggable Tool for Measuring Software Metrics from Source Code
Subroutines Idea: useful code can be saved and re-used, with different data values Example: Our function to find the largest element of an array might.
○Yuichi Semura1, Norihiro Yoshida2, Eunjong Choi3, Katsuro Inoue1
Chapter 6 Methods: A Deeper Look
Predicting Fault-Prone Modules Based on Metrics Transitions
Refactoring Support Tool: Cancer
Recommending Verbs for Rename Method using Association Rule Mining
On Refactoring Support Based on Code Clone Dependency Relation
Collaboration of Parafrase-2 and NaraView
Dotri Quoc†, Kazuo Kobori†, Norihiro Yoshida
CMSC 202 Exceptions.
CS 325: Software Engineering
Presentation transcript:

Refactoring Support Based on Code Clone Analysis Yoshiki Higo*, Toshihiro Kamiya**, Shinji Kusumoto*, Katsuro Inoue* *Graduate School of Information Science and Technology, Osaka University **PRESTO, Japan Science and Technology Corporation Thank you chair. I am Yoshiki Higo, from Osaka University, Japan. The title of my talk is “Refactoring Support Based on Code Clone Analysis”. 2004/4/7 PROFES2004

Contents Background: Code Clone Objective Code Clone Analysis Tools: CCFinder&Gemini Extraction of Structural Clone for Refactoring Case Study Summaries and Future Works This is contents of my presentation. At first, I talk about the background of our research. Second, I explain the research Objective, and code clone analysis tools, CCFinder & Gemini. Next, I explain the extraction of structural clone for refactoring and the case study. At last, I talk about summaries and future works. 2004/4/7 PROFES2004

Background: Code Clone Code clone is a code fragment in source files that is identical or similar to another Clone Class Clone Pair Code clone is one of factors that make software maintenance more difficult. If some faults are found in a code clone, it is necessary to consider pros and cons of modification in its all code clones. At first, I explain the background of our research. Code clone is a code fragment in source files that is identical or similar to another. For example, these two figures indicate source files. And, these three gray parts are code clones. Here, we call each pair a clone pair, and these all are collectively called a clone class. It is generally said that code clone is one of factors that make software maintenance more difficult. For example, if some faults are found in a code clone, it is necessary to consider pros and cons of modification in its all code clones. As shown in this figure, when there are only three code clones, it is easy to correct them. But, if very many code clones exist in a huge software, it becomes very serious problem to detect and correct them. 2004/4/7 PROFES2004

Code Clone Analysis Tools: CCFinder&Gemini We have been developing code clone analysis tools, Code clone detection tool, CCFinder[1], GUI-based clone analysis environment, Gemini[2]. We have delivered these tools to software companies and evaluated the usefulness through some case studies. So, we have been developing code clone analysis tools, One is code clone detection tool, CCFinder. And, the other is GUI-based clone analysis environment, Gemini And, we have delivered these tools to software companies, and evaluated the usefulness of it through some case studies. [1] T. Kamiya, S. Kusumoto, and K. Inoue, “CCFinder: A multi-linguistic token-based code clone detection system for large scale source code”, IEEE Transactions on Software Engineering, 28(7):654-670, 2002. [2] Y. Ueda, T. Kamiya, S. Kusumoto and K. Inoue, “Gemini: Maintenance Support Environment Based on Code Clone Analysis”, Proc. Of the 8th IEEE International Symposium on Software Metrics, 67-76, 2002. 2004/4/7 PROFES2004

Outline of CCFinder CCFinder directly compares source code on token unit, and detects code clones. Normalization of name space Replacement of names defined by user Removal of table initialization Consideration of modules delimiter CCFinder can analyze the system of millions line scale in practical use time. Next I explain CCFinder. CCFinder directly compares source code on token unit, and detect code clones. In finding clones, CCFinder uses following heuristics to detect more significant code clones. Also, CCFinder can analyze the system of millions line scale in practical use time. 2004/4/7 PROFES2004

CCFinder: Clone Detection Process Source files Lexical analysis Transformation Token sequence Match detection Transformed token sequence Clones on transformed sequence Formatting Clone pairs 1. static void foo() throws RESyntaxException { 2. String a[] = new String [] { "123,400", "abc", "orange 100" }; 3. org.apache.regexp.RE pat = new org.apache.regexp.RE("[0-9,]+"); 4. int sum = 0; 5. for (int i = 0; i < a.length; ++i) 6. if (pat.match(a[i])) 7. sum += Sample.parseNumber(pat.getParen(0)); 8. System.out.println("sum = " + sum); 9. } 10. static void goo(String [] a) throws RESyntaxException { 11. RE exp = new RE("[0-9,]+"); 12. int sum = 0; 13. for (int i = 0; i < a.length; ++i) 14. if (exp.match(a[i])) 15. sum += parseNumber(exp.getParen(0)); 16. System.out.println("sum = " + sum); 17. } Lexical analysis Transformation Token sequence Match detection Transformed token sequence Clones on transformed sequence Formatting Lexical analysis Transformation Token sequence Match detection Transformed token sequence Clones on transformed sequence Formatting Lexical analysis Transformation Token sequence Match detection Transformed token sequence Clones on transformed sequence Formatting 1. static void foo() throws RESyntaxException { 2. String a[] = new String [] { "123,400", "abc", "orange 100" }; 3. org.apache.regexp.RE pat = new org.apache.regexp.RE("[0-9,]+"); 4. int sum = 0; 5. for (int i = 0; i < a.length; ++i) 6. if (pat.match(a[i])) 7. sum += Sample.parseNumber(pat.getParen(0)); 8. System.out.println("sum = " + sum); 9. } 10. static void goo(String [] a) throws RESyntaxException { 11. RE exp = new RE("[0-9,]+"); 12. int sum = 0; 13. for (int i = 0; i < a.length; ++i) 14. if (exp.match(a[i])) 15. sum += parseNumber(exp.getParen(0)); 16. System.out.println("sum = " + sum); 17. } Next, I explain the clone detection process. In this explanation, I use this source code as an example. Lexical analysis is done first of all, and source code is divided into tokens like this. Next process is transformation. In this process, replacement of identifiers is performed, and this token sequence is generated. Next process is Match detection. CCFinder detects clone pairs from the token sequence which is generated in previous step At last, formatting is performed to make clone pairs map on actual source code. CCFinder outputs the portion information of code clones by performing this process. 2004/4/7 PROFES2004

Outline of Gemini Gemini is GUI-based clone analysis environment Gemini uses CCFinder as clone detection unit Gemini has mainly three interfaces Scatter plot User can select clones by mouse dragging Scatter plot has sort function, zoom function, and so on Metric graph Metric graph shows several metrics of clone class. User can select clones by specifying ranges of each metric value Source code view User can browse the source code of clones selected in other views Gemini is implemented in Java Next I explain Gemini. Gemini is GUI-based clone analysis environment. Gemini uses CCFinder as clone detection unit. And Gemini has mainly three interfaces. First interface is scatter plot. On scatter plot, user can select clones by mouse dragging. And Scatter plot has sort function, zoom function and so on. Second interface is metric graph. Metric graph shows several metrics of clone class. And, user can select clones by specifying ranges of each metrics value. Last interface is source code view. On source code view, user can browse the source code of clones selected in other views. 2004/4/7 PROFES2004

Gemini: Architecture Gemini a b c a b c a d e c a b c a b c a d e c a, b, c, ... : tokens : matched position User Interfaces Clone pair manager User Scatter plot view Clone selection information Code clone detector CCFinder Source files Source code manager Source code view Clone selection information Code clone database This is the architecture of Gemini. First of all, source files are input into clone detection unit, CCFinder. Then the output of CCFinder is accumulated in this code clone database. And each interface gets code clone information which it needs. This is a snapshot of scatter plot. Scatter plot shows visually where code clones exist. In this view, user can select code clones by mouse dragging. Now, I use this model to explain scatter plot. The original point of scatter plot is upper left corner. And, token sequence of source code is arranged on the both of horizontal and vertical direction from the original point in the same way. Each cell of this matrix is checked if its corresponding horizontal and vertical tokens are identical. For example, this dot means that this token “a” equals to this token “a”. So, there is a main diagonal line like this, because in these portions horizontal tokens are same as vertical ones. Then a clone pair is shown as a diagonal line segment. In this case, if we consider at least four tokens are needed as a code clone, this line represents a code clone. Naturally, the distribution is symmetrical with the main diagonal line. Using this plot as user interface, user can easily identify the location of clone pairs. Metrics manager Metric graph views 2004/4/7 PROFES2004

Gemini: Architecture Gemini User Interfaces DFL (C ): Estimation of how many tokens would be removed from source files when all code fragments of clone class C are replaced with caller statements of a new identical routine Clone pair manager User Scatter plot view Clone selection information Code clone detector CCFinder Source files new sub routine caller statements Source code manager Source code view Clone selection information Code clone database This is a snapshot of metric graph. These five bar represent each metric. In this view, user can select code clones by specifying ranges of each metric value. This is an example of metric This metric is called DFL. This means an estimation of how many tokens would be removed from source files when all code fragments of clone class C are replaced with caller statements of a new identical routine as shown this figure. Metrics manager Metric graph views 2004/4/7 PROFES2004

Gemini: Architecture Gemini User Interfaces Clone pair manager User Scatter plot view Clone selection information Code clone detector CCFinder Source files Source code manager Source code view Clone selection information Code clone database This is a snapshot of source code view. In this view, user can browse the source code of code clone selected in other views. The highlighted parts represent code clone. Metrics manager Metric graph views 2004/4/7 PROFES2004

Application of CCFinder&Gemini Open source software JDK libraries (Java, 570 KLOC) Linux, FreeBSD (C, 1.6 + 1.3 MLOC) FreeBSD, OpenBSD,NetBSD(C) Qt(C++,240KLOC) Commercial Software (more than 50 companies) NTT Data Corp., Hitachi Ltd., Hitachi GP, NEC soft Ltd., ASTEC Inc., SRA Inc., NASDA,Daiwa Computer, etc… We have applied CCFinder&Gemini to these softwares. In open source software, for example, JDK libraries are investigated, and diference between Linux and FreeBSD are compared. In commercial software, we have delivered these tools to NTT Data and HITACHI and so on. CCFinder&Gemini was delivered to more than 50 companies. We also have applied these tools to student exercise of Osaka University and investigated the rate of similarity of student programs. And, the output of CCFinder is also filed in a court as an evidence for software copyright suit. Students exercise of Osaka University Filed in a court as an evidence for software copyright suit 2004/4/7 PROFES2004

Feedback As an actual application of CCFinder, we want to use code clone analysis in refactoring process. But code clones detected by CCFinder are sequences of tokens, such code clones are not appropriate to be directly replaced by one module (subroutine, function and so on). Then, we got some feedbacks from the companies. One of them is that they want to use code clones detected by CCFinder in refactoring process. But, currently, code clones detected by CCFinder are a sequence of tokens, so, they are not appropriate to be directly replaced by one module( subroutine, function and so on ) I will show you the examples later. 2004/4/7 PROFES2004

Objective We propose a method to extract code clones from ones detected by CCFinder, which are well-suited to refactoring process ([Extract Method], [Pull Up Method])*. We apply the proposed method to an open source software, and evaluate the applicability of it. So, as the purpose of this research, we propose a method to extract code clones from ones detected by CCFinder, which are well-suited to refactoring process, especially extract method and pull up method. And, we apply the proposed method to an open source software, and evaluate applicability of it. *M. Fowler: Refactoring: Improving the Design of Existing Code, Addison-Wesley, 1999. 2004/4/7 PROFES2004

Extract Method void methodA(int i){ methodZ(); System.out.println(“name:” + name); System.out.println(“amount:” + i); } void methodB(int i){ methodY(); void methodA(int i){ methodZ(); methodC(i); } void methodB(int i){ methodY(); Void methodC(int i){ System.out.println(“name:” + name); System.out.println(“amount:” + i); methodC(i); methodC(i); Next, I briefly explain the two refactoring techniques, extract method and pull up method. Firstly, I explain extract method. For example, this source code defines two methods, A and B. Each method includes these two statements. By extracting them as one method, this source code is modified like this. In this example, these two System.out.printlns are re-defined as a new method, C. 2004/4/7 PROFES2004

Pull Up Method class A class B class C class A class B class C  method A  method A Secondly, I explain pull up method. This means that same methods which are defined in several children classes are pulled up to common parent class. In this example, method A defined in class B and class C are pulled up to class A. 2004/4/7 PROFES2004

Outline of proposed method Source files Structural code clones are extracted from the output of CCFinder Using same format as CCFinder, we can examine extracted code clones by using Gemini CCFinder Clone data Clone data Filter Structural Clone data Next I explain outline of the proposed method. In the proposed method, structural code clones are extracted from the output of CCFinder. And, using same format as CCFinder, we can examine the extracted code clones by using Gemini. The proposed method is implemented as a filter between CCFinder and Gemini. Conventionally, as shown in this figure, source files were analyzed by CCFinder, then, it was directly passing the analysis result to Gemini, and user analyze code clones. In proposed method, we introduce a filter between CCFinder and Gemini, and get the portion information of the structural code clones. Gemini 2004/4/7 PROFES2004

Synopsis of structural clones (for Java) Declaration class { … } ,interface { … } Method method, constructor, static initializer Statement if statement, for statement, while statement do statement, switch statement, try statement synchronized statement Block range surrounded with `{` and `}` This is the synopsis of structural clones. As declaration, class declaration and interface declaration are extracted. As method, method body, constructor, and static initializer are extracted. As statement, if statement and for statement and so on. As block, range surrounded with braces is extracted. 2004/4/7 PROFES2004

Implementation of proposed method CCShaper(Code Clone Shaper) Target program: Java CCShaper extracts structural clones from the output of CCFinder Description language: Java Source size: about 12000 LOC Syntax analysis unit is built by using JavaCC We implemented the filter as an actual tool named CodeCloneShaper(Code Clone Shaper). Currently, the target programming language is only Java. Intuitively speaking, CCShaper extracts structural clones from the output of CCFinder 2004/4/7 PROFES2004

Processes executed by CCShaper Source files Output of CCFinder Syntax analysis unit performs syntax analysis to source code including code clones Clone extraction unit extracts structural code clones from the result of syntax analysis and the output of CCFinder Clone management unit sorts and merges code clones detected by clone extraction unit Syntax analysis unit Clone extraction unit Next, I explain CCShaper. Inputs of CCShaper are the output of CCFinder and source files. And, this filter consists of three units. First unit is syntax analysis unit, which performs syntax analysis to source code including code clones. Second unit is clone extraction unit, which extracts structural code clones from the result of syntax analysis and the output of CCFinder. Third unit is clone management unit, which sorts and merges code clones detected by clone extraction unit. Clone management unit Output 2004/4/7 PROFES2004

・・・・・ methodA(){ ・・・・・ if( ・・・ ){ } } ・・・・・ methodB(){ ・・・・・ Clone Pair Code Fragment 1 Code Fragment 2 ・・・・・ methodA(){    ・・・・・    if( ・・・ ){  } } ・・・・・ methodB(){    ・・・・・    if( ・・・ ){  } } This is an example of extraction process. CCFinder finds these two fragments as a clone pair. In each fragment, method declaration and if statement are included. In this case, CCShaper extracts only method code clone. Because, if we remove “if-statement” clone, method clone exists. 2004/4/7 PROFES2004

Example 1 (Code clones including needless statements for refactoring) righttokennumber = c.getEndNumber() - c.getStartNumber() + 1; } string getLeftClone() const { char temp[STRLENGTH]; snprintf(temp,STRLENGTH, "%s\t%d,%d,%d\t%d,%d,%d\t",leftID.c_str(), leftstartline,leftstartcolumn,leftstartnumber, leftendline,leftendcolumn,leftendnumber); string clone(temp); return clone; string getRightClone() const "%s\t%d,%d,%d\t%d,%d,%d\t",rightID.c_str(), rightstartline,rightstartcolumn,rightstartnumber, rightendline,rightendcolumn,rightendnumber); int getLeftTokenNumber() const return lefttokennumber; Now I explain the behavior of CCShaper using actual source code. For this source code, CCFinder reports that this highlighted part is a code clone of this highlighted part. 2004/4/7 PROFES2004

Example 1 (Code clones including needless statements for refactoring) parts should be detected. Only As a result, this clone pair is extracted. The left code clone includes all of method “getLeftClone” and the first part of the next method. The right code clone includes all of method “getRightClone” and the first part of the next method. In this case, CCShaper extracts only these blue and red parts because these part are method bodies. 2004/4/7 PROFES2004

Example 2 (Code clones not suited to refactoring) CCFinder extracts parts as code clones. But, these are not suited to refactoring. This is the second example. For this source code, CCFinder extracts these fragments as a clone pair. But, as you can see, these code clones don’t include structural blocks and they are not suited to refactoring. So, in this case, CCShaper doesn’t extract any fragment as code clone. 2004/4/7 PROFES2004

Case study: Overview In this case study, we performed a refactoring to ANTLR, which is an open source Java program We investigated the following code clones Exact Clone :These code clones are verbatimly the same ones Renamed Clone :These code clone are different in names defined by programmer As the result of refactoring with CCShaper, we removed 2 clone classes. Through regression test, we confirmed the behavior of ANTLR. Next I explain the case study that we conducted. In this case study, we performed a refactoring to ANTLR, which is an open source java program. We investigated the two kinds of code clones. One is exact clone, which is verbatimly the same ones. The other is renamed clone, which is different in names defined by programmer. So, you can see exact clones are easy to refactoring. On the other hand, you should carefully modify renamed clone to merge them. And, as the result of refactoring with CCShaper, we removed 2 clone classes. And, after refactoring, through regression test, we confirmed the behavior of ANTLR. 2004/4/7 PROFES2004

Refactoring process Step1: Exact Clone removal Step2: Renamed Clone removal Step3: Regression Test This case study consists of three steps. First step is exact clone removal. Second step is renamed clone removal. And, final step is regression test. 2004/4/7 PROFES2004

Step1: Exact Clone removal In this analysis, CCFinder found exact clones which have more than 30 tokens Without CCShaper With CCShaper a At first, CCFinder found exact clones that have more than 30 tokens. The left metric graph is without CCShaper and the right is with CCShaper. In the right metric graph, this polygonal line stands out. So, we selected this labeled “a”. 2004/4/7 PROFES2004

Step1: Exact Clone removal Source code of the selected code clone (a) These were “if-statement” code clones } _cnt3++; } while (true); if ( _createToken && _token==null && _ttype!=Token.SKIP ) { _token = makeToken(_ttype); _token.setText(new String(text.getBuffer(), _begin, text.length()-_begin)); _returnToken = _token; These clones appeared in six classes These six classes inherited a same class This is one of the selected code clones. These were “if-statement” code clones. And, these clones appeared in six classes. Furthermore, these six classes inherited a same class. So, these statements could be merged to one method in common parent class. These statements could be merged to one method in common parent class. 2004/4/7 PROFES2004

Step1: Exact Clone removal protected Token _CLASS0(boolean _CLASS0_first, Token _CLASS0_second, int _CLASS0_third,int _CLASS0_forth){ if ( _CLASS0_first && _CLASS0_second==null && _CLASS0_third!=Token.SKIP ){ _CLASS0_second = makeToken(_CLASS0_third); _CLASS0_second.setText(new String(text.getBuffer(), _CLASS0_forth, text.length()-_CLASS0_forth)); } return _CLASS0_second; Step1: Exact Clone removal After refactoring Before refactoring } _cnt3++; } while (true); _token = _CLASS0(_createToken,_token,_ttype,_begin); _returnToken = _token; } _cnt3++; } while (true); if ( _createToken && _token==null && _ttype!=Token.SKIP ) { _token = makeToken(_ttype); _token.setText(new String(text.getBuffer(), _begin, text.length()-_begin)); _returnToken = _token; This is before refactoring. This highlighted part is “if-statement” code clone. And, this is after refactoring. The code clone is replaced with new method caller. This is source code of new method. This method was added in common parent class. 2004/4/7 PROFES2004

Step2: Renamed Clone removal In this analysis, CCFinder found renamed clones which have more than 50 tokens Without CCShaper With CCShaper b Secondly, CCFinder found renamed clones which have more than 50 tokens. The left scatter plot is without CCShaper and the right is with CCShaper. In the left scatter plot, very many code clones were detected, and we can’t understand which portion should be paid attention. On the other hand, in the right scatter plot, we can see that the number of code clones is seriously decreasing. Now, I select this portion labeled “b”, and checked it. 2004/4/7 PROFES2004

Step2: Renamed Clone removal This code clone appeared in 20 places of ANTLR. Source code of the selected code clone (b) public final void mOPEN_ELEMENT_OPTION(boolean _createToken) throws RecognitionException, CharStreamException, TokenStreamException {    int _ttype;    Token _token=null;    int _begin=text.length();    ttype = OPEN_ELEMENT_OPTION;    int _saveIndex;    match('<');    if ( _createToken && _token==null && _ttype!=Token.SKIP ) {       _token = makeToken(_ttype);       _token.setText(new String(text.getBuffer(), _begin, text.length()-_begin));    }    _returnToken = _token; } Only portions were different from other clones. All clones were methods included in the same class. This is one of the selected code clones. This code clone appeared in 20 places of ANTLR. And, only these two gray portions were different from other clones which belong to the same clone class. Furthermore, all code clones were methods included in the same class. So, these methods could be merged to one method by adding 2 arguments. These methods could be merged to one method by adding 2 arguments. 2004/4/7 PROFES2004

Input(grammer definition file) Step3: Regression Test After refactoring process, we performed regression test We used all 84 sample files included in ANTLR These sample files are used as input The original ANTLR and modified one output source code We compared the results of the original ANTLR and modified one. Input(grammer definition file) ANTLR (original) ANTLR (modified) After refactoring process, we performed regression test. We used all 84 sample files included in ANTLR . These sample files are used as input. And the original ANTLR and modified one output source code. Finally, we compared the results between the original ANTLR and modified one, and confirmed that no difference exists between the two. Output(source code of syntax analysis) Output(source code of syntax analysis) compare 2004/4/7 PROFES2004

Summaries We have developed a filtering tool (CCShaper) that extracts code clones that are well-suited to refactoring activity We have evaluated the applicability of CCShaper by applying it to an actual Java program Now I conclude my presentation. We have developed a filtering tool CCShaper that extracts code clones that are well-suited to refactoring activity. And, we have evaluated the applicability of CCShaper by applying it to an actual Java program. 2004/4/7 PROFES2004

Future works We are going to develop more practical filtering method by considering context of clones, extend CCShaper to apply other programming languages, apply CCShaper to refactoring of commercial software products As future works, we are going to develop more practical filtering method by consider context of clones. (For example, variables referred in clones and spatial relationships between clones.) We will also extend it to apply other programming languages, And apply CCShaper to refactoring of commercial software. Thank you. 2004/4/7 PROFES2004

2004/4/7 PROFES2004

The difference between ‘diff’ and clone detection tools Diff finds the longest common sub-string. Given a code portion, diff does not report two or more same code portions (clones). Clone detection tool finds all the same or similar code portions. 2004/4/7 PROFES2004

Suffix-tree Suffix tree is a tree that satisfies the following conditions. A leaf node represents the starting position of sub-string. A path from root node to a leaf node represents a sub-string. First characters of labels of all the edges from one node are different from each other. → A common path means a clone Suffix tree is a tree that satisfies three conditions. Condition 1 is that a leaf node represents the starting position of sub-string. Condition 2 is that a path from root node to a leaf node represents a sub-string. Condition 3 is that first characters of labels of all the edges from one node are different from each other. Therefore, a common path means a clone pair. For example, This figure is a suffix-tree for this string “xxyxyz “. Next this path from root node to this leaf node represents the string “xyxyz “, and this sub-string starting from number 2 is also “xyxyz “. So, this leaf has the number 2. The path from root node to this leaf node represents the string “xyz “, and this sub-string starting from number 4 is also “xyz “. The common path between them is “xy “. Then this sub-string “xy “ is detected as a code clone. Also in this scatter plot, that “xy “ is shown here. 2004/4/7 PROFES2004

Example of transformation rules in Java All identifiers defined by user are transformed to same tokens. Unique identifier is inserted at each end of the top-level definitions and declarations. Prevents detecting clones that begin at the middle of class definition and end at the middle of another one. ”java. lang. Math. PI” is transformed to ”Math. PI”. By using import sentence, a class is referred to with either full package name or a shorter name ” new int[] {1, 2, 3} ” is transformed to ” new int[] {$} ” Eliminates table initialization code. For example, all identifiers defined by user are translated to same tokens. This transformation absorbs differences in names of variables. In another rule, unique identifiers is inserted at each end of the top-level definitions and declarations. This rule prevents detecting clones that begin at the middle of class definition and end at the middle of another one. And, these practical devices are also given to the rules. 2004/4/7 PROFES2004

The output of CCFinder Output of CCFinder #version: ccfinder 3.1 #langspec: JAVA #option: -b 30,1 #option: -k + #option: -r abcdfikmnprsv #option: -c wfg #begin{file description} 0.0 52 C:\Gemini.java 0.1 94 C:\GeneralManager.java : #end{file description} #begin{clone} 0.1 53,9 63,13 1.10 542,9 553,13 35 0.1 53,9 63,13 1.10 624,9 633,13 35 0.2 124,9 152,31 0.2 154,9 216,51 42       : #end{clone} Output of CCFinder Object file ID ( file 0 in Group 0 ) Location of a clone pair ( Lines 53 - 63 in file 0.1 and Lines 542 - 553 in file 1.10 are identical or similar to each other) However, the output of CCFinder is difficult to understand intuitively. This figure shows an example of the actual output. In this part, a file ID number is described, and in this part, a location of a clone pair is described. As you can see, it is difficult to analyze source code by only this text-based information of the location of clone pairs. The analysis of a large number of clone pairs is especially difficult. It is difficult to analyze source code by only this text-based information of the location of clone pairs. 2004/4/7 PROFES2004

The analysis of comparison among students (non-gapped clones only) The corresponding code A (2 students) Similar code fragments were from source code of sample compiler described in textbook. B (4 students) Many code fragments were similar even with respect to name of variables or comments. B A So we analyzed the actual source code corresponding to these crowded clone pairs through source code viewer. As for area A, there are 2 students, and their similar code fragments were not plagiarisms but from source code of sample compiler described in textbook. However, as for area B, there are 4 students, and many code fragments were similar even with respect to name of variables or comments. So, it can be said that the possibility that plagiarisms among them happened is high. 2004/4/7 PROFES2004

Clone class metrics LEN (C ): Length of token sequence of each element in clone class C LNR (C) : Length of non-repetitive token sequence of LEN(C) POP (C ): Number of elements in clone class C DFL (C ): Estimation of how many tokens would be removed from source files when all code fragments of clone class C are replaced with caller statements of a new identical routine RAD (C ): Distribution in the file system of elements in clone class C new sub routine caller statements In turn, I explain four metrics with clone class which are used in metric graph of Gemini. First one is LEN(C). This is a length of token sequence of each element in a clone class. Second one is POP(C). This is a number of elements in a clone class. Third one is DFL(C). This means an estimation of how many tokens would be removed from source files when all code fragments of clone class C are replaced with caller statements of a new identical routine as this figure. This value can be calculated from the values of LEN and POP. Fourth metric is RAD. This means the distribution in the file system of elements in clone class. 2004/4/7 PROFES2004

Comparison with AST approach Features of AST approach Extract the same sub-trees of AST as a clone The result is precise because of strict syntax analysis. High space and time complexity Features of Our approach Hybrid approach of CCFinder’s quick but inaccurate clone detection and CCShaper’s filtering considering syntax structure. Hybrid approach of CCFinder’s quick but inaccurate clone detection and CCShaper’s filtering considering syntax structure 2004/4/7 PROFES2004

The other approaches AST(Abstract syntax tree) approach Clone = the same sub-trees in an AST Deep dependence on program language PDG(Program dependency Graph) approach Clone = the same sub-graph in a PDG Graph comparison is difficult Code metric Clone = the routines which have the same metric values Severe restriction in granularity CCFinder&CCShaper Clone = the code fragments which have the same syntax structure Limited precision 2004/4/7 PROFES2004

Why I choose “a” I selected the clones by the following criteria All clone code fragments appear in the same class The metric LEN is high The code fragment includes a whole method body クローン間の位置関係 2004/4/7 PROFES2004