Department of Computer Science, Graduate School of Information Science & Technology, Osaka University Detection and evolution analysis of code clones for efficient management of large-scale software systems Software Engineering Laboratory Eunjong Choi 1
Department of Computer Science, Graduate School of Information Science & Technology, Osaka University Code Clone A code fragment that has other code fragments identical or similar to it in the source code 2 Source File 1 Source File 2 Code Clone Clone Set Code Clone Representative factor that hampers software maintenance
Department of Computer Science, Graduate School of Information Science & Technology, Osaka University Clone Detection Tools Using various granularities String, Token, Program dependency graphs CCFinder [Kamiya2002] Token-based code clone detection tool Famous for its high recall 3 [Kamiya2002] T. Kamiya et al. : A Multilinguistic Token-Based Code Clone Detection System for Large Scale Source Code. IEEE TSE, 28,7, pp , 2002
Department of Computer Science, Graduate School of Information Science & Technology, Osaka University Example of CCFinder [Ueda2002] 1. static void foo() throws RESyntaxException { 2. String a[] = new String [] { "123,400", "abc", "orange 100" }; 3. org.apache.regexp.RE pat = new org.apache.regexp.RE("[0-9,]+"); 4. int sum = 0; 5. for (int i = 0; i < a.length; ++i) 6. if (pat.match(a[i])) 7. sum += Sample.parseNumber(pat.getParen(0)); 8. System.out.println("sum = " + sum); 9. } 10. static void goo(String [] a) throws RESyntaxException { 11. RE exp = new RE("[0-9,]+"); 12. int sum = 0; 13. for (int i = 0; i < a.length; ++i) 14. if (exp.match(a[i])) 15. sum += parseNumber(exp.getParen(0)); 16. System.out.println("sum = " + sum); 17. } Lexical analysis Transformation Token sequence Match detection Transformed token sequence Clones on transformed sequence Formatting Source files Code clones 4 [Ueda2002] Y. Ueda et al. Gemini: Maintenance Support Environment Based on Code Clone Analysis.
Department of Computer Science, Graduate School of Information Science & Technology, Osaka University Lexical analysis Transformation Token sequence Match detection Transformed token sequence Clones on transformed sequence Formatting Source files Code clones 5 [Ueda2002] Y. Ueda et al. Gemini: Maintenance Support Environment Based on Code Clone Analysis. Example of CCFinder [Ueda2002]
Department of Computer Science, Graduate School of Information Science & Technology, Osaka University Lexical analysis Transformation Token sequence Match detection Transformed token sequence Clones on transformed sequence Formatting Source files Code clones 6 [Ueda2002] Y. Ueda et al. Gemini: Maintenance Support Environment Based on Code Clone Analysis. Example of CCFinder [Ueda2002]
Department of Computer Science, Graduate School of Information Science & Technology, Osaka University Lexical analysis Transformation Token sequence Match detection Transformed token sequence Clones on transformed sequence Formatting Source files Code clones 7 [Ueda2002] Y. Ueda et al. Gemini: Maintenance Support Environment Based on Code Clone Analysis. Example of CCFinder [Ueda2002]
Department of Computer Science, Graduate School of Information Science & Technology, Osaka University 1. static void foo() throws RESyntaxException { 2. String a[] = new String [] { "123,400", "abc", "orange 100" }; 3. org.apache.regexp.RE pat = new org.apache.regexp.RE("[0-9,]+"); 4. int sum = 0; 5. for (int i = 0; i < a.length; ++i) 6. if (pat.match(a[i])) 7. sum += Sample.parseNumber(pat.getParen(0)); 8. System.out.println("sum = " + sum); 9. } 10. static void goo(String [] a) throws RESyntaxException { 11. RE exp = new RE("[0-9,]+"); 12. int sum = 0; 13. for (int i = 0; i < a.length; ++i) 14. if (exp.match(a[i])) 15. sum += parseNumber(exp.getParen(0)); 16. System.out.println("sum = " + sum); 17. } Source files Lexical analysis Transformation Token sequence Match detection Transformed token sequence Clones on transformed sequence Formatting Source files Code clones 8 [Ueda2002] Y. Ueda et al. Gemini: Maintenance Support Environment Based on Code Clone Analysis. Example of CCFinder [Ueda2002]
Department of Computer Science, Graduate School of Information Science & Technology, Osaka University Clone Management Tools for managing code clones consistent change, clone refactoring Clone refactoring Merging code clones into a single unit (i.e. method/function) Reduce effort and time for clone management 9 Clone Refactoring Call
Department of Computer Science, Graduate School of Information Science & Technology, Osaka University Many companies release a new model in rapid rushed intervals [Bosch 2010] Frequently reuse robust parts of existing source code for new development Motivation of the Thesis (1/2) 10 ++ Reused Parts Unique Features + [Bosch 2010] J. Bosch and P. Bosch-Sijtsema From integration to composition: On the impact of software product lines, global development and ecosystems. J. Syst. Softw. 83, 1 (January 2010), pp
Department of Computer Science, Graduate School of Information Science & Technology, Osaka University The existing tools are insufficient for large-scale software systems Take much time for detection system involving a large amount of code clones Tools for clone management are commonly underused Motivation of the Thesis (2/2) 11 RQ1. How to quickly detect code clones from large-scale software systems? RQ2. How to develop more widely used tools that support clone refactoring?
Department of Computer Science, Graduate School of Information Science & Technology, Osaka University Thesis Outline Chapter 1 Introduction Chapter 2: Related work [1-2] Chapter 3: Proposing and Evaluating Clone Detection Approaches [1-1] To answer RQ1 Chapter 4: Investigating Merged Code Clones during Software Evolution [1-3][1-4] To answer RQ2 Chapter 5: Conclusion and Future Work 12
Department of Computer Science, Graduate School of Information Science & Technology, Osaka University Chapter 3 : Proposing and Evaluating Clone Detection Approaches 13
Department of Computer Science, Graduate School of Information Science & Technology, Osaka University Motivation of This Study (1/2) Important to detect code clones from different release models/versions A large amount of code clones increase detection time Identical files increase computational complexity of code clone detection Code clones are repeatedly detected within them. 14
Department of Computer Science, Graduate School of Information Science & Technology, Osaka University Motivation of This Study (2/2) Different degrees of normalizations make subtly different source code to be detected as code clones Normalization : transformation of program elements 15 Code Clone by CCFinder org.apache.regexp.RE pat = new org.apache.regexp.RE("[0-9,]+"); …… RE exp = new RE("[0-9,]+"); ……. org.apache.regexp.RE pat = new org.apache.regexp.RE("[0-9,]+"); ……
Department of Computer Science, Graduate School of Information Science & Technology, Osaka University Overview of This Study (1/2) Proposes six approaches and evaluates them To investigate how the normalizations impact the code clone detection Approach with non-normalization Approaches with normalization 16
Department of Computer Science, Graduate School of Information Science & Technology, Osaka University Overview of This Study (2/2) The proposes approaches share three pipeline phases Preprocessing : performs equivalence class (i.e. a set of files that are identical each other) partition and then generates corpus (i.e. a set of files that are representatives of each equivalence class) based on the MD5 hash values of the input files. clone detection :detects code clones on the corpus using CCFinder Post-processing : generates all clone sets by mapping output of CCFinder, the equivalence classes and other information if necessary 17
Department of Computer Science, Graduate School of Information Science & Technology, Osaka University Approach with Non-normalization 18 Input source files Clone detection Post-processing Calculate MD5 hash values Select 0cc be05175 b 1 c 1 a 1 a 2 Partition equivalence class 0cc be b 1 c 1 a 1 a 2 0cc be05175 b 1 c 1 a 1 a 2 Detect code clones a 1 a 2 Mapping Preprocessing b 1 c 1
Department of Computer Science, Graduate School of Information Science & Technology, Osaka University Approach with Normalizations (1/2) 19 Input source files Preprocessing Clone detection Post-processing Calculate MD5 hash values Select 0cc be05175 b 1 c 1 a 1 a 2 Partition equivalence class 0cc be b 1 c 1 a 1 a 2 0cc be05175 b 1 c 1 a 1 a 2 Detect code clones b 1 c 1 a 1 a 2 b 1 c 1 a 1 a 2 Mapping b 1 c 1 a 1 a 2 Parse &Normalize
Department of Computer Science, Graduate School of Information Science & Technology, Osaka University Identical Except for Comments (IEC) approach Identical Except for Macros (IEM) approach Identical Except for Macros and Comments (IEMC) approach Identical Source Code (ISC) approach Identical Normalized Source Code (INSC) approach Approach with Normalizations (2/2) 20
Department of Computer Science, Graduate School of Information Science & Technology, Osaka University Case Study (1/2) Research Questions (RQs) RQ1. Can proposed approaches detect code clones faster than an approach that uses only CCFinder? RQ2. Which approach is the fastest among the proposed approaches? The approaches are applied to different versions of three open source software (OSS) systems. Our proposed approaches Approach that uses only CCFinder 21
Department of Computer Science, Graduate School of Information Science & Technology, Osaka University Case Study (2/2) Statistics of subject systems Detection environment 64 bits Windows 7 Professional workstation equipped with 2 processors and 24 gigabytes of main memory. 22 Project Name Lang uage version#FilesLine Of Code #Tokens Apache ant Java29 consecutive versions ( ) 18,7084,862,1028,404,790 Linux kernel C12 consecutive versions ( ) 7,8395,690,96712,537,555 Samsung galaxy C2 versions for different areas 29,57319,920,38743,924,235
Department of Computer Science, Graduate School of Information Science & Technology, Osaka University Detection Time in Seconds (Samsung galaxy) 23 Approach NamesTotal detection Preprocess ing Clone detection Post- processing Approach that uses only CCFinder 19, Approach with non- normalization 4,326 7 (0.16%)4,307 (99.56%) 12 (0.28%) IEC Approach8, (2.31%)4,686 (53.23%)3,913 (44.46%) IEM Approach9, (2.93%)4,601 (49.79%)4,368 (47.28%) IEMC Approach8, (2.60%)4,530 (52.01%)3,954 (45.39%) ISC Approach8, (2.84%)4,398 (51.67%)3,873 (45.49%) INSC Approach8, (2.63%)4,552 (51.18%) 4,108 (46.19%)
Department of Computer Science, Graduate School of Information Science & Technology, Osaka University Answers to RQs RQ1. Can proposed approaches detect code clones faster than an approach that uses only CCFinder? RQ2. Which approach is the fastest among the proposed approaches? 24 Answer to RQ1 : Our proposed approaches are able to detect code clones faster than the “approach that uses only CCFinder”. Answer to RQ2 : The “Approach with non- normalization” is the fastest.
Department of Computer Science, Graduate School of Information Science & Technology, Osaka University 25
Department of Computer Science, Graduate School of Information Science & Technology, Osaka University Chapter 4 : Investigating Merged Code Clones during Software Evolution 26
Department of Computer Science, Graduate School of Information Science & Technology, Osaka University Motivation of This Study Clone refactoring tools are commonly underused compared to refactoring tools Investigated instances of clone refactoring in open source software systems To uncover clues that could contribute to the development of more widely used tools for clone refactoring 27
Department of Computer Science, Graduate School of Information Science & Technology, Osaka University Research Questions(RQs) RQ1: Which refactoring patterns are the most frequently used in clone refactoring? RQ2: How similar are the token sequences between pairs of merged code clones? RQ3: How different are the lengths of token sequences between pairs of merged code clones? RQ4: How far are pairs of code clones located before clone refactoring? 28
Department of Computer Science, Graduate School of Information Science & Technology, Osaka University Steps of Investigation 29 software repository extract Ref-Finder source code detected instances of refactoring k k+1 k k+1 identified instances of clone refactoring identified instances of clone refactoring k k+1 Step 1: Detecting Instances of Refactoring Step 2: Identifying Instances of Clone Refactoring Step 3: Measuring the Characteristics of Merged Code Clones
Department of Computer Science, Graduate School of Information Science & Technology, Osaka University Step 1: Detecting Instances of Refactoring (1/2) Ref-Finder [Prete2010] was applied to identify instances of refactoring Extract Method (EM) Extract Class (EC) Extract Superclass (ES) Form Template Method (FTM) Pull Up Method (PUM) Parameterized Method (PM) Replace Method with Method Object (RMMO) 30 [Prete2010] K. Prete, et al., Template-based reconstruction of complex refactorings. In Proc. of ICSM, pp. 1-10, 2010
Department of Computer Science, Graduate School of Information Science & Technology, Osaka University Step 1: Detecting Instances of Refactoring (2/2) Manually validated the output of Ref-Finder To exclude false positive Referred to existing validated output data [Bavota2012] Subject systems 31 [Bavota2012] G. Bavota et al, "When Does a Refactoring Induce Bugs? An Empirical Study,"? In Proc. of SCAM, pp , 2012 Subject SystemsVersions#Versions#Period Apache ant1.2 – Jan – Dec ArgoUML Oct – Dec Xerces-j Nov – Nov. 2010
Department of Computer Science, Graduate School of Information Science & Technology, Osaka University Step 2 : Identifying Instances of Clone Refactoring (1/3) Undirected similarity(usim) [Mende2010] : determine the similarity between two token sequences Using Levenshtein distance Measuring the amount of difference between two character sequences Levenshtein distance between survey and surgery is 2 [B-yates] 32 survey → surgey → surgery +1 [Mende2010] T. Mende et al,. An evaluation of code similarity identification for the grow-and- prune model. Journal of Software Maintenance, 21(2): pp , 2009 [B-yates] R. Baeza-Yates and B. Ribeiro-Neto.Modern. Information Retrieval: The Concepts and Technology behind Search (2nd Edition). Addison Wesley, 2010.
Department of Computer Science, Graduate School of Information Science & Technology, Osaka University Step 2 : Identifying Instances of Clone Refactoring (2/3) Levenshtein distance between two token sequences is normalized by the maximum size between them 33 : number of items that have to be changed to turn function fx into fy : a normalized token sequence : length of normalized token sequence
Department of Computer Science, Graduate School of Information Science & Technology, Osaka University Step 2 : Identifying Instances of Clone Refactoring (3/3) Each pair of refactored clones was defined as an instance of clone refactoring, only if it satisfied the following three conditions Syntax condition : Each pair of code fragments was refactored into the same new method in the new version Similarity condition : The computed usim value of each pair of code fragments in the old version was more than 65% [Mende2010] Volume condition : The token length of each refactored pair was greater than 10 in the old version [Mende2010] 34 [Mende2010] T. Mende et al,. An evaluation of code similarity identification for the grow-and-prune model. Journal of Software Maintenance, 21(2): pp , 2009
Department of Computer Science, Graduate School of Information Science & Technology, Osaka University RQ1. The Most Frequently Used in Refactoring Patterns categorized sets of code clones based on whether they were merged into the same newly-created method using the refactoring patterns a total of 35 sets of merged code clones were identified 35 Refactoring patternEMESFTMRMMO #Instances Answer to RQ1: RMMO was the most frequently used refactoring pattern observed, followed by EM pattern.
Department of Computer Science, Graduate School of Information Science & Technology, Osaka University RQ2 Similarities of Token Sequences between Pairs of Merged Code Clones 36 The average usim values of sets of merged code clones Answer to RQ2: EM and RMMO were mainly used to merge pairs of code clones of various token similarities. The token similarities of EM and RMMO were relatively low compared to that of ES and FTM
Department of Computer Science, Graduate School of Information Science & Technology, Osaka University Suggestions for Tool Developing Vital for tools to support RMMO and EM patterns To support EM pattern, tools should suggest pairs of code clones with various token similarities as candidates for clone refactoring To support RMMO pattern, tools should suggest pairs of code clones of various token similarities as candidates 37
Department of Computer Science, Graduate School of Information Science & Technology, Osaka University 38
Department of Computer Science, Graduate School of Information Science & Technology, Osaka University Future Work Higher speed : extend the approach in Chapter 3 Using the distributed approach such as D-CCFinder [Livieri2007] Tool developing : Based on the investigation results of Chapter 4 39 [Livieri2007] S. Livieri et al. :”Very-Large Scale Code Clone Analysis and Visualization of Open Source Programs Using Distributed CCFinder: D-CCFinder”, pp , ICSE 2007, Any Questions?