Department of Computer Science, Graduate School of Information Science & Technology, Osaka University Detection and evolution analysis of code clones for.

Slides:



Advertisements
Similar presentations
Duplicate code detection using Clone Digger Peter Bulychev Lomonosov Moscow State University CS department.
Advertisements

A Mutation / Injection-based Automatic Framework for Evaluating Code Clone Detection Tools Chanchal Roy University of Saskatchewan The 9th CREST Open Workshop.
Unification and Refactoring of Clones Giri Panamoottil Krishnan and Nikolaos Tsantalis Department of Computer Science & Software Engineering Clone images.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Identifying Source.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Extraction of.
Reverse Engineering © SERG Code Cloning: Detection, Classification, and Refactoring.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Extracting Code.
A Tool Support to Merge Similar Methods with a Cohesion Metric COB ○ Masakazu Ioka 1, Norihiro Yoshida 2, Tomoo Masai 1,Yoshiki Higo 1, Katsuro Inoue 1.
Systems Analysis and Design in a Changing World, 6th Edition
Refactoring Support Tool: Cancer Yoshiki Higo Osaka University.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Industrial Application.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Where Does This.
OOSE 01/17 Institute of Computer Science and Information Engineering, National Cheng Kung University Member:Q 薛弘志 P 蔡文豪 F 周詩御.
Software Engineering Lab, Osaka University Code Clone Analysis and Its Application Katsuro Inoue Osaka University.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Finding Similar.
Code Clone Analysis and Its Application
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University What Kinds of.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 1 Refactoring.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University A Criterion for.
Department of Computer Science, Graduate School of Information Science and Technology, Osaka University DCCFinder: A Very- Large Scale Code Clone Analysis.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Investigation.
Department of Computer Science, Graduate School of Information Science & Technology, Osaka University A clone detection approach for a collection of similar.
Supervisor:Mr. Sayed Morteza Zaker Presentor:Fateme hadinezhad.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University What Do Practitioners.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 1 ARIES: Refactoring.
Department of Computer Science, Graduate School of Information Science & Technology, Osaka University A Method to Detect License Inconsistencies for Large-
Mining and Analysis of Control Structure Variant Clones Guo Qiao.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Code-Clone Analysis.
2002/12/11PROFES20021 On software maintenance process improvement based on code clone analysis Yoshiki Higo* , Yasushi Ueda* , Toshihiro Kamiya** , Shinji.
CMCD: Count Matrix based Code Clone Detection Yang Yuan and Yao Guo Key Laboratory of High-Confidence Software Technologies (Ministry of Education) Peking.
1 Gemini: Maintenance Support Environment Based on Code Clone Analysis *Graduate School of Engineering Science, Osaka Univ. **PRESTO, Japan Science and.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Applying Clone.
Department of Computer Science, Graduate School of Information Science & Technology, Osaka University Inoue Laboratory Eunjong Choi 1 Investigating Clone.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University How to extract.
1 A Heuristic Approach Towards Solving the Software Clustering Problem ICSM03 Brian S. Mitchell /
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Code-Clone Detection.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Development of.
Department of Computer Science, Graduate School of Information Science & Technology, Osaka University Retrieving Similar Code Fragments based on Identifier.
Department of Computer Science, Graduate School of Information Science & Technology, Osaka University 1 Towards an Assessment of the Quality of Refactoring.
Department of Computer Science, Graduate School of Information Science & Technology, Osaka University 1 Towards an Investigation of Opportunities for Refactoring.
Duplicate code detection using anti-unification Peter Bulychev Moscow State University Marius Minea Institute eAustria, Timisoara.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Finding Code Clones.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Code Clone Analysis.
1 Measuring Similarity of Large Software System Based on Source Code Correspondence Tetsuo Yamamoto*, Makoto Matsushita**, Toshihiro Kamiya***, Katsuro.
Department of Computer Science, Graduate School of Information Science & Technology, Osaka University An Empirical Study of Out-dated Third-party Code.
Experience of Finding Inconsistently-Changed Bugs in Code Clones of Mobile Software Katsuro Inoue†, Yoshiki Higo†, Norihiro Yoshida†, Eunjong Choi†, Shinji.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 1 Classification.
University of Waterloo Four “interesting” ways in which history can teach us about software Michael W. Godfrey * Xinyi Dong Cory Kapser Lijie Zou Software.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 1 Extracting Sequence.
What kind of and how clones are refactored? A case study of three OSS projects WRT2012 June 1, Eunjong Choi†, Norihiro Yoshida‡, Katsuro Inoue†
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 1 コードクローン解析に基づくリファクタリング支援.
Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Towards a Collection of Refactoring Patterns Based.
1 Gemini: Code Clone Analysis Tool †Graduate School of Engineering Science, Osaka Univ., Japan ‡ Graduate School of Information Science and Technology,
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 1 Aries: Refactoring.
Department of Computer Science, Graduate School of Information Science & Technology, Osaka University Detection of License Inconsistencies in Free and.
The PLA Model: On the Combination of Product-Line Analyses 강태준.
On Detection of Gapped Code Clones using Gap Locations Yasushi Ueda†, Toshihiro Kamiya‡, Shinji Kusumoto†, and Katsuro Inoue† †Graduate School of Information.
Estimating Code Size After a Complete Code-Clone Merge Buford Edwards III, Yuhao Wu, Makoto Matsushita, Katsuro Inoue 1 Graduate School of Information.
Do Developers Focus on Severe Code Smells?
Refactoring Support Based on Code Clone Analysis
CBCD: Cloned Buggy Code Detector
A Refactoring Technique for Large Groups of Software Clones
Ruru Yue1, Na Meng2, Qianxiang Wang1 1Peking University 2Virginia Tech
○Yuichi Semura1, Norihiro Yoshida2, Eunjong Choi3, Katsuro Inoue1
: Clone Refactoring Davood Mazinanian Nikolaos Tsantalis Raphael Stein
Refactoring Support Tool: Cancer
Quaid-i-Azam University
Yuhao Wu1, Yuki Manabe2, Daniel M. German3, Katsuro Inoue1
Multilingual Detection of Code Clones Using ANTLR Grammar Definitions
On Refactoring Support Based on Code Clone Dependency Relation
Presentation transcript:

Department of Computer Science, Graduate School of Information Science & Technology, Osaka University Detection and evolution analysis of code clones for efficient management of large-scale software systems Software Engineering Laboratory Eunjong Choi 1

Department of Computer Science, Graduate School of Information Science & Technology, Osaka University Code Clone A code fragment that has other code fragments identical or similar to it in the source code 2 Source File 1 Source File 2 Code Clone Clone Set Code Clone Representative factor that hampers software maintenance

Department of Computer Science, Graduate School of Information Science & Technology, Osaka University Clone Detection Tools Using various granularities  String, Token, Program dependency graphs CCFinder [Kamiya2002]  Token-based code clone detection tool  Famous for its high recall 3 [Kamiya2002] T. Kamiya et al. : A Multilinguistic Token-Based Code Clone Detection System for Large Scale Source Code. IEEE TSE, 28,7, pp , 2002

Department of Computer Science, Graduate School of Information Science & Technology, Osaka University Example of CCFinder [Ueda2002] 1. static void foo() throws RESyntaxException { 2. String a[] = new String [] { "123,400", "abc", "orange 100" }; 3. org.apache.regexp.RE pat = new org.apache.regexp.RE("[0-9,]+"); 4. int sum = 0; 5. for (int i = 0; i < a.length; ++i) 6. if (pat.match(a[i])) 7. sum += Sample.parseNumber(pat.getParen(0)); 8. System.out.println("sum = " + sum); 9. } 10. static void goo(String [] a) throws RESyntaxException { 11. RE exp = new RE("[0-9,]+"); 12. int sum = 0; 13. for (int i = 0; i < a.length; ++i) 14. if (exp.match(a[i])) 15. sum += parseNumber(exp.getParen(0)); 16. System.out.println("sum = " + sum); 17. } Lexical analysis Transformation Token sequence Match detection Transformed token sequence Clones on transformed sequence Formatting Source files Code clones 4 [Ueda2002] Y. Ueda et al. Gemini: Maintenance Support Environment Based on Code Clone Analysis.

Department of Computer Science, Graduate School of Information Science & Technology, Osaka University Lexical analysis Transformation Token sequence Match detection Transformed token sequence Clones on transformed sequence Formatting Source files Code clones 5 [Ueda2002] Y. Ueda et al. Gemini: Maintenance Support Environment Based on Code Clone Analysis. Example of CCFinder [Ueda2002]

Department of Computer Science, Graduate School of Information Science & Technology, Osaka University Lexical analysis Transformation Token sequence Match detection Transformed token sequence Clones on transformed sequence Formatting Source files Code clones 6 [Ueda2002] Y. Ueda et al. Gemini: Maintenance Support Environment Based on Code Clone Analysis. Example of CCFinder [Ueda2002]

Department of Computer Science, Graduate School of Information Science & Technology, Osaka University Lexical analysis Transformation Token sequence Match detection Transformed token sequence Clones on transformed sequence Formatting Source files Code clones 7 [Ueda2002] Y. Ueda et al. Gemini: Maintenance Support Environment Based on Code Clone Analysis. Example of CCFinder [Ueda2002]

Department of Computer Science, Graduate School of Information Science & Technology, Osaka University 1. static void foo() throws RESyntaxException { 2. String a[] = new String [] { "123,400", "abc", "orange 100" }; 3. org.apache.regexp.RE pat = new org.apache.regexp.RE("[0-9,]+"); 4. int sum = 0; 5. for (int i = 0; i < a.length; ++i) 6. if (pat.match(a[i])) 7. sum += Sample.parseNumber(pat.getParen(0)); 8. System.out.println("sum = " + sum); 9. } 10. static void goo(String [] a) throws RESyntaxException { 11. RE exp = new RE("[0-9,]+"); 12. int sum = 0; 13. for (int i = 0; i < a.length; ++i) 14. if (exp.match(a[i])) 15. sum += parseNumber(exp.getParen(0)); 16. System.out.println("sum = " + sum); 17. } Source files Lexical analysis Transformation Token sequence Match detection Transformed token sequence Clones on transformed sequence Formatting Source files Code clones 8 [Ueda2002] Y. Ueda et al. Gemini: Maintenance Support Environment Based on Code Clone Analysis. Example of CCFinder [Ueda2002]

Department of Computer Science, Graduate School of Information Science & Technology, Osaka University Clone Management Tools for managing code clones  consistent change, clone refactoring Clone refactoring  Merging code clones into a single unit (i.e. method/function)  Reduce effort and time for clone management 9 Clone Refactoring Call

Department of Computer Science, Graduate School of Information Science & Technology, Osaka University Many companies release a new model in rapid rushed intervals [Bosch 2010]  Frequently reuse robust parts of existing source code for new development Motivation of the Thesis (1/2) 10 ++ Reused Parts Unique Features + [Bosch 2010] J. Bosch and P. Bosch-Sijtsema From integration to composition: On the impact of software product lines, global development and ecosystems. J. Syst. Softw. 83, 1 (January 2010), pp

Department of Computer Science, Graduate School of Information Science & Technology, Osaka University The existing tools are insufficient for large-scale software systems  Take much time for detection system involving a large amount of code clones  Tools for clone management are commonly underused Motivation of the Thesis (2/2) 11 RQ1. How to quickly detect code clones from large-scale software systems? RQ2. How to develop more widely used tools that support clone refactoring?

Department of Computer Science, Graduate School of Information Science & Technology, Osaka University Thesis Outline Chapter 1 Introduction Chapter 2: Related work [1-2] Chapter 3: Proposing and Evaluating Clone Detection Approaches [1-1]  To answer RQ1 Chapter 4: Investigating Merged Code Clones during Software Evolution [1-3][1-4]  To answer RQ2 Chapter 5: Conclusion and Future Work 12

Department of Computer Science, Graduate School of Information Science & Technology, Osaka University Chapter 3 : Proposing and Evaluating Clone Detection Approaches 13

Department of Computer Science, Graduate School of Information Science & Technology, Osaka University Motivation of This Study (1/2) Important to detect code clones from different release models/versions A large amount of code clones increase detection time  Identical files increase computational complexity of code clone detection Code clones are repeatedly detected within them. 14

Department of Computer Science, Graduate School of Information Science & Technology, Osaka University Motivation of This Study (2/2) Different degrees of normalizations make subtly different source code to be detected as code clones  Normalization : transformation of program elements 15 Code Clone by CCFinder org.apache.regexp.RE pat = new org.apache.regexp.RE("[0-9,]+"); …… RE exp = new RE("[0-9,]+"); ……. org.apache.regexp.RE pat = new org.apache.regexp.RE("[0-9,]+"); ……

Department of Computer Science, Graduate School of Information Science & Technology, Osaka University Overview of This Study (1/2) Proposes six approaches and evaluates them  To investigate how the normalizations impact the code clone detection Approach with non-normalization Approaches with normalization 16

Department of Computer Science, Graduate School of Information Science & Technology, Osaka University Overview of This Study (2/2) The proposes approaches share three pipeline phases  Preprocessing : performs equivalence class (i.e. a set of files that are identical each other) partition and then generates corpus (i.e. a set of files that are representatives of each equivalence class) based on the MD5 hash values of the input files.  clone detection :detects code clones on the corpus using CCFinder  Post-processing : generates all clone sets by mapping output of CCFinder, the equivalence classes and other information if necessary 17

Department of Computer Science, Graduate School of Information Science & Technology, Osaka University Approach with Non-normalization 18 Input source files Clone detection Post-processing Calculate MD5 hash values Select 0cc be05175 b 1 c 1 a 1 a 2 Partition equivalence class 0cc be b 1 c 1 a 1 a 2 0cc be05175 b 1 c 1 a 1 a 2 Detect code clones a 1 a 2 Mapping Preprocessing b 1 c 1

Department of Computer Science, Graduate School of Information Science & Technology, Osaka University Approach with Normalizations (1/2) 19 Input source files Preprocessing Clone detection Post-processing Calculate MD5 hash values Select 0cc be05175 b 1 c 1 a 1 a 2 Partition equivalence class 0cc be b 1 c 1 a 1 a 2 0cc be05175 b 1 c 1 a 1 a 2 Detect code clones b 1 c 1 a 1 a 2 b 1 c 1 a 1 a 2 Mapping b 1 c 1 a 1 a 2 Parse &Normalize

Department of Computer Science, Graduate School of Information Science & Technology, Osaka University Identical Except for Comments (IEC) approach Identical Except for Macros (IEM) approach Identical Except for Macros and Comments (IEMC) approach Identical Source Code (ISC) approach Identical Normalized Source Code (INSC) approach Approach with Normalizations (2/2) 20

Department of Computer Science, Graduate School of Information Science & Technology, Osaka University Case Study (1/2) Research Questions (RQs)  RQ1. Can proposed approaches detect code clones faster than an approach that uses only CCFinder?  RQ2. Which approach is the fastest among the proposed approaches? The approaches are applied to different versions of three open source software (OSS) systems.  Our proposed approaches  Approach that uses only CCFinder 21

Department of Computer Science, Graduate School of Information Science & Technology, Osaka University Case Study (2/2) Statistics of subject systems Detection environment  64 bits Windows 7 Professional workstation equipped with 2 processors and 24 gigabytes of main memory. 22 Project Name Lang uage version#FilesLine Of Code #Tokens Apache ant Java29 consecutive versions ( ) 18,7084,862,1028,404,790 Linux kernel C12 consecutive versions ( ) 7,8395,690,96712,537,555 Samsung galaxy C2 versions for different areas 29,57319,920,38743,924,235

Department of Computer Science, Graduate School of Information Science & Technology, Osaka University Detection Time in Seconds (Samsung galaxy) 23 Approach NamesTotal detection Preprocess ing Clone detection Post- processing Approach that uses only CCFinder 19, Approach with non- normalization 4,326 7 (0.16%)4,307 (99.56%) 12 (0.28%) IEC Approach8, (2.31%)4,686 (53.23%)3,913 (44.46%) IEM Approach9, (2.93%)4,601 (49.79%)4,368 (47.28%) IEMC Approach8, (2.60%)4,530 (52.01%)3,954 (45.39%) ISC Approach8, (2.84%)4,398 (51.67%)3,873 (45.49%) INSC Approach8, (2.63%)4,552 (51.18%) 4,108 (46.19%)

Department of Computer Science, Graduate School of Information Science & Technology, Osaka University Answers to RQs RQ1. Can proposed approaches detect code clones faster than an approach that uses only CCFinder? RQ2. Which approach is the fastest among the proposed approaches? 24 Answer to RQ1 : Our proposed approaches are able to detect code clones faster than the “approach that uses only CCFinder”. Answer to RQ2 : The “Approach with non- normalization” is the fastest.

Department of Computer Science, Graduate School of Information Science & Technology, Osaka University 25

Department of Computer Science, Graduate School of Information Science & Technology, Osaka University Chapter 4 : Investigating Merged Code Clones during Software Evolution 26

Department of Computer Science, Graduate School of Information Science & Technology, Osaka University Motivation of This Study Clone refactoring tools are commonly underused compared to refactoring tools Investigated instances of clone refactoring in open source software systems  To uncover clues that could contribute to the development of more widely used tools for clone refactoring 27

Department of Computer Science, Graduate School of Information Science & Technology, Osaka University Research Questions(RQs) RQ1: Which refactoring patterns are the most frequently used in clone refactoring? RQ2: How similar are the token sequences between pairs of merged code clones? RQ3: How different are the lengths of token sequences between pairs of merged code clones? RQ4: How far are pairs of code clones located before clone refactoring? 28

Department of Computer Science, Graduate School of Information Science & Technology, Osaka University Steps of Investigation 29 software repository extract Ref-Finder source code detected instances of refactoring k k+1 k k+1 identified instances of clone refactoring identified instances of clone refactoring k k+1 Step 1: Detecting Instances of Refactoring Step 2: Identifying Instances of Clone Refactoring Step 3: Measuring the Characteristics of Merged Code Clones

Department of Computer Science, Graduate School of Information Science & Technology, Osaka University Step 1: Detecting Instances of Refactoring (1/2) Ref-Finder [Prete2010] was applied to identify instances of refactoring  Extract Method (EM)  Extract Class (EC)  Extract Superclass (ES)  Form Template Method (FTM)  Pull Up Method (PUM)  Parameterized Method (PM)  Replace Method with Method Object (RMMO) 30 [Prete2010] K. Prete, et al., Template-based reconstruction of complex refactorings. In Proc. of ICSM, pp. 1-10, 2010

Department of Computer Science, Graduate School of Information Science & Technology, Osaka University Step 1: Detecting Instances of Refactoring (2/2) Manually validated the output of Ref-Finder  To exclude false positive  Referred to existing validated output data [Bavota2012] Subject systems 31 [Bavota2012] G. Bavota et al, "When Does a Refactoring Induce Bugs? An Empirical Study,"? In Proc. of SCAM, pp , 2012 Subject SystemsVersions#Versions#Period Apache ant1.2 – Jan – Dec ArgoUML Oct – Dec Xerces-j Nov – Nov. 2010

Department of Computer Science, Graduate School of Information Science & Technology, Osaka University Step 2 : Identifying Instances of Clone Refactoring (1/3) Undirected similarity(usim) [Mende2010] : determine the similarity between two token sequences  Using Levenshtein distance Measuring the amount of difference between two character sequences Levenshtein distance between survey and surgery is 2 [B-yates] 32 survey → surgey → surgery +1 [Mende2010] T. Mende et al,. An evaluation of code similarity identification for the grow-and- prune model. Journal of Software Maintenance, 21(2): pp , 2009 [B-yates] R. Baeza-Yates and B. Ribeiro-Neto.Modern. Information Retrieval: The Concepts and Technology behind Search (2nd Edition). Addison Wesley, 2010.

Department of Computer Science, Graduate School of Information Science & Technology, Osaka University Step 2 : Identifying Instances of Clone Refactoring (2/3) Levenshtein distance between two token sequences is normalized by the maximum size between them 33 : number of items that have to be changed to turn function fx into fy : a normalized token sequence : length of normalized token sequence

Department of Computer Science, Graduate School of Information Science & Technology, Osaka University Step 2 : Identifying Instances of Clone Refactoring (3/3) Each pair of refactored clones was defined as an instance of clone refactoring, only if it satisfied the following three conditions  Syntax condition : Each pair of code fragments was refactored into the same new method in the new version  Similarity condition : The computed usim value of each pair of code fragments in the old version was more than 65% [Mende2010]  Volume condition : The token length of each refactored pair was greater than 10 in the old version [Mende2010] 34 [Mende2010] T. Mende et al,. An evaluation of code similarity identification for the grow-and-prune model. Journal of Software Maintenance, 21(2): pp , 2009

Department of Computer Science, Graduate School of Information Science & Technology, Osaka University RQ1. The Most Frequently Used in Refactoring Patterns categorized sets of code clones based on whether they were merged into the same newly-created method using the refactoring patterns  a total of 35 sets of merged code clones were identified 35 Refactoring patternEMESFTMRMMO #Instances Answer to RQ1: RMMO was the most frequently used refactoring pattern observed, followed by EM pattern.

Department of Computer Science, Graduate School of Information Science & Technology, Osaka University RQ2 Similarities of Token Sequences between Pairs of Merged Code Clones 36 The average usim values of sets of merged code clones Answer to RQ2: EM and RMMO were mainly used to merge pairs of code clones of various token similarities.  The token similarities of EM and RMMO were relatively low compared to that of ES and FTM

Department of Computer Science, Graduate School of Information Science & Technology, Osaka University Suggestions for Tool Developing Vital for tools to support RMMO and EM patterns  To support EM pattern, tools should suggest pairs of code clones with various token similarities as candidates for clone refactoring To support RMMO pattern, tools should suggest pairs of code clones of various token similarities as candidates 37

Department of Computer Science, Graduate School of Information Science & Technology, Osaka University 38

Department of Computer Science, Graduate School of Information Science & Technology, Osaka University Future Work Higher speed : extend the approach in Chapter 3  Using the distributed approach such as D-CCFinder [Livieri2007] Tool developing : Based on the investigation results of Chapter 4 39 [Livieri2007] S. Livieri et al. :”Very-Large Scale Code Clone Analysis and Visualization of Open Source Programs Using Distributed CCFinder: D-CCFinder”, pp , ICSE 2007, Any Questions?