Estimating Code Size After a Complete Code-Clone Merge Buford Edwards III, Yuhao Wu, Makoto Matsushita, Katsuro Inoue 1 Graduate School of Information.

Estimating Code Size After a Complete Code-Clone Merge Buford Edwards III, Yuhao Wu, Makoto Matsushita, Katsuro Inoue 1 Graduate School of Information Science and Technology, Osaka University

Outline  Review Code Clones  Prior Code Clone Research  Refactoring/Merging Code Clones  Complete Code-Clone Merge Explanation  Basic Case and Illustration  Expand to Difficult Case (Overlapping and Embedded Code Clones)  Prototype tool and its application  Conclusions 2

What are code clones?  Code clones – sections of code that are the same or very similar to each other  How similar they must be depends on what kind of clone and how one measures their similarity. 3 Image: http://learn.genetics.utah.edu/content/cloning/whyclone/images/clones.jpg

Types of Code Clones  Type 1 – Identical  Type 2 – Different variable names/values  Type 3 – May have additions, deletions, altered statements due to editing  Type 4 – Semantic, has same function but different structure or syntax 4

Why do code clones matter?  Code clones increase maintenance costs  Inconsistent changes lead to bugs [1]  “Nearly every second unintentionally inconsistent change to a code clone leads to a fault” [2]  As project increases in size, more likely for unintentional code clones to appear [3] 5 [1] Chanchal K. Roy, James R. Cordy, Rainer Koschke, Comparison and evaluation of code clone detection techniques and tools: A qualitative approach, Sci. Comput. Program., Vol.74, No.7, pp.470-497 (2007). [2] Elmar Juergens, Florian Deissenboeck, Benjamin Hummel, Stefan Wagner, Do code clones matter?, In Proceedings of the 31st Inter-national Conference on Software Engineering (ICSE ’09), pp.485-495 (2009). [3] Michel Dagenais, Ettore Merlo, Bruno Lagu¨e, and Daniel Proulx. Clones occurrence in large object oriented software packages. In Pro-ceedings of the 8th IBM Centre for Advanced Studies Conference (CASCON ’98), pp. 192-200 (1998).

Should we get rid of clones?  Quantitative evaluation of code clones may help us decide  How much of the software system is made of code clones?  How much of the system size will be reduced if we merge all code clones?  Code clone detection tools exist to answer the first question. 6

What is Merging?  Merging – we mean a kind of refactoring  Code refactoring – restructuring preexistent code without changing external behavior or final execution result [4]  Code clone refactor technique [5] –  Extract clones from the code  Create shared function that contains cloned portion  Create calls to that shared function 7 [4] Martin Fowler, Refactoring: Improving the Design of Existing Code, Addison-Wesley (1999). [5] Yoshiki Higo, Toshihiro Kamiya, Shinji Kusumoto, Katsuro Inoue, Refactoring Support Based on Code Clone Analysis, In Proceedings of 5th International Conference on Product Focused Software Process Improvement, pp.220-233 (2004).

Complete Code-Clone Merge  How much of the system size will be reduced if we merge all code clones?  Complete Code-Clone Merge (CCM) is an algorithm designed to help answer that question 8

CCM Explained  We have a source file S of a certain line length |S|  Each code clone will have a unique ID.  Each unique code clone will be extracted to a shared function. 9

CCM Explained  Within S, each clone will be replaced with a call to their respective shared functions.  Merging all code clones creates S’ of a certain line length |S’|  We expect |S’| < |S| 10

Basic Case and Illustration  |S| = 100 lines  Recognize clones A and B.  A = 15 lines, B = 10 lines  POP of A = 2, POP of B = 2  POP (population) – number of times a clone appears  Merge clones into individual shared functions 11

12 Clone Detection Software Clone Pair Data CCM Source Code: S |S| = 100 Lines 1 100 A: 15 Lines B: 10 Lines A: 15 Lines B: 10 Lines 1 A: Function Call B: Function Call S’ - 1 Line 83 A: 15 Lines B: 10 Lines A: Initialization A: Termination B: Initialization B: Termination - 1 Line |S’| = 83 Lines

Basic Case and Illustration Result Summary Initial Size |S|100 Lines Total Clone Length50 Lines Reduced Size |S’|83 Lines Lines of Code Reduced17 Lines Percent Reduction17% 13

Basic Case and Illustration Result Summary Initial Size |S|100 Lines Total Clone Length50 Lines Reduced Size |S’|83 Lines Lines of Code Reduced17 Lines Percent Reduction17% 14 Sum of all Unique Code Clone Lengths x POP Clone IDAB Lines1510 POP22 Total Size302050

Basic Case and Illustration Result Summary Initial Size |S|100 Lines Total Clone Length50 Lines Reduced Size |S’|83 Lines Lines of Code Reduced17 Lines Percent Reduction17% 15 (|S| - Total Clone Length) + Total Function Calls + Total Shared Function Size 50 Lines + 4 Lines + 29 Lines Function(Clone ID)AB Core Lines1510 Initialization Lines11 Termination Lines11 Total Size171229 Note: Initialization and Termination may be configured to be a value other than the 1 Line default value.

Basic Case and Illustration Result Summary Initial Size |S|100 Lines Total Clone Length50 Lines Reduced Size |S’|83 Lines Lines of Code Reduced17 Lines Percent Reduction17% 16 |S| - |S’| = Lines of Code Reduced 100 - 83 = 17

Basic Case and Illustration Result Summary Initial Size |S|100 Lines Total Clone Length50 Lines Reduced Size |S’|83 Lines Lines of Code Reduced17 Lines Percent Reduction17% 17 (Lines of Code Reduced / |S|) x 100 = Percent Reduction (17 Lines / 100 Lines) x 100 = 17%

Overlapping and Embedded Code Clones 18 1 100 B: 15 Lines A: 15 Lines B: 15 Lines  Sections of code, identified as code clones that share a portion of their code with another unique code clone  Not uncommon, must be accounted for.

Overlapping and Embedded Code Clones 19 1 100 B: 15 Lines A: 15 Lines B: 15 Lines  Can no longer simply create shared function for A and B  We decide to use the “Chunking Method”

Overlapping and Embedded Code Clones 20 1 100 B: 15 Lines A: 15 Lines B: 15 Lines C: 5 Lines |S| = 100 1 100 B’: 10 Lines A’: 10 Lines B’: 10 Lines C: 5 Lines

B’: 10 Lines A’: 10 Lines B’: 10 Lines C: 5 Lines Overlapping and Embedded Code Clones 21 1 100  After creating “chunks” can create a shared method for each  Create calls as normal  Overlaps increase the number of lines required in |S’|

CCM Size Estimation Prototype Tool  Tool used to estimate system size after merging all code clones.  Tool uses CCFinderX as part of the required input [6]  Generates clone pair data used by the algorithm  Source code S is also required input.  Removal of whitespace/comments before running CCFinderX and tool. 22 [6] CCFinderX Official site, http://www.ccfinder.net/.

Application of the Tool  Three examples of source codes used as part of CCM Prototype application  Multilap.java  Java JDK [7]  Quake Engine [8]  Java JDK and Quake Engine chosen due to large size. [7] Java SE j Oracle Technology Network j Oracle, http://www.oracle.com/technetwork/java/javase. Java. SE Development Kit 8, Update 77 Release Notes, http://www.oracle.com/technetwork/java/javase/8u77-relnotes-2944725.html. [8] GitHub - id-Software/Quake: Quake GPL Source Release, https://github.com/id-Software/Quake. © 1992 23

Multilap.java  Control to show multiple overlapping code clones.  Can follow the calculations for this step- by-step in paper. 24

Java JDK Code clone volume:  Calculated via: (Total Clone Length/|S|) x 100 25 Result Summary Initial Size |S|813,546 Lines Total Clone Length207,072 Lines Code Clone Volume25.45% Reduced Size |S’|708,139 Lines Lines of Code Reduced105,407 Lines Percent Reduction12.96% Java JDK 1.8.0_77-b03

Java JDK  Code clone volume: Approx. 25%  Most common POP is 2  If we assume every clone has POP of 2, expected reduction percent would be about half of code clone volume. (12.73%)  Actual Reduction: 12.96% 26

Quake Engine 27 Result Summary Initial Size |S|216,722 Lines Total Clone Length49,098 Lines Code Clone Volume22.66% Reduced Size |S’|194,324 Lines Lines of Code Reduced22,398 Lines Percent Reduction10.33%

Quake Engine  Code clone volume: Approx. 22.66%  POP 2 is again most frequent, although to a lesser extent.  Expected reduction: 11.33%  Actual reduction: 10.33% 28

Conclusions  Quantitative evaluation:  What percentage of the source code could theoretically be reduced?  Application results seem reasonable  Analyzing the POP frequencies, reduction seems consistent with what is expected  Code clones with POP value of 2 most common in large sources analyzed by prototype 29

Estimating Code Size After a Complete Code-Clone Merge Buford Edwards III, Yuhao Wu, Makoto Matsushita, Katsuro Inoue 1 Graduate School of Information.

Similar presentations

Presentation on theme: "Estimating Code Size After a Complete Code-Clone Merge Buford Edwards III, Yuhao Wu, Makoto Matsushita, Katsuro Inoue 1 Graduate School of Information."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Estimating Code Size After a Complete Code-Clone Merge Buford Edwards III, Yuhao Wu, Makoto Matsushita, Katsuro Inoue 1 Graduate School of Information.

Similar presentations

Presentation on theme: "Estimating Code Size After a Complete Code-Clone Merge Buford Edwards III, Yuhao Wu, Makoto Matsushita, Katsuro Inoue 1 Graduate School of Information."— Presentation transcript:

Similar presentations

About project

Feedback