Assessing the Refactorability of Software Clones

Assessing the Refactorability of Software Clones
Clone images created by Rebecca Tiarks et al. Assessing the Refactorability of Software Clones N. Tsantalis, D. Mazinanian and G. P. Krishnan Department of Computer Science & Software Engineering

Some clones need to be refactored
Motivation Clones may be harmful Clones are associated with error-proneness due to inconsistent updates (Juergens et ICSE’09) Clones increase significantly the maintenance effort and cost (Lozano et ICSM’08) Clones are change-prone (Mondal et al. 2012) Some studies have shown that clones are stable There is evidence that clones may be harmful. So, what is the support we can get from clone refactoring tools? Some clones need to be refactored

Motivation cont'd Tools should be able to refactor more clones
Current refactoring tools perform poorly A study by Tairas & Gray [IST’12] on Type-II clones detected by Deckard in 9 open-source projects revealed: only 10.6% of them could be refactored by Eclipse CeDAR [IST’12] was able to refactor 18.7% of them Our position is that tools should be able to refactor more clones Tools should be able to refactor more clones

Limitation #1 Current tools can parameterize only a small subset of differences in clones. Mostly differences between variable identifiers, literals, simple method calls. In this example, we can see two clones that call the same constructor Rectangle, but there are differences in the passed arguments. For instance method call getHeight() is replaced with infix expression “high - low”. This is a case that is not supported by current clone refactoring tools. Clone #1 Clone #2 Rectangle rectangle = new Rectangle( a, b, c, high – low ); Rectangle rectangle = new Rectangle( a, b, c, getHeight() );

Limitation #2 Current approaches may return non-optimal matching solutions. They do not explore the entire search space of possible matches. In case of multiple possible matches, they select the “first” or “best” match. They face scalability issues due to the problem of combinatorial explosion.

NOT APPROVED 24 differences Clone #1 Clone #2
if (orientation == VERTICAL) { Line2D line = new Line2D.Double(); double y0 = dataArea.getMinY(); double y1 = dataArea.getMaxY(); g2.setPaint(im.getOutlinePaint()); g2.setStroke(im.getOutlineStroke()); if (range.contains(start)) { line.setLine(start2d, y0, start2d, y1); g2.draw(line); } if (range.contains(end)) { line.setLine(end2d, y0, end2d, y1); else if (orientation == HORIZONTAL) { double x0 = dataArea.getMinX(); double x1 = dataArea.getMaxX(); line.setLine(x0, start2d, x1, start2d); line.setLine(x0, end2d, x1, end2d); if (orientation == VERTICAL) { Line2D line = new Line2D.Double(); double x0 = dataArea.getMinX(); double x1 = dataArea.getMaxX(); g2.setPaint(im.getOutlinePaint()); g2.setStroke(im.getOutlineStroke()); if (range.contains(start)) { line.setLine(x0, start2d, x1, start2d); g2.draw(line); } if (range.contains(end)) { line.setLine(x0, end2d, x1, end2d); else if (orientation == HORIZONTAL) { double y0 = dataArea.getMinY(); double y1 = dataArea.getMaxY(); line.setLine(start2d, y0, start2d, y1); line.setLine(end2d, y0, end2d, y1); 24 differences NOT APPROVED Here we can see two clones (in JFreeChart) as matched by current clone unification techniques. As we can observe, there are several differences between the matched statement, which make more difficult the refactoring of the clones. For each difference a parameter has to be added in the extracted method, and in some cases there are differences that cannot be parameterized. So, this matching solution is not acceptable for the purpose refactoring.

Clone #1 Clone #2 if (orientation == VERTICAL) {
Line2D line = new Line2D.Double(); double y0 = dataArea.getMinY(); double y1 = dataArea.getMaxY(); g2.setPaint(im.getOutlinePaint()); g2.setStroke(im.getOutlineStroke()); if (range.contains(start)) { line.setLine(start2d, y0, start2d, y1); g2.draw(line); } if (range.contains(end)) { line.setLine(end2d, y0, end2d, y1); Line2D line = new Line2D.Double(); double x0 = dataArea.getMinX(); double x1 = dataArea.getMaxX(); g2.setPaint(im.getOutlinePaint()); g2.setStroke(im.getOutlineStroke()); if (range.contains(start)) { line.setLine(x0, start2d, x1, start2d); g2.draw(line); } if (range.contains(end)) { line.setLine(x0, end2d, x1, end2d); if (orientation == VERTICAL) { } else if (orientation == HORIZONTAL) { Line2D line = new Line2D.Double(); double x0 = dataArea.getMinX(); double x1 = dataArea.getMaxX(); g2.setPaint(im.getOutlinePaint()); g2.setStroke(im.getOutlineStroke()); if (range.contains(start)) { line.setLine(x0, start2d, x1, start2d); g2.draw(line); } if (range.contains(end)) { line.setLine(x0, end2d, x1, end2d); else if (orientation == HORIZONTAL) { } Line2D line = new Line2D.Double(); double y0 = dataArea.getMinY(); double y1 = dataArea.getMaxY(); g2.setPaint(im.getOutlinePaint()); g2.setStroke(im.getOutlineStroke()); if (range.contains(start)) { line.setLine(start2d, y0, start2d, y1); g2.draw(line); } if (range.contains(end)) { line.setLine(end2d, y0, end2d, y1); If we have a closer look to the clones, we will notice that the bodies of the if statements are symmetrically exactly the same

APPROVED 2 differences Clone #1 Clone #2
if (orientation == VERTICAL) { Line2D line = new Line2D.Double(); double y0 = dataArea.getMinY(); double y1 = dataArea.getMaxY(); g2.setPaint(im.getOutlinePaint()); g2.setStroke(im.getOutlineStroke()); if (range.contains(start)) { line.setLine(start2d, y0, start2d, y1); g2.draw(line); } if (range.contains(end)) { line.setLine(end2d, y0, end2d, y1); else if (orientation == HORIZONTAL) { double x0 = dataArea.getMinX(); double x1 = dataArea.getMaxX(); line.setLine(x0, start2d, x1, start2d); line.setLine(x0, end2d, x1, end2d); if (orientation == HORIZONTAL) { Line2D line = new Line2D.Double(); double y0 = dataArea.getMinY(); double y1 = dataArea.getMaxY(); g2.setPaint(im.getOutlinePaint()); g2.setStroke(im.getOutlineStroke()); if (range.contains(start)) { line.setLine(start2d, y0, start2d, y1); g2.draw(line); } if (range.contains(end)) { line.setLine(end2d, y0, end2d, y1); else if (orientation == VERTICAL) { double x0 = dataArea.getMinX(); double x1 = dataArea.getMaxX(); line.setLine(x0, start2d, x1, start2d); line.setLine(x0, end2d, x1, end2d); 2 differences APPROVED As a result, we can find a better matching for the clones by parameterizing the differences in the conditional expressions of the “if” statements. This solution is easier to refactor, since a smaller number of differences has to be parameterized.

Minimizing differences
Minimizing the differences during the matching process is critical for refactoring. Why? Less differences means less parameters for the extracted method (i.e., a more reusable method). Less differences means also lower probability for precondition violations (i.e., higher refactoring feasibility) Matching process objectives: Maximize the number of matched statements Minimize the number of differences between them It directly affects the number of parameters that have to be introduced in the extracted method containing the common functionality, as well as the feasibility of the refactoring transformation.

Limitation #3 There are no preconditions to determine whether clones can be safely refactored. The parameterization of differences might change the behavior of the program. Statements in gaps need to be moved before the cloned code. Changing the order of statements might also affect the behavior of the program.

Our goal Improve the state-of-the-art in the Refactoring of Software Clones: Given two code fragments containing clones; Find potential control structures that can be refactored. Find an optimal mapping between the statements of two clones. Make sure that the refactoring of the clones will preserve program behavior. Find the most appropriate refactoring strategy to eliminate the clones. Point 1: The clones may not necessarily expand over the entire methods; The methods may not necessarily have the same control structure. Point 2: An optimal mapping should have the maximum number of mapped statements with the minimum number of differences between them. Point 3: Define preconditions to be examined before refactoring application. Point 4: Based on the location of clones (i.e., same or different classes) and their particular characteristics (i.e., gaps) we can decide to Extract method, Extract and Pull Up method in a common superclass, apply Template Method design pattern to handle gaps, or extract a static method in a Utility class (if the clones do not access instance variables or methods).

Our approach Control Structure Matching PDG Mapping Precondition
isomorphic CDT pairs differences unmapped statements Our approach takes as input clone fragments (detected by tools) or entire methods containing clones. First, it extracts the control structure of the input code fragments and tries to find matching subtrees. Next, for each pair of matched subtrees, it generates the corresponding PDG subgraphs and tries to find an optimal mapping. In the final step, the differences resulting from the mapping solution are examined against some preconditions. If all preconditions are satisfied the clones can be refactored. Control Structure Matching PDG Mapping Precondition Examination

Phase 1: Control Structure Matching
Intuition: two pieces of code can be merged only if they have an identical control structure. We extract the Control Dependence Trees (CDTs) representing the control structure of the input methods or clones. We find all non-overlapping largest common subtrees within the CDTs. Each subtree match will be treated as a separate refactoring opportunity.

CDT Subtree Matching CDT of Fragment #1 CDT of Fragment #2 x A a y B C
Assuming that these are the CDTs of two code fragments, the largest common subtree corresponds to the highlighted nodes. D E F G f g d e

Phase 2: PDG Mapping We extract the PDG subgraphs corresponding to the matched CDT subtrees. We want to find the common subgraph that satisfies two conditions: It has the maximum number of matched nodes The matched nodes have the minimum number of differences. This is an optimization problem that can be solved using an adaptation of a Maximum Common Subgraph algorithm [McGregor, 1982].

MCS Algorithm Builds a search tree in depth-first order, where each node represents a state of the search space. Explores the entire search space. It has an exponential worst case complexity. As the number of possible matching node combinations increases, the width of the search tree grows rapidly (combinatorial explosion).

Divide-and-Conquer We break the original matching problem into smaller sub-problems based on the control dependence structure of the clones. Finally, we combine the sub-solutions to give a global solution to the original matching problem. Starting from the control node pairs nested deeper in the control structure of the clones, we perform a Bottom-up Divide-and-Conquer approach to find the best sub-solutions at each level.

Bottom-up Divide-and-Conquer
CDT subtree of Clone #1 CDT subtree of Clone #2 A a B C b c D D E F G f g d d e Level 2 Assuming that these are the two CDT subtrees that have been matched in the previous phase, Our approach starts from node D of the first tree and tries to find all possible matching nodes (at the same level) in the second tree. Every pair of matched nodes is used as starting point for the MCS algorithm, which returns a sub-solution. At the end, we select the best sub-solution and continue the same process in a bottom-up fashion. Best sub-solution from (D, d)

Bottom-up Divide-and-Conquer
CDT subtree of Clone #1 CDT subtree of Clone #2 A a B C b c E E F G f g e e Level 2 Best sub-solution from (E, e)

Phase 3: Precondition examination
Preconditions related to clone differences: Parameterization of differences should not break existing data dependences in the PDGs. Reordering of unmapped statements should not break existing data dependences in the PDGs. Preconditions related to method extraction: The unified code should return one variable at most. Matched branching (break, continue) statements should be accompanied with the corresponding matched loops in the unified code.

Correctness Evaluation
Hypothesis: A clone pair that has been assessed as refactorable should be possible to be refactored without causing any compile errors, and all unit tests of the project should pass after the application of the refactoring.

Correctness Evaluation
Collected 610 clone pairs covered by unit tests from JFreeChart source directory 45% of them belong to different classes of the same inheritance hierarchy 33% belong to the same method 17% belong to different methods of the same class 5% belong to different classes that are not part of the same inheritance hierarchy

Correctness Evaluation Results
All 610 refactorings were applied without causing any compile errors 13 refactorings led to a test failure related to serialization Problem caused by serialized fields being pulled up to the superclass Problem fixed by changing the refactoring mechanics

Performance Evaluation
We collected 1,150,967 clone pairs from 9 open-source projects using 4 different clone detectors (CCFinder, CloneDR, Deckard, NiCad) We run our analysis in very clone pair Execution Time (x) in seconds # Cases % x  1 1,149,274 99.853 1 < x  10 1460 0.127 10 < x  100 177 0.015 x > 100 * 56 0.005 * Maximum 199.4

Empirical Study RQ1: How does the source code type (production vs. test) of software clones affect their refactorability? RQ2: How does the relative location of software clones affect their refactorability? RQ3: How does the clone type of software clones affect their refactorability? RQ4: How does the size of software clones affect their refactorability? RQ5: What are the most frequent reasons (precondition violations) that hinder the refactoring of software clones?

RQ1 - Production vs Test Code
Clones in production code tend to be more refactorable than clones in test code. AST-based clone detection tools (i.e., CloneDR and Deckard) are more efficient in detecting refactorable clones in production code. CCFinder is more efficient in detecting refactorable clones in test code.

RQ2 - Relative Location Clones with a close relative location (i.e., same method, type, or file) tend to be more refactorable than clones in distant locations (i.e., same hierarchy, or unrelated types). CloneDR is more efficient in detecting refactorable clones located in the same method and type, as well as in unrelated types.

RQ3 - Clone Types Type-1 clones: identical code fragments except for variations in whitespace, layout, and comments Type-2 clones: structurally/syntactically identical fragments except for variations in identifiers, literals and types in addition to Type-1 differences Type-3 clones: copied fragments with statements changed, added or removed in addition to Type-2 differences

RQ3 - Clone Types Type-1 clones are more refactorable than Type-2 and Type-3 clones. There is a significant number of Type-3 clones that can be refactored as Type-2 clones by moving the unmapped statements before or after the mapped ones. AST-based clone detection tools (i.e., CloneDR and Deckard) are more efficient in detecting Type-2 and Type-3 refactorable clones in production code

Why some Type-1 clones are not refactorable?
The clones return more than one variable (49%) The clones return variables with different types (30%) The clones contain conditional return statements (16%) The clones contain branching statements (e.g., break, continue) without the corresponding loop (5%)

Refactorable Type-3 clones

RQ4 – Clone Size Clones with a small size tend to be more refactorable than clones with a larger size. Deckard detects larger and more uniformly distributed (in terms of size) refactorable clones.

RQ5 – Precondition violations
The two most dominant reasons hindering the refactoring of clones are: the presence of variables in mapped statements having different types, the presence of unmapped statements that cannot be moved before or after the common code due to existing dependencies.

Visit our project at

Assessing the Refactorability of Software Clones

Similar presentations

Presentation on theme: "Assessing the Refactorability of Software Clones"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Assessing the Refactorability of Software Clones

Similar presentations

Presentation on theme: "Assessing the Refactorability of Software Clones"— Presentation transcript:

Similar presentations

About project

Feedback