1 Evaluating Code Duplication Detection Techniques Filip Van Rysselberghe and Serge Demeyer Lab On Re-Engineering University Of Antwerp Towards a Taxonomy.

Slides:



Advertisements
Similar presentations
Duplicate code detection using Clone Digger Peter Bulychev Lomonosov Moscow State University CS department.
Advertisements

R2PL, Pittsburgh November 10, 2005 Copyright © Fraunhofer IESE 2005 Analyzing the Product Line Adequacy of Existing Components Jens Knodel
Alternate Software Development Methodologies
Developer Testing and Debugging. Resources Code Complete by Steve McConnell Code Complete by Steve McConnell Safari Books Online Safari Books Online Google.
1 Exponential Distribution and Reliability Growth Models Kan Ch 8 Steve Chenoweth, RHIT Right: Wait – I always thought “exponential growth” was like this!
Reverse Engineering © SERG Code Cloning: Detection, Classification, and Refactoring.
Paper Title Your Name CMSC 838 Presentation. CMSC 838T – Presentation Motivation u Problem paper is trying to solve  Characteristics of problem  … u.
Memories of Bug Fixes Sunghun Kim, Kai Pan, and E. James Whitehead Jr., University of California, Santa Cruz Presented By Gleneesha Johnson CMSC 838P,
Metamorphic Malware Research
7. Duplicated Code Metrics Duplicated Code Software quality
RIT Software Engineering
SE 450 Software Processes & Product Metrics 1 Defect Removal.
Object-Oriented Reengineering Patterns and Techniques Prof. O. Nierstrasz Prof. S. Ducasse T.
Applied Software Project Management Andrew Stellman & Jennifer Greene Applied Software Project Management Applied Software.
Near-Duplicate Detection by Instance-level Constrained Clustering Hui Yang, Jamie Callan Language Technologies Institute School of Computer Science Carnegie.
Refactoring Support Tool: Cancer Yoshiki Higo Osaka University.
PROJECT EVALUATION. Introduction Evaluation  comparing a proposed project with alternatives and deciding whether to proceed with it Normally carried.
On Comparing Classifiers: Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Presented by Prakash Tilwani MACS 598 April 25 th.
Engineering Design Process
Software Life Cycle Model
1 Functional Testing Motivation Example Basic Methods Timing: 30 minutes.
Jieming Zhu 1, Pinjia He 1, Qiang Fu 2, Hongyu Zhang 3, Michael R. Lyu 1, Dongmei Zhang 3 1 The Chinese University of Hong Kong, Hong Kong 2 Microsoft,
DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola.
Reverse Engineering State Machines by Interactive Grammar Inference Neil Walkinshaw, Kirill Bogdanov, Mike Holcombe, Sarah Salahuddin.
(1) Automated Quality Assurance Philip Johnson Collaborative Software Development Laboratory Information and Computer Sciences University of Hawaii Honolulu.
Database Design - Lecture 2
CSCE 548 Code Review. CSCE Farkas2 Reading This lecture: – McGraw: Chapter 4 – Recommended: Best Practices for Peer Code Review,
Mining and Analysis of Control Structure Variant Clones Guo Qiao.
« Performance of Compressed Inverted List Caching in Search Engines » Proceedings of the International World Wide Web Conference Commitee, Beijing 2008)
CMCD: Count Matrix based Code Clone Detection Yang Yuan and Yao Guo Key Laboratory of High-Confidence Software Technologies (Ministry of Education) Peking.
1 Experience-Driven Process Improvement Boosts Software Quality © Software Quality Week 1996 Experience-Driven Process Improvement Boosts Software Quality.
ISV Innovation Presented by ISV Innovation Presented by Business Intelligence Fundamentals: Data Cleansing Ola Ekdahl IT Mentors 9/12/08.
Copyright © 2015 – Curt Hill Version Control Systems Why use? What systems? What functions?
Department of Computer Science, Graduate School of Information Science & Technology, Osaka University Inoue Laboratory Eunjong Choi 1 Investigating Clone.
1 A Heuristic Approach Towards Solving the Software Clustering Problem ICSM03 Brian S. Mitchell /
ReiserFS Hans Reiser
University of Waterloo How does your software grow? Evolution and architectural change in open source software Michael Godfrey Software Architecture Group.
Department of Computer Science, Graduate School of Information Science & Technology, Osaka University Retrieving Similar Code Fragments based on Identifier.
EXPLOITING DYNAMIC VALIDATION FOR DOCUMENT LAYOUT CLASSIFICATION DURING METADATA EXTRACTION Kurt Maly Steven Zeil Mohammad Zubair WWW/Internet 2007 Vila.
Compression of Real-Time Cardiac MRI Video Sequences EE 368B Final Project December 8, 2000 Neal K. Bangerter and Julie C. Sabataitis.
Detecting Group Differences: Mining Contrast Sets Author: Stephen D. Bay Advisor: Dr. Hsu Graduate: Yan-Cheng Lin.
Chapter 3: Software Project Management Metrics
International Workshop on Principles of Software Evolution1Vienna, 11 September 2001 Evolution Metrics Tom Mens Programming Technology Lab Vrije Universiteit.
ANKITHA CHOWDARY GARAPATI
Duplicate code detection using anti-unification Peter Bulychev Moscow State University Marius Minea Institute eAustria, Timisoara.
CSc 461/561 Information Systems Engineering Lecture 5 – Software Metrics.
Application of Design Heuristics in the Designing and Implementation of Object Oriented Informational Systems.
Chapter 5: Software Re-Engineering Omar Meqdadi SE 3860 Lecture 5 Department of Computer Science and Software Engineering University of Wisconsin-Platteville.
CISC Machine Learning for Solving Systems Problems Presented by: Suman Chander B Dept of Computer & Information Sciences University of Delaware Automatic.
A Study of Cloning in the Linux SCSI Drivers Wei Wang, Michael W. Godfrey Software Architecture Group David R. Cheriton School of Computer Science University.
LogTree: A Framework for Generating System Events from Raw Textual Logs Liang Tang and Tao Li School of Computing and Information Sciences Florida International.
1 Measuring Similarity of Large Software System Based on Source Code Correspondence Tetsuo Yamamoto*, Makoto Matsushita**, Toshihiro Kamiya***, Katsuro.
1 Experience from Studies of Software Maintenance and Evolution Parastoo Mohagheghi Post doc, NTNU-IDI SEVO Seminar, 16 March 2006.
© Bennett, McRobb and Farmer 2005
University of Waterloo Four “interesting” ways in which history can teach us about software Michael W. Godfrey * Xinyi Dong Cory Kapser Lijie Zou Software.
Refactoring as a Lifeline : Lessons Learned from Refactoring Amr Noaman Abdel-Hamid Software Engineering Competence Center ( SECC ) IT Indeustry Development.
Aiding Comprehension of Cloning Through Categorization Cory Kapser and Michael W. Godfrey Software Architecture Group School of Computer Science, University.
(6) Estimating Computer’s efficiency Software Estimation The objective of Software Estimation is to provide the skills needed to accurately predict the.
1 Modeling the Search Landscape of Metaheuristic Software Clustering Algorithms Dagstuhl – Software Architecture Brian S. Mitchell
The PLA Model: On the Combination of Product-Line Analyses 강태준.
Opinion spam and Analysis 소프트웨어공학 연구실 G 최효린 1 / 35.
YAHMD - Yet Another Heap Memory Debugger
Why Every Dev. Team Needs Static Analysis
Simple metrics to assess code quantity and quality
IM-pack: Software Installation Using Disk Images
Systematic Manual Testing
CBCD: Cloned Buggy Code Detector
Software engineering – 1
Design and Programming
Refactoring Support Tool: Cancer
Presentation transcript:

1 Evaluating Code Duplication Detection Techniques Filip Van Rysselberghe and Serge Demeyer Lab On Re-Engineering University Of Antwerp Towards a Taxonomy of Clones in Source Code: A Case Study Cory J. Kapser and Michael W. Godfrey Software Architecture Group University of Waterloo

2 Duplicated Code (a.k.a. code clone) Code duplication occurs when developers systematically copy previously existing code which solved a problem similar to the one they are currently trying to solve. Typically 5% to 10% of code, up to 50%. Variety of reasons duplication occurs.

3 Associated Problems Errors can be difficult to fix. Change in requirements may be difficult to implement. Code size unnecessarily increased. Can lead to unused, dead code. Can be indicative of design problems. Bugs may be copied as well.

4 Evaluating Duplicated Code Detection Techniques Authors set out to evaluate the qualities of several clone detection techniques and determine where they fit best into the software maintenance process. Compares 3 representative techniques on 5 small to medium size cases.

5 Duplication Detection Techniques Authors suggest there are three groups of methods of detecting duplicated code: –String based –Token based –Parse-tree based

6 Research Structure Goal Questions Experimental Setup

7 Selected Cases ScoreMaster TextEdit Brahms Jmocha JavaParser of JMetric

8 Results: Portability Simple line matching most portable. Parameterized line matching and suffix tree matching are fairly portable. Metric based matching least portable.

9 Results: What Kind of Matches Found? Metrics based approach find function block duplication. Simple string matching finds equal lines. Parameterized line matching finds duplicated lines. Suffix tree matching finds duplicated series of tokens.

10 Results: Accuracy Number of false matches: –Parameterized suffix tree matching and simple line matching find no false matches. –Parameterized line matching finds few false matches. –Metrics based matching finds many false positives when applying metrics to block fragments, only a few when applying to methods.

11 Results: Accuracy Number of useless matches: –Both parameterized methods returned low amounts of useless matches. –Metrics found more useless matches, 133 out of 138 in TextEdit when applying metrics to methods. –Simple line matching finds many, 229 useless matches in TextEdit.

12 Results: Accuracy Number of recognizable matches –Metric fingerprints is very high. –Parameterized matching techniques return less recognizable matches. –Simple string match returns the lowest.

13 Results: Performance

14 Conclusions Based on comparing the 3 representative duplication detection techniques, the following conclusions were drawn: –Simple line matching is suitable for problem detection and assessment. –Parameterized matching will work well with fine-grained refactoring tools. –Metric Fingerprints will work well with method level refactoring techniques. Have shown that each technique has specific advantages and disadvantages. Have laid the ground work for a systemic approach to detecting and removing clones.

15 Toward a Taxonomy of Clones Aim to profile cloning as it occurs in the real world and generate a taxonomy of types of code duplications. This will give us insight into how and why developers duplicate code, and aid the effort in developing clone detection techniques and tools.

16 The Study Performed on the Linux kernel file- system subsystem. –Consists of 538.c and.h files, 279,118 LOC. –42 file system implementations. –Layered design.

17 Study Methods Used parameterized string matching and metrics based detection to gather clones. Manually inspected clones returned from the detection tools and created the current taxonomy. Generated scripts to classify each clone into one of clone types, and again manually inspected these results.

18 Taxonomy of Clones Duplicated blocks within the same function. Cloned blocks across functions, files and directories. Similar functions, same file. Functions cloned between files in the same directory. Functions cloned across directories. Cloned files. Initialization and finalization clones.

19 Results 12% of the Linux kernel file-system code is involved in code duplication. Detected 3116 clone pairs, with an average length is 13.5 lines. 78% of cloning occurs in the same directory.

20 Locality of Clone Pairs

21 Frequency of Clone Types

22 Families of File Systems ext2 and ext3 highly related. Intermezzo cloned much from the main file-system code and Coda. Jffs has cloned much from inflate_fs, most of the clones were put into 1 file.

23 Visualization of Cloning Without Showing Same Directory Clones

24 Metrics Vs. String Matching

25 Conclusions We have begun to build a taxonomy of code clones in software. Cloning activity in the Linux kernel file-system subsystem is at a non-trivial rate. Cloning most commonly occurs within a subsystem. Parameterized string matching provides an interesting and powerful method for function duplication detection. 3D visualization provided an interesting method of viewing clones amongst subsystems.

26 Importance of this Work Lots of clone detection methods out there, few comparisons. What we catch and what we miss is unclear.