7. Duplicated Code Metrics Duplicated Code Software quality

Slides:



Advertisements
Similar presentations
CS 11 C track: lecture 7 Last week: structs, typedef, linked lists This week: hash tables more on the C preprocessor extern const.
Advertisements

Welcome to. Who am I? A better way to code Design Patterns ???  What are design patterns?  How many are there?  How do I use them?  When do I use.
File Processing - Indirect Address Translation MVNC1 Hashing Indirect Address Translation Chapter 11.
Reverse Engineering © SERG Code Cloning: Detection, Classification, and Refactoring.
OORPT Object-Oriented Reengineering Patterns and Techniques 7. Problem Detection Prof. O. Nierstrasz.
© S. Demeyer, S. Ducasse, O. Nierstrasz Duplication.1 7. Problem Detection Metrics  Software quality  Analyzing trends Duplicated Code  Detection techniques.
© S. Demeyer, S. Ducasse, O. Nierstrasz Duplication.1 7. Problem Detection Metrics  Software quality  Analyzing trends Duplicated Code  Detection techniques.
OORPT Object-Oriented Reengineering Patterns and Techniques 7. Problem Detection Prof. O. Nierstrasz.
Memories of Bug Fixes Sunghun Kim, Kai Pan, and E. James Whitehead Jr., University of California, Santa Cruz Presented By Gleneesha Johnson CMSC 838P,
Testing an individual module
Object-Oriented Reengineering Patterns Serge Demeyer Stéphane Ducasse Oscar Nierstrasz
7/14/2015EECS 584, Fall MapReduce: Simplied Data Processing on Large Clusters Yunxing Dai, Huan Feng.
1.3 Executing Programs. How is Computer Code Transformed into an Executable? Interpreters Compilers Hybrid systems.
Data Structures Introduction Phil Tayco Slide version 1.0 Jan 26, 2015.
Introducing Java.
REFACTORING Lecture 4. Definition Refactoring is a process of changing the internal structure of the program, not affecting its external behavior and.
Software Reengineering & Evolution Serge Demeyer Stéphane Ducasse Oscar Nierstrasz
Data Structures & AlgorithmsIT 0501 Algorithm Analysis I.
Database Technical Session By: Prof. Adarsh Patel.
1 CSC 221: Introduction to Programming Fall 2012 Functions & Modules  standard modules: math, random  Python documentation, help  user-defined functions,
©Ian Sommerville 2000 Software Engineering, 6th edition. Chapter 10Slide 1 Architectural Design l Establishing the overall structure of a software system.
Object-Oriented Reengineering Patterns 5. Problem Detection.
Computer Programming TCP1224 Chapter 3 Completing the Problem-Solving Process and Getting Started with C++
CMCD: Count Matrix based Code Clone Detection Yang Yuan and Yao Guo Key Laboratory of High-Confidence Software Technologies (Ministry of Education) Peking.
Cross Language Clone Analysis Team 2 October 27, 2010.
Rossella Lau Lecture 1, DCO10105, Semester B, DCO10105 Object-Oriented Programming and Design  Lecture 1: Introduction What this course is about:
Testing Methods Carl Smith National Certificate Year 2 – Unit 4.
Problem Solving Techniques. Compiler n Is a computer program whose purpose is to take a description of a desired program coded in a programming language.
COMPUTER PROGRAMMING. A Typical C++ Environment Phases of C++ Programs: 1- Edit 2- Preprocess 3- Compile 4- Link 5- Load 6- Execute Loader Primary Memory.
1 Evaluating Code Duplication Detection Techniques Filip Van Rysselberghe and Serge Demeyer Lab On Re-Engineering University Of Antwerp Towards a Taxonomy.
DATA STRUCTURE & ALGORITHMS (BCS 1223) NURUL HASLINDA NGAH SEMESTER /2014.
1 Ch. 1: Software Development (Read) 5 Phases of Software Life Cycle: Problem Analysis and Specification Design Implementation (Coding) Testing, Execution.
Chapter 9 Database Systems © 2007 Pearson Addison-Wesley. All rights reserved.
Logic Analyzer ECE-4220 Real-Time Embedded Systems Final Project Dallas Fletchall.
1 Compiler Design (40-414)  Main Text Book: Compilers: Principles, Techniques & Tools, 2 nd ed., Aho, Lam, Sethi, and Ullman, 2007  Evaluation:  Midterm.
1 Chapter 2 C++ Syntax and Semantics, and the Program Development Process.
1 Debugging and Syntax Errors in C++. 2 Debugging – a process of finding and fixing bugs (errors or mistakes) in a computer program.
Software Development Problem Analysis and Specification Design Implementation (Coding) Testing, Execution and Debugging Maintenance.
Data Preprocessing Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
Chapter 6: Analyzing and Interpreting Quantitative Data
Cross Language Clone Analysis Team 2 February 3, 2011.
LogTree: A Framework for Generating System Events from Raw Textual Logs Liang Tang and Tao Li School of Computing and Information Sciences Florida International.
 Software Clones:( Definitions from Wikipedia) ◦ Duplicate code: a sequence of source code that occurs more than once, either within a program or across.
Cross Language Clone Analysis Team 2 February 3, 2011.
The Hashemite University Computer Engineering Department
JavaScript Introduction and Background. 2 Web languages Three formal languages HTML JavaScript CSS Three different tasks Document description Client-side.
CHAPTER 3 COMPLETING THE PROBLEM- SOLVING PROCESS AND GETTING STARTED WITH C++ An Introduction to Programming with C++ Fifth Edition.
Tool Support for Testing Classify different types of test tools according to their purpose Explain the benefits of using test tools.
Introduction to Computer Programming Concepts M. Uyguroğlu R. Uyguroğlu.
Coupling and Cohesion Pfleeger, S., Software Engineering Theory and Practice. Prentice Hall, 2001.
Department of Computer Science, Graduate School of Information Science & Technology, Osaka University A Metric-based Approach for Reconstructing Methods.
Compiler Design (40-414) Main Text Book:
Software Metrics 1.
Definition CASE tools are software systems that are intended to provide automated support for routine activities in the software process such as editing.
CSC 221: Computer Programming I Spring 2010
CSC 221: Computer Programming I Fall 2005
Revision Lecture
CBCD: Cloned Buggy Code Detector
Un</br>able’s MySecretSecrets
Design, prototyping and construction
Introduction to Database Management System
○Yuichi Semura1, Norihiro Yoshida2, Eunjong Choi3, Katsuro Inoue1
Program Design Introduction to Computer Programming By:
Introduction to Programming
C Preprocessor(CPP).
Review for Midterm Exam
CHAPTER 1: THE DATABASE ENVIRONMENT AND DEVELOPMENT PROCESS
Automatically Diagnosing and Repairing Error Handling Bugs in C
Design, prototyping and construction
Presentation transcript:

7. Duplicated Code Metrics Duplicated Code Software quality Analyzing trends Duplicated Code Detection techniques Visualizing duplicated code © S. Demeyer, S. Ducasse, O. Nierstrasz

Code is Copied Small Example from the Mozilla Distribution (Milestone 9) Extract from /dom/src/base/nsLocation.cpp © S. Demeyer, S. Ducasse, O. Nierstrasz

How Much Code is Duplicated? Usual estimates: 8 to 12% in normal industrial code 15 to 25 % is already a lot! Case Study LOC Duplication without comments with comments gcc 460’000 8.7% 5.6% Database Server 245’000 36.4% 23.3% Payroll 40’000 59.3% 25.4% Message Board 6’500 29.4% 17.4% Duplication figures in parens are entire source code (incl. white space & comments), first number is when all white lines and comments are filtered out © S. Demeyer, S. Ducasse, O. Nierstrasz

What is Duplicated Code? Duplicated Code = Source code segments that are found in different places of a system. in different files in the same file but in different functions in the same function The segments must contain some logic or structure that can be abstracted, i.e., Copied artefacts range from expressions, to functions, to data structures, and to entire subsystems. © S. Demeyer, S. Ducasse, O. Nierstrasz

Copied Code Problems General negative effect Code bloat Negative effects on Software Maintenance Copied Defects Changes take double, triple, quadruple, ... Work Dead code Add to the cognitive load of future maintainers Copying as additional source of defects Errors in the systematic renaming produce unintended aliasing Metaphorically speaking: Software Aging, “hardening of the arteries”, “Software Entropy” increases even small design changes become very difficult to effect © S. Demeyer, S. Ducasse, O. Nierstrasz

Code Duplication Detection Nontrivial problem: No a priori knowledge about which code has been copied How to find all clone pairs among all possible pairs of segments? © S. Demeyer, S. Ducasse, O. Nierstrasz

General Schema of Detection Process Author Level Transformed Code Comparison Technique [John94a] Lexical Substrings String-Matching [Duca99a] Normalized Strings [Bake95a] Syntactical Parameterized Strings [Mayr96a] Metric Tuples Discrete comparison [Kont97a] Euclidean distance [Baxt98a] AST Tree-Matching © S. Demeyer, S. Ducasse, O. Nierstrasz

Recall and Precision © S. Demeyer, S. Ducasse, O. Nierstrasz

Simple Detection Approach (i) Assumption: Code segments are just copied and changed at a few places Noise elimination transformation remove white space, comments remove lines that contain uninteresting code elements (e.g., just ‘else’ or ‘}’) … //assign same fastid as container fastid = NULL; const char* fidptr = get_fastid(); if(fidptr != NULL) { int l = strlen(fidptr); fastid = newchar[ l + 1 ]; … fastid=NULL; constchar*fidptr=get_fastid(); if(fidptr!=NULL) intl=strlen(fidptr) fastid = newchar[l+] © S. Demeyer, S. Ducasse, O. Nierstrasz

Simple Detection Approach (ii) Code Comparison Step Line based comparison (Assumption: Layout did not change during copying) Compare each line with each other line. Reduce search space by hashing: Preprocessing: Compute the hash value for each line Actual Comparison: Compare all lines in the same hash bucket Evaluation of the Approach Advantages: Simple, language independent Disadvantages: Difficult interpretation © S. Demeyer, S. Ducasse, O. Nierstrasz

A Perl script for C++ (i) Here the idea is not to explain the perl script but just to say that it is possible to have a three pages long script that handle Handle multiple files Remove comments and white spaces Control noise (if, {,) Granularity (number of lines) Possible to remove keywords © S. Demeyer, S. Ducasse, O. Nierstrasz

A Perl script for C++ (ii) Handles multiple files Removes comments and white spaces Controls noise (if, {,) Granularity (number of lines) Possible to remove keywords © S. Demeyer, S. Ducasse, O. Nierstrasz

Output Sample Lines = duplicated lines create_property(pd,pnImplObjects,stReference,false,*iImplObjects); create_property(pd,pnElttype,stReference,true,*iEltType); create_property(pd,pnMinelt,stInteger,true,*iMinelt); create_property(pd,pnMaxelt,stInteger,true,*iMaxelt); create_property(pd,pnOwnership,stBool,true,*iOwnership); Locations: </face/typesystem/SCTypesystem.C>6178/6179/6180/6181/6182 </face/typesystem/SCTypesystem.C>6198/6199/6200/6201/6202 create_property(pd,pnSupertype,stReference,true,*iSupertype); create_property(pd,pMinelt,stInteger,true,*iMinelt); Locations: </face/typesystem/SCTypesystem.C>6177/6178 </face/typesystem/SCTypesystem.C>6229/6230 Lines = duplicated lines Locations = file names and line number © S. Demeyer, S. Ducasse, O. Nierstrasz

Enhanced Simple Detection Approach Code Comparison Step As before, but now Collect consecutive matching lines into match sequences Allow holes in the match sequence Evaluation of the Approach Advantages Identifies more real duplication, language independent Disadvantages Less simple Misses copies with (small) changes on every line © S. Demeyer, S. Ducasse, O. Nierstrasz

Abstraction Abstracting selected syntactic elements can increase recall, at the possible cost of precision © S. Demeyer, S. Ducasse, O. Nierstrasz

Visualization of Duplicated Code Visualization provides insights into the duplication situation A simple version can be implemented in three days Scalability issue Dotplots — Technique from DNA Analysis Code is put on vertical as well as horizontal axis A match between two elements is a dot in the matrix © S. Demeyer, S. Ducasse, O. Nierstrasz

Visualization of Copied Code Sequences Detected Problem File A contains two copies of a piece of code File B contains another copy of this code Possible Solution Extract Method All examples are made using Duploc from an industrial case study (1 Mio LOC C++ System) © S. Demeyer, S. Ducasse, O. Nierstrasz

Visualization of Repetitive Structures Detected Problem 4 Object factory clones: a switch statement over a type variable is used to call individual construction code Possible Solution Strategy Method © S. Demeyer, S. Ducasse, O. Nierstrasz

Visualization of Cloned Classes Class A Class B Detected Problem: Class A is an edited copy of class B. Editing & Insertion Possible Solution Subclassing … Class A Class B © S. Demeyer, S. Ducasse, O. Nierstrasz

Visualization of Clone Families Overview Detail Mural at left. 20 Classes implementing lists for different data types © S. Demeyer, S. Ducasse, O. Nierstrasz

Duploc is scalable, integrates detection and visualization I.e., it also makes sense to invest in specialized tools (as long as you don’t have to implement them yourself!). © S. Demeyer, S. Ducasse, O. Nierstrasz

Conclusion Duplicated code is a real problem makes a system progressively harder to change Detecting duplicated code is a hard problem some simple technique can help tool support is needed Visualization of code duplication is useful some basic support are easy to build one student build a simple visualization tool in three days Curing duplicated code is an active research area © S. Demeyer, S. Ducasse, O. Nierstrasz