 Software Clones:( Definitions from Wikipedia) ◦ Duplicate code: a sequence of source code that occurs more than once, either within a program or across.

Slides:



Advertisements
Similar presentations
Duplicate code detection using Clone Digger Peter Bulychev Lomonosov Moscow State University CS department.
Advertisements

Indexing DNA Sequences Using q-Grams
Review What is a virtual function? What can be achieved with virtual functions? How to define a pure virtual function? What is an abstract class? Can a.
ANTLR in SSP Xingzhong Xu Hong Man Aug Outline ANTLR Abstract Syntax Tree Code Equivalence (Code Re-hosting) Future Work.
Macro Processor.
A new method of finding similarity regions in DNA sequences Laurent Noé Gregory Kucherov LORIA/UHP Nancy, France LORIA/INRIA Nancy, France Corresponding.
CS590 Z Matching Program Versions Xiangyu Zhang. CS590Z Problem Statement  Suppose a program P’ is created by modifying P. Determine the difference between.
Reverse Engineering © SERG Code Cloning: Detection, Classification, and Refactoring.
INTERPRETER Main Topics What is an Interpreter. Why should we learn about them.
Analyzing Software Code and Execution – Plagiarism and Bug Detection Shoaib Jameel.
7. Duplicated Code Metrics Duplicated Code Software quality
13/07/2015Dr Andy Brooks1 Fyrirlestrar 9 & 10 CCFinder: A Tool to Detect Clones “I can just copy these lines. That is the safest thing to do. The code.
L. Padmasree Vamshi Ambati J. Anand Chandulal J. Anand Chandulal M. Sreenivasa Rao M. Sreenivasa Rao Signature Based Duplicate Detection in Digital Libraries.
REFACTORING Lecture 4. Definition Refactoring is a process of changing the internal structure of the program, not affecting its external behavior and.
INTRODUCTION TO COMPUTING CHAPTER NO. 06. Compilers and Language Translation Introduction The Compilation Process Phase 1 – Lexical Analysis Phase 2 –
Chapter 10: Compilers and Language Translation Invitation to Computer Science, Java Version, Third Edition.
Mining and Analysis of Control Structure Variant Clones Guo Qiao.
Reviewing Recent ICSE Proceedings For:.  Defining and Continuous Checking of Structural Program Dependencies  Automatic Inference of Structural Changes.
By: TARUN MEHROTRA 12MCMB11.  More time is spent maintaining existing software than in developing new code.  Resources in M=3*(Resources in D)  Metrics.
O Supervisor : Dr. Harold Boley o Advisor : Dr. Tara Athan o Team : Simranjit Singh Pratik Shah Bijiteshwar R Aayush.
CMCD: Count Matrix based Code Clone Detection Yang Yuan and Yao Guo Key Laboratory of High-Confidence Software Technologies (Ministry of Education) Peking.
Cross Language Clone Analysis Team 2 October 27, 2010.
Microsoft Visual Basic 2005: Reloaded Second Edition Chapter 7 Sub and Function Procedures.
Chapter 6 Programming Languages (2) Introduction to CS 1 st Semester, 2015 Sanghyun Park.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Applying Clone.
Cross Language Clone Analysis Team 2 April 7, 2011.
Feasibility Study Cross-language Clone Analysis Team 2.
Functions, Procedures, and Abstraction Dr. José M. Reyes Álamo.
6 Chapter 61 Looping Programming Logic and Design, Second Edition, Comprehensive 6.
Ass. Prof. Dr Masri Ayob TK 6123 Lecture 13: Assembly Language Level (Level 4)
CPS 506 Comparative Programming Languages Syntax Specification.
Introducing Python CS 4320, SPRING Lexical Structure Two aspects of Python syntax may be challenging to Java programmers Indenting ◦Indenting is.
With Jeff Gray and Ira Baxter Robert Tairas Visualization of Clone Detection Results Eclipse Technology Exchange Workshop OOPSLA 2006 Portland, Oregon.
Duplicate code detection using anti-unification Peter Bulychev Moscow State University Marius Minea Institute eAustria, Timisoara.
Overview of Previous Lesson(s) Over View  Symbol tables are data structures that are used by compilers to hold information about source-program constructs.
Intermediate Code Representations
Cross Language Clone Analysis Team 2 February 3, 2011.
LogTree: A Framework for Generating System Events from Raw Textual Logs Liang Tang and Tao Li School of Computing and Information Sciences Florida International.
CS307P-SYSTEM PRACTICUM CPYNOT. B13107 – Amit Kumar B13141 – Vinod Kumar B13218 – Paawan Mukker.
Scalable Clone Detection and Elimination for Erlang Programs Huiqing Li, Simon Thompson University of Kent Canterbury, UK.
Cross Language Clone Analysis Team 2 November 22, 2010.
1 Compiler & its Phases Krishan Kumar Asstt. Prof. (CSE) BPRCE, Gohana.
Cross Language Clone Analysis Team 2 November 10, 2010.
Cross Language Clone Analysis Team 2. Team Introduction Task Summary Introduction Scope of Work Description of Related Research Identification of User.
CSC3315 (Spring 2009)1 CSC 3315 Lexical and Syntax Analysis Hamid Harroud School of Science and Engineering, Akhawayn University
SWE 4743 Abstraction Richard Gesick. CSE Abstraction the mechanism and practice of abstraction reduces and factors out details so that one can.
Cross Language Clone Analysis Team 2 February 3, 2011.
CSCI-383 Object-Oriented Programming & Design Lecture 25.
CS307P-SYSTEM PRACTICUM CPYNOT. B13107 – Amit Kumar B13141 – Vinod Kumar B13218 – Paawan Mukker.
Uniq The uniq command is useful when you need to find duplicate lines in a file. The basic format of the command is uniq in_file out_file In this format,
© Siemens Product Lifecycle Management Software Inc. All rights reserved Siemens PLM Software Solid Edge ST5 Training Alternate Assemblies.
Software. Introduction n A computer can’t do anything without a program of instructions. n A program is a set of instructions a computer carries out.
Further Modularization, Cohesion, and Coupling. Simple Program Design, Fourth Edition Chapter 9 2 Objectives In this chapter you will be able to: Further.
MODULAR PROGRAMMING Many programs are too large to be developed by one person. programs are routinely developed by teams of programmers The linker program.
Lexical and Syntax Analysis
CS 3304 Comparative Languages
Lecture 2 Lexical Analysis Joey Paquet, 2000, 2002, 2012.
Overview of Compilation The Compiler Front End
Overview of Compilation The Compiler Front End
Genome alignment Usman Roshan.
CBCD: Cloned Buggy Code Detector
Cross Language Clone Analysis Team 2 November 22, 2010
○Yuichi Semura1, Norihiro Yoshida2, Eunjong Choi3, Katsuro Inoue1
Functions, Procedures, and Abstraction
Programming Logic and Design Fourth Edition, Comprehensive
Individual Research Presentation
Chapter 10: Compilers and Language Translation
Functions, Procedures, and Abstraction
Matching Program Versions
Faculty of Computer Science and Information System
Presentation transcript:

 Software Clones:( Definitions from Wikipedia) ◦ Duplicate code: a sequence of source code that occurs more than once, either within a program or across different programs owned or maintained by the same entity. ◦ Clones: sequences of duplicate code.  “Clones are segments of code that are similar according to some definition of similarity.” —Ira Baxter, 2002

 How clones are created: ◦ copy and paste programming ◦ similar functionality, similar code ◦ plagiarism

 3 Types of Clones: ◦ Type 1: an exact copy without modifications (except for whitespace and comments). ◦ Type 2: a syntactically identical copy  only variable, type, or function identifiers have been changed. ◦ Type 3: a copy with further modifications  statements have been changed, added, or removed.

 Per our task, in order to find clones across different programming languages, we will have to first convert the code from each language over to a language independent object model.  Some Language Independent Object Models: ◦ Dagstuhl Middle Metamodel (DMM) ◦ Microsoft CodeDOM  Both of these models provide a language independent object model for representing the structure of source code.

 Detecting clones across multiple programming languages is on the cutting edge of research.  A preliminary version of this was done by Dr. Kraft and his students for C# and VB. ◦ They compared the Mono C# parser (written in C#) to the Mono VB parser (written in VB). ◦ Publication:  Nicholas A. Kraft, Brandon W. Bonds, Randy K. Smith: Cross-language Clone Detection. SEKE 2008: 54-59

 Token sequence of CodeDOM graphs with Levenshtein distance ◦ The Levenshtein distance between two sequences is defined as the minimum number of edits needed to transform one sequence into the other  Performs Comparisons of code files  CodeDOM tree is tokenized  Based on Distances ◦ Percentage of matching tokens in a sequence

 Only does file-to-file comparisons ◦ Does not detect clones in same source file  Can only detect Type 1 and some Type 2 clones  Not very efficient (brute force)

 Split into parameter (identifiers and literals) and non-parameter tokens  Non-parameter tokens summarized using a hash function  Parameter tokens are encoded using a position index for their occurrence in the sequence ◦ Abstracts concrete names and values while maintaining order

 Represent all prefixes of the sequence in a suffix tree  Suffixes that share the same set of edges have a common prefix ◦ Prefix occurs more than once (clone)