Download presentation
Presentation is loading. Please wait.
Published byRodger May Modified over 8 years ago
1
Software Clones:( Definitions from Wikipedia) ◦ Duplicate code: a sequence of source code that occurs more than once, either within a program or across different programs owned or maintained by the same entity. ◦ Clones: sequences of duplicate code. “Clones are segments of code that are similar according to some definition of similarity.” —Ira Baxter, 2002
2
How clones are created: ◦ copy and paste programming ◦ similar functionality, similar code ◦ plagiarism
3
3 Types of Clones: ◦ Type 1: an exact copy without modifications (except for whitespace and comments). ◦ Type 2: a syntactically identical copy only variable, type, or function identifiers have been changed. ◦ Type 3: a copy with further modifications statements have been changed, added, or removed.
4
Per our task, in order to find clones across different programming languages, we will have to first convert the code from each language over to a language independent object model. Some Language Independent Object Models: ◦ Dagstuhl Middle Metamodel (DMM) ◦ Microsoft CodeDOM Both of these models provide a language independent object model for representing the structure of source code.
5
Detecting clones across multiple programming languages is on the cutting edge of research. A preliminary version of this was done by Dr. Kraft and his students for C# and VB. ◦ They compared the Mono C# parser (written in C#) to the Mono VB parser (written in VB). ◦ Publication: Nicholas A. Kraft, Brandon W. Bonds, Randy K. Smith: Cross-language Clone Detection. SEKE 2008: 54-59
6
Token sequence of CodeDOM graphs with Levenshtein distance ◦ The Levenshtein distance between two sequences is defined as the minimum number of edits needed to transform one sequence into the other Performs Comparisons of code files CodeDOM tree is tokenized Based on Distances ◦ Percentage of matching tokens in a sequence
8
Only does file-to-file comparisons ◦ Does not detect clones in same source file Can only detect Type 1 and some Type 2 clones Not very efficient (brute force)
9
Split into parameter (identifiers and literals) and non-parameter tokens Non-parameter tokens summarized using a hash function Parameter tokens are encoded using a position index for their occurrence in the sequence ◦ Abstracts concrete names and values while maintaining order
10
Represent all prefixes of the sequence in a suffix tree Suffixes that share the same set of edges have a common prefix ◦ Prefix occurs more than once (clone)
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.