ACM Southeast Conference Melbourne, FL March 11, 2006 Phoenix-Based Clone Detection using Suffix Trees Robert Tairas

Slides:



Advertisements
Similar presentations
Introduction to Compilation of Functional Languages Wanhe Zhang Computing and Software Department McMaster University 16 th, March, 2004.
Advertisements

Introduction to Computer Science 2 Lecture 7: Extended binary trees
The Phoenix Compiler and Tools Framework
Program Representations. Representing programs Goals.
Reverse Engineering © SERG Code Cloning: Detection, Classification, and Refactoring.
Representing programs Goals. Representing programs Primary goals –analysis is easy and effective just a few cases to handle directly link related things.
1 Intermediate representation Goals: –encode knowledge about the program –facilitate analysis –facilitate retargeting –facilitate optimization scanning.
Phase-Based Program Sampling Using Phoenix Chandra Krintz University of California, Santa Barbara Microsoft Faculty Summit July, 2005.
Data Flow Analysis Compiler Design October 5, 2004 These slides live on the Web. I obtained them from Jeff Foster and he said that he obtained.
Software Optimization and Analysis Framework Phoenix By Joel Messer.
Chapter 6: User-Defined Functions I
Guide To UNIX Using Linux Third Edition
Chapter 14: Advanced Topics: DBMS, SQL, and ASP.NET
Chapter 10 Application Development. Chapter Goals Describe the application development process and the role of methodologies, models and tools Compare.
Spring Roo CS476 Aleksey Bukin Peter Lew. What is Roo? Productivity tool Allows for easy creation of Enterprise Java applications Runs alongside existing.
Andy Ayers Microsoft VC++
Robert Tairas (INRIA & EMN) Ferosh Jacob (University of Alabama) Jeff Gray (University of Alabama) International Workshop on Software Clones (IWSC) – May.
Software (Program) Analysis. Automated Static Analysis Static analyzers are software tools for source text processing They parse the program text and.
Chapter 1 Introduction Dr. Frank Lee. 1.1 Why Study Compiler? To write more efficient code in a high-level language To provide solid foundation in parsing.
Phoenix John LeforShahrokh Mortazavi Microsoft ResearchDeveloper Division.
 Input and Output Functions Input and Output Functions  OperatorsOperators Arithmetic Operators Assignment Operators Relational Operators Logical Operators.
Client Scripting1 Internet Systems Design. Client Scripting2 n “A scripting language is a programming language that is used to manipulate, customize,
1 Module Objective & Outline Module Objective: After completing this Module, you will be able to, appreciate java as a programming language, write java.
Chapter 06 (Part I) Functions and an Introduction to Recursion.
Aspect Oriented Programming Sumathie Sundaresan CS590 :: Summer 2007 June 30, 2007.
Aspect-Oriented Refactoring of the Apache Cocoon Shared-Object Resource Allocation System Jeff Dalton February 28th, 2003 Advisor: David G. Hannay Client:
Cross Language Clone Analysis Team 2 October 27, 2010.
/* Documentations */ Pre process / Linking statements Global declarations; main( ) { Local Declarations; Program statements / Executable statements; }
Supported by ELTE IKKK, Ericsson Hungary, in cooperation with University of Kent Erlang refactoring with relational database Anikó Víg and Tamás Nagy Supervisors:
JM - 1 Introduction to Bioinformatics: Lecture III Genome Assembly and String Matching Jarek Meller Jarek Meller Division of Biomedical.
Joe Hummel, the compiler is at your service Chicago Code Camp 2014.
Programming Fundamentals. Today’s Lecture Why do we need Object Oriented Language C++ and C Basics of a typical C++ Environment Basic Program Construction.
Supported by ELTE IKKK, Ericsson Hungary, in cooperation with University of Kent Erlang refactoring with relational database Anikó Víg and Tamás Nagy Supervisors:
Methodology: The AOP Refactoring Process Aspect-Oriented Refactoring of the Apache Cocoon Shared-Object Resource Allocation System Jeff Dalton Advisor:
UHD::3320::CH121 DESIGN PHASE Chapter 12. UHD::3320::CH122 Design Phase Two Aspects –Actions which operate on data –Data on which actions operate Two.
Towards Multi-Paradigm Software Development Valentino Vranić Department of Computer Science and Engineering Faculty of Electrical Engineering.
With Jeff Gray and Ira Baxter Robert Tairas Visualization of Clone Detection Results Eclipse Technology Exchange Workshop OOPSLA 2006 Portland, Oregon.
Duplicate code detection using anti-unification Peter Bulychev Moscow State University Marius Minea Institute eAustria, Timisoara.
1 Compiler Design (40-414)  Main Text Book: Compilers: Principles, Techniques & Tools, 2 nd ed., Aho, Lam, Sethi, and Ullman, 2007  Evaluation:  Midterm.
© Copyright 1992–2004 by Deitel & Associates, Inc. and Pearson Education Inc. All Rights Reserved. Chapter 5 - Functions Outline 5.1Introduction 5.2Program.
Weaving a Debugging Aspect into Domain-Specific Language Grammars SAC ’05 PSC Track Santa Fe, New Mexico USA March 17, 2005 Hui Wu, Jeff Gray, Marjan Mernik,
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University IWPSE 2003 Program.
Gordana Rakić, Zoran Budimac
Intermediate Code Representations
Chuck Mitchell Senior Architect, Phoenix Product Unit Microsoft Corporation.
Scalable Clone Detection and Elimination for Erlang Programs Huiqing Li, Simon Thompson University of Kent Canterbury, UK.
Chapter 5 Introduction To Form Builder. Lesson A Objectives  Display Forms Builder forms in a Web browser  Use a data block form to view, insert, update,
Joe Hummel, the compiler is at your service SDC Meetup, Sept 2014.
Cross Language Clone Analysis Team 2. Team Introduction Task Summary Introduction Scope of Work Description of Related Research Identification of User.
 Software Clones:( Definitions from Wikipedia) ◦ Duplicate code: a sequence of source code that occurs more than once, either within a program or across.
Cross Language Clone Analysis Team 2 February 3, 2011.
STL CSSE 250 Susan Reeder. What is the STL? Standard Template Library Standard C++ Library is an extensible framework which contains components for Language.
8 th Semester, Batch 2009 Department Of Computer Science SSUET.
1 Topic 4: Abstract Syntax Symbol Tables COS 320 Compiling Techniques Princeton University Spring 2016 Lennart Beringer.
Plug-In Architecture Pattern. Problem The functionality of a system needs to be extended after the software is shipped The set of possible post-shipment.
Joe Hummel, the compiler is at your service Chicago Coder Conference, June 2016.
Pyragen A PYTHON WRAPPER GENERATOR TO APPLICATION CORE LIBRARIES Fernando PEREIRA, Christian THEIS - HSE/RP EDMS tech note:
SE 510 Principles and Applications of Software Design Aspect Oriented Programming October 5, 2005 Jeff Webb.
Chapter 3 – Describing Syntax
Compiler Design (40-414) Main Text Book:
Basic Program Analysis
LLVM Pass and Code Instrumentation
VISUAL BASIC.
.NET and .NET Core Foot View of .NET Pan Wuming 2017.
Individual Research Presentation
Moonzoo Kim School of Computing KAIST
Data Flow Analysis Compiler Design
Moonzoo Kim School of Computing KAIST
Automatically Diagnosing and Repairing Error Handling Bugs in C
Plug-In Architecture Pattern
Presentation transcript:

ACM Southeast Conference Melbourne, FL March 11, 2006 Phoenix-Based Clone Detection using Suffix Trees Robert Tairas Advisor: Dr. Jeff Gray

Code Clones A sequence of statements that are duplicated in multiple locations in a program _____ _______ _________ ___ ______ ___ ______ _____ ________ ___ ______ _________ ____ _______ ______ _________ _____ _______ ________ ___ _________ ______ _______ _____ _________ _______ ___ ______ ________ __________ _______ ____ _______ ________ ___ _________ ______ ____ ______ _________ _____ ____ ________ ____ __________ _______ ___ ________ ___ _________ ______ _____ ________ ___ ______ _________ ______ _______ ______ _______ _________ _______ ___ ________ ___ _________ ____ Source Code ________ ___ _________ ______ Cloned Code

Clones in Source Code Copy-and-paste parts of code from one location to another  The copied code already works correctly  No time to be efficient Research shows that 5-10% of large scale computer programs are clones (Baxter, 98) _____ _______ _________ ___ ______ ___ ______ _____ ________ ___ ______ _________ ____ _______ ______ _________ _____ _______ ________ ___ _________ ______ _______ _____ _________ _______ ___ ______ ________ __________ _______ ____ _______ ________ ___ _________ ______ ____ ______ _________ _____ ____ ________ ____ __________ _______ ___ ________ ___ _________ ______ _____ ________ ___ ______ _________ ______ _______ ______ _______ _________ _______ ___ ________ ___ _________ ____ Source Code

Clones in Source Code Dominant decomposition: A block of statements that performs a function/concern dominates another block  The two concerns crosscut each other  One concern will have to yield to the other  Related to Aspect Oriented Programming (AOP)

Clones in Source Code logging in org.apache.tomcat  red shows lines of code that handle logging  not in just one place  not even in a small number of places

Clone Dilemma Maintenance  To update code that is cloned will require all clones to be updated Restructure/refactor Separate into aspects But first we need to find the clones

Contribution: Automated Clone Detection Searches for exact matching function level clones utilizing suffix tree structures in the Microsoft Phoenix framework Microsoft Phoenix Clone Detector Suffix Trees Source Code Report of Clones

Types of Clones int func1() { int x = 1; int y = x + 5; return y; } int func2() { int p = 1; int q = p + 5; return q; } int main() { int x = 1; int y = x + 5; return y; } int func3() { int s = 1; int t = s + 5; s++; return t; } Exact matchExact match, with only the variable names differing Near exact match Original code As defined in an experiment comparing existing clone detection techniques at the 1st International Workshop on Detection of Software Clones (02)

What is Phoenix? Next-Generation Framework for  building Compilers  building Software Analysis Tools Basis for Microsoft compilers for 10+ years More information: Note: Contents of this slide courtesy of John Lefor at Microsoft Research

DelphiCobol HL Opts LL Opts Code Gen HL Opts LL Opts HL Opts Native Image C# Phoenix Core AST IR Syms Types CFG SSA Xlator Formatter Browser Phx APIs Profiler Obfuscator Visualizer Security Checker Refactor Lint VB C++ IRassembly C++ C++AST PREfast Profile Eiffel C++ Phx AST Lex/Yacc Tiger Code Gen CompilersTools Note: This slide courtesy of John Lefor at Microsoft Research

Suffix Trees A suffix tree of a string is a tree where each suffix of the string is represented by a path from the root to a leaf In bioinformatics it is used to search for patterns in DNA or protein sequences Example: suffix tree for abgf$ abgf$ bgf$gf$f$$

Another Suffix Tree Example Suffix tree for abcebcf$ abcebcf$ f$ f$ ebcf$ c ebcf$ $ Leaf numbers: The number indicates the starting position of the suffix from the left of the string ebcf$ f$ bc 2 5

bcebcf$ 2 Another Suffix Tree Example Suffix tree for abcebcf$ abcebcf$ f$ ebcf$ 7 8 $ Leaf numbers: The number indicates the starting position of the suffix from the left of the string f$ ebcf$ c 6 3 ebcf$bcf$ 5

Another Suffix Tree Example Suffix tree for abgf$abgf# $abgf# abgf # $abgf# ### $abgf#$abgf# $abgf# bgf gf f 2,1 1,12,2 1,2 2,3 1,3 2,4 1,4 1,5 2,5# Leaf numbers: The first number indicates the string. The second number indicates the starting position of the suffix in that string. Two identical strings (abgf) separated by unique terminating characters

Abstract Syntax Tree Nodes int func1() { return x; } FUNCDEFN COMPOUND RETURN SYMBOL int func2() { return y; } FUNCDEFN COMPOUND RETURN SYMBOL Note: Node names are Phoenix-defined.

Remember This? Suffix tree for abgf$abgf# $abgf# abgf # $abgf# ### $abgf#$abgf# $abgf# bgf gf f 2,1 1,12,2 1,2 2,3 1,3 2,4 1,4 1,5 2,5# Leaf numbers: The first number indicates the function. The second number indicates the starting position of the suffix in that function. FUNCDEFNCOMPOUNDRETURNSYMBOL FUNCDEFNCOMPOUNDRETURNSYMBOL a b g f $ a b g f # For exact function matching, we’re looking for suffix tree nodes of edges, where the edges include all the AST nodes of a function.

PLUS False Positives int func1() { int x = 3; int y = x + 5; return y; } int func2() { int x = 1; int y = y + 5; return y; } int main() { int x = 1; int y = x + 5; return y; } Original code FUNCDEFN COMPOUND DECLARATION SYMBOLCONSTANT RETURNSYMBOL CONSTANT x, i32 i32, 1 y, i32 x, i32 i32, 5 y, i32 i32

Phoenix Phases Processes are divided into “phases” Custom phases can be inserted to perform tasks such as software analysis Phases are inserted through “plug-ins” in the form of a library (DLL) module Microsoft Phoenix Plug-in Clone Detection Phase Custom Phase

Clone Detector in Phoenix Phoenix Back-end example.c C/C++ Front-end example.ast Report csclones.cs C# csclones.dll

Case Study Program:Abyss Small web server (~1500 LOC) Weltab Election results program (~11K LOC) Duplicate function groups: Functions ConfGetToken (in conf.c) and GetToken (in http.c). Functions ThreadRun (in thread.c) and ThreadStop (in thread.c). Note: Out of 5 duplicate function groups found, 3 were in predefined header files. Function canvw (in canv.c, cnv1.c, and cnv1a.c). Functions lhead (in lans.c and lansxx.c) and rshead (in r01tmp.c, r101tmp.c, r11tmp.c, r26tmp.c, r51tmp.c, rsum.c, and rsumxx.c). Function rsprtpag (in r01tmp.c, r101tmp.c, r11tmp.c, r26tmp.c, r51tmp.c, and rsum.c). Function askchange (in vedt.c, vfix.c, and xfix.c). Note: Out of 6 duplicate function groups found, 2 were in predefined header files.

Limitations and Future Work Looks only for exact matches  Currently working on a process called hybrid dynamic programming, which includes the use of suffix trees (k- difference inexact matching) Looks only at the function level  Enable multiple levels clone detection  Higher: statement level; Lower: program level Recognizes only C nodes  Coverage for other languages, such as C++ and C#  Another approach: language independent

Thank you…Questions? Phoenix-Based Clone Detection using Suffix Trees