Cross Language Clone Analysis Team 2 October 27, 2010.

Slides:



Advertisements
Similar presentations
Compilers and Language Translation
Advertisements

CPSC Compiler Tutorial 9 Review of Compiler.
INTERPRETER Main Topics What is an Interpreter. Why should we learn about them.
CodeSimian CS491B – Andrew Weng. Motivation Academic integrity is a universal issue Plagiarism is still common today Kaavya Viswanathan (Harvard Student)
Environments and Evaluation
Data Flow Analysis Compiler Design October 5, 2004 These slides live on the Web. I obtained them from Jeff Foster and he said that he obtained.
Chapter 3 Program translation1 Chapt. 3 Language Translation Syntax and Semantics Translation phases Formal translation models.
Testing an individual module
CSC 8310 Programming Languages Meeting 2 September 2/3, 2014.
ANTLR with ASTs. Abstract Syntax Trees ANTLR can be instructed to produce ASTs for the output of the parser ANTLR uses a prefix notation for representing.
ANTLR.
Invitation to Computer Science 5th Edition
CS 355 – Programming Languages
INTRODUCTION TO COMPUTING CHAPTER NO. 06. Compilers and Language Translation Introduction The Compilation Process Phase 1 – Lexical Analysis Phase 2 –
Chapter 1 Introduction Dr. Frank Lee. 1.1 Why Study Compiler? To write more efficient code in a high-level language To provide solid foundation in parsing.
Parser-Driven Games Tool programming © Allan C. Milne Abertay University v
Chapter 10: Compilers and Language Translation Invitation to Computer Science, Java Version, Third Edition.
Reviewing Recent ICSE Proceedings For:.  Defining and Continuous Checking of Structural Program Dependencies  Automatic Inference of Structural Changes.
Concordia University Department of Computer Science and Software Engineering Click to edit Master title style COMPILER DESIGN Review Joey Paquet,
CST320 - Lec 11 Why study compilers? n n Ties lots of things you know together: –Theory (finite automata, grammars) –Data structures –Modularization –Utilization.
Chapter 6 Programming Languages (2) Introduction to CS 1 st Semester, 2015 Sanghyun Park.
Cross Language Clone Analysis Team 2 April 7, 2011.
Joey Paquet, Lecture 12 Review. Joey Paquet, Course Review Compiler architecture –Lexical analysis, syntactic analysis, semantic.
Feasibility Study Cross-language Clone Analysis Team 2.
Interpretation Environments and Evaluation. CS 354 Spring Translation Stages Lexical analysis (scanning) Parsing –Recognizing –Building parse tree.
Review 1.Lexical Analysis 2.Syntax Analysis 3.Semantic Analysis 4.Code Generation 5.Code Optimization.
Unit-1 Introduction Prepared by: Prof. Harish I Rathod
CPS 506 Comparative Programming Languages Syntax Specification.
Chapter 1 Introduction. Chapter 1 - Introduction 2 The Goal of Chapter 1 Introduce different forms of language translators Give a high level overview.
Overview of Previous Lesson(s) Over View  A program must be translated into a form in which it can be executed by a computer.  The software systems.
Duplicate code detection using anti-unification Peter Bulychev Moscow State University Marius Minea Institute eAustria, Timisoara.
Cross Language Clone Analysis Team 2 October 13, 2010.
Cross Language Clone Analysis Team 2 February 3, 2011.
Accomplishments  Getting larger portion of both Java and C# into CodeDOM to support cross language detections  Source code and statement line number.
Cross Language Clone Analysis Team 2 March 3, 2011.
Scalable Clone Detection and Elimination for Erlang Programs Huiqing Li, Simon Thompson University of Kent Canterbury, UK.
Cross Language Clone Analysis Team 2 November 22, 2010.
Design and Planning Or: What’s the next thing we should do for our project?
1 Compiler & its Phases Krishan Kumar Asstt. Prof. (CSE) BPRCE, Gohana.
The Interpreter Pattern (Behavioral) ©SoftMoore ConsultingSlide 1.
Cross Language Clone Analysis Team 2 November 10, 2010.
Cross Language Clone Analysis Team 2. Team Introduction Task Summary Introduction Scope of Work Description of Related Research Identification of User.
 Software Clones:( Definitions from Wikipedia) ◦ Duplicate code: a sequence of source code that occurs more than once, either within a program or across.
Cross Language Clone Analysis Team 2 February 3, 2011.
Chap. 7, Syntax-Directed Compilation J. H. Wang Nov. 24, 2015.
©SoftMoore ConsultingSlide 1 Structure of Compilers.
1 Asstt. Prof Navjot Kaur Computer Dept PRESENTED BY.
Overview of Compilation Prepared by Manuel E. Bermúdez, Ph.D. Associate Professor University of Florida Programming Language Principles Lecture 2.
Introduction to Compiler Construction
Lexical and Syntax Analysis
Constructing Precedence Table
CS 3304 Comparative Languages
Introduction to Parsing (adapted from CS 164 at Berkeley)
Overview of Compilation The Compiler Front End
Overview of Compilation The Compiler Front End
Cross Language Clone Analysis Team 2 November 22, 2010
Course supervisor: Lubna Siddiqui
Chapter 3: Lexical Analysis
○Yuichi Semura1, Norihiro Yoshida2, Eunjong Choi3, Katsuro Inoue1
Individual Research Presentation
Programming Fundamentals (750113) Ch1. Problem Solving
Lecture 4: Lexical Analysis & Chomsky Hierarchy
CMPE 152: Compiler Design August 21/23 Lab
Data Flow Analysis Compiler Design
Programming Fundamentals (750113) Ch1. Problem Solving
Programming Fundamentals (750113) Ch1. Problem Solving
High-Level Programming Language
Chapter 10: Compilers and Language Translation
Compilers Principles, Techniques, & Tools Taught by Jing Zhang
Faculty of Computer Science and Information System
Presentation transcript:

Cross Language Clone Analysis Team 2 October 27, 2010

Current Tasks GOLD Parsing System Grammar Update Clone Analysis Demonstration Team Collaboration Path Forward 2

 Allen Tucker  Patricia Bradford  Greg Rodgers  Brian Bentley  Ashley Chafin 3

What we are tackling… 4

 Current tasks created for the first user story “Source Code Load & Translate”: ◦ Load & parse C# source code. ◦ Load & parse JAVA source code. ◦ Load & parse C++ source code. ◦ Translate the parsed C# source code to CodeDOM. ◦ Translate the parsed JAVA source code to CodeDOM. ◦ Translate the parsed C++ source code to CodeDOM. ◦ Associate the CodeDOM to the original source code. 5

6

7

8

GOLD Parsing Populating CodeDOM 9

 What we are doing?  Compiled Grammar Table  Bookkeeping  Testing 10

Grammar Compiled Grammar Table (*.cgt) Source Code Parsed Data 11

Grammar Compiled Grammar Table (*.cgt) Source Code Parsed Data Typical output from engine: a long nested tree 12

Compiled Grammar Table (*.cgt) Source Code Parsed Data CodeDOM Conversion Need to write routine to move data from Parsed Tree to CodeDOM Parsed data trees from parser are stored in consistent data structure, but are based on rules defined within grammars AST 13

 For Java, there is… ◦ 359 production rules ◦ 249 distinctive symbols (terminal & non-terminal)  For C#, there is… ◦ 415 production rules ◦ 279 distinctive symbols (terminal & non-terminal) 14

Since there are so many production rules, we came up with the following bookkeeping:  A spreadsheet of the compiled grammar table (for each language) with each production rule indexed. ◦ This spreadsheet covers:  various aspects of language  what we have/have not handled from the parser  what we have/have not implemented into CodeDOM  percentage complete 16

17

 White Box Testing: ◦ Unit Testing  Black Box Testing: ◦ Production Rule Testing  Allows us to test the robustness of our engine because we can force rule production errors.  Regression Testing  Automated 18

19

20

 Three Step Process Step 1 Code Translation Step 2 Clone Detection Step 3 Visualization Source Files Translator Common Model Inspector Detected Clones UI Clone Visualization 21

Java & C# 22

 Currently the grammars we have for the Gold parser are out dated.  Current Gold Grammars ◦ C# version 2.0 ◦ Java version 1.4  Current available software versions ◦ C# version 4.0 ◦ Java version 6

 Available updated grammars ◦ Antlr has grammars updated to more recent versions of both C# and Java. ◦ C# version 4.0 (latest version) ◦ Java version 1.5 (second to latest version)  Currently we are attempting to transform the Antlr grammars into Gold Parser grammars.

 Grammars for C# and Java are very complex and require a lot of work to build.  Antler and Gold Parser grammars use completely different syntax.  Positive note: Other development not halted by use of older grammars.

Overview and Dr. Kraft’s Student’s Tool 26

 Software Clones:( Definitions from Wikipedia) ◦ Duplicate code: a sequence of source code that occurs more than once, either within a program or across different programs owned or maintained by the same entity. ◦ Clones: sequences of duplicate code.  “Clones are segments of code that are similar according to some definition of similarity.” —Ira Baxter, 2002

 How clones are created: ◦ copy and paste programming ◦ similar functionality, similar code ◦ plagiarism

 3 Types of Clones: ◦ Type 1: an exact copy without modifications (except for whitespace and comments). ◦ Type 2: a syntactically identical copy  only variable, type, or function identifiers have been changed. ◦ Type 3: a copy with further modifications  statements have been changed, added, or removed.

 Per our task, in order to find clones across different programming languages, we will have to first convert the code from each language over to a language independent object model.  Some Language Independent Object Models: ◦ Dagstuhl Middle Metamodel (DMM) ◦ Microsoft CodeDOM  Both of these models provide a language independent object model for representing the structure of source code.

 Detecting clones across multiple programming languages is on the cutting edge of research.  A preliminary version of this was done by Dr. Kraft and his students for C# and VB. ◦ They compared the Mono C# parser (written in C#) to the Mono VB parser (written in VB). ◦ Publication:  Nicholas A. Kraft, Brandon W. Bonds, Randy K. Smith: Cross-language Clone Detection. SEKE 2008: 54-59

 Token sequence of CodeDOM graphs with Levenshtein distance ◦ The Levenshtein distance between two sequences is defined as the minimum number of edits needed to transform one sequence into the other  Performs Comparisons of code files  CodeDOM tree is tokenized  Based on Distances ◦ Percentage of matching tokens in a sequence

 Only does file-to-file comparisons ◦ Does not detect clones in same source file  Can only detect Type 1 and some Type 2 clones  Not very efficient (brute force)

 Split into parameter (identifiers and literals) and non-parameter tokens  Non-parameter tokens summarized using a hash function  Parameter tokens are encoded using a position index for their occurrence in the sequence ◦ Abstracts concrete names and values while maintaining order

 Represent all prefixes of the sequence in a suffix tree  Suffixes that share the same set of edges have a common prefix ◦ Prefix occurs more than once (clone)

What’s been done 37 Demonstration

Team Collaboration Team 2 & Team 3 38

 Team 2 ◦ We plan to start giving Team 3 periodic drops of our source code for Java and C# parsing. ◦ We are researching and working to update the Java and C# grammars.  Team 3 ◦ Team 3 is working on C++ parsing.  Looking into other parser, ELSA. 39

Next Iteration & Schedule 40

 Finalize Iteration 1 (C++ to CodeDom)  Iteration 2 (Code Analysis)  Iteration 3 (Begin GUI) Path Forward

Schedule