Cross Language Clone Analysis Team 2 February 3, 2011
Parsing/CodeDOM Clone Analysis Customer Meeting GUI Implementation Testing Current Status Path Forward 2
Allen Tucker Patricia Bradford Greg Rodgers Ashley Chafin 3
Quick Overview Quick overview of our project and where we currently stand. 4
3 Types of Clones (Definition of Similarity): ◦ Type 1: An exact copy without modifications (except for whitespace and comments) ◦ Type 2: A syntactically identical copy Only variable, type, or function identifiers have been changed ◦ Type 3: A copy with further modifications Statements have been changed, reordered, added, or removed Clones Types 5
Three Step Process Step 1 Code Translation Step 2 Clone Detection Step 3 Visualization Task Understanding Source Files Translator Common Model Inspector Detected Clones UI Clone Visualization 6
Step 1: Code Translation ◦ C#, C++, Java, VB (or Python) ◦ CodeDOM Step 2: Clone Detection ◦ Leverage current clone detection techniques and research Step 3: Clone Visualization ◦ Need for an intuitive user interface Task Understanding (cont.) 7
Dr. Kraft Application 8
Limitations Only does file-to-file comparisons ◦ Does not detect clones in same source file Can only detect Type 1 and some Type 2 clones Not very efficient (brute force) 9
Add Support for Same File Clone Detection Add Support for Type 3 Clone Detection ◦ Requires more Research Provide a more efficient clone analysis algorithm Enhancements 10
Features Clone Detection Software Suite ◦ Identifies ◦ Tracks ◦ Manages Software Clones Multi-language support ◦ C++ ◦ C# ◦ Java 11
Features (cont) Extendible ◦ Built on a Plug-in Framework ◦ Add new languages Easy to Navigate between Clones Persists Clones for easy Retrieval 12
Features (cont) Provides complete code coverage Multi-Application Support ◦ Stand-alone ◦ Plug-in based (Eclipse) ◦ Backend service (Ant task) Extendible ◦ Built on a Plug-in Framework ◦ Add new languages Easy to Navigate between Clones Persists Clones for easy Retrieval 13
Complexity of problem proves more difficult than initial estimates. Technology to be applied is neither well- established or has yet to be developed. Unable to complete defined project scope within schedule. Volatile user requirements leading to redefinition of project objectives. Risks 14
Architecture Design and Architecture 15
Key Architecture Points Multilanguage support Configurable for different platforms ◦ Stand-along application ◦ plug-in ◦ backend service Extendable 16
Architecture C# Service Java Service C++ Service Application User Interface Application User Interface Code Model Clone Detection Algorithms Core API Language Support (Interface) 17 Service Eclipse Plug-in Eclipse Plug-in Etc… Web Interface Web Interface
Core Unit Code Model ◦ Stores the code in common format Application Programming Interface ◦ Used to embed clone detection in applications Language Service Interface ◦ Communication layer between the core and the specific language services Code Model Clone Detection Algorithms Core API Language Service Interface 18
Visual Studio Solution 19
Core 20
Core - API 21
Language Service 22
Language Service 23
Language Service 24
App Configuration 25
The Algorithm 26
3 Types of Clones (Definition of Similarity): ◦ Type 1: An exact copy without modifications (except for whitespace and comments) ◦ Type 2: A syntactically identical copy Only variable, type, or function identifiers have been changed ◦ Type 3: A copy with further modifications Statements have been changed, reordered, added, or removed 27
28 Code Base CodeDOM Conversion Use Gold Parser for conversion Transformation Transform the CodeDOM elements into a sequence of tokens Processed Code Match Detection Run comparison algorithm on transformed code Transformed Code Clones Formatting Clone pair/class locations of the transformed code are mapped to the original code base by line numbers and file location Clone Pairs/Classes Filtering Clones are extracted from the source, visualized and manually analyzed to filter out false positives
Covert source code to CodeDOM 29
Transform the CodeDOM syntax to a sequence of tokens 30
$p$p($p$p&$p){$p$p=$p;$p$p=$p.$p();for(; $p!=$p. $p();++$p){$p<<$p<<$p<<*$p<<$p;++$p;}} $p$p($p$p&$p){$p$p=$p;$p$p=$p.$p();for(; $p!=$p. $p();++$p){$p $p $p<<$p;++$p;}} Levenshtein Distance ◦ minimum number of edits needed to transform one string into the other Insertion Deletion substitution 31
32
Parsing and conversion to CodeDOM 33
How It Works (Block Structure) Grammar Compiled Grammar Table (*.cgt) Source Code Parsed Data 34
How It Works (Process) Grammar Compiled Grammar Table (*.cgt) Source Code Parsed Data Typical output from engine: a long nested tree 35
Usage within CloneDigger Compiled Grammar Table (*.cgt) Source Code Parsed Data CodeDOM Conversion Need to write routine to move data from Parsed Tree to CodeDOM Parsed data trees from parser are stored in consistent data structure, but are based on rules defined within grammars AST 36
Grammar Updates Currently the grammars we have for the Gold parser are out dated. Current Gold Grammars ◦ C# version 2.0 ◦ Java version 1.4 Current available software versions ◦ C# version 4.0 ◦ Java version 6 37
Received grammar and included in project. One parser engine == Three languages
CodeDOM Document Object Model for Source Code API - [System.CodeDom] Only supports certain aspects of the language since it’s language agnostic ◦ Good Enough What Does it Do? ◦ Programmatically Constructs Code What Doesn’t it Do? ◦ Does NOT parse 39
CodeDOM Example CodeCompileUnit ◦ CodeNameSpace Imports Types Members Event Field Method Statements Expression Property 40
White Box and Black Box Testing 41
White Box Testing: ◦ Unit Testing Black Box Testing: ◦ Production Rule Testing Allows us to test the robustness of our engine because we can force rule production errors. Regression Testing Automated ◦ Functional Testing 42
Current Test Count: 33 Added test to cover existing code All tests are passing… ◦ “Happy Path Tests” ◦ Will begin off-nominals
Where we currently stand 47
48 These estimates are only for work done this semester. Source Code Load & Translate ◦ C % ◦ C# - 0% ◦ Java – 35% ◦ Associate – 0% Source Code Analyze ◦ Dr. Kraft’s analysis technique – 40% ◦ Type 1 clones – 0% (Implement Next Iteration) ◦ Type 2 clones – 0% ◦ Type 3 clones – 0% Where we stand…
49 Project Management ◦ Remove “demo” GUI – 100% ◦ Sketches for visual design – 40% ◦ GUI Rework – 83% Testing ◦ Baseline unit tests – 100% ◦ Update unit test for this iteration – 90% ◦ Create/Update Functional Tests – 75% Where we stand…
As of Feb 3, 2011 SLOC: ◦ CS666_Client = 2137 lines ◦ CS666_Core = 2695 lines ◦ CS666_Console = 138 lines ◦ CS666_CppParser = 155 lines ◦ CS666_CsParser = 3265 lines ◦ CS666_JavaParser = 3388 lines ◦ CS666_LanguageSupport = 84 lines ◦ CS666_UnitTests = 944 lines Total = lines (including unit tests) 50 - Used lcounter.exe to count SLOC
Path Forward for the next iteration 51
52 Schedule
53 Below is a list of the tasks for our next iteration: ◦ Parsing/CodeDOM C++ parsing Complete Java conversion to CodeDOM ◦ Clone Analysis Detecting Type 1 clones ◦ GUI Project management Displaying source code Sketches for visual design Next Iteration
54 ◦ Documentation User Stories, Use Cases, UML Models, Sketches Project management Displaying source code Displaying CodeDOM Displaying Type 1 clones detected Functional Tests Update schedule ◦ Testing Unit tests Execute functional tests Next Iteration