Multilingual Detection of Code Clones Using ANTLR Grammar Definitions

Slides:



Advertisements
Similar presentations
ANTLR in SSP Xingzhong Xu Hong Man Aug Outline ANTLR Abstract Syntax Tree Code Equivalence (Code Re-hosting) Future Work.
Advertisements

Lex -- a Lexical Analyzer Generator (by M.E. Lesk and Eric. Schmidt) –Given tokens specified as regular expressions, Lex automatically generates a routine.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Extraction of.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Extracting Code.
176 Formal Languages and Applications: We know that Pascal programming language is defined in terms of a CFG. All the other programming languages are context-free.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Industrial Application.
Software Engineering Lab, Osaka University Code Clone Analysis and Its Application Katsuro Inoue Osaka University.
1 Introduction to Parsing Lecture 5. 2 Outline Regular languages revisited Parser overview Context-free grammars (CFG’s) Derivations.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University ICSE 2003 Java.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Finding Similar.
(1.1) COEN 171 Programming Languages Winter 2000 Ron Danielson.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University What Kinds of.
Department of Computer Science, Graduate School of Information Science and Technology, Osaka University DCCFinder: A Very- Large Scale Code Clone Analysis.
Chapter 1 Introduction Dr. Frank Lee. 1.1 Why Study Compiler? To write more efficient code in a high-level language To provide solid foundation in parsing.
Department of Computer Science, Graduate School of Information Science & Technology, Osaka University A clone detection approach for a collection of similar.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University What Do Practitioners.
Chapter 10: Compilers and Language Translation Invitation to Computer Science, Java Version, Third Edition.
Department of Computer Science, Graduate School of Information Science & Technology, Osaka University A Method to Detect License Inconsistencies for Large-
Mining and Analysis of Control Structure Variant Clones Guo Qiao.
Compiler Phases: Source program Lexical analyzer Syntax analyzer Semantic analyzer Machine-independent code improvement Target code generation Machine-specific.
2002/12/11PROFES20021 On software maintenance process improvement based on code clone analysis Yoshiki Higo* , Yasushi Ueda* , Toshihiro Kamiya** , Shinji.
Department of Computer Science, Graduate School of Information Science & Technology, Osaka University Detection and evolution analysis of code clones for.
1 Gemini: Maintenance Support Environment Based on Code Clone Analysis *Graduate School of Engineering Science, Osaka Univ. **PRESTO, Japan Science and.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Applying Clone.
Department of Computer Science, Graduate School of Information Science & Technology, Osaka University Inoue Laboratory Eunjong Choi 1 Investigating Clone.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University How to extract.
D. M. Akbar Hussain: Department of Software & Media Technology 1 Compiler is tool: which translate notations from one system to another, usually from source.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Development of.
Department of Computer Science, Graduate School of Information Science & Technology, Osaka University Retrieving Similar Code Fragments based on Identifier.
Department of Computer Science, Graduate School of Information Science & Technology, Osaka University 1 Towards an Assessment of the Quality of Refactoring.
Introduction to Parsing
Introduction to Lex Ying-Hung Jiang
Duplicate code detection using anti-unification Peter Bulychev Moscow State University Marius Minea Institute eAustria, Timisoara.
Introduction to Lexical Analysis and the Flex Tool. © Allan C. Milne Abertay University v
Weaving a Debugging Aspect into Domain-Specific Language Grammars SAC ’05 PSC Track Santa Fe, New Mexico USA March 17, 2005 Hui Wu, Jeff Gray, Marjan Mernik,
1 Measuring Similarity of Large Software System Based on Source Code Correspondence Tetsuo Yamamoto*, Makoto Matsushita**, Toshihiro Kamiya***, Katsuro.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 1 Classification.
CSC3315 (Spring 2009)1 CSC 3315 Lexical and Syntax Analysis Hamid Harroud School of Science and Engineering, Akhawayn University
1 January 18, January 18, 2016January 18, 2016January 18, 2016 Azusa, CA Sheldon X. Liang Ph. D. Computer Science at Azusa Pacific University Azusa.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 1 Extracting Sequence.
What kind of and how clones are refactored? A case study of three OSS projects WRT2012 June 1, Eunjong Choi†, Norihiro Yoshida‡, Katsuro Inoue†
1 Gemini: Code Clone Analysis Tool †Graduate School of Engineering Science, Osaka Univ., Japan ‡ Graduate School of Information Science and Technology,
CS 404Ahmed Ezzat 1 CS 404 Introduction to Compiler Design Lecture 1 Ahmed Ezzat.
Department of Computer Science, Graduate School of Information Science & Technology, Osaka University Detection of License Inconsistencies in Free and.
Estimating Code Size After a Complete Code-Clone Merge Buford Edwards III, Yuhao Wu, Makoto Matsushita, Katsuro Inoue 1 Graduate School of Information.
CS 3304 Comparative Languages
Regular Expressions, Backus-Naur Form and Reverse Polish Notation
Chapter 3 – Describing Syntax
Lexical and Syntax Analysis
A Simple Syntax-Directed Translator
Lexical Analysis.
CS510 Compiler Lecture 4.
Structured Programming
PROGRAMMING LANGUAGES
Formal Language Theory
Yuta Nakamura1, Eunjong Choi1, Norihiro Yoshida2,
○Yuichi Semura1, Norihiro Yoshida2, Eunjong Choi3, Katsuro Inoue1
Programming Language Syntax 2
Ada – 1983 History’s largest design effort
Review: Compiler Phases:
R.Rajkumar Asst.Professor CSE
CS 3304 Comparative Languages
Quaid-i-Azam University
CS 3304 Comparative Languages
Yuhao Wu1, Yuki Manabe2, Daniel M. German3, Katsuro Inoue1
On Refactoring Support Based on Code Clone Dependency Relation
12th Computer Science – Unit 5
Chapter 10: Compilers and Language Translation
Compiler Construction
Faculty of Computer Science and Information System
Presentation transcript:

Multilingual Detection of Code Clones Using ANTLR Grammar Definitions Yuichi Semura Osaka University Norihiro Yoshida Nagoya University Eunjong Choi NAIST Katsuro Inoue Osaka University APSEC2018

Code Clone A code fragment with an identical or similar code fragment. Introduced in source program by various reasons such as reusing code by `copy-and-paste’. Make software maintenance more difficult. Several tools have been proposed for detecting code clones. Code Clone File A File B copy and paste copy and paste

Parameterized Clones [1] Code fragments that are structurally/syntactically identical, except for variations in identifiers, literals, types, layout and comments. void show(int range){ int x = 0; //init for(int i=0 ; i<range ; i++){ printf(“%d ”,x); x=x+i; } void print(int max){ int x = 0; /*total*/ for(int i=0 ; i<max ; i++){ These are structurally similar each other. [1] Roy, Chanchal K., James R. Cordy, and Rainer Koschke. "Comparison and evaluation of code clone detection techniques and tools: A qualitative approach." Science of computer programming 74.7 (2009): 470-495.

A Clone Detection Tool: CCFinderX [2] [3] Features of CCFinderX CCFinderX is widely-used in academic research as well as industries. Detects token-based and parameterized code clones from input source code. Handles COBOL, C/C++, C#, Java, Visual Basic. [2] Toshihiro Kamiya, Shinji Kusumoto, and Katsuro Inoue. CCFinder: a multilinguistic token-based code clone detection system for large scale source code. IEEE Trans. Softw. Eng., Vol. 28, No. 7, pp. 654–670, 2002. [3] T. Kamiya, “the archive of CCFinder Official Site,” 2005. [Online]. Available: http://www.ccfinder.net/

Lexical Analysis in CCFinderX if (b==c) value=i ; if b == c value = i ; ( ) Lexical Analysis Source Code Source Files CCFinderX Lexical Analysis Transformation Splitting source codes to tokens. The lexical analysis depends on the grammar of the language. To apply an additional language, it is necessary to implement the corresponding lexical analyzer. Detection/Formatting Token-based Clone-Pairs 

Transformation in CCFinderX Source Code Source Files if (b==c) value=i ; CCFinderX Transformation Lexical Analysis if ( b == c ) value = i ; Transformation if ( $ == $ ) $ = $ ; Detection/Formatting Replacing identifiers and literals to the same token (ex. $). This transformation enables the detection of parameterized clones. Token-based Clone-Pairs 

Motivation of Our Research When we apply a clone detection tool to additional languages… It is necessary to implement lexical analyzers. -- This implementation takes more time and effort. More simple extension mechanism is required to handle additional languages. [3] It is necessary to catch and analyze lexical differences among programming languages. [3] Kazunori Sakamoto. Occf: A framework for developing test coverage measurement tools supporting multiple programming languages. Software Testing, IEEE Sixth International Conference on, Verification and Validation (ICST), pp.422--430. 2013

A Parser Generator: ANTLR [4] ANTLR is a parser generator and widely-used. It creates a lexer, parser or compiler based on a grammar definition file of the target language. A grammar definition file of a programming language may have lexical information for code clone detection. A GitHub repository namely ‘grammars-v4’ [5] is a collection of grammar definition files of many languages. Parser Generator ANTLR prog: expr; expr: term (('+'|'-') term)*; term: factor (('*'|'/') factor)*; factor: INT | '(' expr ')' ; INT : [0-9]+ ; Grammar Definition Files (~.g4) Lexer, Parser or Compiler input output [4] ANTLR: http://www.antlr.org/ [5] https://github.com/antlr/grammars-v4

An Overview of Our Research Code Clone Detection using grammar definitions We extended CCFinderSW, a code clone detection tool based on CCFinderX. CCFinderSW has a lexical information extractor and uses grammar definition files of ANTLR. The extractor transforms comment rules and reserved words into Regular Expressions. Evaluation We applied the extractor to the 43 grammar definition files in the GitHub repository namely ‘grammars-v4’.

CCFinderSW [6] Users can download grammar files from an online repository such as ‘grammars-v4’. Source Code Source Files Grammar Definition File Lexical Information Extractor CCFinderSW Comment Rules And Reserved Words Lexical Analysis Transformation Detection/Formatting Token-based Clone-Pairs  [6] Y. Semura, N. Yoshida, E. Choi, and K. Inoue, “CCFinderSW: Clone detection tool with flexible multilingual tokenization,” in Proc. of APSEC 2017, 2017, pp. 654–659.

Implementation of the lexical information extractor Extracts comment and reserved word definitions and transforms them into Regular Expressions (RegEx) Lexical Information Extractor Reserved Word RegEx Comment Grammar Definition File

Investigation of grammar definition files We investigated notations of grammar definitions files in the GitHub repository ‘grammars-v4’ This repository contains over 150 ANTLR grammar definition files(~.g4) Example: Comment rule definitions in C[7] BlockComment : '/*' .*? '*/' -> skip ; This comment is enclosed by /* and */. LineComment : '//' ~[\r\n]* -> skip ; This comment continues to the end of line. [7] https://github.com/antlr/grammars-v4/blob/master/c/C.g4

Identification of comment rules Based on our investigation, we prepare 4 patterns so as to identify comment rules. A name of a definition contains ‘comment’, ‘COMMENT’ and so on. A definition is linked to a ‘skip’ command. A definition is linked to a ‘channel(HIDDEN)’ command. A definition is linked to a ‘channel(X)’ command. Moreover, X contains ‘comment’, ‘COMMENT’, and so on. Comment: ‘/*’ .*? ‘*/’; Block1: ‘/*’ .*? ‘*/’ -> skip; Block2: '/*' .*? '*/‘ -> channel(HIDDEN); Block3: '/*' .*? '*/' -> channel(COMMENT_C);

Transformation of comment definitions into a RegEx The transformation of comment definitions is comprised of the following 4 steps: Identifies comment definitions from all definitions. Step A Apply other grammar definitions to references in identified definitions. Step B Transform applied definitions into RegExes in available in Java. Step C Combine all transformed RegExes into one RegEx. Step D

Transformation of comment definitions (Step A) Step A: The extractor identifies comment definitions from all definitions. Step A BComment: CSTART .*? CEND; LComment: DSLASH ~[\r\n]*; CSTART: ’/*’; CEND: ’*/’; DSLASH: ’//’; Step B BComment: '/*' .*? '*/'; LComment: '//' ~[\r\n]*; Step C /\*[\s\S]*?\*/ //((?![\r\n])[\s\S])* Step D /\*[\s\S]*?\*/|//((?![\r\n])[\s\S])*

Transformation of comment definitions (Step B) Step B: The extractor applies recursively other grammar definitions to references in identified definitions. Step A BComment: CSTART .*? CEND; LComment: DSLASH ~[\r\n]*; CSTART: ’/*’; CEND: ’*/’; DSLASH: ’//’; Step B BComment: '/*' .*? '*/'; LComment: '//' ~[\r\n]*; Step C /\*[\s\S]*?\*/ //((?![\r\n])[\s\S])* Step D /\*[\s\S]*?\*/|//((?![\r\n])[\s\S])*

Transformation of comment definitions (Step C) Step C: The extractor transforms applied definitions to RegExes of Java language. Step A BComment: CSTART .*? CEND; LComment: DSLASH ~[\r\n]*; CSTART: ’/*’; CEND: ’*/’; DSLASH: ’//’; Step B BComment: '/*' .*? '*/'; LComment: '//' ~[\r\n]*; Step C /\*[\s\S]*?\*/ //((?![\r\n])[\s\S])* Step D /\*[\s\S]*?\*/|//((?![\r\n])[\s\S])*

Transformation of comment definitions (Step D) Step D: The extractor combines all transformed RegExes into a RegEx. Step A BComment: CSTART .*? CEND; LComment: DSLASH ~[\r\n]*; CSTART: ’/*’; CEND: ’*/’; DSLASH: ’//’; Step B BComment: '/*' .*? '*/'; LComment: '//' ~[\r\n]*; Step C /\*[\s\S]*?\*/ //((?![\r\n])[\s\S])* Step D /\*[\s\S]*?\*/|//((?![\r\n])[\s\S])*

Identification of reserved words In our investigation of reserved word definitions, 2 notations appear frequently. A grammar definition which is composed of only alphabets and is enclosed in single quotations. A grammar definition matches alphabets sequences. A character set ‘[wW]’ matches both uppercase and lowercase letter of ‘w’. BREAK : 'break'; CONTINUE : 'continue'; BREAK: [bB] [rR] [eE] [aA] [kK] ; // BREAK break BrEAk... CONTINUE: [cC] [oO] [nN] [tT] [iI] [nN] [uU] [eE]; //continue ConTInUe CoNtInuE...

Transformation of reserved word into a RegEx The process of transformation of reserved word into a RegEx is comprised of the following five steps: Identifies character sequences composed of only alphabets. Step 1 Apply other grammar definitions to references in identified definitions. Step 2 Transform applied definitions into RegExes in available in Java. Step 3 Identifies RegExes which match alphabet strings transformed in Step 3. Step 4 Combine all transformed RegExes into one RegEx. Step 5

Transformation of reserved word definitions (Step 1) Step 1: The extractor identifies character sequence composed of only alphabets. Step 1 All definitions in a grammar file CONTINUE : ‘continue'; CASE: ‘c’ A S E; A: [aA]; S: ‘s’ | ’S’; E: [eE]; Step 2 CASE: ‘c’ [aA] ( ‘s’ | ’S’ ) [eE]; Step 3 c[aA](s|S)[eE] Step 4,5 continue|c[aA](s|S)[eE]

Transformation of reserved word definitions (Step 2) Step 2: The extractor applies other grammar definitions to references in identified definitions. Step 1 All definitions in a grammar file CONTINUE : ‘continue'; CASE: ‘c’ A S E; A: [aA]; S: ‘s’ | ’S’; E: [eE]; Step 2 CASE: ‘c’ [aA] ( ‘s’ | ’S’ ) [eE]; Step 3 c[aA](s|S)[eE] Step 4,5 continue|c[aA](s|S)[eE]

Transformation of reserved word definitions (Step 3) Step 3: The extractor transforms applied definitions into RegExes of Java language. . Step 1 All definitions in a grammar file CONTINUE : ‘continue'; CASE: ‘c’ A S E; A: [aA]; S: ‘s’ | ’S’; E: [eE]; Step 2 CASE: ‘c’ [aA] ( ‘s’ | ’S’ ) [eE]; Step 3 c[aA](s|S)[eE] Step 4,5 continue|c[aA](s|S)[eE]

Transformation of reserved word definitions (Step 4 and 5) Step 4 and 5: The extractor identifies RegExes which match alphabet strings and combines all transformed RegExes into one RegEx. Step 1 All definitions in a grammar file CONTINUE : ‘continue'; CASE: ‘c’ A S E; A: [aA]; S: ‘s’ | ’S’; E: [eE]; Step 2 CASE: ‘c’ [aA] ( ‘s’ | ’S’ ) [eE]; Step 3 c[aA](s|S)[eE] Step 4,5 continue|c[aA](s|S)[eE]

Evaluation Purpose of the evaluation We confirmed the accuracy of the transformation from grammar definitions into RegExes. Methodology We investigated how many grammar files can our tool analyze in terms of comments and reserved words.

Evaluation: selection of target files Selected 154 grammar files in ‘grammars-v4’. Focused 43 files, which are available in the advanced search of the code search engine at GitHub[7]. agc antlrv4 apex asm6502 aspectj brainf**k c clojure cobol85 cool cpp14 csharp css3 ecmascript erlang fortran77 golang html idl java9 kotlin lua modelica m2pim4 objective-c pascal php plsql prolog protobuf3 python3 r rexx scala smalltalk smtlibv2 swift3 vba verilog vhdl visualbasic webidl xml [7] https://github.com/search/advanced

Evaluation: Methods Methods: Identify manually the notations of comments and reserved words from each of the grammar definition files of the 43 languages. Apply the extractor to the 43 grammar definition files. Check manually whether the extracted RegExes are correct as comments and reserved words.

Evaluation: Result Result: Comment Rules: 38 out of 43 grammar files Reserved Words: 36 out of 43 grammar files Reasons why our tool cannot extract some comment rules and reserved words A grammar files contains complex notations. A programming language has a comment rule which is transformed into RegEx.

Summary Conclusion We extended CCFinderSW that has a lexical information extractor from grammar definitions of ANTLR. We indicated that the extended CCFinderSW can extract most grammar rules in 43 languages. Future Work We are now analyzing OSS project written in Go, Python and Rust using the extended CCFinderSW.