Department of Computer Science, Graduate School of Information Science and Technology, Osaka University DCCFinder: A Very- Large Scale Code Clone Analysis.

Department of Computer Science, Graduate School of Information Science and Technology, Osaka University DCCFinder: A Very- Large Scale Code Clone Analysis and Visualization Tool Simone Livieri Yoshiki Higo Makoto Matsushita Katsuro Inoue

Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Background Open-Source Software (OSS) is used in many software systems Relations between software systems can be exposed through code clone analysis Large collections of OSS exist Huge memory requirements, long running time Computing power is cheap Large number of computers are often easy accessible Code clone analysis can be distributed

Department of Computer Science, Graduate School of Information Science and Technology, Osaka University In the beginning was CCFinder CCFinder is a code-clone analysis tool Widely used and cited Token based Many languages supported (e.g. C, C++, Java) Good scalability (but can ’ t handle very large input)

Department of Computer Science, Graduate School of Information Science and Technology, Osaka University DCCFinder D(istributed)CCFinder is a tool for distributed code clone analysis Master-slave distributed system Data sharing through a shared file system Uses CCFinder to perform the code clone analysis The prototype ran on 80 computers of the Student Laboratory of our department

Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Computational Model target category 1category 4category 2category 3project 1project 2project 3project 4project 5project 6project 7project 8 unit 1unit i-1unit iunit i+1unit j-1unit junit j+1unit n Target is the set of source file undergoing code clone analysis A category is a set of source file sharing a specific feature or use A project is a single software system A unit is a set of source files that may cross multiple projects Piece i,j unit j unit i CCFinder Slave Node Two units make a piece. A piece is the collection of file that will be analyzed on each slave node

Department of Computer Science, Graduate School of Information Science and Technology, Osaka University System Implementation (1) Written in Java (about 20kLoc) Master-Slave-Registry communication handled with Java RMI Basic fault tolerance Master and slave node characteristics ProcessorPentium IV 3GHz Memory1 GBytes Network LinkGigabit Ethernet connected to 100 MBit/s network hubs OSFreeBSD 5.3-STABLE Local Storage40~50 GBytes

Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Analysis Process

Department of Computer Science, Graduate School of Information Science and Technology, Osaka University System Implementation (2) Indexer Examines the target and collect file size, LoC, project and category name Computes unit boundaries Master Node Creates the input files for CCFinder and assigns jobs to the slaves Slave Node Copies the files on the local storage Executes CCFinder Copies the output to the shared storage

Department of Computer Science, Graduate School of Information Science and Technology, Osaka University System Implementation (3) Clone Coverage Analyzer Compute the number of shared line of code between each pair of files, projects and categories Image Generator Generate scatter plot, heat maps or bar chart from the clone coverage data

Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Case Study I: The FressBSD Target Vast collection of Open- Source software used by the FreeBSD OS Unit size: 15MBytes Minimum code clone length: 50 tokens Total number of tasks: 269,745 Number of categories45 Number of projects6658 Number of.c files754.552 Total line of code403,625,067 Total size10.8GBytes Time elapsed Indexer22 minutes D-CCFinder51 hours Scatter plot Clone Coverage Analyzer 23 hours Image Generator4 hours Total78 hours 22 minutes

Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Case Study I: Result

Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Case Study I: Result php4 and php5 duplicated source tree

Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Case Study I: Result gstream’s main source tree is duplicated inside all the gstream plugin projects

Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Multiple copies of the X- Windows System source tree Case Study I: Result

Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Case Study I: Result

Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Case Study I: Result Database Category CCC1: 41% Causes: Different version of the same software Database drivers for different languages Multiple copies of the phpX source tree

Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Development Category CCC1: 38% Causes: Mainly the presence of different versions of the GNU binary utilities and compilers Case Study I: Result

Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Lang and Development Categories CCC1: 28% Causes: The presence in both categories of the suite of GNU compilers Case Study I: Result

Department of Computer Science, Graduate School of Information Science and Technology, Osaka University X11 Fonts Category CCC1: 46% Causes: Small category size Seven copies of the X Window System source tree Case Study I: Result

Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Case Study II: SPARS-J and the FressBSD Target SPARS-J is a Java component analysis tool About 47000 line of code; written in C Code clones between the SPARS-J and the whole FreeBSD target were detected

Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Case Study II: Code Clone Coverage (before) Most of the code clones were from a single file: getopt.c

Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Case Study II: Code Clone Coverage (after) Code clones from CGI handling source code Specialized version of getopt.c

Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Summary Proposed a new approach to distributed large scale code clone analysis Obtained a global overview of code clones in the FreeBSD target In SPARS-J, effortlessly individuated the use of code from the FreeBSD target

Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Summary (2) The acceleration gain was 20. Limited by: data transfer, network congestion, master-slave coordination Generating of reasonable size scatter-plot traded speed for accuracy. Effects: Source code organization easily visible, enhanced artifacts, finer details not distinguishable Currently can ’ t efficiently filter unnecessary or not- so-interesting code clones Being addressed by exploring fingerprint based source code analysis

Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Future Work Currently D-CCFinder is being rewritten Better fault tolerance GUI Interface Distributed post processing and image generation Exploring the evolution of different software systems with code clone analysis

Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Metrics A pair of files or projects or categories Segments of the cone clones between M 0 and M 1 Segments of the cone clones between M 0 and M 1 in M 0 Number of lines of code in x CCC1 is the percentage of shared line of code between M 0 and M 1 computed over the total line of code of M 0 and M 1 CCC2 is the percentage of line of code that M 0 shares with M 1 computed over the total line of code of M 0

Department of Computer Science, Graduate School of Information Science and Technology, Osaka University DCCFinder: A Very- Large Scale Code Clone Analysis.

Similar presentations

Presentation on theme: "Department of Computer Science, Graduate School of Information Science and Technology, Osaka University DCCFinder: A Very- Large Scale Code Clone Analysis."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Department of Computer Science, Graduate School of Information Science and Technology, Osaka University DCCFinder: A Very- Large Scale Code Clone Analysis.

Similar presentations

Presentation on theme: "Department of Computer Science, Graduate School of Information Science and Technology, Osaka University DCCFinder: A Very- Large Scale Code Clone Analysis."— Presentation transcript:

Similar presentations

About project

Feedback