Source File Set Search for Clone-and-Own Reuse Analysis

Slides:



Advertisements
Similar presentations
Configuration management
Advertisements

Configuration management
Computer Science and Engineering Inverted Linear Quadtree: Efficient Top K Spatial Keyword Search Chengyuan Zhang 1,Ying Zhang 1,Wenjie Zhang 1, Xuemin.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Identifying Source.
Latent Semantic Indexing (mapping onto a smaller space of latent concepts) Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 18.
School of Computer Science and Engineering Finding Top k Most Influential Spatial Facilities over Uncertain Objects Liming Zhan Ying Zhang Wenjie Zhang.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Extraction of.
1 Oct 30, 2006 LogicSQL-based Enterprise Archive and Search System How to organize the information and make it accessible and useful ? Li-Yan Yuan.
Computer & Network Forensics
Ph.D. DefenceUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:
7. Duplicated Code Metrics Duplicated Code Software quality
1 Copyright © 2014 Tata Consultancy Services Limited Source Code Management using Rational Team Concert IBM Rational, Alliance & Technology Unit 2 July.
Systems Analysis and Design in a Changing World, 6th Edition
AFID: An Automated Fault Identification Tool Alex Edwards Sean Tucker Sébastien Worms Rahul Vaidya Brian Demsky.
Chapter 25 – Configuration Management 1Chapter 25 Configuration management.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Measuring Copying.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Industrial Application.
Update your servers to service pack 2. Ensure that the environment is fully functioning. Migrate to 64 bit servers is necessary. REVIEW UPGRADE BEST PRACTICES.
L. Padmasree Vamshi Ambati J. Anand Chandulal J. Anand Chandulal M. Sreenivasa Rao M. Sreenivasa Rao Signature Based Duplicate Detection in Digital Libraries.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University ICSE 2003 Java.
Yuki Manabe*, Daniel M. German†,‡ and Katsuro Inoue†
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University A lightweight.
Department of Computer Science, Graduate School of Information Science & Technology, Osaka University A clone detection approach for a collection of similar.
1 Software Development Configuration management. \ 2 Software Configuration  Items that comprise all information produced as part of the software development.
Hipikat: A Project Memory for Software Development The CISC 864 Analysis By Lionel Marks.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Applying Clone.
1 Schema Registries Steven Hughes, Lou Reich, Dan Crichton NASA 21 October 2015.
Department of Computer Science, Graduate School of Information Science & Technology, Osaka University Inoue Laboratory Eunjong Choi 1 Investigating Clone.
Android architecture & setting up. Android operating system comprises of different software components arranges in stack. Different components of android.
May 30, 2016Department of Computer Sciences, UT Austin1 Using Bloom Filters to Refine Web Search Results Navendu Jain Mike Dahlin University of Texas at.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Development of.
Department of Computer Science, Graduate School of Information Science & Technology, Osaka University Retrieving Similar Code Fragments based on Identifier.
Configuration Management and Change Control Change is inevitable! So it has to be planned for and managed.
Semi-Automatic patch upgrade kit
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University July 21, 2008WODA.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University IWPSE 2003 Program.
Extracting a Unified Directory Tree to Compare Similar Software Products Yusuke Sakaguchi, Takashi Ishio, Tetsuya Kanda, Katsuro Inoue Department of Computer.
Cross Language Clone Analysis Team 2 February 3, 2011.
LogTree: A Framework for Generating System Events from Raw Textual Logs Liang Tang and Tao Li School of Computing and Information Sciences Florida International.
+ Moving Targets: Security and Rapid-Release in Firefox Presented by Carlos Bernal-Cárdenas.
Department of Computer Science, Graduate School of Information Science & Technology, Osaka University An Empirical Study of Out-dated Third-party Code.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 1 Classification.
1 Chapter 12 Configuration management This chapter is extracted from Sommerville’s slides. Text book chapter 29 1.
ASIS + RPM: ASISwsmp German Cancio, Lionel Cons, Philippe Defert, Andras Nagy CERN/IT Presented by Alan Lovell.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 1 Extracting Sequence.
What kind of and how clones are refactored? A case study of three OSS projects WRT2012 June 1, Eunjong Choi†, Norihiro Yoshida‡, Katsuro Inoue†
1 Object-Oriented Analysis and Design with the Unified Process Figure 13-1 Implementation discipline activities.
CEG 2400 FALL 2012 Windows Servers Network Operating Systems.
Department of Computer Science, Graduate School of Information Science & Technology, Osaka University Detection of License Inconsistencies in Free and.
Computational Challenges in BIG DATA 28/Apr/2012 China-Korea-Japan Workshop Takeaki Uno National Institute of Informatics & Graduated School for Advanced.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Software Ingredients:
Chapter 25 – Configuration Management 1Chapter 25 Configuration management.
Estimating Code Size After a Complete Code-Clone Merge Buford Edwards III, Yuhao Wu, Makoto Matsushita, Katsuro Inoue 1 Graduate School of Information.
 INDEX  Overview.  Introduction.  System Requirement.  Features Of SQL.  Development Process.  System Design (SDLC).  Implementation.  Future.
Spatial Approximate String Search. Abstract This work deals with the approximate string search in large spatial databases. Specifically, we investigate.
KEEPS – a system for UELMA preservation and security
Creighton Barrett Dalhousie University Archives
SoftReports - Patient Reports
Knut Kröger & Reiner Creutzburg
Introduction of Week 3 Assignment Discussion
○Yuichi Semura1, Norihiro Yoshida2, Eunjong Choi3, Katsuro Inoue1
Introduction to Database Systems
Predicting Fault-Prone Modules Based on Metrics Transitions
Chapter 25 – Configuration Management
Index tuning Hash Index.
Minwise Hashing and Efficient Search
Where Does This Code Come from and Where Does It Go?
Hash-Based Indexes Chapter 11
Overview Activities from additional UP disciplines are needed to bring a system into being Implementation Testing Deployment Configuration and change management.
Recommending Adaptive Changes for Framework Evolution
Presentation transcript:

Source File Set Search for Clone-and-Own Reuse Analysis Takashi Ishio†‡, Yusuke Sakaguchi‡, Kaoru Ito‡, Katsuro Inoue‡ NAIST SE LAB † Nara Institute of Science and Technology, Japan ‡ Osaka University, Japan MSR2017

Motivation: Software Reuse Developers often reuse existing libraries. Cloned Components Firefox 45 libjpeg expat libpng zlib libjar stlport libvpx libogg freetype2

Library Update Problem Release note and security advisories often specify existing versions that should be updated. Due to the bug fixes, any installations of 1.2.9 or 1.2.10 should be immediately replaced with 1.2.11. zlib 1.2.11 release (http://www.zlib.net/) A version number of a library copy is very important to answer: zlib “Should we update this library copy?”

Version number is often unavailable [Xia, 2013] Some projects record version numbers in their repositories. It may get lost over time. Firefox-45:modules/zlib Upgrade zlib to version 1.2.8 NSS-3.14:lib/zlib reorganize NSS directory layout, moving files, very large changeset! (No version information)

Recovering Version Information from Source Files Query: A set of files Result: A list of components that are likely reused Firefox-45.0 # Package Name Total Sim Same Similar 1 zlib-1:1.2.8.dfsg-2 25.9714 22 4 2 genometools-1.5.8-2 25.9670 3 mongodb-1:3.2.8-1 19.9505 15 5 zlib/gzlib.c zlib/inflate.c zlib/mozzconf.h zlib/zconf.h zlib/zlib.h zlib/zutil.c … (27 files) Debian GNU/Linux Package Database (200,018 packages)

Process Component Search Component Ranking Compute similarity between query files and existing component files. Component Ranking Select components using aggregated file similarity.

Similarity Definition Jaccard Index of trigrams: An approximation of edit distance sim 𝑎, 𝑏 = |𝑡𝑟𝑖𝑔𝑟𝑎𝑚𝑠 𝑎 ∩𝑡𝑟𝑖𝑔𝑟𝑎𝑚𝑠 𝑏 | |𝑡𝑟𝑖𝑔𝑟𝑎𝑚𝑠(𝑎)∪𝑡𝑟𝑖𝑔𝑟𝑎𝑚𝑠(𝑏)| Example: f1: while (( *dst++ = *src++) != '\0'); f2: while (*dst++ = *src++); trigrams(f1) trigrams(f2) _, _, while _, _, while _, while, ( _, while, ( while, (, ( while, (, * (, (, * White space and comments are ignored. Supported C/C++ and Java in this paper. (, *, dst (, *, dst *, dst, ++ *, dst, ++ … …

1. Component Search Database Find the most similar file in each component. We ignore less similar files: sim 𝑎, 𝑏 ≥𝑡ℎ . zlib-1.2.8 gzlib.c inflate.c zconf.h … 1.0 Query Firefox-45.0 zlib/gzlib.c zlib/inflate.c zlib/mozzconf.h zlib/zconf.h zlib/zlib.h zlib/zutil.c … 0.9160 zlib-1.2.7 gzlib.c inflate.c zconf.h … mongodb-3.2.8 inflate.c …

1. Component Search Database Find the most similar file in each component. We ignore less similar files: sim 𝑎, 𝑏 ≥𝑡ℎ . zlib-1.2.8 gzlib.c inflate.c zconf.h … 1.0 Query 0.9948 Firefox-45.0 0.9858 zlib/gzlib.c zlib/inflate.c zlib/mozzconf.h zlib/zconf.h zlib/zlib.h zlib/zutil.c … 0.9160 zlib-1.2.7 0.9568 gzlib.c inflate.c zconf.h … 0.9384 0.991 mongodb-3.2.8 inflate.c … Components including similar files are likely reused.

A naïve file comparison takes time. Implementation Issue A naïve file comparison takes time.    |Q| × |F| #query files #database files 27 files in zlib directory 11,040,924 files in Debian GNU/Linux (C/C++ and Java)  We employ b-bit minwise hashing technique.

b-bit minwise hashing [Li, 2010] 1-bit Min-Hash: b 𝑓 =min ℎ 𝑡 | 𝑡∈𝑡𝑟𝑖𝑔𝑟𝑎𝑚𝑠 𝑓 𝑚𝑜𝑑 2 trigrams(f1) _, _, while h(t1) _, while, ( h(t2) min mod 2 while, (, ( h(t3) h(ti) b(f1) ∈ {0, 1} (, (, * h(t4) (, *, dst h(t5) If f1 and f2 are more similar, more likely b(f1) = b(f2). *, dst, ++ h(t6) … … trigrams(f2) b(f2) ∈ {0, 1}

Similarity estimation A hash function extracts the same hash value from two files on the probability 𝑝: 𝑝=sim 𝑓1, 𝑓2 + 1−sim 𝑓1, 𝑓2 2 [Li, 2010] Similarity represented by an observed probability 𝑝𝑜: sim𝑒 𝑓1, 𝑓2 = 𝑝𝑜− 1 2 ×2 We observe a probability 𝑝𝑜 using multiple independent hash functions 𝑏𝑖 𝑓 (1≤𝑖≤𝑘). (We use 𝑘=2048.) b1 b2 b3 b4 b5 b6 b7 b8 b9 b10 f1 1 f2 𝑝𝑜=0.9  sim𝑒 𝑓1, 𝑓2 =0.8

Fast similarity computation Error margin IF sim𝑒 𝑓1, 𝑓2 ≥𝑡ℎ−𝑚 THEN sim 𝑓1,𝑓2 = |𝑡𝑟𝑖𝑔𝑟𝑎𝑚𝑠 𝑎 ∩𝑡𝑟𝑖𝑔𝑟𝑎𝑚𝑠 𝑏 | |𝑡𝑟𝑖𝑔𝑟𝑎𝑚𝑠(𝑎)∪𝑡𝑟𝑖𝑔𝑟𝑎𝑚𝑠(𝑏)| ELSE sim 𝑓1,𝑓2 =0 An actual similarity is computed only if necessary.

2. Component Ranking Exclude uninteresting components. Database zlib-1.2.8 gzlib.c inflate.c zconf.h … 1.0 Query 0.9948 Sum=25.9714 Firefox-45.0 0.9858 zlib/gzlib.c zlib/inflate.c zlib/mozzconf.h zlib/zconf.h zlib/zlib.h zlib/zutil.c … 0.9160 0.9568 0.9384 0.9160 zlib-1.2.7 0.9568 gzlib.c inflate.c zconf.h … 0.9384 0.991 mongodb-3.2.8 inflate.c … Sum=19.9505

Our implementation: Clofile Search http://sel.ist.osaka-u.ac.jp/clofile/ Submit a zip file including source files. You will receive a web page for a result.

Does it report an original version of a component? Evaluation Does it report an original version of a component? Dataset: 75 directories in Firefox and Android Extracted version numbers from commit messages. Analyzed a position of an original version in a result. Accuracy measures: Top-k Recall: How frequently an original component is included in the top-k of a result. The sum of positions in the results: It approximates manual effort to identify all the original components.

Result Method Top-1 Recall Top-5 Top-10 Top-∞ Sum of positions Baseline (SHA-1) 0.640 0.773 0.827 0.960 931 Baseline +Ranking 0.707 0.840 0.867 719 th=1.0 0.733 0.893 0.987 785 th=0.9 0.907 0.920 1.000 551 th=0.8 627 th=0.7 0.680 0.880 692 th=0.6 0.667 689 Ranking is added Ignoring white space and comments. Identifying similar files. Top-5: 0.773  0.907 Reduced manual effort 931  551 (60%) No false negatives!

Performance Time per Query [th=0.6]: Median: 77.7 seconds Max: 25 minutes 13,720 files are analyzed in 3.5 hours. (0.92 seconds per file) Environment: Intel Xeon E6-2690 v3 (2.6 GHz), 64 GB RAM 4 GB hash values and 20 GB file names on memory. 300 GB source files on HDD.

Conclusion Our method reports existing components that are likely reused. b-bit minwise hashing is employed to estimate a similarity from hash values in a practical time. Clofile Search http://sel.ist.osaka-u.ac.jp/clofile/ We hope that the tool helps users to analyze their cloned components. Please try it!