CBCD: Cloned Buggy Code Detector

Slides:



Advertisements
Similar presentations
Duplicate code detection using Clone Digger Peter Bulychev Lomonosov Moscow State University CS department.
Advertisements

Chao Liu, Chen Chen, Jiawei Han, Philip S. Yu
Native x86 Decompilation Using Semantics-Preserving Structural Analysis and Iterative Control-Flow Structuring Edward J. Schwartz *, JongHyup Lee ✝, Maverick.
CS590 Z Matching Program Versions Xiangyu Zhang. CS590Z Problem Statement  Suppose a program P’ is created by modifying P. Determine the difference between.
Reverse Engineering © SERG Code Cloning: Detection, Classification, and Refactoring.
Improving the Unification of Software Clones Using Tree & Graph Matching Algorithms Giri Panamoottil Krishnan Supervisor: Dr. Nikolaos Tsantalis
July 11 th, 2005 Software Engineering with Reusable Components RiSE’s Seminars Sametinger’s book :: Chapters 16, 17 and 18 Fred Durão.
Analyzing Software Code and Execution – Plagiarism and Bug Detection Shoaib Jameel.
7. Duplicated Code Metrics Duplicated Code Software quality
(Page 554 – 564) Ping Perez CS 147 Summer 2001 Alternative Parallel Architectures  Dataflow  Systolic arrays  Neural networks.
Chapter 3 Planning Your Solution
Regression testing Tor Stållhane. What is regression testing – 1 Regression testing is testing done to check that a system update does not re- introduce.
Examining the Code [Reading assignment: Chapter 6, pp ]
Git: Part 1 Overview & Object Model These slides were largely cut-and-pasted from tutorial/, with some additions.
Computer Programming and Basic Software Engineering 4. Basic Software Engineering 1 Writing a Good Program 4. Basic Software Engineering.
© 2008, Renesas Technology America, Inc., All Rights Reserved 1 Purpose  This training course describes how to configure the the C/C++ compiler options.
EGR 2261 Unit 5 Control Structures II: Repetition  Read Malik, Chapter 5.  Homework #5 and Lab #5 due next week.  Quiz next week.
University of Maryland Bug Driven Bug Finding Chadd Williams.
Index tuning-- B+tree. overview © Dennis Shasha, Philippe Bonnet 2001 B+-Tree Locking Tree Traversal –Update, Read –Insert, Delete phantom problem: need.
Department of Computer Science, Graduate School of Information Science and Technology, Osaka University DCCFinder: A Very- Large Scale Code Clone Analysis.
Department of Computer Science, Graduate School of Information Science & Technology, Osaka University A clone detection approach for a collection of similar.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Code-Clone Analysis.
Reviewing Recent ICSE Proceedings For:.  Defining and Continuous Checking of Structural Program Dependencies  Automatic Inference of Structural Changes.
CSC-682 Cryptography & Computer Security Sound and Precise Analysis of Web Applications for Injection Vulnerabilities Pompi Rotaru Based on an article.
CMCD: Count Matrix based Code Clone Detection Yang Yuan and Yao Guo Key Laboratory of High-Confidence Software Technologies (Ministry of Education) Peking.
Cross Language Clone Analysis Team 2 October 27, 2010.
Testing and Debugging Version 1.0. All kinds of things can go wrong when you are developing a program. The compiler discovers syntax errors in your code.
Property of Jack Wilson, Cerritos College1 CIS Computer Programming Logic Programming Concepts Overview prepared by Jack Wilson Cerritos College.
1 Evaluating Code Duplication Detection Techniques Filip Van Rysselberghe and Serge Demeyer Lab On Re-Engineering University Of Antwerp Towards a Taxonomy.
Current Assignments Homework 2 is available and is due in three days (June 19th). Project 1 due in 6 days (June 23 rd ) Write a binomial root solver using.
Project Management All projects need to be “managed” –Cost (people-effort, tools, education, etc.) –schedule –deliverables and “associated” characteristics.
Java Basics Hussein Suleman March 2007 UCT Department of Computer Science Computer Science 1015F.
Design - programming Cmpe 450 Fall Dynamic Analysis Software quality Design carefully from the start Simple and clean Fewer errors Finding errors.
Demo of Scalable Pluggable Types Michael Ernst MIT Dagstuhl Seminar “Scalable Program Analysis” April 17, 2008.
Scalable Clone Detection and Elimination for Erlang Programs Huiqing Li, Simon Thompson University of Kent Canterbury, UK.
 Software Clones:( Definitions from Wikipedia) ◦ Duplicate code: a sequence of source code that occurs more than once, either within a program or across.
University of Waterloo Four “interesting” ways in which history can teach us about software Michael W. Godfrey * Xinyi Dong Cory Kapser Lijie Zou Software.
Pruning Dynamic Slices With Confidence Original by: Xiangyu Zhang Neelam Gupta Rajiv Gupta The University of Arizona Presented by: David Carrillo.
STATIC CODE ANALYSIS. OUTLINE  INTRODUCTION  BACKGROUND o REGULAR EXPRESSIONS o SYNTAX TREES o CONTROL FLOW GRAPHS  TOOLS AND THEIR WORKING  ERROR.
An Algorithm for Detecting and Removing Clones in Java Code Nicolas Juillerat & Béat Hirsbrunner Pervasive and Artificial Intelligence ( ) Research Group,
Deadlocks Lots of resources can only be used by one process at a time. Exclusive access is needed. Suppose process A needs R1 + R2 and B needs R1 + R2.
DDC 2223 SYSTEM SOFTWARE DDC2223 SYSTEM SOFTWARE.
Compiler Design (40-414) Main Text Book:
PRINCIPLES OF COMPILER DESIGN
Types for Programs and Proofs
CompSci 230 Software Construction
Different Types of Testing
Loop Structures.
Course: Introduction to Computers
Reading: Pedro Domingos: A Few Useful Things to Know about Machine Learning source: /cacm12.pdf reading.
Chapter 1: Introduction to Compiling (Cont.)
Scripts & Functions Scripts and functions are contained in .m-files
Java Programming: Guided Learning with Early Objects
Rename Local Variable Refactoring Instances
Lightweight Verification of Array Indexing
High Coverage Detection of Input-Related Security Faults
CS240: Advanced Programming Concepts
Test Case Test case Describes an input Description and an expected output Description. Test case ID Section 1: Before execution Section 2: After execution.
Regression testing Tor Stållhane.
A Metric for Evaluating Static Analysis Tools
Introduction to Data Structure
On Refactoring Support Based on Code Clone Dependency Relation
Tahsin Reza Matei Ripeanu Nicolas Tripoul
Data Structures Introduction
Foundations and Definitions
Review of Previous Lesson
Automatically Diagnosing and Repairing Error Handling Bugs in C
Lab 8: GUI testing Software Testing LTAT
Matching Program Versions
CHAPTER 6 Testing and Debugging.
Presentation transcript:

CBCD: Cloned Buggy Code Detector Jingyue Li Michael D. Ernst DNV Research & Innovation University of Washington

Overview Cut-and-paste happens An empirical study of cloned buggy code 09 September 2018 Overview Cut-and-paste happens An empirical study of cloned buggy code 4% of bugs are cloned in a commercial system 350 cloned bugs in Open Source Software CBCD - A tool to detect cloned buggy code (Optimizations of Program Dependency Graph based clone detection for finding bugs and for scaling) CBCD Evaluation CBCD outperformed 4 clone detectors for this task

An empirical study of cloned buggy code

Have bugs been cloned? Investigated systems 09 September 2018 Have bugs been cloned? Investigated systems Linux kernel (OSS): millions of LOC Git version control system (OSS): > 300 KLOC PostgreSQL database (OSS): > 300 KLOC A software product line (commercial): > 50 products, > 1 million LOC each Approach We searched for these keywords in bug reports and commit messages: “same bug”, “same fix”, “same issue”, “same error”, and “same problem” For each match, we manually determined whether the commit was necessitated by duplication of a bug If so, we identified the buggy code and its clones manually

Results of the Empirical Study 09 September 2018 Results of the Empirical Study OSS Cloned bugs Locatable clones Linux kernel 282 157 Git 33 12 PostgreSQL 22 For some bugs, we could locate their clones, e.g. in the commit, it says “This is quite the same fix as in 2cb96f86628d6e97fcbda5fe4d8d74876239834c.” For some other bugs, we could not locate their clones, e.g. “Other platforms have this same bug, in one form or another.” Commercial product line 969 out of 25420 bugs (3.8%) No access to source code

The Buggy Code Segment is Usually Small 09 September 2018 The Buggy Code Segment is Usually Small More than 88% of the bugs cover 4 or fewer contiguous lines of code Size (contiguous lines) of the largest component of each bug fix

The Need to Detect Cloned Buggy Code 09 September 2018 The Need to Detect Cloned Buggy Code The empirical study confirmed that cloned bugs do exist Once a bug is detected, it is necessary to find all its clones and fix them Customer satisfaction drops when a customer re-encounters a bug that the vendor claimed to have fixed Inputs Buggy code snippet Entire system to be checked Output |

CBCD - A tool to detect cloned buggy code

CBCD Uses PDG-Based Clone Detection Approach 09 September 2018 CBCD Uses PDG-Based Clone Detection Approach Clone detection approaches: Text-based, Token-based, Abstract Syntax Tree (AST)-based Program Dependency Graph (PDG)-based static int add(int a, int b)  {     return(a+b);  }  Entry: add a = ain b = bin ret = result result = a+b Control dependency Data dependency PDG example PDG-based clone detection approach Check graph isomorphism of PDGs of code segments to identify code clones More resilient to code changes CBCD: check whether PDG of the buggy code is a subgraph of the PDG of the system Unlike other PDG-based tools, CBCD also requires vertex kind and source text match

CBCD Architecture Codesurfer And also used in subgraph testing Bug PDG 09 September 2018 CBCD Architecture Subgraph isomorphism is O(N!N) We implemented optimizations to reduce N Step 1: Create Bug PDG Create System PDG Bug PDG System vertex Info Step 3: Subgraph testing (igraph) Output: Bug clones And also used in subgraph testing Input: Buggy lines System to be checked Step 2: Split the Bug PDG and prune the System PDG Split Bug PDG Pruned System PDG Vertex kind and source text are used to prune parts of System PDG that cannot match Codesurfer

Four Optimizations: Reduce the N in Subgraph Isomorphism 09 September 2018 Four Optimizations: Reduce the N in Subgraph Isomorphism Isomorphism requires graph structure, vertex kind, and source text to match More than 88% of the bugs cover ≤ 4 lines 1. Prune System PDG Compare the start and end vertex kind and source text of edge of the System and Bug PDG 2. Break System PDG Use vertex kind info. of Bug PDG to choose only small sections of the System PDG 3. Exclude irrelevant System PDG Bug PDG radius = 2 4. Break Bug PDG System PDG

CBCD evaluation

CBCD Evaluation and Comparison with Clone Detectors 09 September 2018 CBCD Evaluation and Comparison with Clone Detectors Research questions: How well can CBCD find cloned buggy code? How well does CBCD scale? Oracle for the evaluation 53 bugs (34 Linux kernel, 5 Git, and 14 PostgreSQL) and their clones Evaluation environment Ubuntu 10.04, 4G Mem, 3Ghz CPU Clone detectors under comparison CBCD (PDG-based) Simian (text-based) CCFinder (token-based) Decard (AST-based) CloneDr (AST-based)

How Well Can CBCD Find Cloned Buggy Code? 09 September 2018 How Well Can CBCD Find Cloned Buggy Code? For each investigated bug and each of its clone, we check if a tool reports False Negative (FN): A clone is identified by the developer, but not by the tool. False Positive (FP): A clone is repored by the tool, but not by developers as buggy. Each of the 53 investigated bugs falls into one of four categories * Only 2% of the input files and 1% of the lines of code General clone detectors and CBCD are designed for different purposes General clone detectors are good at finding large code clones CBCD is more like an advanced “Find” command, good at finding small code clones

Advantages and Limitations of CBCD 09 September 2018 Advantages and Limitations of CBCD Advantages: As expected, CBCD is more resilient to some types of code changes Code deletions or insertions that do not change the PDG Reordering of statements Control-statement replacement Limitations False Positive : when the buggy code is a one-line function call Buggy code: memset(ib_ah_attr, 0, sizeof  *path); True positive: memset(ib_ah_attr, 0, sizeof *path); False positive: memset(p, 0, padding); False Negative: Buggy code is in data structure or marcos, due to limitations of CodeSurfer. False Negative: Variable renaming in an expression, because CBCD limits such renaming Buggy code: if (!hpet && !ref1 && !ref2) True positive: if (!hpet && !ref_start && !rf_stop)

How well does CBCD scale? 09 September 2018 How well does CBCD scale? <47 seconds to find clones of a bug in a system having ≥ 400 KNLOC, after the system PDG is generated by CodeSurfer Generating system PDG from CodeSurfer is the bottleneck However, this step needs to be done only once and is reusable The optimizations are effective Most effective: Prune System PDG Pruned 90% of the edges 20237 and 11890 performance improvement in two cases Avoid “out-of-memory” error due to too big System PDG in another case Other optimizations provide moderate performance gains

Conclusions Examined real-world bug reports 09 September 2018 Conclusions Examined real-world bug reports Established that identical (cloned) bugs are a serious problem Proposed that programmer should search for clones of buggy code Optimizations to PDG-based clone detection algorithms for scalability Implemented CBCD, a PDG-based tool that detects possible clones of buggy code CBCD is scalable, effective, and outperforms other tools for this task Finding cloned bugs is very important, and now a tool (CBCD) exists for that!