Matching Program Versions

Slides:



Advertisements
Similar presentations
Introduction to Algorithms
Advertisements

Chao Liu, Chen Chen, Jiawei Han, Philip S. Yu
Information Retrieval in Practice
Symbol Table.
Indexing DNA Sequences Using q-Grams
School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) SSA Guo, Yao.
Longest Common Subsequence
CS252: Systems Programming Ninghui Li Program Interview Questions.
Latent Semantic Indexing (mapping onto a smaller space of latent concepts) Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 18.
Overview What is Dynamic Programming? A Sequence of 4 Steps
CS590 Z Matching Program Versions Xiangyu Zhang. CS590Z Problem Statement  Suppose a program P’ is created by modifying P. Determine the difference between.
Reverse Engineering © SERG Code Cloning: Detection, Classification, and Refactoring.
1 Suffix Trees and Suffix Arrays Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto Addison-Wesley, (Chapter 8)
1 Hash-Based Indexes Yanlei Diao UMass Amherst Feb 22, 2006 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
Aki Hecht Seminar in Databases (236826) January 2009
This material in not in your text (except as exercises) Sequence Comparisons –Problems in molecular biology involve finding the minimum number of edit.
1 Tree-Structured Indexes Chapter Introduction  As for any index, 3 alternatives for data entries k* :  Data record with key value k   Choice.
6/29/20151 Efficient Algorithms for Motif Search Sudha Balla Sanguthevar Rajasekaran University of Connecticut.
Important Problem Types and Fundamental Data Structures
Space-Efficient Sequence Alignment Space-Efficient Sequence Alignment Bioinformatics 202 University of California, San Diego Lecture Notes No. 7 Dr. Pavel.
CS212: DATA STRUCTURES Lecture 10:Hashing 1. Outline 2  Map Abstract Data type  Map Abstract Data type methods  What is hash  Hash tables  Bucket.
CMCD: Count Matrix based Code Clone Detection Yang Yuan and Yao Guo Key Laboratory of High-Confidence Software Technologies (Ministry of Education) Peking.
Hashing Chapter 20. Hash Table A hash table is a data structure that allows fast find, insert, and delete operations (most of the time). The simplest.
A Query Adaptive Data Structure for Efficient Indexing of Time Series Databases Presented by Stavros Papadopoulos.
1 Chapter 6 Dynamic Programming. 2 Algorithmic Paradigms Greedy. Build up a solution incrementally, optimizing some local criterion. Divide-and-conquer.
IDENTIFYING SEMANTIC DIFFERENCES IN ASPECTJ PROGRAMS Martin Görg and Jianjun Zhao Computer Science Department, Shanghai Jiao Tong University.
1 Hashing - Introduction Dictionary = a dynamic set that supports the operations INSERT, DELETE, SEARCH Dictionary = a dynamic set that supports the operations.
Chapter 10 Hashing. The search time of each algorithm depend on the number n of elements of the collection S of the data. A searching technique called.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Hash-Based Indexes Chapter 11 Modified by Donghui Zhang Jan 30, 2006.
Mining Document Collections to Facilitate Accurate Approximate Entity Matching Presented By Harshda Vabale.
CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original.
CS307P-SYSTEM PRACTICUM CPYNOT. B13107 – Amit Kumar B13141 – Vinod Kumar B13218 – Paawan Mukker.
1 Measuring Similarity of Large Software System Based on Source Code Correspondence Tetsuo Yamamoto*, Makoto Matsushita**, Toshihiro Kamiya***, Katsuro.
Scalable Clone Detection and Elimination for Erlang Programs Huiqing Li, Simon Thompson University of Kent Canterbury, UK.
Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.
CS307P-SYSTEM PRACTICUM CPYNOT. B13107 – Amit Kumar B13141 – Vinod Kumar B13218 – Paawan Mukker.
Gspan: Graph-based Substructure Pattern Mining
Tree-Structured Indexes: Introduction
CS522 Advanced database Systems
CSE373: Data Structures & Algorithms Lecture 6: Hash Tables
Tree-Structured Indexes
COP Introduction to Database Structures
Hash-Based Indexes Chapter 11
CBCD: Cloned Buggy Code Detector
13 Text Processing Hongfei Yan June 1, 2016.
Accessing nearby copies of replicated objects
Web Data Extraction Based on Partial Tree Alignment
CSCE350 Algorithms and Data Structure
CS222P: Principles of Data Management Notes #6 Index Overview and ISAM Tree Index Instructor: Chen Li.
Space-for-time tradeoffs
B+-Trees and Static Hashing
Hash-Based Indexes Chapter 10
B+Trees The slides for this text are organized into chapters. This lecture covers Chapter 9. Chapter 1: Introduction to Database Systems Chapter 2: The.
The Longest Common Subsequence Problem
Tree-Structured Indexes
Searching CLRS, Sections 9.1 – 9.3.
Space-for-time tradeoffs
Algorithms for Deep Sequencing Data
Database Systems (資料庫系統)
CS222/CS122C: Principles of Data Management Notes #6 Index Overview and ISAM Tree Index Instructor: Chen Li.
Space-for-time tradeoffs
Dynamic Programming-- Longest Common Subsequence
General External Merge Sort
Dynamic Programming II DP over Intervals
Chapter 11 Instructor: Xin Zhang
Tree-Structured Indexes
CS222/CS122C: Principles of Data Management UCI, Fall 2018 Notes #05 Index Overview and ISAM Tree Index Instructor: Chen Li.
CPS 296.3:Algorithms in the Real World
CSE 542: Operating Systems
Week 13 - Wednesday CS221.
Presentation transcript:

Matching Program Versions CS590 Z Matching Program Versions Xiangyu Zhang

Problem Statement Suppose a program P’ is created by modifying P. Determine the difference between P and P’. For an artifact c’ in P’, decide if c’ belongs to the difference, if not, find the correspondence of c’ in P. Static mapping Non-trivial Name comparison? What if Clone analysis, comparison checking

Motivations Validate compiler transformations Facilitate regression testing Reverse obfuscation Information propagation Debugging Code plagiarism detection Information Assurance

Approaches Static Approaches Dynamic Approaches (not today) Entity name based String based (MOSS) AST based (DECKARD) CFG based (JDIFF) PDG based (PDIFF) Binary based (BMAT) Log based (editor plugin, comparison checking) Dynamic Approaches (not today)

Static Approaches Entity name matching String matching Model a function/field as tuples Coarse grained matching String matching Diff (CVS, Subservion) Longest common subsequence (LCS) Available operations are addition and deletion Matched pairs can not cross one another Programs are far more complicated than strings Copy, paste, move CP-Miner (scale to linux kernel clone detection) Frequent subsequence mining If two strings are considered, LCS has polynomial complexity (by dynamic programming)

MOSS Code plagiarism detection Challenges Problem statement It also handles other digital contents Challenges White space (variable name) Noise (“the”, “int i”); Order scrambling (paragraph reorders) Problem statement Given a set of documents, identify substring matches that satisfy two properties: If there is a substring match at least as long as the guarantee threshold t, then this match is detected; Do not detect any matches shorter than the noise threshold, k.

MOSS k-gram A continuous substring of length k

MOSS Incremental hashing Hashing strings of length k is expensive for large k. “rolling” hash function The (i+1)th k-gram hash = F (the ith k-gram hash, …)

MOSS Fingerprint selection A subset of hash values Our goals: find all matching substrings >t; ignore matchings <k) One of every tth hash values 0 mod p

MOSS Winnowing Observation: given a sequence of hashes h1,…hn, if n>t-k, then at least one of the hi must be chosen Have a sliding window with size w=t-k+1 In each window select the minimum hash value, break ties by select the rightmost occurrence.

MOSS Algorithm Build an index mapping fingerprints to locations for all documents. Each document is fingerprinted a second time and the selected fingerprints are looked up in the index; this gives the list of all matching fingerprints for each document. Sort (d,d1,fx), (d, d2,fy) by the first two elements. Matches between documents are rank-ordered by size (number of fingerprints)

MOSS Advantages Limitations Guarantee to detect any >t substring matches Limitations Minor edits fail MOSS. x= a*b + c vs. z= c + a*b Insertion, deletion

AST based matching [YANG, 1991, Software Practice and Experience] Given two functions, build the ASTs Match the roots If so, apply LCS to align subtrees Continue recursively Fragile

DECKARD (ICSE 2007)

DECKARD Advantages Limitations Scalability Insensitive to minor structural changes such as reordering, insertion, deletion Limitations Structural similarity only Insertion that incurs structure change.

CFG matching Hammock graph (JDIFF ,ASE 2004) Match classes by names Match fields by types Match methods by signatures Match instruction in methods by hammock graphs A hammock is a single entry single exit subgraph of a CFG.

CFG matching Pros Cons Orthogonal Simple Coarse grained matching only Can be combined with other matching techniques Simple Cons Coarse grained matching only Not good at clone detection In case of code transformation

Semantic Based Matched Using PDG (SAS’01)

Semantic Based

Semantic Based Pros Cons Non-contiguous, intertwined, reordered Insensitive to code transformations. Cons Scalability Points-to analysis Starting from a matching pair seems to be a problem

Wrap Up For clone detection Maybe structural / text similarity is a good idea For whole program matching / method matching with code transformations Semantic based is more appropriate Scalability PDG < CFG | AST < STRING < NAME