CS590 Z Matching Program Versions Xiangyu Zhang. CS590Z Problem Statement  Suppose a program P’ is created by modifying P. Determine the difference between.

Slides:



Advertisements
Similar presentations
Chao Liu, Chen Chen, Jiawei Han, Philip S. Yu
Advertisements

Symbol Table.
Indexing DNA Sequences Using q-Grams
Longest Common Subsequence
CS252: Systems Programming Ninghui Li Program Interview Questions.
Latent Semantic Indexing (mapping onto a smaller space of latent concepts) Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 18.
Overview What is Dynamic Programming? A Sequence of 4 Steps
Reverse Engineering © SERG Code Cloning: Detection, Classification, and Refactoring.
Tools for Text Review. Algorithms The heart of computer science Definition: A finite sequence of instructions with the properties that –Each instruction.
1 Suffix Trees and Suffix Arrays Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto Addison-Wesley, (Chapter 8)
1 Accessing nearby copies of replicated objects Greg Plaxton, Rajmohan Rajaraman, Andrea Richa SPAA 1997.
1 Hash-Based Indexes Yanlei Diao UMass Amherst Feb 22, 2006 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
This material in not in your text (except as exercises) Sequence Comparisons –Problems in molecular biology involve finding the minimum number of edit.
Program Representations Xiangyu Zhang. CS590Z Software Defect Analysis Program Representations  Static program representations Abstract syntax tree;
1 Tree-Structured Indexes Chapter Introduction  As for any index, 3 alternatives for data entries k* :  Data record with key value k   Choice.
6/29/20151 Efficient Algorithms for Motif Search Sudha Balla Sanguthevar Rajasekaran University of Connecticut.
Tree-Structured Indexes. Range Searches ``Find all students with gpa > 3.0’’ –If data is in sorted file, do binary search to find first such student,
Space-Efficient Sequence Alignment Space-Efficient Sequence Alignment Bioinformatics 202 University of California, San Diego Lecture Notes No. 7 Dr. Pavel.
CS212: DATA STRUCTURES Lecture 10:Hashing 1. Outline 2  Map Abstract Data type  Map Abstract Data type methods  What is hash  Hash tables  Bucket.
CIS 218 Advanced UNIX1 CIS 218 – Advanced UNIX (g)awk.
CMCD: Count Matrix based Code Clone Detection Yang Yuan and Yao Guo Key Laboratory of High-Confidence Software Technologies (Ministry of Education) Peking.
Lossless Compression CIS 465 Multimedia. Compression Compression: the process of coding that will effectively reduce the total number of bits needed to.
Tree-Structured Indexes Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY courtesy of Joe Hellerstein for some slides.
1 Chapter 6 Dynamic Programming. 2 Algorithmic Paradigms Greedy. Build up a solution incrementally, optimizing some local criterion. Divide-and-conquer.
IDENTIFYING SEMANTIC DIFFERENCES IN ASPECTJ PROGRAMS Martin Görg and Jianjun Zhao Computer Science Department, Shanghai Jiao Tong University.
1 Hashing - Introduction Dictionary = a dynamic set that supports the operations INSERT, DELETE, SEARCH Dictionary = a dynamic set that supports the operations.
Chapter 10 Hashing. The search time of each algorithm depend on the number n of elements of the collection S of the data. A searching technique called.
Duplicate code detection using anti-unification Peter Bulychev Moscow State University Marius Minea Institute eAustria, Timisoara.
Automated Patch Generation Adapted from Tevfik Bultan’s Lecture.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Hash-Based Indexes Chapter 11 Modified by Donghui Zhang Jan 30, 2006.
Mining Document Collections to Facilitate Accurate Approximate Entity Matching Presented By Harshda Vabale.
Indexes. Primary Indexes Dense Indexes Pointer to every record of a sequential file, (ordered by search key). Can make sense because records may be much.
CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original.
CS307P-SYSTEM PRACTICUM CPYNOT. B13107 – Amit Kumar B13141 – Vinod Kumar B13218 – Paawan Mukker.
1 Measuring Similarity of Large Software System Based on Source Code Correspondence Tetsuo Yamamoto*, Makoto Matsushita**, Toshihiro Kamiya***, Katsuro.
Scalable Clone Detection and Elimination for Erlang Programs Huiqing Li, Simon Thompson University of Kent Canterbury, UK.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 B+-Tree Index Chapter 10 Modified by Donghui Zhang Nov 9, 2005.
Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.
Storage and Indexing. How do we store efficiently large amounts of data? The appropriate storage depends on what kind of accesses we expect to have to.
CS307P-SYSTEM PRACTICUM CPYNOT. B13107 – Amit Kumar B13141 – Vinod Kumar B13218 – Paawan Mukker.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Tree-Structured Indexes Content based on Chapter 10 Database Management Systems, (3 rd.
Dr Nazir A. Zafar Advanced Algorithms Analysis and Design Advanced Algorithms Analysis and Design By Dr. Nazir Ahmad Zafar.
Gspan: Graph-based Substructure Pattern Mining
Tree-Structured Indexes: Introduction
CSE373: Data Structures & Algorithms Lecture 6: Hash Tables
Tree-Structured Indexes
COP Introduction to Database Structures
Hash-Based Indexes Chapter 11
CBCD: Cloned Buggy Code Detector
13 Text Processing Hongfei Yan June 1, 2016.
Accessing nearby copies of replicated objects
CS222P: Principles of Data Management Notes #6 Index Overview and ISAM Tree Index Instructor: Chen Li.
Space-for-time tradeoffs
B+-Trees and Static Hashing
Hash-Based Indexes Chapter 10
B+Trees The slides for this text are organized into chapters. This lecture covers Chapter 9. Chapter 1: Introduction to Database Systems Chapter 2: The.
The Longest Common Subsequence Problem
Hash-Based Indexes Chapter 11
Tree-Structured Indexes
Space-for-time tradeoffs
Algorithms for Deep Sequencing Data
Database Systems (資料庫系統)
CS222/CS122C: Principles of Data Management Notes #6 Index Overview and ISAM Tree Index Instructor: Chen Li.
Space-for-time tradeoffs
Storage and Indexing.
General External Merge Sort
Chapter 11 Instructor: Xin Zhang
Tree-Structured Indexes
CS222/CS122C: Principles of Data Management UCI, Fall 2018 Notes #05 Index Overview and ISAM Tree Index Instructor: Chen Li.
Matching Program Versions
Presentation transcript:

CS590 Z Matching Program Versions Xiangyu Zhang

CS590Z Problem Statement  Suppose a program P’ is created by modifying P. Determine the difference between P and P’. For an artifact c’ in P’, decide if c’ belongs to the difference, if not, find the correspondence of c’ in P. Static mapping Non-trivial  Name comparison?  What if Clone analysis, comparison checking

CS590Z Motivations  Validate compiler transformations  Facilitate regression testing  Reverse obfuscation  Information propagation  Debugging  Code plagiarism detection  Information Assurance

CS590Z Approaches  Static Approaches Entity name based String based (MOSS) AST based (DECKARD) CFG based (JDIFF) PDG based (PDIFF) Binary based (BMAT) Log based (editor plugin, comparison checking)  Dynamic Approaches (not today)

CS590Z Static Approaches  Entity name matching Model a function/field as tuples Coarse grained matching  String matching Diff (CVS, Subservion) Longest common subsequence (LCS)  Available operations are addition and deletion  Matched pairs can not cross one another  Programs are far more complicated than strings Copy, paste, move CP-Miner (scale to linux kernel clone detection)  Frequent subsequence mining

CS590Z MOSS  Code plagiarism detection It also handles other digital contents  Challenges White space (variable name) Noise (“the”, “int i”); Order scrambling (paragraph reorders)  Problem statement Given a set of documents, identify substring matches that satisfy two properties:  If there is a substring match at least as long as the guarantee threshold t, then this match is detected;  Do not detect any matches shorter than the noise threshold, k.

CS590Z MOSS  k-gram A continuous substring of length k

CS590Z MOSS  Incremental hashing Hashing strings of length k is expensive for large k. “rolling” hash function  The (i+1)th k-gram hash = F (the ith k-gram hash, …)

CS590Z MOSS  Fingerprint selection A subset of hash values Our goals: find all matching substrings >t; ignore matchings <k) One of every tth hash values 0 mod p

CS590Z MOSS  Winnowing Observation: given a sequence of hashes h1,…hn, if n>t-k, then at least one of the hi must be chosen Have a sliding window with size w=t-k+1 In each window select the minimum hash value, break ties by select the rightmost occurrence.

CS590Z MOSS  Algorithm Build an index mapping fingerprints to locations for all documents. Each document is fingerprinted a second time and the selected fingerprints are looked up in the index; this gives the list of all matching fingerprints for each document. Sort (d,d1,fx), (d, d2,fy) by the first two elements. Matches between documents are rank-ordered by size (number of fingerprints)

CS590Z MOSS  Advantages Guarantee to detect any >t substring matches  Limitations Minor edits fail MOSS.  x= a*b + c vs. z= c + a*b Insertion, deletion

CS590Z AST based matching  [YANG, 1991, Software Practice and Experience] Given two functions, build the ASTs Match the roots If so, apply LCS to align subtrees Continue recursively  Fragile

CS590Z DECKARD (ICSE 2007)

CS590Z DECKARD  Advantages Scalability Insensitive to minor structural changes such as reordering, insertion, deletion  Limitations Structural similarity only Insertion that incurs structure change.

CS590Z CFG matching  Hammock graph (JDIFF,ASE 2004) Match classes by names Match fields by types Match methods by signatures Match instruction in methods by hammock graphs  A hammock is a single entry single exit subgraph of a CFG.

CS590Z CFG matching  Pros Orthogonal  Can be combined with other matching techniques Simple  Cons Coarse grained matching only  Not good at clone detection In case of code transformation

CS590Z Semantic Based Matched  Using PDG (SAS’01)

CS590Z Semantic Based

CS590Z Semantic Based  Pros Non-contiguous, intertwined, reordered Insensitive to code transformations.  Cons Scalability  Points-to analysis Starting from a matching pair seems to be a problem

CS590Z Wrap Up  For clone detection Maybe structural / text similarity is a good idea  For whole program matching / method matching with code transformations Semantic based is more appropriate  Scalability PDG < CFG | AST < STRING < NAME