CS590 Z Matching Program Versions Xiangyu Zhang. CS590Z Problem Statement  Suppose a program P’ is created by modifying P. Determine the difference between.

CS590 Z Matching Program Versions Xiangyu Zhang

CS590Z Problem Statement  Suppose a program P’ is created by modifying P. Determine the difference between P and P’. For an artifact c’ in P’, decide if c’ belongs to the difference, if not, find the correspondence of c’ in P. Static mapping Non-trivial  Name comparison?  What if Clone analysis, comparison checking

CS590Z Motivations  Validate compiler transformations  Facilitate regression testing  Reverse obfuscation  Information propagation  Debugging  Code plagiarism detection  Information Assurance

CS590Z Approaches  Static Approaches Entity name based String based (MOSS) AST based (DECKARD) CFG based (JDIFF) PDG based (PDIFF) Binary based (BMAT) Log based (editor plugin, comparison checking)  Dynamic Approaches (not today)

CS590Z Static Approaches  Entity name matching Model a function/field as tuples Coarse grained matching  String matching Diff (CVS, Subservion) Longest common subsequence (LCS)  Available operations are addition and deletion  Matched pairs can not cross one another  Programs are far more complicated than strings Copy, paste, move CP-Miner (scale to linux kernel clone detection)  Frequent subsequence mining

CS590Z MOSS  Code plagiarism detection It also handles other digital contents  Challenges White space (variable name) Noise (“the”, “int i”); Order scrambling (paragraph reorders)  Problem statement Given a set of documents, identify substring matches that satisfy two properties:  If there is a substring match at least as long as the guarantee threshold t, then this match is detected;  Do not detect any matches shorter than the noise threshold, k.

CS590Z MOSS  k-gram A continuous substring of length k

CS590Z MOSS  Incremental hashing Hashing strings of length k is expensive for large k. “rolling” hash function  The (i+1)th k-gram hash = F (the ith k-gram hash, …)

CS590Z MOSS  Fingerprint selection A subset of hash values Our goals: find all matching substrings >t; ignore matchings <k) One of every tth hash values 0 mod p

CS590Z MOSS  Winnowing Observation: given a sequence of hashes h1,…hn, if n>t-k, then at least one of the hi must be chosen Have a sliding window with size w=t-k+1 In each window select the minimum hash value, break ties by select the rightmost occurrence.

CS590Z MOSS  Algorithm Build an index mapping fingerprints to locations for all documents. Each document is fingerprinted a second time and the selected fingerprints are looked up in the index; this gives the list of all matching fingerprints for each document. Sort (d,d1,fx), (d, d2,fy) by the first two elements. Matches between documents are rank-ordered by size (number of fingerprints)

CS590Z MOSS  Advantages Guarantee to detect any >t substring matches  Limitations Minor edits fail MOSS.  x= a*b + c vs. z= c + a*b Insertion, deletion

CS590Z AST based matching  [YANG, 1991, Software Practice and Experience] Given two functions, build the ASTs Match the roots If so, apply LCS to align subtrees Continue recursively  Fragile

CS590Z DECKARD (ICSE 2007)

CS590Z DECKARD  Advantages Scalability Insensitive to minor structural changes such as reordering, insertion, deletion  Limitations Structural similarity only Insertion that incurs structure change.

CS590Z CFG matching  Hammock graph (JDIFF,ASE 2004) Match classes by names Match fields by types Match methods by signatures Match instruction in methods by hammock graphs  A hammock is a single entry single exit subgraph of a CFG.

CS590Z CFG matching  Pros Orthogonal  Can be combined with other matching techniques Simple  Cons Coarse grained matching only  Not good at clone detection In case of code transformation

CS590Z Semantic Based Matched  Using PDG (SAS’01)

CS590Z Semantic Based

CS590Z Semantic Based  Pros Non-contiguous, intertwined, reordered Insensitive to code transformations.  Cons Scalability  Points-to analysis Starting from a matching pair seems to be a problem

CS590Z Wrap Up  For clone detection Maybe structural / text similarity is a good idea  For whole program matching / method matching with code transformations Semantic based is more appropriate  Scalability PDG < CFG | AST < STRING < NAME

CS590 Z Matching Program Versions Xiangyu Zhang. CS590Z Problem Statement  Suppose a program P’ is created by modifying P. Determine the difference between.

Similar presentations

Presentation on theme: "CS590 Z Matching Program Versions Xiangyu Zhang. CS590Z Problem Statement  Suppose a program P’ is created by modifying P. Determine the difference between."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CS590 Z Matching Program Versions Xiangyu Zhang. CS590Z Problem Statement  Suppose a program P’ is created by modifying P. Determine the difference between.

Similar presentations

Presentation on theme: "CS590 Z Matching Program Versions Xiangyu Zhang. CS590Z Problem Statement  Suppose a program P’ is created by modifying P. Determine the difference between."— Presentation transcript:

Similar presentations

About project

Feedback