Presentation is loading. Please wait.

Presentation is loading. Please wait.

CS590 Z Matching Program Versions Xiangyu Zhang. CS590Z Problem Statement  Suppose a program P’ is created by modifying P. Determine the difference between.

Similar presentations


Presentation on theme: "CS590 Z Matching Program Versions Xiangyu Zhang. CS590Z Problem Statement  Suppose a program P’ is created by modifying P. Determine the difference between."— Presentation transcript:

1 CS590 Z Matching Program Versions Xiangyu Zhang

2 CS590Z Problem Statement  Suppose a program P’ is created by modifying P. Determine the difference between P and P’. For an artifact c’ in P’, decide if c’ belongs to the difference, if not, find the correspondence of c’ in P. Static mapping Non-trivial  Name comparison?  What if Clone analysis, comparison checking

3 CS590Z Motivations  Validate compiler transformations  Facilitate regression testing  Reverse obfuscation  Information propagation  Debugging  Code plagiarism detection  Information Assurance

4 CS590Z Approaches  Static Approaches Entity name based String based (MOSS) AST based (DECKARD) CFG based (JDIFF) PDG based (PDIFF) Binary based (BMAT) Log based (editor plugin, comparison checking)  Dynamic Approaches (not today)

5 CS590Z Static Approaches  Entity name matching Model a function/field as tuples Coarse grained matching  String matching Diff (CVS, Subservion) Longest common subsequence (LCS)  Available operations are addition and deletion  Matched pairs can not cross one another  Programs are far more complicated than strings Copy, paste, move CP-Miner (scale to linux kernel clone detection)  Frequent subsequence mining

6 CS590Z MOSS  Code plagiarism detection It also handles other digital contents  Challenges White space (variable name) Noise (“the”, “int i”); Order scrambling (paragraph reorders)  Problem statement Given a set of documents, identify substring matches that satisfy two properties:  If there is a substring match at least as long as the guarantee threshold t, then this match is detected;  Do not detect any matches shorter than the noise threshold, k.

7 CS590Z MOSS  k-gram A continuous substring of length k

8 CS590Z MOSS  Incremental hashing Hashing strings of length k is expensive for large k. “rolling” hash function  The (i+1)th k-gram hash = F (the ith k-gram hash, …)

9 CS590Z MOSS  Fingerprint selection A subset of hash values Our goals: find all matching substrings >t; ignore matchings <k) One of every tth hash values 0 mod p

10 CS590Z MOSS  Winnowing Observation: given a sequence of hashes h1,…hn, if n>t-k, then at least one of the hi must be chosen Have a sliding window with size w=t-k+1 In each window select the minimum hash value, break ties by select the rightmost occurrence.

11 CS590Z MOSS  Algorithm Build an index mapping fingerprints to locations for all documents. Each document is fingerprinted a second time and the selected fingerprints are looked up in the index; this gives the list of all matching fingerprints for each document. Sort (d,d1,fx), (d, d2,fy) by the first two elements. Matches between documents are rank-ordered by size (number of fingerprints)

12 CS590Z MOSS  Advantages Guarantee to detect any >t substring matches  Limitations Minor edits fail MOSS.  x= a*b + c vs. z= c + a*b Insertion, deletion

13 CS590Z AST based matching  [YANG, 1991, Software Practice and Experience] Given two functions, build the ASTs Match the roots If so, apply LCS to align subtrees Continue recursively  Fragile

14 CS590Z DECKARD (ICSE 2007)

15 CS590Z DECKARD  Advantages Scalability Insensitive to minor structural changes such as reordering, insertion, deletion  Limitations Structural similarity only Insertion that incurs structure change.

16 CS590Z CFG matching  Hammock graph (JDIFF,ASE 2004) Match classes by names Match fields by types Match methods by signatures Match instruction in methods by hammock graphs  A hammock is a single entry single exit subgraph of a CFG.

17 CS590Z CFG matching  Pros Orthogonal  Can be combined with other matching techniques Simple  Cons Coarse grained matching only  Not good at clone detection In case of code transformation

18 CS590Z Semantic Based Matched  Using PDG (SAS’01)

19 CS590Z Semantic Based

20 CS590Z Semantic Based  Pros Non-contiguous, intertwined, reordered Insensitive to code transformations.  Cons Scalability  Points-to analysis Starting from a matching pair seems to be a problem

21 CS590Z Wrap Up  For clone detection Maybe structural / text similarity is a good idea  For whole program matching / method matching with code transformations Semantic based is more appropriate  Scalability PDG < CFG | AST < STRING < NAME


Download ppt "CS590 Z Matching Program Versions Xiangyu Zhang. CS590Z Problem Statement  Suppose a program P’ is created by modifying P. Determine the difference between."

Similar presentations


Ads by Google