Chao Liu, Chen Chen, Jiawei Han, Philip S. Yu

Slides:

Advertisements

Similar presentations

TWO STEP EQUATIONS 1. SOLVE FOR X 2. DO THE ADDITION STEP FIRST

Advertisements

You have been given a mission and a code. Use the code to complete the mission and you will save the world from obliteration…

1 Yell / The Law and Special Education, Second Edition Copyright © 2006 by Pearson Education, Inc. All rights reserved.

Advanced Piloting Cruise Plot.

Feichter_DPG-SYKL03_Bild-01. Feichter_DPG-SYKL03_Bild-02.

Kapitel 21 Astronomie Autor: Bennett et al. Galaxienentwicklung Kapitel 21 Galaxienentwicklung © Pearson Studium 2010 Folie: 1.

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. *See PowerPoint Lecture Outline for a complete, ready-made.

1 Copyright © 2013 Elsevier Inc. All rights reserved. Chapter 116.

1 Copyright © 2013 Elsevier Inc. All rights reserved. Appendix 01.

1 Copyright © 2010, Elsevier Inc. All rights Reserved Fig 2.1 Chapter 2.

1 Copyright © 2013 Elsevier Inc. All rights reserved. Chapter 107.

1 Copyright © 2013 Elsevier Inc. All rights reserved. Chapter 40.

1 Copyright © 2013 Elsevier Inc. All rights reserved. Chapter 28.

1 Copyright © 2013 Elsevier Inc. All rights reserved. Chapter 44.

1 Copyright © 2013 Elsevier Inc. All rights reserved. Chapter 38.

1 Copyright © 2013 Elsevier Inc. All rights reserved. Chapter 75.

By D. Fisher Geometric Transformations. Reflection, Rotation, or Translation 1.

Chapter 1 Image Slides Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Business Transaction Management Software for Application Coordination 1 Business Processes and Coordination.

© 2010 Pearson Addison-Wesley. All rights reserved. Addison Wesley is an imprint of Chapter 11: Structure and Union Types Problem Solving & Program Design.

Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13

Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13

Title Subtitle.

My Alphabet Book abcdefghijklm nopqrstuvwxyz.

DIVIDING INTEGERS 1. IF THE SIGNS ARE THE SAME THE ANSWER IS POSITIVE 2. IF THE SIGNS ARE DIFFERENT THE ANSWER IS NEGATIVE.

SUBTRACTING INTEGERS 1. CHANGE THE SUBTRACTION SIGN TO ADDITION

MULT. INTEGERS 1. IF THE SIGNS ARE THE SAME THE ANSWER IS POSITIVE 2. IF THE SIGNS ARE DIFFERENT THE ANSWER IS NEGATIVE.

Photo Slideshow Instructions (delete before presenting or this page will show when slideshow loops) 1.Set PowerPoint to work in Outline. View/Normal click.

BT Wholesale October Creating your own telephone network WHOLESALE CALLS LINE ASSOCIATED.

Break Time Remaining 10:00.

Linked List A linked list consists of a number of links, each of which has a reference to the next link. Adding and removing elements in the middle of.

Data Structures: A Pseudocode Approach with C

FIFO Queues CSE 2320 – Algorithms and Data Structures Vassilis Athitsos University of Texas at Arlington 1.

Pearls of Functional Algorithm Design Chapter 1 1 Roger L. Costello June 2011.

ABC Technology Project

2007 Pearson Education, Inc. All rights reserved C Structures, Unions, Bit Manipulations and Enumerations.

1 / / / /. 2 (Object) (Object) –, 10 (Class) (Class) –, –, – (Variables) [ Data member Field Attribute](, ) – (Function) [ Member function Method Operation.

© Charles van Marrewijk, An Introduction to Geographical Economics Brakman, Garretsen, and Van Marrewijk.

15. Oktober Oktober Oktober 2012.

1 Breadth First Search s s Undiscovered Discovered Finished Queue: s Top of queue 2 1 Shortest path from s.

1 public class Newton { public static double sqrt(double c) { double epsilon = 1E-15; if (c < 0) return Double.NaN; double t = c; while (Math.abs(t - c/t)

Squares and Square Root WALK. Solve each problem REVIEW:

We are learning how to read the 24 hour clock

Do you have the Maths Factor?. Maths Can you beat this term’s Maths Challenge?

© 2012 National Heart Foundation of Australia. Slide 2.

Lets play bingo!!. Calculate: MEAN Calculate: MEDIAN

Chapter 5 Test Review Sections 5-1 through 5-4.

SIMOCODE-DP Software.

GG Consulting, LLC I-SUITE. Source: TEA SHARS Frequently asked questions 2.

1 of 32 Images from Africa. 2 of 32 My little Haitian friend Antoine (1985)

Addition 1’s to 20.

25 seconds left…...

Visions of Australia – Regional Exhibition Touring Fund Applicant organisation Exhibition title Exhibition Sample Support Material Instructions 1) Please.

We will resume in: 25 Minutes.

Clock will move after 1 minute

1 Unit 1 Kinematics Chapter 1 Day

Select a time to count down from the clock above

Murach’s OS/390 and z/OS JCLChapter 16, Slide 1 © 2002, Mike Murach & Associates, Inc.

Environmental Data Analysis with MatLab Lecture 15: Factor Analysis.

Analyzing Software Code and Execution – Plagiarism and Bug Detection Shoaib Jameel.

Presentation transcript:

GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis Chao Liu, Chen Chen, Jiawei Han, Philip S. Yu University of Illinois at Urbana-Champaign IBM T.J. Waston Research Center Presented by Chao Liu

Motivations Blossom of open-source projects SourceForge.net: 125,090 projects as July 2006 Convenience for software plagiarism? You can always find something online Core-part plagiarism Ripping off GUIs and irrelevant parts (Illegally) reuse the implementations of core-algorithms Our goal Efficient detection of core-part plagiarism

Challenges Effectiveness Efficiency Professional plagiarists Automated plagiarism Efficiency Only a small part of code is plagiarized, how to detect it efficiently?

Outline Plagiarism Disguises Review of Plagiarism Detection GPLAG: PDG-based Plagiarism Detection Efficiency and Scalability Experiments Conclusions

A procedure in a program, called join Original Program A procedure in a program, called join 01 static void 02 make_blank (struct line *blank, int count) 03 { 04 int i; 05 unsigned char *buffer; 06 struct field *fields; 07 blank->nfields = count; 08 blank->buf.size = blank->buf.length = count + 1; 09 blank->buf.buffer = (char*) xmalloc (blank->buf.size); 10 buffer = (unsigned char *) blank->buf.buffer; 11 blank->fields = fields = (struct field *) xmalloc (sizeof (struct field) * count); 12 for (i = 0; i < count; i++){ 13 ... 14 } 15 }

Disguise 1: Format Alteration Insert comments and blanks 01 static void 02 make_blank (struct line *blank, int count) 03 { 04 int i; 05 unsigned char *buffer; 06 struct field *fields; 07 blank->nfields = count; // initialization 08 blank->buf.size = blank->buf.length = count + 1; 09 blank->buf.buffer = (char*) xmalloc (blank->buf.size); 10 buffer = (unsigned char *) blank->buf.buffer; 11 blank->fields = fields = (struct field *) xmalloc (sizeof (struct field) * count); 12 for (i = 0; i < count; i++){ 13 ... 14 } 15 }

Disguise 2: Identifier Renaming Rename variables consistently 01 static void 02 fill_content (struct line *fill, int num) 03 { 04 int i; 05 unsigned char *buffer; 06 struct field *fields; 07 fill->nfields = num; // initialization 08 fill->buf.size = fill->buf.length = num + 1; 09 fill->buf.buffer = (char*) xmalloc (fill->buf.size); 10 buffer = (unsigned char *) fill->buf.buffer; 11 fill->fields = fields = (struct field *) xmalloc (sizeof (struct field) * num); 12 for (i = 0; i < num; i++){ 13 ... 14 } 15 }

Disguise 3: Statement Reordering Reorder non-dependent statements 01 static void 02 fill_content (struct line *fill, int num) 03 { 04 int i; 05 unsigned char *buffer; 06 struct field *fields; 11 fill->fields = fields = (struct field *) xmalloc (sizeof (struct field) * num); 08 fill->buf.size = fill->buf.length = num + 1; 09 fill->buf.buffer = (char*) xmalloc (fill->buf.size); 10 buffer = (unsigned char *) fill->buf.buffer; 07 fill->nfields = num; // initialization 12 for (i = 0; i < num; i++){ 13 ... 14 } 15 } statement reordering

Disguise 4: Control Replacement Use equivalent control structure 01 static void 02 fill_content (struct line *fill, int num) 03 { 04 int i; 05 unsigned char *buffer; 06 struct field *fields; 11 fill->fields = fields = (struct field *) xmalloc (sizeof (struct field) * num); 08 fill->buf.size = fill->buf.length = num + 1; 09 fill->buf.buffer = (char*) xmalloc (fill->buf.size); 10 buffer = (unsigned char *) fill->buf.buffer; 07 fill->nfields = num; // initialization i = 0; while (i < num){ ... 15 i++; 16 } 17 }

Disguise 5: Code Insertion Insert immaterial code 01 static void 02 fill_content (struct line *fill, int num) 03 { 04 int i; 05 unsigned char *buffer; 06 struct field *fields; 11 fill->fields = fields = (struct field *) xmalloc (sizeof (struct field) * num); 08 fill->buf.size = fill->buf.length = num + 1; 09 fill->buf.buffer = (char*) xmalloc (fill->buf.size); 10 buffer = (unsigned char *) fill->buf.buffer; 07 fill->nfields = num; // initialization i = 0; while (i < num){ ... for (int j = 0; j < i; j++); 15 i++; 16 } 17 }

Fully Disguised

Outline Plagiarism Disguises Review of Plagiarism Detection GPLAG: PDG-based Plagiarism Detection Efficiency and Scalability Experiments Conclusions

Review of Plagiarism Detection String-based [Baker et al. 1995] A program represented as a string Blanks and comments ignored. AST-based [Baxter et al. 1998, Kontogiannis et al. 1995] A program is represented as an Abstract Syntax Tree (AST) Fragile to statement reordering, control replacement and code insertion Token-based [Kamiya et al. 2002, Prechelt et al. 2002] Variables of the same type are mapped to the same token A program is represented as a token string Fingerprint of token strings is used for robustness [Schleimer et al. 2003] Partially robust to statement reordering, control replacement and code insertion Representatives: Moss and JPlag

Outline Plagiarism Disguises Review of Plagiarism Detection GPLAG: PDG-based Plagiarism Detection Efficiency and Scalability Experiments Conclusions

Graphic representation of source code int sum(int array[], int count) { int i, sum; sum = 0; for(i = 0; i < count; i++){ sum = add(sum, array[i]); } return sum; int add(int a, int b) { return a + b; }

Graphic representation of source code int sum(int array[], int count) { int i, sum; sum = 0; for(i = 0; i < count; i++){ sum = add(sum, array[i]); } return sum; int add(int a, int b) { return a + b; }

Control Dependency int sum(int array[], int count) { int i, sum; sum = 0; for(i = 0; i < count; i++){ sum = add(sum, array[i]); } return sum; int add(int a, int b) { return a + b; }

Data Dependency int sum(int array[], int count) { int i, sum; sum = 0; for(i = 0; i < count; i++){ sum = add(sum, array[i]); } return sum; int add(int a, int b) { return a + b; }

Plagiarism Detectible?

Corresponding PDGs PDG for the Original Code PDG for the Plagiarized Code

PDG-based Plagiarism Detection A program is represented as a set of PDGs Let g be a PDG of Procedure P in the original program Let g’ be a PDG of Procedure P’ in the plagiarism suspect Subgraph isomorphism implies plagiarism If g is subgraph isomorphic to g’, P’ is likely plagiarized from P γ-isomorphism: Graph g is γ-isomorphic to g’ if there exists a subgraph s of g such that s is subgraph isomorphic to g’, and |s|≥ γ |g|. If g is γ–isomorphic to g’, the PDG pair (g, g’) is regarded as a plagiarized PDG pair, and is then returned to human beings for examination.

Advantages Robust because it is hard to overhaul PDGs Dependencies encode program logic Incentive of plagiarism

Outline Plagiarism Disguises Review of Plagiarism Detection GPLAG: PDG-based Plagiarism Detection Efficiency and Scalability Experiments Conclusions

Efficiency and Scalability Search space If the original program has n procedures and the plagiarism suspect has m procedures n*m subgraph isomorphism testings Pruning search space Lossless filter Statistical lossy filter

Lossless filter Interestingness γ-isomorphism definition PDGs smaller than an interesting size K are excluded from both sides γ-isomorphism definition A PDG pair (g, g’) is discarded if |g’| <γ|g|.

Lossy Filter Observation Requirement If procedure P’ is plagiarized from procedure P, its PDG g’ should look similar to g. So discard those dissimilar PDG pairs Requirement This filter must be light-weighted Otherwise, direct isomorphism could be more efficient.

Vertex Histogram Represent PDG g by Similarly, represent PDG g’ by h(g) = (n1, n2, …, nk), where ni is the frequency of the ith kind of vertices. Similarly, represent PDG g’ by h(g’) = (m1, m2, …, mk). Direct similarity measurement? How to define a proper similarity threshold? Is thus defined threshold program-independent?

Hypothesis Testing-based Approach Basic idea Estimate a k-dimensional multinomial distribution from h(g) Test whether h(g’) is likely an observation from If it is, g’ looks similar to g, and an isomorphism testing is needed. Otherwise, (g, g’) is discarded

Technical Details

Technical Details (cont’d)

Work-flow of GPLAG PDGs are generated with Codesurfer Isomorphism testing is implemented with VFLib.

Outline Plagiarism Disguises Review of Plagiarism Detection GPLAG: PDG-based Plagiarism Detection Efficiency and Scalability Experiments Conclusions

Experiment Design Subject programs Effectiveness Filter efficiency Core-part plagiarism detection

Effectiveness 2-hour manual plagiarism, but can be automated? GPLAG detects all plagiarized PDG pairs within 1 second PDG isomorphism also reveals what plagiarism disguises are applied

Efficiency Subject programs Lossless and lossy filter bc, less and tar. Exact copy as plagiarism. Lossless and lossy filter Pruning PDG-pairs. Implication to overall time cost.

Pruning Uninteresting PDG-pairs Lossless only Lossless and lossy

Implication to Overall Time Cost Time-out for subgraph isomorphism testing, time hogs. Lossless filter does not save much time. Lossy filter significantly reduces the time cost. Major time saving comes from the avoidance of time hogs.

Detection of Core-part Plagiarism Lower time cost with lossy filter. Lower false positives with lossy filter.

Outline Plagiarism Disguises Review of Plagiarism Detection GPLAG: PDG-based Plagiarism Detection Efficiency and Scalability Experiments Conclusions

Conclusions We developed a new algorithm GPLAG for software plagiarism detection It is more effective to fight against “professional” plagiarists We developed a statistical lossy filter, which improves the efficiency of GPLAG We experimentally verified the effectiveness and efficiency of GPLAG

Q & A Thank You!

References [1] B. S. Baker. On finding duplication and near duplication in large software systems. In Proc. of 2nd Working Conf. on Reverse Engineering, 1995. [2] I. D. Baxter, A. Yahin, L. Moura, M. Sant’Anna, and L. Bier. Clone detection using abstract syntax trees. In Proc. of Int. Conf. on Software Maintenance, 1998. [3] K. Kontogiannis, M. Galler, and R. DeMori. Detecting code similarity using patterns. In Working Notes of 3rd Workshop on AI and Software Engineering, 1995. [4] T. Kamiya, S. Kusumoto, and K. Inoue. CCFinder: a multilinguistic token-based code clone detection system for large scale source code. IEEE Trans. Softw. Eng., 28(7), 2002. [5] L. Prechelt, G. Malpohl, and M. Philippsen. Finding plagiarisms among a set of programs with JPlag. J. of Universal Computer Science, 8(11), 2002. [6] S. Schleimer, D. S. Wilkerson, and A. Aiken. Winnowing: local algorithms for document fingerprinting. SIGMOD, 2003. [7] V. B. Livshits and T. Zimmermann. Dynamine: Finding common error patterns by mining software revision histories. In Proc. of 13th Int. Symp. on the Foundations of Software Engineering, 2005. [8] C. Liu, X. Yan, and J. Han. Mining control flow abnormality for logic error isolation. In In Proc. 2006 SIAM Int. Conf. on Data Mining, 2006. [9] C. Liu, X. Yan, H. Yu, J. Han, and P. S. Yu. Mining behavior graphs for ”backtrace” of noncrashing bugs. In SDM, 2005.