Simple techniques for plagiarism detection in student programming projects Szymon Grabowski, Wojciech Bieniecki Computer Engineering Dept., Tech. Univ.

Slides:



Advertisements
Similar presentations
PHP I.
Advertisements

A number of MATLAB statements that allow us to control the order in which statements are executed in a program. There are two broad categories of control.
The Assembly Language Level
Computer Programming Rattapoom Waranusast Department of Electrical and Computer Engineering Faculty of Engineering, Naresuan University.
Programming Logic and Design, Third Edition Comprehensive
1 A Balanced Introduction to Computer Science, 2/E David Reed, Creighton University ©2008 Pearson Prentice Hall ISBN Chapter 17 JavaScript.
 Aim in building a phylogenetic tree is to use a knowledge of the characters of organisms to build a tree that reflects the relationships between them.
1 Query Languages. 2 Boolean Queries Keywords combined with Boolean operators: –OR: (e 1 OR e 2 ) –AND: (e 1 AND e 2 ) –BUT: (e 1 BUT e 2 ) Satisfy e.
Working with JavaScript. 2 Objectives Introducing JavaScript Inserting JavaScript into a Web Page File Writing Output to the Web Page Working with Variables.
Whole Genome Alignment using Multithreaded Parallel Implementation Hyma S Murthy CMSC 838 Presentation.
Boyer-Moore string search algorithm Book by Dan Gusfield: Algorithms on Strings, Trees and Sequences (1997) Original: Robert S. Boyer, J Strother Moore.
Testing an individual module
C++ for Engineers and Scientists Third Edition
Programming Logic and Design, Introductory, Fourth Edition1 Understanding Computer Components and Operations (continued) A program must be free of syntax.
Chapter 1 Program Design
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 10, 2005.
Chapter 7 Managing Data Sources. ASP.NET 2.0, Third Edition2.
Cmpt-225 Simulation. Application: Simulation Simulation  A technique for modeling the behavior of both natural and human-made systems  Goal Generate.
Chapter 3 Planning Your Solution
The Program Design Phases
Chapter 5: Information Retrieval and Web Search
Adding Automated Functionality to Office Applications.
Games and Simulations O-O Programming in Java The Walker School
An Automatic Segmentation Method Combined with Length Descending and String Frequency Statistics for Chinese Shaohua Jiang, Yanzhong Dang Institute of.
TERMS TO KNOW. Programming Language A vocabulary and set of grammatical rules for instructing a computer to perform specific tasks. Each language has.
Introduction 01_intro.ppt
1 Shawlands Academy Higher Computing Software Development Unit.
Pairwise Alignment, Part I Constructing the Values and Directions Tables from 2 related DNA (or Protein) Sequences.
The Web-based Data Collection in the Italian Population and Housing Census Leonardo Tininini and Antonino Virgillito ISTAT Meeting on the Management of.
Simple Program Design Third Edition A Step-by-Step Approach
CIS Computer Programming Logic
RuleML-2007, Orlando, Florida1 Towards Knowledge Extraction from Weblogs and Rule-based Semantic Querying Xi Bai, Jigui Sun, Haiyan Che, Jin.
COMPUTER PROGRAMMING Source: Computing Concepts (the I-series) by Haag, Cummings, and Rhea, McGraw-Hill/Irwin, 2002.
Gary MarsdenSlide 1University of Cape Town Principles of programming language design Gary Marsden Semester 2 – 2001.
WRITING THE RESEARCH REPORT & CITING RESOURCES BUSN 364 – Week 15 Özge Can.
Unit III : Introduction To Data Structures and Analysis Of Algorithm 10/8/ Objective : 1.To understand primitive storage structures and types 2.To.
The Development of a search engine & Comparison according to algorithms Sungsoo Kim Haebeom Lee The mid-term progress report.
Mining and Analysis of Control Structure Variant Clones Guo Qiao.
1 The Software Development Process  Systems analysis  Systems design  Implementation  Testing  Documentation  Evaluation  Maintenance.
1 Chapter 4: Selection Structures. In this chapter, you will learn about: – Selection criteria – The if-else statement – Nested if statements – The switch.
XP Tutorial 10New Perspectives on Creating Web Pages with HTML, XHTML, and XML 1 Working with JavaScript Creating a Programmable Web Page for North Pole.
1 A Feature Selection and Evaluation Scheme for Computer Virus Detection Olivier Henchiri and Nathalie Japkowicz School of Information Technology and Engineering.
Analysis of Algorithms CSCI Previous Evaluations of Programs Correctness – does the algorithm do what it is supposed to do? Generality – does it.
Major objective of this course is: Design and analysis of modern algorithms Different variants Accuracy Efficiency Comparing efficiencies Motivation thinking.
Problem Solving Techniques. Compiler n Is a computer program whose purpose is to take a description of a desired program coded in a programming language.
A fast algorithm for the generalized k- keyword proximity problem given keyword offsets Sung-Ryul Kim, Inbok Lee, Kunsoo Park Information Processing Letters,
Pseudocode Simple Program Design Third Edition A Step-by-Step Approach 2.
What does C store? >>A = [1 2 3] >>B = [1 1] >>[C,D]=meshgrid(A,B) c) a) d) b)
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Introducing Python CS 4320, SPRING Lexical Structure Two aspects of Python syntax may be challenging to Java programmers Indenting ◦Indenting is.
The Software Development Process
Loops Robin Burke IT 130. Outline Announcement: Homework #6 Conditionals (review) Iteration while loop while with counter for loops.
Learning to Detect Faces A Large-Scale Application of Machine Learning (This material is not in the text: for further information see the paper by P.
Programming Fundamentals. Overview of Previous Lecture Phases of C++ Environment Program statement Vs Preprocessor directive Whitespaces Comments.
Intermediate 2 Computing Unit 2 - Software Development.
 Software Development Life Cycle  Software Development Tools  High Level Programming:  Structures  Algorithms  Iteration  Pseudocode  Order of.
1 The Software Development Process ► Systems analysis ► Systems design ► Implementation ► Testing ► Documentation ► Evaluation ► Maintenance.
The Development of a search engine & Comparison according to algorithms Sung-soo Kim The final report.
Algorithms and Pseudocode
CS307P-SYSTEM PRACTICUM CPYNOT. B13107 – Amit Kumar B13141 – Vinod Kumar B13218 – Paawan Mukker.
Huffman Coding (2 nd Method). Huffman coding (2 nd Method)  The Huffman code is a source code. Here word length of the code word approaches the fundamental.
OCR A Level F453: The function and purpose of translators Translators a. describe the need for, and use of, translators to convert source code.
4 - Conditional Control Structures CHAPTER 4. Introduction A Program is usually not limited to a linear sequence of instructions. In real life, a programme.
C++ for Engineers and Scientists Second Edition Chapter 4 Selection Structures.
UNIT-4 BLACKBOX AND WHITEBOX TESTING
Program Design Introduction to Computer Programming By:
Chapter 1 Introduction(1.1)
Software Design Lecture : 9.
Introduction to Computer Science
UNIT-4 BLACKBOX AND WHITEBOX TESTING
Presentation transcript:

Simple techniques for plagiarism detection in student programming projects Szymon Grabowski, Wojciech Bieniecki Computer Engineering Dept., Tech. Univ. of Łódź, Poland {SGrabow, Sieci i Systemy Informatyczne, Łódź, październik 2006 We plagiarized it...

2 Sz. Grabowski & W. Bieniecki, Simple techniques for plagiarism detection... What is plagiarism?

3 Plagiarism everywhere Sz. Grabowski & W. Bieniecki, Simple techniques for plagiarism detection... text (articles, scientific papers (also self-plagiarisms), essays...or just plot ideas in fiction books) music (melodies, “sampling”) images (copy/paste e.g. from web pages) Our interest: text plagiarism.

4 Text plagiarism stealing natural language (NL) texts stealing software code Sz. Grabowski & W. Bieniecki, Simple techniques for plagiarism detection...

5 Previous work (1/3) Faidhi & Robinson (1987): six levels of program modification in a plagiarism attempt: (i) changing comments, (ii) changing identifier names, (iii) reordering variable positions, (iv) procedure combination, (v) changing program statements, (vi) changing control logic. Changes in the program control logic are most laborious (and vulnerable to hard-to-detect errors) but also hardest to properly identify as plagiarism. Sz. Grabowski & W. Bieniecki, Simple techniques for plagiarism detection...

6 Previous work (2/3) Irving (2004): finding local similarity with a variant of the Smith-Waterman classic algorithm (1982). Aim: taking care of both precision/recall and speed. Pretchelt et al. (2000): JPlag online system. Basic technique: find a set of indentical substrings of strings A and B, adhering to a few simple rules. Quite robust to reordering parts of the text. Sz. Grabowski & W. Bieniecki, Simple techniques for plagiarism detection...

7 Previous work (3/3) Many algorithms based on various code complexity measures (like e.g. the number of execution paths through a program). (See [Clough, 2000] for details.) Mozgovoy et al. (2005): Suffix array based alg. to decrease the computation complexity of all-against-all file comparison. Sz. Grabowski & W. Bieniecki, Simple techniques for plagiarism detection...

8 Our motivation Some say that laziness is a professional feature... Therefore we wanted to keep things simple (as opposed to many algs from the literature). Our task: find plagiarisms in student homeworks. Namely, in Java projects. Small projects: not more than a few hundred lines expected. Sz. Grabowski & W. Bieniecki, Simple techniques for plagiarism detection...

9 We conjecture that the relative order and frequency of the keywords of a given language is quite a good indicator if two documents were created independently or not. Because it is not easy to find synonymous constructs without some understanding of the code. Our approach Why keywords? Maybe operators instead? Rather not. Examples (in C and similar lang.): x = y / 2;  x = y * 0.5; x-=2;  x--, x--; Sz. Grabowski & W. Bieniecki, Simple techniques for plagiarism detection...

10 Java keywords Sz. Grabowski & W. Bieniecki, Simple techniques for plagiarism detection... nutsandbolts/_keywords.html

11 Three variants Extracting keywords, that’s easy. What then? What similarity measure? We propose 3 variants: based on the context-free counts of the keywords, i.e., order-0 statistics; based on the similarity of the statistics of pairs of successive keywords in the source files, i.e., order-1 rather than order-0 statistics; based on the similarity between the whole sequences of used keywords, in the order of their appearances, with aid of the LCS (longest common subsequence) measure. Sz. Grabowski & W. Bieniecki, Simple techniques for plagiarism detection... In all variants we measure pair-wise file similarity.

12 Algorithm I (order-0 statistics) 1. For both files we create a dictionary ( Dict1 and Dict2, respectively) of occurring keywords with the number of occurrences (a histogram). 2. We calculate the total number of keywords C. 3. We calculate the number of keyword repetitions R: 4. We evaluate the similarity S = R / C. Sz. Grabowski & W. Bieniecki, Simple techniques for plagiarism detection...

13 Algorithm II (order-1 statistics) 1. For both files create a sequence of keywords ( List1 and List2 ). 2. For each element i of List1 (except from the last one) take its successor List1(i+1) and add the pair to the list lp1. Delete the repeated records from lp1. 3. Analogously for lp2. 4. Evaluate the similarity measure S: Sz. Grabowski & W. Bieniecki, Simple techniques for plagiarism detection...

14 Given strings A, B, |A| = n, |B| = m, find the longest subsequence shared by both strings. Sometimes we are interested in a simpler problem: finding only the length of the LCS (LLCS), not the matching sequence. Longest Common Subsequence (LCS) A = m a t t e r B = b r o t h e r s LLCS(A, B) = 3. LCS(A, B) = t e r LCS example Sz. Grabowski & W. Bieniecki, Simple techniques for plagiarism detection...

15 Algorithm III LCS on strings where the “characters” are keywords Sz. Grabowski & W. Bieniecki, Simple techniques for plagiarism detection Denote the sequence of keywords in file1 and file2 with Word1, Word2, respectively. 2. Use the formula for similarity measure S:

16 Implementation / test setup All codes in Python 2.4 (perfect language for reluctant coders). Test machine: Pentium4 3 GHz, 512 MB of RAM, Windows XP SP2. Input files: 15 student Java projects (single source files) solving the same task: displaying time on an analog clock, using a client-server technology (a server provides the system time, and many clients can be connected to the server). Sz. Grabowski & W. Bieniecki, Simple techniques for plagiarism detection...

17 Sz. Grabowski & W. Bieniecki, Simple techniques for plagiarism detection... Test files Files that in fact are plagiarisms are in the positions: 5 → 1, 8 → 15, 7 → 10 and 7 → 13.

18 Alg I (order-0 stats), similarity measure Sz. Grabowski & W. Bieniecki, Simple techniques for plagiarism detection......But the 4th thief was not detected. 

19 Alg II (order-1 stats), similarity measure Sz. Grabowski & W. Bieniecki, Simple techniques for plagiarism detection... Not perfect but not bad either...

20 Alg III (LCS), similarity measure Sz. Grabowski & W. Bieniecki, Simple techniques for plagiarism detection... All four thieves at the top!

21 Conclusions Sz. Grabowski & W. Bieniecki, Simple techniques for plagiarism detection... All the presented algorithms seem to indicate the plagiarized codes properly. But in practice it is impossible to set the threshold similarity value for each algorithm above which the files are plagiarisms. In Algorithm I the values of similarity vary from 0.75 to 1 and all below 0.98 don’t indicate a plagiarism. This algorithm is the most resistant to changing the order of instructions and functions.

22 Algorithm II is pretty resistant to changing the order of functions and blocks of instructions. The range of obtained similarity measure values is much wider comparing to the first case. Algorithm III, based on the LCS measure, is vulnerable to changing the order of functions and instructions in the file. In the inspected case, however, students stealing the code did not bother to mix the functions so the results are comparable to Algorithm II. Conclusions, cont’d All presented algorithms should work properly if a stolen homework is only a part of the original code. Sz. Grabowski & W. Bieniecki, Simple techniques for plagiarism detection...

23 Future plans Sz. Grabowski & W. Bieniecki, Simple techniques for plagiarism detection... Making it more robust to function reordering (even Algorithm II). Idea: convert a source file to a cannonical form, sorting functions according to their signatures. More experiments (also for sources in C++, PHP...). Handling multi-file projects. Use not only keywords but standard library function names too? Several independent similarity measures and the detection based on training?