Using Natural Language Processing for Automatic Plagiarism Detection Miranda Chong*, Lucia Specia, Ruslan Mitkov Research Group in Computational Linguistics.

Using Natural Language Processing for Automatic Plagiarism Detection Miranda Chong*, Lucia Specia, Ruslan Mitkov Research Group in Computational Linguistics University of Wolverhampton, UK. *miranda.chong@wlv.ac.uk 23 rd June 2010 4 th International Plagiarism Conference Northumbria University, Newcastle upon Tyne, UK. 1

Overview Introduction Challenges Aims NLP Explained Experimental Setup Findings Discussion Further Developments Summary 2

Introduction What is plagiarism? What is plagiarism detection? As humans it is easy to judge “similar” passages. But can computers perform this judgement? 3

Challenges Existing methodologies: Limitations 4 Lexical changes: synonymy, related concepts Structural changes: active/passive voice, word order, joining/splitting sentences Textual Entailment: sentence paraphrase & other semantic variations Multi-source Plagiarism Multi-lingual Plagiarism

5 Vector space model (or term vector model) is an algebraic model for representing text documents (and any objects, in general) as vectors of identifiers, such as, for example, index terms. It is used in information filtering, information retrieval, indexing and relevancy rankings. Its first use was in the SMART Information Retrieval System. The vector space model has the following limitations: 1. Long documents are poorly represented because they have poor similarity values (a small scalar product and a large dimensionality) 2. Search keywords must precisely match document terms; word substrings might result in a "false positive match"

Aims Current research focus: Proposed framework: Existing approaches + NLP = Improve accuracy Scope of research: Tackle genuine plagiarism cases 7 External plagiarismMonolingual (English)Free textDocument Level

NLP Explained [ Natural Language Processing ] Computer system to analyse Written/ Spoken Human Speech 8 Linguistics Computer Science Mathematics NLP Machine Translation …etc Information Extraction Document Summarisation Question Answering Artificial Intelligence

Experimental Setup (1) Corpus of Plagiarised Short Answers – Clough & Stevenson (2009) Original source documents (wiki articles) : 5 Plagiarised documents : 57 – Near copy : 19 – Light revision : 19 – Heavy revision :19 Non-plagiarised documents : 38 9

Experimental Setup (1 cont.) 4 levels of suspicious plagiarised documents – Near copy (copy & paste without changes) – Light revision (minor alteration) – Heavy revision (rewriting and restructuring) – Non-plagiarised (original text not given) Alternatively, 2 levels of classification – Plagiarised (Near copy + Light revision + Heavy revision) – Non-plagiarised Note: The 2 level classification was not used in the paper. Please see poster presentation for a comparison. 10

Experimental Setup (2) System architecture pipeline Suspicious Documents Original Documents Text Pre- processing & NLP Techniques Comparison Methodologies Machine Learning Algorithm Accuracy Score Corpus Raw Text Processed Text Features Sets Classifier

Experimental Setup (3) Text pre-processing & NLP techniques: Syntactic processing techniques: 12 Baseline Sentence segmentation Tokenisation Lowercase Part-of-speech tagging Stop-word removal Punctuation removal Number replacement Lemmatisation Stemming Dependency parsingChunking

Experimental Setup (4) Comparison methodologies: Comparative baseline: Ferret Plagiarism Detector (Lyon et al., 2000) Machine learning algorithm: 13 Trigram similarity measures Language model probability measure Longest common subsequence Dependency relations matching Naïve Bayes Classifier

Sentence segmentation Determine sentence boundaries Split text in document into sentences Allow sentence level matching [ “To be or not to be– that is the question: whether 'tis nobler in the mind to suffer the slings and arrows of outrageous fortune, or to take arms against a sea of troubles and, by opposing, end them.”] [“To die, to sleep no more – and by a sleep to say we end the heartache and the thousand natural shocks that flesh is heir to – ‘tis a consummation devoutly to be wished.” ] - Quote from William Shakespeare's Hamlet 14

Tokenisation Determine words, punctuation symbols boundaries Isolate punctuations from words “To be or not to be– that is the question:” ↓ [To] [be] [or] [not] [to] [be] [–] [that] [is] [the] [question] [:] 15

Lowercase Substitute uppercase characters with lowercase Generalise word matching “To be or not to be– that is the question:” ↓ to be or not to be– that is the question: 16

Part-of-speech tagging Assign grammatical tags to each word Analyse sequence of tags on syntactic level “To be or not to be– that is the question:” ↓ TO VB CC RB TO VB : WDT VBZ DT NN : 17

Stop-word removal Remove irrelevant words Keep content words (verbs, adverbs, nouns, adjectives) “To be or not to be– that is the question:” ↓ be or not be - question: 18

Punctuation removal Remove punctuation “To be or not to be– that is the question:” ↓ To be or not to be that is the question 19

Number replacement Replace numbers with dummy symbol Generalise words “63.75 percent of all statistics are made up, including this one.” ↓ [NUM] percent of all statistics are made up, including this one. 20

Lemmatisation Transform words into their dictionary base forms Allow matching of similar words Produced  Produce 21

Stemming Transform words into their base forms Produced/ Product/ Produce  Produc Computational  Comput 22

Dependency parsing Syntactic analysis of sentences Stanford parser Allow matching for related pairs of words at constituent level “To be or not to be– that is the question:” aux(be-2, To-1) cc(be-2, or-3) neg(be-6, not-4) aux(be-6, to-5) conj(be-2, be-6) nsubj(question-11, that-8) cop(question-11, is-9) det(question-11, the-10) parataxis(be-2, question-11) 23

Chunking Shallow parsing generates parse tree Keep only the identifiers and structure “To be or not to be” (S (VP (TO To) (VP (VB be)))) (CC or) (PP (RB not) (IN to) (VP (VB be)))) ↓ VP VP CC PP VP 24

Trigram similarity measures “To be or not to be” {“To”, “be”, “or”} {“be”, “or”, “not”} {“or”, “not”, “to”} {“not”, “to”, “be”} Jaccard similarity coefficient - Ferret Plagiarism Detector Matching Trigrams in suspicious & original Documents ÷ All Trigrams in suspicious & original Documents Containment measure – Clough & Stevenson Matching Trigrams in suspicious & original Documents ÷ All Trigrams in Suspicious Documents 25

Longest common subsequence Calculates the longest sequence of matching words between sentences Sentence 1: to be or not to be– that is the question. Sentence 2: should we trust our new PM? that is the question for many voters. LCS = “that”, “is”, “the”, “question” = 4 26

Language model probability measure N-grams statistical model SRILM – language modelling toolkit (Stolcke, 2002) Calculates level of similarity between document pairs Combining probabilities of n-gram overlaps – Unigrams, Bigrams, Trigrams (tokenised corpus) – 4-grams & 5-grams (chunked corpus) 27 Tokenisation Chunking

Dependency relations matching Count number of matching parsed data between documents Dependency = Overlapping relations ÷ Number of relations in Suspicious doc = 2 ÷ 4 = 0.5 28 Suspicious doc nsubj(question, that) cop(question, is) det(question, the) parataxis(be, question) Original doc aux(be, to) cc(be, or) neg(be, not) aux(be, to) conj(be, be) nsubj(question, that) cop(question, is)

Machine learning algorithm WEKA – machine learning toolkit (Hall et al., 2009) Use feature scores for training Naïve Bayes classifier to learn a model The model classify documents according to their level of plagiarism 29 Near-CopyLight Revision Heavy RevisionNon-Plagiarism What does a classifier do?

Findings (1) Comparison results of feature sets 30 Pre-processing Techniques Comparison Methodology Feature Sets 1.Trigram containment measure: baseline dataset 2.Ferret: baseline dataset 3.Ferret: baseline + lemmatisation 4.Ferret: baseline + stop-word removal + punctuations removal + number replacement 1.Trigram containment measure: baseline dataset 2.Ferret: baseline dataset 3.Ferret: baseline + lemmatisation 4.Ferret: baseline + stop-word removal + punctuations removal + number replacement 5. Language model: Bigram perplexity 6. Language model: Trigram perplexity 5. Language model: Bigram perplexity 6. Language model: Trigram perplexity 7. Longest common subsequence 8. Dependency relations matching

Findings (2) Naïve Bayes classifier 10-fold cross-validation 32 Trigram Containment Measure: Baseline Ferret: Baseline + Lemmatisation Ferret: Baseline + Stop-words removal + Punctuation removal + Number replacement Language model: Bigram perplexity Language model: Trigram perplexity Longest Common Subsequence Dependency Relations Matching Trigram Containment Measure: Baseline Ferret: Baseline + Lemmatisation Ferret: Baseline + Stop-words removal + Punctuation removal + Number replacement Language model: Bigram perplexity Language model: Trigram perplexity Longest Common Subsequence Dependency Relations Matching Best Features Set 70% accurate Ferret Baseline 66% accurate 41 features in total All features 60% accurate

Discussion (1) NLP enhances existing approaches Effective : distinguish between plagiarised & non-plagiarised documents Deep NLP Techniques (Parsing) + Machine Learning = Promising Framework 33 Accuracy of best features set on two levels (plag/non-plag): 94.74%

Discussion (2) Final human judgement needed to establish cases Potential educational purposes – Identify suspicious cases for further investigation – Pre-emptive tool to detect incorrectly referenced materials 34

Further Developments Identify paraphrased texts – Wordnet : correlation 0.72 Future plans: – Passage level – Integrate Wordnet with current framework – Perform experiments on other corpora (METER, PAN) – Address multi-lingual plagiarism detection 35 Parse tree dependency relations 0.67 Parse tree dependency relations 0.67

Summary Plagiarism detection methodologies can be improved using NLP These tools can identify possible plagiarised cases Human intervention will always be required to judge plagiarised cases 36

THE END References Clough, P., & Stevenson, M. (2009). Developing a corpus of plagiarised short answers. Language Resources and Evaluation, LRE 2010. Ferret (2009). University of Hertfordshire. [Accessed: 21/3/2010] Available at: http://homepages.feis.herts.ac.uk/~pdgroup/ Gumm, H. P. (2010). Plagiarism or “naturally given” ? Decide for yourself …. Philipps-Universität Marburg. [Accessed: 17/5/2010] Available at: Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I. (2009). The WEKA Data Mining Software: An Update. In ACM Special Interest Group on Knowledge Discovery and Data Mining, SIGKDD Explorations, (11)1. (pp.10-18). iParadigms (2010). Turnitin [Accessed: 11/5/2010] Available at: Lyon, C., Barrett, R., & Malcolm, J. (2001). Experiments in Electronic Plagiarism Detection. [Accessed: 21/3/2010] Available at: Stolcke, A. (2002). SRILM- An extensible language modelling toolkit. In Proceedings of the Seventh International Conference on Spoken Language Processing, 3, (pp. 901-904). ZEIT Online. (2010). Abrechnung im Netz. [Accessed: 17/5/2010] Available at: 37

Trigram Containm ent Measure Ferret: Baseline Ferret: Baseline + Lemma Ferret: Baseline + Stopword + Punctuati ons+ Number Language Model – Bigram Language Model – Trigram Longest Common Subseque nce Parse Tree Dependen cy Relations Plagiarism Level 0.0081632650.0058940.0059170.003378 0.8609690.8221550.0451040.02551non-plag 0.6986899560.381503 0.377926 0.0660080.0676990.8838690.759259copy 0.560.4294870.4281150.325243 0.1660630.1300390.3907460.664894light 0.1231527090.0654630.0656110.045283 0.7975190.7359230.1281360.202454heavy 0.0193236710.0088370.008850.002457 0.4582330.4398330.6851690.031646non-plag 0.0061349690.0073260.0072990.003067 0.6244610.5955630.1662930.008368non-plag 0.0243055560.0151310.015110 0.9939040.9461120.1457880.060185non-plag 38 … 0.1911111110.1634350.1727020.1081080.1976050.1358690.3030560.203593copy 0.0121951220000.8451570.80028300.016304non-plag

Using Natural Language Processing for Automatic Plagiarism Detection Miranda Chong*, Lucia Specia, Ruslan Mitkov Research Group in Computational Linguistics.

Similar presentations

Presentation on theme: "Using Natural Language Processing for Automatic Plagiarism Detection Miranda Chong*, Lucia Specia, Ruslan Mitkov Research Group in Computational Linguistics."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Using Natural Language Processing for Automatic Plagiarism Detection Miranda Chong*, Lucia Specia, Ruslan Mitkov Research Group in Computational Linguistics.

Similar presentations

Presentation on theme: "Using Natural Language Processing for Automatic Plagiarism Detection Miranda Chong*, Lucia Specia, Ruslan Mitkov Research Group in Computational Linguistics."— Presentation transcript:

Similar presentations

About project

Feedback