Presentation is loading. Please wait.

Presentation is loading. Please wait.

COMPUTER-ASSISTED PLAGIARISM DETECTION PRESENTER: CSCI 6530 STUDENT.

Similar presentations


Presentation on theme: "COMPUTER-ASSISTED PLAGIARISM DETECTION PRESENTER: CSCI 6530 STUDENT."— Presentation transcript:

1 COMPUTER-ASSISTED PLAGIARISM DETECTION PRESENTER: CSCI 6530 STUDENT

2 OUTLINE Introduction Types of Plagiarism Detection External Plagiarism Detection Source Retrieval Process Source Retrieval Process – Cont. Text Alignment Algorithms Intrinsic Plagiarism Detection Intrinsic Plagiarism Detection – Cont. Discussions Conclusion References

3 INTRODUCTION Plagiarism [1] The practice of duplicating someone else’s work without crediting the source Present other’s work as your own Fail to cite sources Provide incorrect information about the source Use one source as the majority of your work Mostly related to documents but also includes art designs, software The expansion and availability of information Easier to plagiarize Harder to detect PAN Workshop [2] Evaluation lab on Uncovering Plagiarism, Authorship and Social Software Misuse Materials from 1 st and 6 th International Competition on Plagiarism detection

4 TYPES OF PLAGIARISM DETECTION Manual Detection Human review Impractical due to the vast amount of documents Computer-aided Detection Algorithm to spot potential plagiarism cases Human is still the final reviewer Two problem classes: [2] Extrinsic/External Comparing a document to a collection of documents Intrinsic Applying stylometry – techniques of analyzing writing styles No reference collection

5 EXTERNAL PLAGIARISM DETECTION Overall Approach [2] Figure 1: Generic Process for External Plagiarism Detection [2] Source Retrieval: Collect all source documents that are used by the suspicious document Text Alignment: Pair-wise comparison between each source and suspicious document Take in account of obfuscation Post Processing Filter identified similar pairs for later visual inspection

6 SOURCE RETRIEVAL PROCESS Chunking [2] Dividing up a suspicious document No overlap Keyphrase Extraction [2] Extracting keywords from a chunk Most important step Too many: too many queries, increasing cost Too little: might yield low performance Techniques: n-grams-based searching Term frequency-inverse document frequency (tf-idf) Schemes from research community Combination

7 SOURCE RETRIEVAL PROCESS - CONTINUED Query Formulation [2] Non-overlapping and overlapping queries Search Control [2] Dynamically adjusting submission of queries to search engine based on search results Download Filtering [2] Removing non-relevant results from search Reducing time/cost for the following comparison step Techniques: top 10 results, using classifier, n-grams-based searching, etc…

8 TEXT ALIGNMENT ALGORITHMS Seeding [2] Identifying matches between two documents Exact matches or Create matches Many reasonable seeds as possible Techniques: n-grams-based matching, fingerprint, similarity threshold, etc... Extension [2] Merging seeds into text passages for human detection Techniques: Rule-based approach Combining seeds based on criteria Dynamic programming Using bioinformatics algorithms, substitute pairs of texts instead Clustering-based approach Filtering [2] Removing passages based on criteria: overlapping, too short, etc…

9 INTRINSIC PLAGIARISM DETECTION Vector Space Model [3] Vectorization Created a normalized vector for each sentence per each stylometric feature Average word frequency, punctuation, pronouns, etc… Concatenated vectors to a single vector and normalize again Computed a mean vector Outlier Detection Calculated cosine similarity from mean vector to each individuals Post processing Constructed text passages based on deviated sentences

10 INTRINSIC PLAGIARISM DETECTION – CONT. Character n-gram Profiles [4] Atomization: Chunking for later analyzing and comparing Paragraph and n-character (n = 5000) Feature Extraction Assigning numerical values to text to measure complexity E.g. average sentence length, punctuation percentage, derived features, etc Feature combination Identifying sets of features that yield good results Techniques: neural network, genetic programming, etc… Classification k-means clustering: ‘plagiarized’ and ‘non-plagiarized’ clusters Outlier detection Hidden Markov modeling

11 DISCUSSIONS External Plagiarism Detection: High detection rate Great costs against large reference collection Challenges: Obfuscation, cross languages Intrinsic Plagiarism Detection: More similar to human detection Would work in case of obfuscation, cross languages Low detection rate Harder problem due to lack of reference Most researches focus on external plagiarism detection

12 CONCLUSION Human Plagiarism Detection: Most accurate, required for final detection Needed support from software: Large amounts of suspicious documents Large amounts of reference collection Combination of intrinsic and external plagiarism detection: Intrinsic first, then external [3] Combine the results [4] Cross-Languages Approach: [5] Translate to one language, then use external Non-text plagiarism detection

13 REFERENCES [1] What is Plagiarism?, Plagiarism.org, [online] 2014, http://www.plagiarism.org/plagiarism-101/what-is-plagiarism/ (Accessed: 12 July 2015). http://www.plagiarism.org/plagiarism-101/what-is-plagiarism/ [2] M. Potthast, M. Hagen, T. Gollub, M. Tippmann, J. Kiesel, P. Rosso, et al., "Overview of the 6th international competition on plagiarism detection," in CLEF Conference on Multilingual and Multimodal Information Access Evaluation, 2014. [3] M. Zechner, M. Muhr, R. Kern, and M. Granitzer, "External and intrinsic plagiarism detection using vector space models," in Proc. SEPLN, 2009, pp. 47-55. [4] N. Carnahan, M. Huderle, N. Jones, C. Stephan, T. Tran, and Z. Wood-Doughty, "Plagiarism Detection," 2014. [5] A. Barrón-Cedeno, P. Rosso, E. Agirre, and G. Labaka, "Plagiarism detection across distant language pairs," in Proceedings of the 23rd International Conference on Computational Linguistics, 2010, pp. 37-45.


Download ppt "COMPUTER-ASSISTED PLAGIARISM DETECTION PRESENTER: CSCI 6530 STUDENT."

Similar presentations


Ads by Google