Download presentation
Presentation is loading. Please wait.
Published byMelanie Andrews Modified over 9 years ago
1
A New Approach for Cross- Language Plagiarism Analysis Rafael Corezola Pereira, Viviane P. Moreira, and Renata Galante Universidade Federal do Rio Grande do Sul CLEF 2010
2
Outline Introduction Related Work The Proposed Approach Experiments Summary and Future Work
3
Introduction Plagiarism is one of the most serious forms of academic misconduct It is defined as “the use of another person's written work without acknowledging the source” A study with over 80,000 students in the US and Canada found that many of them have already commited a plagiarism offense –36% of undergraduate students –24% of graduate students Several types –Word-for-word, paraphrasing, text translation, etc.
4
Introduction Cross-language plagiarism is becoming more commom –Evolution of automatic translation systems –Increasing availability of textual content in many different languages Common scenario –A student downloads a paper, translates it using a automatic translation tool, corrects some translation errors and presents it as his own work It can also involve self-plagiarism –Usually aims at increasing the number of publications
5
Introduction What is the task? –Detect the plagiarized passages in the suspicious documents and their corresponding text fragments in the source documents even if the documents are written in different languages Known as External plagiarism analysis
6
Outline Introduction Related Work The Proposed Approach Experiments Summary and Future Work
7
Related Work Monolingual Plagiarism Analysis –Fingerprints, fuzzy-fingerprints,... Cross-Language Plagiarism Analisys –Statistical bilingual dictionary + bilingual text alignment –Use EuroWordNet to transform words into a language independent representation PAN competition –Enables different methods to be compared against each other –It was held as an evaluation lab in conjunction with CLEF 2010
8
Outline Introduction Related Work The Proposed Approach Experiments Summary and Future Work
9
The Proposed Approach Suspicious Documents Original Documents Norm. Susp. Documents Norm. Orig. Documents Language Normalization Suspicious Document for each Index Retrieval Candidate Documents Training Corpus Feature Selection + Classifier Training Classification Model Plagiarism Analysis Preliminary Result Post-Processing Final Result (1) (2) (4) (5) (3)
10
(1) Language Normalization All documents are converted into a common language English was chosen –More translation resources –One of the easiest languages to translate into Used a language guesser and an automatic translation tool
11
The Proposed Approach Suspicious Documents Original Documents Norm. Susp. Documents Norm. Orig. Documents Language Normalization Suspicious Document for each Index Retrieval Candidate Documents Training Corpus Feature Selection + Classifier Training Classification Model Plagiarism Analysis Preliminary Result Post-Processing Final Result (1) (2) (4) (5) (3)
12
(2) Retrieval of Candidate Documents Problem: It is not feasible to perform exhaustive comparisons Solution: Use passages from the suspicious document as a query to be sent to an IR system Note that documents are divided into subdocuments (paragraphs) in order to reduce the amount of text that must be analyzed At the end of this phase, we have a list of at most ten candidate subdocuments for each passage in the suspicious document
13
The Proposed Approach Suspicious Documents Original Documents Norm. Susp. Documents Norm. Orig. Documents Language Normalization Suspicious Document for each Index Retrieval Candidate Documents Training Corpus Feature Selection + Classifier Training Classification Model Plagiarism Analysis Preliminary Result Post-Processing Final Result (1) (2) (4) (5) (3)
14
(3) Feature Selection and Classifier Training The goal is to build a classification model that can learn how to distinguish between a plagiarized and a non-plagiarized text passage Annotated synthetic examples used for training J48 classification algorithm Features –The cosine similarity between the suspicious passage and the candidate subdocument –The similarity score assigned by the IR system –The position of the candidate subdocument in the rank generated –The length (in characters) of the suspicious and the candidate subdocument
15
The Proposed Approach Suspicious Documents Original Documents Norm. Susp. Documents Norm. Orig. Documents Language Normalization Suspicious Document for each Index Retrieval Candidate Documents Training Corpus Feature Selection + Classifier Training Classification Model Plagiarism Analysis Preliminary Result Post-Processing Final Result (1) (2) (4) (5) (3)
16
(4) Plagiarism Analysis Submit the test instances to the trained classifier and let it decide whether the suspicious passage is, in fact, plagiarized from one of the candidate subdocuments SubDoc 1 SubDoc 2 SubDoc 10... Classifier Plagiarized Or Non-Plagiarized class labels Suspicious Document Passage 1Passage 2Passage 5... Index Retrieval
17
The Proposed Approach Suspicious Documents Original Documents Norm. Susp. Documents Norm. Orig. Documents Language Normalization Suspicious Document for each Index Retrieval Candidate Documents Training Corpus Feature Selection + Classifier Training Classification Model Plagiarism Analysis Preliminary Result Post-Processing Final Result (1) (2) (4) (5) (3)
18
(5) Result Post-Processing Join the contiguous plagiarized passages detected by the method in order to decrease the granularity score The granularity score is a measure that assesses whether the plagiarism method reports a plagiarized passage as a whole or as several small plagiarized passages
19
Outline Introduction Related Work The Proposed Approach Experiments Summary and Future Work
20
Experiments Multilingual Test Collection –ECLaPA collection assembled from the Europarl Parallel Corpus (English, Portuguese and French) –An analogous monolingual corpus was also assembled –Available at http://www.inf.ufrgs.br/~viviane/eclapa.html Terrier IR System (Porter Stemmer + Stop-Word Removal) Weka (J48 classification algorithm) Google Translator (as language guesser) LEC Power Translator Evaluation Measures (PAN competition)
21
Experiments - Results Monolingual vs. Multilingual Recall was the most affected measure –Loss of information due to the translation process 86% of the overall score of the monolingual baseline ---MonolingualMultilingual% of Monolingual Recall0.86480.676078.16% Precision0.55150.511892.80% F-Measure0.67350.582586.48% Granularity1.0000 100% Overall Score 0.67350.582586.48%
22
Experiments - Results Detailed analysis The larger the passage the easier the detection Plagiarized passages detected –Monolingual 90% vs. 77% Multilingual --- MonolingualMultilingual ShortMediumLargeShortMediumLarge Detected43512892392421190239 Total60713232396071323239 %71971003990100
23
Outline Introduction Related Work The Proposed Approach Experiments Summary and Future Work
24
Summary We proposed and evaluated a new method for CLPA –Used a classification algorithm in order to decide whether a text passage is pagiarized or not We assembled an artificial cross-language plagiarism test collection to evaluate the method –It is freely available Cross-language experiment achieved 86% of the performance of the monolingual baseline
25
Future Work Improve the time spent during the analysis of each suspicious document –Analyze each suspicious passage in a different computer? Test other features during the classifier training phase Evaluate the method while detecting plagiarism between documents written in unrelated languages –English vs. Chinese/Japanese –Many plagiarism cases happen between these pairs of languages Citation analysis
26
Outline Introduction Related Work The Proposed Approach Experiments Summary and Future Work
27
A New Approach for Cross- Language Plagiarism Analysis Rafael Corezola Pereira, Viviane P. Moreira, and Renata Galante Universidade Federal do Rio Grande do Sul CLEF 2010 Questions?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.