Using copy-detection and text comparison algorithms for cross- referencing multiple editions of literary works A. Zaslavsky, Alejandro Bia, K. Monostori, School of Computer Science & Software Engineering Australia Monash University, Australia, Spain & Miguel de Cervantes DL, University of Alicante, Alicante, Spain, European Conference on Digital Libraries, Darmstadt, 2001
Overview Copy-detection, plagiarism and comparative literary analysis Text processing in DLs and humanities research Tools and approaches MatchDetectReveal architecture Cervantes's Quijote DL & MDR Conclusion
Introduction Problems Intellectual property Plagiarism Search results Copy-prevention Special hardware Active documents Copy-detection Plagiarism.org SCAM Koala sif
Copy-detection Digital watermarking Codewords Line-shift coding Word-shift coding Feature coding String comparison 30 32
Copy-Detection Architecture Registration Module Comparison Module Parsing Module
MatchDetectReveal(MDR) Internet MDR users MDR customizer 4matching engine 4format converter 4search engine 4visualiser local repository matching rule DB indexes Similarity & overlap rule interpreter IEEE DL ACM DL Local cluster Global resources Base Document Set Generator
Example screen dump
Conclusion Comparative analysis of editions Cleaning up OCR output Performance Text ordering not necessary Fine granularity of overlap detection
Future Work Similar blocks of text XML output Rules for overlap & similarity Visualisation of results