Document Similarities Anand Bahety Cody Dunne
Project Idea Find similar segments of documents
Project Idea (cont…) Inexact matching –Local alignment (Smith-Waterman, BLAST) –Based on character Meaningless to score character differences –Based on word Need a good scoring function
Project Idea (cont…) Scoring function based on word relationships –Part of speech Noun -> pronoun (ok) Noun ->verb (worse) –Synonyms – positive score –Antonyms – negative score –Network of word relationships WordNet – publicly available lexical English database –Gaps Different numbers of adjectives/adverbs Prepositions, pronouns
Related Work Document versioning (Versioning Machine, etc…) Detecting plagiarism (Bagdis, etc…)
Potential Pitfalls False positives The Great Wall of China is very famous. The Fantastic Wall by XYZ is very famous. –Pick correct word meanings False negatives –Database isn’t perfect/complete Incomplete scoring function –Only examines particular types of words –Depends on order Limited to English –EuroWordNet