Download presentation
Presentation is loading. Please wait.
1
Document Similarities Anand Bahety Cody Dunne
2
Project Idea Find similar segments of documents
3
Project Idea (cont…) Inexact matching –Local alignment (Smith-Waterman, BLAST) –Based on character Meaningless to score character differences –Based on word Need a good scoring function
4
Project Idea (cont…) Scoring function based on word relationships –Part of speech Noun -> pronoun (ok) Noun ->verb (worse) –Synonyms – positive score –Antonyms – negative score –Network of word relationships WordNet – publicly available lexical English database –Gaps Different numbers of adjectives/adverbs Prepositions, pronouns
5
Related Work Document versioning (Versioning Machine, etc…) Detecting plagiarism (Bagdis, etc…)
6
Potential Pitfalls False positives The Great Wall of China is very famous. The Fantastic Wall by XYZ is very famous. –Pick correct word meanings False negatives –Database isn’t perfect/complete Incomplete scoring function –Only examines particular types of words –Depends on order Limited to English –EuroWordNet
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.