Download presentation
Presentation is loading. Please wait.
Published byUrsula Lewis Modified over 9 years ago
1
Thesaurus-Based Index Term Extraction Olena Medelyan Digital Library Laboratory
2
Describe the topics in a document Index terms: controlled vocabulary ( e.g. predatory birds, damage, aquaculture) Keyphrases: freely chosen (e.g. techniques, bird predation, aquaculture) Purposes: –Organize library’s holding –Provide thematic access to documents –Represent documents as brief summary –Aid navigation in search results Manual assignment: expensive, time-consuming Index Terms vs. Keyphrases Overview of Techniques for Reducing Bird Predation at Aquaculture Facilities
3
Extraction vs. Assignment Select significant n-grams or NPs according to their characteristics Classify documents according to their content words into classes (lables = keyphrases) - Restriction to syntax - Bad quality phrases - No consistency + Easy and fast implementation + Not much training required - Need large corpora - Long compuational time - Not practical + Word coocurrence + High accuracy
4
KEA++ Combines extraction with controlled vocabulary Considers semantic relations Controlled vocabulary = thesaurus Experiment: –agricultural documents (www.fao.org/docrep) –Agrovoc thesaurus (www.fao.org/agrovoc)
5
How does it Work? 1.Extract n-grams, transform them to pseudo-phrases, map to pseudo-phrases of thesaurus´ descriptors bird predation predat bird 2.Each document = set of candidate phrases 3.Training (document + manually assinged phrases) a.Compute the features b.Compute the model 4.Testing (new documents, no phrases) a.Compute the features b.Compute probabilities according to the model 5.Classification model: Naïve Bayes
6
Features TF×IDF – phrases that are specific for a given document are significant First Position – phrases that are in the beginning (or the end) of the document are significant Phrase Length – phrases with certain number of words are significant (2!) Node Degree – phrases that are related to the most other phrases in the document are significant
7
Example fisheries fish culture aquaculture fish ponds aquaculture techniques bird controll predatory birds noxious birds scares pest conroll controll methods monitoring methods equipment protective structures electrical installation fencing Indexers: 1 2 3 4 5 6 Agrovoc relation: KEA++: damage noise north america techniques fishery production predation predators birds ropes fishing operations
8
Evaluation I Standard Evaluation: –Number of exact matches in the test set –Precision, Recall, F-measure Problem: –Semantic similarity is not considered –Comparison only to one indexer, although indexing is subjective
9
Evaluation II Inter-indexer consistency, e.g. Rolling ’s measure: Indexersvs. othervs. KEA vs.KEA++ indexers 142729 239828 337926 437631 537625 636420 avg38727 Rolling‘s IIC = 2C A+B C – number of phrases in common A – number of phrases in the first set B – number of phrases in the second set -11%
10
“Overview of Techniques for Reducing Bird Predation at Aquaculture Facilities”. Results Indexer KEA++ Exactaquacultureaquaculturedamagefencingscaresnoise* Similarbird controlbirds predatory birdspredators fish culturefishing operations fishery production No matchnoxious birds control methods ropes *Selected by only one indexer
11
Problems & Future Work Trivial problems (e.g. stemming errors) Document chunking –What are important and disturbing parts of the document? Topic coverage –exploring thesaurus ’ structure –Lexical chains Term occurrence –Including other NLP resources (e.g. WordNet) Multi-linguality, other domains
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.