Thesaurus-Based Index Term Extraction Olena Medelyan Digital Library Laboratory
Describe the topics in a document Index terms: controlled vocabulary ( e.g. predatory birds, damage, aquaculture) Keyphrases: freely chosen (e.g. techniques, bird predation, aquaculture) Purposes: –Organize library’s holding –Provide thematic access to documents –Represent documents as brief summary –Aid navigation in search results Manual assignment: expensive, time-consuming Index Terms vs. Keyphrases Overview of Techniques for Reducing Bird Predation at Aquaculture Facilities
Extraction vs. Assignment Select significant n-grams or NPs according to their characteristics Classify documents according to their content words into classes (lables = keyphrases) - Restriction to syntax - Bad quality phrases - No consistency + Easy and fast implementation + Not much training required - Need large corpora - Long compuational time - Not practical + Word coocurrence + High accuracy
KEA++ Combines extraction with controlled vocabulary Considers semantic relations Controlled vocabulary = thesaurus Experiment: –agricultural documents ( –Agrovoc thesaurus (
How does it Work? 1.Extract n-grams, transform them to pseudo-phrases, map to pseudo-phrases of thesaurus´ descriptors bird predation predat bird 2.Each document = set of candidate phrases 3.Training (document + manually assinged phrases) a.Compute the features b.Compute the model 4.Testing (new documents, no phrases) a.Compute the features b.Compute probabilities according to the model 5.Classification model: Naïve Bayes
Features TF×IDF – phrases that are specific for a given document are significant First Position – phrases that are in the beginning (or the end) of the document are significant Phrase Length – phrases with certain number of words are significant (2!) Node Degree – phrases that are related to the most other phrases in the document are significant
Example fisheries fish culture aquaculture fish ponds aquaculture techniques bird controll predatory birds noxious birds scares pest conroll controll methods monitoring methods equipment protective structures electrical installation fencing Indexers: Agrovoc relation: KEA++: damage noise north america techniques fishery production predation predators birds ropes fishing operations
Evaluation I Standard Evaluation: –Number of exact matches in the test set –Precision, Recall, F-measure Problem: –Semantic similarity is not considered –Comparison only to one indexer, although indexing is subjective
Evaluation II Inter-indexer consistency, e.g. Rolling ’s measure: Indexersvs. othervs. KEA vs.KEA++ indexers avg38727 Rolling‘s IIC = 2C A+B C – number of phrases in common A – number of phrases in the first set B – number of phrases in the second set -11%
“Overview of Techniques for Reducing Bird Predation at Aquaculture Facilities”. Results Indexer KEA++ Exactaquacultureaquaculturedamagefencingscaresnoise* Similarbird controlbirds predatory birdspredators fish culturefishing operations fishery production No matchnoxious birds control methods ropes *Selected by only one indexer
Problems & Future Work Trivial problems (e.g. stemming errors) Document chunking –What are important and disturbing parts of the document? Topic coverage –exploring thesaurus ’ structure –Lexical chains Term occurrence –Including other NLP resources (e.g. WordNet) Multi-linguality, other domains