Presentation is loading. Please wait.

Presentation is loading. Please wait.

Thesaurus-Based Index Term Extraction Olena Medelyan Digital Library Laboratory.

Similar presentations


Presentation on theme: "Thesaurus-Based Index Term Extraction Olena Medelyan Digital Library Laboratory."— Presentation transcript:

1 Thesaurus-Based Index Term Extraction Olena Medelyan Digital Library Laboratory

2 Describe the topics in a document Index terms: controlled vocabulary ( e.g. predatory birds, damage, aquaculture) Keyphrases: freely chosen (e.g. techniques, bird predation, aquaculture) Purposes: –Organize library’s holding –Provide thematic access to documents –Represent documents as brief summary –Aid navigation in search results Manual assignment: expensive, time-consuming Index Terms vs. Keyphrases Overview of Techniques for Reducing Bird Predation at Aquaculture Facilities

3 Extraction vs. Assignment Select significant n-grams or NPs according to their characteristics Classify documents according to their content words into classes (lables = keyphrases) - Restriction to syntax - Bad quality phrases - No consistency + Easy and fast implementation + Not much training required - Need large corpora - Long compuational time - Not practical + Word coocurrence + High accuracy

4 KEA++ Combines extraction with controlled vocabulary Considers semantic relations Controlled vocabulary = thesaurus Experiment: –agricultural documents (www.fao.org/docrep) –Agrovoc thesaurus (www.fao.org/agrovoc)

5 How does it Work? 1.Extract n-grams, transform them to pseudo-phrases, map to pseudo-phrases of thesaurus´ descriptors bird predation  predat bird 2.Each document = set of candidate phrases 3.Training (document + manually assinged phrases) a.Compute the features b.Compute the model 4.Testing (new documents, no phrases) a.Compute the features b.Compute probabilities according to the model 5.Classification model: Naïve Bayes

6 Features TF×IDF – phrases that are specific for a given document are significant First Position – phrases that are in the beginning (or the end) of the document are significant Phrase Length – phrases with certain number of words are significant (2!) Node Degree – phrases that are related to the most other phrases in the document are significant

7 Example fisheries fish culture aquaculture fish ponds aquaculture techniques bird controll predatory birds noxious birds scares pest conroll controll methods monitoring methods equipment protective structures electrical installation fencing Indexers: 1 2 3 4 5 6 Agrovoc relation: KEA++: damage noise north america techniques fishery production predation predators birds ropes fishing operations

8 Evaluation I Standard Evaluation: –Number of exact matches in the test set –Precision, Recall, F-measure Problem: –Semantic similarity is not considered –Comparison only to one indexer, although indexing is subjective

9 Evaluation II Inter-indexer consistency, e.g. Rolling ’s measure: Indexersvs. othervs. KEA vs.KEA++ indexers 142729 239828 337926 437631 537625 636420 avg38727 Rolling‘s IIC = 2C A+B C – number of phrases in common A – number of phrases in the first set B – number of phrases in the second set -11%

10 “Overview of Techniques for Reducing Bird Predation at Aquaculture Facilities”. Results Indexer KEA++ Exactaquacultureaquaculturedamagefencingscaresnoise* Similarbird controlbirds predatory birdspredators fish culturefishing operations fishery production No matchnoxious birds control methods ropes *Selected by only one indexer

11 Problems & Future Work Trivial problems (e.g. stemming errors) Document chunking –What are important and disturbing parts of the document? Topic coverage –exploring thesaurus ’ structure –Lexical chains Term occurrence –Including other NLP resources (e.g. WordNet) Multi-linguality, other domains


Download ppt "Thesaurus-Based Index Term Extraction Olena Medelyan Digital Library Laboratory."

Similar presentations


Ads by Google