Thesaurus-Based Index Term Extraction Olena Medelyan Digital Library Laboratory.

Thesaurus-Based Index Term Extraction Olena Medelyan Digital Library Laboratory

Describe the topics in a document Index terms: controlled vocabulary ( e.g. predatory birds, damage, aquaculture) Keyphrases: freely chosen (e.g. techniques, bird predation, aquaculture) Purposes: –Organize library’s holding –Provide thematic access to documents –Represent documents as brief summary –Aid navigation in search results Manual assignment: expensive, time-consuming Index Terms vs. Keyphrases Overview of Techniques for Reducing Bird Predation at Aquaculture Facilities

Extraction vs. Assignment Select significant n-grams or NPs according to their characteristics Classify documents according to their content words into classes (lables = keyphrases) - Restriction to syntax - Bad quality phrases - No consistency + Easy and fast implementation + Not much training required - Need large corpora - Long compuational time - Not practical + Word coocurrence + High accuracy

KEA++ Combines extraction with controlled vocabulary Considers semantic relations Controlled vocabulary = thesaurus Experiment: –agricultural documents (www.fao.org/docrep) –Agrovoc thesaurus (www.fao.org/agrovoc)

How does it Work? 1.Extract n-grams, transform them to pseudo-phrases, map to pseudo-phrases of thesaurus´ descriptors bird predation  predat bird 2.Each document = set of candidate phrases 3.Training (document + manually assinged phrases) a.Compute the features b.Compute the model 4.Testing (new documents, no phrases) a.Compute the features b.Compute probabilities according to the model 5.Classification model: Naïve Bayes

Features TF×IDF – phrases that are specific for a given document are significant First Position – phrases that are in the beginning (or the end) of the document are significant Phrase Length – phrases with certain number of words are significant (2!) Node Degree – phrases that are related to the most other phrases in the document are significant

Example fisheries fish culture aquaculture fish ponds aquaculture techniques bird controll predatory birds noxious birds scares pest conroll controll methods monitoring methods equipment protective structures electrical installation fencing Indexers: 1 2 3 4 5 6 Agrovoc relation: KEA++: damage noise north america techniques fishery production predation predators birds ropes fishing operations

Evaluation I Standard Evaluation: –Number of exact matches in the test set –Precision, Recall, F-measure Problem: –Semantic similarity is not considered –Comparison only to one indexer, although indexing is subjective

Evaluation II Inter-indexer consistency, e.g. Rolling ’s measure: Indexersvs. othervs. KEA vs.KEA++ indexers 142729 239828 337926 437631 537625 636420 avg38727 Rolling‘s IIC = 2C A+B C – number of phrases in common A – number of phrases in the first set B – number of phrases in the second set -11%

“Overview of Techniques for Reducing Bird Predation at Aquaculture Facilities”. Results Indexer KEA++ Exactaquacultureaquaculturedamagefencingscaresnoise* Similarbird controlbirds predatory birdspredators fish culturefishing operations fishery production No matchnoxious birds control methods ropes *Selected by only one indexer

Problems & Future Work Trivial problems (e.g. stemming errors) Document chunking –What are important and disturbing parts of the document? Topic coverage –exploring thesaurus ’ structure –Lexical chains Term occurrence –Including other NLP resources (e.g. WordNet) Multi-linguality, other domains

Thesaurus-Based Index Term Extraction Olena Medelyan Digital Library Laboratory.

Similar presentations

Presentation on theme: "Thesaurus-Based Index Term Extraction Olena Medelyan Digital Library Laboratory."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Thesaurus-Based Index Term Extraction Olena Medelyan Digital Library Laboratory.

Similar presentations

Presentation on theme: "Thesaurus-Based Index Term Extraction Olena Medelyan Digital Library Laboratory."— Presentation transcript:

Similar presentations

About project

Feedback