Thesaurus-Based Index Term Extraction Olena Medelyan Digital Library Laboratory.

Slides:



Advertisements
Similar presentations
Query Languages. Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.
Advertisements

UCLA : GSE&IS : Department of Information StudiesJF : 276lec1.ppt : 5/2/2015 : 1 I N F S I N F O R M A T I O N R E T R I E V A L S Y S T E M S Week.
Bring Order to Your Photos: Event-Driven Classification of Flickr Images Based on Social Knowledge Date: 2011/11/21 Source: Claudiu S. Firan (CIKM’10)
TÍTULO GENÉRICO Concept Indexing for Automated Text Categorization Enrique Puertas Sanz Universidad Europea de Madrid.
A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng City University of Hong Kong WWW 2007 Session: Similarity Search April.
Explorations in Tag Suggestion and Query Expansion Jian Wang and Brian D. Davison Lehigh University, USA SSM 2008 (Workshop on Search in Social Media)
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
Basic IR: Queries Query is statement of user’s information need. Index is designed to map queries to likely to be relevant documents. Query type, content,
IR & Metadata. Metadata Didn’t we already talk about this? We discussed what metadata is and its types –Data about data –Descriptive metadata is external.
CS Word Sense Disambiguation. 2 Overview A problem for semantic attachment approaches: what happens when a given lexeme has multiple ‘meanings’?
Learning to Advertise. Introduction Advertising on the Internet = $$$ –Especially search advertising and web page advertising Problem: –Selecting ads.
Presented by Zeehasham Rasheed
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
CSE 730 Information Retrieval of Biomedical Data The use of medical lexicon in biomedical IR.
Xiaomeng Su & Jon Atle Gulla Dept. of Computer and Information Science Norwegian University of Science and Technology Trondheim Norway June 2004 Semantic.
1 UCB Digital Library Project An Experiment in Using Lexical Disambiguation to Enhance Information Access Robert Wilensky, Isaac Cheng, Timotius Tjahjadi,
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Indexing Overview Approaches to indexing Automatic indexing Information extraction.
Finding Advertising Keywords on Web Pages Scott Wen-tau YihJoshua Goodman Microsoft Research Vitor R. Carvalho Carnegie Mellon University.
International Atomic Energy Agency INIS Training Seminar Principles of Information Retrieval and Query Formulation 07 – 11 October 2013 Vienna, Austria.
Query session guided multi- document summarization THESIS PRESENTATION BY TAL BAUMEL ADVISOR: PROF. MICHAEL ELHADAD.
Word Sense Disambiguation for Automatic Taxonomy Construction from Text-Based Web Corpora 12th International Conference on Web Information System Engineering.
Keyphrase Extraction in Scientific Documents Thuy Dung Nguyen and Min-Yen Kan School of Computing National University of Singapore Slides available at.
COMP423: Intelligent Agent Text Representation. Menu – Bag of words – Phrase – Semantics – Bag of concepts – Semantic distance between two words.
Classifying Tags Using Open Content Resources Simon Overell, Borkur Sigurbjornsson & Roelof van Zwol WSDM ‘09.
CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of Texts Catherine Pease & Paul Schmidt IAI, Saarbrücken.
1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.
Towards Improving Classification of Real World Biomedical Articles Kostas Fragos TEI of Athens Christos Skourlas TEI of Athens
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo – Automated Content Management System Srikanth Kallurkar Quantum Leap.
12th of October, 2006KEG seminar1 Combining Ontology Mapping Methods Using Bayesian Networks Ontology Alignment Evaluation Initiative 'Conference'
Internet Research Fourth Edition Unit C. Internet Research – Illustrated, Fourth Edition 2 Internet Research: Unit C Browsing Subject Guides.
The PrestoSpace Project Valentin Tablan. 2 Sheffield NLP Group, January 24 th 2006 Project Mission The 20th Century was the first with an audiovisual.
Multilingual Information Exchange APAN, Bangkok 27 January 2005
Scott Duvall, Brett South, Stéphane Meystre A Hands-on Introduction to Natural Language Processing in Healthcare Annotation as a Central Task for Development.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Lecture 6 Hidden Markov Models Topics Smoothing again: Readings: Chapters January 16, 2013 CSCE 771 Natural Language Processing.
1 Statistical NLP: Lecture 9 Word Sense Disambiguation.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Learning to Link with Wikipedia David Milne and Ian H. Witten Department of Computer Science, University of Waikato CIKM 2008 (Best Paper Award) Presented.
Chapter 6: Information Retrieval and Web Search
Terminology and documentation*  Object of the study of terminology:  analysis and description of the units representing specialized knowledge in specialized.
Automatic Keyphrase Extraction (Jim Nuyens) Keywords are an everyday part of looking up topics and specific content. What are some of the ways of obtaining.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
Algorithmic Detection of Semantic Similarity WWW 2005.
Methods for Automatic Evaluation of Sentence Extract Summaries * G.Ravindra +, N.Balakrishnan +, K.R.Ramakrishnan * Supercomputer Education & Research.
1 Automatic indexing Salton: When the assignment of content identifiers is carried out with the aid of modern computing equipment the operation becomes.
August 17, 2005Question Answering Passage Retrieval Using Dependency Parsing 1/28 Question Answering Passage Retrieval Using Dependency Parsing Hang Cui.
Digital Video Library Network Supervisor: Prof. Michael Lyu Student: Ma Chak Kei, Jacky.
Acceso a la información mediante exploración de sintagmas Anselmo Peñas, Julio Gonzalo y Felisa Verdejo Dpto. Lenguajes y Sistemas Informáticos UNED III.
Improving the Classification of Unknown Documents by Concept Graph Morteza Mohagheghi Reza Soltanpour
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
MSG Reuse Catalog T.W. van den Berg 7 April 2010.
Ontology Based Annotation of Text Segments Presented by Ahmed Rafea Samhaa R. El-Beltagy Maryam Hazman.
Feature Assignment LBSC 878 February 22, 1999 Douglas W. Oard and Dagobert Soergel.
HIVE as a Machine-aided Indexing Tool Personal Keyword use without vocabulary control Machine-aided indexing term extraction Participant relevant and not.
COMP423: Intelligent Agent Text Representation. Menu – Bag of words – Phrase – Semantics Semantic distance between two words.
Measuring Monolinguality
Using lexical chains for keyword extraction
Natural Language Processing (NLP)
Multimedia Information Retrieval
Applying Key Phrase Extraction to aid Invalidity Search
INDEXING TECHNIQUES The process of constructing document surrogates or document representations is called as Subject Indexing. Indexing has to specify.
CS 430: Information Discovery
A Suite to Compile and Analyze an LSP Corpus
Semantic Similarity Methods in WordNet and their Application to Information Retrieval on the Web Yizhe Ge.
Information Retrieval
Natural Language Processing (NLP)
Natural Language Processing (NLP)
Presentation transcript:

Thesaurus-Based Index Term Extraction Olena Medelyan Digital Library Laboratory

Describe the topics in a document Index terms: controlled vocabulary ( e.g. predatory birds, damage, aquaculture) Keyphrases: freely chosen (e.g. techniques, bird predation, aquaculture) Purposes: –Organize library’s holding –Provide thematic access to documents –Represent documents as brief summary –Aid navigation in search results Manual assignment: expensive, time-consuming Index Terms vs. Keyphrases Overview of Techniques for Reducing Bird Predation at Aquaculture Facilities

Extraction vs. Assignment Select significant n-grams or NPs according to their characteristics Classify documents according to their content words into classes (lables = keyphrases) - Restriction to syntax - Bad quality phrases - No consistency + Easy and fast implementation + Not much training required - Need large corpora - Long compuational time - Not practical + Word coocurrence + High accuracy

KEA++ Combines extraction with controlled vocabulary Considers semantic relations Controlled vocabulary = thesaurus Experiment: –agricultural documents ( –Agrovoc thesaurus (

How does it Work? 1.Extract n-grams, transform them to pseudo-phrases, map to pseudo-phrases of thesaurus´ descriptors bird predation  predat bird 2.Each document = set of candidate phrases 3.Training (document + manually assinged phrases) a.Compute the features b.Compute the model 4.Testing (new documents, no phrases) a.Compute the features b.Compute probabilities according to the model 5.Classification model: Naïve Bayes

Features TF×IDF – phrases that are specific for a given document are significant First Position – phrases that are in the beginning (or the end) of the document are significant Phrase Length – phrases with certain number of words are significant (2!) Node Degree – phrases that are related to the most other phrases in the document are significant

Example fisheries fish culture aquaculture fish ponds aquaculture techniques bird controll predatory birds noxious birds scares pest conroll controll methods monitoring methods equipment protective structures electrical installation fencing Indexers: Agrovoc relation: KEA++: damage noise north america techniques fishery production predation predators birds ropes fishing operations

Evaluation I Standard Evaluation: –Number of exact matches in the test set –Precision, Recall, F-measure Problem: –Semantic similarity is not considered –Comparison only to one indexer, although indexing is subjective

Evaluation II Inter-indexer consistency, e.g. Rolling ’s measure: Indexersvs. othervs. KEA vs.KEA++ indexers avg38727 Rolling‘s IIC = 2C A+B C – number of phrases in common A – number of phrases in the first set B – number of phrases in the second set -11%

“Overview of Techniques for Reducing Bird Predation at Aquaculture Facilities”. Results Indexer KEA++ Exactaquacultureaquaculturedamagefencingscaresnoise* Similarbird controlbirds predatory birdspredators fish culturefishing operations fishery production No matchnoxious birds control methods ropes *Selected by only one indexer

Problems & Future Work Trivial problems (e.g. stemming errors) Document chunking –What are important and disturbing parts of the document? Topic coverage –exploring thesaurus ’ structure –Lexical chains Term occurrence –Including other NLP resources (e.g. WordNet) Multi-linguality, other domains