Information Retrieval using Word Senses: Root Sense Tagging Approach Sang-Bum Kim, Hee-Cheol Seo and Hae-Chang Rim Natural Language Processing Lab., Department.

Slides:



Advertisements
Similar presentations
Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Advertisements

Suleyman Cetintas 1, Monica Rogati 2, Luo Si 1, Yi Fang 1 Identifying Similar People in Professional Social Networks with Discriminative Probabilistic.
Chapter 5: Introduction to Information Retrieval
Improved TF-IDF Ranker
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
Personalizing Search via Automated Analysis of Interests and Activities Jaime Teevan Susan T.Dumains Eric Horvitz MIT,CSAILMicrosoft Researcher Microsoft.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
Automatic Discovery of Technology Trends from Patent Text Youngho Kim, Yingshi Tian, Yoonjae Jeong, Ryu Jihee, Sung-Hyon Myaeng School of Engineering Information.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Chapter 2Modeling 資工 4B 陳建勳. Introduction.  Traditional information retrieval systems usually adopt index terms to index and retrieve documents.
Computer comunication B Information retrieval. Information retrieval: introduction 1 This topic addresses the question on how it is possible to find relevant.
ITCS 6010 Natural Language Understanding. Natural Language Processing What is it? Studies the problems inherent in the processing and manipulation of.
Chapter 5: Information Retrieval and Web Search
Yin Yang (Hong Kong University of Science and Technology) Nilesh Bansal (University of Toronto) Wisam Dakka (Google) Panagiotis Ipeirotis (New York University)
Mining and Summarizing Customer Reviews
Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.
Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Microsoft Research Asia Yunhua Hu, Guomao Xin, Ruihua Song, Guoping.
Personalisation Seminar on Unlocking the Secrets of the Past: Text Mining for Historical Documents Sven Steudter.
Processing of large document collections Part 3 (Evaluation of text classifiers, applications of text categorization) Helena Ahonen-Myka Spring 2005.
Probabilistic Model for Definitional Question Answering Kyoung-Soo Han, Young-In Song, and Hae-Chang Rim Korea University SIGIR 2006.
An Integrated Approach to Extracting Ontological Structures from Folksonomies Huairen Lin, Joseph Davis, Ying Zhou ESWC 2009 Hyewon Lim October 9 th, 2009.
Leveraging Conceptual Lexicon : Query Disambiguation using Proximity Information for Patent Retrieval Date : 2013/10/30 Author : Parvaz Mahdabi, Shima.
COMP423.  Query expansion  Two approaches ◦ Relevance feedback ◦ Thesaurus-based  Most Slides copied from ◦
Reyyan Yeniterzi Weakly-Supervised Discovery of Named Entities Using Web Search Queries Marius Pasca Google CIKM 2007.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
A Probabilistic Graphical Model for Joint Answer Ranking in Question Answering Jeongwoo Ko, Luo Si, Eric Nyberg (SIGIR ’ 07) Speaker: Cho, Chin Wei Advisor:
1 Query Operations Relevance Feedback & Query Expansion.
MIRACLE Multilingual Information RetrievAl for the CLEF campaign DAEDALUS – Data, Decisions and Language, S.A. Universidad Carlos III de.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Probabilistic Query Expansion Using Query Logs Hang Cui Tianjin University, China Ji-Rong Wen Microsoft Research Asia, China Jian-Yun Nie University of.
Retrieval Models for Question and Answer Archives Xiaobing Xue, Jiwoon Jeon, W. Bruce Croft Computer Science Department University of Massachusetts, Google,
Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.
Chapter 6: Information Retrieval and Web Search
Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.
GUIDED BY DR. A. J. AGRAWAL Search Engine By Chetan R. Rathod.
Lecture 1: Overview of IR Maya Ramanath. Who hasn’t used Google? Why did Google return these results first ? Can we improve on it? Is this a good result.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,
Personalizing Web Search using Long Term Browsing History Nicolaas Matthijs, Cambridge Filip Radlinski, Microsoft In Proceedings of WSDM
Personalization with user’s local data Personalizing Search via Automated Analysis of Interests and Activities 1 Sungjick Lee Department of Electrical.
Natural Language Processing for Information Retrieval -KVMV Kiran ( )‏ -Neeraj Bisht ( )‏ -L.Srikanth ( )‏
Iterative Translation Disambiguation for Cross Language Information Retrieval Christof Monz and Bonnie J. Dorr Institute for Advanced Computer Studies.
© 2004 Chris Staff CSAW’04 University of Malta of 15 Expanding Query Terms in Context Chris Staff and Robert Muscat Department of.
From Text to Image: Generating Visual Query for Image Retrieval Wen-Cheng Lin, Yih-Chen Chang and Hsin-Hsi Chen Department of Computer Science and Information.
Language Model in Turkish IR Melih Kandemir F. Melih Özbekoğlu Can Şardan Ömer S. Uğurlu.
1 Masters Thesis Presentation By Debotosh Dey AUTOMATIC CONSTRUCTION OF HASHTAGS HIERARCHIES UNIVERSITAT ROVIRA I VIRGILI Tarragona, June 2015 Supervised.
Performance Measurement. 2 Testing Environment.
AN EFFECTIVE STATISTICAL APPROACH TO BLOG POST OPINION RETRIEVAL Ben He Craig Macdonald Iadh Ounis University of Glasgow Jiyin He University of Amsterdam.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Improving the performance of personal name disambiguation.
Relevance-Based Language Models Victor Lavrenko and W.Bruce Croft Department of Computer Science University of Massachusetts, Amherst, MA SIGIR 2001.
1 Evaluating High Accuracy Retrieval Techniques Chirag Shah,W. Bruce Croft Center for Intelligent Information Retrieval Department of Computer Science.
Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.
Query Suggestions in the Absence of Query Logs Sumit Bhatia, Debapriyo Majumdar,Prasenjit Mitra SIGIR’11, July 24–28, 2011, Beijing, China.
Relevance Models and Answer Granularity for Question Answering W. Bruce Croft and James Allan CIIR University of Massachusetts, Amherst.
Survey Jaehui Park Copyright  2008 by CEBT Introduction  Members Jung-Yeon Yang, Jaehui Park, Sungchan Park, Jongheum Yeon  We are interested.
Date: 2012/5/28 Source: Alexander Kotov. al(CIKM’11) Advisor: Jia-ling, Koh Speaker: Jiun Jia, Chiou Interactive Sense Feedback for Difficult Queries.
Event-Based Extractive Summarization E. Filatova and V. Hatzivassiloglou Department of Computer Science Columbia University (ACL 2004)
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Usefulness of Quality Click- through Data for Training Craig Macdonald, ladh Ounis Department of Computing Science University of Glasgow, Scotland, UK.
Autumn Web Information retrieval (Web IR) Handout #14: Ranking Based on Click Through data Ali Mohammad Zareh Bidoki ECE Department, Yazd University.
Query expansion COMP423. Menu Query expansion Two approaches Relevance feedback Thesaurus-based Most Slides copied from
Query Type Classification for Web Document Retrieval In-Ho Kang, GilChang Kim KAIST SIGIR 2003.
University Of Seoul Ubiquitous Sensor Network Lab Query Dependent Pseudo-Relevance Feedback based on Wikipedia 전자전기컴퓨터공학 부 USN 연구실 G
Queensland University of Technology
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Multimedia Information Retrieval
Murat Açar - Zeynep Çipiloğlu Yıldız
CS 620 Class Presentation Using WordNet to Improve User Modelling in a Web Document Recommender System Using WordNet to Improve User Modelling in a Web.
Retrieval Performance Evaluation - Measures
Presentation transcript:

Information Retrieval using Word Senses: Root Sense Tagging Approach Sang-Bum Kim, Hee-Cheol Seo and Hae-Chang Rim Natural Language Processing Lab., Department of Computer Science and Engineering, Korea University

Introduction Since natural language has its lexical ambiguity, it is predictable that text retrieval systems benefits from resolving ambiguities from all words in a given collection. Since natural language has its lexical ambiguity, it is predictable that text retrieval systems benefits from resolving ambiguities from all words in a given collection. Previous IR experiments using word senses have shown disappointing results. Previous IR experiments using word senses have shown disappointing results. Some reasons for previous failures: skewed sense frequencies, collocation problem, inaccurate sense disambiguation, etc. Some reasons for previous failures: skewed sense frequencies, collocation problem, inaccurate sense disambiguation, etc.

Introduction WSD for IR tasks should be performed on all ambiguous words in a collection since we cannot know user query in advance. WSD for IR tasks should be performed on all ambiguous words in a collection since we cannot know user query in advance. Performance of WSD reach at most about 75% precision and recall on all word task in SENSEVAL competition. (about 95% in the lexical sample task) Performance of WSD reach at most about 75% precision and recall on all word task in SENSEVAL competition. (about 95% in the lexical sample task)

Introduction Some observations that sense disambiguation for crude tasks such as IR is different from traditional word sense disambiguation. Some observations that sense disambiguation for crude tasks such as IR is different from traditional word sense disambiguation. It ’ s arguable that fine-grained word sense disambiguation is necessary to improve retrieval performance. It ’ s arguable that fine-grained word sense disambiguation is necessary to improve retrieval performance. ex: “ stock ” has 17 different senses in WordNet. For IR, consistent disambiguation is more important than accurate disambiguation, and flexible disambiguation is better than strict disambiguation. For IR, consistent disambiguation is more important than accurate disambiguation, and flexible disambiguation is better than strict disambiguation.

Root Sense Tagging Approach This approach aims to improve the performance of large-scale text retrieval by conducting coarse-grained, consistent, and flexible WSD. This approach aims to improve the performance of large-scale text retrieval by conducting coarse-grained, consistent, and flexible WSD. 25 root senses for the nouns in WordNet 2.0 are used. 25 root senses for the nouns in WordNet 2.0 are used.ex: “ story ” has 6 senses in WordNet - {message, fiction, history, report, fib} are from the same root sense of “ relation ”. - {floor} is from the root sense of “ artifact ”.

Root Sense Tagging Approach The root sense tagger classifies each noun in the documents and queries into one of the 25 root senses, so it is called coarse-grained disambiguation. The root sense tagger classifies each noun in the documents and queries into one of the 25 root senses, so it is called coarse-grained disambiguation.

Root Sense Tagging Approach When classifying a given ambiguous word, the most informative neighboring clue word have the highest MI with the given word is selected. When classifying a given ambiguous word, the most informative neighboring clue word have the highest MI with the given word is selected. The single most probable sense among the candidate root senses for the given word is chosen according to the MI between the selected neighboring clue word and each candidate root sense. The single most probable sense among the candidate root senses for the given word is chosen according to the MI between the selected neighboring clue word and each candidate root sense.

Root Sense Tagging Approach There are 101,778 non-ambiguous units in WordNet 2.0. There are 101,778 non-ambiguous units in WordNet 2.0. ex: “ actor ” = {role player, doer} → person “ computer system ” → artifact

Co-occurrence Data Construction The steps to extract co-occurrence information from each document: The steps to extract co-occurrence information from each document: 1. Assign root sense to each non-ambiguous noun in the document. 2. Assign a root sense to each second noun of non- ambiguous compound nouns in the document. 3. Even if any noun tagged in step 2 occurs alone in other position, assign the same root sense in step For each sense-assigned noun in the document, extract all (context word, sense ) pairs within a predefined window. 5. Extract all ( word, word ) pairs.

Co-occurrence Data Construction

MI-based Root Sense Tagging “ system ” has 9 fine-grained senses in WordNet, and 5 root senses: artifact, cognition, body, substance and attribute. “ system ” has 9 fine-grained senses in WordNet, and 5 root senses: artifact, cognition, body, substance and attribute.

Indexing and Retrieval 26-bit sense field is added to each term posting element in index. 26-bit sense field is added to each term posting element in index. 1 bit is used for unk assigned to unknown words. 1 bit is used for unk assigned to unknown words. If s( w ) is set to null or w is not a noun, all the bits are 0. If s( w ) is set to null or w is not a noun, all the bits are 0. Two situations must be considered: Two situations must be considered: Several different root senses may be assigned to the same word within a document. Several different root senses may be assigned to the same word within a document. Only nouns are sense tagged, but a verb with the same indexing keyword form may exist in the document. Only nouns are sense tagged, but a verb with the same indexing keyword form may exist in the document.

Indexing and Retrieval

A sense-oriented term weighting method is proposed. A sense-oriented term weighting method is proposed. Traditional term-based index. Traditional term-based index. Term weight is transformed by using sense weight sw calculated by referring to the sense field. Term weight is transformed by using sense weight sw calculated by referring to the sense field. Sense weight sw ij for term t i in document d j is defined as: Sense weight sw ij for term t i in document d j is defined as: sw ij = 1 + α . q(dsf ij, qsf i ) where dsf ij and qsf i indicate the sense field of term t i in document d i and query respectively. Sense-matching function q is defined as:

Data and Evaluation Methodologies Two document collections and two query sets are used. Two document collections and two query sets are used. Documents Documents 210,157 documents of Financial Times collection in TREC CD vol.4 210,157 documents of Financial Times collection in TREC CD vol.4 127,742 documents of LA Times collection in vol.5 127,742 documents of LA Times collection in vol.5 Queries Queries TREC 7( ) and TREC 8( ) queries TREC 7( ) and TREC 8( ) queries Three baseline term weighting method Three baseline term weighting method W1: simple idf weighting W1: simple idf weighting W2: tf . idf weighting W2: tf . idf weighting W3: (1 + log( tf )) . idf weighting W3: (1 + log( tf )) . idf weighting

Experiment Results Sense weight parameter α is set to 0.5. Sense weight parameter α is set to 0.5. More improvements is obtained in the long queries experiments. More improvements is obtained in the long queries experiments. Overgrown weight in W3(W2)+sense Overgrown weight in W3(W2)+sense

Pseudo Relevance Feedback Five terms from the top ten documents are selected by the probabilistic term selection method. Five terms from the top ten documents are selected by the probabilistic term selection method. For the sense fields of the new query terms in +sense experiments, a voting method is used, which the most frequent root sense in the top 10 documents is assigned to the new terms. For the sense fields of the new query terms in +sense experiments, a voting method is used, which the most frequent root sense in the top 10 documents is assigned to the new terms.

Result of Pseudo Relevance Feedback

BM25 using Root Senses

Conclusions A coarse-grained, consistent, and flexible sense tagging method is proposed to improve large-scale text retrieval performance. A coarse-grained, consistent, and flexible sense tagging method is proposed to improve large-scale text retrieval performance. This approach can be applied to retrieval systems in other languages in cases where there are lexical resources much more roughly constructed than expensive resources like WordNet. This approach can be applied to retrieval systems in other languages in cases where there are lexical resources much more roughly constructed than expensive resources like WordNet. The proposed sense-field based indexing and sense- weight oriented ranking do not seriously increase system overhead. The proposed sense-field based indexing and sense- weight oriented ranking do not seriously increase system overhead.

Conclusions Experiment results show good performance even with the relevance feedback method or state-of-the-art BM25 retrieval model. Experiment results show good performance even with the relevance feedback method or state-of-the-art BM25 retrieval model. Verbs should be assigned with senses in future work. Verbs should be assigned with senses in future work. It is essential to develop an elaborate retrieval model, i.e., a term weighting model considering word senses. It is essential to develop an elaborate retrieval model, i.e., a term weighting model considering word senses.