1 UCB Digital Library Project An Experiment in Using Lexical Disambiguation to Enhance Information Access Robert Wilensky, Isaac Cheng, Timotius Tjahjadi,

Slides:

Advertisements

Similar presentations

eClassifier: Tool for Taxonomies

Advertisements

Relevance Feedback Limitations –Must yield result within at most 3-4 iterations –Users will likely terminate the process sooner –User may get irritated.

Introduction to Information Retrieval

Natural Language Processing WEB SEARCH ENGINES August, 2002.

TÍTULO GENÉRICO Concept Indexing for Automated Text Categorization Enrique Puertas Sanz Universidad Europea de Madrid.

Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.

Explorations in Tag Suggestion and Query Expansion Jian Wang and Brian D. Davison Lehigh University, USA SSM 2008 (Workshop on Search in Social Media)

Search Engines and Information Retrieval

Distributed Search over the Hidden Web Hierarchical Database Sampling and Selection Panagiotis G. Ipeirotis Luis Gravano Computer Science Department Columbia.

April 22, Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:

Automatic Web Page Categorization by Link and Context Analysis Giuseppe Attardi Antonio Gulli Fabrizio Sebastiani.

6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.

Gimme’ The Context: Context- driven Automatic Semantic Annotation with CPANKOW Philipp Cimiano et al.

1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004.

Computer comunication B Information retrieval. Information retrieval: introduction 1 This topic addresses the question on how it is possible to find relevant.

1 Information Retrieval and Web Search Introduction.

Methods for Domain-Independent Information Extraction from the Web An Experimental Comparison Oren Etzioni et al. Prepared by Ang Sun

J. Chen, O. R. Zaiane and R. Goebel An Unsupervised Approach to Cluster Web Search Results based on Word Sense Communities.

HYPERGEO 1 st technical verification ARISTOTLE UNIVERSITY OF THESSALONIKI Baseline Document Retrieval Component N. Bassiou, C. Kotropoulos, I. Pitas 20/07/2000,

Chapter 5: Information Retrieval and Web Search

Word Sense Disambiguation for Automatic Taxonomy Construction from Text-Based Web Corpora 12th International Conference on Web Information System Engineering.

Disambiguation of References to Individuals Levon Lloyd (State University of New York) Varun Bhagwan, Daniel Gruhl (IBM Research Center) Varun Bhagwan,

Search Engines and Information Retrieval Chapter 1.

Exploiting Wikipedia as External Knowledge for Document Clustering Sakyasingha Dasgupta, Pradeep Ghosh Data Mining and Exploration-Presentation School.

CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of Texts Catherine Pease & Paul Schmidt IAI, Saarbrücken.

©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo – Automated Content Management System Srikanth Kallurkar Quantum Leap.

Dr. Susan Gauch When is a rock not a rock? Conceptual Approaches to Personalized Search and Recommendations Nov. 8, 2011 TResNet.

A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.

PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.

Search Engine By Bhupendra Ratha, Lecturer School of Library and Information Science Devi Ahilya University, Indore

When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.

Topical Crawlers for Building Digital Library Collections Presenter: Qiaozhu Mei.

Word Sense Disambiguation in Queries Shaung Liu, Clement Yu, Weiyi Meng.

Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.

TOPIC CENTRIC QUERY ROUTING Research Methods (CS689) 11/21/00 By Anupam Khanal.

Document Clustering 文件分類林頌堅世新大學圖書資訊學系 Sung-Chien Lin Department of Library and Information Studies Shih-Hsin University.

Math Information Retrieval Zhao Jin. Zhao Jin. Math Information Retrieval Examples: –Looking for formulas –Collect teaching resources –Keeping updated.

Chapter 6: Information Retrieval and Web Search

CS 4705 Lecture 19 Word Sense Disambiguation. Overview Selectional restriction based approaches Robust techniques –Machine Learning Supervised Unsupervised.

LIS618 lecture 3 Thomas Krichel Structure of talk Document Preprocessing Basic ingredients of query languages Retrieval performance evaluation.

Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.

Searching the web Enormous amount of information –In 1994, 100 thousand pages indexed –In 1997, 100 million pages indexed –In June, 2000, 500 million pages.

Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,

Algorithmic Detection of Semantic Similarity WWW 2005.

How Do We Find Information?. Key Questions  What are we looking for?  How do we find it?  Why is it difficult? “A prudent question is one-half of wisdom”

1 A Web Search Engine-Based Approach to Measure Semantic Similarity between Words Presenter: Guan-Yu Chen IEEE Trans. on Knowledge & Data Engineering,

Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏

Comparing Document Segmentation for Passage Retrieval in Question Answering Jorg Tiedemann University of Groningen presented by: Moy’awiah Al-Shannaq

1 CS 430: Information Discovery Lecture 8 Automatic Term Extraction and Weighting.

Survey on Long Queries in Keyword Search : Phrase-based IR Sungchan Park

Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.

Feature Assignment LBSC 878 February 22, 1999 Douglas W. Oard and Dagobert Soergel.

COMP423: Intelligent Agent Text Representation. Menu – Bag of words – Phrase – Semantics Semantic distance between two words.

Lecture-6 Bscshelp.com. Todays Lecture  Which Kinds of Applications Are Targeted?  Business intelligence  Search engines.

VECTOR SPACE INFORMATION RETRIEVAL 1Adrienn Skrop.

Information Organization: Overview

Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance Hello everyone,

Lecture 12: Relevance Feedback & Query Expansion - II

Information Retrieval

Information Retrieval

Data Mining Chapter 6 Search Engines

Panos Ipeirotis Luis Gravano

Panagiotis G. Ipeirotis Luis Gravano

CS246: Information Retrieval

Semantic Similarity Methods in WordNet and their Application to Information Retrieval on the Web Yizhe Ge.

Information Retrieval

Chapter 31: Information Retrieval

Information Organization: Overview

Chapter 19: Information Retrieval

Introduction to Search Engines

Presentation transcript:

1 UCB Digital Library Project An Experiment in Using Lexical Disambiguation to Enhance Information Access Robert Wilensky, Isaac Cheng, Timotius Tjahjadi, and Heyning Cheng

2 UCB Digital Library ProjectGoal u Enhance information access by –fully automated text categorization –by adding searching by word sense u Applied to the World Wide Web

3 UCB Digital Library Project Manual vs. Automatically Created Directories u Manual classification of documents is – Expensive – Not scalable t Hard to keep up with the rapid growth and changes of information sources such as the Web u Would like fully automatic classification – no training set – no rules – appeal instead to “intrinsic semantics”

4 UCB Digital Library Project Lexical Disambiguation u Problem: Determine the intended sense of ambiguous word u Approach: Based on Yarowsky, et al. – Thesaurus categories as proxies for senses t We used Roget’s 5th – Training: Count nearby word-category co- occurrence – Deployment: Add up the word-category evidence

5 UCB Digital Library Project Counting Co-occurrences of Terms with Categories …while storks and cranes make their nests in the bank… Result is category co-occurrence vector for each term. [Tools, Animals]

6 UCB Digital Library Project Automatic Topic Assignment Based on Word Sense u Hearst – Topic   word-category association vectors u Fisher and Wilensky – Contrasted different algorithms – Concluded that exploiting word senses may improve topic assignment u We use prior prob. dist. of word senses, (and more recently, disambiguation per se.)

7 UCB Digital Library Project IAGO 0.1 vs. 1.0 u IAGO 0.1: –Eliminated short (< 100 content words) pages –Trained on newswire text u IAGO 1.0: –Trained on Encarta encyclopedia –Estimated word sense priors on the Web (used 10 million words of random web documents) –ignored proper nouns –augmented stop-list to deal with various problems u Tested categorization by mapping Yahoo categories to ours u Tested disambiguation on newswire, then sampled Web.

8 UCB Digital Library Project IAGO! Overview

9 UCB Digital Library Project Classification Results Category Name Precision Recall ComputerScience 87.5% 19.4% FinanceInvestment 100.0% 13.4% FitnessExercise 100.0% 1.8% MotionPictures 100.0% 54.8% Music 98.2% 42.4% Nutrition 97.9% 29.9% Occupation 97.8% 30.3% TheEnvironment n/a 0.0% Travel 75.0% 15.4% Overall precision = 97% Overall recall = 21% Now: (version 1.0) Category Name Precision Recall ComputerScience 31.6% 17.1% FinanceInvestment 94.4% 22.0% FitnessExercise 100.0% 4.3% MotionPictures 100.0% 57.1% Music 97.5% 58.3% Nutrition 80.3% 35.6% Occupation 100.0% 13.1% TheEnvironment n/a 0.0% Travel 50.0% 5.7% Overall precision = 88% Overall recall = 23% Then: (version 0.1) ( 92.3% and 20.4% if no adjustment by hand)

10 UCB Digital Library Project IAGO! 1.0 Internet Directory u Used engine to classify a few tens of thousands of web documents into Roget’s categories.

11 UCB Digital Library Project

12 UCB Digital Library Project Disambiguation Results

13 UCB Digital Library Project Application to Text Searching u Present user with set of known word senses from which to select – e.g., keyword = “rock” t =stone t =kind of music u Retrieve by word, filter by word sense u Rank by number of matching word senses

14 UCB Digital Library Project

15 UCB Digital Library Project

16 UCB Digital Library Project Is it Useful? u Results in the literature generally suggest disambiguation not useful for long queries, and utility is highly sensitive to disambiguation accuracy. u However, 40% of search queries on the web are reported to be single words. u So, does disambiguation work well enough to aid with single word queries?

17 UCB Digital Library ProjectUsefulness u Let r be the frequency of the most common of (non-overlapping) senses. u Can show that, to be better than just using keyword retrieval, disambiguation accuracy needs to be at least 50%, increasing in accuracy as r increases, but need not be highly accurate. (In fact, it can perform below the baseline.) u IAGO! 1.0 performs well above this level.

18 UCB Digital Library ProjectUsefulness u Key word retrieval will produce word sense retrieval precision and recall of r and 1 for common sense, (1-r), 1 for less common u A disambiguation method that was correct p of the time would have precision and recall values of and p for a word sense with frequency r. u Using E as the metric, can show that p needs to be at least for a disambiguation method to outperform keyword retrieval u For small r, p must be greater than 50%. For large r, this compares favorably with keyword retrieval even with fairly low disambiguation accuracy. –E.g., with a 90/10 distribution of word senses, then, for the more common word sense case, E, with a beta of.5, is better for a disambiguation algorithm with an accuracy over 77% than for keyword retrieval. (For the less common word sense, a “disambiguation” algorithm that is completely random gives a superior result.)

19 UCB Digital Library Project More results u Latest implementation (by Heyning Cheng) reduces training to about 1 hour (from about 24); classifying 1000 documents takes about 10 minutes. u Also improved performance of disambiguation. This made it practical to use disambiguation in topic assignment: –I.e, produces slightly better results; also appears to be less sensitive to changes in stoplist, and can be made to run quickly. u Disambiguation with a substantially smaller window size (even as small as 5) did not reduce accuracy; in some cases, a half-window size of 10 out- performed one of 50.

20 UCB Digital Library Project More results (con’t) u Weighted word sense priors by IDF of the term

21 UCB Digital Library Project More Results u Excluding low-utility or confusing Roget’s categories (down to about 200) improved recall to about 40% on the 1000 document test set. u The “purity” of topic assignment (% of all word senses disambiguated to the assigned topic) seems correlated with accuracy at least as well as IAGO’s ranking algorithm.

22 UCB Digital Library Project Future Work u Get better word sense proxies! u Word-sense searching –Create word sense index –Support word-sense searching within more general searches. –Improve disambiguation by exploiting priors. –Test against synonym expansion methods u Automatic topic-categorization – Handle multi-word phrases; proper names

23 UCB Digital Library Project Future Plans: Longer Term u Disambiguation – Handle non-nouns – Better word sense source t Automatic grouping of thesaural word senses u Topic-categorization –Multiple topic assignment – Quality u Summarization via same techniques u Other linguistic choices, e.g., thematic roles