Center for Natural Language Processing School of Information Studies

Slides:

Advertisements

Similar presentations

Chapter 11 user support. Issues –different types of support at different times –implementation and presentation both important –all need careful design.

Advertisements

Center for NLP Whither Come the Words? Dr. Elizabeth D. Liddy Center for Natural Language Processing School of Information Studies Syracuse University.

Chapter 5: Introduction to Information Retrieval

Multimedia Database Systems

Text Operations: Preprocessing. Introduction Document preprocessing –to improve the precision of documents retrieved –lexical analysis, stopwords elimination,

Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.

IR & Metadata. Metadata Didn’t we already talk about this? We discussed what metadata is and its types –Data about data –Descriptive metadata is external.

Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.

A Flexible Workbench for Document Analysis and Text Mining NLDB’2004, Salford, June Gulla, Brasethvik and Kaada A Flexible Workbench for Document.

1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004.

1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.

Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.

Query Operations: Automatic Global Analysis. Motivation Methods of local analysis extract information from local set of documents retrieved to expand.

Chapter 5: Information Retrieval and Web Search

Overview of Search Engines

Indexing Overview Approaches to indexing Automatic indexing Information extraction.

Taxonomies: Hidden but Critical Tools Marjorie M.K. Hlava President Access Innovations, Inc.

Query Relevance Feedback and Ontologies How to Make Queries Better.

Query Expansion.

Processing of large document collections Part 3 (Evaluation of text classifiers, applications of text categorization) Helena Ahonen-Myka Spring 2005.

Claudia Marzi Institute for Computational Linguistics, “Antonio Zampolli” – Italian National Research Council University of Pavia – Dept. of Theoretical.

COMP423.  Query expansion  Two approaches ◦ Relevance feedback ◦ Thesaurus-based  Most Slides copied from ◦

Automatic Lexical Annotation Applied to the SCARLET Ontology Matcher Laura Po and Sonia Bergamaschi DII, University of Modena and Reggio Emilia, Italy.

The Cognitive Perspective in Information Science Research Anthony Hughes Kristina Spurgin.

Philosophy of IR Evaluation Ellen Voorhees. NIST Evaluation: How well does system meet information need? System evaluation: how good are document rankings?

Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.

Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.

1 Query Operations Relevance Feedback & Query Expansion.

Chapter 6: Information Retrieval and Web Search

Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.

LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.

Computational linguistics A brief overview. Computational Linguistics might be considered as a synonym of automatic processing of natural language, since.

Natural Language Processing for Information Retrieval -KVMV Kiran ( )‏ -Neeraj Bisht ( )‏ -L.Srikanth ( )‏

Query Expansion By: Sean McGettrick. What is Query Expansion? Query Expansion is the term given when a search engine adding search terms to a user’s weighted.

© 2004 Chris Staff CSAW’04 University of Malta of 15 Expanding Query Terms in Context Chris Staff and Robert Muscat Department of.

Information Retrieval

Term Weighting approaches in automatic text retrieval. Presented by Ehsan.

Overview of Statistical NLP IR Group Meeting March 7, 2006.

Trends in NL Analysis Jim Critz University of New York in Prague EurOpen.CZ 12 December 2008.

Information Retrieval in Practice

Human Computer Interaction Lecture 21 User Support

UNIFIED MEDICAL LANGUAGE SYSTEMS (UMLS)

An Efficient Algorithm for Incremental Update of Concept space

Information Organization: Overview

Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance Hello everyone,

Search Engine Architecture

Modern Information Retrieval

Lecture 12: Relevance Feedback & Query Expansion - II

Text Based Information Retrieval

Information Retrieval and Web Search

Information Retrieval and Web Search

Information Retrieval and Web Search

Multimedia Information Retrieval

Information Retrieval

Social Knowledge Mining

WordNet WordNet, WSD.

CSE 635 Multimedia Information Retrieval

Inf 722 Information Organisation

Introduction to Information Retrieval

Chapter 5: Information Retrieval and Web Search

Chapter 11 user support.

Text Mining & Natural Language Processing

Using GOLD to Tracking L2 Development

CS246: Information Retrieval

Semantic Similarity Methods in WordNet and their Application to Information Retrieval on the Web Yizhe Ge.

Automatic Global Analysis

Retrieval Utilities Relevance feedback Clustering

Giannis Varelas Epimenidis Voutsakis Paraskevi Raftopoulou

Information Organization: Overview

Information Retrieval and Web Search

Presentation transcript:

Center for Natural Language Processing School of Information Studies Whither Come the Words? Dr. Elizabeth D. Liddy Center for Natural Language Processing School of Information Studies Syracuse University

A Continuum from Human to Statistical Indexing Manual Controlled vocabularies Mixed Initiative Machine-aided / Human-assisted Machine Learning Automatic Statistical indexing Natural Language Processing indexing

Basic Premise The quality of the representation of documents determines: the ‘richness’ of the indexing the ‘quality’ of access to relevant information the ‘value-add’ analytics the system can accomplish for users

Central Problem of IR How to represent documents for retrieval (Blair, 1990) key issue in controlled vocabulary representation & searching still true with full-text indexing and free-text querying systems because documents & queries are expressed in language language is complex and ambiguous methods for solving the language issue are difficult some IR systems don’t even attempt to deal major challenge of high quality information access

1. Identify indexable / queryable elements: What is a term? Alpha-numeric characters between blank spaces or punctuation? What about non-compositional phrases? Multi-word proper names? What about inter-word symbols such as hyphens or apostrophes? “small business men” vs. “small-business men”

2. Represent the concept behind the term Ability to take ‘terms’, and: Standardize Expand to alternative ‘terms’ Disambiguate So that the concept behind the ‘term’ is represented in both documents & queries

Goal - add all variant terms which refer to the same concept: Term Expansion: Goal - add all variant terms which refer to the same concept: either synonymous expressions or associated terms use either thesaurus, semantic network, or statistically determined co-occurring terms/phrases inspired by success of humanly-consulted IR thesauri used in earliest systems relieves the user from needing to generate all conceptual variants

Term expansion: Multiple approaches: Knowledge-based Linguistic Statistical

Knowledge-based Thesauri I. R. - style intended for human indexers and searchers manually constructed for a specific domain Contain synonymous, more general, and more specific terms Use For Broader Narrower Related Current question is how to utilize them appropriately in Web-based systems

Knowledge-based Thesauri DATABASE MANAGEMENT SYSTEMS UF databases NT relational databases BT file organization management information systems RT database theory decision support systems

Linguistic Thesauri General purpose style e. g. Roget’s, Word Net contain explicit concept hierarchies of up to 8 increasingly specified levels Based on assumption that the words in a semi-colon group (RIT) or a synset (WordNet) are synonymous or near-synonymous issue / difficulty is selecting correct sense for terms

The World Abstract Relations Space Physics Matter Sensation Intellect Vilition Affections Sensation in General Touch Taste Smell Sight Hearing Odor Fragrance Stench Odorless .1 .2 .3 .4 .5 .6 .7 .8 .9 Incense; joss stick;pastille; frankincense or olibanum; agallock or aloeswood; calambac

Linguistic Thesaurus Use in I R Can be used on either / both documents or queries more commonly done on queries Terms are expanded by adding one or all of: synonyms hyponyms hypernyms Issues caused by: idiomatic, specialized terms non-compositional phrases not in thesaurus

Process used by Voorhees ’93 Research Look up each word from text in Word Net If word is found, the set of synonyms from all Synsets are added to the query representation Weight each added word as .8 rather than 1.0 Found results to be better than plain SMART Variable performance over queries Major cause of error was when ambiguous words’ Synsets are used in expansion

Use of Thesauri for expansion: General thesauri such as Roget’s or WordNet have not been shown conclusively to improve results: may sacrifice precision to recall not domain specific not sense disambiguated But, a currently active field of R & D

but the wrong sense of the query term Disambiguation Non-relevant documents may be retrieved because they contain the query term, but the wrong sense of the query term Need good Word Sense Disambiguation

Sample ambiguous query: I would like information about developments in low-risk instruments, especially those being offered by companies specializing in bonds.

Human Sense Disambiguation Sources of influence known from psycholinguistics research: local context the sentence / query containing the ambiguous word restricts the interpretation of the ambiguous word

Sample ambiguous query: I would like information about developments in low-risk instruments, especially those being offered by companies specializing in bonds.

Human Sense Disambiguation Sources of influence known from psycholinguistics research: local context the sentence / query containing the ambiguous word restricts the interpretation of the ambiguous word domain knowledge the fact that a text is concerned with a particular domain activates only the sense appropriate to that domain frequency data the frequency of each sense in general usage affects its accessibility to the mind

Machine Readable Lexical Sources Multiple entries for polysemous words Instrument Medical Financial Dental Musical Hardware Empirical experimentation General

Machine Readable Lexical Sources Senses are ranked by frequency of occurrence in usage: 1. Musical 2. Hardware 3. General 4. Medical 5. Dental 6. Financial 7. Empirical experimentation

Corpus-based Word Sense Disambiguation Supervised learning from manually sense-tagged corpora allows development of algorithms which can correctly tag each word with its correct sense utilizes context, which then proves essential in real-time disambiguation usually a small window of words surrounding the ambiguous term Issues time & cost in tagging the training sample need to retag for new domains or genres

Word Sense Disambiguation Impact on retrieval results Results vary by approach used by query (short queries, especially) by engine Some consider it a proven technique for improving Precision Some are concerned about the trade-off in efficiency

Statistical Thesauri Automatic thesaurus construction Classes of terms produced are not necessarily synonymous, nor broader, nor narrower Rather, words that tend to co-occur with head term Effectiveness varies considerably depending on technique used

Automatic Thesaurus Construction (Salton) Document Collection Based based on index term similarities compute vector similarities for each pair of documents if sufficiently similar, create a thesaurus entry for each term which includes terms from similar document

Sample Automatic Thesaurus Entries: 408 dislocation 411 coercive junction demagnetize minority-carrier flux-leakage point contact hysteresis recombine induct transition insensitive 409 blast-cooled magnetoresistance heat-flow square-loop heat-transfer threshold 410 anneal 412 longitudinal strain transverse

Dynamic Automatic Thesaurus Construction Thesaurus short-cut Run at query time Take all terms in query into consideration at once Look at frequent words and phrases in top retrieved documents and add these to the query = Automatic Relevance Feedback

Expansion by an Association Thesaurus Query: Impact of the 1986 Immigration Law Phrases retrieved by association in corpus - illegal immigration - statutes - amnesty program - applicability - immigration reform law - seeking amnesty - editorial page article - legal status - naturalization service - immigration act - civil fines - undocumented workers - new immigration law - guest worker - legal immigration - sweeping immigration law - employer sanctions - undocumented aliens

NLP-based Indexing the computational process of identifying, selecting, and extracting useful information from massive volumes of textual data: - for potential review by indexers - or stand-alone representation of content - using Natural Language Processing

Natural Language Processing • a range of computational techniques • for analyzing and representing naturally occurring texts • at one or more levels of linguistic analysis • for the purpose of achieving human-like language processing • for a range of tasks or applications

Levels of Language Understanding Pragmatic Discourse Semantic Syntactic Lexical Morphological

What can NLP Indexing do? Phrase recognition Disambiguation Concept expansion

In Summary: There exist a range of approaches for representing documents and queries Each needs to be evaluated in terms of their ability to accomplish the goals of your application Web applications have opened a whole new world of possible variations on the traditional indexing approaches