Artificial intelligence & natural language processing Mark Sanderson Porto, 2000.

Slides:



Advertisements
Similar presentations
Metadata in Carrot II Current metadata –TF.IDF for both documents and collections –Full-text index –Metadata are transferred between different nodes Potential.
Advertisements

Search Results Need to be Diverse Mark Sanderson University of Sheffield.
Automatic indexing and retrieval of crime-scene photographs Katerina Pastra, Horacio Saggion, Yorick Wilks NLP group, University of Sheffield Scene of.
For Friday No reading Homework –Chapter 23, exercises 1, 13, 14, 19 –Not as bad as it sounds –Do them IN ORDER – do not read ahead here.
For Monday Read Chapter 23, sections 3-4 Homework –Chapter 23, exercises 1, 6, 14, 19 –Do them in order. Do NOT read ahead.
Search Engines and Information Retrieval
Predicting Text Quality for Scientific Articles AAAI/SIGART-11 Doctoral Consortium Annie Louis : Louis A. and Nenkova A Automatically.
NaLIX: A Generic Natural Language Search Environment for XML Data Presented by: Erik Mathisen 02/12/2008.
 Mark Sanderson, University of Sheffield University of Sheffield CIIR, University of Massachusetts Deriving concept hierarchies from text Mark Sanderson,
A Markov Random Field Model for Term Dependencies Donald Metzler and W. Bruce Croft University of Massachusetts, Amherst Center for Intelligent Information.
1 CS 430: Information Discovery Lecture 3 Inverted Files and Boolean Operations.
A Flexible Workbench for Document Analysis and Text Mining NLDB’2004, Salford, June Gulla, Brasethvik and Kaada A Flexible Workbench for Document.
Information Retrieval in Practice
Reference Collections: Task Characteristics. TREC Collection Text REtrieval Conference (TREC) –sponsored by NIST and DARPA (1992-?) Comparing approaches.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
CSE 730 Information Retrieval of Biomedical Data The use of medical lexicon in biomedical IR.
WISER: Newspapers online : an introduction to the scope and range of recent and current newspapers available on Oxlip, including hints on effective search.
1 CS 502: Computing Methods for Digital Libraries Lecture 11 Information Retrieval I.
Semantic Web Technologies Lecture # 2 Faculty of Computer Science, IBA.
Mining and Summarizing Customer Reviews
Robert Hass CIS 630 April 14, 2010 NP NP↓ Super NP tagging JJ ↓
Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.
Search Engines and Information Retrieval Chapter 1.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
For Friday Finish chapter 23 Homework: –Chapter 22, exercise 9.
Essay and Report Writing. Learning Outcomes After completing this course, students will be able to: Analyse essay questions effectively. Identify how.
Question Answering From Zero to Hero Elena Eneva 11 Oct 2001 Advanced IR Seminar.
Query Expansion By: Sean McGettrick. What is Query Expansion? Query Expansion is the term given when a search engine adding search terms to a user’s weighted.
Information Retrieval and Web Search Cross Language Information Retrieval Instructor: Rada Mihalcea Class web page:
Abstract Question answering is an important task of natural language processing. Unification-based grammars have emerged as formalisms for reasoning about.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
Why is Computational Linguistics Not More Used in Search Engine Technology? John Tait University of Sunderland, UK.
©2003 Paula Matuszek CSC 9010: Text Mining Applications Document Summarization Dr. Paula Matuszek (610)
1 CS 430: Information Discovery Lecture 3 Inverted Files.
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Relevance Detection Approach to Gene Annotation Aid to automatic annotation of databases Annotation flow –Extraction of molecular function of a gene from.
Noun-Phrase Analysis in Unrestricted Text for Information Retrieval David A. Evans, Chengxiang Zhai Laboratory for Computational Linguistics, CMU 34 th.
SLIDE 1IS 240 – Spring 2009 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
WIRED Week 3 Syllabus Update (next week) Readings Overview - Quick Review of Last Week’s IR Models (if time) - Evaluating IR Systems - Understanding Queries.
LOGO A comparison of two web-based document management systems ShaoxinYu Columbia University March 31, 2009.
Compiler design Lecture 1: Compiler Overview Sulaimany University 2 Oct
Natural Language Processing for Information Retrieval -KVMV Kiran ( )‏ -Neeraj Bisht ( )‏ -L.Srikanth ( )‏
1 CS 430: Information Discovery Lecture 4 Files Structures for Inverted Files.
For Monday Read chapter 24, sections 1-3 Homework: –Chapter 23, exercise 8.
For Monday Read chapter 26 Last Homework –Chapter 23, exercise 7.
Information Retrieval using Word Senses: Root Sense Tagging Approach Sang-Bum Kim, Hee-Cheol Seo and Hae-Chang Rim Natural Language Processing Lab., Department.
Information Retrieval
For Friday Finish chapter 23 Homework –Chapter 23, exercise 15.
Comparing Document Segmentation for Passage Retrieval in Question Answering Jorg Tiedemann University of Groningen presented by: Moy’awiah Al-Shannaq
1 Evaluating High Accuracy Retrieval Techniques Chirag Shah,W. Bruce Croft Center for Intelligent Information Retrieval Department of Computer Science.
Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.
For Monday Read chapter 26 Homework: –Chapter 23, exercises 8 and 9.
Text Information Management ChengXiang Zhai, Tao Tao, Xuehua Shen, Hui Fang, Azadeh Shakery, Jing Jiang.
Using Semantic Relations to Improve Information Retrieval
NATURAL LANGUAGE PROCESSING
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
REPORTING YOUR PROJECT OUTCOMES HELEN MCBURNEY. PROGRAM FOR TODAY: Report Reporting to local colleagues Reporting to the Organisation Tips for abstract.
Reporting your Project Outcomes Helen McBurney. Program for today: Report Reporting to local colleagues Reporting to the Organisation Tips for abstract.
University of Malta CSA3080: Lecture 10 © Chris Staff 1 of 18 CSA3080: Adaptive Hypertext Systems I Dr. Christopher Staff Department.
Information Retrieval in Practice
PRESENTED BY: PEAR A BHUIYAN
Information Retrieval and Web Search
Information Retrieval and Web Search
Introduction to Information Retrieval
Content Analysis of Text
CS246: Information Retrieval
Information Retrieval
Presentation transcript:

Artificial intelligence & natural language processing Mark Sanderson Porto, 2000

Aims To provide an outline of the attempts made at using NLP techniques in IR

Objectives At the end of this lecture you will be able to –Outline a range of attempts to get NLP to work with IR systems –Idly speculate on why they failed –Describe the successful use of NLP in a limited domain

Why? Seems an obvious area of investigation –Why not working?

Use of NLP Syntactic –Parsing to identify phrases –Full syntactic structure comparison Semantic –Building an understanding of a document’s content Discourse –Exploiting document structure?

Syntactic Parsing to identify phrases –The issues. –Explain how it’s done (a bit). –Is it worth it? Other possibilities –Grammatical tagging –Full syntactic structure comparison Explain how it’s done (a little bit). Show results.

Simple phrase identification High frequency terms could be good candidates. –Why? Terms co-occurring more often than chance. –Within small number of words. –Surrounding simple terms. –Not surrounding punctuation.

Problems Close words that aren’t phrases. “the use of computers in science & technology” Distant words that are phrases. “preparation & evaluation of abstracts and extracts”

Parsing for phrases Using parsers to identify noun phrases. Make a phrase out of a head and the head of its modifiers. “automatic analysis of scientific text” ADJ NOUN PREP NP PP

Errors Not a perfect rule by any means. –Need restrictions to eliminate bogus phrases. “automatic analysis of these four scientific texts” ADJ NOUN PREP NP PP DETQUANT

Do they work? Fagan compared statistical with syntactic, statistics won, just –J. Fagan (1987) Experiments in phrase indexing for document retrieval: a comparison of syntactic & nonsyntactic methods, in TR Department of Computer Science, Cornell University More research has been conducted. –T. Strzalkowski (1995) Natural language information retrieval, in Information Processing & Management, Vol. 31, No. 3, pp

Check out TREC Overview of the Seventh Text REtrieval Conference (TREC-7), E. Voorhees, D. Harman (National Institute of Standards and Technology) – –Ad hoc track Fairly even between statistical phrases, syntactic phrases and no phrases.

Grammatical tagging? Tag document text with grammatical codes? –R. Garside (1987). The CLAWS word tagging system, in The computational analysis of english: a corpus based approach, R. Garside, G. Leech, G. Sampson Eds., Longman: Doesn’t appear to work –R. Sacks-Davis, P. Wallis, R. Wilkinson (1990). Using syntactic analysis in a document retrieval system that uses signature files, in Proceedings of 13 th ACM SIGIR Conference:

Syntactic structure comparison Has been tried… –A. F. Smeaton & P. Sheridan (1991) Using morpho- syntactic language analysis in phrase matching, in Proceedings of RIAO ‘91, Pages Method –Parse sentences into tree structures –When you get a phrase match Look at linking syntactic operator. Look at the residual tree structure that didn’t match Does not to work

Semantic Disambiguation –Given a word appearing in a certain context, disambiguators will tell you what sense it is. IR system –Index document collections by senses rather than words –Ask the users what senses the query words are –Retrieve on senses

Disambiguation Does it work? –No (well maybe) M. Sanderson, Word sense disambiguation and information retrieval, in Proceedings of the 17 th ACM SIGIR Conference, Pages , 1994 M. Sanderson & C.J. van Rijsbergen, The impact on retrieval effectiveness of skewed frequency distributions, in ACM Transactions on Information Systems (TOIS) Vol. 17 No. 4, 1999, Pages

Partial conclusions NLP has yet to prove itself in IR –Agree –D.D. Lewis & K. Sparck-Jones (1996) Natural language processing for information retrieval, in Communications of the ACM (CACM) 1996 Vol. 39, No. 1, –Sort of don’t agree –A. Smeaton (1992) Progress in the application of natural language processing to information retrieval tasks, in The Computer Journal, Vol. 35, No. 3.

Mark’s idle speculation What people think is going on always Keywords NLP

Mark’s idle speculation What’s usually actually going on Keywords NLP

Areas where NLP does work Systems with the following ingredients. –Collection documents cover small domain. –Language use is limited in some manner. –User queries cover tight subject area. –Documents/queries very short Image captions –LSI, pseudo-relevance feedback –People willing to spend money getting NLP to work

RIME & IOTA From Grenoble –Y. Chiaramella & J. Nie (1990) A retrieval model based on an extended modal logic and its application to the RIME experimental approach, in Proceedings of the 13 th SIGIR conference, Pages Medical record retrieval system Some database’y parts Free text descriptions of cases

Indexing “an opacity affecting probably the lung and the trachea” {[p], SGN} {[bears-on], SGN} {[and], SGN} {[bears-on], SGN} {[lung], LOC}{[opacity], SGN} {[trachea], LOC} LOC - localisation SGN - observed sign

Retrieval How do we match a user’s query to these structures? –Using transformations - bit like logic. {[bears-on], SGN} {[lung], LOC}{[opacity], SGN} t - uncertainty {[lung], LOC}, t {[opacity], SGN}, t  

Tree transformation {[bears-on], SGN} {[has-for-value], SGN} {[lung], LOC}{[opacity], SGN}{[contour], SGN} {[blurred], LOC} {[opacity], SGN} {[has-for-value], SGN}, t {[has-for-value], SGN} {[contour], SGN} {[blurred], LOC} 

Term transforms Basic medical terms stored in a hierarchy. –Transformations possible again with uncertainty added. Level 1Level 2Level 3 tumourcancersarcoma hygroma kystepolykystosis pseudokyst polyppolyposis

Isn’t this a bit slow? Yes Optimisation –Scan for potential documents. –Process them intensively. Evaluation? –Not in that paper.

Not unique SCISOR –P.S. Jacobs & L.F. Rau (1990) SCISOR: Extracting Information from On-line News, in Communications of the ACM (CACM), Vol. 33, No. 11, 88-97

Why do they work? Because of the restrictions –Small subject domain. –Limited vocabulary. –Restricted type of question. Compare with large scale IR system. –Keywords are good enough. –Long time to set up. –Hard to adapt to new domain.

Anything else for NLP? Text Generation –IR system explaining itself?

Conclusions By now, you will be able to –Outline a range of attempts to get NLP to work with IR systems –Idly speculate on why they failed –Describe the successful use of NLP in a limited domain