COMP3410 DB32: Technologies for Knowledge Management Lecture 7: Query Broadening to improve IR By Eric Atwell, School of Computing, University of Leeds.

Slides:



Advertisements
Similar presentations
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 7: Scoring and results assembly.
Advertisements

COMP3410 DB32: Technologies for Knowledge Management 08 : Introduction to Knowledge Discovery By Eric Atwell, School of Computing, University of Leeds.
COMP3410 DB32: Technologies for Knowledge Management Lecture 4: Inverted Files and Signature Files for IR By Eric Atwell, School of Computing, University.
Relevance Feedback User tells system whether returned/disseminated documents are relevant to query/information need or not Feedback: usually positive sometimes.
Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Chapter 5: Introduction to Information Retrieval
Introduction to Information Retrieval
Improved TF-IDF Ranker
Query Languages. Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.
UCLA : GSE&IS : Department of Information StudiesJF : 276lec1.ppt : 5/2/2015 : 1 I N F S I N F O R M A T I O N R E T R I E V A L S Y S T E M S Week.
Exercising these ideas  You have a description of each item in a small collection. (30 web sites)  Assume we are looking for information about boxers,
A Quality Focused Crawler for Health Information Tim Tang.
Basic IR: Queries Query is statement of user’s information need. Index is designed to map queries to likely to be relevant documents. Query type, content,
Search Strategies Online Search Techniques. Universal Search Techniques Precision- getting results that are relevant, “on topic.” Recall- getting all.
1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement: Relevance Feedback Information Filtering.
Chapter 5: Query Operations Baeza-Yates, 1999 Modern Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Vector Methods 1.
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
Information Retrieval
CS246 Basic Information Retrieval. Today’s Topic  Basic Information Retrieval (IR)  Bag of words assumption  Boolean Model  Inverted index  Vector-space.
Semantic Web Technologies Lecture # 2 Faculty of Computer Science, IBA.
CONTI’2008, 5-6 June 2008, TIMISOARA 1 Towards a digital content management system Gheorghe Sebestyen-Pal, Tünde Bálint, Bogdan Moscaliuc, Agnes Sebestyen-Pal.
Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.
Modern Information Retrieval Relevance Feedback
Query Relevance Feedback and Ontologies How to Make Queries Better.
COMP423: Intelligent Agent Text Representation. Menu – Bag of words – Phrase – Semantics – Bag of concepts – Semantic distance between two words.
Query Expansion.
COMP423.  Query expansion  Two approaches ◦ Relevance feedback ◦ Thesaurus-based  Most Slides copied from ◦
1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement and Relevance Feedback.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Clustering User Queries of a Search Engine Ji-Rong Wen, Jian-YunNie & Hon-Jian Zhang.
Query Operations J. H. Wang Mar. 26, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text.
1 Query Operations Relevance Feedback & Query Expansion.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Weighting and Matching against Indices. Zipf’s Law In any corpus, such as the AIT, we can count how often each word occurs in the corpus as a whole =
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Information Retrieval Model Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
LIS618 lecture 3 Thomas Krichel Structure of talk Document Preprocessing Basic ingredients of query languages Retrieval performance evaluation.
Parallel and Distributed Searching. Lecture Objectives Review Boolean Searching Indicate how Searches may be carried out in parallel Overview Distributed.
Relevance Feedback Hongning Wang What we have learned so far Information Retrieval User results Query Rep Doc Rep (Index) Ranker.
University of Malta CSA3080: Lecture 6 © Chris Staff 1 of 20 CSA3080: Adaptive Hypertext Systems I Dr. Christopher Staff Department.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Thesauri usage in information retrieval systems: example of LISTA and ERIC database thesaurus Kristina Feldvari Departmant of Information Sciences, Faculty.
© 2004 Chris Staff CSAW’04 University of Malta of 15 Expanding Query Terms in Context Chris Staff and Robert Muscat Department of.
How Do We Find Information?. Key Questions  What are we looking for?  How do we find it?  Why is it difficult? “A prudent question is one-half of wisdom”
1 Automatic indexing Salton: When the assignment of content identifiers is carried out with the aid of modern computing equipment the operation becomes.
Data Structures and Algorithms Dr. Tehseen Zia Assistant Professor Dept. Computer Science and IT University of Sargodha Lecture 1.
Of 33 lecture 1: introduction. of 33 the semantic web vision today’s web (1) web content – for human consumption (no structural information) people search.
Jane Reid, AMSc IRIC, QMUL, 30/10/01 1 Information seeking Information-seeking models Search strategies Search tactics.
Information Retrieval Transfer Cycle Dania Bilal IS 530 Fall 2007.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Query Refinement and Relevance Feedback.
1 CS 430: Information Discovery Lecture 21 Interactive Retrieval.
Query expansion COMP423. Menu Query expansion Two approaches Relevance feedback Thesaurus-based Most Slides copied from
Knowledge and Information Retrieval Dr Nicholas Gibbins 32/4037.
Automated Information Retrieval
Text Based Information Retrieval
Multimedia Information Retrieval
IR Theory: Evaluation Methods
Information Organization: Clustering
CS 430: Information Discovery
Evaluation of IR Performance
Relevance Feedback & Query Expansion
CS 430: Information Discovery
Inf 722 Information Organisation
Introduction to Information Retrieval
Semantic Similarity Methods in WordNet and their Application to Information Retrieval on the Web Yizhe Ge.
Information Retrieval and Web Design
Presentation transcript:

COMP3410 DB32: Technologies for Knowledge Management Lecture 7: Query Broadening to improve IR By Eric Atwell, School of Computing, University of Leeds (including re-use of teaching resources from other sources, esp. Stuart Roberts, School of Computing, Univ of Leeds)

Module Objectives On completion of this module, students should be able to: … describe classical and emerging information retrieval techniques, and their relevance to knowledge management; …

Todays objectives first we look at a method for query broadening that required input from the user then we look at an automatic method for query broadening using a thesaurus by the end of the lecture you should understand what a thesaurus, terminology-bank, ontology are, and how they are used to broaden queries

Some issues to be resolved Synonyms –football / soccer, tap / faucet: search for one, find both? homonyms –lead (metal or leash?), tap: find both, only want one? local/global contexts determine good terms –football articles: wont mention word football; will have particular meaning for the word goal Precoordination (proximity query): multi-word terms –Venetian blind vs blind Venetian

Evaluation/Effectiveness measures effort - required by the users in formulation of queries time - between receipt of user query and production of list of hits presentation - of the output coverage - of the collection recall - the fraction of relevant items retrieved precision - the fraction of retrieved items that are relevant user satisfaction – with the retrieved items

Better hits: Query Broadening User unaware of collection characteristics is likely to formulate a naïve query query broadening aims to replace the initial query with a new one featuring one or other of: –new index terms –adjusted term weights One method uses feedback information from the user Another method uses a thesaurus / term-bank / ontology

From response to initial query, gather relevance information H R = R H = set of retrieved, relevant hits H NR = H-R = set of retrieved, non-relevant hits replace query q with replacement query q' : q' = q d i / |H R | d i / |H NR | note: this moves the query vector closer to the centroid of the relevant retrieved document vectors and further from the centroid of the non-relevant retrieved documents. d i H NR d i H R Relevance Feedback

Using terms from relevant documents We expect documents that are similar to one another in meaning (or usefulness) to have similar index terms. The system creates a replacement query (q) based on q, but adds index terms that have been used to index known relevant documents, increases the relative weight of index terms in q that are also found in relevant documents, and reduces the weight of terms found in non-relevant documents.

How does this help? It could help if documents were being missed because of the synonym problem. The user uses the word jam, but some recipes use jelly instead. Once a hit that uses jelly has been recognized as relevant, then jelly will appear n the next version of the query. Now hits may use jelly but not jam. Conversely, it can help with the homonym problem. If the user wants references to lead (the metal), and gets documents relating to dog-walking, then by marking the dog-walking references as not relevant, key words associated with dog-walking will be reduced in weight

pros and cons of feedback If is set = 0, ignore non-relevant hits, a positive feedback system; often preferred the feedback formula can be applied repeatedly, asking user for relevance information at each iteration relevance feedback is generally considered to be very effective for high-use systems one drawback is that it is not fully automatic.

Simple feedback example: T = {pudding, jam, traffic, lane, treacle} d 1 = (0.8, 0.8, 0.0, 0.0, 0.4), d 2 = (0.0, 0.0, 0.9, 0.8, 0.0), d 3 = (0.8, 0.0, 0.0, 0.0, 0.8) d 4 = (0.6, 0.9, 0.5, 0.6, 0.0) Recipe for jam pudding DoT report on traffic lanes Radio item on traffic jam in Pudding Lane Recipe for treacle pudding Display first 2 documents that match the following query: q = (1.0, 0.6, 0.0, 0.0, 0.0) r = (0.91, 0.0, 0.6, 0.73) Retrieved documents are: d 1 : Recipe for jam pudding d 4 : Radio item on traffic jam relevant not relevant

Suppose we set and to 0.5, to 0.2 q' = q d i / | H R | d i / | H NR | = 0.5 q d d 4 = 0.5 (1.0, 0.6, 0.0, 0.0, 0.0) (0.8, 0.8, 0.0, 0.0, 0.4) 0.2 (0.6, 0.9, 0.5, 0.6, 0.0) = (0.78, 0.52, 0.1, 0.12, 0.2) (Note |Hn| = 1 and |Hnr| = 1) d i H R d i H NR Positive and Negative Feedback

Simple feedback example: T = {pudding, jam, traffic, lane, treacle} d 1 = (0.8, 0.8, 0.0, 0.0, 0.4), d 2 = (0.0, 0.0, 0.9, 0.8, 0.0), d 3 = (0.8, 0.0, 0.0, 0.0, 0.8) d 4 = (0.6, 0.9, 0.5, 0.6, 0.0) Display first 2 documents that match the following query: q = (0.78, 0.52, 0.1, 0.12, 0.2) r = (0.96, 0.0, 0.86, 0.63) Retrieved documents are: d 1 : Recipe for jam pudding d 3 : Recipe for treacle pud relevant

Thesaurus a thesaurus or ontology may contain –controlled vocabulary of terms or phrases describing a specific restricted topic, –synonym classes, –hierarchy defining broader terms (hypernyms) and narrower terms (hyponyms) –classes of related terms. a thesaurus or ontology may be: –generic (as Rogets thesaurus, or WordNet) –specific to a certain domain of knowledge, eg medical

Language normalisation Content analysis Uncontrolled keywords Thesaurus Index terms User query Normalised query match by replacing words from documents and query words with synonyms from a controlled language, we can improve precision and recall:

Thesaurus / Ontology construction Include terms likely to be of value in content analysis for each term, form classes of related words (separate classes for synonyms, hypernyms, hyponyms) form separate classes for each relevant meaning of the word terms in a class should occur with roughly equal frequency (not easy – NL has Zipfs law word-freq ) avoid high-frequency terms it involves some expert judgment that will not be easy to automate.

Example thesaurus A public-domain thesaurus (WORDNET) is available from: /home/cserv1_a/staff/nlplib/WordNet/2.0 /home/cserv1_a/staff/extras/nltk/1.4.2/corpora/wordnet computer data processor electronic computer information processing system synonyms (sense 1):

Example thesaurus A public-domain thesaurus (WORDNET) is available from: computer calculator reckoner figurer estimator synonyms (sense 2):

Hypernym is the generic term used to designate a whole class of specific instances. Y is a hypernym of X if X is a (kind of) Y. Hyponym is the generic term used to designate a member of a class. X is a hyponym of Y if X is a (kind of) Y. Coordinate words are words that have the same hypernym. Hypernym synsets are preceded by "->", and hyponym synsets are preceded by "=>". Terminology (from WordNet Help)

Hypernyms Sense 1 computer, data processor, electronic computer, information processing system -> machine -> device -> instrumentality, instrumentation -> artifact, artefact -> object, physical object -> entity, something Hypernym synsets are preceded by "->", and hyponym synsets are preceded by "=>".

Hyponyms Sense 1 computer, data processor, electronic computer, information processing system => analog computer, analogue computer => digital computer => node, client, guest => number cruncher => pari-mutuel machine, totalizer, totaliser, totalizator, totalisator => server, host Hypernym synsets are preceded by "->", and hyponym synsets are preceded by "=>".

Sense 1 computer, data processor, electronic computer, information processing system -> machine => assembly => calculator, calculating machine => calendar => cash machine, cash dispenser, automated teller machine, automatic teller machine, automated teller, automatic teller, ATM => computer, data processor, electronic computer, information processing system => concrete mixer, cement mixer => corker => cotton gin, gin => decoder Coordinate terms

Thesaurus use replace term in document and/or query with term in controlled language replace term in query with related or broader term to increase recall suggest to user narrower terms to increase precision Doc: Query: Thesaurus computer (sense 1) match S

Thesaurus use replace term in document and/or query with term in controlled language replace term in query with related or broader term to increase recall suggest to user narrower terms to increase precision Thesaurus Query: match All collection Query: match All collection B

Thesaurus use replace term in document and/or query with term in controlled language replace term in query with related or broader term to increase recall suggest to user narrower terms to increase precision Thesaurus Query: client match All collection match All collection N Query: User

Key points a thesaurus or ontology can be used to normalise a vocabulary and queries (?or documents?) it can be used (with some human intervention) to increase recall and precision generic thesaurus/ontology may not be effective in specialized collections and/or queries Semi-automatic construction of thesaurus/ontology based on the retrieved set of documents has produced some promising results.