University of Malta CSA3080: Lecture 6 © 2003- Chris Staff 1 of 20 CSA3080: Adaptive Hypertext Systems I Dr. Christopher Staff Department.

Slides:



Advertisements
Similar presentations
Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Advertisements

Chapter 5: Introduction to Information Retrieval
INSTRUCTOR: DR.NICK EVANGELOPOULOS PRESENTED BY: QIUXIA WU CHAPTER 2 Information retrieval DSCI 5240.
Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
Introduction to Information Retrieval (Part 2) By Evren Ermis.
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
IR Models: Overview, Boolean, and Vector
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
Information Retrieval Review
ISP 433/533 Week 2 IR Models.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
Probabilistic IR Models Based on probability theory Basic idea : Given a document d and a query q, Estimate the likelihood of d being relevant for the.
CSM06 Information Retrieval Lecture 3: Text IR part 2 Dr Andrew Salway
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
INFO 624 Week 3 Retrieval System Evaluation
Modeling Modern Information Retrieval
Computer comunication B Information retrieval. Information retrieval: introduction 1 This topic addresses the question on how it is possible to find relevant.
Vector Space Model CS 652 Information Extraction and Integration.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Multimedia Databases Text II. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Multimedia Databases Text databases Image and video.
Automatic Indexing (Term Selection) Automatic Text Processing by G. Salton, Chap 9, Addison-Wesley, 1989.
IR Models: Review Vector Model and Probabilistic.
WXGB6106 INFORMATION RETRIEVAL Week 3 RETRIEVAL EVALUATION.
Information Retrieval
HYPERGEO 1 st technical verification ARISTOTLE UNIVERSITY OF THESSALONIKI Baseline Document Retrieval Component N. Bassiou, C. Kotropoulos, I. Pitas 20/07/2000,
Recuperação de Informação. IR: representation, storage, organization of, and access to information items Emphasis is on the retrieval of information (not.
Chapter 5: Information Retrieval and Web Search
Search and Retrieval: Relevance and Evaluation Prof. Marti Hearst SIMS 202, Lecture 20.
University of Malta CSA3080: Lecture 9 © Chris Staff 1 of 13 CSA3080: Adaptive Hypertext Systems I Dr. Christopher Staff Department.
University of Malta CSA1013:Information Search and Retrieval © Chris Staff 1 of 24 CSA1013 Historical Perspectives of Dr. Christopher.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.
Query Operations J. H. Wang Mar. 26, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
University of Malta CSA3080: Lecture 7 © Chris Staff 1 of 18 CSA3080: Adaptive Hypertext Systems I Dr. Christopher Staff Department.
Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.
Chapter 6: Information Retrieval and Web Search
Information retrieval 1 Boolean retrieval. Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text)
1 Computing Relevance, Similarity: The Vector Space Model.
1 Automatic Classification of Bookmarked Web Pages Chris Staff Second Talk February 2007.
University of Malta CSA3080: Lecture 4 © Chris Staff 1 of 14 CSA3080: Adaptive Hypertext Systems I Dr. Christopher Staff Department.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
CPSC 404 Laks V.S. Lakshmanan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides at UC-Berkeley.
Information Retrieval Model Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
Parallel and Distributed Searching. Lecture Objectives Review Boolean Searching Indicate how Searches may be carried out in parallel Overview Distributed.
University of Malta CSA3080: Lecture 3 © Chris Staff 1 of 18 CSA3080: Adaptive Hypertext Systems I Dr. Christopher Staff Department.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Evaluation of Agent Building Tools and Implementation of a Prototype for Information Gathering Leif M. Koch University of Waterloo August 2001.
© 2004 Chris Staff CSAW’04 University of Malta of 15 Expanding Query Terms in Context Chris Staff and Robert Muscat Department of.
Web- and Multimedia-based Information Systems Lecture 2.
University of Malta CSA3080: Lecture 12 © Chris Staff 1 of 22 CSA3080: Adaptive Hypertext Systems I Dr. Christopher Staff Department.
Information Retrieval
Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.
University of Malta CSA4080: Topic 7 © Chris Staff 1 of 15 CSA4080: Adaptive Hypertext Systems II Dr. Christopher Staff Department.
DISTRIBUTED INFORMATION RETRIEVAL Lee Won Hee.
Natural Language Processing Topics in Information Retrieval August, 2002.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Information Retrieval and Web Search IR models: Vector Space Model Term Weighting Approaches Instructor: Rada Mihalcea.
Introduction to Information Retrieval Probabilistic Information Retrieval Chapter 11 1.
INFORMATION RETRIEVAL Pabitra Mitra Computer Science and Engineering IIT Kharagpur
University of Malta CSA3080: Lecture 10 © Chris Staff 1 of 18 CSA3080: Adaptive Hypertext Systems I Dr. Christopher Staff Department.
Automated Information Retrieval
Information Retrieval and Web Search
User-Adaptive Systems
Information Retrieval and Web Design
Information Retrieval and Web Design
ADVANCED TOPICS IN INFORMATION RETRIEVAL AND WEB SEARCH
VECTOR SPACE MODEL Its Applications and implementations
Presentation transcript:

University of Malta CSA3080: Lecture 6 © Chris Staff 1 of 20 CSA3080: Adaptive Hypertext Systems I Dr. Christopher Staff Department of Computer Science & AI University of Malta Lecture 6: Information Retrieval II

University of Malta CSA3080: Lecture 6 © Chris Staff 2 of 20 Aims and Objectives Statistical Model of IR

University of Malta CSA3080: Lecture 6 © Chris Staff 3 of 20 Aims and Objectives Once we know what an AHS user’s interests are, we can find relevant information in the document collection –Guide user along path –Show relevant document to user Boolean/Extended Boolean models have some limitations. Statistical model may provide advantages

University of Malta CSA3080: Lecture 6 © Chris Staff 4 of 20 Precision and Recall What is relevance? How do we measure performance? Recall: %age of relevant docs retrieved Precision: %age of docs retrieved that are relevant

University of Malta CSA3080: Lecture 6 © Chris Staff 5 of 20 Boolean Model: Problems Blair & Maron, 1985, “An Evaluation of Retrieval Effectiveness for a Full-Text Document-Retrieval System” Death-knell for pure Boolean approach Evaluated IBM’s STorage And Information Retrieval System (STAIRS) STAIRS used to index 40,000 legal documents representing c. 350,000 pages of text

University of Malta CSA3080: Lecture 6 © Chris Staff 6 of 20 Boolean Model: Problems To retrieve all and only those documents that are relevant to a given request for information Lawyers who made requests wanted at least 75% of relevant documents Retrieval effectiveness discovered to be poor

University of Malta CSA3080: Lecture 6 © Chris Staff 7 of 20 Boolean Model: Problems Lawyers would make request for information Paralegals familiar with case and trained to use STAIRS would search for relevant documents Lawyers would rate docs “vital”, “satisfactory”, “marginally relevant”, “irrelevant” Lawyers could modify query Iteration stops when lawyer signs that 75% of relevant docs have been seen

University of Malta CSA3080: Lecture 6 © Chris Staff 8 of 20 Boolean Model: Problems Results: –Precision on average 79.0% –Recall on average only 20%!

University of Malta CSA3080: Lecture 6 © Chris Staff 9 of 20 Boolean Model: Problems Why? –Mismatch between terminology used by lawyers/paralegals and authors of documents –Spelling mistakes in documents –Use of slang and indirect reference

University of Malta CSA3080: Lecture 6 © Chris Staff 10 of 20 Extended/Boolean Methods: other problems The Vocabulary Problem –Furnas, et al, 1987, “The Vocabulary Problem in Human-System Communication” –“Armchair” naming of objects / concepts very inaccurate. Only c. 20% chance of two randomly selected people using the same name to refer to the same object/concept! –Implications for information retrieval –Why it appears to be a non-problem for Web-based systems

University of Malta CSA3080: Lecture 6 © Chris Staff 11 of 20 Extended/Boolean Methods: other problems Boolean and extended boolean require a document to satisfy a query by containing the terms as specified in the query Document representation is independent of other documents in the collection No way of indicating which terms are more significant than others in the query

University of Malta CSA3080: Lecture 6 © Chris Staff 12 of 20 Extended/Boolean Methods: problems Relevance feedback –RF is an important tool, much underutilised in “popular” search engines –Users not always able to describe need fully But can always recognise a relevant document! –After initial query, mark documents in the results set as relevant or non-relevant –Let the IR system re-compute the query!

University of Malta CSA3080: Lecture 6 © Chris Staff 13 of 20 Statistical Model of IR For a given term, which documents are statistically most likely to be about the term? How does the co-occurrence of terms affect the relevance of the document? Reference: –G. Salton and C. Buckley. (1988). Term-weighting approaches in automatic text retrieval. Information Processing & Management, 24(5):

University of Malta CSA3080: Lecture 6 © Chris Staff 14 of 20 Statistical Model of IR A document is considered to be relevant to a query if it is similar enough A similarity measure calculates the Euclidean Distance between a query and a document representation plotted into vector space Relevant documents can be ranked in descending order of similarity

University of Malta CSA3080: Lecture 6 © Chris Staff 15 of 20 Statistical Model of IR Boolean model used simply presence or absence of a term in a document Extended Boolean model used other term features, including term frequency to rank relevant documents Statistical model also uses distribution of term in collection: document frequency –Size of collection / DF (inverse DF)

University of Malta CSA3080: Lecture 6 © Chris Staff 16 of 20 Statistical Model of IR The term weight is: –Term frequency x Inverse Document Frequency Also normalise term weight, so that length of document is taken into account –DL(j): no. of terms in document j –NDL(j): DL(j) / (Average document length)

University of Malta CSA3080: Lecture 6 © Chris Staff 17 of 20 Statistical Model of IR Cosine Similarity Measure:

University of Malta CSA3080: Lecture 6 © Chris Staff 18 of 20 Statistical Model of IR Can now rank documents according to similarity Can also support relevance feedback in iterative retrieval Relevance feedback can help AHS determine unspecified significant terms that also indicate user interests

University of Malta CSA3080: Lecture 6 © Chris Staff 19 of 20 Statistical Model of IR Disadvantage that same document in different collections can have different IDF, which effects term weight Modern approaches use statistical language models to use the likelihood of occurrence in the language, rather than in the document collection Reference: –Djoerd Hiemstra and Franciska de Jong, (19??), Statistical Language Models and Information Retrieval: natural language processing really meets retrieval.

University of Malta CSA3080: Lecture 6 © Chris Staff 20 of 20 Conclusion Statistical model of IR yields improvements over Boolean/Extended Boolean, although it is still not popular for Web-based search –Why? Many approaches to adaptation use statistical evidence (e.g., Amazon) Will investigate other models in CSA4080