Probabilistic Information Retrieval

Slides:



Advertisements
Similar presentations
Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
Advertisements

Relevance Feedback User tells system whether returned/disseminated documents are relevant to query/information need or not Feedback: usually positive sometimes.
Information Retrieval and Organisation Chapter 12 Language Models for Information Retrieval Dell Zhang Birkbeck, University of London.
Information Retrieval and Organisation Chapter 11 Probabilistic Information Retrieval Dell Zhang Birkbeck, University of London.
Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Introduction to Information Retrieval
INSTRUCTOR: DR.NICK EVANGELOPOULOS PRESENTED BY: QIUXIA WU CHAPTER 2 Information retrieval DSCI 5240.
Bag-of-Words Methods for Text Mining CSCI-GA.2590 – Lecture 2A
Information Retrieval Models: Probabilistic Models
Chapter 7 Retrieval Models.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 12: Language Models for IR.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 11: Probabilistic Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Modeling Modern Information Retrieval
Modern Information Retrieval Chapter 5 Query Operations.
Agent Technology for e-Commerce
Query Reformulation: User Relevance Feedback. Introduction Difficulty of formulating user queries –Users have insufficient knowledge of the collection.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Vector Methods 1.
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
1 CS 430 / INFO 430 Information Retrieval Lecture 10 Probabilistic Information Retrieval.
Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
1 CS 502: Computing Methods for Digital Libraries Lecture 11 Information Retrieval I.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Lecture 2: Retrieval Models Maya Ramanath. QQ1 Vector space model: 0 for non-presence of a term, 1 for presence: Query: q1 AND q2 AND q3 Compare the set.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Lecture 12 IR in Google Age. Traditional IR Traditional IR examples – Searching a university library – Finding an article in a journal archive – Searching.
Probabilistic Ranking of Database Query Results Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International.
Naive Bayes Classifier
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
The Development of a search engine & Comparison according to algorithms Sungsoo Kim Haebeom Lee The mid-term progress report.
Exploring Online Social Activities for Adaptive Search Personalization CIKM’10 Advisor : Jia Ling, Koh Speaker : SHENG HONG, CHUNG.
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
Weighting and Matching against Indices. Zipf’s Law In any corpus, such as the AIT, we can count how often each word occurs in the corpus as a whole =
Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.
Ranking in Information Retrieval Systems Prepared by: Mariam John CSE /23/2006.
Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.
C.Watterscsci64031 Probabilistic Retrieval Model.
Date: 2013/6/10 Author: Shiwen Cheng, Arash Termehchy, Vagelis Hristidis Source: CIKM’12 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang Predicting the Effectiveness.
CSE 6392 – Data Exploration and Analysis in Relational Databases April 20, 2006.
Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.
CpSc 881: Information Retrieval. 2 Using language models (LMs) for IR ❶ LM = language model ❷ We view the document as a generative model that generates.
Basics of Databases and Information Retrieval1 Databases and Information Retrieval Lecture 1 Basics of Databases and Information Retrieval Instructor Mr.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Introduction to Information Retrieval Introduction to Information Retrieval Lecture Probabilistic Information Retrieval.
Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International University Gerhard Weikum, MPI Informatik.
Implementation of Vector Space Model March 27, 2006.
Introduction to Information Retrieval Probabilistic Information Retrieval Chapter 11 1.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 14: Language Models for IR.
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Lecture 13: Language Models for IR
Text Based Information Retrieval
Tutorial#3.
CSCI 5417 Information Retrieval Systems Jim Martin
Ranking in IR and WWW Modern Information Retrieval: A Brief Overview
Data Mining Lecture 11.
Multimedia Information Retrieval
Probabilistic Ranking of Database Query Results
Information Retrieval
موضوع پروژه : بازیابی اطلاعات Information Retrieval
Content Based Image Retrieval
CS 430: Information Discovery
Retrieval Utilities Relevance feedback Clustering
INF 141: Information Retrieval
Conceptual grounding Nisheeth 26th March 2019.
Information Retrieval and Web Design
Probabilistic Ranking of Database Query Results
ADVANCED TOPICS IN INFORMATION RETRIEVAL AND WEB SEARCH
Presentation transcript:

Probabilistic Information Retrieval CSE6392 - Database Exploration Gautam Das Thursday, March 29 2006 Z.M. Joseph Spring 2006, CSE, UTA

Basic Rules of Probability Recall the product rule: Baye’s Theorem:

Basic Assumptions Assume a database D consisting of a set of objects: documents, tuples, etc. Q : Query R : ‘Relevant Set’ of tuples Goal is to find an R for each Q, given D. Instead of deterministic, consider probabilistic ordering Ranking/Scoring function should decide the degree of relevance of a document Thus given a document d: Score(d) = P(R|D) [1] Thus, according to this, if you know the relevance set, then R’s members would have probability of 1, which would be the maximum score. Others would get a probability of 0.

Simplification From [1]: Take ratios of probability that document is in R to probability that it is not in R: This retains the old ordering. Factors in the elements outside R which are part of D.

Applying Bayes Theorem Simplify as follows:

Observations Forms the scoring function The equation still retains R, which we do not know. The ordering will still be the same using this equation as a scoring function

Derivation for Keyword Queries Now assume that a query contains a vector of words, with zero probability assigned if it does not occur. Then, applying the previous equation to each word w (instead of to a document) and combining all the words of the query gives:

Search for “Microsoft Corporation” Thus expression would be: Assume you had two documents: D1 : Contains ‘Microsoft’ but not ‘Corporation’ D2 : Contains ‘Corporation’ but not ‘Microsoft’ Thus:

Search for “Microsoft Corporation” Because Corporation is more common in the database D, then P(Corporation|D) will be far higher than P(Microsoft|D). Thus Score(D1) will be higher than Score(D2). Thus document which has ‘Microsoft’ in it will get higher ranking as this is more specific than the word ‘Corporation’. Similar to Vector Space ranking by relevance

Relevance Feedback Can keep fine-tuning R by getting user feedback on initial rankings. Once a better R is known, better scoring and ranking of matches is possible.

PIR Applied to Databases Originally PIR was applied to documents and not to databases Applying PIR to databases is not easy as it is difficult to capture various aspects These include: Different values of an attributes PIR is based on words in document, in a database if a car is blue, black,etc. that is not easily captured Would you assign each color as a keyword? What to sacrifice in ranking is also not easy to capture – if a user’s preference is black cars, how is PIR applied to that when listing results that do not match entirely?