CS347 Lecture 5 April 23, 2001 ©Prabhakar Raghavan.

Slides:



Advertisements
Similar presentations
Information Retrieval and Organisation Chapter 11 Probabilistic Information Retrieval Dell Zhang Birkbeck, University of London.
Advertisements

Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
Representations for KBS: Uncertainty & Decision Support
Basic IR: Modeling Basic IR Task: Slightly more complex:
For Monday Read chapter 18, sections 1-2 Homework: –Chapter 14, exercise 8 a-d.
Probabilistic Information Retrieval Chris Manning, Pandu Nayak and
Lecture 11: Probabilistic Information Retrieval
Information Retrieval Models: Probabilistic Models
From last time What’s the real point of using vector spaces?: A user’s query can be viewed as a (very) short document. Query becomes a vector in the same.
IR Models: Overview, Boolean, and Vector
M.I. Jaime Alfonso Reyes ´Cortés.  The basic task for any probabilistic inference system is to compute the posterior probability distribution for a set.
Information Retrieval Review
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) Classic Information Retrieval (IR)
Incorporating Language Modeling into the Inference Network Retrieval Framework Don Metzler.
CS347 Review Slides (IR Part II) June 6, 2001 ©Prabhakar Raghavan.
Parametric search and zone weighting Lecture 6. Recap of lecture 4 Query expansion Index construction.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 11: Probabilistic Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 10 Probabilistic Information Retrieval.
Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.
Information retrieval: overview. Information Retrieval and Text Processing Huge literature dating back to the 1950’s! SIGIR/TREC - home for much of this.
Bayesian Networks Alan Ritter.
CS276A Text Retrieval and Mining Lecture 10. Recap of the last lecture Improving search results Especially for high recall. E.g., searching for aircraft.
IR Models: Review Vector Model and Probabilistic.
Other IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models boolean vector.
CS347 Lecture 9 May 11, 2001 ©Prabhakar Raghavan.
Quiz 4: Mean: 7.0/8.0 (= 88%) Median: 7.5/8.0 (= 94%)
Machine Learning CUNY Graduate Center Lecture 21: Graphical Models.
Probabilistic Models in IR Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata Using majority of the slides from.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Bayesian Learning By Porchelvi Vijayakumar. Cognitive Science Current Problem: How do children learn and how do they get it right?
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
Bayesian Networks for Data Mining David Heckerman Microsoft Research (Data Mining and Knowledge Discovery 1, (1997))
Information retrieval 1 Boolean retrieval. Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text)
Introduction to Bayesian Networks
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Information Retrieval Model Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
Search Engine Architecture
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan.
Announcements Project 4: Ghostbusters Homework 7
Boolean Model Hongning Wang Abstraction of search engine architecture User Ranker Indexer Doc Analyzer Index results Crawler Doc Representation.
Slides for “Data Mining” by I. H. Witten and E. Frank.
Vector Space Models.
Web-Mining Agents Probabilistic Information Retrieval Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme Karsten Martiny (Übungen)
CS276A Text Information Retrieval, Mining, and Exploitation Lecture 6 22 Oct 2002.
Bayesian Networks in Document Clustering Slawomir Wierzchon, Mieczyslaw Klopotek Michal Draminski Krzysztof Ciesielski Mariusz Kujawiak Institute of Computer.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture Probabilistic Information Retrieval.
Introduction to Information Retrieval Introduction to Information Retrieval Introducing Information Retrieval and Web Search.
CS315 Introduction to Information Retrieval Boolean Search 1.
Introduction to Information Retrieval Probabilistic Information Retrieval Chapter 11 1.
Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Microsoft Research Cambridge,
Recuperação de Informação B Modern Information Retrieval Cap. 2: Modeling Section 2.8 : Alternative Probabilistic Models September 20, 1999.
Lecture 1: Introduction and the Boolean Model Information Retrieval
CS276 Information Retrieval and Web Search
Text Based Information Retrieval
Search Engine Architecture
Implementation Issues & IR Systems
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Information Retrieval Models: Probabilistic Models
MANAGING DATA RESOURCES
CAP 5636 – Advanced Artificial Intelligence
CS 188: Artificial Intelligence
Spreadsheets, Modelling & Databases
Text Retrieval and Mining
CS 430: Information Discovery
Recuperação de Informação B
Presentation transcript:

CS347 Lecture 5 April 23, 2001 ©Prabhakar Raghavan

Today’s topics Generalized query operators –Evidence accumulation and structured queries Basics of Bayesian networks –Bayesian nets for Text Retrieval Structured+unstructured queries –Adding database-like queries

A step back User has an information need Translates need to query Translation depends on IR system’s syntax + user’s sophistication System presents docs meeting query Review Browse Navigate Update query Key: system’smodel of query-doc proximity

Models of query-doc proximity ABoolean queries –Doc is either in or out for query BVector spaces –Doc has non-negative proximity to query AEvidence accumulation –Combine score from multiple sources BBayesian nets and probabilistic methods –Infer probability that doc meets user’s information need

Evidence accumulation View each term in the query as providing partial evidence of match tf  idf + vector space retrieval is one example –Corpus-dependent (idf depends on corpus) In some situations corpus-dependent evidence is undesirable

Corpus-independent evidence When is corpus-independent scoring useful? –When corpus statistics are hard to maintain Distributed indices - more later Rapidly changing corpora –When stable scores are desired Users get used to issuing a search and seeing a doc with a score of (say) User subsequently filters by score –“Show me only docs with score at least 0.9”

Corpus-independent scoring Document routing is a key application –There is a list of standing queries e.g., bounced check in a bank’s customer service department –Each incoming doc ( ) scored against all standing queries –Routed to destination (customer specialist) based on best-scoring standing query More on this with automatic classification

Typical corpus-independent score Use a convex function of tf ij –e.g., Score(i,j) = 1 - exp(-a  tf ij ) –a is a tuning constant –gives a contribution of query term i for doc j Given a multi-term query, compute the average contribution, over all query terms

Bayesian Networks for Text Retrieval Text retrieval –Find the best set of documents that satisfies a user’s information need Bayesian Network –Model causal relationship between events –Infer the belief that an event holds based on observations of other events

What is a Bayesian network? Is a directed acyclic graph Nodes –Events or Variables Assume values. For our purposes, all Boolean Links –model dependencies between nodes

Toy Example Gloom (g) Finals (f) Project Due (d) No Sleep (n) Triple Latte (t)

Links as dependencies Link Matrix –Attached to each node Give influences of parents on that node. –Nodes with no parent get a “prior probability” e.g., f, d. –interior node : conditional probability of all combinations of values of its parents e.g., n,g,t.

Independence Assumption Variables not connected by a link: no direct conditioning. Joint probability - obtained from link matrices. See examples on next slide.

Independence Assumption Independence assumption: P(t|g f)=P(t|g) Joint probability P(f d n g t) =P(f) P(d) P(n|f) P(g|f d) P(t|g) Gloom (g) Finals (f) Project Due (d) No Sleep (n) Triple Latte (t)

Chained inference Evidence - a node takes on some value Inference –Compute belief (probabilities) of other nodes conditioned on the known evidence –Two kinds of inference: Diagnostic and Predictive Computational complexity –General network: NP-hard  polytree networks - tractable.

Key inference tool Bayes’ theorem: for any two events a,c Implies, for instance:

Diagnostic Inference Propagate beliefs through parents of a node Inference rule bibi ac

Diagnostic inference Evidence: n=true Belief: P(f|n)=? Gloom (g) Finals (f) Project Due (d) No Sleep (n) Triple Latte (t)

Diagnostic inference Normalize

Predictive Inference Compute belief of child nodes of evidence Inference rule bibi ac

Model for Text Retrieval Goal –Given a user’s information need (evidence), find probability a doc satisfies need Retrieval model –Model docs in a document network –Model information need in a query network

Bayesian nets for text retrieval d1d1 d2d2 r1r1 r3r3 c1c1 c3c3 q1q1 q2q2 i r2r2 c2c2 Document Network Query Network Documents Terms Concepts Query operators (AND/OR/NOT) Information need

Link matrices and probabilities Prior doc probability P(d) = 1/n P(r|d) –within-document term frequency –tf  idf - based P(c|r) –1-to-1 –thesaurus P(q|c): canonical forms of query operators

Example Hamlet Macbeth reasondouble reasontwo ORNOT User query trouble Document Network Query Network

Extensions Prior probs don’t have to be 1/n. “User information need” doesn’t have to be a query - can be words typed, in docs read, any combination … Link matrices can be modified over time. –User feedback. The promise of “personalization”

Computational details Document network built at indexing time Query network built/scored at query time Representation: –Link matrices from docs to any single term are like the postings entry for that term.

Exercise Consider ranking docs for a 1-term query. What is the difference between –A cosine-based vector-space ranking where each doc has tf  idf components, normalized; –A Bayesian net in which the link matrices on the docs-to-term links are normalized tf  idf?

Semi-structured search Structured search - search by restricting on attribute values, as in databases. Unstructured search - search in unstructured files, as in text. Semi-structured search: combine both.

Terminology Each document has –structured fields (aka attributes, columns) –free-form text Each field assumes one of several possible values –e.g., language (French, Japanese, etc.); price (for products); date; … Fields can be ordered (price, speed), or unordered (language, color).

Queries A query is any combination of –text query –field query A field query specifies one or more values for one or more fields –for numerical values, ranges possible e.g., price < 5000.

Example

Indexing: structured portion For each fields, order docs by values for that field –e.g., sorted by authors’ names, language … Maintain range indices (in memory) for each value of each attribute –like a postings entry –counts are like freq in postings.

Query processing Given value for each field, determine counts of matching docs Process query using optimization heuristics –Lightest axis first Merge with text search postings.

Numerical attributes Expensive to maintain a separate postings for each value of a numerical attribute –e.g., price Bucket into numerical ranges, maintain postings for each bucket At the user interface, present only bucket boundaries –e.g., if index buckets price into steps of $5000, present only these buckets to user

General ranges If the UI allows the user to specify an arbitrary numerical range –in the used-car section of cars.com: price, year –e.g., price between 1234 and Need to walk through the postings entry for (say) the bucket , until 1234 reached At most two postings entries need a walk- through

Resources MIR 2.6, 2.8. R.M. Tong, L.A. Appelbaum, V.N. Askman, J.F. Cunningham. Conceptual Information Retrieval using RUBRIC. Proc. ACM SIGIR , (1987).

Bayesian Resources E. Charniak. Bayesian nets without tears. AI Magazine 12(4): (1991). H.R. Turtle and W.B. Croft. Inference Networks for Document Retrieval. Proc. ACM SIGIR: 1-24 (1990). D. Heckerman. A Tutorial on Learning with Bayesian Networks. Microsoft Technical Report MSR-TR F. Crestani, M. Lalmas, C. J. van Rijsbergen, I. Campbell. Is This Document Relevant?... Probably: A Survey of Probabilistic Models in Information Retrieval. ACM Computing Surveys 30(4): (1998).