Download presentation
Presentation is loading. Please wait.
Published byEarl Thornton Modified over 9 years ago
1
Language Models LBSC 796/CMSC 828o Session 4, February 16, 2004 Douglas W. Oard
2
Questions The meaning of “maybe” Probabilistic retrieval Comparison with vector space model Agenda
3
Muddiest Points Why distinguish utility and relevance? How coordination measure is ranked Boolean Why use term weights? The meaning of DF The problem with log(1)=0 How the vectors are built How to do cosine normalization (5) Why to do cosine normalization (2) Okapi graphs
4
Supporting the Search Process Source Selection Search Query Selection Ranked List Examination Document Delivery Document Query Formulation IR System Indexing Index Acquisition Collection
5
Looking Ahead We ask “is this document relevant?” –Vector space: we answer “somewhat” –Probabilistic: we answer “probably” The key is to know what “probably” means –First, we’ll formalize that notion –Then we’ll apply it to retrieval
6
Probability What is probability? –Statistical: relative frequency as n –Subjective: degree of belief Thinking statistically –Imagine a finite amount of “stuff” –Associate the number 1 with the total amount –Distribute that “mass” over the possible events
7
Statistical Independence A and B are independent if and only if: P(A and B) = P(A) P(B) Independence formalizes “unrelated” –P(“being brown eyed”) = 85/100 –P(“being a doctor”) = 1/1000 –P(“being a brown eyed doctor”) = 85/100,000
8
Dependent Events Suppose” –P(“having a B.S. degree”) = 2/10 –P(“being a doctor”) = 1/1000 Would you expect –P(“having a B.S. degree and being a doctor”) = 2/10,000 ??? Extreme example: –P(“being a doctor”) = 1/1000 –P(“having studied anatomy”) = 12/1000
9
Conditional Probability P(A | B) P(A and B) / P(B) A B A and B P(A) = prob of A relative to the whole space P(A|B) = prob of A considering only the cases where B is known to be true
10
More on Conditional Probability Suppose –P(“having studied anatomy”) = 12/1000 –P(“being a doctor and having studied anatomy”) = 1/1000 Consider –P(“being a doctor” | “having studied anatomy”) = 1/12 But if you assume all doctors have studied anatomy –P(“having studied anatomy” | “being a doctor”) = 1 Useful restatement of definition: P(A and B) = P(A|B) x P(B)
11
Some Notation Consider –A set of hypotheses: H1, H2, H3 –Some observable evidence O P(O|H1) = probability of O being observed if we knew H1 were true P(O|H2) = probability of O being observed if we knew H2 were true P(O|H3) = probability of O being observed if we knew H3 were true
12
An Example Let –O = “Joe earns more than $80,000/year” –H1 = “Joe is a doctor” –H2 = “Joe is a college professor” –H3 = “Joe works in food services” Suppose we do a survey and we find out –P(O|H1) = 0.6 –P(O|H2) = 0.07 –P(O|H3) = 0.001 What should be our guess about Joe’s profession?
13
Bayes’ Rule What’s P(H1|O)? P(H2|O)? P(H3|O)? Theorem: P(H | O) = P(O | H) x P(H) P(O) Posterior probability Prior probability Notice that the prior is very important!
14
Back to the Example Suppose we also have good data about priors: –P(O|H1) = 0.6P(H1) = 0.0001 doctor –P(O|H2) = 0.07P(H2) = 0.001 prof –P(O|H3) = 0.001P(H3) = 0.2 food We can calculate –P(H1|O) = 0.00006/ P(“earning > $70K/year”) –P(H2|O) = 0.0007/ P(“earning > $70K/year”) –P(H3|O) = 0.0002/ P(“earning > $70K/year”)
15
Key Ideas Defining probability using frequency Statistical independence Conditional probability Bayes’ rule
16
Questions Defining probability Using probability for retrieval –Language modeling –Inference networks Comparison with vector space model Agenda
17
Probability Ranking Principle Assume binary relevance/document independence –Each document is either relevant or it is not –Relevance of one doc reveals nothing about another Assume the searcher works down a ranked list –Seeking some number of relevant documents Theorem (provable from assumptions): –Documents should be ranked in order of decreasing probability of relevance to the query, P(d relevant-to q)
18
Probabilistic Retrieval Strategy Estimate how terms contribute to relevance –How do TF, DF, and length influence your judgments about document relevance? (e.g., Okapi) Combine to find document relevance probability Order documents by decreasing probability
19
Where do the probabilities fit? Comparison Function Representation Function Query Formulation Human Judgment Representation Function Retrieval Status Value Utility Query Information NeedDocument Query RepresentationDocument Representation Query Processing Document Processing P(d is Rel | q)
20
Binary Independence Model Binary refers again to binary relevance Assume “term independence” –Presence of one term tells nothing about another Assume “uniform priors” –P(d) is the same for all d
21
“Okapi” Term Weights TF componentIDF component
22
Stochastic Language Models Models probability of generating any string 0.2the 0.1a 0.01man 0.01woman 0.03said 0.02likes … themanlikesthewoman 0.20.010.020.20.01 multiply Model M P(s | M)
23
Language Models, cont’d Models probability of generating any string 0.2the 0.1a 0.01man 0.01woman 0.03said 0.02likes … Model M1 0.2the 0.1yon 0.001class 0.01maiden 0.03sayst 0.02pleaseth … Model M2 maidenclasspleasethyonthe 0.00050.010.0001 0.2 0.010.00010.020.10.2 P(s|M2) > P(s|M1)
24
Retrieval with Language Models Treat each document as the basis for a model Rank document d based on P(d | q) –P(d | q) = P(q | d) x P(d) / P(q) P(q) is same for all documents, can’t change ranks P(d) [the prior] is often treated as the same for all d –But we could use criteria like authority, length, genre P(q | d) is the probability of q given d’s model –Same as ranking by P(q | d)
25
Computing P(q | d) Build a smoothed language model for d –Count the frequency of each term in d –Count the frequency of each term in the collection –Combine the two in some way –Redistribute probabilities to unobserved events Example: add 1 to every count Combine the probability for the full query –Summing over the terms in q is a soft “OR”
26
Key Ideas Probabilistic methods formalize assumptions –Binary relevance –Document independence –Term independence –Uniform priors –Top-down scan Natural framework for combining evidence –e.g., non-uniform priors
27
Inference Networks A flexible way of combining term weights –Boolean model –Binary independence model –Probabilistic models with weaker assumptions Key concept: rank based on P(d | q) –P(d | q) = P(q | d) x P(d) / P(q) Efficient large-scale implementation –InQuery text retrieval system from U Mass
28
A Boolean Inference Net bat d1d2d3d4 catfathatmatpatratvat ANDOR sat AND I Information need
29
A Binary Independence Network bat d1d2d3d4 catfathatmatpatratvatsat query
30
Probability Computation Turn on exactly one document at a time –Boolean: Every connected term turns on –Binary Ind: Connected terms gain their weight Compute the query value –Boolean: AND and OR nodes use truth tables –Binary Ind: Fraction of the possible weight
31
A Critique Most of the assumptions are not satisfied! –Searchers want utility, not relevance –Relevance is not binary –Terms are clearly not independent –Documents are often not independent The best known term weights are quite ad hoc –Unless some relevant documents are known
32
But It Works! Ranked retrieval paradigm is powerful –Well suited to human search strategies Probability theory has explanatory power –At least we know where the weak spots are –Probabilities are good for combining evidence Good implementations exist (InQuery, Lemur) –Effective, efficient, and large-scale
33
Comparison With Vector Space Similar in some ways –Term weights can be based on frequency –Terms often used as if they were independent Different in others –Based on probability rather than similarity –Intuitions are probabilistic rather than geometric
34
One Minute Paper Which assumption underlying the probabilistic retrieval model causes you the most concern, and why?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.