Modern Information Retrieval Chapter 2 Modeling
Probabilistic model the appearance or absent of an index term in a document is interpreted either as evidence that the document is relevant or that it is irrelevant to a query establish a weight for each term
a collection of N documents R of which are relevant R t of which contain term t f t of which contain t these values can be obtained from a training set with relevance judgments
computing probabilities P r [relevant t]=R t f t P r [irrelevant t]=(f t -R t ) f t P r [relevant t ]=(R-R t )/(N-f t ) P r [irrelevant t ]=(N-f t -(R-R t ))/(N-f t )
computing weight W t for t W t = P r [relevant t] P r [irrelevant t ] P r [irrelevant t] P r [relevant t ] = R t /f t (N-f t -(R-R t ))/(N-f t ) (f t -R t )/f t (R-R t )/(N-f t ) = R t /(R-R t ) (f t -R t )/(N-f t -(R-R t ))
W t >1 indicates that the appearance of t supports the document is relevant W t <1 indicates that the appearance of t suggests the document is irrelevant N=20, R=13, R t =11, f t =12 W t =33 N=20, R=13, R t =4, f t =7 W t =0.59 W t =1 indicates that t is neutral
negative weight indicates that the document is predicted to be irrelevant zero weight indicates that the document is neutral
Comparison the Boolean model is the weakest model no partial matching the vector model and probabilistic model are comparative while the vector model is more popular term frequency is not considered in the probabilistic model