Tutorial#3
Retrieval models Retrieval models match query with documents to: separate documents into relevant and non-relevant class rank the documents according to the relevance. Boolean model Vector space model (VSM) Probabilistic models
Boolean model Boolean model is most common exact-match model queries are logic expressions with document features as operands In pure Boolean model, retrieved documents are not ranked.
Example D7 OR D1,D2,D5 AND D2,D4,D5,D6,D8 D7 OR D2,D5
Vector space model (VSM) Documents and queries are represented as vectors. dj = (w1,j,w2,j,...,wt,j) q = (w1,q,w2,q,...,wt,q) Each dimension corresponds to a separate term. If a term occurs in the document, its value in the vector is non-zero.
K5 K4 K3 K2 K1 K0 Q0 Q1 Q2 Q3 Q4 K5 K4 K3 K2 K1 K0 D0 D1 D2 D3 D4
Vector space model (VSM) Several different ways of computing these values, also known as (term) weights, have been developed. One of the best known schemes is (tf-idf) weighting:
(tf-idf) weighting
Vector space model (VSM)
Example documents: D0:'How to Bake Bread Without Recipes', D1:'The Classic Art of Viennese Pastry', D2:'Numerical Recipes: The Art of Scientific Computing', D3:'Breads, Pastries, Pies and Cakes : Quantity Baking Recipes', D4:'Pastry: A Book of Best French Recipe‘ Keywords : ['bak','recipe','bread','cake','pastr','pie']
will generate a matrix 6 terms x 5 documents 'pie' 'pastr' 'cake' 'bread' 'recipe' 'bak' 1 D0 D1 D2 D3 D4
Query: "baking bread“ will generate a matrix 6 terms x 5 documents 'pie' 'pastr' 'cake' 'bread' 'recipe' 'bak' 1 D0 D1 D2 D3 D4
VSM Implementation VSMranker.java ranks documents for a query Provides functions to develop different user interfaces Stand alone usage needs document and query TDMs java -cp ../java VSMranker cacm.tdm query.tdm 7 Retrieves top 7 documents for CACM queries
Ex#3 (solve in tutorial time)
References: http://www.ccs.neu.edu/home/jaa/CSG339.06F/Lectures/vector.pdf http://www.ccs.neu.edu/home/jaa/CSG339.06F/Lectures/boolean.pdf