Presentation is loading. Please wait.

Presentation is loading. Please wait.

Advanced Query Processing Dr. Susan Gauch. Query Term Weights  The vector space model matches queries to documents with the inner product/cosine similarity.

Similar presentations


Presentation on theme: "Advanced Query Processing Dr. Susan Gauch. Query Term Weights  The vector space model matches queries to documents with the inner product/cosine similarity."— Presentation transcript:

1 Advanced Query Processing Dr. Susan Gauch

2 Query Term Weights  The vector space model matches queries to documents with the inner product/cosine similarity measure  Query vector * Document vector (inner product)  Normalized_q_vector  Normalized_doc_vector  Sum over all terms (i ε t in vector space)  nwt_q i * nwt_d ij  We implement this with:  For all terms i with non-zero query weight  For all documents j that contain term i  Sum (nwt_d ij )

3 Query term weights  Where did the query term weights go?  Essentially, we assume that all query terms are weighted “1”  If a term occurs twice in a query  E.g., “dog cat dog”  Process “dog” twice, add the postings for “dog” twice, so we effectively have a q_wt of 2 for “dog”  Can do this more efficiently by preprocessing query using a…. HASHTABLE! To count the term frequencies in the query  Dog (2) cat (1)

4 Query Term Weights – Simple Implementation  Can do this more efficiently by preprocessing query using a…. HASHTABLE! To count the term frequencies in the query  Dog (2) cat (1)  For all terms i with non-zero query weight  For all documents j that contain term i  Sum (q_wt i * nwt_d ij )

5 Query Term Weights – Proper Implementation  Can change query syntax to allow users to specify weights:  Dog (2) Cat (1)  Dog 0.7 Cat 0.3  Need better query parsing  Can tie to interfaces (sliders)  Users poor at selecting weights and often get worse retrieval not better, so infrequently implemented

6 Query Term Weights – Document Similarity  Where are query term weights actually used?  When trying to locate “similar” documents  Consider: how do you find the most similar documents to document d k  Applications: plagiarism detection, document clustering/classification (unsupervised/supervised learning)  Simple implementation:  Treat d k as a query  Top results are most similar documents

7 Document Similarity  For all terms i with non-zero weight in d k  For all documents j that contain term i  Sum (nwt_d ik * nwt_d ij )  What is weight d ik  Tf*idf of terms in d ik  We would need to store this  Or, start with document and calculate on the fly using stored idf in dict file  Efficiency  Linear in number of terms  Very slow for long documents  Calculate tf*idf for all terms in document k  Sort and use top n weighted terms (n ~ 10.. 50)

8 ~Boolean Queries  Vector space model merely sums the weights of the query terms in each document  Top document may not have all query terms in it  How implement quasi-Boolean retrieval  “+canine feline –teeth”  Results must have “canine”, may have “feline”, must not have “teeth”  Need to expand accumulator buckets to keep track of number of required terms contributing to the weights and number of excluded terms

9 ~Boolean Queries  Accumulator:  Total  Num-Required  Num-Excluded  For regular (no + or -)  Just add to Total (nothing new)  For required terms (+)  Add to total  Add to Num-Required

10

11 ~Boolean Queries  For excluded terms (-)  Subtract from total  Add to Num-Excluded  Presenting results:  First (only) show results where  Num_required in Accumulator == Num_required in query && Num_excluded == 0  Sort by weight  Can expand the results shown by later showing groups of results with  High weights, but missing 1 or more required terms  High weight, but including 1 or more excluded terms


Download ppt "Advanced Query Processing Dr. Susan Gauch. Query Term Weights  The vector space model matches queries to documents with the inner product/cosine similarity."

Similar presentations


Ads by Google