Download presentation
Presentation is loading. Please wait.
Published byBrenda Reeves Modified over 9 years ago
1
Advanced Query Processing Dr. Susan Gauch
2
Query Term Weights The vector space model matches queries to documents with the inner product/cosine similarity measure Query vector * Document vector (inner product) Normalized_q_vector Normalized_doc_vector Sum over all terms (i ε t in vector space) nwt_q i * nwt_d ij We implement this with: For all terms i with non-zero query weight For all documents j that contain term i Sum (nwt_d ij )
3
Query term weights Where did the query term weights go? Essentially, we assume that all query terms are weighted “1” If a term occurs twice in a query E.g., “dog cat dog” Process “dog” twice, add the postings for “dog” twice, so we effectively have a q_wt of 2 for “dog” Can do this more efficiently by preprocessing query using a…. HASHTABLE! To count the term frequencies in the query Dog (2) cat (1)
4
Query Term Weights – Simple Implementation Can do this more efficiently by preprocessing query using a…. HASHTABLE! To count the term frequencies in the query Dog (2) cat (1) For all terms i with non-zero query weight For all documents j that contain term i Sum (q_wt i * nwt_d ij )
5
Query Term Weights – Proper Implementation Can change query syntax to allow users to specify weights: Dog (2) Cat (1) Dog 0.7 Cat 0.3 Need better query parsing Can tie to interfaces (sliders) Users poor at selecting weights and often get worse retrieval not better, so infrequently implemented
6
Query Term Weights – Document Similarity Where are query term weights actually used? When trying to locate “similar” documents Consider: how do you find the most similar documents to document d k Applications: plagiarism detection, document clustering/classification (unsupervised/supervised learning) Simple implementation: Treat d k as a query Top results are most similar documents
7
Document Similarity For all terms i with non-zero weight in d k For all documents j that contain term i Sum (nwt_d ik * nwt_d ij ) What is weight d ik Tf*idf of terms in d ik We would need to store this Or, start with document and calculate on the fly using stored idf in dict file Efficiency Linear in number of terms Very slow for long documents Calculate tf*idf for all terms in document k Sort and use top n weighted terms (n ~ 10.. 50)
8
~Boolean Queries Vector space model merely sums the weights of the query terms in each document Top document may not have all query terms in it How implement quasi-Boolean retrieval “+canine feline –teeth” Results must have “canine”, may have “feline”, must not have “teeth” Need to expand accumulator buckets to keep track of number of required terms contributing to the weights and number of excluded terms
9
~Boolean Queries Accumulator: Total Num-Required Num-Excluded For regular (no + or -) Just add to Total (nothing new) For required terms (+) Add to total Add to Num-Required
11
~Boolean Queries For excluded terms (-) Subtract from total Add to Num-Excluded Presenting results: First (only) show results where Num_required in Accumulator == Num_required in query && Num_excluded == 0 Sort by weight Can expand the results shown by later showing groups of results with High weights, but missing 1 or more required terms High weight, but including 1 or more excluded terms
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.