Advanced Query Processing Dr. Susan Gauch. Query Term Weights  The vector space model matches queries to documents with the inner product/cosine similarity.

Slides:



Advertisements
Similar presentations
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 7: Scoring and results assembly.
Advertisements

Boolean and Vector Space Retrieval Models
Indexing. Efficient Retrieval Documents x terms matrix t 1 t 2... t j... t m nf d 1 w 11 w w 1j... w 1m 1/|d 1 | d 2 w 21 w w 2j... w 2m 1/|d.
Introduction to Information Retrieval
Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
Intelligent Information Retrieval 1 Vector Space Model for IR: Implementation Notes CSC 575 Intelligent Information Retrieval These notes are based, in.
CS 430 / INFO 430 Information Retrieval
From last time What’s the real point of using vector spaces?: A user’s query can be viewed as a (very) short document. Query becomes a vector in the same.
1 CS 430 / INFO 430 Information Retrieval Lecture 4 Searching Full Text 4.
1 CS 430: Information Discovery Lecture 3 Inverted Files and Boolean Operations.
Ch 4: Information Retrieval and Text Mining
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
CS/Info 430: Information Retrieval
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Vector Methods 1.
1 Basic Text Processing and Indexing. 2 Document Processing Steps Lexical analysis (tokenizing) Stopwords removal Stemming Selection of indexing terms.
The Vector Space Model …and applications in Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 10 Probabilistic Information Retrieval.
Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.
Web Search – Summer Term 2006 II. Information Retrieval (Basics) (c) Wolfgang Hürst, Albert-Ludwigs-University.
Vocabulary Spectral Analysis as an Exploratory Tool for Scientific Web Intelligence Mike Thelwall Professor of Information Science University of Wolverhampton.
WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.
1 CS 430 / INFO 430 Information Retrieval Lecture 4 Searching Full Text 4.
Chapter 5: Information Retrieval and Web Search
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 6 9/8/2011.
CS324e - Elements of Graphics and Visualization Java Intro / Review.
The Development of a search engine & Comparison according to algorithms Sungsoo Kim Haebeom Lee The mid-term progress report.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.
1 CS 430: Information Discovery Lecture 3 Inverted Files.
Chapter 6: Information Retrieval and Web Search
CSE3201/CSE4500 Term Weighting.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Ranking in Information Retrieval Systems Prepared by: Mariam John CSE /23/2006.
Web Search. Crawling Start from some root site e.g., Yahoo directories. Traverse the HREF links. Search(initialLink) fringe.Insert( initialLink ); loop.
IR Homework #2 By J. H. Wang Mar. 31, Programming Exercise #2: Query Processing and Searching Goal: to search relevant documents for a given query.
Matching LBSC 708A/CMSC 828O March 8, 1999 Douglas W. Oard and Dagobert Soergel.
Advanced Search Features Dr. Susan Gauch. Pruning Search Results  If a query term has many postings  It is inefficient to add all postings to the accumulator.
Incremental Indexing Dr. Susan Gauch. Indexing  Current indexing algorithms are essentially batch processing  They start from scratch every time  What.
1 CS 430: Information Discovery Sample Midterm Examination Notes on the Solutions.
Vector Space Models.
CIS 530 Lecture 2 From frequency to meaning: vector space models of semantics.
Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.
Sudhanshu Khemka.  Treats each document as a vector with one component corresponding to each term in the dictionary  Weight of a component is calculated.
Ranked Retrieval INST 734 Module 3 Doug Oard. Agenda Ranked retrieval  Similarity-based ranking Probability-based ranking.
Web Information Retrieval Textbook by Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schutze Notes Revised by X. Meng for SEU May 2014.
User-Friendly Systems Instead of User-Friendly Front-Ends Present user interfaces are not accepted because the underlying systems are too difficult to.
Accumulator Representations Dr. Susan Gauch. Criteria  Fast look up by docid  Need to be able to add posting data efficiently  Acc.Add (docid, wt)
The Development of a search engine & Comparison according to algorithms Sung-soo Kim The final report.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
1 CS 430: Information Discovery Lecture 3 Inverted Files.
Implementation of Vector Space Model March 27, 2006.
Text Similarity: an Alternative Way to Search MEDLINE James Lewis, Stephan Ossowski, Justin Hicks, Mounir Errami and Harold R. Garner Translational Research.
Information Retrieval Inverted Files.. Document Vectors as Points on a Surface Normalize all document vectors to be of length 1 Define d' = Then the ends.
VECTOR SPACE INFORMATION RETRIEVAL 1Adrienn Skrop.
IR 6 Scoring, term weighting and the vector space model.
Why indexing? For efficient searching of a document
Plan for Today’s Lecture(s)
Text Based Information Retrieval
CS 430: Information Discovery
Information Retrieval and Web Search
Implementation Issues & IR Systems
Representation of documents and queries
Implementation Based on Inverted Files
CS 430: Information Discovery
6. Implementation of Vector-Space Retrieval
Semantic Similarity Methods in WordNet and their Application to Information Retrieval on the Web Yizhe Ge.
Boolean and Vector Space Retrieval Models
INF 141: Information Retrieval
Information Retrieval and Web Design
CS 430: Information Discovery
Presentation transcript:

Advanced Query Processing Dr. Susan Gauch

Query Term Weights  The vector space model matches queries to documents with the inner product/cosine similarity measure  Query vector * Document vector (inner product)  Normalized_q_vector  Normalized_doc_vector  Sum over all terms (i ε t in vector space)  nwt_q i * nwt_d ij  We implement this with:  For all terms i with non-zero query weight  For all documents j that contain term i  Sum (nwt_d ij )

Query term weights  Where did the query term weights go?  Essentially, we assume that all query terms are weighted “1”  If a term occurs twice in a query  E.g., “dog cat dog”  Process “dog” twice, add the postings for “dog” twice, so we effectively have a q_wt of 2 for “dog”  Can do this more efficiently by preprocessing query using a…. HASHTABLE! To count the term frequencies in the query  Dog (2) cat (1)

Query Term Weights – Simple Implementation  Can do this more efficiently by preprocessing query using a…. HASHTABLE! To count the term frequencies in the query  Dog (2) cat (1)  For all terms i with non-zero query weight  For all documents j that contain term i  Sum (q_wt i * nwt_d ij )

Query Term Weights – Proper Implementation  Can change query syntax to allow users to specify weights:  Dog (2) Cat (1)  Dog 0.7 Cat 0.3  Need better query parsing  Can tie to interfaces (sliders)  Users poor at selecting weights and often get worse retrieval not better, so infrequently implemented

Query Term Weights – Document Similarity  Where are query term weights actually used?  When trying to locate “similar” documents  Consider: how do you find the most similar documents to document d k  Applications: plagiarism detection, document clustering/classification (unsupervised/supervised learning)  Simple implementation:  Treat d k as a query  Top results are most similar documents

Document Similarity  For all terms i with non-zero weight in d k  For all documents j that contain term i  Sum (nwt_d ik * nwt_d ij )  What is weight d ik  Tf*idf of terms in d ik  We would need to store this  Or, start with document and calculate on the fly using stored idf in dict file  Efficiency  Linear in number of terms  Very slow for long documents  Calculate tf*idf for all terms in document k  Sort and use top n weighted terms (n ~ )

~Boolean Queries  Vector space model merely sums the weights of the query terms in each document  Top document may not have all query terms in it  How implement quasi-Boolean retrieval  “+canine feline –teeth”  Results must have “canine”, may have “feline”, must not have “teeth”  Need to expand accumulator buckets to keep track of number of required terms contributing to the weights and number of excluded terms

~Boolean Queries  Accumulator:  Total  Num-Required  Num-Excluded  For regular (no + or -)  Just add to Total (nothing new)  For required terms (+)  Add to total  Add to Num-Required

~Boolean Queries  For excluded terms (-)  Subtract from total  Add to Num-Excluded  Presenting results:  First (only) show results where  Num_required in Accumulator == Num_required in query && Num_excluded == 0  Sort by weight  Can expand the results shown by later showing groups of results with  High weights, but missing 1 or more required terms  High weight, but including 1 or more excluded terms