Searching Binding of search statements Boolean queries – Boolean queries in weighted systems – Weighted Boolean queries in non-weighted systems Similarity.

Slides:



Advertisements
Similar presentations
Relevance Feedback User tells system whether returned/disseminated documents are relevant to query/information need or not Feedback: usually positive sometimes.
Advertisements

Chapter 5: Introduction to Information Retrieval
Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
Introduction to Information Retrieval (Part 2) By Evren Ermis.
Intelligent Information Retrieval 1 Vector Space Model for IR: Implementation Notes CSC 575 Intelligent Information Retrieval These notes are based, in.
Ranking models in IR Key idea: We wish to return in order the documents most likely to be useful to the searcher To do this, we want to know which documents.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
IR Models: Overview, Boolean, and Vector
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement: Relevance Feedback Information Filtering.
Evaluation.  Allan, Ballesteros, Croft, and/or Turtle Types of Evaluation Might evaluate several aspects Evaluation generally comparative –System A vs.
Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.
Chapter 5: Query Operations Baeza-Yates, 1999 Modern Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Chapter 2Modeling 資工 4B 陳建勳. Introduction.  Traditional information retrieval systems usually adopt index terms to index and retrieve documents.
INFO 624 Week 3 Retrieval System Evaluation
Modeling Modern Information Retrieval
1 Query Language Baeza-Yates and Navarro Modern Information Retrieval, 1999 Chapter 4.
Evaluating the Performance of IR Sytems
ITCS 6010 Natural Language Understanding. Natural Language Processing What is it? Studies the problems inherent in the processing and manipulation of.
Query Reformulation: User Relevance Feedback. Introduction Difficulty of formulating user queries –Users have insufficient knowledge of the collection.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Vector Methods 1.
Vector Space Model CS 652 Information Extraction and Integration.
1 CS 430 / INFO 430 Information Retrieval Lecture 10 Probabilistic Information Retrieval.
Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Automatic Indexing (Term Selection) Automatic Text Processing by G. Salton, Chap 9, Addison-Wesley, 1989.
IR Models: Review Vector Model and Probabilistic.
Recuperação de Informação. IR: representation, storage, organization of, and access to information items Emphasis is on the retrieval of information (not.
Chapter 5: Information Retrieval and Web Search
1 Automatic Indexing The vector model Methods for calculating term weights in the vector model : –Simple term weights –Inverse document frequency –Signal.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Web search basics (Recap) The Web Web crawler Indexer Search User Indexes Query Engine 1 Ad indexes.
Term and Document Clustering Manual thesaurus generation Automatic thesaurus generation Term clustering techniques: –Cliques,connected components,stars,strings.
Query Operations J. H. Wang Mar. 26, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Chapter 6: Information Retrieval and Web Search
CS 533 Information Retrieval Systems.  Introduction  Connectivity Analysis  Kleinberg’s Algorithm  Problems Encountered  Improved Connectivity Analysis.
1 Computing Relevance, Similarity: The Vector Space Model.
CSE3201/CSE4500 Term Weighting.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
University of Malta CSA3080: Lecture 6 © Chris Staff 1 of 20 CSA3080: Adaptive Hypertext Systems I Dr. Christopher Staff Department.
Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit.
IR Theory: Relevance Feedback. Relevance Feedback: Example  Initial Results Search Engine2.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Vector Space Models.
C.Watterscsci64031 Probabilistic Retrieval Model.
C.Watterscsci64031 Classical IR Models. C.Watterscsci64032 Goal Hit set of relevant documents Ranked set Best match Answer.
Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.
Evaluation. The major goal of IR is to search document relevant to a user query. The evaluation of the performance of IR systems relies on the notion.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture Probabilistic Information Retrieval.
1 CS 430: Information Discovery Lecture 8 Collection-Level Metadata Vector Methods.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Query Refinement and Relevance Feedback.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
VECTOR SPACE INFORMATION RETRIEVAL 1Adrienn Skrop.
Automated Information Retrieval
Plan for Today’s Lecture(s)
Information Retrieval and Web Search
Information Retrieval on the World Wide Web
Evaluation.
Basic Information Retrieval
Representation of documents and queries
CS 430: Information Discovery
Semantic Similarity Methods in WordNet and their Application to Information Retrieval on the Web Yizhe Ge.
Boolean and Vector Space Retrieval Models
CS 430: Information Discovery
Information Retrieval and Web Design
CS 430: Information Discovery
Presentation transcript:

Searching Binding of search statements Boolean queries – Boolean queries in weighted systems – Weighted Boolean queries in non-weighted systems Similarity measures – Well-known measures – Thresholds – Ranking Relevance feedback

Binding of Search Statements Search statements are generated by users to describe their information needs. Typically,a search statement uses Boolean logic or natural language. Three level of binding may be observed. At each level the query statement becomes more specific. 1.At the first level, the user attempts to specify the information needed, using his/her vocabulary and past experience. Example : “Find me information on the impact of oil spills in Alaska on the price of oil”

Binding of Search Statements(cont.) 2. At the next level,the system translates the query to its own internal language. This process is similar to that of processing (indexing) a new document. Example: “impact, oil (petroleum), spills (accidents), Alaska,price(cost,value)” 3. At the final level, the system reconsiders the query based upon the specific database. For example, assigning weights to the terms based upon the document frequency of each term. Example : impact(.308),oil(.606),petroleum(.65),spills(.12), accidents(.23),Alaska(.45),price(.16),cost(.25),value(.10)

Boolean Queries Boolean queries are natural in systems where weights are binary. A term either applies or does not apply to a query. Each term T is associated with the set of documents D T – A AND B : Retrieve all documents for which both A and B are relevant ( ) – A OR B : Retrieve all documents for which either A or B are relevant ( ) – A NOT B : Retrieve all documents for which A is relevant and B is not relevant ( ) Consider two “unnatural” situations: – Boolean queries in systems that index documents with weighted terms. – Weighted Boolean queries in systems that use non-weighted (binary) terms.

Boolean Queries in weighted Systems Environment: –A weighted system, where the relevance of a term to a document is expressed with a weight. –Boolean queries, involving AND and OR. Possible approach: use threshold to convert all weights to binary representations. Possible approach: –Transform the query to disjunctive normalform: Sets of conjunctions of the form T1 T2 T3.. Connected by a operators. –Given a document D First, its relevance to each conjunct is computed as the minimum weight of any document term that appears in the conjunct. Then, the document relevance for the complete query is the maximum of the conjunct weights.

Boolean Queries in Weighted Systems(cont.) Example: Two documents indexed by 3 terms: –Doc1 = Term1 / 0.2, Term2 / 0.5, term3 / 0.6 –Doc2 = Term1 / 0.7, Term2 / 0.2, term3 / 0.1 –Query: ( Term1 AND Term2 ) OR Term3 Relevance of Doc1 to the query is 0.6 Relevance of Doc2 to the query is 0.2

Weighted Boolean Queries in Non-weighted Systems Environment: –A conventional system, where a term is either relevant or non-relevant to a document. –Boolean queries, in which users associate a weight (importance) with each term. Possible approach: –OR A 1 B 1 includes all the documents in D A D B. A 1 B 0 includes all the documents in D A. As the weight of B changes from 0 to 1, Documents from D B - D A are added to D A. –AND A 1 B 1 includes all the documents in D A D B. A 1 B 0 includes all the documents in D A. As the weight of B changes from 1to 0, Documents from D A - D B are added to D A D B.

Weighted Boolean Queries in Non-weighted Systems (cont.) NOT –A 1 B 1 includes all the documents in D A - D B. –A 1 B 0 includes all the documents in D A. –As the weight of B changes from 1 to 0, Documents from D A D B are added to D A - D B. Algorithm: –determine the documents that satisfy the either of the “extreme” interpretations. –Determine the centroid of the inner set. –Calculate the similarity of the documents outside of the inner set and the centroid. –Determine the number of document of documents to be added, by multiplying the actual weight B (a value between 0 and 1 ) by the number of documents outside of the inner set. –Select the documents to be added as those most similar to the centroid.

Similarity Measures Typical, similarity measures are used when both queries and documents are described by vectors A similarity measures gauges the similarity between two documents (for the purpose of search we do not consider here them similarity,but many of the consideration are identical) The measure increases as similarity grows(0 reflects total dissimilarity) A variety of similarity measures has been proposed and experimented with. As queries are analogous to documents,the same similarity measures can be used to measure – document-document similarity (used in document clustering) –document-query similarity (used in searching) –query-query similarity(?)

Similarity Measures:Inner Product Consider again SIM(D i,D j )= Where the weights W ik are simple frequency counts The problem with this simple measure is that it is not normalized to account for variances in the length of documents –This might be corrected by dividing each frequency count by the length of the document –It may be also be corrected by dividing each frequency count by the maximum frequency count for the document Additional normalization is often performed to force all similarity values to the range between 0 and 1

Similarity Measures:Inner Product(cont.) This is a refinement of the previous measure (alternatively,the measure remains the inner product,but the representation are different): SIM(Q,D)= where m is the number of documents in the collection n is the number of indexing terms Each document is a sequence of n weights: D= (d 1,…,d n ) A query is also a sequence of n weights: Q= (q 1,…,q n ) Each weight q k or d k = IDF k *TF k / MF IDF k = The inverse document frequency for term T k : that is, a value that decreases as the frequency of the term in the collection increases; for example,log 2 m/Df ki +1,where DF k counts the number of documents in which term T k appears) TF k /MF = The frequency of term T k in this document,divided by the maximal frequency of any term in this document There are other constants for fine-tuning the formula’s performance

Similarity Measures:Cosine A document or a query are treated as n-dimensional vectors SIM(Q,D) = Formula measures the cosine of the angle between the two vectors As cosine approaches 1, the two vectors become coincident(document and query represent unrelated concepts) Problem: Does not take into account the length of the vectors Consider –Query = (4,8,0) –Doc1 = (1,2,0) –Doc2 = (3,6,0) SIM(Query,Doc1) and SIM(Query,Doc2)are identical, even though Doc2 has significantly higher weights in the terms in common

Similarity Measures : Summary Four well-known measures of vector similarity Similarity Measure Evaluation for Binary Evaluation for Weighted sim(X, Y) Term Vectors Term Vectors Inner product Dice coefficient Cosine coefficient Jaccard coefficient

Similarity Measures: Summary(cont) Observations : All four measures use the same inner product as nominator. The denominators of the last three maybe viewed as normalizations of the inner product. The definitions for binary term vectors are more intuitive. All measures are 1 when X = Y and 0 when X and Y are disjoint

Thresholds Use of similarity measures may return the entire database as a search result, because the similarity measure might yield close-to-zero values for most, if not all, of the documents. Similarity measures must be used with thresholds : –Threshold : a value that the similarity measure must exceed –It might also be a limit on the size of the answer. Example :  Terms: American, geography, lake, Mexico, painter, oil, reserve, subject.  Doc1 : geography of Mexico suggests oil reserves are available. Doc1 = ( 0, 1, 0, 1, 0, 1, 1, 0)  Doc2 : American geography has lakes available everywhere. Doc2 = (1, 1, 1, 0, 0, 0, 0, 0)  Doc3 : painter suggest Mexico lakes as subjects. Doc3 = (0, 0, 1, 1, 1, 0, 0, 1)  Query : oil reserves in Mexico Query = (0, 0, 0, 1, 0, 1, 1, 0)

Thresholds(cont.) Example(cont.) Using the inner product measures : SIM(Query, Doc1) = 3 SIM(Query, Doc2) = 0 SIM(Query, Doc3) = 1 If a threshold of 2 is selected, then only Doc1 is retrieved. Use of thresholds may decrease recall when documents are clustered, and search compares queries to centroids. There may be documents in a cluster that are not retrieved, even though they are similar enough to the query, because their cluster centroid is not close enough to the query. The risk increases as the deviation in the cluster increases (the documents are not clustered around the centroid the centroid -- bad cluster)

Ranking Similarity measures provide a means for ranking the set of retrieved documents: –Ordering the documents from the most likely to satisfy the query to the least likely. –Ranking reduces the user overhead. –Because similarity measures are not accurate, precise ranking may be misleading; documents may be grouped into sets, and the documents sets are ranked in order of relevance.

Relevance Feedback An initial query might not provide an accurate description of the user’s needs: –User’s lack of knowledge of the domain. –User’s vocabulary does not match authors’ vocabulary. After examining the result of his query, a user can often improve the description of his needs: –Querying is an iterative process. –Further iterations are generated either manually, or automatically. Relevance feedback: Knowledge of which returned documents are relevant and which are not, is used to generate the next query. –Assumption: the documents relevant to a query resemble each other(similar vectors). –Hence, if a document is known to be relevant, the query can be improved by increasing its similarity to that document. –Similarly, if a document is known to be non-relevant, the query can be improved by decreasing its similarity to that document.

Relevance Feedback(cont.) Given a query (a vector) we –add to it the average (centroid) of the relevant documents in the result, and –subtract from it the average (centroid) of the non-relevant documents in the result. A vector algebra expression: where –Q i = The present query. –Q i+1 = The revised query. –D = A document in the result. –R = The relevant documents in the result(r = cardinality or R) –NR = The non-relevant documents in the result(nr = cardinality of NR)

Relevance Feedback (cont.) A revised formula, giving more control over the various components: where –  Tuning constants; for example, 1.0, 0.5, 0.25 – = Positive feedback factor. Uses the user’s judgments on relevant documents to increase the values of terms. Moves the query to retrieve documents similar to relevant documents retrieved (in the direction of more relevant documents). – = Negative feedback factor. Uses the user’s judgments on non- relevant documents to decrease the values of terms. Moves the query away from non-relevant documents. –Positive feedback often weighted significantly more than negative feedback; often, only positive feedback is used.

Relevance Feedback(Cont.) Illustration : Impact of relevance feedback. Illustration shows the effect of positive feedback only or negative feedback only : Boxes : filled = present query ; hollow = modified query. Oval : set of documents retrieved by present query. Circles : filled = non-relevant documents ; hollow = relevant. Positive feedbackNegative feedback

Relevance Feedback(Cont.) Example : –Assume query Q = (3,0,0,2,0) retrieved three documents Doc1, Doc2, Doc3. –Assume Doc1 and Doc2 are judged relevant and Doc3 is judged non-relevant. –Assume the constants used are 1.0, 0.5, –The revised query is : Q’ = (3, 0, 0, 2, 0) * ((2+1)/2, (4+3)/2, (0+0)/2, (0+0)/2, (2+0)/2) * (0, 0, 4, 3, 2) = (3.75, 1.75, -1, 1.25, 0) = (3.75, 1.75, 0, 1.25, 0)

Relevance Feedback(Cont.) Example(cont.) : –Using the similarity formula we can compare the similarity of Q and Q’ to the three documents : –Compared to the original query, new query is more similar to Doc1 and Doc2(judged relevant), and less similar to Doc3(judged non-relevant). –Notice how the new query added Term2, which was not in the original query. For example, a user may be searching for “word processor” to be used on a “PC”, and the revised query may introduce the term “Mac”.

Relevance Feedback(Cont.) Problem : Relevance feedback may not operate satisfactorily, if the identified relevant documents do not form a tight cluster. –Possible solution : Cluster the identified relevant documents, then split the original query into several, by constructing a new query for each cluster. Problem : Some of the query terms might not be found in any of the retrieved documents. This will lead to reduction of their relative weight in the modified query(or even elimination). Undesirable, because these terms might still be found in future iterations. –Possible solutions : Ensure the original terms are kept ; or present all modified queries to the user for review. “Fully automatic” relevance feedback:The rank values for the documents in the first answer are used as relevance feedback to automatically generate the second query(no human judgment). –The highest ranking documents are assumed to be relevant(positive feedback only).