Server Ranking for Distributed Text Retrieval Systems on the Internet (Yuwono and Lee) presented by Travis Emmitt.

Slides:



Advertisements
Similar presentations
Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
Advertisements

Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Chapter 5: Introduction to Information Retrieval
Cue Validity Variance (CVV) Database Selection Algorithm Enhancement Travis Emmitt 9 August 1999.
Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
Ranking models in IR Key idea: We wish to return in order the documents most likely to be useful to the searcher To do this, we want to know which documents.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
Learning for Text Categorization
IR Models: Overview, Boolean, and Vector
Evaluating Search Engine
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.
Chapter 5: Query Operations Baeza-Yates, 1999 Modern Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
Ch 4: Information Retrieval and Text Mining
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Hinrich Schütze and Christina Lioma
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Vector Methods 1.
The Vector Space Model …and applications in Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 10 Probabilistic Information Retrieval.
Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Automatic Indexing (Term Selection) Automatic Text Processing by G. Salton, Chap 9, Addison-Wesley, 1989.
Query Operations: Automatic Global Analysis. Motivation Methods of local analysis extract information from local set of documents retrieved to expand.
Chapter 5: Information Retrieval and Web Search
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 6 9/8/2011.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Web search basics (Recap) The Web Web crawler Indexer Search User Indexes Query Engine 1 Ad indexes.
MPI Informatik 1/17 Oberseminar AG5 Result merging in a Peer-to-Peer Web Search Engine Supervisors: Speaker : Sergey Chernov Prof. Gerhard Weikum Christian.
Query Routing in Peer-to-Peer Web Search Engine Speaker: Pavel Serdyukov Supervisors: Gerhard Weikum Christian Zimmer Matthias Bender International Max.
1 CS 430: Information Discovery Lecture 9 Term Weighting and Ranking.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.
Chapter 6: Information Retrieval and Web Search
1 Computing Relevance, Similarity: The Vector Space Model.
Distributed Information Retrieval Server Ranking for Distributed Text Retrieval Systems on the Internet B. Yuwono and D. Lee Siemens TREC-4 Report: Further.
Web Search. Crawling Start from some root site e.g., Yahoo directories. Traverse the HREF links. Search(initialLink) fringe.Insert( initialLink ); loop.
Parallel and Distributed Searching. Lecture Objectives Review Boolean Searching Indicate how Searches may be carried out in parallel Overview Distributed.
Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit.
1 FollowMyLink Individual APT Presentation Third Talk February 2006.
Web- and Multimedia-based Information Systems Lecture 2.
Vector Space Models.
Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.
Motivation  Methods of local analysis extract information from local set of documents retrieved to expand the query  An alternative is to expand the.
C.Watterscsci64031 Classical IR Models. C.Watterscsci64032 Goal Hit set of relevant documents Ranked set Best match Answer.
Information Retrieval Techniques MS(CS) Lecture 7 AIR UNIVERSITY MULTAN CAMPUS Most of the slides adapted from IIR book.
Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Searching Full Text 3.
1 CS 430: Information Discovery Lecture 5 Ranking.
The Development of a search engine & Comparison according to algorithms Sung-soo Kim The final report.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
Chapter 15 Running Time Analysis. Topics Orders of Magnitude and Big-Oh Notation Running Time Analysis of Algorithms –Counting Statements –Evaluating.
Automated Information Retrieval
Plan for Today’s Lecture(s)
Information Retrieval and Web Search
Evaluation.
אחזור מידע, מנועי חיפוש וספריות
Information Retrieval and Web Search
Representation of documents and queries
CS 430: Information Discovery
Chapter 5: Information Retrieval and Web Search
Automatic Global Analysis
Boolean and Vector Space Retrieval Models
CS 430: Information Discovery
Information Retrieval and Web Design
VECTOR SPACE MODEL Its Applications and implementations
Presentation transcript:

Server Ranking for Distributed Text Retrieval Systems on the Internet (Yuwono and Lee) presented by Travis Emmitt

General Architecture Coll_1Coll_2Coll_NColl_3 document relevant to Query_A document relevant to Query_B document relevant to Query_C Broker_1Broker_2Broker_M User_X Query_A User_Y User_Z Query_B Query_C Query_A Need DF info from collections Clones, created when needed

Terminology cooperating autonomous index servers collection fusion problem collections = databases = sites broker servers =?= meta search engines –collection of server-servers index servers = collection servers documents = information resources = texts

More Terminology words –before stemming and stopping –example: { the, computer, computing, French } terms –after stemming and stopping –example: { comput, French } keywords –meaning varies depending upon context

Subscripts Often see TF i,j and IDF j within the context of a single collection –In a multiple collection environment, this notational shorthand can lead to ambiguity. –Should instead use TF h,i,j and IDF h,j h, i, and j are identifiers [possibly integers] –c h is a collection –doc h,i is a document in collection c h –t h,j is a term in collection c h

More Terminology N h = number of documents in collection c h V h = vocabulary / set of all terms in c h M h = number of terms in collection c h –M q = number of terms in query q –M h = |V h |

TF h,i,j = Term Frequency Definition: number of times term t h,j occurs in document doc h,i gGLOSS assumes TF h,i,j = 0 or avgTF h,j –avgTF h,j = Sum i=1 Nh (TF h,i,j ) / N h –TFs assumed to be identical for all documents in collection c h that contain one or more occurances of term t j

TF h,i,max = Maximum Term Frequency t h,i,max = term occuring the most frequently in document doc h,i TF h,i,max = number of times that term t h,i,max occurs in document doc h,i Example: doc h,i = “Cat cat dog cat cat” –t h,i,max = “cat” –TF h,i,max = 4

IDF h,j = Inverse Document Frequency DF h,j = document frequency –number of docs in collection c h containing term t h,j IDF h,j = 1 / DF h,j –the literal interpretation of “inverse” IDF h,j = log (N h / DF h,j ) –how it’s used –normalization technique Note: term t h,j must appear in at least one document in the collection, or DF h,I will be 0 and IDF h,j will be undefined

W h,i,j (scheme) = Term Weight Definition: the “weight” assigned to term t h,j in document doc h,i by a weighting scheme W q,j ( scheme ) = the weight assigned to term t q,j in query q by a weighting scheme –We drop one subscript b/c queries don’t belong to collections, unless you consider the set of queries to be a collection in itself [no one seems to do this] Note: for single term queries, weights might suffice

W h,i,j (atn) “atn” is a code representing choices made during a three part calculation process [a,t,n] X = ( TF h,i,j /TF h,i,max ) -- the TF part Y = log (N h /DF h,j )- - the IDF part W h,i,j (atn) = X * Y Note: TF h,i,max might be the maximum term frequency in doc h,i with the added constraint that the max term must occur in the query. If so, then X is dependent upon query composition and must therefore wait until query time to be calculated.

W h,i,j (atc) X = ( TF h,i,j /TF h,i,max ) -- the TF part Y = log (N h /DF h,j ) -- the IDF part Z = sqrt (Sum k=1 Mh X 2 * Y 2 )-- normalization W h,i,j (atc) = X * Y / Z atc is atn with vector-length normalization –atc is better for comparing long documents –atn is better for comparing short documents, and is cheaper to calculate

Query Time TFs, IDFs, and [possibly] Ws can be calculated prior to performing any queries. Queries are made up of one or more terms. –Some systems perceive queries as documents. –Others seem them as sets of keywords. The job at query time job is to determine how well each document/collection “matches” a query. We calculate a similarity score for each document/collection relative to a query.

S h,i,q (scheme) = Similarity Score Definition: estimated similarity of document doc h,i to query q using a scheme Also called relevance score S h,i,q ( scheme ) = Sum j=1 Mq (W h,i,j ( scheme )*W q,j ( scheme )) -- Eq 1 CVV assumes that Wq,j( scheme ) = 1 for all terms t j that occur in query q, so: –S h,i,q (atn) = Sum j=1 Mq ( W h,i,j (atn) ) -- Eq 3

Ranking and Returning the “Best” Documents Rank documents in descending order of similarity scores to the query. One method: get all docs with similarity scores above a specified threshold theta CVV retrieves top-H+ documents –Include all documents tied with H th best document –Assume H th best doc’s similarity score > 0

Multiple Collection Search Also called collection selection In CVV, brokers need access to DFs –must be centralized, periodically updated –all IDFs then provided to collection servers Why? 1) “N is the number of texts in the database” [page 2] 2) “We refer to index servers as collection servers, as each of them can be viewed as a database carrying a collection of documents.” [page 2] 3) N and DF are both particular to a collection, so what extra- collection information is needed in Equation 3?

CVV = Cue-Validity Variance Also called the CVV ranking method Goodness can be derived completely from DF i,j and N i

CVV Terminology C = set of collections in the system |C| = number of collections in the system N i = number of documents in collection c i DF i,j = # times term t j occurs incollection c i or # documents in c i containing term t j CV i,j = cue-validity of term t j for collection c i CVV i,j = cue-validity variance of t i for c i G i,q = goodness of collection c i to query q

CVV: Calculation A = DF i,j /N i B = Sum k!=i |C| (DF k,j ) / Sum k!=i |C| (N k ) ?= Sum k!=i |C| (DF k,j / N k ) CV i,j = A / (A + B) avgCV j = Sum i=1 |C| (CV i,j ) / |C| CVV i,j = Sum i=1 |C| (CV i,j - avgCV j ) 2 / |C| G i,q = Sum j=1 |C| (CVV j * DF i,j ) I assume that’s what they meant (that M = |C|)

Goodness...of a collection relative to a query Denoted G i,q where i is a collection id, q is a query id G i,q is a sum of scores, over all terms in the query Each score represents how well term q j characterizes collection c i [ i is a collection id, j is a term id ] G i,q = Sum j=1 M (CVV j * DF i,j ) The collection with the highest Goodness is the “best” [most relevant] collection for this query

Goodness: Example Query_A = “cat dog” q 1 = cat, q 2 = dog, M = |q| = 2 You can look at this as [w/ user-friend subscripts] : G Coll_1,Query_A = score Coll_1,cat + score Coll_1,dog G Coll_2,Query_A = score Coll_2,cat + score Coll_2,dog... Note: The authors overload the identifier q. At times it represesents a query id [see Equation 1]. At other times, it represents a set [bag?] of terms: {q i, i from 1 to M}.

Query Term Weights What if Query_A = “cat cat dog”? –Do we allow this? Should we weigh cat more heavily than dog? If so, how? Example: score Coll_1,cat =10, score Coll_1,dog =5 score Coll_2,cat = 5, score Coll_2,dog =11 –Intuitively, Coll_1 is more relevant to Query_A Scores might be computed prior to processing a query –get all collections’ scores for all terms in the vocab –add appropriate pre-computed scores when given a query

QTW: CVV Assumptions The authors are concerned primarily with Internet queries [unlike us]. They assume [based on their observations of users’ query tendencies] that terms appear at most once in a query. Their design doesn’t support query term weights, only cares whether a term is present in the query. Their design cannot be used to easily “find me documents like this one”.

QTW: Approach #1 Approach #1 : q 1 =cat, q 2 =dog –Ignore duplicates. –Results in a “binary term vector”. –G Coll_1,Query_A = = 15 G Coll_2,Query_A = = top guess –Here we see their algorithm would consider Coll_2 to be more relevant than Coll_1, which is counter to our intuition.

QTW: Approach #2 Approach #2 : q 1 =cat, q 2 =cat, q 3 =dog –You need to make q be a bag [allowing duplicate elements] instead of a set [doesn’t allow dups] –G Coll_1,Query_A = = top guess G Coll_2,Query_A = = 21 –Results in the “correct” answer. –Easy to implement once you have a bag set up. –However, primitive brokers will have to calculate [or locate if pre-calculated] cat’s scores twice.

QTW: Approach #3 Approach #3 : q 1 =cat, q 2 =dog, w 1 =2, w 2 =1 –The “true” term vector approach. –G Coll_1,Query_A = 10*2 + 10*1 = top guess G Coll_2,Query_A = 5*2 + 11*1 = 21 –Results in “correct” answer. –Don’t need to calculate scores multiple times. –If query term weights tend to be: >1 -- you save space: [cat,50] instead of fifty “cat cat...” almost all are 1 -- less efficient

QTW: Approach #3 (cont) Approach #3 -- the most TREC-friendly –TREC queries often have duplicate terms –Approach #3 results in “correct” answers and is more efficient than Approach #2 #3 sometimes better for WWW search: –“Find me more docs like this” -- doc similarities –Iterative search engines can use query term weights to hone queries [example on next page] –Possibility of negative term weights [see example]

QTW: Iterative Querying (example) Query_1: “travis(5) emmitt(5) football(5)” –results in lots on Emmitt Smith, nothing on Travis –User tells engine that “emmitt smith” is irrelevant –Engine adjusts each query term weight in the “black list” by -1, then performs a revised query: Query_2: “travis(5) emmitt(4) football(5) smith(-1)” –Hopefully yields less Emmitt Smith, more Travis –Repeat cycle of user feedback, weight tweaking, & requerying until the user is satisfied [or gives up] Can’t do this easily without term weights

QTW: User Profiles Might also have user profiles: –Allison loves cats, hates XXX, likes football –Her profile: cats(+3), XXX(-3), football(+1) –Adjustments made to every query she issues. Issues: “wearing different hats”, relying on keywords, want sensitivity to context: –“XXX” especially bad when JPEGs present –“XXX” not bad when in source code: “XXX:=1;”

QTW: Conclusion The bottom line is that query term weights can be useful, not just in a TREC scenario but in an Internet search scenario. CVV can probably be changed to support query term weights [might’ve already been] The QTW discussion was included mostly as a segue to interesting, advanced issues: iterative querying, user profiles, context.

Query Forwarding Single-Cast approach –Get documents from best collection only. –Fast and simple. No result merging. –Question: How often will this in fact suffice? Multi-Cast approach –Get documents from best n collections. –Slower, requires result merging. –Desired if best collection isn’t complete.

Result Merging local doc ranks -> global doc ranks r i,j = rank of document doc j in collection c i –Ambiguous when dealing with multiple queries and multiple similarity estimation schemes [which is what we do]. –Should actually be r i,j,q (scheme) c min,q = collection w/ least similarity to query q G min,q = goodness score of c min,q relative to query q

Result Merging (cont) D i = estimated score distance between documents in ranks x and x+1 –D i = G min, q / (H * G i,q ) s i,j = 1 - (r i,j - 1) * D i –global relevance score of the j th -ranked doc in c i –need to re-rank documents globally

CVV Assumption #1 Assumption 1: The best document in collection c i is equally relevant to query q (has the same global score) as the best document in collection c k for any k != i and G i,q, G k,q > 0. Nitpick: if k=i, G i,q = G k,q, so no reason for k != i

CVV Assumption #1: Motivation They don’t want to require the same search algorithm at each site [collection server]. Sites will therefore tend to use different scales for Goodness; you can’t simply compare scores directly. They want a “collection containing a few but highly relevant documents to contribute to the final result.”

CVV Assumption #1: Critique What about collections with a few weak documents? Or a few redundant documents [that occur in other, “better” collections]? They omit collections with goodness scores less than half the highest goodness score –The best document could exist by itself in an otherwise lame collection. The overall Goodness for that collection might be lower than half the max (since doc scores are used).

CVV Assumption #2 Assumption 2: The distance, in terms of absolute relevance score difference, between two consecutive document ranks in the result set of a collection is inversely proportional to the goodness score of the collection.

CVV vs gGLOSS Their characterization of gGLOSS: –“a keyword based distributed database broker system” –“relies on the weight-sum of every term in a collection.” –assumes that within a collection c i all docs contains either 0 or avgTF h,i,j occurances of term t –assumes document weight computed similarly in all collections –Y = log (N ^ / DF ^ j ) where N ^ and DF ^ are “global”

Performance Comparisons Accuracy calculated from cosine of the angle between the estimated goodness vector and a baseline goodness vector. –Based on the top H+ –Independent of precision and recall. They of course say that CVV is best –gGLOSS appears much better than CORI (!)