Query operations 1- Introduction 2- Relevance feedback with user relevance information 3- Relevance feedback without user relevance information - Local analysis (pseudo-relevance feedback) - Global analysis (thesaurus) 4- Evaluation 5- Issues
Introduction (1) uNo detailed knowledge of collection and retrieval environment édifficult to formulate queries well designed for retrieval éNeed many formulations of queries for effective retrieval uFirst formulation: often naïve attempt to retrieve relevant information uDocuments initially retrieved: éExamined for relevance information (user, automatically) é Improve query formulations for retrieving additional relevant documents uQuery reformulation: éExpanding original query with new terms éReweighting the terms in expanded query
Introduction (2) äApproaches based on feedback from users (relevance feedback) äApproaches based on information derived from set of initially retrieved documents (local set of documents) äApproaches based on global information derived from document collection
Relevance feedback with user relevance information (1) mMost popular query reformulation strategy mCycle: ãUser presented with list of retrieved documents ãUser marks those which are relevant In practice: top ranked documents are examined Incremental ãSelect important terms from documents assessed relevant by users ãEnhance importance of these terms in a new query mExpected: ãNew query moves towards relevant documents and away from non- relevant documents
Relevance feedback with user relevance information (2) mTwo basic techniques ãQuery expansion Add new terms from relevant documents ãTerm reweighting Modify term weights based on user relevance judgements mAdvantages ãShield users from details of query reformulation process ãSearch broken down in sequence of small steps ãControlled process Emphasise some terms (relevant ones) De-emphasise other terms (non-relevant ones)
Relevance feedback with user relevance information (3) mQuery expansion and term reweighting in the vector space model mTerm reweighting in the probabilistic model
Query expansion and term reweighting in the vector space model mTerm weight vectors of documents assessed relevant Similarities among themselves mTerm weight vectors of documents assessed non-relevant Dissimilar for those of relevant documents mReformulated query: Closer to term weight vectors of relevant documents
Query expansion and term reweighting in the vector space model For query q ãDr: set of relevant documents among retrieved documents ãDn: set of non-relevant documents among retrieved documents ãCr: set of relevant documents among all documents in collection ã , , : tuning constants Assume that Cr is known (unrealistic!) Best query vector for distinguishing relevant documents from non-relevant documents
Query expansion and term reweighting in the vector space model mProblem: |Cr| is unknown mApproach ãFormulate initial query ãIncrementally change initial query vector ãUse |Dr| and |Dn| instead mRochio formula mIde formula
Rochio formula mDirect application of previous formula + add query mInitial formulation =1 mUsually information in relevant documents more important than in non-relevant documents ( << ) mPositive relevance feedback ( =0)
Rochio formula in practice (SMART) m =1 mTerms ãOriginal query ãAppear in more relevant documents that non-relevant documents ãAppear in more than half the relevant documents mNegative weights ignored
Ide formula mInitial formulation = = =1 mSame comments as for the Rochio formula mBoth Ide and Rochio: no optimal criterion
Term reweighting for the probabilistic model m(see note on the BIR model) ãUse idf to rank documents for original query ãCalculate Predict relevance Improved (optimal) retrieval function
Term reweighting for the probabilistic model mIndependence assumptions ãI1: distribution of terms in relevant documents is independent and their distribution in all documents is independent ãI2: distribution of terms in relevant documents is independent and their distribution in irrelevant documents is independent mOrdering principle ãO1: probable relevance based on presence of search terms in documents ãO2: probable relevance based on presence of search terms in documents and their absence from documents
Term reweighting for the probabilistic model Various combinations
Term reweighting for the probabilistic model F1 formula r i = number of relevant documents containing t i n i = number of documents containing t i ratio of the proportion of relevant documents in which the query term t i occurs to the proportion of all documents in which the term t i occurs R = number of relevant documents N= number of documents in collection
Term reweighting for the probabilistic model F2 formula r i = number of relevant documents containing t i n i = number of documents containing t i proportion of relevant documents in which the term t i occurs to the proportion of all irrelevant documents in which t i occurs R = number of relevant documents N= number of documents in collection
Term reweighting for the probabilistic model ratio of “relevance odds” (ratio of relevant documents containing term t i and non-relevant documents containing term t i ) and “collection odds” (ratio of documents containing t i and documents not containing t i ) r i = number of relevant documents containing t i n i = number of documents containing t i F3 formula R = number of relevant documents N= number of documents in collection
Term reweighting for the probabilistic model ratio of “relevance odds” and “non- relevance odds” (ratio of relevant documents not containing t i and the non-relevant documents not containing t i ) r i = number of relevant documents containing t i n i = number of documents containing t i F4 formula R = number of relevant documents N= number of documents in collection
Experiments mF1, F2, F3 and F4 outperform no relevance weighting and idf mF1 and F2; F3 and F4 perform in the same range mF3 and F4 > F1 and F2 mF4 slightly > F3 ãO2 is correct (looking at presence and absence of terms) mNo conclusion with respect to I1 and I2, although I2 seems a more realistic assumption.
Relevance feedback without user relevance mRelevance feedback with user relevance ãClustering hypothesis known relevant documents contain terms which can be used to describe a larger cluster of relevant documents ãDescription of cluster built interactively with user assistance mRelevance feedback without user relevance ãObtain cluster description automatically ãIdentify terms related to query terms (e.g. synonyms, stemming variations, terms close to query terms in text) mLocal strategies mGlobal strategies
Local analysis (pseudo-relevance feedback) mExamine documents retrieved for query to determine query expansion mNo user assistance mClustering techniques mQuery “drift”
Clusters (1) mSynonymy association (one example): terms that frequently co-occur inside local set of documents mTerm-term (e.g., stem-stem) association matrix (normalised)
Clusters (2) mFor term t i ãTake the n largest values m i,j ãThe resulting terms t j form cluster for t i mQuery q ãFinding clusters for the |q| query terms ãKeep clusters small ãExpand original query
Global analysis mExpand query using information from whole set of documents in collection mThesaurus-like structure using all documents ãApproach to automatically built thesaurus (e.g. similarity thesaurus based on co-occurrence frequency) ãApproach to select terms for query expansion
Evaluation of relevance feedback strategies mUse q i and compute precision and recall graph mUse q i+1 and compute precision recall graph ãUse all documents in the collection Spectacular improvements Also due to relevant documents ranked higher Documents known to user Must evaluate with respect to documents not seen by user mThree techniques
Evaluation of relevance feedback strategies mFreezing ãFull-freezing Top n documents are frozen (ones used in RF) Remaining documents are re-ranked Precision-recall on whole ranking Change in effectiveness thus come from unseen documents With many iteration, higher contribution of frozen documents may lead to decrease in effectiveness ãModified freezing Rank position of the last marked relevant document
Evaluation of relevance feedback strategies mTest and control group ãRandom splitting of documents: test documents and group documents Query reformulation performed on test documents New query run against the control documents ãRF performed only on control group ãDifficulty in splitting the collection Distribution of relevant documents
Evaluation of relevance feedback strategies mResidual ranking ãDocuments used in assessing relevance are removed ãPrecision-recall on “residual collection” ãConsider effect of unseen documents ãResults not comparable with original ranking (fewer relevant documents)
Issues mInterface ãAllow user to quickly identify relevant and non-relevant documents ãWhat happen with 2D and 3D visualisation? mGlobal analysis ãOn the web? ãYahoo! mLocal analysis ãComputation cost (on-line) mInteractive query expansion ãUser choose the terms to be added
Negative relevance feedback mDocuments explicitly marked as non-relevant by users ãImplementation ãClarity ãUsability