Assessing The Retrieval A.I Lab 박동훈
Contents 4.1 Personal Assessment of Relevance 4.2 Extending the Dialog with RelFbk 4.3 Aggregated Assessment : Search Engine Performance 4.4 RAVE : A Relevance Assessment Vehicle 4.5 Summary
4.1 Personal Assessment of Relevance Cognitive Assumptions – Users trying to do ‘object recognition’ – Comparison with respect to prototypic document – Reliability of user opinions? – Relevance Scale – RelFbk is nonmetric
Relevance Scale
Users naturally provides only preference information Not(metric) measurement of how relevant a retrieved document is! RelFbk is nonmetric
4.2 Extending the Dialog with RelFbk RelFbk Labeling of the Retr Set
Query Session, Linked by RelFbk
4.2.1 Using RelFbk for Query Refinment
4.2.2 Document Modifications due to RelFbk Fig 4.7 Change documents!? More/less the query that successfully / un matches them
4.3 Aggregated Assessment : Search Engine Performance Underlying Assumptions –RelFbk(q,di) assessments independent –Users’ opinions will all agree with single ‘omniscient’ expert’s
4.3.2 Consensual relevance Consensually relevant
4.3.4 Basic Measures Relevant versus Retrieved Sets
Contingency table NRel : the number of relevant documents NNRel : the number of irrelevant documents NDoc : the total number of documents NRet : the number of retrieved documents NNRet : the number of documents not retrieved
4.3.4 Basic Measures (cont)
4.3.4 Basic Measures (cont)
4.3.5 Ordering the Retr set Each document assigned hitlist rank Rank(di) Descending Match(q,di) Rank(di) Match(q,dj) –Rank(di) Pr(Rel(dj)) Coordination level : document’s rank in Retr –Number of keywords shared by doc and query Goal:Probability Ranking Principle
A tale of two retrievals Query1Query2
Recall/precision curve Query1
Recall/precision curve Query1
Retrieval envelope
4.3.6 Normalized recall ri : i 번째 relevant doc 의 hitlist rank Worst Best
4.3.8 One-Parameter Criteria Combining recall and precision Classification accuracy Sliding ratio Point alienation
Combining recall and precision F-measure –[Jardine & van Rijsbergen71] –[Lewis&Gale94] Effectiveness –[vanRijsbergen, 1979] E=1-F, α=1/(β 2 +1) α=0.5=>harmonic mean of precision & recall
Classification accuracy accuracy Correct identification of relevant and irrelevant
Sliding ratio Imagine a nonbinary, metric Rel(di) measure Rank1, Rank2 computed by two separate systems
Point alienation Developed to measure human preference data Capturing fundamental nonmetric nature of RelFbk
4.3.9 Test corpora More data required for “test corpus” Standard test corpora TREC:Text Retrieval Evaluation Conference TREC’s refined queries TREC constantly expanding, refining tasks
More data required for “test corpus” Documents Queries Relevance assessments Rel(q,d) Perhaps other data too – Classification data (Reuters) – Hypertext graph structure (EB5)
Standard test corpora
TREC constantly expanding, refining tasks Ad hoc queries tasks Routing/filtering task Interactive task
Other Measure Expected search length (ESL) –Length of “path” as user walks down HitList –ESL=Num. irrelevant documents before each relevant document –ESL for random retrieval –ESL reduction factor
4.5 Summary Discussed both metric and nonmetric relevance feedback The difficulties in getting users to provide relevance judgments for documents in the retrieved set Quantified several measures of system perfomance