Measuring How Good Your Search Engine Is. *
Information System Evaluation l Before 1993 evaluations were done using a few small, well-known corpora of test documents such as the Cranfield collection. l 1400 documents, 225 queries, exhaustive relevance judgements. l Now systems are evaluated at the annual Text Retrieval Evaluation Conference (TREC).
Reasons to evaluate the effectiveness of an IR system. l To aid in the selection of a system to procure l To evaluate query generation processes for improvements l To determine the effects of changes made to an existing information system: determine the effects of changing a system’s algorithms
Relevance l The most important evaluation metrics of information systems will always be biased by human subjectivity. l Relevance is not always binary, but a spectrum from exactly what is being looked for and its being totally unrelated. l Relevance may be: l Subjective, depending on a specific user’s judgement: inter- annotator agreement. l Situational, related to a user’s requirements: is information we already know relevant to our information need? l Temporal, changing over time: pertinence
The System View l Relates to a match between query terms and index terms within an item. Can be objectively tested without relying on human judgement, e.g. l time to index an item l Computer memory requirements l Response time from query input to first set of items retrieved for user to view l An objective measure involving the user is the time required to create a query
Recall and Precision l Precision = number_retrieved_and_relevant / total_number_retrieved l Recall = number_retrieved_and_relevant / total_number_relevant
Estimating Recall l In controlled environments with small databases the number of relevant documents can be found. l But for “open” searching on the internet, the total_number_relevant is not known l Two approaches to estimating total_number_relevant l a) Use a sampling technique: what percentage of documents are relevant ? l b) Technique used by TREC: use aggregate pool of documents retrieved by several search engines.
l Do you want to find all the pages which are relevant to your query? l Do you want all pages returned in the first screen of results to be relevant to your query? l is precision considering only the top 10 ranked hits.
User Satisfaction (Platt et al., 2002) l Used five-point Likert scale questionnaires to determine the degree of user satisfaction for each browser: l 1. I like this image browser. l 2. This Browser is easy to use. l 3. This Browser feels familiar. l 4. It is easy to find the photo I am looking for. l 5. A month from now, I would still be able to find these photos. l 6. I was satisfied with how the pictures were organised.
Text Retrieval and Evaluation Conference (TREC). Contents of the TREC data base: l Wall Street Journal l Associated Press Newswire l Articles from Computer Select discs l Federal Register l Short Abstracts from DOE publications l San Jose Mercury News l US Patents
Five new areas of testing called Tracks at TREC l Multilingual (e.g. El Norte newspaper in Spanish) l Interactive (e.g. relevance feedback, rather than batch mode) l Database merging track – merging hit files of several subcollections l Confusion track to deal with corrupted data l Routing (dissemination): long standing queries.
Qualitiative and Quantitative Methods l Qualitative evaluation: what is it like? l Quantitative evaluation: how much is it? A traditional comparison involves the following stages: l Qualitative assessment at the level of question- document pairs, of relevance l Quantitative analysis covering the different documents and the different questions, e.g. recall and precision l A final qualitative assessment of which system(s) perform better than other(s).