Reference Collections: Task Characteristics
TREC Collection Text REtrieval Conference (TREC) –sponsored by NIST and DARPA (1992-?) Comparing approaches for information retrieval from large text collections: –Uniform scoring procedures –Large corpus of news and technical texts –Texts tagged in SGML (includes some metadata and document structure) –Specified tasks
Example Task Number: 168 Topic: Financing AMTRAK Description: A document will address the role of the Federal Government in financing the operation of the National Railroad Transportation Corporation (AMTRAK). Narrative: A relevant document must provide information on the government’s responsibility to make AMTRAK an economically viable entity. It could also discuss the privatization of AMTRAK as an alternative to continuous government subsidies. Documents comparing government subsidies given to air and bus transportation with those provided to AMTRAK would also be relevant.
Deciding What is Relevant Pooling method –Set (pool) of potentially relevant documents is obtained by combining top N results from various retrieval systems. –Humans then examine these to determine which are truly relevant –Assumes relevant documents will be in the pool and that documents not in the pool are not relevant. –Assumptions have been verified (at least for evaluation purposes)
Types of TREC Tasks Ad hoc tasks: –New queries against static collection –IR systems return ranked results –Systems get task and collection Routing tasks: –Standing queries for changing collection –Basically a batch-mode filtering task –Example: identifying topic from AP newswire –Results must be ranked –Systems get task and two collections, one for training and one for evaluation
Secondary Tasks at TREC Chinese –Documents and queries in Chinese Filtering –Determine whether each new document is relevant (no rank order) Interactive –Human searcher interacts with system to determine relevant (no rank order) NLP –Examining value of NLP in IR
Secondary Tasks at TREC Cross Languages –Documents in one language while tasks in another High Precision –Retrieve 10 documents that answer a given information request in 5 minutes. Spoken Document Retrieval –Documents are transcripts of radio broadcasts Very Large Corpus –> 20 GB collection
Evaluation Measures Summary Table Statistics –# of requests in task, # of documents retrieved, # of relevant docs retrieved, total # of relevant docs Recall-Precision Averages –11 standard recall levels Document Level Averages –Avg. precision for specified # of retrieved docs (R) Average Precision Histogram –Graph showing how algorithm did for each request compared to average of all algorithms
Reference Collections: Collection Characteristics
CACM Collection 3204 Communications of the ACM articles Focus of collection: computer science Structured subfields: –Author names –Date information –Word stems from title and abstract –Categories from hierarchical classification –Direct references between articles –Bibliographic coupling connections –Number of co-citations for each pair of articles
CACM Collection 3204 Communications of the ACM articles Test information requests: –52 information requests in natural language with two Boolean query expressions –Average of 11.4 terms per query –Requests are rather specific with an average of about 15 relevant documents –Result in relatively low precision and recall
ISI Collection 1460 documents from the Institute of Scientific Information Focus of collection: information science Structured subfields: –Author names –Word stems from title and abstract –Number of co-citations for each pair of articles
ISI Collection 1460 documents from the Institute of Scientific Information Test information requests: –35 information requests in natural language with Boolean query expressions –Average of 8.1 terms per query –41 information requests in NL without Boolean query expression –Requests are fairly general with an average of about 50 relevant documents –Higher precision and recall
Observation Collection# of Docs# of TermsTerms/Doc CACM ISI Number of terms increases slowly with number of documents
Cystic Fibrosis Collection 1239 articles with “Cystic Fibrosis” index in MEDLINE Structured subfields: –MEDLINE accession number –Author –Title –Source –Major subjects –Minor subjects –Abstract (or extract) –References in the document –Citations to the document
Cystic Fibrosis Collection 1239 articles with “Cystic Fibrosis” index in MEDLINE Test information requests: –100 information requests –Relevance assessed by four experts with a scale of 0 (not relevant), 1 (marginal relevance), and 2 (high relevance) –Overall relevance is sum (0-8)
Discussion Questions In developing a search engine: –How would you use metadata (e.g. author, title, abstract)? –How would you use document structure? –How would you use references, citations, co-citations? –How would you use hyperlinks?