UCB BioText TREC 2003 Participation Participants: Marti Hearst Gaurav Bhalotia, Presley Nakov, Ariel Schwartz Track: Genomics, tasks 1 and 2
TREC Task 1: Overview Search 525,938 MedLine records Titles, abstracts, MeSH category terms, citation information Topics: Taken from the GeneRIF portion of the LocusLink database We are supplied with a gene names Definition of a GeneRIF: For gene X, find all MEDLINE references that focus on the basic biology of the gene or its protein products from the designated organism. Basic biology includes isolation, structure, genetics and function of genes/proteins in normal and disease states.
TREC Task 1: Sample Query Homo sapiens OFFICIAL_GENE_NAME ets variant gene 6 (TEL ncogene) Homo sapiens OFFICIAL_SYMBOL ETV Homo sapiens ALIAS_SYMBOL TEL Homo sapiens PREFERRED_PRODUCT ets variant gene Homo sapiens PRODUCT ets variant gene Homo sapiens ALIAS_PROT TEL1 oncogene The first column is the official topic number (1-50). The second column contains the LocusLink ID for the gene. The third column contains the name of organism. The fourth column contains the gene name type. The fifth column contains the gene name.
TREC Task 1: Approach Two main components: Retrieve relevant docs May miss many because of variation in how gene names are expressed Rank order them
TREC Task 1: Approach Retrieval Normalization of query terms Special characters are replaced with spaces in both queries and documents. Term expansion A set of pattern based rules is applied to the original list of query terms, to expand the original set, and increase recall. Some rules with lower confidence get a lower weight in the ranking step. Stop word removal Organism identification Gene names are often shared across different organisms Developed a method to automatically determine which MeSH terms correspond to LocusLink Organism terms Retrieved Medline docs indicated by LocusLink links corresponding to a given organism Organism terms were the most frequent MeSH categories among the selected docs Used these terms to identify the organism term in Medline An example of playing two databases off each other. Mesh concepts When an exact match is found between one of the query terms and a MeSH term assigned to a document, the document is retrieved.
Gene Name Expansion
Organism Filtering
TREC Task 1: Approach Relevance ranking IBM’s DB2 Net Search Extender was used as the text search engine. Scoring: Each query is a union of 5 different sub-queries - titles, abstracts, titles using low confidence expansion rules, abstracts using low confidence expansion rules, and MeSH concepts. Each sub-query returns a set of documents with a relevance score from the text search engine (or a fixed value for MeSH matches) The aggregated score is the weighted SUM of the individual scores with optional weights applied to each sub-query score. SUM performs better than MAX, since it gives higher confidence to documents found in multiple sub-queries. Scores are normalized to be in the (0,1) range, by dividing the score by the highest aggregated score achieved for the query.
TREC Task 1: Approach GeneRIF classification A Naïve Bayes model is used to assign to each document the probability it is a GeneRIF. MeSH terms are used as features. Combination of text retrieval score and GeneRIF classification score. We tried both an additive and a multiplicative approach. Both behave similarly with a slightly better performance achieved with the additive one.
TREC Task 1: Results Performance is measured using the standard trec_eval program. On training data: Best published result: With GeneRIF classifier: Without GeneRIF classifier: On testing data: (turned in 8/4/03) With GeneRIF classifier – Without GeneRIF classifier –
TREC Task 2 Problem Definition: Given GeneRIFS formatted as: J Biol Chem 2002 Sep 13;277(37): the death effector domain of FADD is involved in interaction with Fas Nucleic Acids Res 2002 Aug 15;30(16): In the case of Fas-mediated apoptosis, when we transiently introduced these hybrid- ribozyme libraries into Fas-expressing HeLa cells, we were able to isolate surviving clones that were resistant to or exhibited a delay in Fas-mediated apoptosis w … reproduce the GeneRIF from the MEDLINE record.
TREC Task 2 What we did TBA