Presentation is loading. Please wait.

Presentation is loading. Please wait.

UCB BioText TREC 2003 Participation Participants: Marti Hearst Gaurav Bhalotia, Presley Nakov, Ariel Schwartz Track: Genomics, tasks 1 and 2.

Similar presentations


Presentation on theme: "UCB BioText TREC 2003 Participation Participants: Marti Hearst Gaurav Bhalotia, Presley Nakov, Ariel Schwartz Track: Genomics, tasks 1 and 2."— Presentation transcript:

1 UCB BioText TREC 2003 Participation Participants: Marti Hearst Gaurav Bhalotia, Presley Nakov, Ariel Schwartz Track: Genomics, tasks 1 and 2

2 TREC Task 1: Overview Search 525,938 MedLine records Titles, abstracts, MeSH category terms, citation information Topics: Taken from the GeneRIF portion of the LocusLink database We are supplied with a gene names Definition of a GeneRIF: For gene X, find all MEDLINE references that focus on the basic biology of the gene or its protein products from the designated organism. Basic biology includes isolation, structure, genetics and function of genes/proteins in normal and disease states.

3 TREC Task 1: Sample Query 3 2120 Homo sapiens OFFICIAL_GENE_NAME ets variant gene 6 (TEL ncogene) 3 2120 Homo sapiens OFFICIAL_SYMBOL ETV6 3 2120 Homo sapiens ALIAS_SYMBOL TEL 3 2120 Homo sapiens PREFERRED_PRODUCT ets variant gene 6 3 2120 Homo sapiens PRODUCT ets variant gene 6 3 2120 Homo sapiens ALIAS_PROT TEL1 oncogene The first column is the official topic number (1-50). The second column contains the LocusLink ID for the gene. The third column contains the name of organism. The fourth column contains the gene name type. The fifth column contains the gene name.

4 TREC Task 1: Approach Two main components: Retrieve relevant docs May miss many because of variation in how gene names are expressed Rank order them

5 TREC Task 1: Approach Retrieval Normalization of query terms Special characters are replaced with spaces in both queries and documents. Term expansion A set of pattern based rules is applied to the original list of query terms, to expand the original set, and increase recall. Some rules with lower confidence get a lower weight in the ranking step. Stop word removal Organism identification Gene names are often shared across different organisms Developed a method to automatically determine which MeSH terms correspond to LocusLink Organism terms  Retrieved Medline docs indicated by LocusLink links corresponding to a given organism  Organism terms were the most frequent MeSH categories among the selected docs  Used these terms to identify the organism term in Medline  An example of playing two databases off each other. Mesh concepts When an exact match is found between one of the query terms and a MeSH term assigned to a document, the document is retrieved.

6 Gene Name Expansion

7 Organism Filtering

8 TREC Task 1: Approach Relevance ranking IBM’s DB2 Net Search Extender was used as the text search engine. Scoring: Each query is a union of 5 different sub-queries -  titles,  abstracts,  titles using low confidence expansion rules,  abstracts using low confidence expansion rules, and  MeSH concepts. Each sub-query returns a set of documents with a relevance score from the text search engine (or a fixed value for MeSH matches) The aggregated score is the weighted SUM of the individual scores with optional weights applied to each sub-query score.  SUM performs better than MAX, since it gives higher confidence to documents found in multiple sub-queries. Scores are normalized to be in the (0,1) range, by dividing the score by the highest aggregated score achieved for the query.

9 TREC Task 1: Approach GeneRIF classification A Naïve Bayes model is used to assign to each document the probability it is a GeneRIF. MeSH terms are used as features. Combination of text retrieval score and GeneRIF classification score. We tried both an additive and a multiplicative approach. Both behave similarly with a slightly better performance achieved with the additive one.

10 TREC Task 1: Results Performance is measured using the standard trec_eval program. On training data: Best published result: 0.4125 With GeneRIF classifier: 0.5101 Without GeneRIF classifier: 0.5028 On testing data: (turned in 8/4/03) With GeneRIF classifier – 0.3933 Without GeneRIF classifier – 0.3768

11 TREC Task 2 Problem Definition: Given GeneRIFS formatted as: 1 355 12107169 J Biol Chem 2002 Sep 13;277(37):34343-8. the death effector domain of FADD is involved in interaction with Fas. 2 355 12177303 Nucleic Acids Res 2002 Aug 15;30(16):3609-14. In the case of Fas-mediated apoptosis, when we transiently introduced these hybrid- ribozyme libraries into Fas-expressing HeLa cells, we were able to isolate surviving clones that were resistant to or exhibited a delay in Fas-mediated apoptosis w … reproduce the GeneRIF from the MEDLINE record.

12 TREC Task 2 What we did TBA


Download ppt "UCB BioText TREC 2003 Participation Participants: Marti Hearst Gaurav Bhalotia, Presley Nakov, Ariel Schwartz Track: Genomics, tasks 1 and 2."

Similar presentations


Ads by Google