BioCreAtIvE Critical Assessment for Information Extraction in Biology Granada, Spain, March28-March 31, 2004 Task 2: Functional annotation of gene products
Task description The assignment of GO annotations to human proteins This is currently done by curators at Swiss-Prot The full text of journal articles was used (636 training docs from J. of Biological Chemistry) Tree subtasks
Subtasks 1)“Recover” text that provides evidence for the GO annotation: Given a (doc, protein, GO term) triplet, find the segment of text supporting this annotation 2)Provide GO annotation for human proteins: Given a (doc, protein) pair, return all GO terms that could be associated with this pair 3)Selection of relevant papers: detect which papers are relevant for a protein in the sense that they contain information that would be suitable to derive a GO annotation and provide the evidence text
Evaluation The prediction were made in form of triplets (protein, paper, GO) plus a piece evidence text More than 30,000 of these individual results were submitted and had to be reviewed by the GO curators The scheme for both GO and proteins was “high”: meaning that the GO term or the protein were correct “generally”: for GO terms this means that the specific protein is not there but a homologue from another organism or a reference to the protein family “low”: the prediction was wrong
Results – Task 2.1
Cont ’ d
Result – Task 2.2
Cont ’ d
Summary of approaches adopted by some participants User17: Soumya Ray and Mark Craven (University of Wisconsin) User20: Francisco M. Couto et al. (from Portugal and France) User4: Frédéric Ehrler and Patrick Ruch (University (Hospital) of Geneva)
Learning Statistical Models for Annotating Proteins with Function Information using Biomedical Text (User17)
Informative Term Model Identify terms that are characteristic of a given GO term Collect training data from other organism databases – SGD, MGI, RGD, TAIR Perform a chi-squared test to identify the informative terms Null hypothesis: the distributions of a term in the two classes (support and background) are identical
Cont ’ d Support set: a set of articles and abstracts associated with the GO term Background set: the remaining set of articles and abstracts
FiGO: Finding GO Terms in Unstructured Text (User20) Calculate the information content of each word occurring in the GO terms, where #w is the number of GO terms whose name contains w, and #max is the maximum number of GO terms whose name contains a common word The information content of a term’s name n is therefore: A GO term may have multiple names (synonyms):
Annotation with a piece of text Given a piece of text, the local information content of each term is defined as follow: FiGO identifies a term in a piece of text, when its local information content is sufficiently close to its information content:, where [0,1] representing how close LIC should be from IC to decide that t is referred in p. Thus the parameter controls the recall and precision of FiGO.
Preliminary Report on the BioCreative Experiment: Task Presentation, System Description and Preliminary Results An IR approach Index the collection of GO terms as if they are documents Each document (MedLine abstract) as a query to be categorized in GO categories Combine two retrieval engines: a vector space model (TFIDF) and a pattern-matcher Two types of indexing unit: stems (Porter-like) and linguistically motivated phrases (noun phrases) The UMLS is also used for string normalization
Summary IR-like approaches generate higher recall Almost all approaches depend on the collection of GO terms GO terms expansion (synonyms, related terms/phrases) seems important