LESSONS FROM THE BIOCREATIVE PROTEIN- PROTEIN INTERACTION (PPI) TASK RegCreative Jamboree, Friday, December, 1st, (2006) MARTIN KRALLINGER, 2006 LESSONS FROM THE BIOCREATIVE PPI TASK
PROTEIN-PROTEIN INTERACTIONS (PPI) MARTIN KRALLINGER, 2006 LESSONS FROM THE BIOCREATIVE PPI TASK M. Krallinger and A. Valencia. Applications of Text Mining in Molecular Biology, from name recognition to Protein interaction maps. In Data Analysis and Visualization in Genomics and Proteomics, chapter 4, Wiley. Crucial to understanding functional role of proteins Relevant for organization of biological processes Development of high throughput experimental technologies Implication PPI for gene regulation (TF and co-regulators) Interaction networks and diseases (e.g. cancer)
PPI ANNOTATION AND DATABASES LESSONS FROM THE BIOCREATIVE PPI TASK et al., 2004) HPID et al., 2004) IntAct et al., 2004) HPRD et al., 2002) DIP et al., 2002) MINT URLReferenceDatabase MARTIN KRALLINGER, 2006 iMEX agreement to share curation efforts Protein Standard Initiative (PSI) recommendation Molecular Interaction (MI) Ontology Large scale experiments Literature curation
BIOCREATIVE PPI TASK MARTIN KRALLINGER, 2006 LESSONS FROM THE BIOCREATIVE PPI TASK Rapid literature growth and manual curation Automatic extraction of protein-protein interactions from text Variety of published strategies Main goals: (1) To determine the state of the art (2) To produce useful resources for training and testing (3) To learn which approaches are successful and practical (4) To monitor interesting new approaches; (5) To provide useful tools to extract protein-protein interactions from texts Task design resembles manual curation process steps Structured record
MARTIN KRALLINGER, 2006 LESSIONS FROM THE BIOCREATIVE PPI TASK Second BioCreative challenge evaluation
INTERACTION ARTICLE SUBTASK (IAS) MARTIN KRALLINGER, 2006 LESSONS FROM THE BIOCREATIVE PPI TASK RELEVANT NOT RELEVANT Identify those articles which are curation relevant Document categorization task Based on PubMed abstracts Training set consisted in: (1) P: Abstracts of PPI relevant abstracts form MINT/IntAct (2) N: Abstracts not relevant for PPI (exhaustive curation) (3) P*: Abstracts of interaction relevant articles: other DB Return two collections of ranked documents: P, N Evaluation: precision, recall, f-score and AROC Participating systems: supervised learning Balanced test set, recent publications
LESSION I: IAS TASK AND OREGANNO MARTIN KRALLINGER, 2006 LESSONS FROM THE BIOCREATIVE PPI TASK Determine relevance of abstract vs. full text for article selection Balanced training collection: positive and negative Avoid journal and date used as classifier features Define training and test set in terms of publication date, e.g.: Training set: published before 2003 Test set: published after 2003 Enriched training data: sentences with relevant evidence Define basic selection strategy: Exhaustive curation of a set of journals: high recall Whole PubMed mining: high precision Curation relevance and annotation types Integration of resulting applications into annotation pipeline Interactive evaluation: timing and annotation efficiency
INTERACTION PAIR SUBTASK (IPS) MARTIN KRALLINGER, 2006 LESSONS FROM THE BIOCREATIVE PPI TASK PMID: Interactor 1: P73213_SYNY3 (Ssr2857 protein ) Interactor 2: ATCS_SYNY3 (pacS protein) Identify protein-protein interaction pairs from full text articles (HTML, PDF) Individual protein identified using UniProt ID/Acc Restrict / define a baseline UniProt release Extraction of physical PPI (MI ontology) Training set: articles and associated PPI pairs System output: for each article ranked list of PPI pairs Evaluation: precision, recall or predicted compared to manual annotation Main difficulties gene normalization / inter-species ambiguity No limitation in organism source
LESSON II: IPS TASK AND OREGANNO MARTIN KRALLINGER, 2006 LESSONS FROM THE BIOCREATIVE PPI TASK GENERAL ASPECTS Difficulties due to inter-organism gene name ambiguity Difficulty to differentiate experimentally confirmed interactions Importance of additional lexical resources Indirect expressions for interactions Author names of the protein interactors for training Protein family ambiguity ASPECTS FOR A GENE REGULATION EXTRACTION TASK Define database for gene normalization Consider experimentally confirmed regulation Bio-entity types: Protein vs. gene (promoter) name finding Provide negative and positive training of co-occurrences (passages) compared to manual annotation Define actual evaluation metric depending on the needs
INTERACTION SENTENCE SUBTASK (ISS) MARTIN KRALLINGER, 2006 LESSONS FROM THE BIOCREATIVE PPI TASK Select the most relevant sentence expressing a protein-protein interaction from full text article Useful for human interpretation and summary generation Provide for each interaction pair a ranked list of maximum 5 evidence passages (max 3 sentences) Pooling method of the predicted passages Evaluation: Percentage of relevant sentences with respect to the total number of submitted and mean reciprocal rank of the passages compared to the manual ones Example: Using a biochemical approach to search for such co-regulatory factors, we identified hGCN5, TRRAP, and hMSH2/6 as BRCA1-interacting proteins. Also additional collection included: Prodisen collection, Veuthey collection, Brun collection, GeneRif interaction sentences M. Krallinger, R. Malik and Alfonso Valencia Text Mining and Protein Annotations: the Construction and Use of Protein Description Sentences, Genome Informatics Vol.17,No.2.
LESSON III: ISS TASK AND OREGANNO MARTIN KRALLINGER, 2006 LESSONS FROM THE BIOCREATIVE PPI TASK GENERAL ASPECTS Difficulties due to lack of collections ‘negative training sentences’ Need of larger (additional) training instances from full text Complex descriptions of referring to interactions Protein normalization and protein family name ambiguity problems Multiple sentence evidence cases (referring expressions, anaphora) Importance of figure legends and certain section titles Article format dependency (PDF vs. HTML) ASPECTS FOR A GENE REGULATION EXTRACTION TASK Define semantic types of (or structure) comment fields Length restriction of training passages Restriction to certain format type and journals Define type of passage which should be extracted: for gene regulation or for evidence type annotation
INTERACTION METHOD SUBTASK (IMS) MARTIN KRALLINGER, 2006 LESSONS FROM THE BIOCREATIVE PPI TASK Identify protein-protein interaction pairs from full text articles together with interaction detection method Map to the MI Ontology (CV) Maximum of 5 MI for a PPI pair Extraction of physical PPI (MI ontology) Mean reciprocal rank compared to the manual annotation BC2_PPI_IMS T1_BC2_PPI Q08211 Q9UBU9 MI:0004 1
LESSON IV: IMS AND OREGANNO MARTIN KRALLINGER, 2006 LESSONS FROM THE BIOCREATIVE PPI TASK GENERAL ASPECTS Difficulties due to lack of training method sentences Very complex task: both PPI pair as well as terms for methods Community focus more on IPS than on IMS (too much task overlap) Difficulty to separate PPI pair and interaction detection method identification Different parts of documents referring to the method Information in non-textual data (e.g. figures) ASPECTS FOR A GENE REGULATION EXTRACTION TASK Define controlled vocabulary relevant for annotation (e.g. evidence types) Provide lexical resources evidence types (synonyms, …) Extraction of controlled vocabulary (ontology concepts) to full text
REGCREATIVE TEXT MINING TASKS MARTIN KRALLINGER, 2006 LESSONS FROM THE BIOCREATIVE PPI TASK Different tasks which might result in automatic annotation relevant summary, which could include: 0. Detection of relevant articles (document categorization & ranking) 1.Ranked (normalized) TF list extracted from the paper 2.Ranked list of regulated genes extracted from the paper 3.Ranked list of Evidence types (and subtypes) extracted from the articles together with text passages. 4. Ranked list of associations between TF and regulated genes together with evidence text
Acknowledgements MARTIN KRALLINGER, 2006 LESSONS FROM THE BIOCREATIVE PPI TASK MINT and IntAct for providing the training and test data collections Publishers for allowing use of the full text articles (NPG and Elsevier) MITRE, NCBI for collaboration in organizing the BioCreative Challenge CNIO for their assistance Thanks to Lynette Hirschman and Alfonso Valencia for their coordination. Thanks to the participating teams from all over the world for their effort in developing the participating systems. Detailed results will be presented in Madrid at the BioCreative II Evaluation workshop, sponsored by the European Science Foundation, ESF (23-25th of April 2007, CNIO, Madrid) and in a special issue of Genome Biology.