LESSONS FROM THE BIOCREATIVE PROTEIN- PROTEIN INTERACTION (PPI) TASK RegCreative Jamboree, Friday, December, 1st, (2006) MARTIN KRALLINGER, 2006 LESSONS.

Slides:



Advertisements
Similar presentations
Sandra Orchard EMBL-EBI Molecular Interactions
Advertisements

FP7 meeting - Gent - Carlos Rodríguez - April 18 WP4: Conceptual Mining from Text for Knowledge Engineering State of the Art WP Coordinators: Alfonso Valencia.
Annotation standards in ORegAnno (Draft) Obi Griffith The RegCreative Jamboree Nov 29, 2006 Ghent, Belgium.
© 2002 The MITRE Corporation. ALL RIGHTS RESERVED. Co-Chair: Alexander Yeh, MITRE Corp. Data: FlyBase ( July 2002 KDD Cup 2002 Task1:
An Information Retrieval and Extraction System for C. elegans Literature.
Mining External Resources for Biomedical IE Why, How, What Malvina Nissim
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
ANALYSIS OF INTER-ANNOTATOR AGREEMENT (TEXT MINING & REG. ANNOTATION) RegCreative Jamboree, Friday, December, 1st, (2006) MARTIN KRALLINGER, 2006 TEXT.
Computational analysis of protein-protein interactions for bench biologists 2-8 September, Berlin Protein Interaction Databases Francesca Diella.
Gene Ontology John Pinney
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
© 2003 The MITRE Corporation. ALL RIGHTS RESERVED. MITRE Critical Assessment of Information Extraction Systems in Biology (BioCreAtIvE) Marc Colosimo Lynette.
© 2006 The MITRE Corporation. ALL RIGHTS RESERVED. Lynette Hirschman The MITRE Corporation Bedford, MA, USA RegCreative Jamboree Nov 29-Dec 1, 2006 Text.
Use of Ontologies in the Life Sciences: BioPax Graciela Gonzalez, PhD (some slides adapted from presentations available at
DI FC UL1 Gene Function Prediction by Mining Biomedical Literature Pooja Jain Master in Bioinformatics Supervisor - Mário Jorge Costa Gaspar.
Annotating Molecular Interactions in MINT
Modeling Functional Genomics Datasets CVM Lesson 1 13 June 2007Bindu Nanduri.
Class Projects. Future Work and Possible Project Topic in Gene Regulatory network Learning from multiple data sources; Learning causality in Motifs; Learning.
Mining the Medical Literature Chirag Bhatt October 14 th, 2004.
Flash talk by: Aditi Garg, Xiaoran Wang Authors: Sarah Rastkar, Gail C. Murphy and Gabriel Murray.
B IOMEDICAL T EXT M INING AND ITS A PPLICATION IN C ANCER R ESEARCH Henry Ikediego
Cis-Regulatory/ Text Mining Interface Discussion.
1 iProLINK: An integrated protein resource for literature mining and literature-based curation 1. Bibliography mapping - UniProt mapped citations 2. Annotation.
Knowledge Integration for Gene Target Selection Graciela Gonzalez, PhD Juan C. Uribe Contact:
Srihari-CSE730-Spring 2003 CSE 730 Information Retrieval of Biomedical Text and Data Inroduction.
RLIMS-P: A Rule-Based Literature Mining System for Protein Phosphorylation Hu ZZ 1, Yuan X 1, Torii M 2, Vijay-Shanker K 3, and Wu CH 1 1 Protein Information.
Bioinformatics and medicine: Are we meeting the challenge?
Towards Improving Classification of Real World Biomedical Articles Kostas Fragos TEI of Athens Christos Skourlas TEI of Athens
Accomplishments and Challenges in Literature Data Mining for Biology L. Hirschman et al. Presented by Jing Jiang CS491CXZ Spring, 2004.
IProLINK – A Literature Mining Resource at PIR (integrated Protein Literature INformation and Knowledge ) Hu ZZ 1, Liu H 2, Vijay-Shanker K 3, Mani I 4,
Intralab Workshop - Reactome CMAP Chang-Feng Quo June 29 th, 2006.
 CiteGraph: A Citation Network System for MEDLINE Articles and Analysis Qing Zhang 1,2, Hong Yu 1,3 1 University of Massachusetts Medical School, Worcester,
Flexible Text Mining using Interactive Information Extraction David Milward
Automatically Generating Gene Summaries from Biomedical Literature (To appear in Proceedings of PSB 2006) X. LING, J. JIANG, X. He, Q.~Z. MEI, C.~X. ZHAI,
Text Mining Special Interest Group Stuart Murray, Wyeth Research Novartis Institute for Biomedical Research, Cambridge, MA 6-8 th October 2004.
COMM331 Effective Reading: Unpacking the text for better understanding Dr. Celeste Rossetto: Learning Development 2013.
Relevance Detection Approach to Gene Annotation Aid to automatic annotation of databases Annotation flow –Extraction of molecular function of a gene from.
Ontologies GO Workshop 3-6 August Ontologies  What are ontologies?  Why use ontologies?  Open Biological Ontologies (OBO), National Center for.
The Functional Genomics Experiment Object Model (FuGE) Andrew Jones, School of Computer Science, University of Manchester MGED Society.
Introduction to the GO: a user’s guide Iowa State Workshop 11 June 2009.
Protein Information Resource Protein Information Resource, 3300 Whitehaven St., Georgetown University, Washington, DC Contact
Introduction to IntAct Pablo Porras Millán, IntAct
Daniel Rico, PhD. Daniel Rico, PhD. Madrid, February 16th, ::: i nformation H iperlinked O ver P roteins - iHOP - Course.
Using Domain Ontologies to Improve Information Retrieval in Scientific Publications Engineering Informatics Lab at Stanford.
This tutorial will describe how to navigate the section of Gramene that provides descriptions of alleles associated with morphological, developmental,
Gene set analyses of genomic datasets Andreas Schlicker Jelle ten Hoeve Lodewyk Wessels.
Distribution of information in biomedical abstracts and full- text publications M. J. Schuemie et al. Dept. of Medical Informatics, Erasmus University.
A collaborative tool for sequence annotation. Contact:
Introduction to the GO: a user’s guide NCSU GO Workshop 29 October 2009.
Construction of Shanghai Life Science & Bio-technology Service Platform for Data Access and Sharing International Workshop on Strategies Presentation of.
Opportunities for Text Mining in Bioinformatics (CS591-CXZ Text Data Mining Seminar) Dec. 8, 2004 ChengXiang Zhai Department of Computer Science University.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
By Jay Krishnan. Introduction Information gathered from Proteomic techniques + neuroscientific research = Information on protein composition and function.
Copyright OpenHelix. No use or reproduction without express written consent1.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
High throughput biology data management and data intensive computing drivers George Michaels.
BioCreAtIvE Critical Assessment for Information Extraction in Biology Granada, Spain, March28-March 31, 2004 Task 2: Functional annotation of gene products.
Coordination and Policy Development in Preparation for a European Open Biodiversity Knowledge Management System Supported by the European Commission through.
David Amar, Tom Hait, and Ron Shamir
Concept Grounding to Multiple Knowledge Bases via Indirect Supervision
Building a community for genome and proteome annotation
Automation of systematic reviews: the reviewer’s viewpoint
STRING Large-scale data and text mining
Kenneth Baclawski et. al. PSB /11/7 Sa-Im Shin
Gully A. Burns1, Pradeep Dasigi2, Eduard H. Hovy2
Wei Wei, PhD, Zhanglong Ji, PhD, Lucila Ohno-Machado, MD, PhD
Functional Annotation of the Horse Genome
Annotation: linking literature to gene products
Batyr Charyyev.
Rey-Long Liu Dept. of Medical Informatics Tzu Chi University Taiwan
Presentation transcript:

LESSONS FROM THE BIOCREATIVE PROTEIN- PROTEIN INTERACTION (PPI) TASK RegCreative Jamboree, Friday, December, 1st, (2006) MARTIN KRALLINGER, 2006 LESSONS FROM THE BIOCREATIVE PPI TASK

PROTEIN-PROTEIN INTERACTIONS (PPI) MARTIN KRALLINGER, 2006 LESSONS FROM THE BIOCREATIVE PPI TASK M. Krallinger and A. Valencia. Applications of Text Mining in Molecular Biology, from name recognition to Protein interaction maps. In Data Analysis and Visualization in Genomics and Proteomics, chapter 4, Wiley.  Crucial to understanding functional role of proteins  Relevant for organization of biological processes  Development of high throughput experimental technologies  Implication PPI for gene regulation (TF and co-regulators)  Interaction networks and diseases (e.g. cancer)

PPI ANNOTATION AND DATABASES LESSONS FROM THE BIOCREATIVE PPI TASK et al., 2004) HPID et al., 2004) IntAct et al., 2004) HPRD et al., 2002) DIP et al., 2002) MINT URLReferenceDatabase MARTIN KRALLINGER, 2006  iMEX agreement to share curation efforts  Protein Standard Initiative (PSI) recommendation  Molecular Interaction (MI) Ontology  Large scale experiments  Literature curation

BIOCREATIVE PPI TASK MARTIN KRALLINGER, 2006 LESSONS FROM THE BIOCREATIVE PPI TASK  Rapid literature growth and manual curation  Automatic extraction of protein-protein interactions from text  Variety of published strategies  Main goals: (1) To determine the state of the art (2) To produce useful resources for training and testing (3) To learn which approaches are successful and practical (4) To monitor interesting new approaches; (5) To provide useful tools to extract protein-protein interactions from texts  Task design resembles manual curation process steps Structured record

MARTIN KRALLINGER, 2006 LESSIONS FROM THE BIOCREATIVE PPI TASK Second BioCreative challenge evaluation

INTERACTION ARTICLE SUBTASK (IAS) MARTIN KRALLINGER, 2006 LESSONS FROM THE BIOCREATIVE PPI TASK RELEVANT NOT RELEVANT  Identify those articles which are curation relevant  Document categorization task  Based on PubMed abstracts  Training set consisted in: (1) P: Abstracts of PPI relevant abstracts form MINT/IntAct (2) N: Abstracts not relevant for PPI (exhaustive curation) (3) P*: Abstracts of interaction relevant articles: other DB  Return two collections of ranked documents: P, N  Evaluation: precision, recall, f-score and AROC  Participating systems: supervised learning  Balanced test set, recent publications

LESSION I: IAS TASK AND OREGANNO MARTIN KRALLINGER, 2006 LESSONS FROM THE BIOCREATIVE PPI TASK  Determine relevance of abstract vs. full text for article selection  Balanced training collection: positive and negative  Avoid journal and date used as classifier features  Define training and test set in terms of publication date, e.g.:  Training set: published before 2003  Test set: published after 2003  Enriched training data: sentences with relevant evidence  Define basic selection strategy:  Exhaustive curation of a set of journals: high recall  Whole PubMed mining: high precision  Curation relevance and annotation types  Integration of resulting applications into annotation pipeline  Interactive evaluation: timing and annotation efficiency

INTERACTION PAIR SUBTASK (IPS) MARTIN KRALLINGER, 2006 LESSONS FROM THE BIOCREATIVE PPI TASK PMID: Interactor 1: P73213_SYNY3 (Ssr2857 protein ) Interactor 2: ATCS_SYNY3 (pacS protein)  Identify protein-protein interaction pairs from full text articles (HTML, PDF)  Individual protein identified using UniProt ID/Acc  Restrict / define a baseline UniProt release  Extraction of physical PPI (MI ontology)  Training set: articles and associated PPI pairs  System output: for each article ranked list of PPI pairs  Evaluation: precision, recall or predicted compared to manual annotation  Main difficulties gene normalization / inter-species ambiguity  No limitation in organism source

LESSON II: IPS TASK AND OREGANNO MARTIN KRALLINGER, 2006 LESSONS FROM THE BIOCREATIVE PPI TASK GENERAL ASPECTS  Difficulties due to inter-organism gene name ambiguity  Difficulty to differentiate experimentally confirmed interactions  Importance of additional lexical resources  Indirect expressions for interactions  Author names of the protein interactors for training  Protein family ambiguity ASPECTS FOR A GENE REGULATION EXTRACTION TASK  Define database for gene normalization  Consider experimentally confirmed regulation  Bio-entity types: Protein vs. gene (promoter) name finding  Provide negative and positive training of co-occurrences (passages) compared to manual annotation  Define actual evaluation metric depending on the needs

INTERACTION SENTENCE SUBTASK (ISS) MARTIN KRALLINGER, 2006 LESSONS FROM THE BIOCREATIVE PPI TASK  Select the most relevant sentence expressing a protein-protein interaction from full text article  Useful for human interpretation and summary generation  Provide for each interaction pair a ranked list of maximum 5 evidence passages (max 3 sentences)  Pooling method of the predicted passages  Evaluation: Percentage of relevant sentences with respect to the total number of submitted and mean reciprocal rank of the passages compared to the manual ones  Example: Using a biochemical approach to search for such co-regulatory factors, we identified hGCN5, TRRAP, and hMSH2/6 as BRCA1-interacting proteins.  Also additional collection included: Prodisen collection, Veuthey collection, Brun collection, GeneRif interaction sentences M. Krallinger, R. Malik and Alfonso Valencia Text Mining and Protein Annotations: the Construction and Use of Protein Description Sentences, Genome Informatics Vol.17,No.2.

LESSON III: ISS TASK AND OREGANNO MARTIN KRALLINGER, 2006 LESSONS FROM THE BIOCREATIVE PPI TASK GENERAL ASPECTS  Difficulties due to lack of collections ‘negative training sentences’  Need of larger (additional) training instances from full text  Complex descriptions of referring to interactions  Protein normalization and protein family name ambiguity problems  Multiple sentence evidence cases (referring expressions, anaphora)  Importance of figure legends and certain section titles  Article format dependency (PDF vs. HTML) ASPECTS FOR A GENE REGULATION EXTRACTION TASK  Define semantic types of (or structure) comment fields  Length restriction of training passages  Restriction to certain format type and journals  Define type of passage which should be extracted: for gene regulation or for evidence type annotation

INTERACTION METHOD SUBTASK (IMS) MARTIN KRALLINGER, 2006 LESSONS FROM THE BIOCREATIVE PPI TASK  Identify protein-protein interaction pairs from full text articles together with interaction detection method  Map to the MI Ontology (CV)  Maximum of 5 MI for a PPI pair  Extraction of physical PPI (MI ontology)  Mean reciprocal rank compared to the manual annotation BC2_PPI_IMS T1_BC2_PPI Q08211 Q9UBU9 MI:0004 1

LESSON IV: IMS AND OREGANNO MARTIN KRALLINGER, 2006 LESSONS FROM THE BIOCREATIVE PPI TASK GENERAL ASPECTS  Difficulties due to lack of training method sentences  Very complex task: both PPI pair as well as terms for methods  Community focus more on IPS than on IMS (too much task overlap)  Difficulty to separate PPI pair and interaction detection method identification  Different parts of documents referring to the method  Information in non-textual data (e.g. figures) ASPECTS FOR A GENE REGULATION EXTRACTION TASK  Define controlled vocabulary relevant for annotation (e.g. evidence types)  Provide lexical resources evidence types (synonyms, …)  Extraction of controlled vocabulary (ontology concepts) to full text

REGCREATIVE TEXT MINING TASKS MARTIN KRALLINGER, 2006 LESSONS FROM THE BIOCREATIVE PPI TASK  Different tasks which might result in automatic annotation relevant summary, which could include: 0. Detection of relevant articles (document categorization & ranking) 1.Ranked (normalized) TF list extracted from the paper 2.Ranked list of regulated genes extracted from the paper 3.Ranked list of Evidence types (and subtypes) extracted from the articles together with text passages. 4. Ranked list of associations between TF and regulated genes together with evidence text

Acknowledgements MARTIN KRALLINGER, 2006 LESSONS FROM THE BIOCREATIVE PPI TASK  MINT and IntAct for providing the training and test data collections  Publishers for allowing use of the full text articles (NPG and Elsevier)  MITRE, NCBI for collaboration in organizing the BioCreative Challenge  CNIO for their assistance  Thanks to Lynette Hirschman and Alfonso Valencia for their coordination.  Thanks to the participating teams from all over the world for their effort in developing the participating systems. Detailed results will be presented in Madrid at the BioCreative II Evaluation workshop, sponsored by the European Science Foundation, ESF (23-25th of April 2007, CNIO, Madrid) and in a special issue of Genome Biology.