Text mining activities at PIR Cecilia Arighi March 12, 2013.

Slides:



Advertisements
Similar presentations
DELOS Highlights COSTANTINO THANOS ITALIAN NATIONAL RESEARCH COUNCIL.
Advertisements

Social networks, in the form of bibliographies and citations, have long been an integral part of the scientific process. We examine how to leverage the.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
LESSONS FROM THE BIOCREATIVE PROTEIN- PROTEIN INTERACTION (PPI) TASK RegCreative Jamboree, Friday, December, 1st, (2006) MARTIN KRALLINGER, 2006 LESSONS.
Dialogue – Driven Intranet Search Suma Adindla School of Computer Science & Electronic Engineering 8th LANGUAGE & COMPUTATION DAY 2009.
1 Enriching UK PubMed Central SPIDER launch meeting, Wolfson College, Oxford Paul Davey, UK PubMed Central Engagement Manager.
Evidence-Based Information Retrieval in Bioinformatics
1 CBioC: Collaborative Bio- Curation Chitta Baral Department of Computer Science and Engineering Arizona State University.
August 29, 2002InforMax Confidential1 Vector PathBlazer Product Overview.
DI FC UL1 Gene Function Prediction by Mining Biomedical Literature Pooja Jain Master in Bioinformatics Supervisor - Mário Jorge Costa Gaspar.
1 BrainWave Biosolutions Limited Accelerating Life Science Research through Technology.
Class Projects. Future Work and Possible Project Topic in Gene Regulatory network Learning from multiple data sources; Learning causality in Motifs; Learning.
DEMO CSE fall. What is GeneMANIA GeneMANIA finds other genes that are related to a set of input genes, using a very large set of functional.
B IOMEDICAL T EXT M INING AND ITS A PPLICATION IN C ANCER R ESEARCH Henry Ikediego
Cis-Regulatory/ Text Mining Interface Discussion.
1 iProLINK: An integrated protein resource for literature mining and literature-based curation 1. Bibliography mapping - UniProt mapped citations 2. Annotation.
BTN323: INTRODUCTION TO BIOLOGICAL DATABASES Day2: Specialized Databases Lecturer: Junaid Gamieldien, PhD
Knowledge Integration for Gene Target Selection Graciela Gonzalez, PhD Juan C. Uribe Contact:
Srihari-CSE730-Spring 2003 CSE 730 Information Retrieval of Biomedical Text and Data Inroduction.
9/30/2004TCSS588A Isabelle Bichindaritz1 Introduction to Bioinformatics.
CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Erice 2008 Introduction to PDB Workshop From Molecules to Medicine: Integrating Crystallography in Drug Discovery Erice, 29 May - 8 June Peter Rose
Integration of PRO and UniProtKB Amherst, NY May 16, 2013 Cathy H. Wu, Ph.D. PRO-PO-GO Meeting.
Evaluation Experiments and Experience from the Perspective of Interactive Information Retrieval Ross Wilkinson Mingfang Wu ICT Centre CSIRO, Australia.
RLIMS-P: A Rule-Based Literature Mining System for Protein Phosphorylation Hu ZZ 1, Yuan X 1, Torii M 2, Vijay-Shanker K 3, and Wu CH 1 1 Protein Information.
Towards Improving Classification of Real World Biomedical Articles Kostas Fragos TEI of Athens Christos Skourlas TEI of Athens
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo – Automated Content Management System Srikanth Kallurkar Quantum Leap.
Copyright OpenHelix. No use or reproduction without express written consent1.
Copyright OpenHelix. No use or reproduction without express written consent1.
IProLINK – A Literature Mining Resource at PIR (integrated Protein Literature INformation and Knowledge ) Hu ZZ 1, Liu H 2, Vijay-Shanker K 3, Mani I 4,
Intralab Workshop - Reactome CMAP Chang-Feng Quo June 29 th, 2006.
Flexible Text Mining using Interactive Information Extraction David Milward
Automatically Generating Gene Summaries from Biomedical Literature (To appear in Proceedings of PSB 2006) X. LING, J. JIANG, X. He, Q.~Z. MEI, C.~X. ZHAI,
Improving Curation Efficiency: User Contributions and Textpresso-Based Semi-Automation SAB 2008 WormBase Literature Curators Textpresso.
Relevance Detection Approach to Gene Annotation Aid to automatic annotation of databases Annotation flow –Extraction of molecular function of a gene from.
Protein Ontology (PRO) Amherst, NY May 15, 2013 Cathy H. Wu, Ph.D. Director, Protein Information Resource (PIR) Edward G. Jefferson Chair and Director.
PIRSF Classification System PIRSF: Evolutionary relationships of proteins from super- to sub-families Homeomorphic Family: Homologous proteins sharing.
Protein Information Resource Protein Information Resource, 3300 Whitehaven St., Georgetown University, Washington, DC Contact
BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.
CACAO Training Fall Community Assessment of Community Annotation with Ontologies (CACAO)
Using Domain Ontologies to Improve Information Retrieval in Scientific Publications Engineering Informatics Lab at Stanford.
Bioinformatics and Computational Biology
You can request PRO terms by using the SourceForge PRO tracker (Fig 3A) or by directly contributing to PRO by providing the information in the RACE-PRO.
Opportunities for Text Mining in Bioinformatics (CS591-CXZ Text Data Mining Seminar) Dec. 8, 2004 ChengXiang Zhai Department of Computer Science University.
DISCUSSION Using a Literature-based NMF Model for Discovering Gene Functional Relationships Using a Literature-based NMF Model for Discovering Gene Functional.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Literature Mining and Database Annotation of Protein Phosphorylation Using a Rule-based System Z. Z. Hu 1, M. Narayanaswamy 2, K. E. Ravikumar 2, K. Vijay-Shanker.
BioCreAtIvE Critical Assessment for Information Extraction in Biology Granada, Spain, March28-March 31, 2004 Task 2: Functional annotation of gene products.
© NCSR, Frascati, July 18-19, 2002 CROSSMARC big picture Domain-specific Web sites Domain-specific Spidering Domain Ontology XHTML pages WEB Focused Crawling.
TDM in the Life Sciences Application to Drug Repositioning *
Proposal for Term Project
STRING Large-scale data and text mining
Development of the Amphibian Anatomical Ontology
Kenneth Baclawski et. al. PSB /11/7 Sa-Im Shin
Search Engine Architecture
Applications of Text Mining
Wei Wei, PhD, Zhanglong Ji, PhD, Lucila Ohno-Machado, MD, PhD
Department of Genetics • Stanford University School of Medicine
Functional Annotation of the Horse Genome
Social Knowledge Mining
Annotation: linking literature to gene products
PIR: Protein Information Resource
Data Warehousing and Data Mining
Literature Data Mining and Protein Ontology Development
Tutorial: Bioinformatics Resources
CSE 635 Multimedia Information Retrieval
Batyr Charyyev.
Web Mining Department of Computer Science and Engg.
Network biology An introduction to STRING and Cytoscape
PolyAnalyst Web Report Training
Presentation transcript:

Text mining activities at PIR Cecilia Arighi March 12, 2013

Dr. Vijay Shanker, CIS Department, University of Delaware BioCreative Consortium Text mining projects in collaboration with: 1 2

iProLink: Text mining resources at PIR RLIMS-P: Text mining tool for extraction of protein phosphorylation information eFIP: Extracting Functional Impact of protein Phosphorylation Resource to facilitate text mining for biocuration with focus on annotation of post-translational modifications (PTMs) eGIFT: Extracting Gene Information From Text

RLIMS-P: extraction of protein phosphorylation information Rule-based systems: make use of : -knowledge about how language is structured -specific knowledge about how biologically relevant facts are stated in the biomedical literature. Rule-based systems: make use of : -knowledge about how language is structured -specific knowledge about how biologically relevant facts are stated in the biomedical literature. PMID: The tool needs to capture the different ways that protein phosphorylation is described in literature Rule-based information extraction system It extracts information about : phosphorylated protein(s) the kinase(s) phosphorylation site(s) RLIMS-P 2.0 over a 100 regular expressions, some of these are of supporting nature (e.g for anaphora resolution).

RLIMS-P Interface: Search New interface! Keywords List of PMIDs Provides suggestions of protein and gene names while typing

RLIMS-P Interface: Result Table Arrange data according to interest Query: BAD Statistics Summary: list all kinases and phospho-proteins found per abstract PMID: list all kinases and phospho-proteins and sites found per abstract Kinase, substrate and sites are color-coded Kinase: list results based on individual kinases extracted by RLIMS-P Substrate: list results based on indivudual substrate extracted by RLIMS-P

Text Evidence Page

The eFIP system for text mining of protein interaction networks of phosphorylated proteins Tudor CO, Arighi CN, Wang Q, Wu CH, Shanker VK. (2012) Database (doi: /database/bas044) 8 eFIP: Functional Impact of Phosphorylation Bad phosphorylation induced by survival factors leads to its preferential binding to and suppression of the death-inducing function of Bad. (PMID ) Find relation between phosphorylation and protein interaction Protein interaction in eFIP: Protein-protein Protein-protein complex Protein-protein region Protein-protein class Example of interaction-related terms detected eFIP Binding Interact Complex Dissociates (used to capture a negative impact of phosphorylation)

9 eFIP Architecture

10 eFIP Website 1 To correct and save eFIP results 2 3

eFIP: To find relevant papers about phosphorylated proteins and their functions 11 Search for BAD If logged in

Distinct phosphorylated forms of a protein may have different interacting proteins, leading to different subcellular locations, functions and pathways Literature mining connects the impact to different BAD forms, and, through kinases, links BAD to pathways 12 Discovery from Literature Mining PMID:

Pubmed Search Results RLIMS-P Set of Phosphorylation-Related Articles for Curation TEXT MINING DATA MINING Protein A Protein B Protein-Protein Interaction Databases FUNCTIONAL ANNOTATION TERMS RACE- PRO RACE- PRO THE PROTEIN ONTOLOGY (PRO) VISUALIZATION Cytoscape Figure 1: Overview of the Workflow

eGIFT Uses natural language processing techniques to retrieve iTerms (informative terms) relevant to a specific gene. Gene centric document retrieval and categorization iTerms

Applications Finding relevant articles to assist in biocuration : – of protein phosphorylated forms and complexes in the Protein Ontology. – Phosphorylated proteins in external databases, such as phospho.ELM (PMID: ) – Pathway curation in Gallus Reactome (The Third Workshop on Integrative Data Analysis in Systems Biology (IDASB) 2012) Automatic information extraction from literature to improve knowlegbase content (iPTM and Gallus Reactome) Improvement of kinase site prediction algorithms (RLIMS-P) Finding set of genes/proteins with common iTerms (eGIFT)

What’s in it for UniProt? 1-For curation: Assist in prioritization of entry annotation based on potential relevant information on protein features (phosphorylation) As of 03/11/2013 in Medline # of RLIMS-P positive PMIDs = 135,739 # with site information= 41,947 # with kinase information= 38,924 2-For UniProt user: Processing on RLIMS-P on the UniProtKB additional bibliography could provide the UniProt user with an extra layer of information that he/she could readily use. Use eFIP/eGIFT model of displaying documents based on information content of the additional bibliography.

Example: Additional Bibliography for raptor: 30 PMIDs

T908 not annotated New Information from Additional Bibliography and RLIMS-P

BioCreative Activities Interactive Text Mining

BioCreative: Critical Assessment of Information Extraction in Biology International community-wide effort to evaluate text mining and information extraction systems applied to the biological domain BioNLP Text REtrieval Conference (TREC) BioCreative workshops are very much driven by the needs of users with focus on: strong linguistic focus with topics of interest to NLP community -Biocuration tasks -Biocuration workflows -Interoperability

21 Background BioCreative I: 2004, Granada, Spain  BMC Bioinformatics 2005, 6 (Suppl 1) BioCreative II: 2007, Madrid, Spain  Genome Biology 2008, 9 (Suppl 2) BioCreative II.5: 2009, Madrid, Spain  IEEE Transactions in Computational Biology and Bioinformatics 2010 BioCreative III: 2010, Bethesda, USA  BMC Bioinformatics 2011, Supp 8 Biocuration and Text Mining: 2012, Georgetown U, USA  Database Virtual Issue 2012 BioCreative IV: 2013

Ranking of relevant documents (document triage) Extraction of genes and proteins names (gene mention) Linkage of names to database identifiers (gene normalization) Extraction of functional annotation in standard ontologies (GO) Extraction of entity relations (e.g. protein–protein interaction) Biocurators annotate corpus Testing set Compare annotation BioCreative Traditional Tracks TM system

Active involvement of the end users to guide development and evaluation of useful tools and standards. Manual annotation Compare annotation and time spent in curation TM system System- assisted annotation BioCreative Interactive task

User Advisory Group (UAG). UAG MemberAffiliation Donghui LiTAIR Judy BlakeMGI Kimberly Van AukenWormBase Fiona McCarthyAgBase Mary SchaefferMaizeDB Stan LaulederkindRGD Peter McQuiltonFlyBase Phoebe RobertsPfizer Andrew Chatr-AryamontriBioGrid Sandra OrchardIntAct Sherri MatisAstraZeneca Workshop 2012 and BioCreative IV UAG MemberAffiliation Eva HualaTAIR Lois MaltaisMGI (not current) Paul SternbergWormbase Pascale GaudetdictyBase (not current) Ian HarrowPfizer (not current) Michele Gwinn GiglioUniversity Maryland Phoebe RobertsPfizer Andrew Chatr-AryamontriBioGrid Luca ToldoMerck (not current) Gianni CesariniMINT BioCreative III A diverse sample of end users with multiple text mining needs Roles: -Develop the end user requirements for interactive text mining task -Provide logistics on system evaluation -Assist in annotating corpora and testing the systems

1-Recruitment of Teams Call for participation via NLP-related mailing lists and Interested teams should provide a document addressing: Relevance and Impact Adaptability Interactivity Performance 2-Recruitment of Curators Call for participation via International Society for Biocuration (ISB) mailing list, and the ISB meeting and BioCreative websites BioCreative Interactive Task

BioCreative Interactive Task Workflow Yes Submission Text Mining System Description Submission of internal benchmarking result, test set and URL No System cannot participate in pre-workshop evaluation, but team is invited to participate in demo and poster session during workshop. Participation in pre-wokshop evaluation Post list of systems and recruitment of biocurators Team/biocurator pairing 1-Preparation phase System tuned to biocuration group (optional) Did team provide benchmarking results? Coordinators Teams Curators Key: Coord/teams

BioCreative interactive task workflow Coordinators Teams Curators Key: Coord/teams Manual Annotation System-assisted Annotation Fill user survey Team provides training via demo, examples, help document, annotation guidelines, and output format Yes No Is biocurator familiar with system and annotation ? Collect output and calculate metrics Report at Workshop 2-Training phase Practice with examples, report bugs Gold Standard: Dataset manually annotated by independent expert 1/2 Dataset selected by domain expert (or coordinator) 3-Evaluation phase

BioCreative III: - Identify genes that are “primary/central” (biologically relevant) in the context of the article (full-length), and normalization -Retrieve articles for which a given gene is “primary/central” 6 Teams participated, 12 biocurators tested systems BioCreative 2012: -Open to any literature-based biocuration task 7 teams participated, more than 40 biocurators tested systems BioCreative IV, October 2013: -Open to any literature-based biocuration task 21 teams registered!! Will recruit biocurators at biocuration meeting BioCreative Interactive Tasks

Teams Registered in BioCreative teams covering very diverse tasks SystemTasksArticles TextPresso Curation of subcellular localization using Gene Ontology cellular component Full-Text PCS (Charaparser) Curation of Entity-Quality terms from phylogenetic literature using ontologies NA PubTator Document triage (relevant documents for curation) and bioconcept annotation (gene, disease, chemicals) Abstract PPIFinder Mining of protein-protein interaction for human proteins (abstract and full legth articles):document classification and extraction of interacting proteins and keywords. Abstract eFIP Mining Protein Interactions of Phosphorylated Proteins from the Literature. Document classification and information extraction of phosphorylated protein, protein binding partners and impact keyword Abstract T-HOD Document triage for disease-related genes (relevant documents for curation) and bioconcept annotation (gene, disease and relation) Abstract Tagtog Protein/gene mentions recognition via interactive learning and annotation framework Abstract

User Survey What do we measure? Precision at document and/or sentence level Recall at document and and/or sentence level Time manual vs. system assisted Survey results: Correlation of response to questions with overall system satisfaction to learn what aspects are important to users

User Survey What’s in it for UniProt? As users we can guide the development of tools that are useful for biocuration We have access to state of the art text mining tools Participate to ensure the use of standards and quality of annotations provided by the tools Publications