Presentation is loading. Please wait.

Presentation is loading. Please wait.

Text mining activities at PIR Cecilia Arighi March 12, 2013.

Similar presentations


Presentation on theme: "Text mining activities at PIR Cecilia Arighi March 12, 2013."— Presentation transcript:

1 Text mining activities at PIR Cecilia Arighi March 12, 2013

2 Dr. Vijay Shanker, CIS Department, University of Delaware BioCreative Consortium Text mining projects in collaboration with: 1 2

3 iProLink: Text mining resources at PIR http://proteininformationresource.org/pirwww/iprolink/ RLIMS-P: Text mining tool for extraction of protein phosphorylation information eFIP: Extracting Functional Impact of protein Phosphorylation Resource to facilitate text mining for biocuration with focus on annotation of post-translational modifications (PTMs) eGIFT: Extracting Gene Information From Text

4 RLIMS-P: extraction of protein phosphorylation information Rule-based systems: make use of : -knowledge about how language is structured -specific knowledge about how biologically relevant facts are stated in the biomedical literature. Rule-based systems: make use of : -knowledge about how language is structured -specific knowledge about how biologically relevant facts are stated in the biomedical literature. PMID:2141171 The tool needs to capture the different ways that protein phosphorylation is described in literature Rule-based information extraction system It extracts information about : phosphorylated protein(s) the kinase(s) phosphorylation site(s) RLIMS-P 2.0 over a 100 regular expressions, some of these are of supporting nature (e.g for anaphora resolution).

5 RLIMS-P Interface: Search New interface! Keywords List of PMIDs Provides suggestions of protein and gene names while typing

6 RLIMS-P Interface: Result Table Arrange data according to interest Query: BAD Statistics Summary: list all kinases and phospho-proteins found per abstract PMID: list all kinases and phospho-proteins and sites found per abstract Kinase, substrate and sites are color-coded Kinase: list results based on individual kinases extracted by RLIMS-P Substrate: list results based on indivudual substrate extracted by RLIMS-P

7 Text Evidence Page

8 The eFIP system for text mining of protein interaction networks of phosphorylated proteins Tudor CO, Arighi CN, Wang Q, Wu CH, Shanker VK. (2012) Database (doi: 10.1093/database/bas044) 8 eFIP: Functional Impact of Phosphorylation Bad phosphorylation induced by survival factors leads to its preferential binding to 14-3-3 and suppression of the death-inducing function of Bad. (PMID 10579309) Find relation between phosphorylation and protein interaction Protein interaction in eFIP: Protein-protein Protein-protein complex Protein-protein region Protein-protein class Example of interaction-related terms detected eFIP Binding Interact Complex Dissociates (used to capture a negative impact of phosphorylation)

9 9 eFIP Architecture

10 10 eFIP Website 1 To correct and save eFIP results 2 3

11 eFIP: To find relevant papers about phosphorylated proteins and their functions 11 Search for BAD If logged in

12 Distinct phosphorylated forms of a protein may have different interacting proteins, leading to different subcellular locations, functions and pathways Literature mining connects the impact to different BAD forms, and, through kinases, links BAD to pathways 12 Discovery from Literature Mining PMID:14967141

13 Pubmed Search Results RLIMS-P Set of Phosphorylation-Related Articles for Curation TEXT MINING DATA MINING Protein A Protein B Protein-Protein Interaction Databases FUNCTIONAL ANNOTATION TERMS RACE- PRO RACE- PRO THE PROTEIN ONTOLOGY (PRO) VISUALIZATION Cytoscape Figure 1: Overview of the Workflow

14 eGIFT Uses natural language processing techniques to retrieve iTerms (informative terms) relevant to a specific gene. Gene centric document retrieval and categorization http://biotm.cis.udel.edu/eGIFT/ iTerms

15 Applications Finding relevant articles to assist in biocuration : – of protein phosphorylated forms and complexes in the Protein Ontology. – Phosphorylated proteins in external databases, such as phospho.ELM (PMID: 17962309) – Pathway curation in Gallus Reactome (The Third Workshop on Integrative Data Analysis in Systems Biology (IDASB) 2012) Automatic information extraction from literature to improve knowlegbase content (iPTM and Gallus Reactome) Improvement of kinase site prediction algorithms (RLIMS-P) Finding set of genes/proteins with common iTerms (eGIFT)

16 What’s in it for UniProt? 1-For curation: Assist in prioritization of entry annotation based on potential relevant information on protein features (phosphorylation) As of 03/11/2013 in Medline # of RLIMS-P positive PMIDs = 135,739 # with site information= 41,947 # with kinase information= 38,924 2-For UniProt user: Processing on RLIMS-P on the UniProtKB additional bibliography could provide the UniProt user with an extra layer of information that he/she could readily use. Use eFIP/eGIFT model of displaying documents based on information content of the additional bibliography.

17 Example: Additional Bibliography for raptor: 30 PMIDs

18 T908 not annotated New Information from Additional Bibliography and RLIMS-P

19 BioCreative Activities Interactive Text Mining

20 BioCreative: Critical Assessment of Information Extraction in Biology International community-wide effort to evaluate text mining and information extraction systems applied to the biological domain BioNLP Text REtrieval Conference (TREC) BioCreative workshops are very much driven by the needs of users with focus on: strong linguistic focus with topics of interest to NLP community -Biocuration tasks -Biocuration workflows -Interoperability

21 21 Background BioCreative I: 2004, Granada, Spain  BMC Bioinformatics 2005, 6 (Suppl 1) BioCreative II: 2007, Madrid, Spain  Genome Biology 2008, 9 (Suppl 2) BioCreative II.5: 2009, Madrid, Spain  IEEE Transactions in Computational Biology and Bioinformatics 2010 BioCreative III: 2010, Bethesda, USA  BMC Bioinformatics 2011, Supp 8 Biocuration and Text Mining: 2012, Georgetown U, USA  Database Virtual Issue 2012 BioCreative IV: 2013

22 Ranking of relevant documents (document triage) Extraction of genes and proteins names (gene mention) Linkage of names to database identifiers (gene normalization) Extraction of functional annotation in standard ontologies (GO) Extraction of entity relations (e.g. protein–protein interaction) Biocurators annotate corpus Testing set Compare annotation BioCreative Traditional Tracks TM system

23 Active involvement of the end users to guide development and evaluation of useful tools and standards. Manual annotation Compare annotation and time spent in curation TM system System- assisted annotation BioCreative Interactive task

24 User Advisory Group (UAG). UAG MemberAffiliation Donghui LiTAIR Judy BlakeMGI Kimberly Van AukenWormBase Fiona McCarthyAgBase Mary SchaefferMaizeDB Stan LaulederkindRGD Peter McQuiltonFlyBase Phoebe RobertsPfizer Andrew Chatr-AryamontriBioGrid Sandra OrchardIntAct Sherri MatisAstraZeneca Workshop 2012 and BioCreative IV UAG MemberAffiliation Eva HualaTAIR Lois MaltaisMGI (not current) Paul SternbergWormbase Pascale GaudetdictyBase (not current) Ian HarrowPfizer (not current) Michele Gwinn GiglioUniversity Maryland Phoebe RobertsPfizer Andrew Chatr-AryamontriBioGrid Luca ToldoMerck (not current) Gianni CesariniMINT BioCreative III A diverse sample of end users with multiple text mining needs Roles: -Develop the end user requirements for interactive text mining task -Provide logistics on system evaluation -Assist in annotating corpora and testing the systems

25 1-Recruitment of Teams Call for participation via NLP-related mailing lists and Interested teams should provide a document addressing: Relevance and Impact Adaptability Interactivity Performance 2-Recruitment of Curators Call for participation via International Society for Biocuration (ISB) mailing list, and the ISB meeting and BioCreative websites BioCreative Interactive Task

26 BioCreative Interactive Task Workflow Yes Submission Text Mining System Description Submission of internal benchmarking result, test set and URL No System cannot participate in pre-workshop evaluation, but team is invited to participate in demo and poster session during workshop. Participation in pre-wokshop evaluation Post list of systems and recruitment of biocurators Team/biocurator pairing 1-Preparation phase System tuned to biocuration group (optional) Did team provide benchmarking results? Coordinators Teams Curators Key: Coord/teams

27 BioCreative interactive task workflow Coordinators Teams Curators Key: Coord/teams Manual Annotation System-assisted Annotation Fill user survey Team provides training via demo, examples, help document, annotation guidelines, and output format Yes No Is biocurator familiar with system and annotation ? Collect output and calculate metrics Report at Workshop 2-Training phase Practice with examples, report bugs Gold Standard: Dataset manually annotated by independent expert 1/2 Dataset selected by domain expert (or coordinator) 3-Evaluation phase

28 BioCreative III: - Identify genes that are “primary/central” (biologically relevant) in the context of the article (full-length), and normalization -Retrieve articles for which a given gene is “primary/central” 6 Teams participated, 12 biocurators tested systems BioCreative 2012: -Open to any literature-based biocuration task 7 teams participated, more than 40 biocurators tested systems BioCreative IV, October 2013: -Open to any literature-based biocuration task 21 teams registered!! Will recruit biocurators at biocuration meeting BioCreative Interactive Tasks

29 Teams Registered in BioCreative 2012 7 teams covering very diverse tasks SystemTasksArticles TextPresso Curation of subcellular localization using Gene Ontology cellular component Full-Text PCS (Charaparser) Curation of Entity-Quality terms from phylogenetic literature using ontologies NA PubTator Document triage (relevant documents for curation) and bioconcept annotation (gene, disease, chemicals) Abstract PPIFinder Mining of protein-protein interaction for human proteins (abstract and full legth articles):document classification and extraction of interacting proteins and keywords. Abstract eFIP Mining Protein Interactions of Phosphorylated Proteins from the Literature. Document classification and information extraction of phosphorylated protein, protein binding partners and impact keyword Abstract T-HOD Document triage for disease-related genes (relevant documents for curation) and bioconcept annotation (gene, disease and relation) Abstract Tagtog Protein/gene mentions recognition via interactive learning and annotation framework Abstract

30 User Survey What do we measure? Precision at document and/or sentence level Recall at document and and/or sentence level Time manual vs. system assisted Survey results: Correlation of response to questions with overall system satisfaction to learn what aspects are important to users

31 User Survey What’s in it for UniProt? As users we can guide the development of tools that are useful for biocuration We have access to state of the art text mining tools Participate to ensure the use of standards and quality of annotations provided by the tools Publications


Download ppt "Text mining activities at PIR Cecilia Arighi March 12, 2013."

Similar presentations


Ads by Google