Download presentation
Presentation is loading. Please wait.
Published byAlbert Todd Modified over 8 years ago
1
Labeling protein-protein interactions Barbara Rosario Marti Hearst Project overview The problem Identifying the interactions between proteins. Labeling the types of interactions Related work: Identifies the interactions, but not their types, Or uses keywords to select out interaction types. Results Using the HIV-1 database to gather labeled data “Papers” For a random subset of the protein pairs PP, we downloaded the corresponding PubMed papers. From these, we extracted all and only those sentences that contain both proteins from the indicated protein pair. We assigned each of these sentences the corresponding interaction I from the database. The data HIV-1 Human Protein Interaction Database 1 Documents interactions between HIV-1 proteins and 2224 pairs of interacting proteins, 65 types For each documented protein-protein interaction the database includes information about: SIMS UC Berkeley We address the problem of multi-way relation classification, using a database that serves as a proxy for training data. Using two graphical models we achieve an accuracy of 60% for a 10-way distinction between relation types on individual sentences. We also provide evidence that the exploitation of the sentences surrounding a citation to a paper can yield higher accuracy than the intra-document sentences alone. Our sponsors: A pair of proteins (PP) The interaction type(s) between them (I) PubMed identification numbers of the journal article(s) describing the interaction(s) (A). We use this information to gather labeled data and to train and test a classification system for the classification of interaction types Prot1Prot2Interact. IPubMed A 10000155871activates11156964 Extract from paper A all the sentences with PP Prot1 and Prot2 … Label them with the interaction I given in the database activates “Citances” To test the hypothesis that the sentences surrounding citations to related work, or citances, are a useful resource for bioNLP 2, we downloaded the papers that cite A. From these citing papers, we extracted all and only those sentences that mention A explicitly; we further filtered these to include all and only the sentences that contain PP. We labeled each these sentences with interaction type I. InteractionPapersCitances Degrades6063 Synergizes with86101 Stimulates10364 Binds98324 Inactivates6892 Interacts with62100 Requires96297 Upregulates11998 Inhibits7884 Suppresses5199 The tasks Given the sentences extracted from paper A and/or the citation sentences: Determine the interaction I given in the HIV-1 database for paper A Identify the proteins involved in the interaction (protein name tagging, or role extraction). (We consider only the ten interactions of the table) The models Dynamic graphical model (DM) for protein interaction classification (and role extraction) 3. Naïve Bayes (NB) for interaction classification. Hiding the protein names: “Selective CXCR4 antagonism by Tat” becomes: “Selective PROT1 antagonism by PROT2” To check whether the interaction types could be unambiguously determined by the protein names. Compare results with a trigger words approach Analyzing the results ModelClassification accuracies All*PapersCitances DB60.557.853.4 NB58.157.855.7 No Protein Names DB60.544.452.3 NB59.746.753.4 Trigger words (with back –off) 25.840.026.1 Baseline: choose the most frequent inter. 21.811.126.1 Interaction classification Protein name tagging (with DM) RecallPrecisionF-measure All0.740.850.79 Papers0.560.830.67 Citances0.750.840.79 * All: sentences from “papers” and “citances” together Absolute discount smoothing Parameters found with cross validation Summary Difficult and important problem: the classification of (ten) different interaction types between proteins in text The dynamic graphical model DM can simultaneously perform protein name tagging and relation identification High accuracy on both problems (well above the baselines) The results obtained removing the protein names indicate that our models learn the linguistic context of the interactions. Found evidence supporting the hypothesis that citation sentences are a good source of training data, most likely because they provide a concise and precise way of summarizing facts in the bioscience literature. Finally, we used a protein-interaction database to automatically gather labeled data for this task. References 1 HIV-1 database: www.ncbi.nlm.nih.gov/RefSeq/HIVInteractions/index.html www.ncbi.nlm.nih.gov/RefSeq/HIVInteractions/index.html 2 P. Nakov, A. Schwartz and M. Hearst, “Citances: Citation Sentences for Semantic Analysis of Bioscience Text”, Proceedings of the SIGIR'04 workshop on Search and Discovery in Bioinformatics, 2004 3 B. Rosario and M. Hearst, “Classifying Semantic Relations in Bioscience Texts”, Proceedings of ACL-04, 2004 host cell proteins other HIV-1 proteins disease associated with HIV/AIDS http://www.ncbi.nlm.nih.gov/RefSeq/HIVInteractions/index.html NSF-DBI-0317510 & Genentech
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.