Literature Mining and Database Annotation of Protein Phosphorylation Using a Rule-based System Z. Z. Hu 1, M. Narayanaswamy 2, K. E. Ravikumar 2, K. Vijay-Shanker.

Slides:



Advertisements
Similar presentations
Literature Data Mining and Protein Ontology Development at the Protein Information Resource (PIR) Hu ZZ 1, Mani I 2, Liu H 3, Vijay-Shanker K 4, Hermoso.
Advertisements

Bio-Medical Interaction Extractor Syed Toufeeq Ahmed ASU.
Specialized models and ranking for coreference resolution Pascal Denis ALPAGE Project Team INRIA Rocquencourt F Le Chesnay, France Jason Baldridge.
Sequence Classification: Chunking Shallow Processing Techniques for NLP Ling570 November 28, 2011.
Mining External Resources for Biomedical IE Why, How, What Malvina Nissim
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
BioContrasts: Extracting and Exploiting Protein-protein Contrastive Relations from Biomedical Literature Jung-jae Kim 1, Zhuo Zhang 2, Jong C. Park 1 and.
Protein databases Morten Nielsen. Background- Nucleotide databases GenBank, National Center for Biotechnology Information.
Automatic Web Page Categorization by Link and Context Analysis Giuseppe Attardi Antonio Gulli Fabrizio Sebastiani.
Predicting the Semantic Orientation of Adjective Vasileios Hatzivassiloglou and Kathleen R. McKeown Presented By Yash Satsangi.
Biomedical Information Extraction. Outline Intro to biomedical information extraction PASTA [Demetriou and Gaizauskas] Biomedical named entities Name.
Information Extraction from Biomedical Text Jerry R. Hobbs Artificial Intelligence Center SRI International.
Document Centered Approach to Text Normalization Andrei Mikheev LTG University of Edinburgh SIGIR 2000.
CSE 730 Information Retrieval of Biomedical Data The use of medical lexicon in biomedical IR.
Improved Cancer Risk Assessment Using Text Mining Ilona Silins 1, Anna Korhonen 2, Johan Högberg 1, Lin Sun 2 and Ulla Stenius 1 1 Institute of Environmental.
Mining the Medical Literature Chirag Bhatt October 14 th, 2004.
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
1 iProLINK: An integrated protein resource for literature mining and literature-based curation 1. Bibliography mapping - UniProt mapped citations 2. Annotation.
Mining and Summarizing Customer Reviews
Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Microsoft Research Asia Yunhua Hu, Guomao Xin, Ruihua Song, Guoping.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Probabilistic Parsing Reading: Chap 14, Jurafsky & Martin This slide set was adapted from J. Martin, U. Colorado Instructor: Paul Tarau, based on Rada.
1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.
© Wiley Publishing All Rights Reserved. Protein and Specialized Sequence Databases.
RLIMS-P: A Rule-Based Literature Mining System for Protein Phosphorylation Hu ZZ 1, Yuan X 1, Torii M 2, Vijay-Shanker K 3, and Wu CH 1 1 Protein Information.
Accomplishments and Challenges in Literature Data Mining for Biology L. Hirschman et al. Presented by Jing Jiang CS491CXZ Spring, 2004.
Outline Quick review of GS Current problems with GS Our solutions Future work Discussion …
IProLINK – A Literature Mining Resource at PIR (integrated Protein Literature INformation and Knowledge ) Hu ZZ 1, Liu H 2, Vijay-Shanker K 3, Mani I 4,
Ling 570 Day 17: Named Entity Recognition Chunking.
How do we Collect Data for the Ontology? AmphibiaTree 2006 Workshop Saturday 11:30–11:45 J. Leopold.
Finding High-frequent Synonyms of a Domain- specific Verb in English Sub-language of MEDLINE Abstracts Using WordNet Chun Xiao and Dietmar Rösner Institut.
Towards Building A Database of Phosphorylate Interactions Extracting Information from the Literature M. Narayanaswamy & K. E. Ravikumar AU-KBC Center,
1 Bio-Trac 40 (Protein Bioinformatics) October 8, 2009 Zhang-Zhi Hu, M.D. Associate Professor Department of Oncology Department of Biochemistry and Molecular.
Lars Juhl Jensen Biomedical text mining. exponential growth.
Copyright OpenHelix. No use or reproduction without express written consent1.
Recognizing Names in Biomedical Texts: a Machine Learning Approach GuoDong Zhou 1,*, Jie Zhang 1,2, Jian Su 1, Dan Shen 1,2 and ChewLim Tan 2 1 Institute.
Relevance Detection Approach to Gene Annotation Aid to automatic annotation of databases Annotation flow –Extraction of molecular function of a gene from.
Playing Biology ’ s Name Game: Identifying Protein Names In Scientific Text Daniel Hanisch, Juliane Fluck, Heinz-Theodor Mevissen and Ralf Zimmer Pac Symp.
PIRSF Classification System PIRSF: Evolutionary relationships of proteins from super- to sub-families Homeomorphic Family: Homologous proteins sharing.
1 Two Applications of Information Extraction to Biological Science Journal Articles: Enzyme Interactions and Protein Structures Kevin Humphreys, George.
Associating Biomedical Terms: Case Study for Acetylation Aaron Buechlein Indiana University School of Informatics Advisor: Dr. Predrag Radivojac.
BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
PREDICTION OF CATALYTIC RESIDUES IN PROTEINS USING MACHINE-LEARNING TECHNIQUES Natalia V. Petrova (Ph.D. Student, Georgetown University, Biochemistry Department),
Protein Sequence Analysis - Overview - NIH Proteomics Workshop 2007 Raja Mazumder Scientific Coordinator, PIR Research Assistant Professor, Department.
Bioinformatics MEDC601 Lecture by Brad Windle Ph# Office: Massey Cancer Center, Goodwin Labs Room 319 Web site for lecture:
Using Domain Ontologies to Improve Information Retrieval in Scientific Publications Engineering Informatics Lab at Stanford.
Central dogma: the story of life RNA DNA Protein.
Reverse Interactomics
Distribution of information in biomedical abstracts and full- text publications M. J. Schuemie et al. Dept. of Medical Informatics, Erasmus University.
Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and Discovery Program.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Using Text Mining and Natural Language Processing for.
MedKAT Medical Knowledge Analysis Tool December 2009.
Bioinformatics and Computational Biology
DNAmRNAProtein Small molecules Environment Regulatory RNA How a cell is wired The dynamics of such interactions emerge as cellular processes and functions.
Opportunities for Text Mining in Bioinformatics (CS591-CXZ Text Data Mining Seminar) Dec. 8, 2004 ChengXiang Zhai Department of Computer Science University.
4. Relationship Extraction Part 4 of Information Extraction Sunita Sarawagi 9/7/2012CS 652, Peter Lindes1.
Automatically Identifying Candidate Treatments from Existing Medical Literature Catherine Blake Information & Computer Science University.
Nonlinear differential equation model for quantification of transcriptional regulation applied to microarray data of Saccharomyces cerevisiae Vu, T. T.,
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and Discovery Program.
Summer Bioinformatics Workshop 2008 BLAST Chi-Cheng Lin, Ph.D., Professor Department of Computer Science Winona State University – Rochester Center
1 GAPSCORE: Finding Gene and Protein Names one Word at a Time Jeffery T. Chang 1, Hinrich Schutze 2 & Russ B. Altman 1 1 Department of Genetics, Stanford.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Chinese Named Entity Recognition using Lexicalized HMMs.
Relation Extraction (RE) via Supervised Classification See: Jurafsky & Martin SLP book, Chapter 22 Exploring Various Knowledge in Relation Extraction.
Protein databases Henrik Nielsen
Department of Genetics • Stanford University School of Medicine
Multimedia Information Retrieval
Literature Data Mining and Protein Ontology Development
Tutorial: Bioinformatics Resources
Presentation transcript:

Literature Mining and Database Annotation of Protein Phosphorylation Using a Rule-based System Z. Z. Hu 1, M. Narayanaswamy 2, K. E. Ravikumar 2, K. Vijay-Shanker 3 and C. H. Wu 1 1 Dep. of Biochemistry and Molecular Biology, Georgetown University Medical Center, USA 2 AU-KBC Research Centre, Anna University 3 Dep. of Computer and Information Sciences, University of Delaware, USA (Bioinformatics, Vol. 21, no. 11, 2005, p )

2/24 Abstract RLISM-P Rule-based LIterature Mining System for Protein Phosphorylation. Extract protein phosphorylation information from MEDLINE abstracts. Phosphorylation objects: kinases, substrates and sites. RLISM-P achieved a precision and recall of 91.4 and 96.4% for paper retrieval, and of 97.9 and 88.0% for extraction of substrates and sites.

3/24 1. Introduction (1/2) Phosphorylation is one of the most common post- translational modifications (PTMs) for proteins and is involved in numerous biological processes. Detection of the dynamic phosphorylation state of the cellular proteome is essential for understanding the regulatory network of biological pathways. There are nearly experimental features annotated in PIR – PSD database, including over 2000 corresponding to five common PTMs — phosphorylation, acetylation, glycosylation, methylation and hydroxylation.

4/24 1. Introduction (2/2) RLIMS-P utilizes shallow parsing and extracts phosphorylation information by matching text with manually developed patterns. The RLIMS-P literature mining system was benchmarked using the iProLINK annotation- tagged corpus as a benchmark standard, and the results were evaluated by PIR (Protein Information Resource) curators.

5/24 2. Systems and Methods 2.1 Phosphorylation objects 2.2 The RLIMS-P architecture 2.3 Phrase detection 2.4 Semantic type classification 2.5 Rule-based relation identification: pattern templates and argument mapping

6/ Phosphorylation Objects (1/2) Three objects Enzyme: kinase that phosphorylates proteins. Substrate: protein that is phosphorylated. Site: phosphorylated residue. The RLIMS-P system is designed to detect and extract these three types of objects from MEDLINE papers, and assign them to argument roles named, and.

7/ Phosphorylation Objects (2/2)

8/ The RLIMS-P Architecture

9/ Phrase Detection (1/2) BaseNP chunks Simple noun phrases that do not include another noun phrase. use the POS tags of words that usually appear at the boundaries. Verb group chunks phosphorylate at Active p90Rsk2 was found to be able to phosphorylate histone H3 at Ser10. Consider the active or passive form.

10/ Phrase Detection (2/2) Phrases in apposition In the yeast Saccharomyces cerevisiae, Sic1, an inhibitor of Clb-Cdc28 kinases, must be phosphorylated and degraded in G 1 for cells to initiate DNA replication,...

11/ Semantic Type Classification (1/2) Syntactic pattern: ‘ X phosphorylated Y in Z ’ ATR/FRP-1 also phosphorylated p53 in Ser Active Chk2 phosphorylated the SQ/TQ sites in Ckk2 SCD... cdk9/cyclinT2 could phosphorylate the retinoblastoma gene (pRb) in human cell lines The relation extracted will depend on what matches Y and Z. and in the first example, and in the second example and only in the third example.

12/ Semantic Type Classification (2/2) NP must be classified as to whether they are of type protein (appropriate for the role of enzyme or substrate ), amino acid residue (for ) or cells, tissues, etc. (for source). Based on the previous work, the classification uses lexical information in the form of informative words that appear as head words (e.g. ‘ mitogen activated protein kinase ’ is classified as a protein because of its head word ‘ kinase ’ ), suffixes and nearby phrases. Additional rules and heuristics are employed based on detecting acronym, appositives and conjunction of entities.

13/ Rule-based Relation Identification (1/2) Pattern templates were manually created after examining a development text corpus of 300 MEDLINE abstracts and 10 journal articles and observing the different forms used to describe phosphorylation interactions. Verbal forms Pattern 1: (in/at )? where ‘ VG ’ denotes verb group and ‘ ? ’ denotes optional argument. Pattern 2: by

14/ Rule-based Relation Identification (2/2) Nominal forms Pattern 3: [ phosphorylation] NP of Pattern 4: phosphorylation of (by )? (in/at )? Pattern 5: by/via phosphorylation at ( )?

15/24 3 Implementation Datasets were derived from data sources in iProLINK: including citation mapping and evidence tagging. Citation mapping involves finding the specific papers describing a given phosphorylation feature of a protein entry from a list of papers in the PSD Reference section. Evidence tagging involves tagging the sentences providing experimental phosphorylation evidence in the abstract an/or full-text of the papers, which may include information of, and.

16/24

17/24 THEME AGENT SITE

18/24 4 Evaluation (1/4) RLIMS-P was evaluated for IR performance in two stages, a preliminary study using a small dataset to refine the system, followed by a benchmarking study using a larger dataset. The preliminary study used 146 abstracts, consisting of 56 positive papers and 90 negative papers: 83.0% precision. Common FPs include detection of phosphorylation of non-proteins or detection of dephosphorylation. The major FN pattern was specific phosphorylated residues of a phosphoprotein, such as phosphoserine or phosphothreonine. These phospho-residue patterns were later added to the rules.

19/24 4 Evaluation (2/4) For the benchmarking study, a larger dataset with 370 abstracts was used. Then further analyze the performance on phosphorylation information extraction using the PIR evidence-tagged abstracts as the benchmark standard.

20/24 4 Evaluation (3/4) For IR: The analysis of the FPs indicates that they often involve texts that describe general consensus sequence or predicted sites of protein phosphorylation. These FPs may result from a condition used in the system that focuses on finding all potential phosphorylation site information. The system missed only four phosphorylation papers, which contained texts with some unusual patterns.

21/24 4 Evaluation (4/4) For IE: The analysis of FNs showed that the program sometimes missed multiple sites in one sentence. Other FNs include cases where correct sites were extracted but the was not identified. The RLIMS-P system had a high precision (97.9%) with only two FPs. The two FP sites occurred in the text that does not indicate phosphorylation of Ser24 and Thr25.

22/24 5 Discussion (1/3) RLIMS-P has several special features: It provides semantic type assignment to simplify pattern specification and improve precisions. It provides phrase detection for pattern matching at a high level of syntactic abstraction. It uses patterns for both verbal and nominal forms, which are common for describing PTMs. It focuses on the specific interaction of protein phosphorylation and extracts not only the proteins involved but also the target sites.

23/24 5 Discussion (2/3) The high recall of citation mapping will ensure minimal ‘ loss ’ of phosphorylation papers and result in significant time saving for annotators to find relevant phosphorylation citations from long lists of papers in given protein entries. The high precision of annotation extraction from retrieved phosphorylation papers will ensure minimal effort in manual checking to validate the annotation. A few site features detected by RLIMS-P are missed by curators.

24/24 5 Discussion (3/3) Future enhancements (1) Adding more phospho-residue rule patterns using chemical synonyms for phosphorylated amino acids, such as ‘ phosphonoserine ’. (2) Coupling the rule patterns with short sequence patterns to recognize phosphorylated residues from sequence patterns (3) Fusing information from multiple sentences, especially when and are described in separate sentences. The system can also be adapted to mine other PTMs, such as methylation and acetylation.