Literature Data Mining and Protein Ontology Development at the Protein Information Resource (PIR) Hu ZZ 1, Mani I 2, Liu H 3, Vijay-Shanker K 4, Hermoso.

Slides:



Advertisements
Similar presentations
Creating NCBI The late Senator Claude Pepper recognized the importance of computerized information processing methods for the conduct of biomedical research.
Advertisements

Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Integration of Protein Family, Function, Structure Rich Links to >90 Databases Value-Added Reports for UniProtKB Proteins iProClass Protein Knowledgebase.
5 EBI is an Outstation of the European Molecular Biology Laboratory. Master title Molecular Interactions – the IntAct Database Sandra Orchard EMBL-EBI.
The IntAct Database Sandra Orchard & Birgit Meldal.
Bioinformatics for biomedicine Summary and conclusions. Further analysis of a favorite gene Lecture 8, Per Kraulis
Scientific publications and archives: media, content and access Lesk, Ch 3 (Lesk, 2008)
Archives and Information Retrieval
Fungal Semantic Web Stephen Scott, Scott Henninger, Leen-Kiat Soh (CSE) Etsuko Moriyama, Ken Nickerson, Audrey Atkin (Biological Sciences) Steve Harris.
Use of Ontologies in the Life Sciences: BioPax Graciela Gonzalez, PhD (some slides adapted from presentations available at
IST Computational Biology1 Information Retrieval Biological Databases 2 Pedro Fernandes Instituto Gulbenkian de Ciência, Oeiras PT.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProt Jennifer McDowall, Ph.D. Senior InterPro Curator Protein Sequence Database:
Biological Data Integration July 22, 2003 GTL Data and Tools Workshop Gaithersburg, MD Cathy H. Wu, Ph.D. Professor of Biochemistry & Molecular Biology.
Protein Sequence Analysis - Overview Raja Mazumder Senior Protein Scientist, PIR Assistant Professor, Department of Biochemistry and Molecular Biology.
1 iProLINK: An integrated protein resource for literature mining and literature-based curation 1. Bibliography mapping - UniProt mapped citations 2. Annotation.
Genome database & information system for Daphnia Don Gilbert, October 2002 Talk doc at
BTN323: INTRODUCTION TO BIOLOGICAL DATABASES Day2: Specialized Databases Lecturer: Junaid Gamieldien, PhD
Improving Data Discovery in Metadata Repositories through Semantic Search Chad Berkley 1, Shawn Bowers 2, Matt Jones 1, Mark Schildhauer 1, Josh Madin.
Erice 2008 Introduction to PDB Workshop From Molecules to Medicine: Integrating Crystallography in Drug Discovery Erice, 29 May - 8 June Peter Rose
1 Protein Bioinformatics – Advances and Challenges Sona Vasudevan Peter McGarvey BY.
RLIMS-P: A Rule-Based Literature Mining System for Protein Phosphorylation Hu ZZ 1, Yuan X 1, Torii M 2, Vijay-Shanker K 3, and Wu CH 1 1 Protein Information.
Anastasia Nikolskaya PIR (Protein Information Resource) Georgetown University Medical Center
NCBI’s Bioinformatics Resources Michele R. Tennant, Ph.D., M.L.I.S. Health Science Center Libraries U.F. Genetics Institute January 2015.
Accomplishments and Challenges in Literature Data Mining for Biology L. Hirschman et al. Presented by Jing Jiang CS491CXZ Spring, 2004.
Outline Quick review of GS Current problems with GS Our solutions Future work Discussion …
IProLINK – A Literature Mining Resource at PIR (integrated Protein Literature INformation and Knowledge ) Hu ZZ 1, Liu H 2, Vijay-Shanker K 3, Mani I 4,
BioNLP related talks and demos at ACL and CONLL ‘05 Presented by Beatrice Alex BioNLP meeting 11 th of July 2005.
Flexible Text Mining using Interactive Information Extraction David Milward
1 Bio-Trac 40 (Protein Bioinformatics) October 8, 2009 Zhang-Zhi Hu, M.D. Associate Professor Department of Oncology Department of Biochemistry and Molecular.
Protein Information Resource (PIR) for Functional Annotation: Protein Family Classification, Literature Mining and Protein Ontology In-Silico Analysis.
Anastasia Nikolskaya Lai-Su Yeh Protein Information Resource Georgetown University Medical Center Washington, DC PIR: a comprehensive resource for functional.
The Gene Ontology project Jane Lomax. Ontology (for our purposes) “an explicit specification of some topic” – Stanford Knowledge Systems Lab Includes:
Relevance Detection Approach to Gene Annotation Aid to automatic annotation of databases Annotation flow –Extraction of molecular function of a gene from.
Playing Biology ’ s Name Game: Identifying Protein Names In Scientific Text Daniel Hanisch, Juliane Fluck, Heinz-Theodor Mevissen and Ralf Zimmer Pac Symp.
PIRSF Classification System PIRSF: Evolutionary relationships of proteins from super- to sub-families Homeomorphic Family: Homologous proteins sharing.
Protein Information Resource Protein Information Resource, 3300 Whitehaven St., Georgetown University, Washington, DC Contact
The Gene Ontology and its insertion into UMLS Jane Lomax.
Mining Biological Data. Protein Enzymatic ProteinsTransport ProteinsRegulatory Proteins Storage ProteinsHormonal ProteinsReceptor Proteins.
BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.
Copyright OpenHelix. No use or reproduction without express written consent1.
Other biological databases and ontologies. Biological systems Taxonomic data Literature Protein folding and 3D structure Small molecules Pathways and.
Protein Sequence Analysis - Overview - NIH Proteomics Workshop 2007 Raja Mazumder Scientific Coordinator, PIR Research Assistant Professor, Department.
Labeling and Enhancing Life Science Links S. Heymann*, F. Naumann*, L. Raschid +, P. Rieger * * Humboldt Universität zu Berlin + University of Maryland.
Bioinformatics and Computational Biology
You can request PRO terms by using the SourceForge PRO tracker (Fig 3A) or by directly contributing to PRO by providing the information in the RACE-PRO.
Scope of the Gene Ontology Vocabularies. Compile structured vocabularies describing aspects of molecular biology Describe gene products using vocabulary.
Computer Storage of Sequences
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
March 28, 2002 NIH Proteomics Workshop Bethesda, MD Lai-Su Yeh, Ph.D. Protein Scientist, National Biomedical Research Foundation Demo: Protein Information.
Investigating semantic similarity measures across the Gene Ontology: the relationship between sequence and annotation Bioinformatics, July 2003 P.W.Load,
Central hub for biological data UniProtKB/Swiss-Prot is a central hub for biological data: over 120 databases are cross-referenced (EMBL/DDBJ/GenBank,
InterPro Sandra Orchard.
Literature Mining and Database Annotation of Protein Phosphorylation Using a Rule-based System Z. Z. Hu 1, M. Narayanaswamy 2, K. E. Ravikumar 2, K. Vijay-Shanker.
Joined up ontologies: incorporating the Gene Ontology into the UMLS.
 What is MSA (Multiple Sequence Alignment)? What is it good for? How do I use it?  Software and algorithms The programs How they work? Which to use?
Introduction to PubChem BioAssay
UNIFIED MEDICAL LANGUAGE SYSTEMS (UMLS)
Interactions and Ontologies
Demo: Protein Information Resource
Department of Genetics • Stanford University School of Medicine
Functional Annotation of the Horse Genome
UniProt: Universal Protein Resource
Annotation: linking literature to gene products
PIR: Protein Information Resource
Literature Data Mining and Protein Ontology Development
Sequence Based Analysis Tutorial
Tutorial: Bioinformatics Resources
Protein Sequence Analysis - Overview -
Protein Sequence Analysis - Overview -
Network biology An introduction to STRING and Cytoscape
Presentation transcript:

Literature Data Mining and Protein Ontology Development at the Protein Information Resource (PIR) Hu ZZ 1, Mani I 2, Liu H 3, Vijay-Shanker K 4, Hermoso V 1, Nikolskaya A 1, Natale DA 1, and Wu CH 1 1 Protein Information Resource, Georgetown University Medical Center, 3900 Reservoir Road, NW, Washington, DC 20057; 2 Georgetown University, 37th and O Streets, NW, Washington, DC 20057; 3 University of Maryland at Baltimore County, Baltimore, MD 21250; 4 Department of Computer and Information Sciences, University of Delaware, Newark, DE PIRSF in DAG View PIRSF family hierarchy based on evolutionary relationships Standardized PIRSF family names as hierarchical protein ontology DAG Network structure for PIRSF family classification system PIRSF-Based Protein Ontology ABSTRACT An integrated protein literature mining resource iProLINK is developed at PIR to provide data sources for Natural Language Processing (NLP) research on bibliography mapping, annotation extraction, protein named-entity recognition, and protein ontology development. A rule-based text-mining system RLIMS-P is used to extract protein phosphorylation information from MEDLINE abstracts to assist database annotation, an online BioThesaurus is developed for protein/gene name mapping and to assist with protein named-entity recognition, and a family classification PIRSF-based protein ontology is developed and to complement other ontologies. As the volume of scientific literature rapidly grows, literature data mining becomes increasingly critical to facilitate genome/proteome annotation and to improve the quality of biological databases. Annotations derived from experimentally verified data from literature are of special value to the UniProtKB (UniProt Knowledgbase). One objective of UniProtKB is to have accurate, consistent, and rich annotation of protein sequence and function. Relevant to this goal are the literature-based curation and development and adoption of ontologies and controlled vocabularies. Literature-Based Curation – Extract Reliable Information from Literature Protein properties: protein function, domains and sites, developmental stages, catalytic activity, binding and modified residues, regulation, induction, pathways, tissue specificity, subcellular location, quaternary structure… This will ensure high quality, accurate and up-to-date experimental data for each protein. But it is a major bottleneck! Ontologies/Controlled Vocabularies – For Information Integration and Knowledge Management UniProtKB entries will be annotated using widely accepted biological ontologies and other controlled vocabularies, e.g. Gene Ontology (GO) and EC nomenclature. The Protein Information Resource has been collaborating with several NLP research groups to develop text-mining methodologies to extract information from biological literature and to develop protein ontology. INTRODUCTION PIR – Integrated Protein Informatics Resource for Genomic/Proteomic Research ( UniProt – Central international database of protein sequence and function ( Bioinformatics Jun 1;21(11): High recall for paper retrieval and high precision for information extraction UniProtKB site feature annotation Proteomics MS data analysis: protein identification Benchmarking of RLIMS-P Sentence extraction Part of speech tagging Preprocessing Acronym detection Term recognition Entity Recognition Noun and verb group detection Other syntactic structure detection Phrase Detection Semantic Type Classification Nominal level relation Verbal level relation Relation Identification Abstracts Full-Length Texts Post- Processing Extracted Annotations Tagged Abstracts Pattern 1: (in/at )? ATR/FRP-1 also phosphorylated p53 in Ser 15 RLIMS-P Rule-based LIterature Mining System for Protein Phosphorylation RLIMS-P Protein Phosphorylation Annotation Extraction Manual tagging assisted with computational extraction Training sets of positive and negative samples BioThesaurus report UniProtKB entry P35625 Tagging guideline versions 1.0 and 2.0 –Generation of domain expert-tagged corpora –Inter-coder reliability – upper bound of machine tagging Dictionary pre-tagging –F-measure: (0.372 Precision, Recall) –Advantages: helpful with standardization and extent of tagging, reducing fatigue problem, and improve inter-coder reliability. BioThesaurus for pre-tagging Raw Thesurus iProClass NCBI Entrez Gene RefSeq GenPept UniProt UniProtKB UniRef90/50 PIR-PSD Genome FlyBase WormBase MGD SGD RGD Other HUGO EC OMIM Name Filtering Highly Ambiguous Nonsensical Terms Semantic Typing UMLS Name Extraction UniProtKB Entries: Protein/Gene Names & Synonyms BioThesaurus Biological entity tagging Name mapping Database annotation literature mining Gateway to other resources Applications: # UniProtKB entry1.86m # Source DB record6.6m # Gene/protein name/terms3.6m BioThesaurus v1.0 m = million (May, 2005) Protein Name Tagging Example 2. Name ambiguity of CLIM1 PIRSF to GO Mapping Superimpose GO and PIRSF hierarchies Bidirectional display (GO- or PIRSF-centric views ) Complements GO: PIRSF-based ontology can be used to analyze GO branches and concepts and to provide links between the GO sub-ontologies Mapped 5363 curated PIRSF homeomorphic families and subfamilies to the GO hierarchy –68% of the PIRSF families and subfamilies map to GO leaf nodes –2329 PIRSFs have shared GO leaf nodes DynGO viewer Two cases: analyze GO branches and concepts and identify missing GO nodes Case I. Nuclear receptor superfamily Case II. IGF-binding protein superfamily iProLINK: An integrated protein resource for literature mining 1. Bibliography mapping - UniProt mapped citations 2. Annotation extraction - annotation tagged literature 3. Protein entity recognition - dictionary, tagged literature 4. Protein ontology development - PIRSF-based ontology Testing and Benchmarking Dataset RLIMS-P text mining tool Protein dictionaries Name tagging guideline Protein ontology Protein Ontology Can Complement GO Expanding a Node: Identification of GO subtrees that need expansion if GO concepts are too broad –IGFBP subfamilies –High- vs. low-affinity binding for IGF between IGFBP and IGFBPrP GO-centric view 2 1 Exploration of Gene and Protein Ontology PIRSF-centric view 1 Molecular function Biological process Estrogen receptor alpha (PIRSF50001) Systematic links between three GO sub-ontologies based on the shared annotations at different protein family levels, e.g., linking molecular function and biological process: –estrogen receptor binding and –estrogen receptor signaling pathway Acknowledgements Research Projects NIH: NHGRI/NIGMS/NLM/NIMH/NCRR/NIDCR (UniProt) NSF: SEIII (Entity Tagging) NSF: ITR (Ontology) Collaborators I. Mani from Georgetown University Department of Linguistics on protein name recognition and protein name ontology. Department of Linguisticsprotein name ontology H. Liu from University of Maryland Department of Information System on protein name recognition and text mining.Department of Information System Vijay K. Shanker from University of Delaware Department of Computer and Information Science on text mining of protein phosphorylation features. Department of Computer and Information Science Summary PIR iProLINK literature mining resource provides annotated data sets for NLP research on annotation extraction and protein ontology development RLIMS-P text-mining tool for protein phosphorylation from PubMed literature. Coupling the high recall for paper retrieval and high precision for information extraction, RLIMS-P can be applied for UniProtKB protein feature annotation. Biothesaurus can be used to solve name synonym and ambiguity, name mapping. PIRSF-based protein ontology can complement GO by identify missing GO concepts/nodes and provides systematic links between three GO sub-ontologies. 7 8 PIRSF: A network structure from superfamilies to subfamilies to reflect evolutionary relationships of full-length proteins Definitions Basic unit = Homeomorphic Family Homeomorphic: Full-length similarity, common domain architecture Network Structure: Flexible number of levels with varying degrees of sequence conservation PIRSF Protein Family Classification Example 1. Name ambiguity of TIMP3 Web-based BioThesaurus Gene/Protein Name Mapping 1.Search Synonyms 2.Resolve Name Ambiguity 3.Underlying ID Mapping Online RLIMS-P text-mining tool (version 1.0) prolink/rlimsp/ Search interface 2. Summary table with top hit of all sites 3. All sites and tagged text evidence 3 DAG file: ftp://ftp.pir.georgetown.edu/pir_databases/pirsf/dagfiles/ Liu et al, 2005, submitted