Integration of PRO and UniProtKB Amherst, NY May 16, 2013 Cathy H. Wu, Ph.D. PRO-PO-GO Meeting.

Slides:



Advertisements
Similar presentations
Annotation of Gene Function …and how thats useful to you.
Advertisements

Genome Annotation: A Protein-centric Perspective.
GO : the Gene Ontology “because you know sometimes words have two meanings” Amelia Ireland GO Curator EBI, Cambridge, UK.
Integration of Protein Family, Function, Structure Rich Links to >90 Databases Value-Added Reports for UniProtKB Proteins iProClass Protein Knowledgebase.
Annotating Gene Products to the GO Harold J Drabkin Senior Scientific Curator The Jackson Laboratory Mouse.
Karen Eilbeck 7/22/08 Ontological relations and computable definitions for sequences at DNA, RNA and protein levels Karen Eilbeck Neocles Leontis Thomas.
PRO and IntAct protein complexes Sandra Orchard PRO Meeting, June 19, 2014.
Ontology annotation: mapping genomic regions biological function Paul D Thomas, Huaiyu Mi and Suzanna Lewis.
Gene Ontology John Pinney
The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.
Protein databases Morten Nielsen. Background- Nucleotide databases GenBank, National Center for Biotechnology Information.
COG and GO tutorial.
Use of Ontologies in the Life Sciences: BioPax Graciela Gonzalez, PhD (some slides adapted from presentations available at
EBI is an Outstation of the European Molecular Biology Laboratory. UniProt Jennifer McDowall, Ph.D. Senior InterPro Curator Protein Sequence Database:
Protein and Function Databases
Protein Ontology: Addressing the need for precision in representing protein networks Darren A. Natale, Ph.D. Protein Science Team Lead, PIR Research Assistant.
UniProt - The Universal Protein Resource
Claire O’Donovan EMBL-EBI. In UniProtKB, we aim to provide… o A high quality protein sequence database A non redundant protein database, with maximal.
Cis-Regulatory/ Text Mining Interface Discussion.
1 iProLINK: An integrated protein resource for literature mining and literature-based curation 1. Bibliography mapping - UniProt mapped citations 2. Annotation.
BTN323: INTRODUCTION TO BIOLOGICAL DATABASES Day2: Specialized Databases Lecturer: Junaid Gamieldien, PhD
Viewing & Getting GO COST Functional Modeling Workshop April, Helsinki.
Introduction to databases Tuomas Hätinen. Topics File Formats Databases -Primary structure: UniProt -Tertiary structure: PDB Database integration system.
Semantic Similarity over Gene Ontology for Multi-label Protein Subcellular Localization Shibiao WAN and Man-Wai MAK The Hong Kong Polytechnic University.
© Wiley Publishing All Rights Reserved. Protein and Specialized Sequence Databases.
IProLINK – A Literature Mining Resource at PIR (integrated Protein Literature INformation and Knowledge ) Hu ZZ 1, Liu H 2, Vijay-Shanker K 3, Mani I 4,
Biological Databases By : Lim Yun Ping E mail :
Fortaleza 31.VII.2006 UniProtKB: Questions and answers UniProtKB/Swiss-Prot: Questions, Answers and a few Tips.
Corrections. - The cacao genome is currently being sequenced - Human Chromosome 1 sequence Search ‘Genome’
DONNA MAGLOTT, PH.D. PRO AND MEDICAL GENETICS RESOURCES AT NCBI.
The Gene Ontology project Jane Lomax. Ontology (for our purposes) “an explicit specification of some topic” – Stanford Knowledge Systems Lab Includes:
What is an Ontology? An ontology is a specification of a conceptualization that is designed for reuse across multiple applications and implementations.
Protein Ontology (PRO) Amherst, NY May 15, 2013 Cathy H. Wu, Ph.D. Director, Protein Information Resource (PIR) Edward G. Jefferson Chair and Director.
BLOCKS Multiply aligned ungapped segments corresponding to most highly conserved regions of proteins- represented in profile.
Ontologies GO Workshop 3-6 August Ontologies  What are ontologies?  Why use ontologies?  Open Biological Ontologies (OBO), National Center for.
Introduction to the GO: a user’s guide Iowa State Workshop 11 June 2009.
DAVID R. SMITH DR. MARY DOLAN DR. JUDITH BLAKE Integrating the Cell Cycle Ontology with the Mouse Genome Database.
Alastair Kerr, Ph.D. WTCCB Bioinformatics Core An introduction to DNA and Protein Sequence Databases.
Protein Information Resource Protein Information Resource, 3300 Whitehaven St., Georgetown University, Washington, DC Contact
Protein and RNA Families
PROTEIN DATABASES. The ideal sequence database for computational analyses and data-mining: I t must be complete with minimal redundancy It must contain.
1 Gene function annotation. 2 Outline  Functional annotation  Controlled vocabularies  Functional annotation at TAIR  Resources and tools at TAIR.
Using Exons to Define Isoforms in PRO Timothy Danford Novartis Institutes for Biomedical Research PRO / AlzForum Kickoff Meeting Oct. 4, 2011.
What is an Ontology? A representation of knowledge in a domain In theory Thomas Gruber (1993) “An ontology is a formal, explicit specification of a shared.
This tutorial will describe how to navigate the section of Gramene that provides descriptions of alleles associated with morphological, developmental,
Ontologies Working Group Agenda MGED3 1.Goals for working group. 2.Primer on ontologies 3.Working group progress 4.Example sample descriptions from different.
You can request PRO terms by using the SourceForge PRO tracker (Fig 3A) or by directly contributing to PRO by providing the information in the RACE-PRO.
Introduction to the GO: a user’s guide NCSU GO Workshop 29 October 2009.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
Getting GO: how to get GO for functional modeling Iowa State Workshop 11 June 2009.
1 An Introduction to Ontology for Scientists Barry Smith University at Buffalo
Central hub for biological data UniProtKB/Swiss-Prot is a central hub for biological data: over 120 databases are cross-referenced (EMBL/DDBJ/GenBank,
Copyright OpenHelix. No use or reproduction without express written consent1 1.
Copyright OpenHelix. No use or reproduction without express written consent1.
Protein databases Petri Törönen Shamelessly copied from material done by Eija Korpelainen and from CSC bio-opas
Maintaining Ontologies as They Scale Across Multiple Species Darren A. Natale Protein Information Resource.
An example of GO annotation from a primary paper GO Annotation Camp, July 2006 PMID:
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
User Community Interactions - Impact on PRO - Darren A. Natale, Ph.D. Protein Science Associate Team Lead, PIR Research Assistant Professor, GUMC Protein.
Protein databases Henrik Nielsen
Mental Functioning and the Gene Ontology
Department of Genetics • Stanford University School of Medicine
Functional Annotation of the Horse Genome
Modified from slides from Jim Hu and Suzi Aleksander Spring 2016
UniProt: Universal Protein Resource
Annotation: linking literature to gene products
PIR: Protein Information Resource
What is an Ontology An ontology is a set of terms, relationships and definitions that capture the knowledge of a certain domain. (common ontology ≠ common.
Ensembl Genome Repository.
Gramene’s Ontologies Tutorial
Presentation transcript:

Integration of PRO and UniProtKB Amherst, NY May 16, 2013 Cathy H. Wu, Ph.D. PRO-PO-GO Meeting

2 PRO Framework PRO terms are defined/annotated using other ontologies and resources via definition of relations or mappings when appropriate

Accessioned, species-specific protein complexes in ProComp are described using protein entities in ProForm; and are cross-referenced to species-independent complex representations in GO A gene product (PR: ) and its isoforms and modified forms (PR: ; PR: ) are represented in PRO as separate, uniquely accessioned entities; but are described in the same UniProtKB record (UniProtKB:Q9D6R2) The representation of protein complexes in the Protein Ontology (PRO) Bult CJ, Drabkin HJ, Evsikov A, Natale D, Arighi C, Roberts N, Ruttenberg A, D'Eustachio P, Smith B, Blake JA, Wu C. (2011) BMC Bioinformatics 12, 371 [PMID: ] Relationships Between PRO-GO- UniProtKB ProComp-ProForm: has_part ProComp-GO: is_a ProForm-UniProtKB: xref 3

Mappings to various external databases promapping.txt: tab-delimited, each line indicating the PRO ID, the database ID, and the type of mapping (is_a or exact) promapping.obo: the same information as promapping.txt, but in OBO format Mappings are of two types: exact The database object is an exact match to the PRO object e.g., PR: describes an isoform of 6-phosphofructokinase type C in human only, which corresponds to UniProtKB:Q is_a The database object is more specific than the PRO object e.g., PR: describes an (organism-nonspecific) isoform of 6-phosphofructokinase type C, so UniProtKB:Q (human) and UniProtKB:Q9WUA3-1 (mouse) are mapped to this term 4 PRO ID Mapping

bri1/iso1/phos 5 (PR: ) has two parents: explicit one in formal definition (PR: ) implicit one only shown in the reasoned version (PR: ) [Term] id: PR: name: protein brassinosteroid insensitive 1 isoform 1 phosphorylated 5 (Arabidopsis thaliana) def: "A protein brassinosteroid insensitive 1 isoform 1 phosphorylated 5 in Arabidopsis thaliana. UniProtKB:O , Thr-872, MOD:00047|Ser-858, MOD:00046|Ser-891, MOD:00046." [PMID: , PRO:LVM] comment: Category=organism-modification. Flag=automatic. synonym: "Athal-BRI1/iso:1/Phos:5" EXACT PRO-short-label [PRO:DNx] synonym: "At protein brassinosteroid insensitive 1 isoform 1 phosphorylated 4" RELATED [] is_a: PR: ! implied link automatically realized ! protein brassinosteroid insensitive 1 isoform 1 (Arabidopsis thaliana) is_a: PR: ! implied link automatically realized ! protein brassinosteroid insensitive 1 isoform 1 phosphorylated 5 intersection_of: PR: ! protein brassinosteroid insensitive 1 isoform 1 phosphorylated 5 intersection_of: only_in_taxon NCBITaxon:3702 ! Arabidopsis thaliana 5 PRO Reasoning with ID Mapping PR: PR: pro.obo: PRO version with no implied links pro_reasoned.obo: implied link automatically realized via is_a

6 Ontological Representation of UniProtKB in PRO  PRO provides the ontological presentation for UniProtKB  Integration of UniProt records/subrecords into the PRO ontological framework  Use UniProtKB protein records (labeled by accession numbers, isoform IDs, and potentially other stable identifiers within UniProtKB records) to represent organism-gene level and sequence level (and potentially modification-level) terms of PRO  Organism-Gene: canonical protein record  Organism-Sequence: isoform subrecord  Organism-Modification: chain/variant subrecord

7 Organism-Gene/Sequence

8 Ontologizing UniProtKB  Full-scale implementation of 12 reference genomes (others as needed)  Organism-Gene: canonical protein record – UniProtKB:xxxxxx  Organism-Sequence: isoform subrecord – UniProtKB:xxxxxx-1  Persistent URL:  UniProtKB URL in the ontological space, proposed as:  PR:xxxxxx (UniProtKB at organism-gene level)  PR:xxxxxx-1 (UniProtKB at organism-sequence level)  To consider  Organism-Modification: chain – UniProtKB:PRO_xxxxxxxxx  Organism-Modification: variant – UniProtKB:VAR_xxxxxx  Integration/coordination between ProComp and IntAct for ontological representation of protein complexes

9 Orthologous-Gene Ortho-Isoform Ortho-PTM Organism-PTM Ortho-Complex Organism-Complex UniProtKB in PRO Ontological Framework: Rich Relations

10 Issues  Stable identifiers  UniProtKB would provide stable identifiers  ID mapping service  Need for sequence merging and isoform curation: when exist Swiss-Prot (SP) entry for a given gene and corresponding unmerged TrEMBL (Tr) entries that may represent a new isoform, a new variant, or a duplicate.  Unmerged Tr entries corresponding to additional isoforms with a sequence different than any mentioned in the SP entry organism-gene (SP): Q96F24 organism-sequence (SP): Q96F24-1, Q96F24-2 organism-sequence (Tr): B4DWS0  Organism-gene only represented in unreviewed (Tr) section: where one or multiple Tr entries exist for a given gene  One entry organism-gene accession (Tr) = Q8VGZ9 organism-sequence accession (Tr; implied) = Q8VGZ9-1  Multiple entries organism-gene accession ***???*** organism-sequence accession = B9E100, Q6W3E0

Integrating PRO curation into UniProtKB Isoforms curated by PRO curators will continue to be integrated into UniProtKB as a priority  PRO isoform curation (mostly done at MGI) is based on experimental information from literature, and covers information such as UniProtKB AC, GO annotation, and comments on evidence on isoform and expression  PIR curators integrate new isoforms and associated annotations to SP entry Submission of annotation for a new SP entry  PIR curators create new reviewed SP entries when annotating protein isoforms and PTM forms with no reference SP entry  Example: BUB3_XENLA Other areas of PRO annotations, particularly on PTMs and complexes, could be integrated as appropriate Reciprocal links from UniProtKB to PRO 11

PRO literature-based annotation of isoforms 4 and 5 of a mouse protein UniProt curation:  Merged 3 TrEMBL entries to existing UniProtKB record (Q8BIF2)  Added Isoform specific subcellular localization information  Updated information about function and added new information New isoform curation in PRO & UniProt CC -!- SUBCELLULAR LOCATION: Nucleus. Cytoplasm. CC -!- SUBCELLULAR LOCATION: Isoform 1: Nucleus. CC -!- SUBCELLULAR LOCATION: Isoform 4: Cytoplasm. CC -!- SUBCELLULAR LOCATION: Isoform 5: Nucleus. CC -!- TISSUE SPECIFICITY: Widely expressed in brain, regions including … CC -!- DEVELOPMENTAL STAGE: In the neural tube, expressed as early as CC embryonic day 9.5 (E9.5) and expression is confined to the nervous … CC -!- INDUCTION: By retinoic acid. Expression is up-regulated in P19 CC cells during neural differentiation upon retinoic acid treatment … CC -!- PTM: Phosphorylated (Probable). CC -!- SIMILARITY: Contains 1 RRM (RNA recognition motif) domain. CC -!- CAUTION: Initial characterization was derived from usage of a CC monoclonal antibody (A60) directed to an unknown protein called... 12

Integrating PRO curation into UniProtKB Reciprocal links from UniProtKB to PRO  UniProtKB cross-reference (DR) lines [e.g., DR GO; GO: ; P:inflammatory response; IEA:Compara]  DR line to include PRO identifier (PURL), PRO name, and short-label  Link to the PRO page(s) at the exact (organism-gene) level and possibly also other PTM forms (organism-modification) Other areas of PRO annotations, particularly on PTMs and complexes, could be integrated as appropriate  Annotation of sequence features (such as PTMs not annotated in UniProtKB) and functional annotation that apply to those features  Barrier for direct annotation integration: curation depth needed for all aspects of annotatable information beyond PTMs  Possible Solution: link to information in PRO as additionally annotated data, similarly to UniProt approach to include additional bibliography 13