PRO and IntAct protein complexes Sandra Orchard PRO Meeting, June 19, 2014.

Slides:



Advertisements
Similar presentations
Annotation of Gene Function …and how thats useful to you.
Advertisements

Macromolecular complexes – A new Online Portal (under construction!) Birgit Meldal (IntAct)
Sandra Orchard EMBL-EBI Molecular Interactions
GO : the Gene Ontology “because you know sometimes words have two meanings” Amelia Ireland GO Curator EBI, Cambridge, UK.
Pathways analysis Iowa State Workshop 11 June 2009.
Provenance in a Collaborative Bio-database RAASWiki Donald Dunbar & Jon Manning Queen’s Medical Research Institute University of Edinburgh Use Cases for.
5 EBI is an Outstation of the European Molecular Biology Laboratory. Master title Molecular Interactions – the IntAct Database Sandra Orchard EMBL-EBI.
The IntAct Database Sandra Orchard & Birgit Meldal.
5 EBI is an Outstation of the European Molecular Biology Laboratory. Master title Molecular Interactions – the IntAct Database Sandra Orchard EMBL-EBI.
The Complex Portal: A ‘one-stop shop’ for protein complexes Birgit Meldal IntAct Curator
Gene Ontology John Pinney
Bioinformatics for biomedicine Summary and conclusions. Further analysis of a favorite gene Lecture 8, Per Kraulis
Lecture 2.21 Retrieving Information: Using Entrez.
Use of Ontologies in the Life Sciences: BioPax Graciela Gonzalez, PhD (some slides adapted from presentations available at
EBI is an Outstation of the European Molecular Biology Laboratory. UniProt Jennifer McDowall, Ph.D. Senior InterPro Curator Protein Sequence Database:
UniProt - The Universal Protein Resource
An introduction to using the AmiGO Gene Ontology tool.
Erice 2008 Introduction to PDB Workshop From Molecules to Medicine: Integrating Crystallography in Drug Discovery Erice, 29 May - 8 June Peter Rose
Integration of PRO and UniProtKB Amherst, NY May 16, 2013 Cathy H. Wu, Ph.D. PRO-PO-GO Meeting.
GO : the Gene Ontology “because you know sometimes words have two meanings” Amelia Ireland GO Curator EBI, Cambridge, UK.
Copyright OpenHelix. No use or reproduction without express written consent1.
New data and tools at TAIR (The Arabidopsis Information Resource)
GO and OBO: an introduction. Jane Lomax EMBL-EBI What is the Gene Ontology? What is OBO? OBO-Edit demo & practical What is the Gene Ontology? What is.
The aims of the Gene Ontology project are threefold: - to compile vocabularies to describe components, functions and processes - to produce tools to query.
The Complex Portal - relationship to Gene Ontology Sandra Orchard (IntAct)
Intralab Workshop - Reactome CMAP Chang-Feng Quo June 29 th, 2006.
Copyright OpenHelix. No use or reproduction without express written consent1.
The Gene Ontology: a real-life ontology, progress and future. Jane Lomax EMBL-EBI.
The Gene Ontology project Jane Lomax. Ontology (for our purposes) “an explicit specification of some topic” – Stanford Knowledge Systems Lab Includes:
Gene Ontology TM (GO) Consortium Jennifer I Clark EMBL Outstation - European Bioinformatics Institute (EBI), Hinxton, Cambridge CB10 1SD, UK Objectives:
Pathway Interaction Database (PID) Market Research BioPortals Tiger Team Meeting Mervi Heiskanen January 31, 2013.
Protein Ontology (PRO) Amherst, NY May 15, 2013 Cathy H. Wu, Ph.D. Director, Protein Information Resource (PIR) Edward G. Jefferson Chair and Director.
1 SRI International Bioinformatics GO Term Integration and Curation in Pathway Tools and EcoCyc Ingrid M. Keseler Bioinformatics Research Group SRI International.
Ontologies GO Workshop 3-6 August Ontologies  What are ontologies?  Why use ontologies?  Open Biological Ontologies (OBO), National Center for.
Introduction to the GO: a user’s guide Iowa State Workshop 11 June 2009.
SRI International Bioinformatics 1 Submitting pathway to MetaCyc Ron Caspi.
24th Feb 2006 Jane Lomax GO Further. 24th Feb 2006 Jane Lomax GO annotations Where do the links between genes and GO terms come from?
Reactome - a curated knowledgebase of human biological pathways and processes.
Getting Started: a user’s guide to the GO GO Workshop 3-6 August 2010.
Functional Annotation and Functional Enrichment. Annotation Structural Annotation – defining the boundaries of features of interest (coding regions, regulatory.
Copyright OpenHelix. No use or reproduction without express written consent1.
Other biological databases and ontologies. Biological systems Taxonomic data Literature Protein folding and 3D structure Small molecules Pathways and.
Getting Started: a user’s guide to the GO TAMU GO Workshop 17 May 2010.
Rice Proteins Data acquisition Curation Resources Development and integration of controlled vocabulary Gene Ontology Trait Ontology Plant Ontology
A curated database of biological pathways.
You can request PRO terms by using the SourceForge PRO tracker (Fig 3A) or by directly contributing to PRO by providing the information in the RACE-PRO.
Introduction to the GO: a user’s guide NCSU GO Workshop 29 October 2009.
A database of biological pathways and processes (borrowed from a presentation created by Steve Jupe)
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
Copyright OpenHelix. No use or reproduction without express written consent1.
IntAct David Croft A database of Molecular Interactions.
An example of GO annotation from a primary paper Rebecca E. Foulger (UniProt Curator) GO Annotation Camp, June 2005 PMID:
Central hub for biological data UniProtKB/Swiss-Prot is a central hub for biological data: over 120 databases are cross-referenced (EMBL/DDBJ/GenBank,
Tools in Bioinformatics Ontologies and pathways. Why are ontologies needed? A free text is the best way to describe what a protein does to a human reader.
An example of GO annotation from a primary paper GO Annotation Camp, July 2006 PMID:
Gene Ontology TM (GO) Consortium
OncoTrack Bioinformatics Workshop Max Planck Institute for Molecular Genetics, Berlin Wednesday 6 th November 2013 TimeSubject 13:30-15:00 Introduction.
Gene Annotation & Gene Ontology May 24, Gene lists from RNAseq analysis What do you do with a list of 100s of genes that contain only the following.
` Comparison of Gene Ontology Term Annotations Between E.coli K12 Databases REDDYSAILAJA MARPURI WESTERN KENTUCKY UNIVERSITY.
Cheminformatics and Metabolism Team The EBI Enzyme Portal.
Randi Vita, M.D. Better living through ontologies at the Immune Epitope Database La Jolla Institute for Allergy & Immunology Division of Vaccine Discovery.
Protein databases Henrik Nielsen
Annotating with GO: an overview
GO : the Gene Ontology & Functional enrichment analysis
The Complex Portal Birgit Meldal
Department of Genetics • Stanford University School of Medicine
Functional Annotation of the Horse Genome
Strategies for annotation of a genome
CCO: concept & current status
The Gene Ontology: an evolution
Presentation transcript:

PRO and IntAct protein complexes Sandra Orchard PRO Meeting, June 19, 2014

Project aims Reference resource for macromolecular complexes Create species-specific stable complex identifiers Central reference resource to link all related efforts (UniProt for protein complexes) Dedicated online Portal to search and visualise (text and graphics), can also export to Cytoscape Emphasis on major model organisms Stored in relational database (IntAct) with existing update mechanisms Download format – PSI-MI XML, can write user-defined format from database Reference ontology for protein complexes PRO terms can span across species or be species-specific Stable ontology term identifier Searched and viewed (text-only) in existing PRO website – can export to Cytoscape Emphasis on major model organisms Stored in an internal database with existing update mechanism Download – ontology (OBO, OWL), annotation file IntAct PRO

Complex definition A stable set (2 or more, to include homodimers) of interacting protein molecules which – can be co-purified and – have been shown to exist as a functional unit in vivo. Non-protein molecules (e.g. small molecules, nucleic acids) may also be present in the complex. Does not include Molecules associated in a pulldown / coimmunoprecipitation but with no functional link Enzyme/substrate, receptor/ligand or similar transient interactions (except when required for stable complex formation) Protein complexes, including homo complexes (e.g. homodimers) Complexes may include non-proteins components IntAct PRO

Data capture Participants – proteins (UniProt), small molecules (ChEBI), nucleic acids (ChEBI, (RNACentral)) Participant features - binding domains, required PTMs Species Stoichiometry – when known Topology (linked binding domains) – when known Function – free text Assembly, e.g. homodimer, heterotetramer… Physical properties, e.g. MW, size, topology/assembly Ligands Disease Participants – proteins (PR identifiers), small molecules (ChEBI), nucleic acids (?) PTMs - implicit in PR term Species Cardinality – indicates stoichiometry Definition – free text “composed of x number of subunits of various components” disease and functional properties are added as an annotation in PAF if known IntAct PRO

Data capture - nomenclature Recommended name: - most recognisable name from literature, use GO component if specific complex exists in GO Systematic name: -based on Reactome’s new CV names – ‘string of (species-specific) gene names with stoichiometry’ Synonyms: - all other names the complex may be known as Name: - most recognisable name from literature, use GO component if specific complex exists in GO Systematic name: -based on Reactome’s new CV names (stoichiometry not incorporated) Synonyms: - all other names the complex may be known as IntAct PRO

Data Capture - xrefs GO (BP, MF, CC) – manually curated to complex, not just imported from proteins Cross references to experimental evidence: IMEx (+ non-IMEx IntAct, MINT & DIP, MatrixDB), Reactome (human) PDB, EMDB ChEMBL PubMed (for further information) IntEnz (enzyme EC numbers) OMIM/EFO (disease) TaxID GO – used as parent term Reactome (human) PubMed TaxID IntAct PRO

Data capture - evidence ECO codes ECO: (physical interaction evidence used in manual assertion) - full experimental evidence for the complex added to the entry. ECO: (sequence orthology evidence used in manual assertion) + inferred from “complex ID” – across species ECO: (sequence similarity evidence used in manual assertion) + inferred from “complex ID” – within species ECO: (inference from background scientific knowledge used in manual assertion) - modelled ECO Codes EXP experimentally verified → ECO: (experimental evidence used in manual assertion) ECO: (biological system reconstruction) - modelled IntActPRO

Linked binding domains PTMs annotated using MOD

SpeciesIntActPRO Human Mouse17393 Rat490 Cow30 Drosophila Melanogaster 120 C.elegans(2)0 Xenopus laevis30 Arabidopsis thaliana 08 Saccharomyces cerevisiae S.pombe16 E. coli870 Total (+215 protein agnostic parent terms) Protein Complex Statistics

IntAct - Parallel Annotation of complexes in GO Project start > 400 complex terms in GO Cellular Component branch, mostly children of GO: protein complex – lacks hierarchical structure Collaboration agreed with GO to provide more structured annotation whilst also adding new terms Parent terms mainly based on complex function e.g. enyzme complexes, transcription factor complexes – TermGenie (TG) Standard Form – Otherwise use TG Free Form – Some complexes still direct children of GO: protein complex Adding “logical definitions” / “cross-products” / “extensions” – e.g. “capable of x activity”

IntAct Data Sources/Curation priorities PDBe – almost 1000 complexes imported, more planned. Experimental data can be imported at same time (N.B. many of these have proven to be partial/sub-complexes so will not directly translate into 1000 finished products. Also many from non-model organisms) – curation ongoing PDB collaborating and mayadd curation effort ChEMBL – 81 drug-target complexes imported – curation complete, more to come with each release (mostly human/mouse/rat) MatrixDB (Sylvie Richard-Blum, Univ. of Lyon) – list of extracellular complexes – curation complete (human/mouse) Reactome – mapping into PSI-MI XML → direct import into IntAct ongoing, issue with sets has now been resolved (human) Mining UniProt (Bernd Roechert, SIB – manually) – curation ongoing (yeast) Manual curation from IMEx DBs & the literature SGD yeast complex list – SGD contributing curation effort EcoCyc – complex list has been dumped into Excel sheet, useful as ‘to do’ list but not suitable for import – curation ongoing (E.coli)

PRO data sources/Curation priorities Toll-like receptor pathway. Curation of both human and mouse (Anna Maria Masci at Duke and Veronica Shamovsky/Peter D’Eustachio from Reactome) Complexes for the Brassinosteroid signaling pathway in Arabidopsis (Mengxi Lv and Cecilia Arighi at University of Delaware) Complexes in TGF-beta signaling pathway (Cecilia, human complexes aligned with Reactome data) Complexes in cell cycle spindle checkpoint for human and yeast (Karen Ross, University of Delaware) Beta catenin related complexes (Irem Celen, University of Delaware)

What else has IntAct to offer? 1.Web-based editorial tool – Institution/curator management system enables attribution of effort to institute -APIs to UniProt, ChEBI (RNA Central when available) allow immediate import of interactors plus selected xrefs. -OLS enables enrichment of CV terms e.g. GO names when AC no used for import -Pulldown menus restricts CV usage to appropriate fields -Intelligent ‘syntax checker’ limits curator error

What else has IntAct to offer? 2. JIRA issue tracker - enables tracking of complexes requiring QC by 2 nd curator - used to request addition of new complex GO terms or hierarchy re-organization, this then undertaken via Term Genie - could additionally be used to request IntAct curation of experimental evidence papers not already in database(s)

What else has IntAct to offer? 3. Automated update process - protein update system. Tracks changes to underlying sequence with every release of UniProt and remaps features (binding domains, PTMs) accordingly. Withdrawn proteins (TrEMBL) remapped. - CV update system.

Proposal for joint curation 1.IntAct/PRO to align curation rules – discussions ongoing 2.IntAct to import PRO complexes & update all existing to joint rule set 3.IntAct to produce script to write complexes to flat file format 4.PRO curators to train on IntAct editor – all new complexes curated in IntAct 5.IntAct responsible for long-term data maintenance

Proposal for joint curation 6. IntAct to write flat files for new/updated complexes with every release 7. PRO to map UniProt + MOD → PR IDs 8. PRO to create ontology, including addition of parent ‘species-agnostic’ terms (IntAct will have “super-complex (Reactome ‘set’ equivalent)/complex/sub- complex relationship – OK for PRO?)