Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Biomedical Ontologies Bio-Trac 40 (Protein Bioinformatics) October 8, 2009 Zhang-Zhi Hu, M.D. Associate Professor Department of Oncology Department of.

Similar presentations


Presentation on theme: "1 Biomedical Ontologies Bio-Trac 40 (Protein Bioinformatics) October 8, 2009 Zhang-Zhi Hu, M.D. Associate Professor Department of Oncology Department of."— Presentation transcript:

1 1 Biomedical Ontologies Bio-Trac 40 (Protein Bioinformatics) October 8, 2009 Zhang-Zhi Hu, M.D. Associate Professor Department of Oncology Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center

2 2 Overview What is ontology? –What is biomedical ontology? What is gene ontology? –How is it generated? –How is it used for annotation? What is protein ontology? –Why is it necessary? –How to use it? ……

3 3 Tree of Porphyry with Aristotle’s Categories Aristotle, 384 BC – 322 BC

4 4 Ontology: In philosophy, it seeks to describe basic categories and relationships of being or existence to define entities and types of entities within its framework: –What do you know? How do you know it? –What is existence? What is a physical object? –What constitutes the identity of an object? …… Central goal is to have a definitive and exhaustive classification of all entities. onto-, of being or existence; -logy, study. Greek origin; Latin, ontologia,1606 “The science of what is, of the kinds and structures of objects, properties, events, processes and relations in every area of reality” – Barry Smith, U Buffalo

5 5 In computer and information science Ontology is a data model that represents a set of concepts within a domain and the relationships between those concepts. It is used to reason about the objects within that domain. is_a Individuals (instances) Classes (concepts) Attributes Relations Most ontologies describe individuals (instances), classes (concepts), attributes, and relations your Ford, my Ford, his Ford… e.g. color, engine, door… Classes

6 6 What are ontology useful for? Terminology management Integration, interoperability, and sharing of data –promote precise communication between scientists –enable information retrieval across multiple resources Knowledge reuse and decision support –extend the power of computational approaches to perform data exploration, inference, and mining Ontology is a form of knowledge representation about the world or some part of it. Biomedical Terminology vs. Biomedical Ontology UMLS (unified medical language system) MeSH (medical subject heading) NCI Thesaurus SNOMED / SNODENT Medical WordNet

7 7 Ontology Enables Large-Scale Biomedical Science Structured representation of biomedicine: –For different types of entities and relations to describe biomedicine (ontology content curation). Annotation: using ontologies to summarize and describe biomedical experimental results to enable: –Integration of their data with other researchers’ results –Cross-species analyses The center of two major activities currently in biomedical research:

8 8 what makes it so wildly successful ? Gene Ontology (GO)

9 9 GO Consortium The Gene Ontology was originally constructed in 1998 by a consortium of researchers studying the genome of three model organisms: –Drosophila melanogaster (fruit fly) (FlyBase) –Mus musculus (mouse) (MGD) –Saccharomyces cerevisiae (yeast) (SGD) Many other model organism databases have joined the GO consortium, contributing: –development of the ontologies –annotations for the genes of one or more organisms http://www.geneontology.org/

10 10 Three key concepts: [Currently total 27399 GO terms] (May 2009) [Biological process: series of events accomplished by one or more ordered assemblies of molecular functions, e.g. signal transduction, or pyrimidine metabolism, and alpha-glucoside transport. [total: 16468] Molecular function: describes activities, such as catalytic or binding activities, that occur at the molecular level. Activities that can be performed by individual gene products, or by assembled complexes of gene products; e.g. catalytic activity, transporter activity. [total: 8585] Cellular component: a component of a cell that it is part of some larger object, maybe an anatomical structure (e.g. ER or nucleus) or a gene product group (e.g. ribosome, or a protein dimer). [total: 2346] What is Gene Ontology? GO provides controlled vocabulary to describe gene and gene product attributes in any organism – how gene products behave in a cellular context Need for annotation of genome sequences GO annotation - Characterization of gene products using GO terms - Members submit their data which are available at GO website.

11 11 GO Representation: Tree or Network? Node, a concept or a term root Leaf node C has two parents, A and B A B C C GO is a network structure Relations: is_a, or part_of

12 12 http://www.geneontology.org/

13 13 GO term (GO:0006366) : mRNA transcription from RNA polymerase II promoter Leaf node GO search and display tool

14 14 Human p53 – GO annotation (UniProtKB:P04637) GO:0006289:nucleotide-excision repair [PMID:7663514; evidence:IMP]

15 15 Science basis of the GO: trained experts use the experimental observations from literature to associate GO terms with gene products (to annotate the entities represented in the gene/protein databases) Enabling data integration across databases and making them available to semantic search GO annotation of gene products Human, mouse, plant, worm, yeast … http://www.geneontology.org/GO.current.annotations.shtml

16 16 What GO is NOT…… Ontology of gene products: e.g. cytochrome c is not in GO, but attributes of cytochrome c are, e.g. oxidoreductase activity. Processes, functions and component unique to mutants or diseases: e.g. oncogenesis is not a valid GO. Protein domains or structural features. Protein-protein interactions. Environment, evolution and expression. Anatomical or histological features above the level of cellular components, including cell types. Neither GO is Ontology of Genes!! – a misnomer

17 17 Missing GO nodes… not deep enough… not broad enough…

18 18 Lack of connections among GOs Estrogen receptor

19 19 what cellular component? what molecular function? what biological process? GO: A Common Standard for Omics Data Analysis

20 20 need more… need to improve the quality of GO to support more rigorous logic-based reasoning across the data annotated in its terms need to extend the GO by engaging ever broader community support for addition of new terms and for correction of errors need to extend the methodology to other domains, including clinical domains, such as: –disease ontology –immunology ontology –symptom (phenotype) ontology –clinical trial ontology –...

21 21 Establish common rules governing best practices for creating ontologies and for using these in annotations Apply these rules to create a complete suite of orthogonal interoperable biomedical reference ontologies National Center for Biomedical Ontology (NCBO) http://www.obofoundry.org/ http://bioontology.org/

22 22 http://www.obofoundry.org/index.cgi?sort=domain&show=ontologies

23 23 tight connection to the biomedical basic sciences compatibility, interoperability, common relations support for logic-based reasoning OBO Foundry = a subset of OBO ontologies, whose developers have agreed in advance to accept a common set of principles reflecting best practice in ontology development designed to ensure: The OBO Foundry A family of interoperable gold standard biomedical reference ontologies to serve annotation of: –scientific literature –model organism databases –clinical trial data … OBO Foundry Principles: http://www.obofoundry.org/crit.shtmlhttp://www.obofoundry.org/crit.shtml

24 24 CONTINUANTOCCURRENT INDEPENDENTDEPENDENT ORGAN AND ORGANISM Organism (NCBI Taxonomy) Anatomical Entity (FMA, CARO) Organ Function (FMP, CPRO) Phenotypic Quality (PaTO) Biological Process (GO) CELL AND CELLULAR COMPONENT Cell (CL) Cellular Component (FMA, GO) Cellular Function (GO) MOLECULE Molecule (ChEBI, SO, RNAO, PRO) Molecular Function (GO) Molecular Process (GO) Rationale of OBO Foundry coverage GRANULARITY RELATION TO TIME

25 25 Foundational is_a part_of Spatial located_in contained_in adjacent_to Temporal transformation_of derives_from preceded_by Participation has_participant has_agent OBO Relation Ontology e.g.: A is_a B =def. every instance of A is an instance of B “rose is_a plant  all instances of rose is_a plant”

26 26 What is Protein Ontology? Why? PRO http://pir.georgetown.edu/pro/

27 27 PRotein Ontology (PRO) PRO in OBO Foundry to represent protein entities Ontology for Protein Evolution (ProEvo) for evolutionary classes of proteins: captures the protein classes reflecting evolutionary relationships at full-length protein levels. It is meant to indicate the relatedness of proteins, not their evolution. Use is_a relationship. Ontology for Protein Forms (ProForm) for multiple protein forms of a gene: captures the different protein forms of a specific gene from genetic variations, alternative splicing, cleavage, post- translational modifications. Relations used is_a or derives_from

28 28 Why PRO? – Allows specification of relationships between PRO and other ontologies, such as GO and Disease Ontology – Provides a structure to support formal, computer-based inferences based on shared attributes among homologous proteins – Provides a stable unique identifier to any protein type PRO:000000650 Smad2 isoform 1 phosphorylated 1 (phosphorylated SSxS motif) NOT has_function GO:0003677:DNA bindingGO:0003677 PRO:000000656 Smad2 isoform 2 phosphorylated 1 (phosphorylated SSxS motif) has_function GO:0003677 DNA bindingGO:0003677  Provides formalization and precise annotation of specific protein forms/ classes, allowing accurate and consistent data mapping, integration and analysis: for dendritic cell ontology or pathway [Term]http://www.biomedcentral.com/1471-2105/10/70http://www.biomedcentral.com/1471-2105/10/70 id: DC_CL:0000003 name: conventional dendritic cell def: "conventional dendritic cell is_a leukocyte that has_high_plasma_membrane_amount_relative_to_leukocyte CD11c and lacks_plasma_membrane_part CD19, CD3, C34, and CD56." [AMM:amm] comment: Immunological Reviews 2007 219: 118-142 intersection_of: CL:0000738 ! leukocyte intersection_of: has_high_plasma_membrane_amount_relative_to_leukocyte PRO:000001013 ! CD11c intersection_of: lacks_plasma_membrane_part DC_CL:0000072 ! CD3 intersection_of: lacks_plasma_membrane_part PRO:000001002 ! CD19 intersection_of: lacks_plasma_membrane_part PRO:000001003 ! CD34 intersection_of: lacks_plasma_membrane_part PRO:000001024 ! CD56

29 29 II I TGF-beta receptor Smad 4 Nucleus DNA binding and transcription regulation MAPKKK Shc SP YP SPYPYPSPYPSPSPTP XIAP P38 MAPK pathway JNK cascade II I SPYPYPSPYPSPSPTP Smad 7 STRAP TAK1 KU Degradation TAK1 Smad 2 SP SP Ski Shc Smad 2 Smad 4 Smad 2 SP SP Smad 4 Smad 2 SP SP Smad 4 Smad 2 SP SP X ERK1/2 CaM Smad 2 SP SP XP TP SPSP SP SP TP Furin TGF-  LAP Cytoplasm Ca2+ Growth signals SP TP YP KU Phosphorylation (P) at Serine (S), Threonine (T) Tyrosine (Y) Ubiquitination (U) at Lysine (K) TGF-beta signaling – comparison between PID and Reactome Growth signals Only included in Reactome Common in both Reactome & PID * All others are in PID. Not all components in the pathway from both databases are listed Stress signals MEKK1 TPTP Smad 2 SP SP XP X PRO:000000523 PRO:000000650 PRO:000000651 PRO:000000410 PRO:000000616 PRO:000000652 PRO:000000366 PRO:000000481 PRO:000000618 PRO:000000397 PRO:000000650 PRO:000000366 PRO:000000650 PRO:000000366 PRO:000000468

30 30 The Need for Representation of Various Proteins Forms Human PRLR and PTMs… Glucocorticoid receptor (GR)

31 31 Sphingomyelin phosphodiesterase (SMPD1) (ASM_HUMAN) Cleavage sites: –lysosomal: the enzyme is transported from the Golgi apparatus to the lysosome after additions of mannose-6-phosphate moieties (M6P) and binding to M6P receptor. –secreted: the shorter cleaved form is not modified with M6P and is targeted for secretion to the extracellular space, with different functions such as LDL binding and oxidized LDL catabolism. M6P lysosome Extracelluar, e.g. LDL binding

32 32 GOA for Transcription factor Ovo-like 2 Form 1 - long: GO:0045892 IDA - negative regulation of transcription, DNA-dependent Form 2 – short: GO:0045893 IDA - positive regulation of transcription, DNA-dependent 274 aa OVOL2_MOUSE (Q8CIV7) - Gene. 2004 336:47-58. PMID:15225875

33 33 Human Chimp Mouse E.coli Fly Yeast Worm Rat B.subtilis Implications of Protein Evolution Common ancestor Conclusions from experiments performed on proteins from one organism are often applicable to the homologous protein from another organism. Information learned about existing proteins allows us to infer the properties of ancestral proteins.

34 34 Functional convergence Protein classes of the same function derived from different evolutionary origins, e.g. carbonate dehydratase (or carbonic anhydrase EC 4.2.1.1), which has three independent gene families with functional convergence. Animal and prokaryotic type Plant and prokaryotic type Archaea type

35 35 Functional divergence TGM3 (Human) TGM3 (Mouse) EPB42 (Human) EPB42 (Mouse) Gene Duplication (TGM3/EPB42 split) Speciation (Human/mouse split) TGM3 (Human) TGM3 (Mouse) EPB42 (Human) EPB42 (Mouse) TGM3 branch EPB42 branch Human Mouse Human Mouse TGM3 = Protein-glutamine gamma-glutamyltransferase (Transglutaminase; involved in protein modification) EBP42 = Erythrocyte membrane protein band 4.2 (Constituent of cytoskeleton; involved in cell shape)

36 36 ProEvo ProForm GO Gene Ontology molecular function cellular component biological process participates_in part_of (for complexes) located _in (for compartments) has_function PRO protein Root Level is_a translation product of an evolutionarily-related gene translation product of a specific mRNA Family-Level Distinction Derivation: common ancestor Source: PIRSF family Modification-Level Distinction Derived from post-translational modification Source: UniProtKB Sequence-Level Distinction Derivation: specific allele or splice variant Source: UniProtKB cleaved/modified translation product disease OMIM Disease agent_in is_a protein modification has_modification PSI-MOD Modification SO Sequence Ontology sequence change has_agent (sequence change) agent_of (effect on function) derives_from protein domain has_part Pfam Domain Example: TGF-beta receptor phosphorylated smad2 isoform1 is a phosphorylated smad2 isoform1 derives_from smad2 isoform 1 is a smad2 is a TGF-  receptor-regulated smad is a smad is a protein Modification Level Sequence Level Family Level Root Level translation product of a specific gene Gene-Level Distinction Derivation: specific gene Sources: PIRSF subfamily, Panther subfamily is_a Gene Level Protein Ontology (PRO) http://pir.georgetown.edu/pro/ http://pir.georgetown.edu/pro/

37 37 Cellular Component: - nucleus - cytoplasm Molecular Function: - protein binding Biological Process: - signal transduction - regulation of transcription, DNA-dependent Mothers against decapentaplegic homolog 2 Smad 2 GO annotation of SMAD2_HUMAN:

38 38 II I TGF-  TGF-beta receptor PP Smad 4 4 DNA binding 1 phosphorylation 2 complex formation Nucleus Cytoplasm Smad 2 PP Smad 4 Transcription Regulation PP Smad 2 Smad 4 3 nuclear translocation PP Smad 2 P P P ++ ERK1 CAMK2 P P

39 39 “normal”Cytoplasmic PRO:00000011 TGF-  receptor phosphorylated Forms complex Nuclear Txn upregulation PRO:00000013 ERK1 phosphorylatedForms complex Nuclear Txn upregulation++ PRO:00000014 CAMK2 phosphorylated Forms complex Cytoplasmic No Txn upregulation PRO:00000015 alternatively spliced short form Cytoplasmic PRO:00000016 phosphorylated short form Nuclear Txn upregulationPRO:00000018 point mutation (causative agent: large intestine carcinoma) Doesn’t form complex Cytoplasmic No Txn upregulation PRO:00000019 Smad 2 PP PP P PP P x PP SMAD2_HUMAN Smad2 gene products Forms LocationID

40 40 Search:smad2 -> 21 protein terms http://pir.georgetown.edu/cgi-bin/pro/textsearch_pro

41 41 Root:Protein PRO hierarchy for smad2 isoform 2

42 42

43 43 PRO entry report shows the ontology and the annotation

44 44

45 45 Summary The vision of the biomedical ontology community is that all biomedical knowledge and data are disseminated on the Internet using principled ontologies, such that they are semantically interoperable and useful for improving biomedical science and clinical care. The scope extends to all knowledge and data that is relevant to the understanding or improvement of human biology and health. Knowledge and data are semantically interoperable when they enable predictable, meaningful, computation across knowledge sources developed independently to meet diverse needs. Principled ontologies are ones that follow NCBO-recommended formats and methodologies for ontology development, maintenance, and use.


Download ppt "1 Biomedical Ontologies Bio-Trac 40 (Protein Bioinformatics) October 8, 2009 Zhang-Zhi Hu, M.D. Associate Professor Department of Oncology Department of."

Similar presentations


Ads by Google