Alignment of Ontologies for Biological Research Judith A. Blake, Ph.D. Bioinformatics and Computational Biology The Jackson Laboratory
Dagstuhl What is my perspective? Biological data is voluminous and complex Data integration is hard work Bio-ontologies provide semantic structure and standards that aid in data analysis and hypothesis generation. There are many challenges to the effective use of bio- ontologies (in addition to challenges to the development of ontologies)
Dagstuhl What is my approach? Goal is to facilitate ‘translational research’ through effective integration of experimental data from mouse models of human conditions with human clinical data from disease studies Bio-ontologies provide a mechanism to support comprehensive data integration and analysis
Dagstuhl Interesting…. - Refine Relations Ontology (RO) - Identify critical datasets - Focus on bottlenecks - Create views
Dagstuhl Phenotype mutant allele definitions QTL strain characteristics phenotype vocabularies disease models (human) comparative phenotypes Genes & Gene Products nomenclature gene characterization transcripts, proteins, gene products functional annotation orthologs & paralogs Sequences & Maps sequence representation C57BL/6J genomic sequence SNPs and strain variants adding biological context to computational gene models Gene Expression mouse anatomy time, tissue, level of expression range of assays & results emphasis on embryonic stages Tumor Biology tumor classifications & descriptions strain incidence histopathology images tumor genetics Overview of Mouse Genome Informatics
Dagstuhl Data acquisition is constant Load Program Summary of Data Loaded Mouse EntrezGeneEntrezGene IDs for mouse markers. Plus marker-to-sequence associations from EntrezGene not already in MGD Human/Rat EntrezGeneNomenclature, map position and other data regarding human and rat genes. OMIM associations for human. GenBank SeqMouse sequence records from GenBank RefSeq SeqMouse sequence records from RefSeq UniProt/TrEMBL SeqMouse sequence records from UniProt and TrEMBL TIGR/DoTS/NIA Seq Mouse consensus sequence records from TIGR/DoTS/NIA clusters TIGR/DoTS/NIA Association Associations between TIGR/DoTS/NIA cluster sequences and markers. Ensembl Gene ModelEnsembl gene model sequences, coordinates, & associations between these & markers NCBI Gene ModelNCBI gene model sequences, coordinates, & associations between these & markers UniProt AssociationUniProt/TrEMBL IDs and additional GenBank IDs for mouse markers. Plus GO and InterPro annotations UniGene AssociationUniGene cluster IDs for mouse markers. EST cDNA CloneMouse IMAGE, NIA, MGC, Riken, cDNAs and EST sequence associations MGC AssociationMGC IDs and associations between MGC full length sequences and MGC cDNAs RPCI CloneRPCI 23/24 BAC clones and sequence associations GO VocabularyUpdated Gene Ontology (GO) vocabularies from the central GO site. OMIM VocabularyUpdated OMIM disease terms MP VocabularyUpdated MP vocabulary (from OBO-Edit) AnatomyUpdated adult mouse anatomy ontology (from OBO-Edit) Mapping panelJAX, EUCIB, Copeland-Jenkins and many others PIRSF Mouse PIR superfamily terms and associations to markers SNPsMouse SNPs from dbSNP and associations between SNPs & markers.
Dagstuhl Snapshot of MGI data content MGI data statisticsMarch, 2007 Number of genes with sequence data28,292 Number of genes (incl. unmapped mutants)35,733 Number of markers (including genes)69,639 Number of markers mapped65,345 Number of genes with protein sequence information24,293 Number of genes with GO annotations17,664 Number of mouse/human orthologies16,127 Number of mouse/rat orthologies15,802 Number of genes with one or more phenotypic alleles6,979 Number of cataloged phenotypic alleles17,494 Number of references113,508 Number of integrated mouse nucleotide sequences (+ ESTs)8,3574,701
Dagstuhl Build 36: Ensembl and NCBI Unification (Exon Overlap Detection) Unique to Ensembl Unique to NCBI Equivalent 1:11:nn:1n:m
Dagstuhl Who is the authority? Data typeWorking relationship Gene Symbol/NameMGI makes primary assignment; coordination with HGNC, RGNC Allele Symbol/NameMGI makes primary assignment Strain DesignationsMGI makes primary assignment Gene -to- nucleotide sequence associationCo-curation with NCBI Gene -to- protein sequence associationCo-curation with UniProt Gene Ontology (GO) annotationsMGI provides primary curation Gene homology data between mouse and other speciesMGI curates orthology relationships Mammalian Phenotype OntologyMGI develops vocabulary Genotype -to- phenotype dataMGI provides primary curation Mouse model -to- human disease (OMIM)MGI provides primary curation Mouse data for which MGI serves as the authoritative source.
Dagstuhl Having the data, we want to ask complex questions
Dagstuhl Multiple Controlled Vocabularies in MGI Gene Nomenclature Gene/Marker Type Allele Type Developmental and Adult Anatomies Assay Type Expression Mapping Molecular Mutation Inheritance Mode Gene Ontology Mammalian Phenotype Ontology Tissue Types Cell Types Cell Lines Units Cytogenetic Molecular ES Cell Line Strain Nomenclature
Dagstuhl Vocabularies in MGI: GO Example DAGs Definition Synonyms GO:54321 Terms … Transcription factor DNA binding Protein binding Ligand binding or carrier Vocabulary Annotations … J:65378TAS J:62648IDA J:60000IEA Ahr Edr2 Genes Synonyms NameMGI:105043
Dagstuhl Mammalian Phenotype Ontology Compositional terms ‘working’ ontology Projected xref to ‘core’ ontologies Anatomy GO Built with attention to ontological principles but with primary goal of supporting annotation of diverse experimental results from many research groups and perspectives
Dagstuhl
We are exploring ontological representations that relate human clinical data with mouse phenotypes Create compositional view for annotation of mouse models and human clinical data Provide xref / RO back to core ontologies Support both annotation and ontology alignment efforts Develop tools to support complex queries
Dagstuhl We modeled gangliosidoses as a test case. Two types of gangliosidoses are Sandoff and Tay-Sachs diseases.
Dagstuhl Curators use controlled terms from structured vocabularies (ontologies) to curate complex biological systems described in the literature The knowledge is in the details
Dagstuhl The knowledge is in the details
Dagstuhl Including the relationship to human disease
Dagstuhl More mouse models – Tay Sachs
Dagstuhl Dopamine CHEBI:18243 Chemical Ontology Cell Type Ontology Dopaminergic Neuron CL: Biological Process Synaptic transmission GO: Brain MA: Anatomical Dictionary Different core ontologies need to be combined to describe complex biological systems
Dagstuhl Dilemma: No formal links currently exist between the separate ontologies Solution? 1. Generate cross-products (compositional terms) as necessary for annotations of characteristics of disease cases and disease models; 2. Annotate specific instances of human cases and mouse models; 3. Visualize and mine co-annotated data
Dagstuhl
Abnormal neuron morphology
Dagstuhl
Next Steps Perspective (views) Lung Cancer Provide Disease Ontology Build compositional view Mouse Data Curate comprehensive annotations for genes implicated in lung phenotypes Human Data Curate clinical data for ontology annotation Data Analysis Use ontological structures to facilitate data exploration and hypothesis generation
Dagstuhl Next conference? “enabling technologies for ontological access to clinical and animal model data” A hands-on problem solving workshop – a problem use case
Dagstuhl Gene Ontology MGI projects are supported by NIH [NHGRI, NICH, and NCI]. Bar Harbor, Maine, USA Mouse Genome Informatics GO Consortium is supported by NIH-NHGRI and by the European Union RTD Programme