Download presentation
Presentation is loading. Please wait.
Published byBrett Craig Modified over 9 years ago
2
Standards and Ontologies for Data Annotation Helen Parkinson Microarray Informatics Team European Bioinformatics Institute NBN-EBI Course, October 2002
3
Annotation, problems and solutions What is an ontology? Examples and uses of existing ontologies ArrayExpress – a database for microarray gene expression data Use of ontologies to annotate microarray data in ArrayExpress Talk structure
4
Informatics resources for biologists Over 500 databanks and analysis tools that work over various resources Repositories of knowledge and data at various levels, primary and secondary databases,and interfaces eg EMBL, Swissprot, Ensembl Knowledge often held as free text; limited use made of controlled vocabularies Enormous amount of semantic heterogeneity and poor query facilities
5
Search for “Ssp1” gene in DDBJ/EMBL/Genbank 1: AB027913 Schizosaccharomyces pombe gene for Ser/Thr protein kinase, partial cds, clone:TA76 2: AL441624 S.pombe hromosome I cosmid c110 3: AL159180 S.pombe chromosome I P1 p14E8 4: AL049609 S.pombe chromosome III cosmid c297 5: AL136235 S.pombe chromosome I cosmid c664 6: D45882 Yeast ssp1 gene for protein kinase, complete cds 7: X59987 S.pombe SSP1 gene for mitochondrial Hsp70 protein (Ssp1)
6
Gene synonyms Problem, a name can identify different genes even in a well annotated organism like S.pombe Ssp1=SPAC664.11 SPAC110.04c SPCC297.03
7
Annotation problems Free text entries in databases cause problems, human not machine readable and humans are error prone Example - many genes and proteins can have the same name even in well annotated organisms Many important projects have no coordination of standards, for e.g. gene naming, describing developmental stages Whose responsibility is this? – community?
8
Possible solutions Using ontologies, like the gene ontology but covering many more areas of biology than gene products What is an ontology and how can they be used? Thinking about how you describe the experiment as you start it
9
What is an ontology? Captures knowledge for both humans and computer applications Has a set of vocabulary definitions that capture a community’s knowledge of a domain `An ontology may take a variety of forms, but necessarily it will include a vocabulary of terms, and some specification of their meaning. This includes definitions and an indication of how concepts are inter-related which collectively impose a structure on the domain and constrain the possible interpretations of terms.‘ It is more than a controlled vocabulary, it has structure (but a cv is a good place to start)
10
What does an ontology do? Captures knowledge Creates a shared understanding – between humans and for computers Makes knowledge machine processable Makes meaning explicit – by definition and context
11
Range of ontologies Catalog/ ID General Logical constraints Terms/ glossary Thesauri “narrower term” relation Formal is-a Frames (properties) Informal is-a Formal instance Value Restrs. Disjointness, Inverse, part-of… Gene Ontology Mouse Anatomy EcoCyc TAMBIS MGED Slide from Robert Stevens, University of Manchester
12
Three types of ontologies Domain-oriented, which are either domain specific (e.g. E. coli) or domain general (e.g. gene function) Task-oriented, which are either task specific (e.g. annotation analysis) or task general (e.g. problem solving); Generic, which capture common high level concepts, such as Physical, Abstract and Substance.
13
How can ontologies be used? Community reference -- neutral authoring. Either defining database schema or defining a common vocabulary for database annotation (avoiding free text). Providing common access to information. Ontology-based search by forming queries over databases. Understanding database annotation and technical literature. Guiding and interpreting analyses and hypothesis generation
14
Components of an ontology Class, container for information, has a definition and a relationship to other classes (is-a, part-of, kind-of) Instances, terms that are contained within a class
15
Example of a class, subclass relationship Class def African elephant sub-class of elephant slot constraint comes from slot has filler Africa Just formalised way to say that African elephants are a type of elephant that come from Africa, but this is machine readable Ian Herrick's, creator of Oiled
16
Examples of usable external ontologies and cv’s NCBI taxonomy database Jackson Lab mouse strains and genes Edinburgh mouse atlas anatomy Chemical and compound Ontologies, e.g. CAS Species specific, fly, A.thaliana, GO GOBO ontologies various pathology ontologies
17
ICD10 International statistical classification of diseases and related health problems….or what people die of Useful information, should be included in databases, eg microarray, health related, DeCode, International, defines disease etc universally But..too much definition can be problematic….
18
Too much definition can be bad ICD-9 (E826) 8 READ-2 (T30..) 81 READ-3 87 ICD-10 (V10-19) 587 V31.22 Occupant of three-wheeled motor vehicle injured in collision with pedal cycle, person on outside of vehicle, nontraffic accident, while working for income W65.40 Drowning and submersion while in bath-tub, street and highway, while engaged in sports activity X35.44 Victim of volcanic eruption, street and highway, while resting, sleeping, eating or engaging in other vital activities
19
And coverage may not be universal ICD10 includes accidents in space But not accidents involving collisions between cars and moose Most Scandinavians are more likely to be injured colliding with a moose than in orbit
20
Summary Ontologies can define terms and structure knowledge for both humans and machines, many do this successfully If over engineered they are no longer human readable, and it is too hard to use them to annotate data
21
Introducing ArrayExpress - a database which needs an ontology
22
ArrayExpress Public database for gene expression data Aims to store well annotated (MIAME compliant) and well structured data MIAME Recorded info should be sufficient to interpret and replicate the experiment Information should be structured so that querying and automated data analysis and mining are feasible* *Brazma et al,.Nature Genetics, 2001
23
Infrastructure at the EBI ArrayExpress (Oracle) Other public Microarray Databases (GEO, CIBEX) ww w EBI Expression Profiler External Bioinformatic databases Data analysis ww w Queries ww w MIAMExpress (MySQL) MAGE-ML Submissions MAGE-ML Array Manufacturers LIMS Microarray software Data Analysis software MAGE-ML Export Local MIAMExpress Installations MAGE-ML files Submissions MAGE-ML pipelines
24
ArrayExpress Conceptual Model Publication External links HybridisationArraySample Source (e.g., Taxonomy ) Experiment Normalisation Gene (e.g., EMBL ) Data
26
Public Data Access Data export as tab delimited file Export to Expression profiler As MAGE-ML, from query interface Arrays exportable as tab delimited file
27
Getting data in From local LIMS system From other microarray database, eg BASE, Rosetta Resolver, SMD Via MIAMExpress, point and click tool from EBI
28
MIAMExpress submission and annotation tool Based on MIAME concepts and questionnaire Perl-CGI, MySQL database Experiment, Array, Protocol submissions Generic annotation tool, all expt types Exports MAGE-ML
30
Array Definition Format Tab delimited file format describing array Defines relationships between features and sequences Provides sequence annotation, database references Exportable from db too
31
ArrayExpress curation effort User support and help documentation Curation at source (not destination) Support on ontologies and CV’s Minimize free text, removal of synonyms MIAME encouragement Help on MAGE-ML Goal: to provide high-quality, well- annotated data to allow automated data analysis
32
Why do we need a ontology for the database? To help users annotate their data usefully easily To perform structured queries To accurately compare data To avoid problems with free text searching To avoid excessive curation workload in future
33
Sample annotation Gene expression data only have meaning in the context of detailed sample descriptions If the data is going to be interpreted by independent parties, sample information has to be searchable and recorded in the database Controlled vocabularies and ontologies are needed for unambiguous sample description, e.g cell type, compound, species, developmental stage None of this is trivial
34
MGED Biomaterial (sample) Ontology Under construction – by MGED OWG – Using OILed Motivated by MIAME and coordinated with the ArrayExpress database model We are defining classes, providing constraints, and adding terms Now being extended to describe experiments and arrays
35
MGED BioMaterial Internal Terms
37
Internal and External Terms combined
38
Examples of external ontologies and cv’s NCBI taxonomy database Jackson Lab mouse strains and genes Edinburgh mouse atlas anatomy HUGO nomenclature for Human genes Chemical and compound Ontologies, e.g. CAS TAIR Flybase anatomy GO (www.geneontology.org)www.geneontology.org GOBO ontologies
39
Example Annotation Sample source and treatment description, and its correct annotation using the MGED BioMaterial Ontology classes and corresponding external references: “Seven week old C57BL/6N mice were treated with fenofibrate. Liver was dissected out, RNA prepared”
40
©-BioMaterialDescription ©-Biosource Property ©-Organism ©-Age ©-DevelopmentStage ©-Sex ©-StrainOrLine ©-BiosourceProvider ©-OrganismPart ©-BioMaterialManipulation ©-EnvironmentalHistory ©-CultureCondition ©-Temperature ©-Humidity ©-Light ©-PathogenTests ©-Water ©-Nutrients ©-Treatment ©-CompoundBasedTreatment (Compound) (Treatment_application) (Measurement) MGED BioMaterial Ontology Instances 7 weeks after birth Female Charles River, Japan 22 2 C 55 5% 12 hours light/dark cycle Specified pathogen free conditions ad libitum MF, Oriental Yeast, Tokyo, Japan in vivo, oral gavage 100mg/kg body weight External References NCBI Taxonomy Mouse Anatomical Dictionary International Committee on Standardized Genetic Nomenclature for Mice International Committee on Standardized Genetic Nomenclature for Mice Mouse Anatomical Dictionary ChemIDplus Mus musculus musculus id: 39442 Stage 28 C57BL/6 Liver Fenofibrate, CAS 49562-28-9
41
Forms make this annotation easier
42
Sanger Human and Mouse Array Annotation Pipeline Takes sequences present on array Exonerate (alignment algorithm) against the NCBI assembly (from Ensembl) Inherits annotation from Ensembl, gene names, database references, GO terms provides a common annotation in tab delimited format, can be parsed to MAGE- ML, or used in ADF Pipeline available for external users to beta test
43
Example BioSequence annotation
44
Future Futher data acquisition for ArrayExpress Update ArrayExpress to newest MAGE-OM V2.0 MIAMExpress, domain specific, portable Further ontology development and integration into tools, use of OWL Curation tools (other than grep, Perl scripts) Improved query interface for AE ArrayExpress update tool Data exchange between public databases
45
Resources Schemas for both ArrayExpress and MIAMExpress, access to code MAGE-ML examples, Arrays, Expts, Protocols MIAME glossary, MAGE-MIAME-ontology mappings List of ontology resources from MGED pages Help in establishing pipelines MGED software MAGEstk Curation, help and advice www.mged.orgwww.mged.org www.ebi.ac.uk/arrayexpress
46
Acknowledgments Microarray Informatics Team, EBI Robert Stevens, Jeremy Rogers and colleagues, University of Manchester, UK Chris Stoeckert, University of Pennsylvania, USA
47
Quote ‘Most biologists would rather share their toothbrush than share a gene name’ Michael Ashburner
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.