Standards and Ontologies for Data Annotation Helen Parkinson Microarray Informatics Team European Bioinformatics Institute NBN-EBI Course, October 2002.

Standards and Ontologies for Data Annotation Helen Parkinson Microarray Informatics Team European Bioinformatics Institute NBN-EBI Course, October 2002

 Annotation, problems and solutions  What is an ontology?  Examples and uses of existing ontologies  ArrayExpress – a database for microarray gene expression data  Use of ontologies to annotate microarray data in ArrayExpress Talk structure

Informatics resources for biologists  Over 500 databanks and analysis tools that work over various resources  Repositories of knowledge and data at various levels, primary and secondary databases,and interfaces eg EMBL, Swissprot, Ensembl  Knowledge often held as free text; limited use made of controlled vocabularies  Enormous amount of semantic heterogeneity and poor query facilities

Search for “Ssp1” gene in DDBJ/EMBL/Genbank  1: AB027913 Schizosaccharomyces pombe gene for Ser/Thr protein kinase, partial cds, clone:TA76  2: AL441624 S.pombe hromosome I cosmid c110  3: AL159180 S.pombe chromosome I P1 p14E8  4: AL049609 S.pombe chromosome III cosmid c297  5: AL136235 S.pombe chromosome I cosmid c664  6: D45882 Yeast ssp1 gene for protein kinase, complete cds  7: X59987 S.pombe SSP1 gene for mitochondrial Hsp70 protein (Ssp1)

Gene synonyms  Problem, a name can identify different genes even in a well annotated organism like S.pombe Ssp1=SPAC664.11 SPAC110.04c SPCC297.03

Annotation problems  Free text entries in databases cause problems, human not machine readable and humans are error prone  Example - many genes and proteins can have the same name even in well annotated organisms  Many important projects have no coordination of standards, for e.g. gene naming, describing developmental stages  Whose responsibility is this? – community?

Possible solutions  Using ontologies, like the gene ontology but covering many more areas of biology than gene products  What is an ontology and how can they be used?  Thinking about how you describe the experiment as you start it

What is an ontology?  Captures knowledge for both humans and computer applications  Has a set of vocabulary definitions that capture a community’s knowledge of a domain  `An ontology may take a variety of forms, but necessarily it will include a vocabulary of terms, and some specification of their meaning. This includes definitions and an indication of how concepts are inter-related which collectively impose a structure on the domain and constrain the possible interpretations of terms.‘  It is more than a controlled vocabulary, it has structure (but a cv is a good place to start)

What does an ontology do?  Captures knowledge  Creates a shared understanding – between humans and for computers  Makes knowledge machine processable  Makes meaning explicit – by definition and context

Range of ontologies Catalog/ ID General Logical constraints Terms/ glossary Thesauri “narrower term” relation Formal is-a Frames (properties) Informal is-a Formal instance Value Restrs. Disjointness, Inverse, part-of… Gene Ontology Mouse Anatomy EcoCyc TAMBIS MGED Slide from Robert Stevens, University of Manchester

Three types of ontologies  Domain-oriented, which are either domain specific (e.g. E. coli) or domain general (e.g. gene function)  Task-oriented, which are either task specific (e.g. annotation analysis) or task general (e.g. problem solving);  Generic, which capture common high level concepts, such as Physical, Abstract and Substance.

How can ontologies be used?  Community reference -- neutral authoring.  Either defining database schema or defining a common vocabulary for database annotation (avoiding free text).  Providing common access to information. Ontology-based search by forming queries over databases.  Understanding database annotation and technical literature.  Guiding and interpreting analyses and hypothesis generation

Components of an ontology  Class, container for information, has a definition and a relationship to other classes (is-a, part-of, kind-of)  Instances, terms that are contained within a class

Example of a class, subclass relationship Class def African elephant sub-class of elephant slot constraint comes from slot has filler Africa Just formalised way to say that African elephants are a type of elephant that come from Africa, but this is machine readable Ian Herrick's, creator of Oiled

Examples of usable external ontologies and cv’s  NCBI taxonomy database  Jackson Lab mouse strains and genes  Edinburgh mouse atlas anatomy  Chemical and compound Ontologies, e.g. CAS  Species specific, fly, A.thaliana,  GO  GOBO ontologies  various pathology ontologies

ICD10  International statistical classification of diseases and related health problems….or what people die of  Useful information, should be included in databases, eg microarray, health related, DeCode,  International, defines disease etc universally  But..too much definition can be problematic….

Too much definition can be bad  ICD-9 (E826) 8  READ-2 (T30..) 81  READ-3 87  ICD-10 (V10-19) 587  V31.22 Occupant of three-wheeled motor vehicle injured in collision with pedal cycle, person on outside of vehicle, nontraffic accident, while working for income  W65.40 Drowning and submersion while in bath-tub, street and highway, while engaged in sports activity  X35.44 Victim of volcanic eruption, street and highway, while resting, sleeping, eating or engaging in other vital activities

And coverage may not be universal  ICD10 includes accidents in space  But not accidents involving collisions between cars and moose  Most Scandinavians are more likely to be injured colliding with a moose than in orbit

Summary  Ontologies can define terms and structure knowledge for both humans and machines, many do this successfully  If over engineered they are no longer human readable, and it is too hard to use them to annotate data

Introducing ArrayExpress - a database which needs an ontology

ArrayExpress  Public database for gene expression data  Aims to store well annotated (MIAME compliant) and well structured data  MIAME  Recorded info should be sufficient to interpret and replicate the experiment  Information should be structured so that querying and automated data analysis and mining are feasible* *Brazma et al,.Nature Genetics, 2001

Infrastructure at the EBI ArrayExpress (Oracle) Other public Microarray Databases (GEO, CIBEX) ww w EBI Expression Profiler External Bioinformatic databases Data analysis ww w Queries ww w MIAMExpress (MySQL) MAGE-ML Submissions MAGE-ML Array Manufacturers LIMS Microarray software Data Analysis software MAGE-ML Export Local MIAMExpress Installations MAGE-ML files Submissions MAGE-ML pipelines

ArrayExpress Conceptual Model Publication External links HybridisationArraySample Source (e.g., Taxonomy ) Experiment Normalisation Gene (e.g., EMBL ) Data

Public Data Access  Data export as tab delimited file  Export to Expression profiler  As MAGE-ML, from query interface  Arrays exportable as tab delimited file

Getting data in  From local LIMS system  From other microarray database, eg BASE, Rosetta Resolver, SMD  Via MIAMExpress, point and click tool from EBI

MIAMExpress submission and annotation tool  Based on MIAME concepts and questionnaire  Perl-CGI, MySQL database  Experiment, Array, Protocol submissions  Generic annotation tool, all expt types  Exports MAGE-ML

Array Definition Format  Tab delimited file format describing array  Defines relationships between features and sequences  Provides sequence annotation, database references  Exportable from db too

ArrayExpress curation effort  User support and help documentation  Curation at source (not destination)  Support on ontologies and CV’s  Minimize free text, removal of synonyms  MIAME encouragement  Help on MAGE-ML  Goal: to provide high-quality, well- annotated data to allow automated data analysis

Why do we need a ontology for the database?  To help users annotate their data usefully easily  To perform structured queries  To accurately compare data  To avoid problems with free text searching  To avoid excessive curation workload in future

Sample annotation  Gene expression data only have meaning in the context of detailed sample descriptions  If the data is going to be interpreted by independent parties, sample information has to be searchable and recorded in the database  Controlled vocabularies and ontologies are needed for unambiguous sample description, e.g cell type, compound, species, developmental stage  None of this is trivial

MGED Biomaterial (sample) Ontology  Under construction – by MGED OWG – Using OILed  Motivated by MIAME and coordinated with the ArrayExpress database model  We are defining classes, providing constraints, and adding terms  Now being extended to describe experiments and arrays

MGED BioMaterial Internal Terms

Internal and External Terms combined

Examples of external ontologies and cv’s  NCBI taxonomy database  Jackson Lab mouse strains and genes  Edinburgh mouse atlas anatomy  HUGO nomenclature for Human genes  Chemical and compound Ontologies, e.g. CAS  TAIR  Flybase anatomy  GO (www.geneontology.org)www.geneontology.org  GOBO ontologies

Example Annotation  Sample source and treatment description, and its correct annotation using the MGED BioMaterial Ontology classes and corresponding external references: “Seven week old C57BL/6N mice were treated with fenofibrate. Liver was dissected out, RNA prepared”

©-BioMaterialDescription ©-Biosource Property ©-Organism ©-Age ©-DevelopmentStage ©-Sex ©-StrainOrLine ©-BiosourceProvider ©-OrganismPart ©-BioMaterialManipulation ©-EnvironmentalHistory ©-CultureCondition ©-Temperature ©-Humidity ©-Light ©-PathogenTests ©-Water ©-Nutrients ©-Treatment ©-CompoundBasedTreatment (Compound) (Treatment_application) (Measurement) MGED BioMaterial Ontology Instances 7 weeks after birth Female Charles River, Japan 22  2  C 55  5% 12 hours light/dark cycle Specified pathogen free conditions ad libitum MF, Oriental Yeast, Tokyo, Japan in vivo, oral gavage 100mg/kg body weight External References NCBI Taxonomy Mouse Anatomical Dictionary International Committee on Standardized Genetic Nomenclature for Mice International Committee on Standardized Genetic Nomenclature for Mice Mouse Anatomical Dictionary ChemIDplus Mus musculus musculus id: 39442 Stage 28 C57BL/6 Liver Fenofibrate, CAS 49562-28-9

Forms make this annotation easier

Sanger Human and Mouse Array Annotation Pipeline  Takes sequences present on array  Exonerate (alignment algorithm) against the NCBI assembly (from Ensembl)  Inherits annotation from Ensembl, gene names, database references, GO terms provides a common annotation in tab delimited format, can be parsed to MAGE- ML, or used in ADF  Pipeline available for external users to beta test

Example BioSequence annotation

Future  Futher data acquisition for ArrayExpress  Update ArrayExpress to newest MAGE-OM  V2.0 MIAMExpress, domain specific, portable  Further ontology development and integration into tools, use of OWL  Curation tools (other than grep, Perl scripts)  Improved query interface for AE  ArrayExpress update tool  Data exchange between public databases

Resources  Schemas for both ArrayExpress and MIAMExpress, access to code  MAGE-ML examples, Arrays, Expts, Protocols  MIAME glossary, MAGE-MIAME-ontology mappings  List of ontology resources from MGED pages  Help in establishing pipelines  MGED software MAGEstk  Curation, help and advice www.mged.orgwww.mged.org www.ebi.ac.uk/arrayexpress

Acknowledgments  Microarray Informatics Team, EBI  Robert Stevens, Jeremy Rogers and colleagues, University of Manchester, UK  Chris Stoeckert, University of Pennsylvania, USA

Quote ‘Most biologists would rather share their toothbrush than share a gene name’ Michael Ashburner

Standards and Ontologies for Data Annotation Helen Parkinson Microarray Informatics Team European Bioinformatics Institute NBN-EBI Course, October 2002.

Similar presentations

Presentation on theme: "Standards and Ontologies for Data Annotation Helen Parkinson Microarray Informatics Team European Bioinformatics Institute NBN-EBI Course, October 2002."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Standards and Ontologies for Data Annotation Helen Parkinson Microarray Informatics Team European Bioinformatics Institute NBN-EBI Course, October 2002.

Similar presentations

Presentation on theme: "Standards and Ontologies for Data Annotation Helen Parkinson Microarray Informatics Team European Bioinformatics Institute NBN-EBI Course, October 2002."— Presentation transcript:

Similar presentations

About project

Feedback