Presentation is loading. Please wait.

Presentation is loading. Please wait.

Standards and Ontologies for Data Annotation Helen Parkinson Microarray Informatics Team European Bioinformatics Institute NBN-EBI Course, October 2002.

Similar presentations


Presentation on theme: "Standards and Ontologies for Data Annotation Helen Parkinson Microarray Informatics Team European Bioinformatics Institute NBN-EBI Course, October 2002."— Presentation transcript:

1

2 Standards and Ontologies for Data Annotation Helen Parkinson Microarray Informatics Team European Bioinformatics Institute NBN-EBI Course, October 2002

3  Annotation, problems and solutions  What is an ontology?  Examples and uses of existing ontologies  ArrayExpress – a database for microarray gene expression data  Use of ontologies to annotate microarray data in ArrayExpress Talk structure

4 Informatics resources for biologists  Over 500 databanks and analysis tools that work over various resources  Repositories of knowledge and data at various levels, primary and secondary databases,and interfaces eg EMBL, Swissprot, Ensembl  Knowledge often held as free text; limited use made of controlled vocabularies  Enormous amount of semantic heterogeneity and poor query facilities

5 Search for “Ssp1” gene in DDBJ/EMBL/Genbank  1: AB027913 Schizosaccharomyces pombe gene for Ser/Thr protein kinase, partial cds, clone:TA76  2: AL441624 S.pombe hromosome I cosmid c110  3: AL159180 S.pombe chromosome I P1 p14E8  4: AL049609 S.pombe chromosome III cosmid c297  5: AL136235 S.pombe chromosome I cosmid c664  6: D45882 Yeast ssp1 gene for protein kinase, complete cds  7: X59987 S.pombe SSP1 gene for mitochondrial Hsp70 protein (Ssp1)

6 Gene synonyms  Problem, a name can identify different genes even in a well annotated organism like S.pombe Ssp1=SPAC664.11 SPAC110.04c SPCC297.03

7 Annotation problems  Free text entries in databases cause problems, human not machine readable and humans are error prone  Example - many genes and proteins can have the same name even in well annotated organisms  Many important projects have no coordination of standards, for e.g. gene naming, describing developmental stages  Whose responsibility is this? – community?

8 Possible solutions  Using ontologies, like the gene ontology but covering many more areas of biology than gene products  What is an ontology and how can they be used?  Thinking about how you describe the experiment as you start it

9 What is an ontology?  Captures knowledge for both humans and computer applications  Has a set of vocabulary definitions that capture a community’s knowledge of a domain  `An ontology may take a variety of forms, but necessarily it will include a vocabulary of terms, and some specification of their meaning. This includes definitions and an indication of how concepts are inter-related which collectively impose a structure on the domain and constrain the possible interpretations of terms.‘  It is more than a controlled vocabulary, it has structure (but a cv is a good place to start)

10 What does an ontology do?  Captures knowledge  Creates a shared understanding – between humans and for computers  Makes knowledge machine processable  Makes meaning explicit – by definition and context

11 Range of ontologies Catalog/ ID General Logical constraints Terms/ glossary Thesauri “narrower term” relation Formal is-a Frames (properties) Informal is-a Formal instance Value Restrs. Disjointness, Inverse, part-of… Gene Ontology Mouse Anatomy EcoCyc TAMBIS MGED Slide from Robert Stevens, University of Manchester

12 Three types of ontologies  Domain-oriented, which are either domain specific (e.g. E. coli) or domain general (e.g. gene function)  Task-oriented, which are either task specific (e.g. annotation analysis) or task general (e.g. problem solving);  Generic, which capture common high level concepts, such as Physical, Abstract and Substance.

13 How can ontologies be used?  Community reference -- neutral authoring.  Either defining database schema or defining a common vocabulary for database annotation (avoiding free text).  Providing common access to information. Ontology-based search by forming queries over databases.  Understanding database annotation and technical literature.  Guiding and interpreting analyses and hypothesis generation

14 Components of an ontology  Class, container for information, has a definition and a relationship to other classes (is-a, part-of, kind-of)  Instances, terms that are contained within a class

15 Example of a class, subclass relationship Class def African elephant sub-class of elephant slot constraint comes from slot has filler Africa Just formalised way to say that African elephants are a type of elephant that come from Africa, but this is machine readable Ian Herrick's, creator of Oiled

16 Examples of usable external ontologies and cv’s  NCBI taxonomy database  Jackson Lab mouse strains and genes  Edinburgh mouse atlas anatomy  Chemical and compound Ontologies, e.g. CAS  Species specific, fly, A.thaliana,  GO  GOBO ontologies  various pathology ontologies

17 ICD10  International statistical classification of diseases and related health problems….or what people die of  Useful information, should be included in databases, eg microarray, health related, DeCode,  International, defines disease etc universally  But..too much definition can be problematic….

18 Too much definition can be bad  ICD-9 (E826) 8  READ-2 (T30..) 81  READ-3 87  ICD-10 (V10-19) 587  V31.22 Occupant of three-wheeled motor vehicle injured in collision with pedal cycle, person on outside of vehicle, nontraffic accident, while working for income  W65.40 Drowning and submersion while in bath-tub, street and highway, while engaged in sports activity  X35.44 Victim of volcanic eruption, street and highway, while resting, sleeping, eating or engaging in other vital activities

19 And coverage may not be universal  ICD10 includes accidents in space  But not accidents involving collisions between cars and moose  Most Scandinavians are more likely to be injured colliding with a moose than in orbit

20 Summary  Ontologies can define terms and structure knowledge for both humans and machines, many do this successfully  If over engineered they are no longer human readable, and it is too hard to use them to annotate data

21 Introducing ArrayExpress - a database which needs an ontology

22 ArrayExpress  Public database for gene expression data  Aims to store well annotated (MIAME compliant) and well structured data  MIAME  Recorded info should be sufficient to interpret and replicate the experiment  Information should be structured so that querying and automated data analysis and mining are feasible* *Brazma et al,.Nature Genetics, 2001

23 Infrastructure at the EBI ArrayExpress (Oracle) Other public Microarray Databases (GEO, CIBEX) ww w EBI Expression Profiler External Bioinformatic databases Data analysis ww w Queries ww w MIAMExpress (MySQL) MAGE-ML Submissions MAGE-ML Array Manufacturers LIMS Microarray software Data Analysis software MAGE-ML Export Local MIAMExpress Installations MAGE-ML files Submissions MAGE-ML pipelines

24 ArrayExpress Conceptual Model Publication External links HybridisationArraySample Source (e.g., Taxonomy ) Experiment Normalisation Gene (e.g., EMBL ) Data

25

26 Public Data Access  Data export as tab delimited file  Export to Expression profiler  As MAGE-ML, from query interface  Arrays exportable as tab delimited file

27 Getting data in  From local LIMS system  From other microarray database, eg BASE, Rosetta Resolver, SMD  Via MIAMExpress, point and click tool from EBI

28 MIAMExpress submission and annotation tool  Based on MIAME concepts and questionnaire  Perl-CGI, MySQL database  Experiment, Array, Protocol submissions  Generic annotation tool, all expt types  Exports MAGE-ML

29

30 Array Definition Format  Tab delimited file format describing array  Defines relationships between features and sequences  Provides sequence annotation, database references  Exportable from db too

31 ArrayExpress curation effort  User support and help documentation  Curation at source (not destination)  Support on ontologies and CV’s  Minimize free text, removal of synonyms  MIAME encouragement  Help on MAGE-ML  Goal: to provide high-quality, well- annotated data to allow automated data analysis

32 Why do we need a ontology for the database?  To help users annotate their data usefully easily  To perform structured queries  To accurately compare data  To avoid problems with free text searching  To avoid excessive curation workload in future

33 Sample annotation  Gene expression data only have meaning in the context of detailed sample descriptions  If the data is going to be interpreted by independent parties, sample information has to be searchable and recorded in the database  Controlled vocabularies and ontologies are needed for unambiguous sample description, e.g cell type, compound, species, developmental stage  None of this is trivial

34 MGED Biomaterial (sample) Ontology  Under construction – by MGED OWG – Using OILed  Motivated by MIAME and coordinated with the ArrayExpress database model  We are defining classes, providing constraints, and adding terms  Now being extended to describe experiments and arrays

35 MGED BioMaterial Internal Terms

36

37 Internal and External Terms combined

38 Examples of external ontologies and cv’s  NCBI taxonomy database  Jackson Lab mouse strains and genes  Edinburgh mouse atlas anatomy  HUGO nomenclature for Human genes  Chemical and compound Ontologies, e.g. CAS  TAIR  Flybase anatomy  GO (www.geneontology.org)www.geneontology.org  GOBO ontologies

39 Example Annotation  Sample source and treatment description, and its correct annotation using the MGED BioMaterial Ontology classes and corresponding external references: “Seven week old C57BL/6N mice were treated with fenofibrate. Liver was dissected out, RNA prepared”

40 ©-BioMaterialDescription ©-Biosource Property ©-Organism ©-Age ©-DevelopmentStage ©-Sex ©-StrainOrLine ©-BiosourceProvider ©-OrganismPart ©-BioMaterialManipulation ©-EnvironmentalHistory ©-CultureCondition ©-Temperature ©-Humidity ©-Light ©-PathogenTests ©-Water ©-Nutrients ©-Treatment ©-CompoundBasedTreatment (Compound) (Treatment_application) (Measurement) MGED BioMaterial Ontology Instances 7 weeks after birth Female Charles River, Japan 22  2  C 55  5% 12 hours light/dark cycle Specified pathogen free conditions ad libitum MF, Oriental Yeast, Tokyo, Japan in vivo, oral gavage 100mg/kg body weight External References NCBI Taxonomy Mouse Anatomical Dictionary International Committee on Standardized Genetic Nomenclature for Mice International Committee on Standardized Genetic Nomenclature for Mice Mouse Anatomical Dictionary ChemIDplus Mus musculus musculus id: 39442 Stage 28 C57BL/6 Liver Fenofibrate, CAS 49562-28-9

41 Forms make this annotation easier

42 Sanger Human and Mouse Array Annotation Pipeline  Takes sequences present on array  Exonerate (alignment algorithm) against the NCBI assembly (from Ensembl)  Inherits annotation from Ensembl, gene names, database references, GO terms provides a common annotation in tab delimited format, can be parsed to MAGE- ML, or used in ADF  Pipeline available for external users to beta test

43 Example BioSequence annotation

44 Future  Futher data acquisition for ArrayExpress  Update ArrayExpress to newest MAGE-OM  V2.0 MIAMExpress, domain specific, portable  Further ontology development and integration into tools, use of OWL  Curation tools (other than grep, Perl scripts)  Improved query interface for AE  ArrayExpress update tool  Data exchange between public databases

45 Resources  Schemas for both ArrayExpress and MIAMExpress, access to code  MAGE-ML examples, Arrays, Expts, Protocols  MIAME glossary, MAGE-MIAME-ontology mappings  List of ontology resources from MGED pages  Help in establishing pipelines  MGED software MAGEstk  Curation, help and advice www.mged.orgwww.mged.org www.ebi.ac.uk/arrayexpress

46 Acknowledgments  Microarray Informatics Team, EBI  Robert Stevens, Jeremy Rogers and colleagues, University of Manchester, UK  Chris Stoeckert, University of Pennsylvania, USA

47 Quote ‘Most biologists would rather share their toothbrush than share a gene name’ Michael Ashburner


Download ppt "Standards and Ontologies for Data Annotation Helen Parkinson Microarray Informatics Team European Bioinformatics Institute NBN-EBI Course, October 2002."

Similar presentations


Ads by Google