Download presentation
Presentation is loading. Please wait.
Published byRoger Wilson Modified over 9 years ago
2
The importance of meta data capture – problems and solutions Helen Parkinson Microarray Informatics Team European Bioinformatics Institute NERC Meta Data Meeting Bristol, April 2003
3
Metadata Annotation problems What is an ontology? The MGED ontology Annotation and stds for microarray data Annotation tools Talk structure
4
The meta data challenge Meta data is data about data Microarray experiments (should) have lots of meta data of different types Meta data varies by experiment and by community Meta data is often free text Free text is evil in a database context
5
Informatics resources for biologists Over 500 databanks and analysis tools that work over various resources Repositories of knowledge and data at various levels, primary and secondary databases,and interfaces eg EMBL, Swissprot, Ensembl Knowledge often held as free text; limited use made of controlled vocabularies Enormous amount of semantic heterogeneity and poor query facilities Lots of effort goes into filtering databases to add value e.g RefSeq from Genbank
6
What are the problems with free text It’s slow to search It’s written by humans, humans are error prone, (typos, synonyms) Humans have a limited processing power Meaning is not explicit in free text Two options, develop tools for NLP (~40% efficient) Or eliminate free text
7
Why is there so much free text? Historically there was a small amount of data and a limited number databases that held the data Humans are like free text and are resistant to change There was nothing better than free text, though use of simple lists helped (e.g. SP keywords)
8
Gene names and free text The EMBL flat file controls only Species nomenclature, everything else is free text, genes, KW, descriptions etc Searching primary databases which are redundant and annotated with free text is hard work and relies upon the user infering information from what is supplied Reproducing this for new databases isn’t desirable Without resources to help them curators cannot control free text efficiently
9
Search for “Ssp1” gene in DDBJ/EMBL/Genbank using SRS gene=“Ssp1” Species=“S.pombe” 1: AB027913 Schizosaccharomyces pombe gene for Ser/Thr protein kinase, partial cds, clone:TA76 2: AL441624 S.pombe chromosome I cosmid c110 3: AL159180 S.pombe chromosome I P1 p14E8 4: AL049609 S.pombe chromosome III cosmid c297 5: AL136235 S.pombe chromosome I cosmid c664 6: D45882 Yeast ssp1 gene for protein kinase, complete cds 7: X59987 S.pombe SSP1 gene for mitochondrial Hsp70 protein (Ssp1)
10
Whose responsibility? Many important projects have no coordination of standards, for e.g. gene nomenclature, disease state, developmental stage There are competing resources in some domains, and none in others Successful projects are built by a community or with community input, e.g GO
11
What is an ontology? Captures knowledge for both humans and computer applications Has a set of vocabulary definitions that capture a community’s knowledge of a domain `An ontology may take a variety of forms, but necessarily it will include a vocabulary of terms, and some specification of their meaning. This includes definitions and an indication of how concepts are inter-related which collectively impose a structure on the domain and constrain the possible interpretations of terms.‘ It is more than a controlled vocabulary, it has structure (but a cv is a good place to start)
12
Advantages of using ontologies Meaning is explicit Meaning is human and computer readable Ease of updating, no need to find terms in free text and change them Data transfer possible without loss of meaning Reasoning to aid queries, annotation etc.
13
Building Ontologies Simple lists work well but adding structure adds reasoning power Class, container for information, has a definition and a relationship to other classes (is-a, part-of, kind-of) Instances, terms that are contained within a class
14
Example of a class, subclass relationship and a constraint Class Domestic Cat sub-class of Pet slot constraint cleans itself, eats stinky food slot has filler slot has instance Suki Just formalised way to say that a domestic cat is a pet that you don’t have to clean - but this is machine readable, and I can use it to classify Ian Herrick's, creator of Oiled
15
MIAME and ArrayExpress
16
Minimum Information About a Microarray Experiment. MIAME is a guideline for microarray experimenters to describe their data so that: Sufficient information is recorded to: Correctly interpret & verify their experiments. Able to replicate the experiments. Structured information must be recorded to: Query and correctly retrieve the data. Analyse the data. MIAME
17
MIAME 6 parts of a microarray experiment Experiment Hybridisation SampleArray Normalisation Data Sample source Sample treatments Extraction protocol Labeling protocol Array design information Location of each element Description of each element Control array elements Statistical treatment Image Scanning protocol Software specifications Quantification matrix Analysis protocol Software specifications
18
ArrayExpress A public repository for Microarray Data – MIAME compliant Uses the MAGE model (designed to hold MIAME compliant data) Holds public and private data Uses controlled vocabulary where possible Can represent complex metadata and reference external resources e.g databases and ontologies
19
Infrastructure at the EBI ArrayExpress (Oracle) Other public Microarray Databases (GEO, CIBEX) ww w EBI Expression Profiler External Bioinformatic databases Data analysis ww w Queries ww w MIAMExpress (MySQL) MAGE-ML Submissions MAGE-ML Array Manufacturers LIMS Microarray software Data Analysis software MAGE-ML Export Local MIAMExpress Installations MAGE-ML files Submissions MAGE-ML pipelines
20
Desirable Queries Show me all experiments where gene x is on the array Show me all the experiments where organism x is treated by compound Y Return all experiments using developmental stage X, disease stage Y – Sort by platform type – Which are untreated? Treated? Treated by what How comparable are these?
21
MAGE-OM and Ontology Entries ArrayExpress uses the MAGE-OM Requires OntologyEntries of 3 types – Simple lists, e.g image format, GIFF, TIFF etc – Infinite – e.g. anything that could be meta data relating to a sample, Species, disease state – In between - types of protocols, types of data transformation, types of Biosequence etc
22
ArrayExpress Conceptual Model Publication External links HybridisationArraySample Source (e.g., Taxonomy ) Experiment Normalisation Gene (e.g., EMBL ) Data
23
MGED MGED is the Microarray Gene Expression Database group It’s a group who develop resources, including the object model and ontologies for the comunnity It includes, SMD, TIGR, Affy, Agilent etc Conference in Aix-en-Provence Sep 2003 www.mged.org
24
The MGED Ontology: A framework for describing functional genomics experiments
25
The MGED Ontology An ontology for microarray experiments – Parts are applicable to describing experiments in general with a focus on microarray – Built by people with legacy data problems EBI, U.Penn,TIGR,SMD, UC Berkeley, NIH plus contributions from the mailing list – Supports the ontology requirements of the MAGE model Our approach to interfacing with other ontologies is “experimental” – Provide a framework to point to other ontologies Know where to find different types of annotation How to interpret that annotation
26
The MGED ontology is not Limited to any one domain or species Modelling the real world or reinventing the MAGE model Mapping terms from external non orthogonal ontologies Recommending one ontology over another (though some are not freely available) Just for microarrays, the same concepts like ‘assay’ apply to phenotype and we want to make a reusable resource
27
Current MGED Ontology
28
MGED Ontology: BioSequence
29
Organising by sub-classing In MAGE Spots relate to BioSequences, these have types: gene, intergenic sequence, clone, PCR product, EST BioSequenceType
30
MGED Ontology: limiting redundancy
31
MGED Ontology: OntologyEntry
32
Example of External Terms
33
Ontology in Browseable Form
34
ArrayExpress MIAMExpress RAD MAGE-ML data exchange Ontology instances propagated to submission/annotation web forms Curation of user defined terms, before inclusion in the ontology User defined terms collected via forms MGED Ontology BiomaterialDescription Sex C C C C Gender documentation: Subclass of sex applicable to heterogametic species (i.e., those in which the sexes produce gametes of markedly different size). Males produce small numerous gametes. Females produce small numbers of large gametes. Hermaphrodites are individuals with both male and female characteristics. Mixed refers to a population of individuals with more than one type of gender. used in individuals: female, hermaphrodite,male,mixed_sex,unknown_sex
38
Turning free text into controlled annotation Sample source and treatment description, and its correct annotation using the MGED ontology “Seven week old C57BL/6N mice were treated with fenofibrate. Liver was dissected out, RNA prepared”
39
©-BioMaterialDescription ©-Biosource Property ©-Organism ©-Age ©-DevelopmentStage ©-Sex ©-StrainOrLine ©-BiosourceProvider ©-OrganismPart ©-BioMaterialManipulation ©-EnvironmentalHistory ©-CultureCondition ©-Temperature ©-Humidity ©-Light ©-PathogenTests ©-Water ©-Nutrients ©-Treatment ©-CompoundBasedTreatment (Compound) (Treatment_application) (Measurement) MGED BioMaterial Ontology Instances 7 weeks after birth Female Charles River, Japan 22 2 C 55 5% 12 hours light/dark cycle Specified pathogen free conditions ad libitum MF, Oriental Yeast, Tokyo, Japan in vivo, oral gavage 100mg/kg body weight External References NCBI Taxonomy Mouse Anatomical Dictionary International Committee on Standardized Genetic Nomenclature for Mice International Committee on Standardized Genetic Nomenclature for Mice Mouse Anatomical Dictionary ChemIDplus Mus musculus musculus id: 39442 Stage 28 C57BL/6 Liver Fenofibrate, CAS 49562-28-9
41
Care when using ontologies Ontologies are rarely complete Ontologies are fit for the purpose for which they are designed – not off the shelf solutions to your problem Building ontologies is hard - it needs both domain experts and tools A simple list is an excellent start
42
Possible solutions Think about what you want to query – this determines what you will annotate Look for existing resources and use them if they are appropriate Do the doable first Share your resources
43
Quote ‘Most biologists would rather share their toothbrush than share a gene name’ Michael Ashburner
44
Acknowledgements Chris Stoeckert, Trish Whetzel, Joe White, Cathy Ball, Paul Spellman - MGED ontology Microarray Informatics Team, EBI Robert Stevens, University of Manchester Funding: EMBL, EU, ILSI
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.