The importance of meta data capture – problems and solutions Helen Parkinson Microarray Informatics Team European Bioinformatics Institute NERC Meta Data.

The importance of meta data capture – problems and solutions Helen Parkinson Microarray Informatics Team European Bioinformatics Institute NERC Meta Data Meeting Bristol, April 2003

 Metadata  Annotation problems  What is an ontology?  The MGED ontology  Annotation and stds for microarray data  Annotation tools Talk structure

The meta data challenge  Meta data is data about data  Microarray experiments (should) have lots of meta data of different types  Meta data varies by experiment and by community  Meta data is often free text  Free text is evil in a database context

Informatics resources for biologists  Over 500 databanks and analysis tools that work over various resources  Repositories of knowledge and data at various levels, primary and secondary databases,and interfaces eg EMBL, Swissprot, Ensembl  Knowledge often held as free text; limited use made of controlled vocabularies  Enormous amount of semantic heterogeneity and poor query facilities  Lots of effort goes into filtering databases to add value e.g RefSeq from Genbank

What are the problems with free text  It’s slow to search  It’s written by humans, humans are error prone, (typos, synonyms)  Humans have a limited processing power  Meaning is not explicit in free text  Two options, develop tools for NLP (~40% efficient)  Or eliminate free text

Why is there so much free text?  Historically there was a small amount of data and a limited number databases that held the data  Humans are like free text and are resistant to change  There was nothing better than free text, though use of simple lists helped (e.g. SP keywords)

Gene names and free text  The EMBL flat file controls only Species nomenclature, everything else is free text, genes, KW, descriptions etc  Searching primary databases which are redundant and annotated with free text is hard work and relies upon the user infering information from what is supplied  Reproducing this for new databases isn’t desirable  Without resources to help them curators cannot control free text efficiently

Search for “Ssp1” gene in DDBJ/EMBL/Genbank using SRS gene=“Ssp1” Species=“S.pombe”  1: AB027913 Schizosaccharomyces pombe gene for Ser/Thr protein kinase, partial cds, clone:TA76  2: AL441624 S.pombe chromosome I cosmid c110  3: AL159180 S.pombe chromosome I P1 p14E8  4: AL049609 S.pombe chromosome III cosmid c297  5: AL136235 S.pombe chromosome I cosmid c664  6: D45882 Yeast ssp1 gene for protein kinase, complete cds  7: X59987 S.pombe SSP1 gene for mitochondrial Hsp70 protein (Ssp1)

Whose responsibility?  Many important projects have no coordination of standards, for e.g. gene nomenclature, disease state, developmental stage  There are competing resources in some domains, and none in others  Successful projects are built by a community or with community input, e.g GO

What is an ontology?  Captures knowledge for both humans and computer applications  Has a set of vocabulary definitions that capture a community’s knowledge of a domain  `An ontology may take a variety of forms, but necessarily it will include a vocabulary of terms, and some specification of their meaning. This includes definitions and an indication of how concepts are inter-related which collectively impose a structure on the domain and constrain the possible interpretations of terms.‘  It is more than a controlled vocabulary, it has structure (but a cv is a good place to start)

Advantages of using ontologies  Meaning is explicit  Meaning is human and computer readable  Ease of updating, no need to find terms in free text and change them  Data transfer possible without loss of meaning  Reasoning to aid queries, annotation etc.

Building Ontologies  Simple lists work well but adding structure adds reasoning power  Class, container for information, has a definition and a relationship to other classes (is-a, part-of, kind-of)  Instances, terms that are contained within a class

Example of a class, subclass relationship and a constraint Class Domestic Cat sub-class of Pet slot constraint cleans itself, eats stinky food slot has filler slot has instance Suki Just formalised way to say that a domestic cat is a pet that you don’t have to clean - but this is machine readable, and I can use it to classify Ian Herrick's, creator of Oiled

MIAME and ArrayExpress

Minimum Information About a Microarray Experiment. MIAME is a guideline for microarray experimenters to describe their data so that: Sufficient information is recorded to: Correctly interpret & verify their experiments. Able to replicate the experiments. Structured information must be recorded to: Query and correctly retrieve the data. Analyse the data. MIAME

MIAME 6 parts of a microarray experiment Experiment Hybridisation SampleArray Normalisation Data Sample source Sample treatments Extraction protocol Labeling protocol Array design information Location of each element Description of each element Control array elements Statistical treatment Image Scanning protocol Software specifications Quantification matrix Analysis protocol Software specifications

ArrayExpress  A public repository for Microarray Data – MIAME compliant  Uses the MAGE model (designed to hold MIAME compliant data)  Holds public and private data  Uses controlled vocabulary where possible  Can represent complex metadata and reference external resources e.g databases and ontologies

Infrastructure at the EBI ArrayExpress (Oracle) Other public Microarray Databases (GEO, CIBEX) ww w EBI Expression Profiler External Bioinformatic databases Data analysis ww w Queries ww w MIAMExpress (MySQL) MAGE-ML Submissions MAGE-ML Array Manufacturers LIMS Microarray software Data Analysis software MAGE-ML Export Local MIAMExpress Installations MAGE-ML files Submissions MAGE-ML pipelines

Desirable Queries  Show me all experiments where gene x is on the array  Show me all the experiments where organism x is treated by compound Y  Return all experiments using developmental stage X, disease stage Y – Sort by platform type – Which are untreated? Treated? Treated by what How comparable are these?

MAGE-OM and Ontology Entries  ArrayExpress uses the MAGE-OM  Requires OntologyEntries of 3 types – Simple lists, e.g image format, GIFF, TIFF etc – Infinite – e.g. anything that could be meta data relating to a sample, Species, disease state – In between - types of protocols, types of data transformation, types of Biosequence etc

ArrayExpress Conceptual Model Publication External links HybridisationArraySample Source (e.g., Taxonomy ) Experiment Normalisation Gene (e.g., EMBL ) Data

MGED  MGED is the Microarray Gene Expression Database group  It’s a group who develop resources, including the object model and ontologies for the comunnity  It includes, SMD, TIGR, Affy, Agilent etc  Conference in Aix-en-Provence Sep 2003 www.mged.org

The MGED Ontology: A framework for describing functional genomics experiments

The MGED Ontology  An ontology for microarray experiments – Parts are applicable to describing experiments in general with a focus on microarray – Built by people with legacy data problems EBI, U.Penn,TIGR,SMD, UC Berkeley, NIH plus contributions from the mailing list – Supports the ontology requirements of the MAGE model  Our approach to interfacing with other ontologies is “experimental” – Provide a framework to point to other ontologies Know where to find different types of annotation How to interpret that annotation

The MGED ontology is not  Limited to any one domain or species  Modelling the real world or reinventing the MAGE model  Mapping terms from external non orthogonal ontologies  Recommending one ontology over another (though some are not freely available)  Just for microarrays, the same concepts like ‘assay’ apply to phenotype and we want to make a reusable resource

Current MGED Ontology

MGED Ontology: BioSequence

Organising by sub-classing  In MAGE Spots relate to BioSequences, these have types: gene, intergenic sequence, clone, PCR product, EST BioSequenceType

MGED Ontology: limiting redundancy

MGED Ontology: OntologyEntry

Example of External Terms

Ontology in Browseable Form

ArrayExpress MIAMExpress RAD MAGE-ML data exchange Ontology instances propagated to submission/annotation web forms Curation of user defined terms, before inclusion in the ontology User defined terms collected via forms MGED Ontology BiomaterialDescription Sex C C C C Gender documentation: Subclass of sex applicable to heterogametic species (i.e., those in which the sexes produce gametes of markedly different size). Males produce small numerous gametes. Females produce small numbers of large gametes. Hermaphrodites are individuals with both male and female characteristics. Mixed refers to a population of individuals with more than one type of gender. used in individuals: female, hermaphrodite,male,mixed_sex,unknown_sex

Turning free text into controlled annotation  Sample source and treatment description, and its correct annotation using the MGED ontology “Seven week old C57BL/6N mice were treated with fenofibrate. Liver was dissected out, RNA prepared”

©-BioMaterialDescription ©-Biosource Property ©-Organism ©-Age ©-DevelopmentStage ©-Sex ©-StrainOrLine ©-BiosourceProvider ©-OrganismPart ©-BioMaterialManipulation ©-EnvironmentalHistory ©-CultureCondition ©-Temperature ©-Humidity ©-Light ©-PathogenTests ©-Water ©-Nutrients ©-Treatment ©-CompoundBasedTreatment (Compound) (Treatment_application) (Measurement) MGED BioMaterial Ontology Instances 7 weeks after birth Female Charles River, Japan 22  2  C 55  5% 12 hours light/dark cycle Specified pathogen free conditions ad libitum MF, Oriental Yeast, Tokyo, Japan in vivo, oral gavage 100mg/kg body weight External References NCBI Taxonomy Mouse Anatomical Dictionary International Committee on Standardized Genetic Nomenclature for Mice International Committee on Standardized Genetic Nomenclature for Mice Mouse Anatomical Dictionary ChemIDplus Mus musculus musculus id: 39442 Stage 28 C57BL/6 Liver Fenofibrate, CAS 49562-28-9

Care when using ontologies  Ontologies are rarely complete  Ontologies are fit for the purpose for which they are designed – not off the shelf solutions to your problem  Building ontologies is hard - it needs both domain experts and tools  A simple list is an excellent start

Possible solutions  Think about what you want to query – this determines what you will annotate  Look for existing resources and use them if they are appropriate  Do the doable first  Share your resources

Quote ‘Most biologists would rather share their toothbrush than share a gene name’ Michael Ashburner

Acknowledgements Chris Stoeckert, Trish Whetzel, Joe White, Cathy Ball, Paul Spellman - MGED ontology Microarray Informatics Team, EBI Robert Stevens, University of Manchester Funding: EMBL, EU, ILSI

The importance of meta data capture – problems and solutions Helen Parkinson Microarray Informatics Team European Bioinformatics Institute NERC Meta Data.

Similar presentations

Presentation on theme: "The importance of meta data capture – problems and solutions Helen Parkinson Microarray Informatics Team European Bioinformatics Institute NERC Meta Data."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

The importance of meta data capture – problems and solutions Helen Parkinson Microarray Informatics Team European Bioinformatics Institute NERC Meta Data.

Similar presentations

Presentation on theme: "The importance of meta data capture – problems and solutions Helen Parkinson Microarray Informatics Team European Bioinformatics Institute NERC Meta Data."— Presentation transcript:

Similar presentations

About project

Feedback