ArrayExpress – a public database for microarray gene expression data Helen Parkinson Microarray Informatics Team European Bioinformatics Institute MGED V, Tokyo 2002
Introducing ArrayExpress Submission methods Ontologies and Annotation The Future Talk structure
Infrastructure at the EBI ArrayExpress (Oracle) Other public Microarray Databases (GEO, CIBEX) ww w EBI Expression Profiler External Bioinformatic databases Data analysis ww w Queries ww w MIAMExpress (MySQL) MAGE-ML Submissions MAGE-ML Array Manufacturers LIMS Microarray software Data Analysis software MAGE-ML Export Local MIAMExpress Installations MAGE-ML files Submissions MAGE-ML pipelines
ArrayExpress Public database for gene expression data, one of three including CIBEX and GEO Aims to store well annotated (MIAME compliant) and well structured data MIAME Recorded info should be sufficient to interpret and replicate the experiment Information should be structured so that querying and automated data analysis and mining are feasible* *Brazma et al,.Nature Genetics, 2001
ArrayExpress conceptual model Publication External links HybridisationArraySample Source (e.g., Taxonomy ) Experiment Normalisation Gene (e.g., EMBL ) Data
ArrayExpress details Database schema derived from MAGE-OM Standard SQL, we use Oracle Validating data loader for MAGE-ML Web interface – Queries - experiment, array, sample – Browsing – views on expt Object model-based query mechanism, automatic mapping to SQL
Data Submission to ArrayExpress Via MAGE-ML pipeline From LIMS/local database Current direct MAGE-ML submitters: Sanger, TIGR, Paul Spellman, Affymetrix Via external tools e.g BASE, J-express Via MIAMExpress Accepts Array, Experiment and Protocol submissions Provides accession numbers
ArrayExpress curation effort User support and help documentation Curation at source (not destination) Support on ontologies and CV’s Minimize free text, removal of synonyms MIAME encouragement Help on MAGE-ML Goal: to provide high-quality, well- annotated data to allow automated data analysis
Public Data Access Data export as tab delimited file Export to Expression profiler As MAGE-ML, from query interface Arrays exportable as tab delimited file
Submission Tool
MIAMExpress submission and annotation tool Based on MIAME concepts and questionnaire Perl-CGI, MySQL database Experiment, Array, Protocol submissions Generic annotation tool Exports MAGE-ML
What makes up a submission ? Final data ArraySampleHybridisationArraySampleHybridisationArraySampleHybridisationArraySampleHybridisationExperiment Protocols Normalisation
Array Definition Format Tab delimited file format that can be parsed to MAGE-ML Defines relationships between features and sequences Provides sequence annotation Example in the Agenda, open letter to journals
Example BioSequence annotation
Sample annotation Gene expression data only have meaning in the context of detailed sample descriptions If the data is going to be interpreted by independent parties, sample information has to be searchable and in the database Controlled vocabularies and ontologies (species, cell types, compound nomenclature, treatments, etc) are needed for unambiguous sample description None of this is trivial
What Does an Ontology Do? Captures knowledge Creates a shared understanding – between humans and for computers Makes knowledge machine processable Makes meaning explicit – by definition and context It is more than a controlled vocabulary, it has structure
MGED Biomaterial (sample) Ontology Under construction – by MGED OWG – Using OILed Motivated by MIAME and coordinated with the database model (mapping available) We are extending classes, providing constraints, defining terms, and adding terms Extension to include other MAGE-OM required ontology entries and cv’s
MGED BioMaterial Internal Terms
Internal and External Terms combined
Examples of external ontologies and cv’s NCBI taxonomy database Jackson Lab mouse strains and genes Edinburgh mouse atlas anatomy HUGO nomenclature for Human genes Chemical and compound Ontologies, e.g. CAS TAIR Flybase anatomy GO ( GOBO ontologies
Example Annotation Sample source and treatment description, and its correct annotation using the MGED BioMaterial Ontology classes and corresponding external references: “Seven week old C57BL/6N mice were treated with fenofibrate. Liver was dissected out, RNA prepared”
©-BioMaterialDescription ©-Biosource Property ©-Organism ©-Age ©-DevelopmentStage ©-Sex ©-StrainOrLine ©-BiosourceProvider ©-OrganismPart ©-BioMaterialManipulation ©-EnvironmentalHistory ©-CultureCondition ©-Temperature ©-Humidity ©-Light ©-PathogenTests ©-Water ©-Nutrients ©-Treatment ©-CompoundBasedTreatment (Compound) (Treatment_application) (Measurement) MGED BioMaterial Ontology Instances 7 weeks after birth Female Charles River, Japan 22 2 C 55 5% 12 hours light/dark cycle Specified pathogen free conditions ad libitum MF, Oriental Yeast, Tokyo, Japan in vivo, oral gavage 100mg/kg body weight External References NCBI Taxonomy Mouse Anatomical Dictionary International Committee on Standardized Genetic Nomenclature for Mice International Committee on Standardized Genetic Nomenclature for Mice Mouse Anatomical Dictionary ChemIDplus Mus musculus musculus id: Stage 28 C57BL/6 Liver Fenofibrate, CAS
Future Data acquisition for ArrayExpress Update ArrayExpress to newest MAGE-OM V2.0 MIAMExpress, domain specific, portable Further ontology development and integration into tools, use of OWL Curation tools (other than grep, Perl scripts) Improved query interface for AE ArrayExpress update tool
Resources Schemas for both ArrayExpress and MIAMExpress, access to code MAGE-ML examples, Arrays, Expts, Protocols MIAME glossary, MAGE-MIAME-ontology mappings List of ontology resources from MGED pages Help in establishing pipelines MGED software MAGEstk Curation, help and advice
Acknowledgments Microarray Informatics Team, EBI, esp. Alvis Brazma, Ugis Sarkans, Mohammad Shojatalab, Ele Holloway, Gaurab Mukherjee, Philippe Rocca-Serra, Susanna Sansone Chris Stoeckert, U. Penn. Members of MGED Sanger Institute - Rob Andrews, Jurg Bahler, Adam Butler, Kate Rice, EMBL Heidelberg - Wilhelm Ansorge, Martina Muckenthaler,Thomas Preiss
‘’I think you should be more explicit here in step two’’ ‘The miracle of microarray data analysis’ Genome Biology 2001, 2 (9)