Microarray data repositories

Slides:



Advertisements
Similar presentations
The MGED Ontology: Providing Descriptors for Microarray Data Trish Whetzel Department of Genetics Center for Bioinformatics University of Pennsylvania.
Advertisements

Oncomine Database Lauren Smalls-Mantey Georgia Institute of Technology June 19, 2006 Note: This presentation contains animation.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
The Rice Functional Genomics Program of China cDNA microarray database (RIFGP-CDMD) consists of complete datasets, including the probe sequences, microarray.
Abstract BarleyBase ( is a USDA-funded public repository for plant microarray data. BarleyBase houses raw and normalized expression.
Basic Genomic Characteristic  AIM: to collect as much general information as possible about your gene: Nucleotide sequence Databases ○ NCBI GenBank ○
1 MicroArray -- Data Analysis Cecilia Hansen & Dirk Repsilber Bioinformatics - 10p, October 2001.
The European Bioinformatics Institute ArrayExpress – a public database for microarray gene expression data Helen Parkinson Microarray Informatics Team.
Microarray Data Analysis Stuart M. Brown NYU School of Medicine.
The MGED Ontology Is An Experimental Ontology Bio-Ontologies Aug 8, 2002 Chris Stoeckert, Helen Parkinson and the MGED Ontology Working Group.
MGED Ontology: An Ontology of Biomaterial Descriptions for Microarrays Microarray Data Analysis and Management: Bio-ontologies for Microarrays EMBL-EBI,
Chapter 6 Methodology Conceptual Databases Design Transparencies © Pearson Education Limited 1995, 2005.
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Modeling Functional Genomics Datasets CVM Lesson 1 13 June 2007Bindu Nanduri.
Methodology Conceptual Database Design
Persistent Systems Pvt. Ltd. Gene Expression Analysis Using Microarrays Dr Mushtaq Ahmed Technology Incubation Division Persistent.
EMBL Outstation — The European Bioinformatics Institute MIAME and ArrayExpress - a standard for microarray data annotation and a database to store it Helen.
Microarray Gene Expression Database (MGED) Ontology Working Group Chris Stoeckert Center for Bioinformatics University of Pennsylvania July 26, 2001.
Gene expression services: ArrayExpress and the Gene Expression Atlas Contact: Gabriella Rustici, PhD Functional Genomics Team EBI-EMBL
ArrayExpress and Expression Atlas: Mining Functional Genomics data Gabriella Rustici, PhD Functional Genomics Team EBI-EMBL
Excerpts from a Sample Description courtesy of M. Hoffman, S. Schmidtke, Lion BioSciences Organism: mus musculus [ NCBI taxonomy browser ] Cell source:
Microrray Data Standardisation Microarray Gene Expression Database group -- MGED December, 2000.
The European Bioinformatics Institute MIAME and Ontologies for Sample Description Helen Parkinson Microarray Informatics Team European Bioinformatics Institute.
Gene Expression Omnibus (GEO)
ILSI-HESI agreement with EBI: ArrayExpress, public repository for toxicogenomics data Susanna Assunta Sansone Microarray Informatics.
Test1 April 2004 Microarray Data Management Jianwei (Jerry) Li.
Copyright OpenHelix. No use or reproduction without express written consent1.
CDNA Microarrays MB206.
Methodology - Conceptual Database Design Transparencies
Methodology Conceptual Databases Design
CSCI 3140 Module 2 – Conceptual Database Design Theodore Chiasson Dalhousie University.
1 Chapter 15 Methodology Conceptual Databases Design Transparencies Last Updated: April 2011 By M. Arief
The European Bioinformatics Institute MGED ontology for consistent annotation of microarray experiments Manchester Bioinformatics Week Ontologies Workshop1.
Gene Expression Data Qifang Xu. Outline cDNA Microarray Technology cDNA Microarray Technology Data Representation Data Representation Statistical Analysis.
Abstract BarleyBase is a USDA-funded public repository for plant microarray data. BarleyBase houses raw and normalized expression data from the 22K Affymetrix.
1 MIAME The MIAME website: © 2002 Norman Morrison for Manchester Bioinformatics.
Agenda Introduction to microarrays
Lecturer: Gareth Jones. How does a relational database organise data? What are the principles of a database management system? What are the principal.
From MIAME to MAML: Microarray Gene Expression Database (MGED) Chris Stoeckert Center for Bioinformatics University of Pennsylvania Sept. 19, 2001 GE ^
Methodology - Conceptual Database Design. 2 Design Methodology u Structured approach that uses procedures, techniques, tools, and documentation aids to.
1/26/2004TCSS545A Isabelle Bichindaritz1 Database Management Systems Design Methodology.
Methodology: Conceptual Databases Design
1 maxdLoad The maxd website: © 2002 Norman Morrison for Manchester Bioinformatics.
What is an Ontology? An ontology is a specification of a conceptualization that is designed for reuse across multiple applications and implementations.
3/24/2005 TIGP 1 Bioinformatics for Microarray Studies at IBS Pei-Ing Hwang, Ph.D. Mar. 24, 2005.
5 - 1 Copyright © 2006, The McGraw-Hill Companies, Inc. All rights reserved.
A plant-specific annotation and submission tool for the incorporation of Arabidopsis gene expression data into ArrayExpress, the EBI’s public DNA microarray.
PROGNOCHIP-BASE, FORTH-ICS 1 PrognoChip-BASE: An Information System for the Management of Spotted DNA MicroArray Experiments Extension of BASE v
Alvis Brazma, Johan Rung, Ugis Sarkans, Thomas Schlitt, Jaak Vilo European Bioinformatics Institute (EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge,
Proteomics databases for comparative studies: Transactional and Data Warehouse approaches Patricia Rodriguez-Tomé, Nicolas Pinaud, Thomas Kowall GeneProt,
Extending FuGE into other domains Andrew Jones School of Computer Science, University of Manchester
1 Outline Standardization - necessary components –what information should be exchanged –how the information should be exchanged –common terms (ontologies)
Ontologies Working Group Agenda MGED3 1.Goals for working group. 2.Primer on ontologies 3.Working group progress 4.Example sample descriptions from different.
The European Bioinformatics Institute ArrayExpress – a public database for microarray gene expression data Helen Parkinson Microarray Informatics Team.
Applied Bioinformatics Week 9 Jens Allmer. Theory I Gene Expression Microarray.
CaArray User Community Meeting Feature Overview and Review of MAGE-TAB Update and Export Specification Call in: Participant Passcode:
Introduction and Applications of Microarray Databases Chen-hsiung Chan Department of Computer Science and Information Engineering National Taiwan University.
Gene Set Analysis using R and Bioconductor Daniel Gusenleitner
ArrayExpress Ugis Sarkans EMBL - EBI
Microarray Technology and Data Analysis Roy Williams PhD Sanford | Burnham Medical Research Institute.
Expression Data Integration Microarray Gene Expression Database Meeting Sunday 14th November 1999.
Using ArrayExpress.
Microarray Technology and Applications
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
MGED Ontology: An Ontology of Biomaterial Descriptions for Microarrays
From MIAME to MAML: Microarray Gene Expression Database (MGED)
MGED Ontology Working Group Report
Functional Genomics Consortium: NIDDK (Kaestner) and (Permutt)
Microarray Data Analysis
Presentation transcript:

Microarray data repositories Mark Lambrecht The Arabidopsis Information Resource http://www.arabidopsis.org

Bioinformatics : process Bioinformatics is involved in storing managing integrating retrieving analysing biological data; genome data, gene expression, proteomic data

Part I : challenges in constructing MA data repositories Part II : Existing microarray data repositories case study : integrating microarray information in Arabidopsis Information Resource

Goal To store gene expression info to allow for Dealing with generation and interpretation of experimental data integration of data in data repository Dealing with integrity testing quality of data, QA (log(R/G) vs log(RG) high-throughput exchange of data

A trivial problem ? Organisation of data determines scope of analysis and results generated by analyses ! Organisation for a few experiments not necessary, but in case of 20, 300, 1000 MA experiments ? Organisation of data has to reflect as good as possible biological reality and thus expert bioinformaticians with core biological background needed. Most complex: integrating MA data in existing biological repository.

Problem 1 : generation and interpretation of experimental data Raw data as generated by a technology platform : oligo microarrays cDNA microarrays Affymetrix microarrays Data dependent on platform (expression scales, linearity, absence of ‘interconversion’ standard)

Interpretation Data is Integrated (secondary) data redundant non-redundant annotation of data : manually, automatic (data cleaning step) Example : annotation of Affymetrix probe sets. Integrated (secondary) data composition of related data from different sources

Key challenge Keeping track of data processing, refinement, revision Effect of data modelling, data collection, data management infrastructure

Example : Affymetrix data platform 12-20 probe pairs (PM, MM)/probe set DAT file : binary image CEL file : quantitative representation CHP file : gene expression measurements

Example : Affymetrix platform Raw data : probe array images (.DAT files) Interpreted data probe intensity data files (.CEL files) probe set summarized data (.CHP files) replicate summarized data normalized data across experiments filtered data annotation information

Example: cDNA microarrays EST/cDNAs are PCR amplified and attached to glass slides by automation Example : normalization by ‘tip’ groups Raw data generated by image analysis programs (2 channel laser scanning device) Processed data : preclustering files (.pcl files) NCBI Locuslink : problem of annotation of ESTs (fully-sequenced organism). Problems : biochemical manipulations, PCR product concentration, hybr efficiency, cross-hybridization

Example: cDNA microarrays

Data review Review, possibly correct experimental results data : quality assessment (quality metrics : work in progress…) technology LIMS = laboratory information management system, validation (QC) tools, tracking systems : complex! analyze and correct process steps, failures interpret analysis results

Data comparability Each MA experiment is different ; scanner used, probes, glass type, tips, hybridization conditions, etc… has to be tracked in database Computational factors : hypothesis testing, analytic factors

Data analysis and interpretation

Data collection : warehouse Goals: single source expression data Rationale : support periodic or continuous accumulation of data for effective exploration and analysis of massive amounts of data Data structure : central object characterized by measurement attribute dimension object ; organized in classification hierarchies Key challenge : applying warehouse concept to the gene expression data domain

Problem 2 : MA data integration in MA repository Alternatives light ; data items in autonomous repositories cross-referenced loose : autonomous repositories accessed through common view tight : integrated in a common data repository => resolution of semantic problems required.

Associations Shallow : data collection and sorting deep : determine semantic relationships

Complexity Determined by number of data types involved in integration Quality of data in different repositories : example: integrate MA database with poorly described experiments in model organism database. Key challenge: understanding the analysis and performance requirements and finding the best alternative addressing them

Problem 3 : Integrating MA data in biological data warehouse New challenges: several sources and platforms have to be integrated (Affymetrix, cDNA spotted arrays, oligo). Providing enhanced expression data context for analysis

Methodology Independent data collection different image analysis software suites different ‘treatment’ of raw data different analysis of treated data by software suites (Spotfire, GeneSpring, Partek, GXExplorer, Genesis, Stanford software Xcluster Treeview,…) Statistical packages : S+, R , SAS

Methodology Employ common formats for exporting or extracting gene expression data / sample data / gene annotation data MAGE group : a common language to describe and exchange microarray data information, describes : microarray design microarray manufacturing information microarray experiment setup gene expression data and analysis results

Methodology : MAGE MAGE-ML has been automatically derived from Microarray Gene Expression Object Model (MAGE-OM), which is developed and described using the Unified Modelling Language (UML) – a standard language for describing object models. MIAME : minimum information about a microarray experiment Use of ontologies or controlled vocabularies (too many <-> not enough), microarray ontology Classification : e.g. aging : death to natural causes, obesitas/diabetes example

Example sample description This is an example of the information requested for a sample (biomaterial) description for a microarray experiment (kindly provided by M. Hofmann and S. Schmidtke, LION bioscience AG) with notes added by C. Stoeckert, U. Penn. Organism: mus musculus [ NCBI taxonomy browser ] Note that the source of the term "NCBI taxonomy browser" is provided. We will need to come up with a controlled vocabulary of sources and/or IDs. Cell source: in-house bred mice (contact: norma.howells@itg.fzk.de) Cell type: thymocytes (complex mixture of cells isolated from thymus. mostly T-lymphocytes and epithelial cells of different differentiation stage) [no matching for this cell type found in GXD "Mouse Anatomical Dictionary"] Note: what we are going for here is not really the type of cell but rather the type of cell source as in paraffin sample, biopsy, etc. This will be clarified in the concepts. Sex: female [ MGED ] Age: 3 - 4 weeks after birth [ MGED ] Growth conditions: normal Note the need for further structuring of this type of information. controlled environment 20 - 22 oC average temperature housed in cages according to German and EU legislation specified pathogen free conditions (SPF) 14 hours light cycle 10 hours dark cycle

Example sample description Developmental stage: stage 28 (juvenile (young) mice) [ GXD "Mouse Anatomical Dictionary" ] Organism part: thymus [ GXD "Mouse Anatomical Dictionary" ] Strain or line: C57BL/6 [International Committee on Standardized Genetic Nomenclature for Mice] Genetic Variation: Inbr (J) 150. Origin: substrains 6 and 10 were separated prior to 1937. This substrain is now probably the most widely used of all inbred strains. Substrain 6 and 10 differ at the H9, Igh2 and Lv loci. Maint. by J,N, Ola. [International Committee on Standardized Genetic Nomenclature for Mice ] Individual: - Individual genetic characteristics: the same as in "Genetic Variation" Disease state: normal Targeted cell type: primary thymocytes (complex mixture of cells isolated from thymus. mostly T-lymphocytes and epithelial cells of different differentiation stage [ no matching for this cell type found in GXD "Mouse Anatomical Dictionary"] Cell line: no cell line; primary cells Treatment: in vivo [MGED] intraperitoneal injection of Dexamethasone into mice, 10 microgram per 25 g bodyweight of the mouse Compound: drug [MGED] synthetic glucocorticoid Dexamethasone, dissolved in PBS Clinical Information: - Separation technique: mice were killed 4 hours post injection and the thymus was taken out. Thymozytes were prepared by meshing the thymus through sterile gauze several times until a suspension of individual cells was obtained. RNA was prepared from thymozytes using a Qiagen RNA preparation kit.

Data analysis in warehouse environment Goal : focus rapidly on data set of interest (single experiment, aggregate of several experiments/replicates,…) Mechanisms: free text search, key word search, Exploration : identify data of biological interest clustering, classification Consistent analysis and exploration Data analysis evolves and is of secondary importance to storing data.

Outlook Data management technology will provide continuously enhanced tools Complexity will rise as new types of data generated appear Expertise needed in application ; image analysis data modelling data analysis : classification/statistics and data management and software DBA, software engineering, software QA + prod.

Part II : existing microarray data repositories NCBI Geo : different types of expression data Submitter Platform Sample Series

Existing microarray data repositories ArrayDB NCGR’s GeneX Stanford Microarray Database (SMD) (see later) They all have clone viewers histogram scatter plots analysis software as plug-ins

Attached to other data repositories Attached to biological pathway databases KEGG GenMAPP http://www.genmapp.org/intro.html

Case study Arabidopsis microarray data AFGC : Arabidopsis functional genomics consortium online demonstration Describing goals of MA data repository : use cases scenario

Usecases for an expression database Remember : these are all functions to be provided by the MA database and interfaces ! (complex queries with DBMS could be simple for individual analysis). General functionalities 1. Retrieve/download all the genes that have been arrayed. 2. Retrieve/download the genes or array elements that have been arrayed using a specific technology or manufacturer or version 3. Return array elements/genes by gene name, locus name, nucleotide accession, Protein accession, and sequence (nucleotide and aa). 4. Download the raw( or processed) data for one ExprHybridization. 5. Download/retrieve the summary data for one ExpressionSet. 6. Browse ExpressionSets (experiments).

Usecases for an expression database Simple Queries 7. Select an ExprHybridization and retrieve processed data for a gene by name, accession or sequence 8. Select an ExpressionSet and retrieve summary data for a gene by name, accession or sequence. 9. Select an experiment (ExpressionSet) and retrieve the genes that were up-/down-regulated, or varied X-fold (up or down), or uncertain, etc. 10. Retrieve experiments (ExpressionSets) by researcher's name, keyword, germplasm, classification, experiment type (time course, treated/untreated, dose response),and/or treatment (or only experimental variable treatment) 11. Retrieve detailed information about an experiment.

Usecases for an expression database More complex queries 12. Retrieve all the experiments in which gene X was up/down/expressed/etc. 13. Retrieve only the experiments that satisfy a given condition/s in which gene X was up/down/expressed/etc 14. Retrieve summary data for Gene X in all the experiments that satisfy a set of experimental conditions (germplasm, treatment, anatomical part, etc). 15. A user is browsing through experiments and wants a list of genes that are distinctive for a pairwise comparison, e.g. a search for auxin–upregulated or –downregulated genes. Advanced queries 16. Taking different samples and different genes, can the genes be divided in certain clusters. 17. Given the expression values of gene X in an experiment, return other genes with similar expression values in the same experiment or across experiments

Data model of Arabidopsis MA repository

Practical implementation of MA repository DBMS : Sybase Mixture of build/buy transact SQL ( Oracle’s procedural SQL) analysis Spotfire /GeneSpring Building own analysis software in Perl, Python, C Compliant with MAGE standards / MAML