Download presentation
1
Microarray data repositories
Mark Lambrecht The Arabidopsis Information Resource
2
Bioinformatics : process
Bioinformatics is involved in storing managing integrating retrieving analysing biological data; genome data, gene expression, proteomic data
3
Part I : challenges in constructing MA data repositories
Part II : Existing microarray data repositories case study : integrating microarray information in Arabidopsis Information Resource
4
Goal To store gene expression info to allow for Dealing with
generation and interpretation of experimental data integration of data in data repository Dealing with integrity testing quality of data, QA (log(R/G) vs log(RG) high-throughput exchange of data
5
A trivial problem ? Organisation of data determines scope of analysis and results generated by analyses ! Organisation for a few experiments not necessary, but in case of 20, 300, 1000 MA experiments ? Organisation of data has to reflect as good as possible biological reality and thus expert bioinformaticians with core biological background needed. Most complex: integrating MA data in existing biological repository.
6
Problem 1 : generation and interpretation of experimental data
Raw data as generated by a technology platform : oligo microarrays cDNA microarrays Affymetrix microarrays Data dependent on platform (expression scales, linearity, absence of ‘interconversion’ standard)
7
Interpretation Data is Integrated (secondary) data redundant
non-redundant annotation of data : manually, automatic (data cleaning step) Example : annotation of Affymetrix probe sets. Integrated (secondary) data composition of related data from different sources
8
Key challenge Keeping track of data processing, refinement, revision
Effect of data modelling, data collection, data management infrastructure
9
Example : Affymetrix data platform
12-20 probe pairs (PM, MM)/probe set DAT file : binary image CEL file : quantitative representation CHP file : gene expression measurements
10
Example : Affymetrix platform
Raw data : probe array images (.DAT files) Interpreted data probe intensity data files (.CEL files) probe set summarized data (.CHP files) replicate summarized data normalized data across experiments filtered data annotation information
11
Example: cDNA microarrays
EST/cDNAs are PCR amplified and attached to glass slides by automation Example : normalization by ‘tip’ groups Raw data generated by image analysis programs (2 channel laser scanning device) Processed data : preclustering files (.pcl files) NCBI Locuslink : problem of annotation of ESTs (fully-sequenced organism). Problems : biochemical manipulations, PCR product concentration, hybr efficiency, cross-hybridization
12
Example: cDNA microarrays
13
Data review Review, possibly correct experimental results
data : quality assessment (quality metrics : work in progress…) technology LIMS = laboratory information management system, validation (QC) tools, tracking systems : complex! analyze and correct process steps, failures interpret analysis results
14
Data comparability Each MA experiment is different ; scanner used, probes, glass type, tips, hybridization conditions, etc… has to be tracked in database Computational factors : hypothesis testing, analytic factors
15
Data analysis and interpretation
16
Data collection : warehouse
Goals: single source expression data Rationale : support periodic or continuous accumulation of data for effective exploration and analysis of massive amounts of data Data structure : central object characterized by measurement attribute dimension object ; organized in classification hierarchies Key challenge : applying warehouse concept to the gene expression data domain
17
Problem 2 : MA data integration in MA repository
Alternatives light ; data items in autonomous repositories cross-referenced loose : autonomous repositories accessed through common view tight : integrated in a common data repository => resolution of semantic problems required.
18
Associations Shallow : data collection and sorting
deep : determine semantic relationships
19
Complexity Determined by number of data types involved in integration
Quality of data in different repositories : example: integrate MA database with poorly described experiments in model organism database. Key challenge: understanding the analysis and performance requirements and finding the best alternative addressing them
20
Problem 3 : Integrating MA data in biological data warehouse
New challenges: several sources and platforms have to be integrated (Affymetrix, cDNA spotted arrays, oligo). Providing enhanced expression data context for analysis
21
Methodology Independent data collection
different image analysis software suites different ‘treatment’ of raw data different analysis of treated data by software suites (Spotfire, GeneSpring, Partek, GXExplorer, Genesis, Stanford software Xcluster Treeview,…) Statistical packages : S+, R , SAS
22
Methodology Employ common formats for exporting or extracting gene expression data / sample data / gene annotation data MAGE group : a common language to describe and exchange microarray data information, describes : microarray design microarray manufacturing information microarray experiment setup gene expression data and analysis results
23
Methodology : MAGE MAGE-ML has been automatically derived from Microarray Gene Expression Object Model (MAGE-OM), which is developed and described using the Unified Modelling Language (UML) – a standard language for describing object models. MIAME : minimum information about a microarray experiment Use of ontologies or controlled vocabularies (too many <-> not enough), microarray ontology Classification : e.g. aging : death to natural causes, obesitas/diabetes example
24
Example sample description
This is an example of the information requested for a sample (biomaterial) description for a microarray experiment (kindly provided by M. Hofmann and S. Schmidtke, LION bioscience AG) with notes added by C. Stoeckert, U. Penn. Organism: mus musculus [ NCBI taxonomy browser ] Note that the source of the term "NCBI taxonomy browser" is provided. We will need to come up with a controlled vocabulary of sources and/or IDs. Cell source: in-house bred mice (contact: Cell type: thymocytes (complex mixture of cells isolated from thymus. mostly T-lymphocytes and epithelial cells of different differentiation stage) [no matching for this cell type found in GXD "Mouse Anatomical Dictionary"] Note: what we are going for here is not really the type of cell but rather the type of cell source as in paraffin sample, biopsy, etc. This will be clarified in the concepts. Sex: female [ MGED ] Age: weeks after birth [ MGED ] Growth conditions: normal Note the need for further structuring of this type of information. controlled environment oC average temperature housed in cages according to German and EU legislation specified pathogen free conditions (SPF) 14 hours light cycle 10 hours dark cycle
25
Example sample description
Developmental stage: stage 28 (juvenile (young) mice) [ GXD "Mouse Anatomical Dictionary" ] Organism part: thymus [ GXD "Mouse Anatomical Dictionary" ] Strain or line: C57BL/6 [International Committee on Standardized Genetic Nomenclature for Mice] Genetic Variation: Inbr (J) 150. Origin: substrains 6 and 10 were separated prior to This substrain is now probably the most widely used of all inbred strains. Substrain 6 and 10 differ at the H9, Igh2 and Lv loci. Maint. by J,N, Ola. [International Committee on Standardized Genetic Nomenclature for Mice ] Individual: - Individual genetic characteristics: the same as in "Genetic Variation" Disease state: normal Targeted cell type: primary thymocytes (complex mixture of cells isolated from thymus. mostly T-lymphocytes and epithelial cells of different differentiation stage [ no matching for this cell type found in GXD "Mouse Anatomical Dictionary"] Cell line: no cell line; primary cells Treatment: in vivo [MGED] intraperitoneal injection of Dexamethasone into mice, 10 microgram per 25 g bodyweight of the mouse Compound: drug [MGED] synthetic glucocorticoid Dexamethasone, dissolved in PBS Clinical Information: - Separation technique: mice were killed 4 hours post injection and the thymus was taken out. Thymozytes were prepared by meshing the thymus through sterile gauze several times until a suspension of individual cells was obtained. RNA was prepared from thymozytes using a Qiagen RNA preparation kit.
26
Data analysis in warehouse environment
Goal : focus rapidly on data set of interest (single experiment, aggregate of several experiments/replicates,…) Mechanisms: free text search, key word search, Exploration : identify data of biological interest clustering, classification Consistent analysis and exploration Data analysis evolves and is of secondary importance to storing data.
27
Outlook Data management technology will provide continuously enhanced tools Complexity will rise as new types of data generated appear Expertise needed in application ; image analysis data modelling data analysis : classification/statistics and data management and software DBA, software engineering, software QA + prod.
28
Part II : existing microarray data repositories
NCBI Geo : different types of expression data Submitter Platform Sample Series
29
Existing microarray data repositories
ArrayDB NCGR’s GeneX Stanford Microarray Database (SMD) (see later) They all have clone viewers histogram scatter plots analysis software as plug-ins
30
Attached to other data repositories
Attached to biological pathway databases KEGG GenMAPP
31
Case study Arabidopsis microarray data
AFGC : Arabidopsis functional genomics consortium online demonstration Describing goals of MA data repository : use cases scenario
32
Usecases for an expression database
Remember : these are all functions to be provided by the MA database and interfaces ! (complex queries with DBMS could be simple for individual analysis). General functionalities 1. Retrieve/download all the genes that have been arrayed. 2. Retrieve/download the genes or array elements that have been arrayed using a specific technology or manufacturer or version 3. Return array elements/genes by gene name, locus name, nucleotide accession, Protein accession, and sequence (nucleotide and aa). 4. Download the raw( or processed) data for one ExprHybridization. 5. Download/retrieve the summary data for one ExpressionSet. 6. Browse ExpressionSets (experiments).
33
Usecases for an expression database
Simple Queries 7. Select an ExprHybridization and retrieve processed data for a gene by name, accession or sequence 8. Select an ExpressionSet and retrieve summary data for a gene by name, accession or sequence. 9. Select an experiment (ExpressionSet) and retrieve the genes that were up-/down-regulated, or varied X-fold (up or down), or uncertain, etc. 10. Retrieve experiments (ExpressionSets) by researcher's name, keyword, germplasm, classification, experiment type (time course, treated/untreated, dose response),and/or treatment (or only experimental variable treatment) 11. Retrieve detailed information about an experiment.
34
Usecases for an expression database
More complex queries 12. Retrieve all the experiments in which gene X was up/down/expressed/etc. 13. Retrieve only the experiments that satisfy a given condition/s in which gene X was up/down/expressed/etc 14. Retrieve summary data for Gene X in all the experiments that satisfy a set of experimental conditions (germplasm, treatment, anatomical part, etc). 15. A user is browsing through experiments and wants a list of genes that are distinctive for a pairwise comparison, e.g. a search for auxin–upregulated or –downregulated genes. Advanced queries 16. Taking different samples and different genes, can the genes be divided in certain clusters. 17. Given the expression values of gene X in an experiment, return other genes with similar expression values in the same experiment or across experiments
35
Data model of Arabidopsis MA repository
36
Practical implementation of MA repository
DBMS : Sybase Mixture of build/buy transact SQL ( Oracle’s procedural SQL) analysis Spotfire /GeneSpring Building own analysis software in Perl, Python, C Compliant with MAGE standards / MAML
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.