Download presentation
Published byJasper Harrington Modified over 9 years ago
1
ArrayExpress and Gene Expression Atlas: Mining Functional Genomics data
Gabriella Rustici, PhD Functional Genomics Team EBI-EMBL
2
Talk structure Why do we need a database for functional genomics data?
ArrayExpress database Archive Gene Expression Atlas ArrayExpress content How to query the database How to download data How to submit data ArrayExpress
3
What is functional genomics (FG)?
The aim of FG is to understand the function of genes and other parts of the genome FG experiments typically utilize genome-wide assays to measure and track many genes (or proteins) in parallel under different conditions High-throughput technologies such as microarrays and high-throughput sequencing (HTS) are frequently used in this field to interrogate the transcriptome ArrayExpress
4
What biological questions is FG addressing?
When and where are genes expressed? How do gene expression levels differ in various cell types and states? What are the functional roles of different genes and in what cellular processes do they participate? How are genes regulated? How do genes and gene products interact? How is gene expression changed in various diseases or following a treatment? ArrayExpress
5
Components of a FG experiment
ArrayExpress
6
ArrayExpress www.ebi.ac.uk/arrayexpress/
Is a public repository for FG data, which provides easy access to well annotated data in a structured and standardized format Serves the scientific community as an archive for data supporting publications, together with GEO at NCBI and CIBEX at DDBJ Facilitates the sharing of experimental information associated with the data such as microarray designs, experimental protocols,…… Based on community standards: MIAME guidelines & MAGE-TAB format for microarray, MINSEQE guidelines for HTS data ( ArrayExpress
7
Reporting standards for microarrays MIAME checklist
Minimal Information About a Microarray Experiment The 6 most critical elements contributing towards MIAME are: Essential sample annotation including experimental factors and their values (e.g. compound and dose) Experimental design including sample data relationships (e.g. which raw data file relates to which sample, ….) Sufficient array annotation (e.g. gene identifiers, genomic coordinates, probe sequences or array catalog number) Essential laboratory and data processing protocols (e.g. normalization method used) Raw data for each hybridization (e.g. CEL or GPR files) Final normalized data for the set of hybridizations in the experiment ArrayExpress
8
Reporting standards for sequencing MINSEQE checklist
Minimal Information about a high-throughput Nucleotide SEQuencing Experiment The proposed guidelines for MINSEQE are (still work in progress): General information about the experiment Essential sample annotation including experimental factors and their values (e.g. compound and dose) Experimental design including sample data relationships (e.g. which raw data file relates to which sample, ….) Essential experimental and data processing protocols Sequence read data with quality scores, raw intensities and processing parameters for the instrument Final processed data for the set of assays in the experiment ArrayExpress
9
Reporting standards for microarrays MAGE-TAB format
MAGE-TAB is a simple spreadsheet format that uses a number of different files to capture information about a microarray experiment. We now adapted it to handle HTS data: IDF Investigation Description Format file, contains top-level information about the experiment including title, description, submitter contact details and protocols. SDRF Sample and Data Relationship Format file contains the relationships between samples and arrays, as well as sample properties and experimental factors, as provided by the data submitter. ADF (for array data only) Array Design Format file, describes the design of an array, i. e. the sequence located at each feature on the array and its annotation. Data files Raw and processed data files. The ‘raw’ data files are the files produced by the microarray image analysis software (e.g. CEL files for Affymetrix or GPR files from GenePix) or the trace data files (fastq, .srf or .sff) for HTS data. The processed data file is a ‘data matrix’ file containing processed values (e.g. files in which the expression values are linked to gene IDs or genome coordinates) ArrayExpress
10
ArrayExpress – two databases
11
What is the difference between them?
ArrayExpress Archive Central object: experiment Query to retrieve experimental information and associated data Expression Atlas Central object: gene/condition Query for gene expression changes across experiments and across platforms ArrayExpress
12
ArrayExpress – two databases
13
ArrayExpress Archive – when to use it?
Find FG experiments that might be relevant to your research Download data and re-analyze it. Often data deposited in public repositories can be used to answer different biological questions from the one asked in the original experiments. Submit microarray or HTS data that you want to publish. Major journals will require data to be submitted to a public repository like ArrayExpress as part of the peer-review process. ArrayExpress
14
How much data in AE Archive?
ArrayExpress
15
HTS data in AE Archive
16
Browsing the AE Archive
ArrayExpress
17
Browsing the AE Archive
The date when the data were loaded in the Archive Number of assays AE unique experiment ID Curated title of experiment Species investigated loaded in Atlas flag Raw sequencing data available in ENA The total number of experiments and assay retrieved The direct link to raw and processed data. An icon indicates that this type of data is available. The list of experiments retrieved can be printed, saved as Tab- delimited format or exported to Excel or as RSS feed ArrayExpress
18
Browsing the AE Archive
ArrayExpress
19
Experimental factor ontology (EFO) http://www.ebi.ac.uk/efo
Application focused ontology modeling experimental factors (EFs) in AE – selected by default Developed to: increase the richness of annotations that are currently made in AE Archive to promote consistency to facilitate automatic annotation and integrate external data EFs are transformed into an ontological representation, forming classes and relationships between those classes EFO terms map to multiple existing domain specific ontologies, such as the Disease Ontology and Cell Type Ontology ArrayExpress
20
Building EFO An example Take all experimental factors sarcoma cancer
neoplasm Kaposi’s sarcoma disease is the parent term is a type of is synonym of Find the logical connection between them disease neoplasm cancer sarcoma Kaposi’s sarcoma [-] Organize them in an ontology sarcoma cancer neoplasm disease Kaposi’s sarcoma ArrayExpress
21
Exploring EFO An example ArrayExpress
22
Searching AE Archive Simple query
ArrayExpress
23
Searching AE Archive Simple query
Search across all fields: AE accession number e.g. E-MEXP-568 Secondary accession numbers e.g. GEO series accession GSE5389 Experiment name Submitter's experiment description Sample attributes, experimental factor and values, including species (e.g. GeneticModification, Mus musculus, DREB2C over-expression) Publication title, authors and journal name, PubMed ID Synonyms for terms are always included in searches e.g. 'human' and 'Homo sapiens’ ArrayExpress
24
AE Archive query output
Matches to exact terms are highlighted in yellow Matches to synonyms are highlighted in green Matches to child terms in the EFO are highlighted in pink
25
AE Archive – experiment view
MIAME or MINSEQE scores show how much the experiment is standard compliant Link to files available. This varies between sequencing and microarray data. For microarray experiments you also have array design file and you can view a graph of the experimental design Raw data in ENA (it is a sequencing experiment), processed data downloadable as a zip Experimental factor(s) and its values ArrayExpress
26
AE Archive – SDRF file ArrayExpress
27
SDRF file – sample & data relationship
New view. Now it is easy to identify the experimental variables investigated in each experiment. In this case only ‘Developmental stage’ is the only variable, with values 10hpi, 15hpi, 20hpi, 25hpi, 30hpi, 35hpi, 40hpi, 5hpi. Also easy to see relationship between samples and data files. ArrayExpress
28
Searching AE Archive Advanced query
Combine search terms Enter two or more keywords in the search box with the operators AND, OR or NOT. AND is the default search term; a search for kidney cancer' will return hits with a match to ‘kidney' AND ‘cancer’ Search terms of more than one word must be entered inside quotes otherwise only the first word will be searched for. E.g. “kidney cancer” Specify fields for searches Particular fields for searching can also be specified in the format of fieldname:value For more info see ArrayExpress
29
Searching AE Archive Advanced query – examples
ArrayExpress
30
ArrayExpress – two databases
31
Expression Atlas – when to use it?
Find out if the expression of a gene (or a group of genes with a common gene attribute, e.g. GO term) change(s) across all the experiments available in the Expression Atlas; Discover which genes are differentially expressed in a particular biological condition that you are interested in. ArrayExpress
32
Expression Atlas construction Experiment selection criteria
The criteria we use for selecting experiments for inclusion in the Atlas are as follows: Array designs relating to experiment must be provided to enable re-annotation using Ensembl or Uniprot (or have the potential for this to be done) High MIAME scores Experiment must have 6 or more hybridizations Sufficient replication and large sample size EF and EFV must be well annotated Adequate sample annotation must be provided Processed data must be provided or raw data which can be renormalized must be available ArrayExpress
33
Expression Atlas construction Analysis pipeline
New meta-analytical tool for searching gene expression profiles across experiments in AE Data is taken as normalized by the submitter Gene-wise linear models (limma) and t-statistics are applied to calculate the strength of genes’ differential expression across conditions across experiments The result is a two-dimensional matrix where rows correspond to genes and columns correspond to conditions, rather than samples. The matrix entries are p-values together with a sign, indicating the significance and direction of differential expression ArrayExpress
34
Expression Atlas construction
ArrayExpress
35
Expression Atlas construction
ArrayExpress
36
Expression Atlas ArrayExpress
37
Atlas home page http://www.ebi.ac.uk/gxa/
Restrict query by direction of differential expression Query for genes Query for conditions The ‘advanced query’ option allows building more complex queries ArrayExpress
38
Atlas home page The ‘Genes’ and ‘Conditions’ search boxes
ArrayExpress
39
Atlas home page A single gene query
ArrayExpress
40
Atlas gene summary page
ArrayExpress
41
Atlas experiment page ArrayExpress
42
Atlas experiment page – HTS data
ArrayExpress
43
Atlas home page A ‘Conditions’ only query
ArrayExpress
44
Atlas heatmap view ArrayExpress
45
Atlas gene-condition query
ArrayExpress
46
Atlas query refining ArrayExpress
47
Atlas gene-condition query
ArrayExpress
48
Atlas query refining ArrayExpress
49
Atlas query refining ArrayExpress
50
Data submission to AE ArrayExpress
51
Data submission to AE www.ebi.ac.uk/microarray/submissions.html
ArrayExpress
52
Submission of HTS gene expression data
Submit via MAGE-TAB submission route Submit: MAGE-TAB spreadsheet containing details of the samples and protocols used. Trace data files for each sample (in SRF, FASTQ or SFF format ) Processed data files For non-human species we will supply your SRF or FASTQ files to the European Nucleotide Archive (ENA). If you have human identifiable sequencing data you need to submit to the The European Genome-phenome Archive and not ArrayExpress. They will supply you with a suitable template for submission and store human identifiable data securely. ArrayExpress
53
Types of data that can be submitted
ArrayExpress
54
What happens after submission?
confirmation Curation The curation team will review your submission and will you with any questions. Possible reopening for editing We will send you an accession number when all the required information has been provided. We will load your experiment into ArrayExpress and provide you with a reviewer login for viewing the data before it is made public. ArrayExpress
55
Find out more Visit our eLearning portal, Train online, at for courses on ArrayExpress and Atlas us at: Atlas mailing list: ArrayExpress
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.