Download presentation
Presentation is loading. Please wait.
Published byLewis Todd Modified over 9 years ago
1
ArrayExpress and Gene Expression Atlas: Mining Functional Genomics data Gabriella Rustici, PhD Functional Genomics Team EBI-EMBL gabry@ebi.ac.uk
2
ArrayExpress2 Talk structure Why do we need a database for functional genomics data? ArrayExpress database Archive Gene Expression Atlas Database content Query the database Data download Data submission
3
Components of a functional genomics experiment Array design information Location of each element Description of each element Hybridization protocol Quantification matrix Software specifications Sample source Sample treatments RNA extraction protocol Labelling protocol Control array elements Normalization method Image Scanning protocol Software specifications Sample source Sample treatments Template preparation Library preparation Cluster amplification Sequencing and imaging From images to sequences Quality Control Sequence alignment Assembly Specific steps depending on the application Sample Library Chip Data analysis Array Normalized data Raw data Sample Data analysis
4
ArrayExpress www.ebi.ac.uk/arrayexpress/ Is a public repository for functional genomics data, mostly generated using microarray or high throughput sequencing (HTS) assays Serves the scientific community as an archive for data supporting publications, together with GEO at NCBI and CIBEX at DDBJ Provides easy access to well annotated microarray data in a structured and standardized format Facilitates the sharing of microarray designs and experimental protocols Based on FGED standards: MIAME checklist, MAGE-TAB format and MO Ontology. MINSEQE checklist for HTS data (http://www.mged.org/minseqe/) ArrayExpress4
5
Reporting standards for microarrays MIAME checklist Minimal Information About a Microarray Experiment The 6 most critical elements contributing towards MIAME are: 1.Essential sample annotation including experimental factors and their values (e.g. compound and dose) 2.Experimental design including sample data relationships (e.g. which raw data file relates to which sample, ….) 3.Sufficient array annotation (e.g. gene identifiers, genomic coordinates, probe sequences or array catalog number) 4.Essential laboratory and data processing protocols (e.g. normalization method used) 5.Raw data for each hybridization (e.g. CEL or GPR files) 6.Final normalized data for the set of hybridizations in the experiment ArrayExpress5
6
Reporting standards for sequencing MINSEQE checklist Minimal Information about a high-throughput Nucleotide SEQuencing Experiment The proposed guidelines for MINSEQE are (still work in progress): 1.General information about the experiment 2.Essential sample annotation including experimental factors and their values (e.g. compound and dose) 3.Experimental design including sample data relationships (e.g. which raw data file relates to which sample, ….) 4.Essential experimental and data processing protocols 5.Sequence read data with quality scores, raw intensities and processing parameters for the instrument 6.Final processed data for the set of assays in the experiment ArrayExpress6
7
MAGE-TAB is a simple spreadsheet format that uses a number of different files to capture information about a microarray experiment: IDFInvestigation Description Format file, contains top-level information about the experiment including title, description, submitter contact details and protocols. SDRFSample and Data Relationship Format file contains the relationships between samples and arrays, as well as sample properties and experimental factors, as provided by the data submitter. ADFArray Design Format file, describes the design of an array, i. e. the sequence located at each feature on the array and annotation of the sequences. Data filesRaw and processed data files. The ‘raw’ data files are the files produced by the microarray image analysis software, such as CEL files for Affymetrix or GPR files from GenePix. The processed data file is a ‘data matrix’ file containing processed values, as provided by the data submitter. 7 Reporting standards for microarrays MAGE-TAB format ArrayExpress
8
Reporting standards What semantics (or ontology) should we use to best describe its annotation? Ontology, which is a formal specification of terms in a particular subject area and the relations among them. Its purpose is to provide a basic, stable and unambiguous description of such terms and relations in order to avoid improper and inconsistent use of the terminology pertaining to a given domain. Thus far, Gene Ontology (GO) has been the most successful ontology initiative. GO is a controlled vocabulary used to describe the biology of a gene product in any organism. ArrayExpress8
9
Reporting standards for microarrays MGED ontology (MO) The MO provides terms for annotating all aspects of a microarray experiment from the design of the experiment and array layout, through to the preparation of the biological sample and the protocols used to hybridize the RNA and analyze the data The MO was developed to provide terms for annotating experiments in line with the MIAME guidelines, i.e. to provide the semantics to describe a microarray experiment according to the concepts specified in MIAME Also check Open Biomedical Ontologies (OBO) initiative (www.obofoundry.org) for the development of life-science ontologies ArrayExpress9
10
10 ArrayExpress – two databases
11
ArrayExpress11 How to query AE and Atlas? AE Archive Query by experiment, sample and experimental factor annotations Filter on species, array platform, molecule assayed and technology used Gene Expression Atlas Gene and/or condition queries Query across experiments and across platforms
12
ArrayExpress – two databases ArrayExpress12
13
How much data in AE Archive? ArrayExpress13
14
ArrayExpress14 Archive by species
15
ArrayExpress15 Browsing the AE Archive
16
The direct link to raw and processed data. An icon indicates that this type of data is available. The total number of experiments and assay retrieved Species investigated Curated title of experiment The date when the data were loaded in the Archive AE unique experiment ID Number of assays The list of experiments retrieved can be printed, saved as Tab- delimited format or exported to Excel or as RSS feed loaded in Atlas flag ArrayExpress16 Raw sequencing data available in ENA
17
ArrayExpress17 Browsing the AE Archive
18
Experimental factor ontology (EFO) http://www.ebi.ac.uk/efo Application focused ontology modeling experimental factors (EFs) in AE Developed to: increase the richness of annotations that are currently made in AE Archive to promote consistency to facilitate automatic annotation and integrate external data EFs are transformed into an ontological representation, forming classes and relationships between those classes EFO terms map to multiple existing domain specific ontologies, such as the Disease Ontology and Cell Type Ontology ArrayExpress18
19
ArrayExpress & Atlas19 Experimental factor ontology (EFO) An example
20
Searching AE Archive Simple query - EFO ArrayExpress20
21
Searching AE Archive Simple query Search across all fields: AE accession number e.g. E-MEXP-568 Secondary accession numbers e.g. GEO series accession GSE5389 Experiment name Submitter's experiment description Sample attributes, experimental factor and values, including species (e.g. GeneticModification, Mus musculus, DREB2C over-expression) Publication title, authors and journal name, PubMed ID Synonyms for terms are always included in searches e.g. 'human' and 'Homo sapiens’ ArrayExpress21
22
AE Archive query output Matches to exact terms are highlighted in yellow Matches to synonyms are highlighted in green Matches to child terms in the EFO are highlighted in pink
23
AE Archive – experiment view ArrayExpress23
24
SamplesSample annotation Genes Gene expression levels or count level data Gene annotations How does processed data look? ArrayExpress24
25
AE Archive – SDRF file ArrayExpress25
26
SDRF file – sample & data relationship ArrayExpress26
27
AE Archive – ADF file ArrayExpress27
28
AE Archive – Old interface ArrayExpress28
29
AE Archive – all files ArrayExpress29
30
ArrayExpress30 AE Archive – all files
31
Searching AE Archive Advanced query Combine search terms Enter two or more keywords in the search box with the operators AND, OR or NOT. AND is the default search term; a search for kidney cancer' will return hits with a match to ‘kidney' AND ‘cancer’ Search terms of more than one word must be entered inside quotes otherwise only the first word will be searched for. E.g. “kidney cancer” Specify fields for searches Particular fields for searching can also be specified in the format of fieldname:value ArrayExpress31
32
Searching AE Archive Advanced query - fieldnames ArrayExpress32 Field nameSearchesExample accession Experiment primary or secondary accessionaccession:E-MEXP-568 array Array design accession or namearray:AFFY-2 OR array:Agilent* ef Experimental factor, the name of the main variables in an experiment. ef:celltype OR ef:compound efv Experimental factor value. Has EFO expansion.efv:fibroblast expdesign Experiment design typeexpdesign:”dose response” exptype Experiment type. Has EFO expansion.exptype:RNA-seq gxa Presence in the Gene Expression Atlas. Only value is gxa:true. ef:compound AND gxa:true pmid PubMed identifierpmid:16553887 sa Sample attribute values. Has EFO expansion.sa:wild_type species Species of the samples. Has EFO expansion.species:”homo sapiens” AND ef:cellline
33
Searching AE Archive Advanced query ArrayExpress33 FilterWhat is filtered assaycount:[x TO y]filter on the number of of assays where x <= y and both values are between 0 and 99,999 (inclusive). To count excluding the values given use curly brackets e.g. assaycount:{1 TO 5} will find experiments with 2-4 assays. Single numbers may also be given e.g. assaycount:10 will find experiments with 10 assays. efcount:[x TO y]filter on the number of experimental factors samplecount:[x TO y]filter on the number of samples sacount:[x TO y]filter on the number of sample attribute categories rawcount:[x TO y]filter on the number of raw files fgemcount:[x TO y]filter on the number of final gene expression matrix (processed data) files miamescore:[x TO y]filter on the MIAME compliance score (maximum score is 5) date:yyyy-mm-ddfilter by release date date:2009-12-01 - will search for experiments released on 1st of Dec 2009 date:2009* - will search for experiments released in 2009 date:[2008-01-01 2008-05-31] - will search for experiments released between 1st of Jan and end of May 2008 Filtering experiments by counts of a particular attribute Experiments fulfilling certain count criteria can also be searched for e.g. having more than 10 assays (hybridizations)
34
ArrayExpress & Atlas34 Searching AE Archive Advanced query – an example
35
Exercise 1 ArrayExpress35
36
ArrayExpress36 ArrayExpress – two databases
37
ArrayExpress37 The criteria we use for selecting experiments for inclusion in the Atlas are as follows: Array designs relating to experiment must be provided to enable re-annotation using Ensembl or Uniprot (or have the potential for this to be done) High MIAME scores Experiment must have 6 or more hybridizations Sufficient replication and large sample size EF and EFV must be well annotated Adequate sample annotation must be provided Processed data must be provided or raw data which can be renormalized must be available Gene Expression Atlas Experiment selection criteria
38
ArrayExpress38 New meta-analytical tool for searching gene expression profiles across experiments in AE Data is taken as normalized by the submitter Gene-wise linear models (limma) and t-statistics are applied to calculate the strength of genes’ differential expression across conditions across experiments The result is a two-dimensional matrix where rows correspond to genes and columns correspond to conditions, rather than samples. The matrix entries are p-values together with a sign, indicating the significance and direction of differential expression Gene Expression Atlas Atlas construction
39
ArrayExpress39 Gene Expression Atlas Atlas construction
40
up-regulated down-regulated no change
41
ArrayExpress41 Gene Expression Atlas
42
ArrayExpress42 Atlas home page http://www.ebi.ac.uk/gxa/ Query for gene(s)Query for condition(s) Restrict search by direction of differential expression The ‘advanced search’ option allows building more complex queries
43
ArrayExpress43 Atlas home page The ‘Genes’ search box & auto-complete function
44
ArrayExpress44 Atlas home page The ‘Conditions’ search box & ontology browsing
45
ArrayExpress45 Atlas home page A single gene query
46
Atlas gene summary page ArrayExpress46
47
ArrayExpress47 Atlas experiment page Expression plot Table containing gene information and drop down menus for searching within the experiment Experimental factors list
48
ArrayExpress & Atlas48 Atlas experiment page – HTS data
49
ArrayExpress & Atlas49 Atlas home page A ‘Conditions’ only query
50
ArrayExpress50 Atlas heatmap view
51
Atlas list view Click the ‘expression profile’ link to view the experiment page ArrayExpress51
52
Atlas data download ArrayExpress52
53
ArrayExpress53 Atlas gene-condition query
54
ArrayExpress54 Atlas query refining
55
ArrayExpress55 Atlas gene-condition query
56
ArrayExpress56 Atlas query refining
57
ArrayExpress57 Atlas query refining
58
ArrayExpress58 Exercises 2, 3 & 4
59
ArrayExpress59 Data submission to AE
60
ArrayExpress60 Data submission to AE www.ebi.ac.uk/microarray/submissions.html
61
Submission of HTS gene expression data Submit via MAGE-TAB submission route Submit: MAGE-TAB spreadsheet containing details of the samples and protocols used. Trace data files for each sample (in SRF, FASTQ or SFF format ) Processed data files For non-human species we will supply your SRF or FASTQ files to the European Nucleotide Archive (ENA). If you have human identifiable sequencing data you need to submit to the The European Genome-phenome Archive and not ArrayExpress. They will supply you with a suitable template for submission and store human identifiable data securely. ArrayExpress & Atlas61
62
Types of data that can be submitted ArrayExpress & Atlas62
63
What happens after submission? Email confirmation Curation The curation team will review your submission and will email you with any questions. Possible reopening for editing We will send you an accession number when all the required information has been provided. We will load your experiment into ArrayExpress and provide you with a reviewer login for viewing the data before it is made public. ArrayExpress & Atlas63
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.