EBI is an Outstation of the European Molecular Biology Laboratory. EBI Bioinformatics Roadshow ILRI/BecA Nairobi Campus 2 nd - 3 rd March 2011
EBI is an Outstation of the European Molecular Biology Laboratory. ArrayExpress and Gene Expression Atlas: Mining Functional Genomics data Gabriella Rustici, PhD Functional Genomics Team EBI-EMBL
ArrayExpress3 Talk structure Why do we need a database for functional genomics data? ArrayExpress database Archive Gene Expression Atlas Database content Query the database Data download Data submission
Components of a functional genomics experiment Array design information Location of each element Description of each element Hybridization protocol Quantification matrix Software specifications Sample source Sample treatments RNA extraction protocol Labelling protocol Control array elements Normalization method Image Scanning protocol Software specifications Array Normalized data Raw data Sample Sample source Sample treatments Template preparation Library Library preparation Chip Cluster amplification Sequencing and imaging Data analysis From images to sequences Quality Control Sequence alignment Assembly Specific steps depending on the application Data analysis
ArrayExpress Is a public repository for functional genomics data, mostly generated using microarray or high throughput sequencing (HTS) assays Serves the scientific community as an archive for data supporting publications, together with GEO at NCBI and CIBEX at DDBJ Provides easy access to well annotated microarray data in a structured and standardized format Facilitates the sharing of microarray designs and experimental protocols Based on FGED standards: MIAME checklist, MAGE-TAB format and MO Ontology. MINSEQE checklist for HTS data ( ArrayExpress5
Reporting standards for microarrays MIAME checklist Minimal Information About a Microarray Experiment The 6 most critical elements contributing towards MIAME are: 1.Essential sample annotation including experimental factors and their values (e.g. compound and dose) 2.Experimental design including sample data relationships (e.g. which raw data file relates to which sample, ….) 3.Sufficient array annotation (e.g. gene identifiers, genomic coordinates, probe sequences or array catalog number) 4.Essential laboratory and data processing protocols (e.g. normalization method used) 5.Raw data for each hybridization (e.g. CEL or GPR files) 6.Final normalized data for the set of hybridizations in the experiment ArrayExpress6
Reporting standards for sequencing MINSEQE checklist Minimal Information about a high-throughput Nucleotide SEQuencing Experiment The proposed guidelines for MINSEQE are (still work in progress): 1.General information about the experiment 2.Essential sample annotation including experimental factors and their values (e.g. compound and dose) 3.Experimental design including sample data relationships (e.g. which raw data file relates to which sample, ….) 4.Essential experimental and data processing protocols 5.Sequence read data with quality scores, raw intensities and processing parameters for the instrument 6.Final processed data for the set of assays in the experiment ArrayExpress7
MAGE-TAB is a simple spreadsheet format that uses a number of different files to capture information about a microarray experiment: IDFInvestigation Description Format file, contains top-level information about the experiment including title, description, submitter contact details and protocols. SDRFSample and Data Relationship Format file contains the relationships between samples and arrays, as well as sample properties and experimental factors, as provided by the data submitter. ADFArray Design Format file, describes the design of an array, i. e. the sequence located at each feature on the array and annotation of the sequences. Data filesRaw and processed data files. The ‘raw’ data files are the files produced by the microarray image analysis software, such as CEL files for Affymetrix or GPR files from GenePix. The processed data file is a ‘data matrix’ file containing processed values, as provided by the data submitter. 8 Reporting standards for microarrays MAGE-TAB format ArrayExpress
Reporting standards What semantics (or ontology) should we use to best describe its annotation? Ontology, which is a formal specification of terms in a particular subject area and the relations among them. Its purpose is to provide a basic, stable and unambiguous description of such terms and relations in order to avoid improper and inconsistent use of the terminology pertaining to a given domain. Thus far, Gene Ontology (GO) has been the most successful ontology initiative. GO is a controlled vocabulary used to describe the biology of a gene product in any organism. ArrayExpress9
Reporting standards for microarrays MGED ontology (MO) The MO provides terms for annotating all aspects of a microarray experiment from the design of the experiment and array layout, through to the preparation of the biological sample and the protocols used to hybridize the RNA and analyze the data The MO was developed to provide terms for annotating experiments in line with the MIAME guidelines, i.e. to provide the semantics to describe a microarray experiment according to the concepts specified in MIAME Also check Open Biomedical Ontologies (OBO) initiative ( for the development of life-science ontologies ArrayExpress10
ArrayExpress11 ArrayExpress – two databases
ArrayExpress12 How to query AE and Atlas? AE Archive Query by experiment, sample and experimental factor annotations Filter on species, array platform, molecule assayed and technology used Gene Expression Atlas Gene and/or condition queries Query across experiments and across platforms
ArrayExpress – two databases ArrayExpress13
How much data in AE Archive? ArrayExpress14
ArrayExpress15 Archive by species
ArrayExpress16 Browsing the AE Archive
The direct link to raw and processed data. An icon indicates that this type of data is available. The total number of experiments and assay retrieved Species investigated Curated title of experiment The date when the data were loaded in the Archive AE unique experiment ID Number of assays The list of experiments retrieved can be printed, saved as Tab- delimited format or exported to Excel or as RSS feed loaded in Atlas flag ArrayExpress17 Raw sequencing data available in ENA
ArrayExpress18 Browsing the AE Archive
Experimental factor ontology (EFO) Application focused ontology modeling experimental factors (EFs) in AE – selected by default Developed to: increase the richness of annotations that are currently made in AE Archive to promote consistency to facilitate automatic annotation and integrate external data EFs are transformed into an ontological representation, forming classes and relationships between those classes EFO terms map to multiple existing domain specific ontologies, such as the Disease Ontology and Cell Type Ontology ArrayExpress19
Searching AE Archive Simple query - EFO ArrayExpress20
Searching AE Archive Simple query Search across all fields: AE accession number e.g. E-MEXP-568 Secondary accession numbers e.g. GEO series accession GSE5389 Experiment name Submitter's experiment description Sample attributes, experimental factor and values, including species (e.g. GeneticModification, Mus musculus, DREB2C over-expression) Publication title, authors and journal name, PubMed ID Synonyms for terms are always included in searches e.g. 'human' and 'Homo sapiens’ ArrayExpress21
AE Archive query output Matches to exact terms are highlighted in yellow Matches to synonyms are highlighted in green Matches to child terms in the EFO are highlighted in pink
AE Archive – experiment view ArrayExpress23 MIAME or MINSEQE scores show how much the experiment is standard compliant Link to ENA and/or GEO corresponding page as well as to the experimental protocols used in this study Experimental factor(s) and its values Link to files available. This varies between sequencing and microarray data. For microarray experiments you also have array design file and you can view a graph of the experimental design
SamplesSample annotation Genes Gene expression levels or count level data Gene annotations How does processed data look? ArrayExpress24
AE Archive – SDRF file ArrayExpress25
SDRF file – sample & data relationship ArrayExpress26
AE Archive – all files ArrayExpress27
ArrayExpress28 AE Archive – all files
Searching AE Archive Advanced query Combine search terms a.Enter two or more keywords in the search box with the operators AND, OR or NOT. AND is the default search term; a search for kidney cancer' will return hits with a match to ‘kidney' AND ‘cancer’ b.Search terms of more than one word must be entered inside quotes otherwise only the first word will be searched for. E.g. “kidney cancer” Specify fields for searches a.Particular fields for searching can also be specified in the format of fieldname:value ArrayExpress29
Searching AE Archive Advanced query - fieldnames ArrayExpress30 Field nameSearchesExample accession Experiment primary or secondary accessionaccession:E-MEXP-568 array Array design accession or namearray:AFFY-2 OR array:Agilent* ef Experimental factor, the name of the main variables in an experiment. ef:celltype OR ef:compound efv Experimental factor value. Has EFO expansion.efv:fibroblast expdesign Experiment design typeexpdesign:”dose response” exptype Experiment type. Has EFO expansion.exptype:RNA-seq gxa Presence in the Gene Expression Atlas. Only value is gxa:true. ef:compound AND gxa:true pmid PubMed identifierpmid: sa Sample attribute values. Has EFO expansion.sa:wild_type species Species of the samples. Has EFO expansion.species:”homo sapiens” AND ef:cellline
Searching AE Archive Advanced query ArrayExpress31 FilterWhat is filtered assaycount:[x TO y]filter on the number of of assays where x <= y and both values are between 0 and 99,999 (inclusive). To count excluding the values given use curly brackets e.g. assaycount:{1 TO 5} will find experiments with 2-4 assays. Single numbers may also be given e.g. assaycount:10 will find experiments with 10 assays. efcount:[x TO y]filter on the number of experimental factors samplecount:[x TO y]filter on the number of samples sacount:[x TO y]filter on the number of sample attribute categories rawcount:[x TO y]filter on the number of raw files fgemcount:[x TO y]filter on the number of final gene expression matrix (processed data) files miamescore:[x TO y]filter on the MIAME compliance score (maximum score is 5) date:yyyy-mm-ddfilter by release date date: will search for experiments released on 1st of Dec 2009 date:2009* - will search for experiments released in 2009 date:[ ] - will search for experiments released between 1st of Jan and end of May 2008 Filtering experiments by counts of a particular attribute Experiments fulfilling certain count criteria can also be searched for e.g. having more than 10 assays (hybridizations)
Exercise 1a and 1b ArrayExpress32
ArrayExpress33 ArrayExpress – two databases
ArrayExpress34 The criteria we use for selecting experiments for inclusion in the Atlas are as follows: i.Array designs relating to experiment must be provided to enable re-annotation using Ensembl or Uniprot (or have the potential for this to be done) ii.High MIAME scores iii.Experiment must have 6 or more hybridizations iv.Sufficient replication and large sample size v.EF and EFV must be well annotated vi.Adequate sample annotation must be provided vii.Processed data must be provided or raw data which can be renormalized must be available Gene Expression Atlas Experiment selection criteria
ArrayExpress35 New meta-analytical tool for searching gene expression profiles across experiments in AE Data is taken as normalized by the submitter Gene-wise linear models (limma) and t-statistics are applied to calculate the strength of genes’ differential expression across conditions across experiments The result is a two-dimensional matrix where rows correspond to genes and columns correspond to conditions, rather than samples. The matrix entries are p-values together with a sign, indicating the significance and direction of differential expression Gene Expression Atlas Atlas construction
ArrayExpress36 Gene Expression Atlas
ArrayExpress37 Atlas home page Query for genes Query for conditions Restrict query by direction of differential expression The ‘advanced query’ option allows building more complex queries
ArrayExpress38 Atlas home page Query auto-complete functionality and ontology browsing
Atlas gene summary page Click the ‘expression profile’ link to view the experiment page ArrayExpress39
ArrayExpress40 Atlas heatmap view
ArrayExpress41 Atlas experiment page Click on ‘Experiment analysis’ to get a table of 10 top differentially expressed genes in experiment E-GEOD-5418 Search for a specific gene, experimental factor or expression level (up/down) Box plot showing differential expression of LCK in disease state ‘experimental malaria-infected’. Data is available for 2 separate array probes.
ArrayExpress42 Atlas experiment page Switch between box plot and line plot views of the data
ArrayExpress & Atlas43 Atlas experiment page – HTS data
Atlas list view Click the ‘expression profile’ link to view the experiment page ArrayExpress44
Atlas data download ArrayExpress45
ArrayExpress46 Atlas query refining
ArrayExpress47 Atlas query refining
ArrayExpress48 Exercise 2
ArrayExpress49 Data submission to AE
ArrayExpress50 Data submission to AE
MIAMExpress- Overview 1.MIAMExpress is a web based tool for submitting microarray data and microarray designs to the ArrayExpress database 2.MIAMExpress can be used for experiments up to 50 hybridizations in size ArrayExpress & Atlas51 Array design (custom) Experiment Protocol
Types of data that can be submitted ArrayExpress & Atlas52
Submission of HTS gene expression data 1.Submit via MAGE-TAB submission route 2.Submit: a.MAGE-TAB spreadsheet containing details of the samples and protocols used. b.Trace data files for each sample (in SRF, FASTQ or SFF format ) c.Processed data files 3.For non-human species we will supply your SRF or FASTQ files to the European Nucleotide Archive (ENA). 4.If you have human identifiable sequencing data you need to submit to the The European Genome-phenome Archive and not ArrayExpress. They will supply you with a suitable template for submission and store human identifiable data securely. ArrayExpress & Atlas53
What happens after submission? 1. confirmation 2.Curation a.The curation team will review your submission and will you with any questions. b.Possible reopening for editing 3.We will send you an accession number when all the required information has been provided. 4.We will load your experiment into ArrayExpress and provide you with a reviewer login for viewing the data before it is made public. ArrayExpress & Atlas54
European Genome-phenome Archive (EGA) 55 1.The European Genome-phenome Archive (EGA) is a service dedicated to archiving, processing and disseminating all types of potentially identifiable genetic and phenotypic human data at the EBI 2.The EGA provides submitters with a free, secure and permanent archiving solution for sharing their data worldwide. 3.In addition, each study is given a stable identifier that may be referred to in publications
Data Submission to the EGA EGA will accept all data components from the study 1.The EGA accepts a.raw data formats arising from applications of DNA sequencing and array-based genotyping. b.processed data such as genotypes, structural variants and whole genome sequence with any value associated with these calls c.any associated phenotype information. d.General experiment information is provided in MAGE-TAB format 2.The data must be "de-identified“in accordance with the individual consent agreements and any applicable regulations. 3.Submissions may use manual upload or automated systems such as SIMBioMS* or direct link from the local LIMS system. EGA requires statements that verify: 1.submission is consistent with the initial informed consent and with the applicable laws and regulations 2.submitter has authority for the submission on behalf of the organisation 3.an individual or organisation can be identified that will be responsible for making data access decisions All data must be encrypted prior to submission. 56
How to submit data to the EGA ArrayExpress & Atlas57 Contact the EGA Helpdesk and inform us that you would like to submit data to the EGA. Contact the EGA Helpdesk and inform us that you would like to submit data to the EGA. Receive a unique accession number for your submission and Log-in details will be sent to you for a password protected submission box that can be used either with FTP or Aspera. Aspera Receive a unique accession number for your submission and Log-in details will be sent to you for a password protected submission box that can be used either with FTP or Aspera. Aspera Document your study protocols, samples and required statements. Encrypt all your documents and files Calculate md5 checksums for the submitted files. Upload all your files into a private submission box. Send your encryption key to the EGA by postal services, deliver in person or let us know over the phone
Submissions-Key Point Don’t be afraid to ask ArrayExpress & Atlas58 Useful Links.pdf