ArrayExpress and Gene Expression Atlas: Mining Functional Genomics Data Amy Tang, PhD ArrayExpress Production Team Functional Genomics Group EMBL-EBI
What’s covered this morning? http://www.ebi.ac.uk/training/course/bioinformatics-udine2013 What do we mean by “functional genomics data”? Why do we need databases for them? Two databases: ArrayExpress Expression Atlas What’s in each database, how to browse, search, interpret, download data (Microarray/sequencing data analysis; How to submit data to ArrayExpress?) 2 ArrayExpress
Functional genomics (FG) data The aim of FG is to understand the function of genes and other (non-genic) parts of the genome Often involved high-throughput technologies (microarrays, high-throughput sequencing [HTS]) Questions addressed: Gene expression - when? where? how much? changes? Gene function - roles of genes in cellular processes, pathways Gene/genome regulation - e.g. histone modifications, CpG (DNA) methylation 3 ArrayExpress
Example of FG data sets in ArrayExpress Questions addressed: Gene expression - when? where? how much? changes? Gene function - roles of genes in cellular processes, pathways 4 ArrayExpress
Example of FG data sets in ArrayExpress Questions addressed: Gene/genome regulation - e.g. histone modifications, CpG (DNA) methylation 5 ArrayExpress
The two databases: how are they related? ArrayExpress Direct submission Curation Statistical analysis Expression Atlas Import from external databases (mainly NCBI Gene Expr. Omnibus) Links to analysis software, e.g. Links to other databases, e.g. 6 ArrayExpress
The two databases: how do they compare? ArrayExpress Expression Atlas Central object Experiment Gene or condition Microarray data Sequencing data RNA-seq coming soon Query for… Experimental information and associated data Up/downregulated genes across experiments and microarray platforms Download data for further analysis Submit data X Curated data Yes (direct submissions) /No (GEO-imported) All curated 7 ArrayExpress
ArrayExpress www.ebi.ac.uk/arrayexpress Public repository for functional genomics data (both microarray and sequencing) Together with GEO at NCBI and CIBEX at DDBJ, serves the scientific community as a data archive supporting publications Provides access to curated data in a structured and standardised format – essential for easy sharing of experimental information Submissions are curated based on community standards: MIAME guidelines & MAGE-TAB format for microarray MINSEQE guidelines & MAGE-TAB format for HTS data 8 ArrayExpress
Community standards for data requirement MIAME = Minimal Information About a Microarray Experiment (http://www.mged.org/Workgroups/MIAME/miame_2.0.html) MINSEQE = Minimal Information about a high-throughput Nucleotide SEQuencing Experiment (http://www.mged.org/minseqe) The checklist: Requirements MIAME MINSEQE 1. Experiment design / background description 2. Sample annotation and experimental factor 3. Array design annotation (e.g. probe sequence) 4. All protocols (wet-lab bench and data processing) 5. Raw data files (from scanner or sequencing machine) 6. Processed data files (normalised and/or transformed) 9 ArrayExpress
What is an experimental factor? The main variable(s) studied, often related to the hypothesis of the experiment and is the independent variable. Values of the factor (“factor values”) should vary. Experimental design Factor Factor Values Not factor beef vs horse meat Diet beef, horse meat Organism (human) smoker vs non-smoker compound cigarette smoke (tobacco), no tobacco Organism (human), sex (male) face cream A vs control X Active ingredient A, “sham” control Cell type A X 1010 ArrayExpress
Reporting standards - MAGE-TAB format A simple spreadsheet format that uses a number of tab-delimited text files Array Design Format file Describes probes on an array, e.g. sequence, genomic mapping location ADF (microarray only) Investigation Description Format file Experiment title Experiment description Submitter’s contact details Definition of all protocols IDF Raw and processed data files 1.fq.gz .CEL A1.CEL Normalized.txt 2.fq.gz Sample Data Relationship Format file Starting materials with annotation Derived materials (e.g. RNA extracts) All assays (hybs/seq. lanes) Resulting data file(s) for each assay SDRF 1111 ArrayExpress
MAGE-TAB Example: IDF
MAGE-TAB Example: SDRF
How much data in ArrayExpress? (as of 29 Oct 2013) 14 ArrayExpress
HTS data in ArrayExpress (as of 29 October 2013) Microarray vs HTS RNA-, DNA-, ChIP-seq breakdown 15 ArrayExpress
Browsing ArrayExpress www.ebi.ac.uk/arrayexpress ArrayExpress
Browsing ArrayExpress experiments www. ebi. ac Browsing ArrayExpress experiments www.ebi.ac.uk/arrayexpress/experiments/browse.html All columns can be sorted by clicking at the heading ArrayExpress
File download on the Browse page Direct download link (e.g. here it’s for a single raw data archive [i.e. *.zip] file) A link to a page which lists all the archive files available for download. (No direct link because there are >1 archives) This is specifically for HTS experiments. Direct link to European Nucleotide Archive (ENA)’s page which lists all the sequencing assays (which are called “runs” at the ENA). 18 ArrayExpress
ArrayExpress single-experiment view Sample characteristics, factors and factor values The microarray design used MIAME or MINSEQE scores ( * = compliant) All files related to this experiment ( e.g. IDF, SDRF, array design, raw data, R object ) Send data to GenomeSpace and analyse it yourself 19 ArrayExpress
Samples view – microarray experiment All columns can be sorted by clicking at the heading Direct link to data files for one sample Sample characteristics Factor values Scroll left and right to see all sample characteristics and factor values 20 ArrayExpress
Samples view – sequencing experiment Direct link to European Nucleotide Archive (ENA) record about this sequencing assay Direct link to fastq files at European Nucleotide Archive (ENA) 21 ArrayExpress
Searching for experiments in ArrayExpress www. ebi. ac Searching for experiments in ArrayExpress www.ebi.ac.uk/arrayexpress/experiments/browse.html ArrayExpress
Experimental factor ontology (EFO) http://www.ebi.ac.uk/efo Ontology: a way to systematically organise experimental factor terms. controlled vocabulary + hierarchy (relationship) Used in EBI databases: and external projects (e.g. NHGRI GWAS Catalogue) Combine terms from a subset of well-maintained and compatible ontologies, e.g. Gene Ontology (cellular component + biological process terms) NCBI Taxonomy Ontology in layman terms: http://jamesmaloneebi.blogspot.co.uk/2012/06/common-ontology-questions-1-what-is-it.html ArrayExpress
Building EFO - an example Take all experimental factors sarcoma cancer neoplasm Kaposi’s sarcoma disease is the parent term is a type of is synonym of Find the logical connection between them disease neoplasm cancer sarcoma Kaposi’s sarcoma [-] Organize them in an ontology sarcoma cancer neoplasm disease Kaposi’s sarcoma ArrayExpress
Exploring EFO - an example ArrayExpress
Experimental factor ontology (EFO) http://www.ebi.ac.uk/efo EFO developed to: increase the richness of annotations in databases expand on search terms when querying ArrayExpress and Expression Atlas using synonyms (e.g. “cerebral cortex” = “adult brain cortex”) using child terms (e.g. “bone” “rib” and “vertebra”) promote consistency (e.g. F/female/, 1day/24hours) facilitate automatic annotation and integration of external data (e.g. changing “gender” to “sex” automatically) 26 ArrayExpress
Searching ArrayExpress Using EFO terms and filters Filter your search results by: Species of interest One array design (platform), molecule (DNA, RNA, protein, etc) technology (microarray or HTS) “Auto-complete” with suggestions (like Google search) Avoid acronyms as search terms Enter keyword, click search, then filter next. ArrayExpress
What search terms can I use? ArrayExpress accession number, e.g. “E-MEXP-568” Secondary accession number e.g. GEO series “GSE5389” Experiment title, description Submitter's email address Publication title, authors and journal name, PubMed ID Sample attributes and experimental factor / factor values: “genetic modification” “heart” “diabetes” “neural stem cells” “penicillin” “ChIP-chip” “methylation profiling” “Arabidopsis” “p53” * Powered by EFO expansion. Use EFO terms wherever possible. ArrayExpress
Example search: “leukemia” Exact match to search term Matched EFO synonyms to search term Matched EFO child term of search term 29 ArrayExpress
Experimental factor value Advanced search Allows you to restrict your search to a specific field Format of search term: field_name:search_term Some examples: Specific field Example term What it means Experimental factor “ef:genotype” Search for experiments where “genotype” is a factor Experimental factor value “efv:"wild type" Search for experiments with “wild type” as factor value. (Factor usually is “genotype” in this case) Expression atlas “gxa:yes” Search for experiments which are present in the Atlas Number of assays “assaycount:[5 TO 10]” Search for experiments which have 5-10 assays More examples: https://www.ebi.ac.uk/arrayexpress/help/how_to_search.html#AdvancedSearchExperiment
QUESTIONS? ArrayExpress
Hands-on exercise 1 Find RNA-seq assays studying human prostate adenocarcinoma Hands-on exercise 2 Find experiments studying the effect of sodium dodecyl sulphate on human skin ArrayExpress
Import from external databases (mainly NCBI Gene Expr. Omnibus) The two databases ArrayExpress Direct submission Curation Statistical analysis Expression Atlas Import from external databases (mainly NCBI Gene Expr. Omnibus) Links to analysis software, e.g. Links to other databases, e.g. 33 ArrayExpress
The two databases: how do they compare? ArrayExpress Expression Atlas Central object Experiment Gene or condition Microarray data Sequencing data RNA-seq coming soon Query for… Experimental information and associated data Up/downregulated genes across experiments and microarray platforms Download data for further analysis Submit data X Curated data Yes (direct submissions) /No (GEO-imported) All curated 34 ArrayExpress
Atlas construction - expt selection criteria Array (platform) designs relating to the experiment must be provided. Probe annotation must be adequate to map probes to genes and allow re-annotation of external references (e.g. Ensembl gene ID, Uniprot ID) At least 3 replicates for each value of the experimental factor Maximum 4 experimental factors Adequate sample annotation using EFO terms Presence of raw data files: CEL raw data files for Affymetrix assays, fastq files for RNA-seq experiments ArrayExpress
Atlas construction – analysis pipeline A dummy example from one experiment: Cond.1 Cond.2 Cond.3 genes Cond.1 Cond.2 Cond.3 Linear model* (Bio/C Limma) Output: 2-D matrix Input data (Affy CEL, non-Affy processed) * More information about the statistical methodology: http://nar.oxfordjournals.org/content/38/suppl_1/D690.full 1= differentially expressed 0 = not differentially expressed ArrayExpress
Atlas construction – analysis pipeline How differential expression is calculated in one experiment: “Is gene X differentially expressed in condition 1 in this experiment?” Gene X Cond.1 mean Cond.2 mean Cond.3 mean Mean of all samples = a single expression value for gene X Compare and calculate statistic ArrayExpress
Atlas construction - analysis pipeline genes Cond.1 Cond.2 Cond.3 Exp.1 Apply linear modelling statistics to each of the n experiments Statistical test genes Cond.4 Cond.5 Cond.6 Exp. 2 Statistical test genes Cond.X Cond.Y Cond.Z Exp. n Statistical test Each experiment has its own “verdict” or “vote” on whether a gene is differentially expressed or not under a certain condition ArrayExpress
Atlas construction - result Summary of the “verdicts” from different experiments ArrayExpress
Expression Atlas home page http://www.ebi.ac.uk/gxa Restrict query by direction of differential expression (up, down, both, neither) Query for conditions Query for genes The ‘advanced query’ option allows building more complex queries ArrayExpress
Mapping microarray probes to genes Probe identifiers Ensembl genes Expression data per probe Every (~monthly) Atlas release takes the latest Ensembl gene – probe identifier mapping data. From Ensembl genes, we also get: Compara genes External references (xrefs) to other databases E.g. UniProt protein IDs, NCBI RefSeq IDs, HGNC gene symbols, gene ontology terms, InterPro terms
Example Atlas search: KCC2 gene and BPA Scenario: You study the health impact of Bisphenol A (BPA) BPA: common additive in household plastic items. Negative health effects have been linked to BPA, e.g. on foetal and neonatal brain development. PNAS paper (Yeo et al., 2013) Bisphenol A delays the perinatal chloride shift in cortical neurons by epigenetic effects on the Kcc2 promoter. BPA + potassium chloride cotransporter 2 (Kcc2) mRNA levels ↓ Epigenetic downregulation Your questions: What is the human KCC2 gene? What is its general expression pattern? In which human organ/tissue is the KCC2 gene differentially expressed? What is the expression pattern of KCC2/Kcc2 orthologues? ArrayExpress
Gene search: human KCC2 gene ArrayExpress
(1) Summarised expression data for one gene Group by experimental factor / intent Default: Sort by levels of diff. expression Clicking at a factor/condition changes profile display ArrayExpress
(2) The anatomogram ArrayExpress
(3) Detailed expression profile Drill down to - 1 probe (210040_at) - mapped to 1 gene (KCC2) - in 1 experiment (E-GEOD-3526) * Samples mapped to “brain” experimental factor by EFO * * * * * * * ArrayExpress
(4) Jump to orthologues from gene summary Orthology comes from Ensembl Compara database ArrayExpress
(5) Compare orthologues with parallel heatmaps ArrayExpress
Atlas ‘condition-only’ query ArrayExpress
Atlas ‘condition-only’ query (cont’d) heatmap view ArrayExpress
Atlas gene + condition query ArrayExpress
Atlas query refining (method 1) What if there are no terms in the “REFINE YOUR QUERY” box which fit my biological question? ArrayExpress
Atlas query refining (method 2) ArrayExpress
Atlas query refining (method 2) AND ArrayExpress
Atlas query refining (method 2) AND ArrayExpress
QUESTIONS? ArrayExpress
Hands-on exercise 3 Find information on Tbx5 expression in mouse in relation to Holt-Oram syndrome Hands-on exercise 4 Find genes involved in human male (in)fertility ArrayExpress
More queries to try… In ArrayExpress In the Expression Atlas Find experiments which studied mouse models of rheumatoid arthritis, focusing on the synovial membrane. Find DNA-based experiments (e.g ChIP-chip, genotyping) directly submitted to ArrayExpress, and studying the effect of estrogen on epigenetic changes in human. In the Expression Atlas Find genes differentially expressed in human ovarian cancer cell lines. These genes may be molecular biomarkers for clinical diagnosis. Find genes which are upregulated in human under the condition of “ultraviolet light”. Among the genes returned, filter for those which have the function “DNA repair”. ArrayExpress
ArrayExpress-Atlas Crossword
A glimpse of what’s coming… “Differential atlas” “Is gene X differentially expressed in condition 1 in this experiment?” Gene X = a single expression value for gene X Cond.1 mean Cond.2 mean Cond.3 mean Mean of all samples Create “contrasts” and calculate statistic ArrayExpress
Differential atlas” mock-up (1) http://www-test.ebi.ac.uk/gxa/experiments/E-MTAB-1066 Lots of mouse-over tips/help (?) Experiment design, data analysis methods, full analytics data for download FDR cut-off Clearer indication of experimental factor and contrast Colour gradient showing severity of differential expression MA plots ArrayExpress
Differential atlas” mock-up (2) Clearer indication of experimental factor and contrast ArrayExpress
A glimpse of what’s coming… “Baseline atlas” http://www-test.ebi.ac.uk/gxa/baseline/experiments ArrayExpress
A glimpse of what’s coming… “Baseline atlas” Gene expression in normal tissues/cell lines, not looking for differentially expressed genes based on different conditions E.g. “Give me all the genes expressed in normal human kidney” Can also filter genes by expression level (e.g. FPKM values) Started with Illumina Body Map 2.0 RNA-seq data (16 tissues) (E-MTAB-513) human cell lines from ENCODE (E-GEOD-26284) Mouse RNA-Seq DBAxC57BL/6J heart, hippocampus, liver, lung, spleen and thymus (E-MTAB-599) ArrayExpress
KCC2 gene in Baseline Atlas FPKM threshold slider ArrayExpress
Find out more about the two databases…. Visit our eLearning portal, Train Online: http://www.ebi.ac.uk/training/online/ for tutorials on ArrayExpress and Expression Atlas ArrayExpress BioConductor R package: http://bioconductor.org/packages/release/bioc/html/ArrayExpress.htm l ArrayExpress help: www.ebi.ac.uk/arrayexpress/help/index.html Email us at: miamexpress@ebi.ac.uk Atlas mailing list: arrayexpress-atlas@ebi.ac.uk ArrayExpress
Open-source tools for FG data analysis Gene Pattern (Broad Institute) http://www.broadinstitute.org/cancer/software/genepattern/ GenomeSpace (incorporates Gene Pattern, ArrayExpress provides link to send data directly to GenomeSpace) http://genomespace.org/ Galaxy (allowing more modular customisation of workflow) BioConductor R (Comprehensive help doc on standard workflows) http://www.bioconductor.org/help/ BioConductor Case Studies (Hahne et al.) Microarray Technology in Practice (Russell et al.) ArrayExpress
Data submission to ArrayExpress Archive
Data submission to Arrayexpress Read this help page carefully before preparing any files Use the MAGE-TAB submission tools to create a tailor-made template spreadsheet (IDF and SDRF) for your experiment ArrayExpress
Submission of HTS data ArrayExpress acts as a “broker” for submitter. Meta-data and processed data: ArrayExpress Raw sequence reads* (e.g. fastq, bam): ENA *See http://www.ebi.ac.uk/ena/about/sra_data_format for accepted read file format ArrayExpress
What happens after submission? Email confirmation Submission ‘closed’ so no more editing on your end Can keep data private until publication. Will provide login account details to you and reviewer for private data access Curation: We will email you with any questions May ‘re-open’ submission for you to make changes Get your submission in the best possible shape to shorten curation and processing time! ArrayExpress
Submission checklist Microarrays HTS 1. Is your array design already accessioned in ArrayExpress? (Check: http://www.ebi.ac.uk/arrayexpress/arrays/browse.html?directsub=on If your array design is not represented, you will have to submit the array design to us before submitting any experimental data, because all data points in your raw/processed files refer back to the array design file) 2. Do you have all the data files ready in the required formats? 1. Are your reads file in a format accepted by the SRA? (Check here: http://www.ebi.ac.uk/ena/about/sra_data_format) 2. If yes, have you dropped the files on the private ArrayExpress FTP site and emailed us about them? 3. Have you filled in the MAGE-TAB spreadsheet with as much meta-data as possible? ArrayExpress
Need help with submitting your data? Visit our eLearning portal, Train Online for the specific tutorial on how to submit data using MAGE-TAB: www.ebi.ac.uk/training/online/course/arrayexpress-submitting-data- using-mage-tab ArrayExpress help page on submisisons: www.ebi.ac.uk/arrayexpress/help/submissions_overview.html Watch this short YouTube video on how to navigate the MAGE-TAB submission tool: http://youtu.be/KVpCVGpjw2Y Email curators at: miamexpress@ebi.ac.uk ArrayExpress