Mining Functional Genomics Data ArrayExpress and Gene Expression Atlas: Amy Tang, PhD ArrayExpress Production Team Functional Genomics Group EMBL-EBI
What’s covered this morning? What do we mean by “functional genomics data”? Why do we need databases for them? Two databases: ArrayExpress Expression Atlas What’s in each database, how to browse, search, interpret, download data (Microarray/sequencing data analysis; How to submit data to ArrayExpress?) ArrayExpress2 data-and-tools-cambridge-uk
Functional genomics (FG) data The aim of FG is to understand the function of genes and other (non-genic) parts of the genome Often involved high-throughput technologies (microarrays, high-throughput sequencing [HTS]) Questions addressed: Gene expression - when? where? how much? changes? Gene function - roles of genes in cellular processes, pathways Gene/genome regulation - e.g. histone modifications, CpG (DNA) methylation ArrayExpress3
Example of FG data sets in ArrayExpress Questions addressed: Gene expression - when? where? how much? changes? ArrayExpress4 Gene function - roles of genes in cellular processes, pathways
Example of FG data sets in ArrayExpress Questions addressed: Gene/genome regulation - e.g. histone modifications, CpG (DNA) methylation ArrayExpress5
Expression Atlas Direct submissio n Import from external databases (mainly NCBI Gene Expr. Omnibus) Curation Statistical analysis The two databases: how are they related? Links to analysis software, e.g. Links to other databases, e.g. ArrayExpress6
The two databases: how do they compare? ArrayExpress7 Expression Atlas Central objectExperimentGene or condition Microarray data Sequencing data RNA-seq data Query for… Experimental information and associated data Gene expression patterns, up/down-regulated genes under certain expt. conditions Download data for further analysis Submit data X Curated data Yes (direct submissions) /No (GEO-imported) All curated
ArrayExpress Public repository for functional genomics data (both microarray and sequencing) Together with GEO at NCBI and CIBEX at DDBJ, serves the scientific community as a data archive supporting publications Provides access to curated data in a structured and standardised format – essential for easy sharing of experimental information Submissions are curated based on community standards: MIAME guidelines & MAGE-TAB format for microarray MINSEQE guidelines & MAGE-TAB format for HTS data ArrayExpress8
Community standards for data requirement MIAME = Minimal Information About a Microarray Experiment ( MINSEQE = Minimal Information about a high-throughput Nucleotide SEQuencing Experiment ( The checklist: ArrayExpress9 RequirementsMIAMEMINSEQE 1. Experiment design / background description 2. Sample annotation and experimental factor 3. Array design annotation (e.g. probe sequence) 4. All protocols (wet-lab bench and data processing) 5. Raw data files (from scanner or sequencing machine) 6. Processed data files (normalised and/or transformed)
What is an experimental factor? The main variable(s) studied, often related to the hypothesis of the experiment and is the independent variable, e.g. “genotype”. “Factor values” of samples should vary (e.g. “p53 -/-”, “wild type”). ArrayExpress10 Experimental designFactor Factor ValuesNot factor beef vs horse meat Diet beef, horse meat Organism (human) smoker vs non-smoker compound cigarette smoke (tobacco), no tobacco Organism (human), sex (male) face cream A vs control X compound Active ingredient A, “sham” control Cell type A X
Reporting standards - MAGE-TAB format ArrayExpress11 A simple spreadsheet format that uses a number of tab-delimited text files Investigation Description Format file Experiment title Experiment description Submitter’s contact details Definition of all protocols IDF Sample Data Relationship Format file Starting materials with annotation Derived materials (e.g. RNA extracts) All assays (hybs/seq. lanes) Resulting data file(s) for each assay SDRF Array Design Format file Describes probes on an array, e.g. sequence, genomic mapping location ADF (microarray only) Raw and processed data files 1.fq.gz.CEL A1.CEL Normalized.txt 2.fq.gz
MAGE-TAB Example: IDF
MAGE-TAB Example: SDRF
How much data in ArrayExpress? (as of 29 Oct 2013) ArrayExpress14
HTS data in ArrayExpress (as of 29 October 2013) Microarray vs HTS RNA-, DNA-, ChIP- seq breakdown ArrayExpress15
Browsing ArrayExpress ArrayExpress16
Browsing ArrayExpress experiments ArrayExpress17 All columns can be sorted by clicking at the heading
File download on the Browse page ArrayExpress18 Direct download link (e.g. here it’s for a single raw data archive [i.e. *.zip] file) This is specifically for HTS experiments. Direct link to European Nucleotide Archive (ENA)’s page which lists all the sequencing assays (which are called “runs” at the ENA). A link to a page which lists all the archive files available for download. (No direct link because there are >1 archives)
ArrayExpress19 ArrayExpress single-experiment view Sample characteristics, factors and factor values MIAME or MINSEQE scores ( * = compliant) All files related to this experiment ( e.g. IDF, SDRF, array design, raw data, R object ) Send data to GenomeSpace and analyse it yourself The microarray design used
ArrayExpress20 Samples view – microarray experiment Scroll left and right to see all sample characteristics and factor values Sample characteristics Factor values Direct link to data files for one sample All columns can be sorted by clicking at the heading
ArrayExpress21 Samples view – sequencing experiment Direct link to fastq files at European Nucleotide Archive (ENA) Direct link to European Nucleotide Archive (ENA) record about this sequencing assay
Searching for experiments in ArrayExpress ArrayExpress22
Experimental factor ontology (EFO) Ontology: a way to systematically organise experimental factor terms. controlled vocabulary + hierarchy (relationship) Used in EBI databases: and external projects (e.g. NHGRI GWAS Catalogue) Combine terms from a subset of well-maintained and compatible ontologies, e.g. Gene Ontology (cellular component + biological process terms) NCBI Taxonomy ArrayExpress23 Ontology in layman terms: is-it.html is-it.html
ArrayExpress24 Building EFO - an example sarcoma cancer neoplasm disease Kaposi’s sarcoma Take all experimental factors sarcoma cancer neoplasm Kaposi’s sarcoma disease is the parent term is a type of disease is synonym of neoplasm is a type of cancer is a type of sarcoma Find the logical connection between them disease neoplasm cancer sarcoma Kaposi’s sarcoma [-] Organize them in an ontology
ArrayExpress25 Exploring EFO - an example
Experimental factor ontology (EFO) EFO developed to: increase the richness of annotations in databases expand on search terms when querying ArrayExpress and Expression Atlas using synonyms (e.g. “cerebral cortex” = “adult brain cortex”) using child terms (e.g. “bone” “rib” and “vertebra”) promote consistency (e.g. F/female/, 1day/24hours) facilitate automatic annotation and integration of external data (e.g. changing “gender” to “sex” automatically) ArrayExpress26
Searching ArrayExpress Using EFO terms and filters ArrayExpress27 Enter keyword, click search, then filter next. “Auto-complete” with suggestions (like Google search) Avoid acronyms as search terms Filter your search results by: Species of interest One array design (platform), molecule (DNA, RNA, protein, etc) technology (microarray or HTS)
What search terms can I use? ArrayExpress accession number, e.g. “E-MEXP-568” Secondary accession number e.g. GEO series “GSE5389” Experiment title, description Submitter's address Publication title, authors and journal name, PubMed ID ArrayExpress28 Sample attributes and experimental factor / factor values: “genetic modification” “heart” “diabetes” “neural stem cells” “penicillin” “ChIP-chip” “methylation profiling” “Arabidopsis” “p53” * Powered by EFO expansion. Use EFO terms wherever possible.
Example search: “leukemia” ArrayExpress29 Exact match to search term Matched EFO synonyms to search term Matched EFO child term of search term
Advanced search Specific field Example termWhat it means Experimental factor“ef:genotype”Search for experiments where “genotype” is a factor Experimental factor value “efv:"wild type"Search for experiments with “wild type” as factor value. (Factor usually is “genotype” in this case) Expression atlas“gxa:yes”Search for experiments which are present in the Atlas Number of assays“assaycount:[5 TO 10]”Search for experiments which have 5-10 assays Allows you to restrict your search to a specific field Format of search term: field_name:search_term Some examples: More examples:
ArrayExpress 31 QUESTIONS ?
Hands-on exercise 1 Find RNA-seq assays studying human prostate adenocarcinoma Hands-on exercise 2 Find experiments studying the effect of sodium dodecyl sulphate on human skin ArrayExpress 32
ArrayExpress Expression Atlas Direct submissio n Import from external databases (mainly NCBI Gene Expr. Omnibus) Curation Statistical analysis The two databases Links to analysis software, e.g. Links to other databases, e.g. ArrayExpress33
The two databases: how do they compare? ArrayExpress34 ArrayExpressExpression Atlas Central objectExperimentGene or condition Microarray data Sequencing data RNA-seq data Query for… Experimental information and associated data Gene expression patterns, up/down-regulated genes under certain expt. conditions Download data for further analysis Submit data X Curated data Yes (direct submissions) /No (GEO-imported) All curated
ArrayExpress35 At least 3 replicates for each value of the experimental factor and maximum 4 factors Adequate sample annotation using EFO terms Adequate array (platform) design to map probes to genes and allow re-annotation of external references (e.g. Ensembl gene ID, Uniprot ID) RNA-seq expt: good quality reads and reference genome build Presence of good quality raw data files: e.g. CEL raw data files for Affymetrix assays, fastq files for RNA-seq experiments Atlas experiment selection criteria
ArrayExpress36 New atlas is launching in 3 days’ time! Old atlasNew Atlas Beforehttp:// Afterhttp://www-test.ebi.ac.uk/gxa/ Old New Where to find the Atlases before and after launch? Launch date: week of 1 Dec 2013
New Atlas: “Baseline” and “differential” ArrayExpress37 BaselineDifferential Query for… Gene expression in normal tissues Up/downregulated genes in “contrasts” of expt conditions (e.g. mutant vs wild type) Microarray dataX RNA-seq data Data volume (as of 1 Dec 2013) 9 experiments265 experiments Predecessor (None)“Gene Expression Atlas” InterfaceReadyStill under development
ArrayExpress38 Experiencing the old and new Atlases today Old New Example use case and exercise Taster and preview Example use case and exercise
“Old” Atlas construction – analysis pipeline ArrayExpress39 genes Cond.1Cond.2Cond.3 Linear model* (Bio/C Limma ) Moderated T-test Cond.1 Cond.2 Cond.3 Input data (Affy CEL, Agilent feature extraction files, RNA-seq fastq files) 1= differentially expressed 0 = not differentially expressed A dummy example from one experiment: Output: 2-D matrix * More information about the statistical methodology:
“Is gene X differentially expressed in condition 1 in this experiment?” Cond.1 mean Cond.2 mean Cond.3 mean Mean of all samples = a single expression value for gene X Compare and calculate statistic ArrayExpress40 How differential expression is calculated in one experiment: Gene X “Old” Atlas construction – analysis pipeline
genes Cond.1Cond.2Cond.3 Exp.1 genes Cond.4Cond.5Cond.6 Exp. 2 genes Cond.XCond.YCond.Z Exp. n Statistical test Statistical test Statistical test Each experiment has its own “verdict” or “vote” on whether a gene is differentially expressed or not under a certain condition ArrayExpress41 Apply linear modelling statistics to each of the n experiments “Old” Atlas construction – analysis pipeline
ArrayExpress42 Summary of the “verdicts” from different experiments “Old” Atlas construction – results
Mapping microarray probes to genes Ensembl genes Probe identifiers Expression data per probe Every (~monthly) Atlas release takes the latest Ensembl gene – probe identifier mapping data. From Ensembl genes, we also get: Compara genes External references (xrefs) to other databases E.g. UniProt protein IDs, NCBI RefSeq IDs, HGNC gene symbols, gene ontology terms, InterPro terms 43ArrayExpress
44 Example Atlas use case: KCC2 gene and BPA Scenario: You study the health impact of Bisphenol A (BPA) BPA: common additive in household plastic items. Negative health effects have been linked to BPA, e.g. on foetal and neonatal brain development. Your questions: 1.In which human organ/tissue is the KCC2 gene differentially expressed? 2.Under what condition(s) is the human KCC2 gene differentially expressed? 3.What is the expression pattern of KCC2/Kcc2 orthologues? PNAS paper (Yeo et al., 2013) Bisphenol A delays the perinatal chloride shift in cortical neurons by epigenetic effects on the Kcc2 promoter.Bisphenol A delays the perinatal chloride shift in cortical neurons by epigenetic effects on the Kcc2 promoter BPA + potassium chloride cotransporter 2 (Kcc2) mRNA levels ↓ Epigenetic downregulation
“Old” Atlas home page ArrayExpress45 Query for single gene or a group of genes Query for conditions The ‘advanced query’ option allows building more complex queries Restrict query by direction of differential expression (up, down, both, neither)
ArrayExpress46 Gene search (old Atlas): human KCC2 gene
ArrayExpress47 (1) Summarised expression data for one gene Group by experimental factor / intent Default: Sort by levels of diff. expression Clicking at a factor/condition changes profile display
ArrayExpress48 (2) The anatomogram
ArrayExpress49 (3) Detailed expression profile Drill down to - 1 probe (210040_at) - mapped to 1 gene (KCC2) - in 1 experiment (E-GEOD- 3526) * * * * * * * * Samples mapped to “brain” experimental factor by EFO
ArrayExpress50 (4) Jump to orthologues from gene summary Orthology comes from Ensembl Compara database
ArrayExpress51 (5) Compare orthologues with parallel heatmaps
52 Baseline Atlas construction GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT + GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT + !''*((((***+))%%++)(%%).1***-+*''))**55CCF>>>>>>CCCCCCC65 Only RNA-seq data sets are used. 1. Align with TopHat 2. Cufflinks Mapped reads bam Reference genome from Ensembl FPKMs ArrayExpress
53 Baseline Atlas search for human KCC2
ArrayExpress54 Baseline Atlas search results
ArrayExpress 55 Human KCC2 gene in Baseline Atlas FPKM threshold slider
ArrayExpress56 Old Atlas ‘condition-only’ query
Old Atlas ‘condition-only’ query (cont’d) heatmap view ArrayExpress57
ArrayExpress58 Old Atlas gene + condition query
ArrayExpress59 Old Atlas query refining
ArrayExpress60 Old Atlas query refining AND
ArrayExpress61 Old Atlas query refining AND
ArrayExpress 62 QUESTIONS ?
Hands-on exercise 3 Find information on Tbx5 expression in mouse in relation to Holt-Oram syndrome Hands-on exercise 4 Find transcription factor genes belonging to the androgen signaling pathway in prostate cancer ArrayExpress63
“Is gene X differentially expressed in condition 1 in this experiment?” Cond.1 mean Cond.2 mean Cond.3 mean Mean of all samples = a single expression value for gene X Create “contrasts” and calculate statistic ArrayExpress 64 Gene X Diff. atlas changes: (1) analysis pipeline How differential expression is calculated in one experiment:
Diff atlas changes (2): modern interface ArrayExpress 65 Clearer indication of experimental factor and contrast Lots of mouse-over tips/help (?) FDR cut-off MA plots Experiment design, data analysis methods, full analytics data for download Colour gradient showing significance of differential expression
ArrayExpress 66 Clearer indication of experimental factor and contrast Diff. atlas changes: (2) modern interface
ArrayExpress67 Diff. atlas changes: (3) verdict “summary”? Experiment 1Experiment 2Experiment 3 Expt. FactorDisease Factor valuesAML, CML, normal = ? SamplesExperiment 1Experiment 2Experiment 3 Normal x 20 AML x 10, relapse 1 st diagnosis CML x 10, relapse 1 st diagnosis What if there are differences in sample attributes?
68 ArrayExpress Diff. atlas changes: (4) Histograms?
ArrayExpress 69 QUESTIONS ?
ArrayExpress-Atlas Crossword ArrayExpress70
Find out more about the two databases…. Visit our eLearning portal, Train Online: for tutorials on ArrayExpress and Expression Atlas ArrayExpress BioConductor R package: l l ArrayExpress help: us at: Atlas mailing list: ArrayExpress71
Open-source tools for FG data analysis BioConductor R (Comprehensive help doc on standard workflows) BioConductor Case Studies (Hahne et al.) Microarray Technology in Practice (Russell et al.) ArrayExpress72 Gene Pattern (Broad Institute) GenomeSpace (incorporates Gene Pattern, ArrayExpress provides link to send data directly to GenomeSpace) Galaxy (allowing more modular customisation of workflow)
Data submission to ArrayExpress Archive ArrayExpress 73
Data submission to Arrayexpress ArrayExpress74 Read this help page carefully before preparing any files Use the MAGE-TAB submission tools to create a tailor-made template spreadsheet (IDF and SDRF) for your experiment
Submission of HTS data ArrayExpress75 ArrayExpress acts as a “broker” for submitter. Meta-data and processed data: ArrayExpress Raw sequence reads* (e.g. fastq, bam): ENA *See for accepted read file formathttp://
What happens after submission? ArrayExpress76 confirmation Submission ‘closed’ so no more editing on your end Curation: We will you with any questions May ‘re-open’ submission for you to make changes Can keep data private until publication. Will provide login account details to you and reviewer for private data access Get your submission in the best possible shape to shorten curation and processing time!
Submission checklist ArrayExpress77 MicroarraysHTS 1. Is your array design already accessioned in ArrayExpress? (Check: e.html?directsub=on e.html?directsub=on If your array design is not represented, you will have to submit the array design to us before submitting any experimental data, because all data points in your raw/processed files refer back to the array design file) 2. Do you have all the data files ready in the required formats? 1. Are your reads file in a format accepted by the SRA? (Check here: data_format) data_format 2. If yes, have you dropped the files on the private ArrayExpress FTP site and ed us about them? 3. Have you filled in the MAGE-TAB spreadsheet with as much meta-data as possible?
Need help with submitting your data? Visit our eLearning portal, Train Online for the specific tutorial on how to submit data using MAGE-TAB: using-mage-tab using-mage-tab ArrayExpress help page on submisisons: Watch this short YouTube video on how to navigate the MAGE-TAB submission tool: curators at: ArrayExpress78