Presentation is loading. Please wait.

Presentation is loading. Please wait.

ArrayExpress and Gene Expression Atlas:

Similar presentations


Presentation on theme: "ArrayExpress and Gene Expression Atlas:"— Presentation transcript:

1 ArrayExpress and Gene Expression Atlas:
Mining Functional Genomics Data Amy Tang, PhD ArrayExpress Production Team Functional Genomics Group EMBL-EBI

2 What’s covered this morning?
What do we mean by “functional genomics data”? Why do we need databases for them? Two databases: ArrayExpress Expression Atlas What’s in each database, how to browse, search, interpret, download data (Microarray/sequencing data analysis; How to submit data to ArrayExpress?) 2 ArrayExpress

3 Functional genomics (FG) data
The aim of FG is to understand the function of genes and other (non-genic) parts of the genome Often involved high-throughput technologies (microarrays, high-throughput sequencing [HTS]) Questions addressed: Gene expression - when? where? how much? changes? Gene function - roles of genes in cellular processes, pathways Gene/genome regulation - e.g. histone modifications, CpG (DNA) methylation 3 ArrayExpress

4 Example of FG data sets in ArrayExpress
Questions addressed: Gene expression - when? where? how much? changes? Gene function - roles of genes in cellular processes, pathways 4 ArrayExpress

5 Example of FG data sets in ArrayExpress
Questions addressed: Gene/genome regulation - e.g. histone modifications, CpG (DNA) methylation 5 ArrayExpress

6 The two databases: how are they related?
ArrayExpress Direct submission Curation Statistical analysis Expression Atlas Import from external databases (mainly NCBI Gene Expr. Omnibus) Links to analysis software, e.g. Links to other databases, e.g. 6 ArrayExpress

7 The two databases: how do they compare?
ArrayExpress Expression Atlas Central object Experiment Gene or condition Microarray data Sequencing data RNA-seq coming soon Query for… Experimental information and associated data Up/downregulated genes across experiments and microarray platforms Download data for further analysis Submit data X Curated data Yes (direct submissions) /No (GEO-imported) All curated 7 ArrayExpress

8 ArrayExpress www.ebi.ac.uk/arrayexpress
Public repository for functional genomics data (both microarray and sequencing) Together with GEO at NCBI and CIBEX at DDBJ, serves the scientific community as a data archive supporting publications Provides access to curated data in a structured and standardised format – essential for easy sharing of experimental information Submissions are curated based on community standards: MIAME guidelines & MAGE-TAB format for microarray MINSEQE guidelines & MAGE-TAB format for HTS data 8 ArrayExpress

9 Community standards for data requirement
MIAME = Minimal Information About a Microarray Experiment ( MINSEQE = Minimal Information about a high-throughput Nucleotide SEQuencing Experiment ( The checklist: Requirements MIAME MINSEQE 1. Experiment design / background description 2. Sample annotation and experimental factor 3. Array design annotation (e.g. probe sequence) 4. All protocols (wet-lab bench and data processing) 5. Raw data files (from scanner or sequencing machine) 6. Processed data files (normalised and/or transformed) 9 ArrayExpress

10 What is an experimental factor?
The main variable(s) studied, often related to the hypothesis of the experiment and is the independent variable. Values of the factor (“factor values”) should vary. Experimental design Factor  Factor Values Not factor  beef vs horse meat Diet beef, horse meat Organism (human) smoker vs non-smoker compound cigarette smoke (tobacco), no tobacco Organism (human), sex (male) face cream A vs control X Active ingredient A, “sham” control Cell type A X 1010 ArrayExpress

11 Reporting standards - MAGE-TAB format
A simple spreadsheet format that uses a number of tab-delimited text files Array Design Format file Describes probes on an array, e.g. sequence, genomic mapping location ADF (microarray only) Investigation Description Format file Experiment title Experiment description Submitter’s contact details Definition of all protocols IDF Raw and processed data files 1.fq.gz .CEL A1.CEL Normalized.txt 2.fq.gz Sample Data Relationship Format file Starting materials with annotation Derived materials (e.g. RNA extracts) All assays (hybs/seq. lanes) Resulting data file(s) for each assay SDRF 1111 ArrayExpress

12 MAGE-TAB Example: IDF

13 MAGE-TAB Example: SDRF

14 How much data in ArrayExpress? (as of 29 Oct 2013)
14 ArrayExpress

15 HTS data in ArrayExpress (as of 29 October 2013)
Microarray vs HTS RNA-, DNA-, ChIP-seq breakdown 15 ArrayExpress

16 Browsing ArrayExpress
ArrayExpress

17 Browsing ArrayExpress experiments www. ebi. ac
Browsing ArrayExpress experiments All columns can be sorted by clicking at the heading ArrayExpress

18 File download on the Browse page
Direct download link (e.g. here it’s for a single raw data archive [i.e. *.zip] file) A link to a page which lists all the archive files available for download. (No direct link because there are >1 archives) This is specifically for HTS experiments. Direct link to European Nucleotide Archive (ENA)’s page which lists all the sequencing assays (which are called “runs” at the ENA). 18 ArrayExpress

19 ArrayExpress single-experiment view
Sample characteristics, factors and factor values The microarray design used MIAME or MINSEQE scores ( * = compliant) All files related to this experiment ( e.g. IDF, SDRF, array design, raw data, R object ) Send data to GenomeSpace and analyse it yourself 19 ArrayExpress

20 Samples view – microarray experiment
All columns can be sorted by clicking at the heading Direct link to data files for one sample Sample characteristics Factor values Scroll left and right to see all sample characteristics and factor values 20 ArrayExpress

21 Samples view – sequencing experiment
Direct link to European Nucleotide Archive (ENA) record about this sequencing assay Direct link to fastq files at European Nucleotide Archive (ENA) 21 ArrayExpress

22 Searching for experiments in ArrayExpress www. ebi. ac
Searching for experiments in ArrayExpress ArrayExpress

23 Experimental factor ontology (EFO) http://www.ebi.ac.uk/efo
Ontology: a way to systematically organise experimental factor terms. controlled vocabulary + hierarchy (relationship) Used in EBI databases: and external projects (e.g. NHGRI GWAS Catalogue) Combine terms from a subset of well-maintained and compatible ontologies, e.g. Gene Ontology (cellular component + biological process terms) NCBI Taxonomy  Ontology in layman terms: ArrayExpress

24 Building EFO - an example
Take all experimental factors sarcoma cancer neoplasm Kaposi’s sarcoma disease is the parent term is a type of is synonym of Find the logical connection between them disease neoplasm cancer sarcoma Kaposi’s sarcoma [-] Organize them in an ontology sarcoma cancer neoplasm disease Kaposi’s sarcoma ArrayExpress

25 Exploring EFO - an example
ArrayExpress

26 Experimental factor ontology (EFO) http://www.ebi.ac.uk/efo
EFO developed to: increase the richness of annotations in databases expand on search terms when querying ArrayExpress and Expression Atlas using synonyms (e.g. “cerebral cortex” = “adult brain cortex”) using child terms (e.g. “bone”  “rib” and “vertebra”) promote consistency (e.g. F/female/, 1day/24hours) facilitate automatic annotation and integration of external data (e.g. changing “gender” to “sex” automatically) 26 ArrayExpress

27 Searching ArrayExpress Using EFO terms and filters
Filter your search results by: Species of interest One array design (platform), molecule (DNA, RNA, protein, etc) technology (microarray or HTS) “Auto-complete” with suggestions (like Google search) Avoid acronyms as search terms Enter keyword, click search, then filter next. ArrayExpress

28 What search terms can I use?
ArrayExpress accession number, e.g. “E-MEXP-568” Secondary accession number e.g. GEO series “GSE5389” Experiment title, description Submitter's address Publication title, authors and journal name, PubMed ID Sample attributes and experimental factor / factor values: “genetic modification” “heart” “diabetes” “neural stem cells” “penicillin” “ChIP-chip” “methylation profiling” “Arabidopsis” “p53” * Powered by EFO expansion. Use EFO terms wherever possible. ArrayExpress

29 Example search: “leukemia”
Exact match to search term Matched EFO synonyms to search term Matched EFO child term of search term 29 ArrayExpress

30 Experimental factor value
Advanced search Allows you to restrict your search to a specific field Format of search term: field_name:search_term Some examples: Specific field Example term What it means Experimental factor “ef:genotype” Search for experiments where “genotype” is a factor Experimental factor value “efv:"wild type" Search for experiments with “wild type” as factor value. (Factor usually is “genotype” in this case) Expression atlas “gxa:yes” Search for experiments which are present in the Atlas Number of assays “assaycount:[5 TO 10]” Search for experiments which have 5-10 assays More examples:

31 QUESTIONS? ArrayExpress

32 Hands-on exercise 1 Find RNA-seq assays studying human prostate adenocarcinoma Hands-on exercise 2 Find experiments studying the effect of sodium dodecyl sulphate on human skin ArrayExpress

33 Import from external databases (mainly NCBI Gene Expr. Omnibus)
The two databases ArrayExpress Direct submission Curation Statistical analysis Expression Atlas Import from external databases (mainly NCBI Gene Expr. Omnibus) Links to analysis software, e.g. Links to other databases, e.g. 33 ArrayExpress

34 The two databases: how do they compare?
ArrayExpress Expression Atlas Central object Experiment Gene or condition Microarray data Sequencing data RNA-seq coming soon Query for… Experimental information and associated data Up/downregulated genes across experiments and microarray platforms Download data for further analysis Submit data X Curated data Yes (direct submissions) /No (GEO-imported) All curated 34 ArrayExpress

35 Atlas construction - expt selection criteria
Array (platform) designs relating to the experiment must be provided. Probe annotation must be adequate to map probes to genes and allow re-annotation of external references (e.g. Ensembl gene ID, Uniprot ID) At least 3 replicates for each value of the experimental factor Maximum 4 experimental factors Adequate sample annotation using EFO terms Presence of raw data files: CEL raw data files for Affymetrix assays, fastq files for RNA-seq experiments ArrayExpress

36 Atlas construction – analysis pipeline
A dummy example from one experiment: Cond.1 Cond.2 Cond.3 genes Cond.1 Cond.2 Cond.3 Linear model* (Bio/C Limma) Output: 2-D matrix Input data (Affy CEL, non-Affy processed) * More information about the statistical methodology: 1= differentially expressed 0 = not differentially expressed ArrayExpress

37 Atlas construction – analysis pipeline
How differential expression is calculated in one experiment: “Is gene X differentially expressed in condition 1 in this experiment?” Gene X Cond.1 mean Cond.2 mean Cond.3 mean Mean of all samples = a single expression value for gene X Compare and calculate statistic ArrayExpress

38 Atlas construction - analysis pipeline
genes Cond.1 Cond.2 Cond.3 Exp.1 Apply linear modelling statistics to each of the n experiments Statistical test genes Cond.4 Cond.5 Cond.6 Exp. 2 Statistical test genes Cond.X Cond.Y Cond.Z Exp. n Statistical test Each experiment has its own “verdict” or “vote” on whether a gene is differentially expressed or not under a certain condition ArrayExpress

39 Atlas construction - result
Summary of the “verdicts” from different experiments ArrayExpress

40 Expression Atlas home page
Restrict query by direction of differential expression (up, down, both, neither) Query for conditions Query for genes The ‘advanced query’ option allows building more complex queries ArrayExpress

41 Mapping microarray probes to genes
Probe identifiers Ensembl genes Expression data per probe Every (~monthly) Atlas release takes the latest Ensembl gene – probe identifier mapping data. From Ensembl genes, we also get: Compara genes External references (xrefs) to other databases E.g. UniProt protein IDs, NCBI RefSeq IDs, HGNC gene symbols, gene ontology terms, InterPro terms

42 Example Atlas search: KCC2 gene and BPA
Scenario: You study the health impact of Bisphenol A (BPA) BPA: common additive in household plastic items. Negative health effects have been linked to BPA, e.g. on foetal and neonatal brain development. PNAS paper (Yeo et al., 2013) Bisphenol A delays the perinatal chloride shift in cortical neurons by epigenetic effects on the Kcc2 promoter. BPA + potassium chloride cotransporter 2 (Kcc2) mRNA levels ↓ Epigenetic downregulation Your questions: What is the human KCC2 gene? What is its general expression pattern? In which human organ/tissue is the KCC2 gene differentially expressed? What is the expression pattern of KCC2/Kcc2 orthologues? ArrayExpress

43 Gene search: human KCC2 gene
ArrayExpress

44 (1) Summarised expression data for one gene
Group by experimental factor / intent Default: Sort by levels of diff. expression Clicking at a factor/condition  changes profile display ArrayExpress

45 (2) The anatomogram ArrayExpress

46 (3) Detailed expression profile
Drill down to - 1 probe (210040_at) - mapped to 1 gene (KCC2) - in 1 experiment (E-GEOD-3526) * Samples mapped to “brain” experimental factor by EFO * * * * * * * ArrayExpress

47 (4) Jump to orthologues from gene summary
Orthology comes from Ensembl Compara database ArrayExpress

48 (5) Compare orthologues with parallel heatmaps
ArrayExpress

49 Atlas ‘condition-only’ query
ArrayExpress

50 Atlas ‘condition-only’ query (cont’d) heatmap view
ArrayExpress

51 Atlas gene + condition query
ArrayExpress

52 Atlas query refining (method 1)
What if there are no terms in the “REFINE YOUR QUERY” box which fit my biological question? ArrayExpress

53 Atlas query refining (method 2)
ArrayExpress

54 Atlas query refining (method 2)
AND ArrayExpress

55 Atlas query refining (method 2)
AND ArrayExpress

56 QUESTIONS? ArrayExpress

57 Hands-on exercise 3 Find information on Tbx5 expression in mouse in relation to Holt-Oram syndrome Hands-on exercise 4 Find genes involved in human male (in)fertility ArrayExpress

58 More queries to try… In ArrayExpress In the Expression Atlas
Find experiments which studied mouse models of rheumatoid arthritis, focusing on the synovial membrane. Find DNA-based experiments (e.g ChIP-chip, genotyping) directly submitted to ArrayExpress, and studying the effect of estrogen on epigenetic changes in human. In the Expression Atlas Find genes differentially expressed in human ovarian cancer cell lines. These genes may be molecular biomarkers for clinical diagnosis. Find genes which are upregulated in human under the condition of “ultraviolet light”. Among the genes returned, filter for those which have the function “DNA repair”. ArrayExpress

59 ArrayExpress-Atlas Crossword

60 A glimpse of what’s coming…
“Differential atlas” “Is gene X differentially expressed in condition 1 in this experiment?” Gene X = a single expression value for gene X Cond.1 mean Cond.2 mean Cond.3 mean Mean of all samples Create “contrasts” and calculate statistic ArrayExpress

61 Differential atlas” mock-up (1)
Lots of mouse-over tips/help (?) Experiment design, data analysis methods, full analytics data for download FDR cut-off Clearer indication of experimental factor and contrast Colour gradient showing severity of differential expression MA plots ArrayExpress

62 Differential atlas” mock-up (2)
Clearer indication of experimental factor and contrast ArrayExpress

63 A glimpse of what’s coming…
“Baseline atlas” ArrayExpress

64 A glimpse of what’s coming…
“Baseline atlas” Gene expression in normal tissues/cell lines, not looking for differentially expressed genes based on different conditions E.g. “Give me all the genes expressed in normal human kidney” Can also filter genes by expression level (e.g. FPKM values) Started with Illumina Body Map 2.0 RNA-seq data (16 tissues) (E-MTAB-513) human cell lines from ENCODE (E-GEOD-26284) Mouse RNA-Seq DBAxC57BL/6J heart, hippocampus, liver, lung, spleen and thymus (E-MTAB-599) ArrayExpress

65 KCC2 gene in Baseline Atlas
FPKM threshold slider ArrayExpress

66 Find out more about the two databases….
Visit our eLearning portal, Train Online: for tutorials on ArrayExpress and Expression Atlas ArrayExpress BioConductor R package: l ArrayExpress help: us at: Atlas mailing list: ArrayExpress

67 Open-source tools for FG data analysis
Gene Pattern (Broad Institute) GenomeSpace (incorporates Gene Pattern, ArrayExpress provides link to send data directly to GenomeSpace) Galaxy (allowing more modular customisation of workflow) BioConductor R (Comprehensive help doc on standard workflows) BioConductor Case Studies (Hahne et al.) Microarray Technology in Practice (Russell et al.) ArrayExpress

68 Data submission to ArrayExpress Archive

69 Data submission to Arrayexpress
Read this help page carefully before preparing any files Use the MAGE-TAB submission tools to create a tailor-made template spreadsheet (IDF and SDRF) for your experiment ArrayExpress

70 Submission of HTS data ArrayExpress acts as a “broker” for submitter.
Meta-data and processed data: ArrayExpress Raw sequence reads* (e.g. fastq, bam): ENA *See for accepted read file format ArrayExpress

71 What happens after submission?
confirmation Submission ‘closed’ so no more editing on your end Can keep data private until publication. Will provide login account details to you and reviewer for private data access Curation: We will you with any questions May ‘re-open’ submission for you to make changes Get your submission in the best possible shape to shorten curation and processing time! ArrayExpress

72 Submission checklist Microarrays HTS
1. Is your array design already accessioned in ArrayExpress? (Check: If your array design is not represented, you will have to submit the array design to us before submitting any experimental data, because all data points in your raw/processed files refer back to the array design file) 2. Do you have all the data files ready in the required formats? 1. Are your reads file in a format accepted by the SRA? (Check here: 2. If yes, have you dropped the files on the private ArrayExpress FTP site and ed us about them? 3. Have you filled in the MAGE-TAB spreadsheet with as much meta-data as possible? ArrayExpress

73 Need help with submitting your data?
Visit our eLearning portal, Train Online for the specific tutorial on how to submit data using MAGE-TAB: using-mage-tab ArrayExpress help page on submisisons: Watch this short YouTube video on how to navigate the MAGE-TAB submission tool: curators at: ArrayExpress


Download ppt "ArrayExpress and Gene Expression Atlas:"

Similar presentations


Ads by Google