Download presentation
Presentation is loading. Please wait.
Published byAngela Golden Modified over 9 years ago
1
ArrayExpress and Gene Expression Atlas: Mining Functional Genomics data Emma Hastings Functional Genomics Team EBI-EMBL emma@ebi.ac.uk http://www.ebi.ac.uk/~emma/Zagreb/
2
2 Services available from EBI Genomes DNA & RNA sequence Gene expression Protein sequence Protein families, motifs and domains Protein families, motifs and domains Protein structure Protein interactions Chemical entities Pathways Systems Literature and ontologies www.ebi.ac.uk
3
ArrayExpress3 Talk structure Why do we need a database for functional genomics data? ArrayExpress database Archive Gene Expression Atlas Database content Query the database Data download Data submission
4
What is functional genomics? Functional genomics is a field of molecular biology that attempts to make use of the vast wealth of data produced by genomic projects (such as genome sequencing projects) to describe gene (and protein) functions and interactions. Unlike genomics and proteomics, functional genomics focuses on the dynamic aspects such as gene transcription, translation, and protein-protein interactions, as opposed to the static aspects of the genomic information such as DNA sequence or structures. Functional genomics attempts to answer questions about the function of DNA at the levels of genes, RNA transcripts, and protein products. ArrayExpress & Atlas4
5
Components of a functional genomics experiment Array design information Location of each element Description of each element Hybridization protocol Quantification matrix Software specifications Sample source Sample treatments RNA extraction protocol Labelling protocol Control array elements Normalization method Image Scanning protocol Software specifications Array Normalized data Raw data Sample Sample source Sample treatments Template preparation Library Library preparation Chip Cluster amplification Sequencing and imaging Data analysis From images to sequences Quality Control Sequence alignment Assembly Specific steps depending on the application Data analysis
6
Why do we need a database for functional genomics data? ArrayExpress & Atlas6 E-MEXP-2451 Transcription profiling of grape berries during ripening Grape berries undergo considerable physical and biochemical changes during the ripening process. Ripening is characterized by a number of changes, including the degradation of chlorophyll, an increase in berry deformability, a rapid increase in the level of hexoses in the berry vacuole, an increase in berry volume, the catabolism of organic acids, the development of skin colour, and the formation of compounds that influence flavour, aroma, and therefore, wine quality. The aim of this work is to identify differentially expressed genes during grape ripening by microarray and real- time PCR techniques. Using a custom array of new generation, we analysed the expression of 6000 grape genes from pre-veraison to full maturity, in Vitis vinifera cultivar Muscat of Hamburg, in two different years (2006 and 2007). Five time points per year and two biological replicates per stadium were considered. To reduced intra- plant and inter-plant biological variability, for each ripening stadium we collected around hundred berries from several bunch grapes of five plants of V. vinifera cv Muscat of Hamburg. We will use the real- time PCR technique to validate microarray data.Muscat of Hamburg. We will use the real-time PCR technique to validate microarray data. E-MEXP-2451 Transcription profiling of grape berries during ripening Grape berries undergo considerable physical and biochemical changes during the ripening process. Ripening is characterized by a number of changes, including the degradation of chlorophyll, an increase in berry deformability, a rapid increase in the level of hexoses in the berry vacuole, an increase in berry volume, the catabolism of organic acids, the development of skin colour, and the formation of compounds that influence flavour, aroma, and therefore, wine quality. The aim of this work is to identify differentially expressed genes during grape ripening by microarray and real- time PCR techniques. Using a custom array of new generation, we analysed the expression of 6000 grape genes from pre-veraison to full maturity, in Vitis vinifera cultivar Muscat of Hamburg, in two different years (2006 and 2007). Five time points per year and two biological replicates per stadium were considered. To reduced intra- plant and inter-plant biological variability, for each ripening stadium we collected around hundred berries from several bunch grapes of five plants of V. vinifera cv Muscat of Hamburg. We will use the real- time PCR technique to validate microarray data.Muscat of Hamburg. We will use the real-time PCR technique to validate microarray data.
7
Why do we need a database for functional genomics data? ArrayExpress & Atlas7
8
ArrayExpress www.ebi.ac.uk/arrayexpress/ Is a public repository for functional genomics data, mostly generated using microarray or high throughput sequencing (HTS) assays Serves the scientific community as an archive for data supporting publications, together with GEO at NCBI and CIBEX at DDBJ Provides easy access to well annotated microarray data in a structured and standardized format Facilitates the sharing of microarray designs and experimental protocols Based on FGED standards: MIAME checklist, MAGE-TAB format and MO Ontology. MINSEQE checklist for HTS data (http://www.mged.org/minseqe/) ArrayExpress8
9
Reporting standards Data standardization efforts focus on answering 3 major questions: 1.What information do we need to capture? 2.What syntax (or file format) should we use to exchange data? 3.What semantics (or ontology) should we use to best describe its annotation? ArrayExpress9
10
What information do we need to capture? MIAME checklist Minimal Information About a Microarray Experiment The 6 most critical elements contributing towards MIAME are: 1.Essential sample annotation including experimental factors and their values (e.g. compound and dose) 2.Experimental design including sample data relationships (e.g. which raw data file relates to which sample, ….) 3.Sufficient array annotation (e.g. gene identifiers, genomic coordinates, probe sequences or array catalog number) 4.Essential laboratory and data processing protocols (e.g. normalization method used) 5.Raw data for each hybridization (e.g. CEL or GPR files) 6.Final normalized data for the set of hybridizations in the experiment ArrayExpress10
11
What information do we need to capture? MINSEQE checklist Minimal Information about a high-throughput Nucleotide SEQuencing Experiment The proposed guidelines for MINSEQE are (still work in progress): 1.General information about the experiment 2.Essential sample annotation including experimental factors and their values (e.g. compound and dose) 3.Experimental design including sample data relationships (e.g. which raw data file relates to which sample, ….) 4.Essential experimental and data processing protocols 5.Sequence read data with quality scores, raw intensities and processing parameters for the instrument 6.Final processed data for the set of assays in the experiment ArrayExpress11
12
MAGE-TAB is a simple spreadsheet format that uses a number of different files to capture information about a microarray experiment: IDFInvestigation Description Format file, contains top-level information about the experiment including title, description, submitter contact details and protocols. SDRFSample and Data Relationship Format file contains the relationships between samples and arrays, as well as sample properties and experimental factors, as provided by the data submitter. ADFArray Design Format file, describes the design of an array, i. e. the sequence located at each feature on the array and annotation of the sequences. Data filesRaw and processed data files. The ‘raw’ data files are the files produced by the microarray image analysis software, such as CEL files for Affymetrix or GPR files from GenePix. The processed data file is a ‘data matrix’ file containing processed values, as provided by the data submitter. 12 What syntax (or file format) should we use to exchange data?
13
What semantics (or ontology) should we use to best describe its annotation? Ontology, which is a formal specification of terms in a particular subject area and the relations among them. Its purpose is to provide a basic, stable and unambiguous description of such terms and relations in order to avoid improper and inconsistent use of the terminology pertaining to a given domain. ArrayExpress13
14
Experimental factor ontology (EFO) http://www.ebi.ac.uk/efo Application focused ontology modeling experimental factors (EFs) in AE Developed to: Consistent curation – ensure that curators use same vocabulary Query support (e.g, query for 'cancer' and get also ‘leukemia') EFs are transformed into an ontological representation, forming classes and relationships between those classes EFO terms map to multiple existing domain specific ontologies, such as the Disease Ontology and Cell Type Ontology ArrayExpress14
15
ArrayExpress & Atlas15 What semantics (or ontology) should we use to best describe its annotation?
16
ArrayExpress16 ArrayExpress – two databases
17
ArrayExpress17 How to query AE and Atlas? AE Archive Query by experiment, sample and experimental factor annotations Filter on species, array platform, molecule assayed and technology used Gene Expression Atlas Gene and/or condition queries Query across experiments and across platforms
18
ArrayExpress – two databases ArrayExpress18
19
ArrayExpress Archive 22281 ArrayExpress & Atlas19...and counting
20
How much data in AE Archive? ArrayExpress20
21
ArrayExpress21 Archive by species
22
ArrayExpress22 Browsing the AE Archive
23
Species investigated Curated title of experiment The date when the data released AE unique experiment ID ArrayExpress23 Raw sequencing data available in ENA The list of experiments retrieved can be printed, saved as Tab-delimited format or exported to Excel or as RSS feed The total number of experiments and assay retrieved The direct link to raw and processed data. An icon indicates that this type of data is available. Loaded in Atlas flag Number of assays
24
ArrayExpress24 Browsing the AE Archive
25
Experimental factor ontology (EFO) Example: ArrayExpress25 carcinoma acinar cell carcinoma adrenocortical carcinoma bladder carcinoma breast carcinoma cervical carcinoma esophageal carcinoma follicular thyroid carcinoma hepatocellular carcinoma pancreatic carcinoma papillary thyroid carcinoma renal carcinoma signet ring cell carcinoma uterine carcinoma adenocarcinoma adenoid cystic carcinoma endometrioid carcinoma gastric carcinoma hereditary leiomyomatosis and renal cell cancer hypopharyngeal carcinoma acinar cell carcinoma adrenocortical carcinoma bladder carcinoma breast carcinoma cervical carcinoma esophageal carcinoma follicular thyroid carcinoma hepatocellular carcinoma pancreatic carcinoma papillary thyroid carcinoma renal carcinoma signet ring cell carcinoma uterine carcinoma adenocarcinoma adenoid cystic carcinoma endometrioid carcinoma gastric carcinoma hereditary leiomyomatosis and renal cell cancer hypopharyngeal carcinoma
26
Searching AE Archive Simple query - EFO ArrayExpress26 Matches to exact terms are highlighted in yellow Matches to child terms in the EFO are highlighted in pink
27
Searching AE Archive Simple query Search across all fields: AE accession number e.g. E-MEXP-568 Secondary accession numbers e.g. GEO series accession GSE5389 Experiment name Submitter's experiment description Sample attributes, experimental factor and values, including species (e.g. GeneticModification, Mus musculus, DREB2C over-expression) Publication title, authors and journal name, PubMed ID Synonyms for terms are always included in searches e.g. 'human' and 'Homo sapiens’ ArrayExpress27
28
AE Archive query output Matches to exact terms are highlighted in yellow
29
AE Archive – experiment view ArrayExpress29 MIAME or MINSEQE scores show how much the experiment is standard compliant Experimental factor(s) and its values Link to files available. This varies between sequencing and microarray data. For microarray experiments you also have array design file. Link to files available. This varies between sequencing and microarray data. For microarray experiments you also have array design file.
30
AE Archive – SDRF file ArrayExpress30
31
SDRF file – sample & data relationship ArrayExpress31
32
AE Archive – ADF file ArrayExpress32
33
AE Archive – all files ArrayExpress33
34
ArrayExpress34 AE Archive – all files
35
Searching AE Archive Advanced query Combine search terms Enter two or more keywords in the search box with the operators AND, OR or NOT. AND is the default search term; a search for kidney cancer will return hits with a match to ‘kidney' AND ‘cancer’ Search terms of more than one word must be entered inside quotes otherwise only the first word will be searched for. E.g. “kidney cancer” Specify fields for searches Particular fields for searching can also be specified in the format of fieldname:value ArrayExpress35
36
Searching AE Archive Advanced query - fieldnames ArrayExpress36 Field nameSearchesExample accession Experiment primary or secondary accessionaccession:E-MEXP-568 array Array design accession or namearray:AFFY-2 OR array:Agilent* ef Experimental factor, the name of the main variables in an experiment. ef:celltype OR ef:compound efv Experimental factor value. Has EFO expansion.efv:fibroblast expdesign Experiment design typeexpdesign:”dose response” exptype Experiment type. Has EFO expansion.exptype:RNA-seq gxa Presence in the Gene Expression Atlas. Only value is gxa:true. ef:compound AND gxa:true pmid PubMed identifierpmid:16553887 sa Sample attribute values. Has EFO expansion.sa:wild_type species Species of the samples. Has EFO expansion.species:”homo sapiens” AND ef:cellline
37
Searching AE Archive Advanced query ArrayExpress37 FilterWhat is filtered assaycount:[x TO y]filter on the number of of assays where x <= y and both values are between 0 and 99,999 (inclusive). To count excluding the values given use curly brackets e.g. assaycount:{1 TO 5} will find experiments with 2-4 assays. Single numbers may also be given e.g. assaycount:10 will find experiments with 10 assays. efcount:[x TO y]filter on the number of experimental factors samplecount:[x TO y]filter on the number of samples sacount:[x TO y]filter on the number of sample attribute categories rawcount:[x TO y]filter on the number of raw files fgemcount:[x TO y]filter on the number of final gene expression matrix (processed data) files miamescore:[x TO y]filter on the MIAME compliance score (maximum score is 5) date:yyyy-mm-ddfilter by release date date:2009-12-01 - will search for experiments released on 1st of Dec 2009 date:2009* - will search for experiments released in 2009 date:[2008-01-01 2008-05-31] - will search for experiments released between 1st of Jan and end of May 2008 Filtering experiments by counts of a particular attribute Experiments fulfilling certain count criteria can also be searched for e.g. having more than 10 assays (hybridizations)
38
ArrayExpress & Atlas38 Searching AE Archive Advanced query – Examples
39
Exercise 1 ArrayExpress39
40
ArrayExpress40 ArrayExpress – two databases
41
ArrayExpress41 The criteria we use for selecting experiments for inclusion in the Atlas are as follows: Array designs relating to experiment must be provided to enable re-annotation using Ensembl or Uniprot (or have the potential for this to be done) High MIAME scores Experiment must have 6 or more hybridizations Sufficient replication and large sample size EF and EFV must be well annotated Adequate sample annotation must be provided Processed data must be provided or raw data which can be renormalized must be available Gene Expression Atlas Experiment selection criteria
42
ArrayExpress42 New meta-analytical tool for searching gene expression profiles across experiments in AE Data is taken as normalized by the submitter Gene-wise linear models (limma) and t-statistics are applied to calculate the strength of genes’ differential expression across conditions across experiments The result is a two-dimensional matrix where rows correspond to genes and columns correspond to conditions, rather than samples. The matrix entries are p-values together with a sign, indicating the significance and direction of differential expression Gene Expression Atlas Atlas construction
43
ArrayExpress43 Gene Expression Atlas Atlas construction
45
ArrayExpress45 Gene Expression Atlas
46
ArrayExpress46 Atlas home page http://www.ebi.ac.uk/gxa/ Query for genes Query for conditions Restrict query by direction of differential expression The ‘advanced query’ option allows building more complex queries
47
ArrayExpress47 Atlas home page The ‘Genes’ search box
48
ArrayExpress48 Atlas home page The ‘Conditions’ search box
49
ArrayExpress49 Atlas home page A single gene query
50
Atlas gene summary page ArrayExpress50
51
ArrayExpress & Atlas51 Atlas home page A ‘Conditions’ only query
52
ArrayExpress52 Atlas heatmap view
53
Atlas list view Click the ‘expression profile’ link to view the experiment page ArrayExpress53
54
Atlas data download ArrayExpress54
55
ArrayExpress55 Atlas experiment page Box plot showing differential expression of Mt2,Agxt2l1 and Insig2 across the growth conditions studied. Data is available for 3 separate array probes. Search for a specific gene, experimental factor or expression level (up/down)
56
ArrayExpress & Atlas56 Atlas experiment page – HTS data
57
ArrayExpress57 Atlas gene-condition query
58
ArrayExpress58 Atlas query refining
59
ArrayExpress59 Atlas gene-condition query
60
ArrayExpress60 Atlas query refining
61
ArrayExpress61 Atlas query refining
62
ArrayExpress62 Exercises 2 & 3
63
ArrayExpress63 Data submission to AE
64
ArrayExpress64 Data submission to AE www.ebi.ac.uk/microarray/submissions.html
65
MIAMExpress- Overview MIAMExpress is a web based tool for submitting microarray data and microarray designs to the ArrayExpress database MIAMExpress can be used for experiments up to 50 hybridizations in size ArrayExpress & Atlas65 Array design (custom) Experiment Protocol
66
MIAMExpress- Getting Started ArrayExpress & Atlas66
67
MIAMExpress- Protocol Submission ArrayExpress & Atlas67 Change the automatically generated protocol name to something meaningful like Sanger Lab Growth Protocol
68
MIAMExpress- Experiment Submission ArrayExpress & Atlas68 Experiment description SamplesExtracts Labeled extracts Hybs Upload raw and norm data Upload combined data files Batchloader interface where you fill in your experiment information in a spreadsheet-like format Web page interface in which a series of web forms are filled out.
69
MIAMExpress-Batch Upload Tool ArrayExpress & Atlas69 Experiment description SamplesExtracts Labeled extracts Hybs Upload raw and norm data Upload combined data files HELP
70
MIAMExpress-Batch Upload Tool ArrayExpress & Atlas70 Experiment description SamplesExtracts Labeled extracts Hybs Upload raw and norm data Upload combined data files Click multiply to create as many rows as you need. Hint: If you complete all the values in the first row prior to this it will save you time Toolbar Cross document functionality
71
MIAMExpress-Batch Upload Tool ArrayExpress & Atlas71 Experiment description SamplesExtracts Labeled extracts Hybs Upload raw and norm data Upload combined data files Double click and select one or more samples per extract
72
MIAMExpress-Batch Upload Tool ArrayExpress & Atlas72 Experiment description SamplesExtracts Labeled extracts Hybs Upload raw and norm data Upload combined data files Use the folder icon to select the required data files Affymetrix users submit the.CEL file here (and.EXP if available). For all other submissions the raw data must be a single.txt or.gpr file.
73
MIAMExpress-Batch Upload Tool ArrayExpress & Atlas73 Experiment description SamplesExtracts Labeled extracts Hybs Upload raw and norm data Upload combined data files If you are supplying a Final Gene Expression Data File you must supply a transformation protocol
74
MIAMExpress-Data File Formats for Submission Submitted normalized data files must contain data from a single hybridization only. If your normalization procedure creates a file containing data from all your hybs then you can submit this as a final gene expression matrix (FGEM). A normalization protocol should be submitted along with your normalized data files-be precise when describing how the data was calculated A final gene expression matrix (FGEM) or combined data file is a file containing data from several hybridizations. The creation of your FGEM must be described in your transformation protocol. The format of the FGEM is as follows: each line corresponds to an array element each column corresponds to your calculated value ArrayExpress & Atlas74
75
Array Submission- Background An array design describes how a microarray was manufactured, what was printed/synthesized at each position on the array and what biological sequences these represent. If the array design has already been submitted to ArrayExpress, you do not need to re-submit it. If you have used a custom array design you will need to submit the array design as an Array Design Format (ADF) file After submitting an array design you can immediately continue with your experiment submission The ADF file can be created in any spreadsheet application but must be saved as a tab delimited text file. ArrayExpress & Atlas75
76
MIAMExpress-Array Submission ArrayExpress & Atlas76 Enter some general information about your array Checker- This tool will check for errors in the format and content of an ADF file.
77
MIAMExpress-Array Submission ArrayExpress & Atlas77 Location and nameLocation and name of each reporter (probe) on the array AnnotationAnnotation such as the nucleotide sequence or database accession associated with the reporter e.g. RefSeq Type of reporterType of reporter (cDNA, oligonucleotide, RNA, DNA etc) that is present Group of reporterGroup of reporter - which are the control and which are the experimental reporters Optional additional informationOptional additional information - this can be used to show that several reporters are associated with the same gene for example, or add a comment about a reporter. control_biosequence - for example a spike control_buffer - buffer spotted on the array control_empty - nothing spotted on the array control_genomic_DNA - e.g. salmon sperm DNA control_label - landing lights control_reporter_size - size standard control_spike_calibration - spike at varying concentrations control_unknown_type control_biosequence - for example a spike control_buffer - buffer spotted on the array control_empty - nothing spotted on the array control_genomic_DNA - e.g. salmon sperm DNA control_label - landing lights control_reporter_size - size standard control_spike_calibration - spike at varying concentrations control_unknown_type
78
MAGE-TAB Example: IDF ArrayExpress & Atlas78
79
MAGE-TAB Example: SDRF ArrayExpress & Atlas79
80
MAGE-TAB Example: SDRF ArrayExpress & Atlas80
81
MAGE-TAB Submission ArrayExpress & Atlas81 Submitter is directed to either submit an experiment or download a template to the desktop Ontology Link Indicate submission type
82
MAGE-TAB Submission ArrayExpress & Atlas82
83
What happens after submission? Email confirmation Curation The curation team will review your submission and will email you with any questions. Possible reopening for editing We will send you an accession number when all the required information has been provided. We will load your experiment into ArrayExpress and provide you with a reviewer login for viewing the data before it is made public. ArrayExpress & Atlas83
84
Submission of High-throughput sequencing gene expression data ArrayExpress & Atlas84
85
MAGE-TAB Example: SDRF
86
ArrayExpress & Atlas86 MAGE-TAB Example: SDRF
87
What needs to be included in the Spreadsheet? Include Assay Name and Technology Type columns Raw files must go in the Array Data File column A sequencing protocol must be provided. The sequencing protocol should have a performer- this is used as the run center name. This protocol must have a Protocol Hardware value saying which sequencing instrument was used (e.g. 454 GS, Illumina Genome Analyzer, AB SOLiD System Reference this in the Protocol REF column before the Assay Name column. These 4 extra Comment[] columns should be added after Extract Name to provide information about how the library was prepared 1. Comment[LIBRARY_LAYOUT] - Specifies whether to expect single or paired reads. The column should contain one of the following: SINGLE or PAIRED 2. Comment[LIBRARY_SOURCE] - Specifies the type of source material that is being sequenced. The column should contain one of the following: TRANSCRIPTOMIC, METAGENOMIC, SYNTHETIC, VIRAL RNA, OTHER * 3. Comment[LIBRARY_STRATEGY] - Sequencing technique intended for the library. The column should contain one of the following: WGS, WCS, WXS, CLONE, CLONEEND, POOLCLONE, FINISHING, AMPLICON, RNA-Seq, EST, FL-cDNA, CTS, ChIP-Seq, MNase-Seq, DNase-Hypersensitivity, Bisulfite-Seq, MRE-Seq, MeDIP-Seq, MBD-Seq, OTHER * 4. Comment[LIBRARY_SELECTION] - Method was used to select and/or enrich the material being sequenced. The column should contain one of the following: RANDOM, PCR, RANDOM PCR, RT-PCR, cDNA, CAGE, RACE, ChIP, MNase DNAse, HMPR, MF, MSLL, 5-methylcytidine antibody, MBD2 protein methyl-CpG binding domain, Hybrid Selection, Reduced Representation, Restriction Digest, size fractionation, CF-S, CF-M, CF-H, CF-T, other, unspecified * ArrayExpress & Atlas87
88
Submissions- Key Point Don’t be afraid to ask ArrayExpress & Atlas88 miamexpress@ebi.ac.uk
89
Summary: ArrayExpress Search by experiment Search by keyword Link to sample properties and experiment design View experiment Search by gene across experiments Browse results summary Search by gene name, species and experimental condition Search by gene name, species and experimental condition View expression under different conditions and profiles
90
ArrayExpress90 Acknowledgements ArrayExpress/Atlas training is funded by the European Commission under SLING, grant agreement number 226073 (Integrating Activity) within Research Infrastructures Action of the FP7 Capacities Specific Programme.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.