Presentation is loading. Please wait.

Presentation is loading. Please wait.

ArrayExpress and Gene Expression Atlas: Mining Functional Genomics data Gabriella Rustici, PhD Functional Genomics Team EBI-EMBL

Similar presentations


Presentation on theme: "ArrayExpress and Gene Expression Atlas: Mining Functional Genomics data Gabriella Rustici, PhD Functional Genomics Team EBI-EMBL"— Presentation transcript:

1 ArrayExpress and Gene Expression Atlas: Mining Functional Genomics data Gabriella Rustici, PhD Functional Genomics Team EBI-EMBL gabry@ebi.ac.uk

2 ArrayExpress2 Talk structure  Why do we need a database for functional genomics data?  ArrayExpress database Archive Gene Expression Atlas  Database content  Query the database  Data download  Data submission

3 Components of a functional genomics experiment Array design information Location of each element Description of each element Hybridization protocol Quantification matrix Software specifications Sample source Sample treatments RNA extraction protocol Labelling protocol Control array elements Normalization method Image Scanning protocol Software specifications Sample source Sample treatments Template preparation Library preparation Cluster amplification Sequencing and imaging From images to sequences Quality Control Sequence alignment Assembly Specific steps depending on the application Sample Library Chip Data analysis Array Normalized data Raw data Sample Data analysis

4 ArrayExpress www.ebi.ac.uk/arrayexpress/  Is a public repository for functional genomics data, mostly generated using microarray or high throughput sequencing (HTS) assays  Serves the scientific community as an archive for data supporting publications, together with GEO at NCBI and CIBEX at DDBJ  Provides easy access to well annotated microarray data in a structured and standardized format  Facilitates the sharing of microarray designs and experimental protocols  Based on FGED standards: MIAME checklist, MAGE-TAB format and MO Ontology.  MINSEQE checklist for HTS data (http://www.mged.org/minseqe/) ArrayExpress4

5 Reporting standards for microarrays MIAME checklist  Minimal Information About a Microarray Experiment  The 6 most critical elements contributing towards MIAME are: 1.Essential sample annotation including experimental factors and their values (e.g. compound and dose) 2.Experimental design including sample data relationships (e.g. which raw data file relates to which sample, ….) 3.Sufficient array annotation (e.g. gene identifiers, genomic coordinates, probe sequences or array catalog number) 4.Essential laboratory and data processing protocols (e.g. normalization method used) 5.Raw data for each hybridization (e.g. CEL or GPR files) 6.Final normalized data for the set of hybridizations in the experiment ArrayExpress5

6 Reporting standards for sequencing MINSEQE checklist  Minimal Information about a high-throughput Nucleotide SEQuencing Experiment  The proposed guidelines for MINSEQE are (still work in progress): 1.General information about the experiment 2.Essential sample annotation including experimental factors and their values (e.g. compound and dose) 3.Experimental design including sample data relationships (e.g. which raw data file relates to which sample, ….) 4.Essential experimental and data processing protocols 5.Sequence read data with quality scores, raw intensities and processing parameters for the instrument 6.Final processed data for the set of assays in the experiment ArrayExpress6

7 MAGE-TAB is a simple spreadsheet format that uses a number of different files to capture information about a microarray experiment: IDFInvestigation Description Format file, contains top-level information about the experiment including title, description, submitter contact details and protocols. SDRFSample and Data Relationship Format file contains the relationships between samples and arrays, as well as sample properties and experimental factors, as provided by the data submitter. ADFArray Design Format file, describes the design of an array, i. e. the sequence located at each feature on the array and annotation of the sequences. Data filesRaw and processed data files. The ‘raw’ data files are the files produced by the microarray image analysis software, such as CEL files for Affymetrix or GPR files from GenePix. The processed data file is a ‘data matrix’ file containing processed values, as provided by the data submitter. 7 Reporting standards for microarrays MAGE-TAB format ArrayExpress

8 Reporting standards What semantics (or ontology) should we use to best describe its annotation?  Ontology, which is a formal specification of terms in a particular subject area and the relations among them.  Its purpose is to provide a basic, stable and unambiguous description of such terms and relations in order to avoid improper and inconsistent use of the terminology pertaining to a given domain.  Thus far, Gene Ontology (GO) has been the most successful ontology initiative. GO is a controlled vocabulary used to describe the biology of a gene product in any organism. ArrayExpress8

9 Reporting standards for microarrays MGED ontology (MO)  The MO provides terms for annotating all aspects of a microarray experiment from the design of the experiment and array layout, through to the preparation of the biological sample and the protocols used to hybridize the RNA and analyze the data  The MO was developed to provide terms for annotating experiments in line with the MIAME guidelines, i.e. to provide the semantics to describe a microarray experiment according to the concepts specified in MIAME  Also check Open Biomedical Ontologies (OBO) initiative (www.obofoundry.org) for the development of life-science ontologies ArrayExpress9

10 10 ArrayExpress – two databases

11 ArrayExpress11 How to query AE and Atlas? AE Archive Query by experiment, sample and experimental factor annotations Filter on species, array platform, molecule assayed and technology used Gene Expression Atlas Gene and/or condition queries Query across experiments and across platforms

12 ArrayExpress – two databases ArrayExpress12

13 How much data in AE Archive? ArrayExpress13

14 ArrayExpress14 Archive by species

15 ArrayExpress15 Browsing the AE Archive

16 The direct link to raw and processed data. An icon indicates that this type of data is available. The total number of experiments and assay retrieved Species investigated Curated title of experiment The date when the data were loaded in the Archive AE unique experiment ID Number of assays The list of experiments retrieved can be printed, saved as Tab- delimited format or exported to Excel or as RSS feed loaded in Atlas flag ArrayExpress16 Raw sequencing data available in ENA

17 ArrayExpress17 Browsing the AE Archive

18 Experimental factor ontology (EFO) http://www.ebi.ac.uk/efo  Application focused ontology modeling experimental factors (EFs) in AE  Developed to: increase the richness of annotations that are currently made in AE Archive to promote consistency to facilitate automatic annotation and integrate external data  EFs are transformed into an ontological representation, forming classes and relationships between those classes  EFO terms map to multiple existing domain specific ontologies, such as the Disease Ontology and Cell Type Ontology ArrayExpress18

19 ArrayExpress & Atlas19 Experimental factor ontology (EFO) An example

20 Searching AE Archive Simple query - EFO ArrayExpress20

21 Searching AE Archive Simple query  Search across all fields: AE accession number e.g. E-MEXP-568 Secondary accession numbers e.g. GEO series accession GSE5389 Experiment name Submitter's experiment description Sample attributes, experimental factor and values, including species (e.g. GeneticModification, Mus musculus, DREB2C over-expression) Publication title, authors and journal name, PubMed ID  Synonyms for terms are always included in searches e.g. 'human' and 'Homo sapiens’ ArrayExpress21

22 AE Archive query output Matches to exact terms are highlighted in yellow Matches to synonyms are highlighted in green Matches to child terms in the EFO are highlighted in pink

23 AE Archive – experiment view ArrayExpress23

24 SamplesSample annotation Genes Gene expression levels or count level data Gene annotations How does processed data look? ArrayExpress24

25 AE Archive – SDRF file ArrayExpress25

26 SDRF file – sample & data relationship ArrayExpress26

27 AE Archive – ADF file ArrayExpress27

28 AE Archive – Old interface ArrayExpress28

29 AE Archive – all files ArrayExpress29

30 ArrayExpress30 AE Archive – all files

31 Searching AE Archive Advanced query  Combine search terms Enter two or more keywords in the search box with the operators AND, OR or NOT. AND is the default search term; a search for kidney cancer' will return hits with a match to ‘kidney' AND ‘cancer’ Search terms of more than one word must be entered inside quotes otherwise only the first word will be searched for. E.g. “kidney cancer”  Specify fields for searches Particular fields for searching can also be specified in the format of fieldname:value ArrayExpress31

32 Searching AE Archive Advanced query - fieldnames ArrayExpress32 Field nameSearchesExample accession Experiment primary or secondary accessionaccession:E-MEXP-568 array Array design accession or namearray:AFFY-2 OR array:Agilent* ef Experimental factor, the name of the main variables in an experiment. ef:celltype OR ef:compound efv Experimental factor value. Has EFO expansion.efv:fibroblast expdesign Experiment design typeexpdesign:”dose response” exptype Experiment type. Has EFO expansion.exptype:RNA-seq gxa Presence in the Gene Expression Atlas. Only value is gxa:true. ef:compound AND gxa:true pmid PubMed identifierpmid:16553887 sa Sample attribute values. Has EFO expansion.sa:wild_type species Species of the samples. Has EFO expansion.species:”homo sapiens” AND ef:cellline

33 Searching AE Archive Advanced query ArrayExpress33 FilterWhat is filtered assaycount:[x TO y]filter on the number of of assays where x <= y and both values are between 0 and 99,999 (inclusive). To count excluding the values given use curly brackets e.g. assaycount:{1 TO 5} will find experiments with 2-4 assays. Single numbers may also be given e.g. assaycount:10 will find experiments with 10 assays. efcount:[x TO y]filter on the number of experimental factors samplecount:[x TO y]filter on the number of samples sacount:[x TO y]filter on the number of sample attribute categories rawcount:[x TO y]filter on the number of raw files fgemcount:[x TO y]filter on the number of final gene expression matrix (processed data) files miamescore:[x TO y]filter on the MIAME compliance score (maximum score is 5) date:yyyy-mm-ddfilter by release date date:2009-12-01 - will search for experiments released on 1st of Dec 2009 date:2009* - will search for experiments released in 2009 date:[2008-01-01 2008-05-31] - will search for experiments released between 1st of Jan and end of May 2008  Filtering experiments by counts of a particular attribute Experiments fulfilling certain count criteria can also be searched for e.g. having more than 10 assays (hybridizations)

34 ArrayExpress & Atlas34 Searching AE Archive Advanced query – an example

35 Exercise 1 ArrayExpress35

36 ArrayExpress36 ArrayExpress – two databases

37 ArrayExpress37  The criteria we use for selecting experiments for inclusion in the Atlas are as follows: Array designs relating to experiment must be provided to enable re-annotation using Ensembl or Uniprot (or have the potential for this to be done) High MIAME scores Experiment must have 6 or more hybridizations Sufficient replication and large sample size EF and EFV must be well annotated Adequate sample annotation must be provided Processed data must be provided or raw data which can be renormalized must be available Gene Expression Atlas Experiment selection criteria

38 ArrayExpress38  New meta-analytical tool for searching gene expression profiles across experiments in AE  Data is taken as normalized by the submitter  Gene-wise linear models (limma) and t-statistics are applied to calculate the strength of genes’ differential expression across conditions across experiments  The result is a two-dimensional matrix where rows correspond to genes and columns correspond to conditions, rather than samples.  The matrix entries are p-values together with a sign, indicating the significance and direction of differential expression Gene Expression Atlas Atlas construction

39 ArrayExpress39 Gene Expression Atlas Atlas construction

40 up-regulated  down-regulated  no change

41 ArrayExpress41 Gene Expression Atlas

42 ArrayExpress42 Atlas home page http://www.ebi.ac.uk/gxa/ Query for gene(s)Query for condition(s) Restrict search by direction of differential expression The ‘advanced search’ option allows building more complex queries

43 ArrayExpress43 Atlas home page The ‘Genes’ search box & auto-complete function

44 ArrayExpress44 Atlas home page The ‘Conditions’ search box & ontology browsing

45 ArrayExpress45 Atlas home page A single gene query

46 Atlas gene summary page ArrayExpress46

47 ArrayExpress47 Atlas experiment page Expression plot Table containing gene information and drop down menus for searching within the experiment Experimental factors list

48 ArrayExpress & Atlas48 Atlas experiment page – HTS data

49 ArrayExpress & Atlas49 Atlas home page A ‘Conditions’ only query

50 ArrayExpress50 Atlas heatmap view

51 Atlas list view Click the ‘expression profile’ link to view the experiment page ArrayExpress51

52 Atlas data download ArrayExpress52

53 ArrayExpress53 Atlas gene-condition query

54 ArrayExpress54 Atlas query refining

55 ArrayExpress55 Atlas gene-condition query

56 ArrayExpress56 Atlas query refining

57 ArrayExpress57 Atlas query refining

58 ArrayExpress58 Exercises 2, 3 & 4

59 ArrayExpress59 Data submission to AE

60 ArrayExpress60 Data submission to AE www.ebi.ac.uk/microarray/submissions.html

61 Submission of HTS gene expression data Submit via MAGE-TAB submission route Submit: MAGE-TAB spreadsheet containing details of the samples and protocols used. Trace data files for each sample (in SRF, FASTQ or SFF format ) Processed data files For non-human species we will supply your SRF or FASTQ files to the European Nucleotide Archive (ENA). If you have human identifiable sequencing data you need to submit to the The European Genome-phenome Archive and not ArrayExpress. They will supply you with a suitable template for submission and store human identifiable data securely. ArrayExpress & Atlas61

62 Types of data that can be submitted ArrayExpress & Atlas62

63 What happens after submission? Email confirmation Curation The curation team will review your submission and will email you with any questions. Possible reopening for editing We will send you an accession number when all the required information has been provided. We will load your experiment into ArrayExpress and provide you with a reviewer login for viewing the data before it is made public. ArrayExpress & Atlas63


Download ppt "ArrayExpress and Gene Expression Atlas: Mining Functional Genomics data Gabriella Rustici, PhD Functional Genomics Team EBI-EMBL"

Similar presentations


Ads by Google