ArrayExpress and Gene Expression Atlas: Mining Functional Genomics data Gabriella Rustici, PhD Functional Genomics Team EBI-EMBL

Slides:



Advertisements
Similar presentations
EBSCO Discovery Service
Advertisements

Database Basics. What is Access? Database management system Computer-based equivalent of a manual database Makes it easy to organize and update information.
The Maize Inflorescence Project Website Tutorial Nov 7, 2014.
The National Center for Biotechnology Information (NCBI) a primary resource for molecular biology information Database Resources.
The Rice Functional Genomics Program of China cDNA microarray database (RIFGP-CDMD) consists of complete datasets, including the probe sequences, microarray.
Basic Genomic Characteristic  AIM: to collect as much general information as possible about your gene: Nucleotide sequence Databases ○ NCBI GenBank ○
Minimum Information About a Microarray Experiment - MIAME MGED 5 workshop.
Time line and procedures for datasets BCBC Pre-retreat Workshop Tyson’s Corner, VA May 11, 2011.
Kate Milova MolGen retreat March 24, Microarray experiments: Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Biological Databases Notes adapted from lecture notes of Dr. Larry Hunter at the University of Colorado.
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
NCBI resources III: GEO and expression data analysis Yanbin Yin Fall
Using ArrayExpress. ArrayExpress is an international public repository for well-annotated microarray data, including gene expression, comparative genomic.
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Tutorial 11: Connecting to External Data
Wiley Online Library. About Wiley Online Library Wiley Online Library hosts the world's broadest and deepest multidisciplinary collection of online resources.
MyiLibrary® ‘Search & View’ Website Training June 8, 2010.
ArrayExpress and Gene Expression Atlas: Mining Functional Genomics data Gabriella Rustici, PhD Functional Genomics Team EBI-EMBL
Gene expression services: ArrayExpress and the Gene Expression Atlas Contact: Gabriella Rustici, PhD Functional Genomics Team EBI-EMBL
ArrayExpress and Expression Atlas: Mining Functional Genomics data Gabriella Rustici, PhD Functional Genomics Team EBI-EMBL
Microrray Data Standardisation Microarray Gene Expression Database group -- MGED December, 2000.
Support for MAGE-TAB in caArray 2.0 Overview and feedback MAGE-TAB Workshop January 24, 2008.
ArrayExpress and Gene Expression Atlas: Mining Functional Genomics data Emma Hastings Functional Genomics Team EBI-EMBL
Gene Expression Omnibus (GEO)
The MGED Society Facilitating Data Sharing and Integration with Standards CTSA Omics Data Standards Working Group Chris Stoeckert Dept. of Genetics and.
Test1 April 2004 Microarray Data Management Jianwei (Jerry) Li.
EBI is an Outstation of the European Molecular Biology Laboratory. EBI Bioinformatics Roadshow ILRI/BecA Nairobi Campus 2 nd - 3 rd March 2011.
ArrayExpress and Gene Expression Atlas: Mining Functional Genomics data Gabriella Rustici, PhD Functional Genomics Team EBI-EMBL
Copyright OpenHelix. No use or reproduction without express written consent1.
ArrayExpress and Expression Atlas: Mining Functional Genomics data Gabriella Rustici, PhD Functional Genomics Team EBI-EMBL
1 Welcome to the GrameneMart Tutorial A tool for batch data sequence retrieval 1.Select a Gramene dataset to search against. 2.Add filters to the dataset.
Copyright OpenHelix. No use or reproduction without express written consent1.
The European Bioinformatics Institute MGED ontology for consistent annotation of microarray experiments Manchester Bioinformatics Week Ontologies Workshop1.
Abstract BarleyBase is a USDA-funded public repository for plant microarray data. BarleyBase houses raw and normalized expression data from the 22K Affymetrix.
GENOME-CENTRIC DATABASES Daniel Svozil. NCBI Gene Search for DUT gene in human.
Management Information Systems MS Access MS Access is an application software that facilitates us to create Database Management Systems (DBMS)
OARE Module 5A: Scopus (Elsevier). Table of Contents About Scopus (Elsevier) Using Scopus Search Page Results/Refine Search Pages Download, PDF, Export,
Copyright OpenHelix. No use or reproduction without express written consent1.
1 maxdLoad The maxd website: © 2002 Norman Morrison for Manchester Bioinformatics.
MS Access 2007 Management Information Systems 1. Overview 2  What is MS Access?  Access Terminology  Access Window  Database Window  Create New Database.
ITGS Databases.
© 2004 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice SISP 6.1 Delta Training Documentation.
PROGNOCHIP-BASE, FORTH-ICS 1 PrognoChip-BASE: An Information System for the Management of Spotted DNA MicroArray Experiments Extension of BASE v
Protein Data Bank: An Introduction Learning to Use the RCSB PDB Portal.
Analysis of GEO datasets using GEO2R Parthav Jailwala CCR Collaborative Bioinformatics Resource CCR/NCI/NIH.
Gene Expression Omnibus (GEO)
This tutorial will describe how to navigate the section of Gramene that provides descriptions of alleles associated with morphological, developmental,
1 Outline Standardization - necessary components –what information should be exchanged –how the information should be exchanged –common terms (ontologies)
Master headline RDFizing the EBI Gene Expression Atlas James Malone, Electra Tapanari
ArrayExpress and Expression Atlas: Mining Functional Genomics data Dr Sarah Morgan Training team
Applied Bioinformatics Week 9 Jens Allmer. Theory I Gene Expression Microarray.
CaArray User Community Meeting Feature Overview and Review of MAGE-TAB Update and Export Specification Call in: Participant Passcode:
Introduction and Applications of Microarray Databases Chen-hsiung Chan Department of Computer Science and Information Engineering National Taiwan University.
Tutorial 8 Gene expression analysis 1. How to interpret an expression matrix Expression data DBs - GEO Clustering –Hierarchical clustering –K-means clustering.
Microsoft Office 2013 Try It! Chapter 4 Storing Data in Access.
21 Copyright © 2009, Oracle. All rights reserved. Working with Oracle Business Intelligence Answers.
Welcome to the GrameneMart Tutorial A tool for batch data sequence retrieval 1.Select a Gramene dataset to search against. 2.Add filters to the dataset.
Describing and Annotating Experimental Data: Hands On.
Bioinformatics Shared Resource Introduction to Gene Expression Omnibus (GEO) bsrweb.sanfordburnham.org
ArrayExpress Ugis Sarkans EMBL - EBI
MESA A Simple Microarray Data Management Server. General MESA is a prototype web-based database solution for the massive amounts of initial data generated.
GEO (Gene Expression Omnibus) Deepak Sambhara Georgia Institute of Technology 21 June, 2006.
Using ArrayExpress.
ArrayExpress and Gene Expression Atlas: Mining Functional Genomics data Gabriella Rustici, PhD Functional Genomics Team EBI-EMBL
How to store and visualize RNA-seq data
PubMed Database Interface (Basic Course Module 4 Part A)
Gene Expression Omnibus (GEO)
Welcome to the Quantitative Trait Loci (QTL) Tutorial
Welcome to the Markers Database Tutorial
Presentation transcript:

ArrayExpress and Gene Expression Atlas: Mining Functional Genomics data Gabriella Rustici, PhD Functional Genomics Team EBI-EMBL

ArrayExpress2 Talk structure  Why do we need a database for functional genomics data?  ArrayExpress database Archive Gene Expression Atlas  Database content  Query the database  Data download  Data submission

Components of a functional genomics experiment Array design information Location of each element Description of each element Hybridization protocol Quantification matrix Software specifications Sample source Sample treatments RNA extraction protocol Labelling protocol Control array elements Normalization method Image Scanning protocol Software specifications Sample source Sample treatments Template preparation Library preparation Cluster amplification Sequencing and imaging From images to sequences Quality Control Sequence alignment Assembly Specific steps depending on the application Sample Library Chip Data analysis Array Normalized data Raw data Sample Data analysis

ArrayExpress  Is a public repository for functional genomics data, mostly generated using microarray or high throughput sequencing (HTS) assays  Serves the scientific community as an archive for data supporting publications, together with GEO at NCBI and CIBEX at DDBJ  Provides easy access to well annotated microarray data in a structured and standardized format  Facilitates the sharing of microarray designs and experimental protocols  Based on FGED standards: MIAME checklist, MAGE-TAB format and MO Ontology.  MINSEQE checklist for HTS data ( ArrayExpress4

Reporting standards for microarrays MIAME checklist  Minimal Information About a Microarray Experiment  The 6 most critical elements contributing towards MIAME are: 1.Essential sample annotation including experimental factors and their values (e.g. compound and dose) 2.Experimental design including sample data relationships (e.g. which raw data file relates to which sample, ….) 3.Sufficient array annotation (e.g. gene identifiers, genomic coordinates, probe sequences or array catalog number) 4.Essential laboratory and data processing protocols (e.g. normalization method used) 5.Raw data for each hybridization (e.g. CEL or GPR files) 6.Final normalized data for the set of hybridizations in the experiment ArrayExpress5

Reporting standards for sequencing MINSEQE checklist  Minimal Information about a high-throughput Nucleotide SEQuencing Experiment  The proposed guidelines for MINSEQE are (still work in progress): 1.General information about the experiment 2.Essential sample annotation including experimental factors and their values (e.g. compound and dose) 3.Experimental design including sample data relationships (e.g. which raw data file relates to which sample, ….) 4.Essential experimental and data processing protocols 5.Sequence read data with quality scores, raw intensities and processing parameters for the instrument 6.Final processed data for the set of assays in the experiment ArrayExpress6

MAGE-TAB is a simple spreadsheet format that uses a number of different files to capture information about a microarray experiment: IDFInvestigation Description Format file, contains top-level information about the experiment including title, description, submitter contact details and protocols. SDRFSample and Data Relationship Format file contains the relationships between samples and arrays, as well as sample properties and experimental factors, as provided by the data submitter. ADFArray Design Format file, describes the design of an array, i. e. the sequence located at each feature on the array and annotation of the sequences. Data filesRaw and processed data files. The ‘raw’ data files are the files produced by the microarray image analysis software, such as CEL files for Affymetrix or GPR files from GenePix. The processed data file is a ‘data matrix’ file containing processed values, as provided by the data submitter. 7 Reporting standards for microarrays MAGE-TAB format ArrayExpress

Reporting standards What semantics (or ontology) should we use to best describe its annotation?  Ontology, which is a formal specification of terms in a particular subject area and the relations among them.  Its purpose is to provide a basic, stable and unambiguous description of such terms and relations in order to avoid improper and inconsistent use of the terminology pertaining to a given domain.  Thus far, Gene Ontology (GO) has been the most successful ontology initiative. GO is a controlled vocabulary used to describe the biology of a gene product in any organism. ArrayExpress8

Reporting standards for microarrays MGED ontology (MO)  The MO provides terms for annotating all aspects of a microarray experiment from the design of the experiment and array layout, through to the preparation of the biological sample and the protocols used to hybridize the RNA and analyze the data  The MO was developed to provide terms for annotating experiments in line with the MIAME guidelines, i.e. to provide the semantics to describe a microarray experiment according to the concepts specified in MIAME  Also check Open Biomedical Ontologies (OBO) initiative ( for the development of life-science ontologies ArrayExpress9

10 ArrayExpress – two databases

ArrayExpress11 How to query AE and Atlas? AE Archive Query by experiment, sample and experimental factor annotations Filter on species, array platform, molecule assayed and technology used Gene Expression Atlas Gene and/or condition queries Query across experiments and across platforms

ArrayExpress – two databases ArrayExpress12

How much data in AE Archive? ArrayExpress13

ArrayExpress14 Archive by species

ArrayExpress15 Browsing the AE Archive

The direct link to raw and processed data. An icon indicates that this type of data is available. The total number of experiments and assay retrieved Species investigated Curated title of experiment The date when the data were loaded in the Archive AE unique experiment ID Number of assays The list of experiments retrieved can be printed, saved as Tab- delimited format or exported to Excel or as RSS feed loaded in Atlas flag ArrayExpress16 Raw sequencing data available in ENA

ArrayExpress17 Browsing the AE Archive

Experimental factor ontology (EFO)  Application focused ontology modeling experimental factors (EFs) in AE  Developed to: increase the richness of annotations that are currently made in AE Archive to promote consistency to facilitate automatic annotation and integrate external data  EFs are transformed into an ontological representation, forming classes and relationships between those classes  EFO terms map to multiple existing domain specific ontologies, such as the Disease Ontology and Cell Type Ontology ArrayExpress18

ArrayExpress & Atlas19 Experimental factor ontology (EFO) An example

Searching AE Archive Simple query - EFO ArrayExpress20

Searching AE Archive Simple query  Search across all fields: AE accession number e.g. E-MEXP-568 Secondary accession numbers e.g. GEO series accession GSE5389 Experiment name Submitter's experiment description Sample attributes, experimental factor and values, including species (e.g. GeneticModification, Mus musculus, DREB2C over-expression) Publication title, authors and journal name, PubMed ID  Synonyms for terms are always included in searches e.g. 'human' and 'Homo sapiens’ ArrayExpress21

AE Archive query output Matches to exact terms are highlighted in yellow Matches to synonyms are highlighted in green Matches to child terms in the EFO are highlighted in pink

AE Archive – experiment view ArrayExpress23

SamplesSample annotation Genes Gene expression levels or count level data Gene annotations How does processed data look? ArrayExpress24

AE Archive – SDRF file ArrayExpress25

SDRF file – sample & data relationship ArrayExpress26

AE Archive – ADF file ArrayExpress27

AE Archive – Old interface ArrayExpress28

AE Archive – all files ArrayExpress29

ArrayExpress30 AE Archive – all files

Searching AE Archive Advanced query  Combine search terms Enter two or more keywords in the search box with the operators AND, OR or NOT. AND is the default search term; a search for kidney cancer' will return hits with a match to ‘kidney' AND ‘cancer’ Search terms of more than one word must be entered inside quotes otherwise only the first word will be searched for. E.g. “kidney cancer”  Specify fields for searches Particular fields for searching can also be specified in the format of fieldname:value ArrayExpress31

Searching AE Archive Advanced query - fieldnames ArrayExpress32 Field nameSearchesExample accession Experiment primary or secondary accessionaccession:E-MEXP-568 array Array design accession or namearray:AFFY-2 OR array:Agilent* ef Experimental factor, the name of the main variables in an experiment. ef:celltype OR ef:compound efv Experimental factor value. Has EFO expansion.efv:fibroblast expdesign Experiment design typeexpdesign:”dose response” exptype Experiment type. Has EFO expansion.exptype:RNA-seq gxa Presence in the Gene Expression Atlas. Only value is gxa:true. ef:compound AND gxa:true pmid PubMed identifierpmid: sa Sample attribute values. Has EFO expansion.sa:wild_type species Species of the samples. Has EFO expansion.species:”homo sapiens” AND ef:cellline

Searching AE Archive Advanced query ArrayExpress33 FilterWhat is filtered assaycount:[x TO y]filter on the number of of assays where x <= y and both values are between 0 and 99,999 (inclusive). To count excluding the values given use curly brackets e.g. assaycount:{1 TO 5} will find experiments with 2-4 assays. Single numbers may also be given e.g. assaycount:10 will find experiments with 10 assays. efcount:[x TO y]filter on the number of experimental factors samplecount:[x TO y]filter on the number of samples sacount:[x TO y]filter on the number of sample attribute categories rawcount:[x TO y]filter on the number of raw files fgemcount:[x TO y]filter on the number of final gene expression matrix (processed data) files miamescore:[x TO y]filter on the MIAME compliance score (maximum score is 5) date:yyyy-mm-ddfilter by release date date: will search for experiments released on 1st of Dec 2009 date:2009* - will search for experiments released in 2009 date:[ ] - will search for experiments released between 1st of Jan and end of May 2008  Filtering experiments by counts of a particular attribute Experiments fulfilling certain count criteria can also be searched for e.g. having more than 10 assays (hybridizations)

ArrayExpress & Atlas34 Searching AE Archive Advanced query – an example

Exercise 1 ArrayExpress35

ArrayExpress36 ArrayExpress – two databases

ArrayExpress37  The criteria we use for selecting experiments for inclusion in the Atlas are as follows: Array designs relating to experiment must be provided to enable re-annotation using Ensembl or Uniprot (or have the potential for this to be done) High MIAME scores Experiment must have 6 or more hybridizations Sufficient replication and large sample size EF and EFV must be well annotated Adequate sample annotation must be provided Processed data must be provided or raw data which can be renormalized must be available Gene Expression Atlas Experiment selection criteria

ArrayExpress38  New meta-analytical tool for searching gene expression profiles across experiments in AE  Data is taken as normalized by the submitter  Gene-wise linear models (limma) and t-statistics are applied to calculate the strength of genes’ differential expression across conditions across experiments  The result is a two-dimensional matrix where rows correspond to genes and columns correspond to conditions, rather than samples.  The matrix entries are p-values together with a sign, indicating the significance and direction of differential expression Gene Expression Atlas Atlas construction

ArrayExpress39 Gene Expression Atlas Atlas construction

up-regulated  down-regulated  no change

ArrayExpress41 Gene Expression Atlas

ArrayExpress42 Atlas home page Query for gene(s)Query for condition(s) Restrict search by direction of differential expression The ‘advanced search’ option allows building more complex queries

ArrayExpress43 Atlas home page The ‘Genes’ search box & auto-complete function

ArrayExpress44 Atlas home page The ‘Conditions’ search box & ontology browsing

ArrayExpress45 Atlas home page A single gene query

Atlas gene summary page ArrayExpress46

ArrayExpress47 Atlas experiment page Expression plot Table containing gene information and drop down menus for searching within the experiment Experimental factors list

ArrayExpress & Atlas48 Atlas experiment page – HTS data

ArrayExpress & Atlas49 Atlas home page A ‘Conditions’ only query

ArrayExpress50 Atlas heatmap view

Atlas list view Click the ‘expression profile’ link to view the experiment page ArrayExpress51

Atlas data download ArrayExpress52

ArrayExpress53 Atlas gene-condition query

ArrayExpress54 Atlas query refining

ArrayExpress55 Atlas gene-condition query

ArrayExpress56 Atlas query refining

ArrayExpress57 Atlas query refining

ArrayExpress58 Exercises 2, 3 & 4

ArrayExpress59 Data submission to AE

ArrayExpress60 Data submission to AE

Submission of HTS gene expression data Submit via MAGE-TAB submission route Submit: MAGE-TAB spreadsheet containing details of the samples and protocols used. Trace data files for each sample (in SRF, FASTQ or SFF format ) Processed data files For non-human species we will supply your SRF or FASTQ files to the European Nucleotide Archive (ENA). If you have human identifiable sequencing data you need to submit to the The European Genome-phenome Archive and not ArrayExpress. They will supply you with a suitable template for submission and store human identifiable data securely. ArrayExpress & Atlas61

Types of data that can be submitted ArrayExpress & Atlas62

What happens after submission? confirmation Curation The curation team will review your submission and will you with any questions. Possible reopening for editing We will send you an accession number when all the required information has been provided. We will load your experiment into ArrayExpress and provide you with a reviewer login for viewing the data before it is made public. ArrayExpress & Atlas63