Data Mining in Ensembl with BioMart Nov, 2009 www.ensembl.org/biomart/martview www.biomart.org/biomart/martview.

Slides:



Advertisements
Similar presentations
Introductory to database handling Endre Sebestyén.
Advertisements

1 / 30 Data Mining with BioMart
Integration of Protein Family, Function, Structure Rich Links to >90 Databases Value-Added Reports for UniProtKB Proteins iProClass Protein Knowledgebase.
Genomic Innovations- Orthology Paralogy. Genomic innovation.
Working with gene lists: Finding data using GEO & BioMart June 5, 2014.
1 Welcome to the Protein Database Tutorial This tutorial will describe how to navigate the section of Gramene that provides collective information on proteins.
Peter Tsai, Bioinformatics Institute.  University of California, Santa Cruz (UCSC)  A rapid and reliable display of any requested portion of genomes.
Kate Milova MolGen retreat March 24, Microarray experiments: Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
UCSC Genome Browser Tutorial
Data Mining in Ensembl with EnsMart. 2 of 24 All genes from a candidate region Genes with a particular protein domain Members of a protein family Genes.
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
SNP Resources: Finding SNPs Databases and Data Extraction Mark J. Rieder, PhD SeattleSNPs Variation Workshop March 20-21, 2006.
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Data retrieval BioMart Data sets on ftp site MySQL queries of databases Perl API access to databases Export View.
Access Tutorial 8 Sharing, Integrating, and Analyzing Data
Working with the Conifer_dbMagic database: A short tutorial on mining conifer assembly data. This tutorial is designed to be used in a “follow along” fashion.
Gene Expression Omnibus (GEO)
Copyright OpenHelix. No use or reproduction without express written consent1.
BioC 2009 Database mining with biomaRt Steffen Durinck Illumina Inc.
1 The Genome Browser allows you to –Browse the Rice-Japonica, Maize and Arabidopsis genomes. –View the location of a particular feature on the rice genome.
1 Welcome to the GrameneMart Tutorial A tool for batch data sequence retrieval 1.Select a Gramene dataset to search against. 2.Add filters to the dataset.
CANDID: A candidate gene identification tool Janna Hutz March 19, 2007.
GENOME-CENTRIC DATABASES Daniel Svozil. NCBI Gene Search for DUT gene in human.
Copyright OpenHelix. No use or reproduction without express written consent1.
BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006.
UCSC Genome Browser 1. The Progress 2 Database and Tool Explosion : 230 databases and tools 1996 : first annual compilation of databases and tools.
Managing Data Modeling GO Workshop 3-6 August 2010.
Access Chapter 2: Relational Database Objectives Design data Create tables Understand table relationships Understand data types, key, & field properties.
1 of 38 Data Mining in Ensembl with BioMart. 2 of 38 Simple Text-based Search Engine.
UBio Training Courses Micro-RNA web tools Gonzalo
Introduction to the Gramene Genetic Diversity module 5/2010 Build #31.
Alastair Kerr, Ph.D. WTCCB Bioinformatics Core An introduction to DNA and Protein Sequence Databases.
EuPathDB: an integrated resource and tool for eukaryotic pathogen bioinformatics Aurrecoechea C., Heiges M., Warrenfeltz S. for the EuPathDB team CTEGD,
Data Mining in Ensembl with BioMart Giulietta Spudich.
Copyright OpenHelix. No use or reproduction without express written consent1.
GVS: Genome Variation Server Materials prepared by: Warren C. Lathe, PhD Updated: Q Version 2.
ID Mapping to accessions from different databases. COST Functional Modeling Workshop April, Helsinki.
EBI is an Outstation of the European Molecular Biology Laboratory. Gautier Koscielny VectorBase Meeting 08 Feburary 2012, EBI VectorBase Text Search Engine.
Applied Bioinformatics Week 9 Jens Allmer. Theory I Gene Expression Microarray.
Workshop practical Helsinki Workshop September 2006.
Copyright OpenHelix. No use or reproduction without express written consent1.
Copyright OpenHelix. No use or reproduction without express written consent1.
What do we already know ? The rice disease resistance gene Pi-ta Genetically mapped to chromosome 12 Rybka et al. (1997). It has also been sequenced Bryan.
A guided tour of Ensembl This quick tour will give you an outline view of what Ensembl is all about. You will learn: –Why we need Ensembl –What is in the.
Central hub for biological data UniProtKB/Swiss-Prot is a central hub for biological data: over 120 databases are cross-referenced (EMBL/DDBJ/GenBank,
Lei Kong, Ph.D. Center for Bioinformatics Peking University ABrowse - A General Purpose Genome Browser Framework.
Copyright OpenHelix. No use or reproduction without express written consent1.
Tools in Bioinformatics Genome Browsers. Retrieving genomic information Previous lesson(s): annotation-based perspective of search/data Today: genomic-based.
Genomes at NCBI. Database and Tool Explosion : 230 databases and tools 1996 : first annual compilation of databases and tools lists 57 databases.
Welcome to the combined BLAST and Genome Browser Tutorial.
The Genome Genome Browser Training Materials developed by: Warren C. Lathe, Ph.D. and Mary Mangan, Ph.D. Part 2.
Welcome to the GrameneMart Tutorial A tool for batch data sequence retrieval 1.Select a Gramene dataset to search against. 2.Add filters to the dataset.
Expression Analysis of the Sphingolipid Metabolism Gene Extraction: Pathway Modification: Branch Addition: Gene Addition: Data Formatting Download GenMAPP.
Designing, Executing and Sharing Workflows with Taverna 2.4 Different Service Types Katy Wolstencroft Helen Hulme myGrid University of Manchester.
Access Tutorial 2 Building a Database and Defining Table Relationships
Getting GO annotation for your dataset
Data Mining with BioMart
ID Mapping tools: Converting Accessions between Databases
Welcome to the Protein Database Tutorial
Access Tutorial 8 Sharing, Integrating, and Analyzing Data
Step-by-step demo of using BioMart to extract SNP information
Access Tutorial 2 Building a Database and Defining Table Relationships
A Tutorial Lincoln Stein, Cold Spring Harbor Laboratory
Access Tutorial 2 Building a Database and Defining Table Relationships
TargetDB and PEPCDB •
Welcome to the GrameneMart Tutorial
Gene Safari (Biological Databases)
Welcome - webinar instructions
Tutorial 8 Sharing, Integrating, and Analyzing Data
Presentation transcript:

Data Mining in Ensembl with BioMart Nov,

BioMart- Data mining BioMart is a search engine that can find multiple terms and put them into a table format. Such as: mouse gene (IDs), chromosome and base pair position No programming required!

General or Specific Data-Tables All the genes for one species Or… only genes on one specific region of a chromosome Or… genes on one region of a chromosome associated with an InterPro domain

The First Step: Choose the Dataset Dataset: Current Ensembl, Human genes

The Second Step: Filters Filters: Define a gene set

Attributes attach information Attributes: Determine output columns

Results Tables or sequences

Query: For the human CFTR gene, can I export the EntrezGene ID, and also, probes with this gene sequence from the “Affy HG U133 Plus 2” microarray platform? In the query: Filters: what we know Attributes: what we want to know.

Query: For the human CFTR gene, can I export the EntrezGene ID, and also, probes with this gene sequence from the “Affy HG U133 Plus 2” microarray platform? In the query: Filters: what we know Attributes: what we want to know.

Query: For the human CFTR gene, can I export the EntrezGene ID, and also, probes with this gene sequence from the “Affy HG U133 Plus 2” microarray platform? In the query: Filters: what we know Attributes: what we want to know (columns in the result table)

A Brief Example Select Homo sapiens Use the current Ensembl (archives are also available)

Select the genes with Filters Expand the GENE panel to enter in the gene ID(s). Expand the ‘REGION’ panel. Click Filters

Filters Change this to HGNC symbol. Enter “CFTR” in the box. Click “Count” to see if genes passed through your filters.

Attributes (Output Options) Expand the “GENE” section. Click on ‘Attributes’

Expand the ‘EXTERNAL’ panel for non-Ensembl IDs. Attributes (Output Options) Select “Description” and “Associated Gene Name”.

Attributes (Output) External IDs include EntrezGene IDs and also Microarray probe IDs. ………………………………………………………………….

“Results” show Description, Name, EntrezGene and Probe matches from the Affy HG U133-Plus-2 platform. The Results Table - Preview For the full result table: click “Go” or View “ALL” rows.

Full Result Table Ensembl Gene and Transcript IDs Description Gene Name EntrezGene ID Affy HG probe

Other Export Options (Attributes)  Sequences: UTRs, flanking sequences, cDNA and peptides, etc  Gene IDs from Ensembl and external sources (MGI, Entrez, etc)  Microarray data  Protein Functions/descriptions (Interpro, GO)  Orthologous gene sets  SNP/ Variation Data

BioMart Data Sets Ensembl genes Vega genes Variations

BioMart around the world… BioMart started at Ensembl… To where has it travelled?

Central Portal

WormBase

HapMap Population frequencies Inter- population comparisons Gene annotation

DictyBase

GRAMENE

The Potato Center

How to Get There Or click on ‘BioMart’ from Ensembl

Choose Dataset (All genes for a species) Choose Filters (narrows the gene set) Choose Attributes (output options) Now Try the Worked Example on Page 23! The Flow

Ensembl Core Databases Relational Database Normalised Each data point stored only once Therefore: Quick updates Minimal storage requirements But: Many tables Many joins for complicated queries Slow for data mining applications

Normalised Schema gene_idgene.symbol 9970SMAD1 1712SMAD2 8240SMAD3 1967SMAD4 …… gene_idtranscript 9970ENST ENST ENST ENST ENST …… gene_idstable_id 9970ENSG ENSG ENSG ENSG ……

BioMart Database Data warehouse De-normalised Query-optimised Therefore: Fast and flexible Ideal for data mining But: Tables with apparent “redundancy” Needs rebuilding from scratch for every release from normalised core databases

De-Normalised Schema gene_idtranscript_idgene.symbol ENSG ENST SMAD1 ENSG ENST SMAD2 ENSG ENST SMAD2 ENSG ENST SMAD3 ENSG ENST SMAD4 ………

SPECIES FOCUS REGION SNP PROTEIN HOMOLOGY GENE EXPRESSIONREFSEQ INTERPRO GO SWISSPROT EMBL AFFYMETRIX FASTA FILE EXCEL TEXT GTF HTML DATASETFILTERATTRIBUTES Information Flow REGION SNP PROTEIN HOMOLOGY GENE EXPRESSION