Data Mining in Ensembl with BioMart Nov,
BioMart- Data mining BioMart is a search engine that can find multiple terms and put them into a table format. Such as: mouse gene (IDs), chromosome and base pair position No programming required!
General or Specific Data-Tables All the genes for one species Or… only genes on one specific region of a chromosome Or… genes on one region of a chromosome associated with an InterPro domain
The First Step: Choose the Dataset Dataset: Current Ensembl, Human genes
The Second Step: Filters Filters: Define a gene set
Attributes attach information Attributes: Determine output columns
Results Tables or sequences
Query: For the human CFTR gene, can I export the EntrezGene ID, and also, probes with this gene sequence from the “Affy HG U133 Plus 2” microarray platform? In the query: Filters: what we know Attributes: what we want to know.
Query: For the human CFTR gene, can I export the EntrezGene ID, and also, probes with this gene sequence from the “Affy HG U133 Plus 2” microarray platform? In the query: Filters: what we know Attributes: what we want to know.
Query: For the human CFTR gene, can I export the EntrezGene ID, and also, probes with this gene sequence from the “Affy HG U133 Plus 2” microarray platform? In the query: Filters: what we know Attributes: what we want to know (columns in the result table)
A Brief Example Select Homo sapiens Use the current Ensembl (archives are also available)
Select the genes with Filters Expand the GENE panel to enter in the gene ID(s). Expand the ‘REGION’ panel. Click Filters
Filters Change this to HGNC symbol. Enter “CFTR” in the box. Click “Count” to see if genes passed through your filters.
Attributes (Output Options) Expand the “GENE” section. Click on ‘Attributes’
Expand the ‘EXTERNAL’ panel for non-Ensembl IDs. Attributes (Output Options) Select “Description” and “Associated Gene Name”.
Attributes (Output) External IDs include EntrezGene IDs and also Microarray probe IDs. ………………………………………………………………….
“Results” show Description, Name, EntrezGene and Probe matches from the Affy HG U133-Plus-2 platform. The Results Table - Preview For the full result table: click “Go” or View “ALL” rows.
Full Result Table Ensembl Gene and Transcript IDs Description Gene Name EntrezGene ID Affy HG probe
Other Export Options (Attributes) Sequences: UTRs, flanking sequences, cDNA and peptides, etc Gene IDs from Ensembl and external sources (MGI, Entrez, etc) Microarray data Protein Functions/descriptions (Interpro, GO) Orthologous gene sets SNP/ Variation Data
BioMart Data Sets Ensembl genes Vega genes Variations
BioMart around the world… BioMart started at Ensembl… To where has it travelled?
Central Portal
WormBase
HapMap Population frequencies Inter- population comparisons Gene annotation
DictyBase
GRAMENE
The Potato Center
How to Get There Or click on ‘BioMart’ from Ensembl
Choose Dataset (All genes for a species) Choose Filters (narrows the gene set) Choose Attributes (output options) Now Try the Worked Example on Page 23! The Flow
Ensembl Core Databases Relational Database Normalised Each data point stored only once Therefore: Quick updates Minimal storage requirements But: Many tables Many joins for complicated queries Slow for data mining applications
Normalised Schema gene_idgene.symbol 9970SMAD1 1712SMAD2 8240SMAD3 1967SMAD4 …… gene_idtranscript 9970ENST ENST ENST ENST ENST …… gene_idstable_id 9970ENSG ENSG ENSG ENSG ……
BioMart Database Data warehouse De-normalised Query-optimised Therefore: Fast and flexible Ideal for data mining But: Tables with apparent “redundancy” Needs rebuilding from scratch for every release from normalised core databases
De-Normalised Schema gene_idtranscript_idgene.symbol ENSG ENST SMAD1 ENSG ENST SMAD2 ENSG ENST SMAD2 ENSG ENST SMAD3 ENSG ENST SMAD4 ………
SPECIES FOCUS REGION SNP PROTEIN HOMOLOGY GENE EXPRESSIONREFSEQ INTERPRO GO SWISSPROT EMBL AFFYMETRIX FASTA FILE EXCEL TEXT GTF HTML DATASETFILTERATTRIBUTES Information Flow REGION SNP PROTEIN HOMOLOGY GENE EXPRESSION