Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Mining with BioMart

Similar presentations


Presentation on theme: "Data Mining with BioMart"— Presentation transcript:

1 Data Mining with BioMart

2 Simple and … Complex Queries
Genes within a candidate region Gene products with a particular protein domain Genomic location and description of all mouse and rat homologues of all human genes, that have transmembrane domains, are expressed in the cardiovascular system and are associated with non-synonymous SNPs

3 Ensembl Core Database Relational database Normalised
Each data point stored only once Therefore: Quick updates Minimal storage requirements But: Many tables Many joins for complicated queries Slow for data mining applications

4 Normalised Schema gene_id gene_stable_id 9970 ENSG00000170365 1712
8240 ENSG 1967 ENSG gene_id gene_symbol 9970 SMAD1 1712 SMAD2 8240 SMAD3 1967 SMAD4 gene_id transcript_id 9970 ENST 1712 ENST ENST 8240 ENST 1967 ENST

5 BioMart Database Data warehouse De-normalised Query-optimised
Tables with apparent “redundancy” Therefore: Fast and flexible Ideal for data mining Produced from normalised core databases at every new release

6 De-Normalised Schema gene_stable_id transcript_id gene.symbol
ENSG ENST SMAD1 ENSG ENST SMAD2 ENST ENSG ENST SMAD3 ENSG ENST SMAD4

7 BioMart Developed jointly by the European Bioinformatics Institute (EBI) and Cold Spring Harbor Laboratory (CSHL) Publicly available implementations at: Ensembl Central Server Dictybase Wormbase (WormMart) Gramene (GrameneMart) euGenes HapMap (HapMart) ZF-Models

8 BioMart

9 Data Sets Primary Ensembl Genes Vega Genes SNPs Secondary Markers
“Diseases” Gene ontology Gene expression information Homology predictions Protein annotation

10 Information Flow START FILTER OUTPUT DATABASE SPECIES REGION SNP
PROTEIN HOMOLOGY GENE EXPRESSION RefSeq InterPro GO Swiss-Prot EMBL Affymetrix FILTER FASTA FILE EXCEL TEXT GTF HTML OUTPUT REGION SNP PROTEIN HOMOLOGY GENE EXPRESSION

11 BioMart Example Find all Ensembl genes on the short arm of human chromosome 1 which are known to be associated with a disease Export the 100 bp upstream of the transcripts of the above genes

12 2. Select “Homo sapiens genes (NCBI36)”
1. Select “Ensembl 38” 3. Click “next” 2. Select “Homo sapiens genes (NCBI36)”

13 5. Select “Band Start p36.33 – End p11.1”
4. Select “Chromosome 1” 7. Click “next” 5. Select “Band Start p36.33 – End p11.1” 6. Select “with Disease Association Only”

14 8. Select Attribute Page “Features” Summary of actions
9. Select “Ensembl Gene ID” and “Ensembl Transcript ID”

15 10. Select “Disease OMIM ID” and “Disease description”
11. Select Output format “MS Excel” 12. Click “export”

16

17 13. Select Attribute Page “Sequences”
17. Click “export” 14. Select “Flank (Transcript)” 15. Enter “Upstream flank 100” 16. Select Header information

18

19 There are other ways… MartShell
Command line interface to Mart written in Java Mart Query Language

20 What about queries not possible to do in BioMart?
MySQL queries on ensembldb.ensembl.org MySQL client Perl API BioPerl and Ensembl modules Java API

21 Q & A Q U E S T I O N S A N S W E R S

22 Exercises «The range and complexity of the questions you can address through the Ensembl MartView resource is truly impressive. We really encourage you to spend some time playing with it …»


Download ppt "Data Mining with BioMart"

Similar presentations


Ads by Google