Download presentation
Presentation is loading. Please wait.
Published byJefferson Mothershead Modified over 10 years ago
1
1 / 30 Data Mining with BioMart www.ensembl.org/biomart/martview www.biomart.org/biomart/martview
2
2 / 30 What is BioMart? A data export tool A quick table generator A web interface to mine Ensembl data
3
3 / 30 BioMart- Data mining BioMart is a search engine that can find multiple terms and put them into a table format. Such as: mouse gene (IDs), chromosome and base pair position No programming required!
4
4 / 30 General or Specific Data-Tables All the genes for one species Or… only genes on one specific region of a chromosome Or… make BioMart select genes (I.e. all transcripts that match a microarry probe set, GO term, or InterPro domain).
5
5 / 30 Results Tables or sequences
6
6 / 30 The First Step: Choose the Dataset Dataset: Current Ensembl, Human genes
7
7 / 30 The Second Step: Filters Filters: Define a gene set
8
8 / 30 Attributes attach information Attributes: Determine output columns
9
9 / 30 Query For the human CFTR gene, export the Entrez Gene ID(s) and matching Affy HG U133-PLUS-2 probeset(s)
10
10 / 30 Query: For the human CFTR gene, export the Entrez Gene ID(s) and matching Affy HG U133-PLUS-2 probeset(s) In the query: Filters: what we know Attributes: what we want to know.
11
11 / 30 Query: For the human CFTR gene, export the Entrez Gene ID(s) and matching Affy HG U133-PLUS-2 probeset(s) In the query: Filters: what we know Attributes: what we want to know.
12
12 / 30 Query: For the human CFTR gene, export the Entrez Gene ID(s) and matching Affy HG U133-PLUS-2 probeset(s) In the query: Filters: what we know Attributes: what we want to know
13
13 / 30 A Brief Example Use the current Ensembl (archives are also available) Select Homo sapiens genes
14
14 / 30 Select the Genes with Filters Expand the GENE panel to enter in the gene ID(s). Expand the ‘GENE’ panel. Click Filters
15
15 / 30 Filters (and Count) Click “Count” to see if genes passed through your filters. Change this to HGNC curated name. Enter “CFTR” in the box.
16
16 / 30 Attributes (Output Options) Click on ‘Attributes’ ‘Attributes’ allows you to output information.
17
17 / 30 Attributes (Output Options) Select ‘EntrezGene ID’
18
18 / 30 Attributes (Output Options) Select the Affy Platform ‘HG U133-PLUS-2’ in the ‘Microarray’ section
19
19 / 30 The Results Table - Preview For the full result table: click “Go” or View “ALL” rows.
20
20 / 30 Full Result Table Ensembl Gene ID for CFTR Ensembl Transcript IDs EntrezGene ID Affy HG probeset
21
21 / 30 Other Export Options (Attributes) Sequences: UTRs, flanking sequences, cDNA and peptides, etc Gene IDs from Ensembl and external sources (MGI, Entrez, etc) Microarray data Protein Functions/descriptions (Interpro, GO) Orthologous gene sets SNP/ Variation Data
22
22 / 30 BioMart around the world… BioMart started at Ensembl… To where has it travelled?
23
23 / 30 Central Portal www.biomart.org
24
24 / 30WormBase
25
25 / 30HapMap Population frequencies Inter- population comparisons Gene annotation
26
26 / 30 DictyBase
27
27 / 30 GRAMENE www.gramene.org
28
28 / 30 The Potato Center
29
29 / 30 How to Get There http://www.biomart.org/biomart/martview http://www.ensembl.org/biomart/martview Or click on ‘BioMart’ from Ensembl
30
30 / 30 Worked Example Follow the worked example on pg 26 Then, do the exercises on pg 34 (answers on pg 37) This module should do the following: Show you how to export multiple data types from Ensembl for gene IDs or chromosomal regions.
31
31 / 30 Ensembl Core Databases Relational Database Normalised Each data point stored only once Therefore: Quick updates Minimal storage requirements But: Many tables Many joins for complicated queries Slow for data mining applications
32
32 / 30 Normalised Schema gene_idgene.symbol 9970SMAD1 1712SMAD2 8240SMAD3 1967SMAD4 …… gene_idtranscript 9970ENST00000302085 1712ENST00000262160 1712ENST00000356825 8240ENST00000327367 1967ENST00000342988 …… gene_idstable_id 9970ENSG00000170365 1712ENSG00000175387 8240ENSG00000166949 1967ENSG00000141646 ……
33
33 / 30 BioMart Database Data warehouse De-normalised Query-optimised Therefore: Fast and flexible Ideal for data mining But: Tables with apparent “redundancy” Needs rebuilding from scratch for every release from normalised core databases
34
34 / 30 De-Normalised Schema gene_idtranscript_idgene.symbol ENSG00000170365ENST00000302085SMAD1 ENSG00000175387ENST00000262160SMAD2 ENSG00000175387ENST00000356825SMAD2 ENSG00000166949ENST00000327367SMAD3 ENSG00000141646ENST00000342988SMAD4 ………
35
35 / 30 SPECIES FOCUS REGION SNP PROTEIN HOMOLOGY GENE EXPRESSIONREFSEQ INTERPRO GO SWISSPROT EMBL AFFYMETRIX FASTA FILE EXCEL TEXT GTF HTML DATASETFILTERATTRIBUTES Information Flow REGION SNP PROTEIN HOMOLOGY GENE EXPRESSION
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.