Download presentation
Presentation is loading. Please wait.
Published byBrendan Hopkins Modified over 9 years ago
1
BioMart Federated Database Architecture Arek Kasprzyk EBI 9 June 2005
2
BioMart A join project –European Bioinformatics Institute (EBI) –Cold Spring Harbor Laboratory (CSHL) Aim –To develop a simple and scalable data management system capable of integrating distributed data sources.
3
Challenges Data sources –Large –Distributed –Different data
4
Requirements User –All data accessible through a single set of interaces –Suitable for power biologists and bioinformaticians Deployer –‘Out of the box’ installation –Built in query optimization –Easy data federation Architecture –Distributed –Domain agnostic –Platform independent
5
Query Engine Federated architecture
6
BioMart Data mart User interfaces Data sources
7
Data mart and dataset Dataset
8
Data mart, dataset and schema Schema
9
Dataset Configuration XML
10
BioMart abstractions Dataset –A subset of data organized into 1 or more tables Attribute –A single data point –e. g. gene name Filter –An operation on an attribute –e. g. ‘Chromosome =1’
11
Datasets, Attributes and Filters GENE gene_id(PK) gene_stable_id gene_start gene_chrom_end chromosome gene_display_id description MartDataset Attribute Filter
12
Examples Upstream sequences for all kinases up-regulated in brain and associated with a QTL for a neurological disorder Name, chromosome position, description of all genes located on chromosome 1, expressed in lung, associated with human homologues and non- synonymous snp changes
13
FK PK Data model
14
FK PK FK PK Data model
15
FK PK FK Data model
16
main1 PK1 2 PK2 PK1 FK2 dm FK2 dm FK1 FK2 dm FK1 FK2 PK1 FK1 FK2 PK2 FK1 Data model - ‘reversed star’
17
Dataset Fixed schema transformation A B TATA TBTB C
18
BioMart abstractions Link –‘common currency’ between two datasets –e. g. accession Exportable –Potential links to export Importable –Potential links to import
19
Exportables, Importables and Links Dataset 1 Dataset 2 Links
20
Exportables, Importables and Links Dataset 1 Dataset 2 Exportable Importable name = uniprot_id attributes = uniprot_ac name = uniprot_id filters = uniprot_ac Links
21
Exportables, Importables and Links Dataset 1Dataset 2 Exportable Importable name=genomic_region attributes=chr_name, chr_start, chr_end name=genomic_region filters=chr_name (=), chr_start (>=), chr_end (<=) Links
22
Building BioMart databases Source databases Mart Transformation MartBuilder Configuration XML MartEditor
24
Table naming convention Naïve configuration Tables –Meta tables meta_content –Data tables dataset__content__type Data tables –Main __main –Dimension __dm Columns –Key _key
25
Retrieval myDatabase SNPVega EnsemblUniProt myMart MSD BioMart API JAVAPerl MartExplorerMartShellMartView Schema transformation MartBuilder XML MartEditor Configuration Databases Public data (local or remote) BioMart architecture
26
MartView
27
MartExplorer
28
MartShell Using = dataset Get = attribute Where = filter
29
Mart Query Language (MQL) ● Mart Query Language (MQL) syntax: using get where ● Can join datasets together: using Dataset1 get Attribute1 where Filter1=var1 as q; using Dataset2 get Attribute2 where Filter2=var2 and filter3 in q ● Can script and pipe: martshell.sh -E MQLscript.mql > results.txt martshell.sh -E MQLscript.mql | wc
30
Third party software Bioconductor (biomaRt) –BioMart schema Taverna –BioMart java library DAS ProServer –BioMart perl library
31
biomaRt
32
Taverna
33
ProServer No programming DAS request and responses defined by Exportables and Importables and configured by MartEditor DAS1
34
BioMart deployers Large scale data federation (EBI) Optimising access to a large database (Ensembl, WormBase) Connecting priopriatery datasets to public data (Pasteur, Unilever, Serono, Sanofi-Aventis, DevGen etc …)
35
EBI Uniprot MSD SANGER Ensembl SNP Vega Sequence WWW Hinxton example
36
BioMart deployers Large scale data federation (Hinxton) Optimising access to a large database (Ensembl, WormBase, ArrayExpress) Connecting priopriatery datasets to public data (Pasteur, Unilever, Serono, Sanofi-Aventis, DevGen etc …)
37
WormBase
38
Ensembl
39
ArrayExpress
40
BioMart deployers Large scale data federation (Hinxton) Optimising access to a large database (Ensembl, WormBase) Federating user data with public data (Pasteur, INRA, Bayer,Unilever, Serono, Sanofi-Aventis, DevGen, Solexa etc …)
41
dbsnpHapMapEnsembl Give me frequency data from dbsnp Give me genoype and frequency data from HapMap Give me SNPs location on gene/transcript Give me frequency, genotype, location on gene/transcript from dbsnp, HapMap, Ensembl, RefSeq, AceView and Vegas Java graphical user interface WWW web browser GMIA_SNP_mart_database RefSeq SNP1 T/A AL13929 963253 1 SNP2 C/T AL13929 963255 -1 SNP3 C/G AL13929 963258 1. ………………………………. AceViewVega Genetics of Infectious and Autoimmune Diseases, Pasteur Institute, INSERM U730, Paris, France.
42
… what next ?
43
BioMart model Already applied –Ensembl –Vega –SNP –Uniprot –MSD –ArrayExpress –WormBase –Variety of ‘in house’ projects In development –HapMap
44
Summary BioMart interface –Batch queries –‘Data mining’ –Large annotation BioMart software –Set up your own database –Make your database scalable and responsive –Federate with other data
45
Where are we? 0.2 released in february 0.3 to be released in june –Platforms Mysql Oracle Postgres
46
Acknowledgments BioMart –Damian Smedley (EBI) –Darin London (EBI) –Will Spooner (CSHL) Contributors –Arne Stabenau (Ensembl) –Andreas Kahari (Ensembl) –Craig Melsopp (Ensembl) –Katerina Tzouvara (Uniprot) –Paul Donlon (Unilever)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.