Download presentation
Presentation is loading. Please wait.
Published byMyron Conley Modified over 9 years ago
1
BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006
2
BioMart A joint project –European Bioinformatics Institute (EBI) –Cold Spring Harbor Laboratory (CSHL) Aim –To develop a generic, query-oriented data management system capable of integrating distributed data sources.
3
Focus ‘Data mining’ or advance search –Creating custom datasets –Querying multiple datasets –Interactive Users –People who provide database-based service –‘Power user’ biologists and bioinformaticians
4
Requirements User –‘One-stop shop’ for biological data –Suitable for power biologists and bioinformaticians –A set of interfaces that allow user to group and refine biological data based upon many criteria Deployer –‘Out of the box’ installation –Built in ‘ query optimization –Easy data federation Architecture –Domain agnostic –Distributed –Platform independent
5
Advanced search GUIs
6
Single interface
7
Single access point
8
Queries across different databases Dataset 1 Dataset 2 Links
9
Main features Domain agnostic Platform independent (MySQL, ORACLE, Postgres) Scalable for big datasets Federated architecture Automated UI configuration
10
How does it work?
11
BioMart Data mart XML Meta data BioMart software Source data
12
Query Engine Federated architecture
13
FK PK Data model
14
FK PK FK Data model
15
main1 PK1 2 PK2 PK1 FK2 dm FK2 dm FK1 FK2 dm FK1 FK2 PK1 FK1 FK2 PK2 FK1 Data model - ‘reversed star’
16
Data mart and dataset Dataset
17
Data mart, dataset and virtual schema virtual schema
18
BioMart abstractions Dataset –A subset of data organized into 1 or more tables Attribute –A single data point –e. g. gene name Filter –An operation on an attribute –e. g. ‘Chromosome =1’
19
Datasets, Attributes and Filters GENE gene_id(PK) gene_stable_id gene_start gene_chrom_end chromosome gene_display_id description Mart Dataset Attribute Filter
20
BioMart abstractions (cont) Link –‘common currency’ between two datasets –e. g. accession Exportable –Potential links to export Importable –Potential links to import
21
Exportables, Importables and Links Dataset 1 Dataset 2 Links
22
Exportables, Importables and Links Dataset 1 Dataset 2 Exportable Importable name = uniprot_id attributes = uniprot_ac name = uniprot_id filters = uniprot_ac Links
23
Exportables, Importables and Links Dataset 1 Dataset 2 Exportable Importable name=genomic_region attributes=chr_name, chr_start, chr_end name=genomic_region filters=chr_name (=), chr_start (>=), chr_end (<=) Links
24
Creating BioMart databases
25
Building BioMart databases Source databases Mart Transformation MartBuilde r Configuration XML MartEditorMartBuilder
26
Schema transformation principles Central table –Longest n:1, 1:1 path Dimension table –Central transformation ‘around’ 1:n table. –Link tables are decomposed into a set of 1:n first
27
MartBuilder Application Read database meta data Transforms a source schema into suggested datasets and lets you edit the process Produces a set of SQL statements (DDL) to run against the server to perform the transformation
29
Dataset Configuration Dataset configuration Attributes Filters Trees, Groups, Collections Exportables, Importables Semantics Relational mapping User interface Linking datasets XML-based
30
Table naming convention Naïve configuration Tables –Meta tables meta_content –Data tables dataset__content__type Data tables –Main __main –Dimension __dm Columns –Key _key
31
Naming convention examples Homo sapiens gene ensembl –hsapiens_gene_ensembl__gene__main –hsapiens_gene_ensembl__xref_hugo__dm Encode –hsapiens_encode__encode__main Uniprot –uniprot__protein__main –uniprot__interpro__dm Uniprot sequence –uniprot_sequence__sequence__main
32
Dataset Configuration XML
33
MartEditor
34
Accessing BioMart databases
35
Retrieval myDatabase SNPVega EnsemblUniProt myMart MSD BioMart API JAVAPerl MartExplorer MartShellMartView Schema transformation MartBuilder XML MartEditor Configuration Databases Public data (local or remote) BioMart architecture
36
MartView (current)
37
MartView (new 0_5)
38
MartExplorer
39
MartShell Using = dataset Get = attribute Where = filter
40
MartShell (MQL) ● Uses Mart Query Language (MQL) to generate queries: using get where ● Can join datasets together: using Dataset1 get Attribute1 where Filter1=var1 as q; using Dataset2 get Attribute2 where Filter2=var2 and filter3 in q ● Can script and pipe: martshell.sh -E MQLscript.mql > results.txt martshell.sh -E MQLscript.mql | wc
41
MartShell examples MartShell> using MSD.msd get pdb_id where resolution_less < 1.5 and has_ec_info only; 193l 194l 1arb... MartShell> using MSD.msd get pdb_id where resolution_less < 1.5 and has_ec_info only as q; MartShell> using Ensembl.hsapiens_gene_ensembl get sequence transcript_flanks+1000 where pdb in q; ENST00000270142.2ENSG00000142168.2 strand=forwardchr=21assembly=NCBI34 downstream flanking sequence of transcript only AAACTAAATTAGCTCTGATACTTATTTATATAAACAGCTTCAGTGG AA....
42
biomaRt
43
Taverna
44
DAS ProServer
45
BioMart deployers Large scale data federation (EBI) Optimising access to a large database (Ensembl, WormBase) Connecting priopriatery datasets to public data (Pasteur, Unilever, Serono, Sanofi-Aventis, DevGen etc …)
46
EBI Uniprot MSD SANGER Ensembl SNP Vega Sequence WWW Hinxton example
47
BioMart deployers Large scale data federation (Hinxton) Optimising access to a large database (Ensembl, WormBase, ArrayExpress) Connecting priopriatery datasets to public data (Pasteur, Unilever, Serono, Sanofi-Aventis, DevGen etc …)
48
WormBase Genes Expression Phenotypes Variations Literature Ontologies Sequence Genes Expression Phenotypes Variations Literature Ontologies Sequence
49
Ensembl Genes Ontologies Variations Protein annotation Disease Homologies Sequence Array annotations Genes Ontologies Variations Protein annotation Disease Homologies Sequence Array annotations
50
HapMap Population Frequencies Inter population comparisons Gene annotation Population Frequencies Inter population comparisons Gene annotation
51
ArrayExpress
52
BioMart deployers Large scale data federation (Hinxton) Optimising access to a large database (Ensembl, WormBase) Federating third party data with public data (Pasteur, INRA, Bayer,Unilever, Serono, Sanofi-Aventis, DevGen, Solexa etc …)
53
In development CAPRISA RGD DICTYBASE PURDUE UNIVERSITY RZPD
54
Music Mart
55
BioMart model Already applied –Ensembl –Vega –SNP –Uniprot –MSD –ArrayExpress –WormBase –Gramene –HapMap –Variety of ‘in house’ projects (academia and industrial)
56
User restriction XML Dataset XML martUser “default” “advanced”
57
Interface configuration XML Dataset XML Interface “single-page web interface” “wizard style web interface”
58
Web services MartView 3306 Local Mart 3306 X Remote Mart MartService 3306 80 XML
59
Web services (cont) MartService requests Registry XML Dataset information: name, type etc DatasetConfig XML Mart Query: –API query object is converted to a XML representation on the client and sent to the server. –Query object is regenerated on the server and processed. Results are sent back to client as a simple tab-delim HTML page.
60
Summary A generic data management system –A set of easily configurable user interfaces –Distributed Data federation –Query optimization
61
BioMart www.biomart.org Open source (LGPL) Public MySQL server ftp mart-dev@ebi.ac.uk mart-announce@ebi.ac.uk
62
Acknowledgments BioMart –Arek Kasprzyk (EBI) –Damian Smedley (EBI) –Syed Haider (EBI) –Gudmundur Thorisson (CSHL) Contributors –Darin London (EBI) –Will Spooner (CSHL) –Damian Keefe (Ensembl) –Arne Stabenau (Ensembl) –Andreas Kahari (Ensembl) –Craig Melsopp (Ensembl) –Katerina Tzouvara (Uniprot) –Paul Donlon (Unilever) –Steffen Durinck (SCD-ESAT, Katholieke Universiteit Leuven) –Benoit Ballester (Universite de la Mediterranee) –Stephen Robinson (EBI) –Asif Kibria (EBI) –Paul Donlon (Unilever)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.