BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006
BioMart A joint project –European Bioinformatics Institute (EBI) –Cold Spring Harbor Laboratory (CSHL) Aim –To develop a generic, query-oriented data management system capable of integrating distributed data sources.
Focus ‘Data mining’ or advance search –Creating custom datasets –Querying multiple datasets –Interactive Users –People who provide database-based service –‘Power user’ biologists and bioinformaticians
Requirements User –‘One-stop shop’ for biological data –Suitable for power biologists and bioinformaticians –A set of interfaces that allow user to group and refine biological data based upon many criteria Deployer –‘Out of the box’ installation –Built in ‘ query optimization –Easy data federation Architecture –Domain agnostic –Distributed –Platform independent
Advanced search GUIs
Single interface
Single access point
Queries across different databases Dataset 1 Dataset 2 Links
Main features Domain agnostic Platform independent (MySQL, ORACLE, Postgres) Scalable for big datasets Federated architecture Automated UI configuration
How does it work?
BioMart Data mart XML Meta data BioMart software Source data
Query Engine Federated architecture
FK PK Data model
FK PK FK Data model
main1 PK1 2 PK2 PK1 FK2 dm FK2 dm FK1 FK2 dm FK1 FK2 PK1 FK1 FK2 PK2 FK1 Data model - ‘reversed star’
Data mart and dataset Dataset
Data mart, dataset and virtual schema virtual schema
BioMart abstractions Dataset –A subset of data organized into 1 or more tables Attribute –A single data point –e. g. gene name Filter –An operation on an attribute –e. g. ‘Chromosome =1’
Datasets, Attributes and Filters GENE gene_id(PK) gene_stable_id gene_start gene_chrom_end chromosome gene_display_id description Mart Dataset Attribute Filter
BioMart abstractions (cont) Link –‘common currency’ between two datasets –e. g. accession Exportable –Potential links to export Importable –Potential links to import
Exportables, Importables and Links Dataset 1 Dataset 2 Links
Exportables, Importables and Links Dataset 1 Dataset 2 Exportable Importable name = uniprot_id attributes = uniprot_ac name = uniprot_id filters = uniprot_ac Links
Exportables, Importables and Links Dataset 1 Dataset 2 Exportable Importable name=genomic_region attributes=chr_name, chr_start, chr_end name=genomic_region filters=chr_name (=), chr_start (>=), chr_end (<=) Links
Creating BioMart databases
Building BioMart databases Source databases Mart Transformation MartBuilde r Configuration XML MartEditorMartBuilder
Schema transformation principles Central table –Longest n:1, 1:1 path Dimension table –Central transformation ‘around’ 1:n table. –Link tables are decomposed into a set of 1:n first
MartBuilder Application Read database meta data Transforms a source schema into suggested datasets and lets you edit the process Produces a set of SQL statements (DDL) to run against the server to perform the transformation
Dataset Configuration Dataset configuration Attributes Filters Trees, Groups, Collections Exportables, Importables Semantics Relational mapping User interface Linking datasets XML-based
Table naming convention Naïve configuration Tables –Meta tables meta_content –Data tables dataset__content__type Data tables –Main __main –Dimension __dm Columns –Key _key
Naming convention examples Homo sapiens gene ensembl –hsapiens_gene_ensembl__gene__main –hsapiens_gene_ensembl__xref_hugo__dm Encode –hsapiens_encode__encode__main Uniprot –uniprot__protein__main –uniprot__interpro__dm Uniprot sequence –uniprot_sequence__sequence__main
Dataset Configuration XML
MartEditor
Accessing BioMart databases
Retrieval myDatabase SNPVega EnsemblUniProt myMart MSD BioMart API JAVAPerl MartExplorer MartShellMartView Schema transformation MartBuilder XML MartEditor Configuration Databases Public data (local or remote) BioMart architecture
MartView (current)
MartView (new 0_5)
MartExplorer
MartShell Using = dataset Get = attribute Where = filter
MartShell (MQL) ● Uses Mart Query Language (MQL) to generate queries: using get where ● Can join datasets together: using Dataset1 get Attribute1 where Filter1=var1 as q; using Dataset2 get Attribute2 where Filter2=var2 and filter3 in q ● Can script and pipe: martshell.sh -E MQLscript.mql > results.txt martshell.sh -E MQLscript.mql | wc
MartShell examples MartShell> using MSD.msd get pdb_id where resolution_less < 1.5 and has_ec_info only; 193l 194l 1arb... MartShell> using MSD.msd get pdb_id where resolution_less < 1.5 and has_ec_info only as q; MartShell> using Ensembl.hsapiens_gene_ensembl get sequence transcript_flanks+1000 where pdb in q; ENST ENSG strand=forwardchr=21assembly=NCBI34 downstream flanking sequence of transcript only AAACTAAATTAGCTCTGATACTTATTTATATAAACAGCTTCAGTGG AA....
biomaRt
Taverna
DAS ProServer
BioMart deployers Large scale data federation (EBI) Optimising access to a large database (Ensembl, WormBase) Connecting priopriatery datasets to public data (Pasteur, Unilever, Serono, Sanofi-Aventis, DevGen etc …)
EBI Uniprot MSD SANGER Ensembl SNP Vega Sequence WWW Hinxton example
BioMart deployers Large scale data federation (Hinxton) Optimising access to a large database (Ensembl, WormBase, ArrayExpress) Connecting priopriatery datasets to public data (Pasteur, Unilever, Serono, Sanofi-Aventis, DevGen etc …)
WormBase Genes Expression Phenotypes Variations Literature Ontologies Sequence Genes Expression Phenotypes Variations Literature Ontologies Sequence
Ensembl Genes Ontologies Variations Protein annotation Disease Homologies Sequence Array annotations Genes Ontologies Variations Protein annotation Disease Homologies Sequence Array annotations
HapMap Population Frequencies Inter population comparisons Gene annotation Population Frequencies Inter population comparisons Gene annotation
ArrayExpress
BioMart deployers Large scale data federation (Hinxton) Optimising access to a large database (Ensembl, WormBase) Federating third party data with public data (Pasteur, INRA, Bayer,Unilever, Serono, Sanofi-Aventis, DevGen, Solexa etc …)
In development CAPRISA RGD DICTYBASE PURDUE UNIVERSITY RZPD
Music Mart
BioMart model Already applied –Ensembl –Vega –SNP –Uniprot –MSD –ArrayExpress –WormBase –Gramene –HapMap –Variety of ‘in house’ projects (academia and industrial)
User restriction XML Dataset XML martUser “default” “advanced”
Interface configuration XML Dataset XML Interface “single-page web interface” “wizard style web interface”
Web services MartView 3306 Local Mart 3306 X Remote Mart MartService XML
Web services (cont) MartService requests Registry XML Dataset information: name, type etc DatasetConfig XML Mart Query: –API query object is converted to a XML representation on the client and sent to the server. –Query object is regenerated on the server and processed. Results are sent back to client as a simple tab-delim HTML page.
Summary A generic data management system –A set of easily configurable user interfaces –Distributed Data federation –Query optimization
BioMart Open source (LGPL) Public MySQL server ftp
Acknowledgments BioMart –Arek Kasprzyk (EBI) –Damian Smedley (EBI) –Syed Haider (EBI) –Gudmundur Thorisson (CSHL) Contributors –Darin London (EBI) –Will Spooner (CSHL) –Damian Keefe (Ensembl) –Arne Stabenau (Ensembl) –Andreas Kahari (Ensembl) –Craig Melsopp (Ensembl) –Katerina Tzouvara (Uniprot) –Paul Donlon (Unilever) –Steffen Durinck (SCD-ESAT, Katholieke Universiteit Leuven) –Benoit Ballester (Universite de la Mediterranee) –Stephen Robinson (EBI) –Asif Kibria (EBI) –Paul Donlon (Unilever)