Presentation is loading. Please wait.

Presentation is loading. Please wait.

BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006.

Similar presentations


Presentation on theme: "BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006."— Presentation transcript:

1 BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006

2 BioMart A joint project –European Bioinformatics Institute (EBI) –Cold Spring Harbor Laboratory (CSHL) Aim –To develop a generic, query-oriented data management system capable of integrating distributed data sources.

3 Focus ‘Data mining’ or advance search –Creating custom datasets –Querying multiple datasets –Interactive Users –People who provide database-based service –‘Power user’ biologists and bioinformaticians

4 Requirements User –‘One-stop shop’ for biological data –Suitable for power biologists and bioinformaticians –A set of interfaces that allow user to group and refine biological data based upon many criteria Deployer –‘Out of the box’ installation –Built in ‘ query optimization –Easy data federation Architecture –Domain agnostic –Distributed –Platform independent

5 Advanced search GUIs

6 Single interface

7 Single access point

8 Queries across different databases Dataset 1 Dataset 2 Links

9 Main features Domain agnostic Platform independent (MySQL, ORACLE, Postgres) Scalable for big datasets Federated architecture Automated UI configuration

10 How does it work?

11 BioMart Data mart XML Meta data BioMart software Source data

12 Query Engine Federated architecture

13 FK PK Data model

14 FK PK FK Data model

15 main1 PK1 2 PK2 PK1 FK2 dm FK2 dm FK1 FK2 dm FK1 FK2 PK1 FK1 FK2 PK2 FK1 Data model - ‘reversed star’

16 Data mart and dataset Dataset

17 Data mart, dataset and virtual schema virtual schema

18 BioMart abstractions Dataset –A subset of data organized into 1 or more tables Attribute –A single data point –e. g. gene name Filter –An operation on an attribute –e. g. ‘Chromosome =1’

19 Datasets, Attributes and Filters GENE gene_id(PK) gene_stable_id gene_start gene_chrom_end chromosome gene_display_id description Mart Dataset Attribute Filter

20 BioMart abstractions (cont) Link –‘common currency’ between two datasets –e. g. accession Exportable –Potential links to export Importable –Potential links to import

21 Exportables, Importables and Links Dataset 1 Dataset 2 Links

22 Exportables, Importables and Links Dataset 1 Dataset 2 Exportable Importable name = uniprot_id attributes = uniprot_ac name = uniprot_id filters = uniprot_ac Links

23 Exportables, Importables and Links Dataset 1 Dataset 2 Exportable Importable name=genomic_region attributes=chr_name, chr_start, chr_end name=genomic_region filters=chr_name (=), chr_start (>=), chr_end (<=) Links

24 Creating BioMart databases

25 Building BioMart databases Source databases Mart Transformation MartBuilde r Configuration XML MartEditorMartBuilder

26 Schema transformation principles Central table –Longest n:1, 1:1 path Dimension table –Central transformation ‘around’ 1:n table. –Link tables are decomposed into a set of 1:n first

27 MartBuilder Application Read database meta data Transforms a source schema into suggested datasets and lets you edit the process Produces a set of SQL statements (DDL) to run against the server to perform the transformation

28

29 Dataset Configuration Dataset configuration Attributes Filters Trees, Groups, Collections Exportables, Importables Semantics Relational mapping User interface Linking datasets XML-based

30 Table naming convention Naïve configuration Tables –Meta tables meta_content –Data tables dataset__content__type Data tables –Main __main –Dimension __dm Columns –Key _key

31 Naming convention examples Homo sapiens gene ensembl –hsapiens_gene_ensembl__gene__main –hsapiens_gene_ensembl__xref_hugo__dm Encode –hsapiens_encode__encode__main Uniprot –uniprot__protein__main –uniprot__interpro__dm Uniprot sequence –uniprot_sequence__sequence__main

32 Dataset Configuration XML

33 MartEditor

34 Accessing BioMart databases

35 Retrieval myDatabase SNPVega EnsemblUniProt myMart MSD BioMart API JAVAPerl MartExplorer MartShellMartView Schema transformation MartBuilder XML MartEditor Configuration Databases Public data (local or remote) BioMart architecture

36 MartView (current)

37 MartView (new 0_5)

38 MartExplorer

39 MartShell Using = dataset Get = attribute Where = filter

40 MartShell (MQL) ● Uses Mart Query Language (MQL) to generate queries: using get where ● Can join datasets together: using Dataset1 get Attribute1 where Filter1=var1 as q; using Dataset2 get Attribute2 where Filter2=var2 and filter3 in q ● Can script and pipe: martshell.sh -E MQLscript.mql > results.txt martshell.sh -E MQLscript.mql | wc

41 MartShell examples MartShell> using MSD.msd get pdb_id where resolution_less < 1.5 and has_ec_info only; 193l 194l 1arb... MartShell> using MSD.msd get pdb_id where resolution_less < 1.5 and has_ec_info only as q; MartShell> using Ensembl.hsapiens_gene_ensembl get sequence transcript_flanks+1000 where pdb in q; ENST00000270142.2ENSG00000142168.2 strand=forwardchr=21assembly=NCBI34 downstream flanking sequence of transcript only AAACTAAATTAGCTCTGATACTTATTTATATAAACAGCTTCAGTGG AA....

42 biomaRt

43 Taverna

44 DAS ProServer

45 BioMart deployers Large scale data federation (EBI) Optimising access to a large database (Ensembl, WormBase) Connecting priopriatery datasets to public data (Pasteur, Unilever, Serono, Sanofi-Aventis, DevGen etc …)

46 EBI Uniprot MSD SANGER Ensembl SNP Vega Sequence WWW Hinxton example

47 BioMart deployers Large scale data federation (Hinxton) Optimising access to a large database (Ensembl, WormBase, ArrayExpress) Connecting priopriatery datasets to public data (Pasteur, Unilever, Serono, Sanofi-Aventis, DevGen etc …)

48 WormBase Genes Expression Phenotypes Variations Literature Ontologies Sequence Genes Expression Phenotypes Variations Literature Ontologies Sequence

49 Ensembl Genes Ontologies Variations Protein annotation Disease Homologies Sequence Array annotations Genes Ontologies Variations Protein annotation Disease Homologies Sequence Array annotations

50 HapMap Population Frequencies Inter population comparisons Gene annotation Population Frequencies Inter population comparisons Gene annotation

51 ArrayExpress

52 BioMart deployers Large scale data federation (Hinxton) Optimising access to a large database (Ensembl, WormBase) Federating third party data with public data (Pasteur, INRA, Bayer,Unilever, Serono, Sanofi-Aventis, DevGen, Solexa etc …)

53 In development CAPRISA RGD DICTYBASE PURDUE UNIVERSITY RZPD

54 Music Mart

55 BioMart model Already applied –Ensembl –Vega –SNP –Uniprot –MSD –ArrayExpress –WormBase –Gramene –HapMap –Variety of ‘in house’ projects (academia and industrial)

56 User restriction XML Dataset XML martUser “default” “advanced”

57 Interface configuration XML Dataset XML Interface “single-page web interface” “wizard style web interface”

58 Web services MartView 3306 Local Mart 3306 X Remote Mart MartService 3306 80 XML

59 Web services (cont) MartService requests Registry XML Dataset information: name, type etc DatasetConfig XML Mart Query: –API query object is converted to a XML representation on the client and sent to the server. –Query object is regenerated on the server and processed. Results are sent back to client as a simple tab-delim HTML page.

60 Summary A generic data management system –A set of easily configurable user interfaces –Distributed Data federation –Query optimization

61 BioMart www.biomart.org Open source (LGPL) Public MySQL server ftp mart-dev@ebi.ac.uk mart-announce@ebi.ac.uk

62 Acknowledgments BioMart –Arek Kasprzyk (EBI) –Damian Smedley (EBI) –Syed Haider (EBI) –Gudmundur Thorisson (CSHL) Contributors –Darin London (EBI) –Will Spooner (CSHL) –Damian Keefe (Ensembl) –Arne Stabenau (Ensembl) –Andreas Kahari (Ensembl) –Craig Melsopp (Ensembl) –Katerina Tzouvara (Uniprot) –Paul Donlon (Unilever) –Steffen Durinck (SCD-ESAT, Katholieke Universiteit Leuven) –Benoit Ballester (Universite de la Mediterranee) –Stephen Robinson (EBI) –Asif Kibria (EBI) –Paul Donlon (Unilever)


Download ppt "BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006."

Similar presentations


Ads by Google