BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006.

Slides:



Advertisements
Similar presentations
BioMart Query Network Arek Kasprzyk European Bioinformatics Institute 8 January 2005.
Advertisements

“BioMart is a query-oriented data management system developed jointly by the Ontario Institute for Cancer Research (OICR) and the.
Kensington Oracle Edition: Open Discovery Workflow Meets Oracle 10g Professor Yike Guo.
From Ontology Design to Deployment Semantic Application Development with TopBraid Holger Knublauch
1 / 30 Data Mining with BioMart
Rafael C Jimenez DAS DAS Workshop 2012 February 27-29, 2012 Using DAS software, an introduction to some DAS implementations.
The database approach to data management provides significant advantages over the traditional file-based approach Define general data management concepts.
Genomic Innovations- Orthology Paralogy. Genomic innovation.
Working with gene lists: Finding data using GEO & BioMart June 5, 2014.
1 Copyright © 2011, Oracle and/or its affiliates. All rights reserved.
1 Introduction to XML. XML eXtensible implies that users define tag content Markup implies it is a coded document Language implies it is a metalanguage.
Network Management Overview IACT 918 July 2004 Gene Awyzio SITACS University of Wollongong.
The Hierarchy of Data Bit (a binary digit): a circuit that is either on or off Byte: 8 bits Character: each byte represents a character; the basic building.
Kate Milova MolGen retreat March 24, Microarray experiments: Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Data Mining in Ensembl with EnsMart. 2 of 24 All genes from a candidate region Genes with a particular protein domain Members of a protein family Genes.
1 Lecture 13: Database Heterogeneity Debriefing Project Phase 2.
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Data retrieval BioMart Data sets on ftp site MySQL queries of databases Perl API access to databases Export View.
Genome database & information system for Daphnia Don Gilbert, October 2002 Talk doc at
1 ArrayExpress and MAGE Jamboree II Ugis Sarkans, EBI.
Information Integration Intelligence with TopBraid Suite SemTech, San Jose, Holger Knublauch
Computing for Bioinformatics Introduction to databases What is a database? Database system components Data types DBMS architectures DBMS systems available.
Fundamentals of Information Systems, Third Edition2 Principles and Learning Objectives The database approach to data management provides significant advantages.
Rajashree Deka Tetherless World Constellation Rensselaer Polytechnic Institute.
Chapter 5 Lecture 2. Principles of Information Systems2 Objectives Understand Data definition language (DDL) and data dictionary Learn about popular DBMSs.
Biological Annotation in R Manchester R, 13th Nov, 2013 Nick Burgoyne Bioinformatician, fiosgenomics
Fundamentals of Information Systems, Fifth Edition
PHP With Oracle 11g XE By Shyam Gurram Eastern Illinois University.
1 Welcome to the GrameneMart Tutorial A tool for batch data sequence retrieval 1.Select a Gramene dataset to search against. 2.Add filters to the dataset.
Copyright OpenHelix. No use or reproduction without express written consent1.
Fundamentals of Database Chapter 7 Database Technologies.
BioMart A Federated Query Architecture Arek Kasprzyk European Bioinformatics Institute 26 April 2004.
1 Information Systems, Ninth Edition Chapter 5 Database Systems and Business Intelligence.
Copyright OpenHelix. No use or reproduction without express written consent1.
Copyright OpenHelix. No use or reproduction without express written consent 2 Overview of Genome Browsers Materials prepared by Warren C. Lathe, Ph.D.
MET280: Computing for Bioinformatics Introduction to databases What is a database? Not a spreadsheet. Data types and uses DBMS (DataBase Management System)
Taverna Workflow. A suite of tools for bioinformatics Fully featured, extensible and scalable scientific workflow management system – Workbench, server,
BioMart and CHADO Arek Kasprzyk GMOD meeting 16 May 2005.
VectorBase Gene expression data in VectorBase Fotis Kafatos, George Christophides, Bob MacCallum & Seth Redmond Imperial College London (thanks also to.
1 XML Based Networking Method for Connecting Distributed Anthropometric Databases 24 October 2006 Huaining Cheng Dr. Kathleen M. Robinette Human Effectiveness.
1 of 38 Data Mining in Ensembl with BioMart. 2 of 38 Simple Text-based Search Engine.
Data Mining in Ensembl with BioMart Nov,
University of Illinois at Urbana-Champaign BeeSpace Navigator v4.0 and Gene Summarizer beespace.uiuc.edu `
Anil Wipat University of Newcastle upon Tyne, UK A Grid based System for Microbial Genome Comparison and analysis.
EnsMart: A Generic System for Fast and Flexible Access to Biological Data Arek Kasprzyk et al (2004) 14: , Genome research EBI, Wellcome Trust.
Building WormBase database(s). SAB 2008 Wellcome Trust Sanger Insitute Cold Spring Harbor Laboratory California Institute of Technology ● RNAi ● Microarray.
DAS Current Situation and Future Developments Jonathan Warren DAS coordinator for the Sanger Institute
Data Mining in Ensembl with BioMart Giulietta Spudich.
EMBL-EBI MSD Search and Visualization tools Jawahar Swaminathan.
Copyright OpenHelix. No use or reproduction without express written consent1.
A collaborative tool for sequence annotation. Contact:
EBI is an Outstation of the European Molecular Biology Laboratory. PRIDE centric exercise: BioMart interface PRIDE team, Proteomics Services Group PANDA.
Architecture View Models A model is a complete, simplified description of a system from a particular perspective or viewpoint. There is no single view.
EBI is an Outstation of the European Molecular Biology Laboratory. Gautier Koscielny VectorBase Meeting 08 Feburary 2012, EBI VectorBase Text Search Engine.
Data Integration & Data Mining Tool Donald Dunbar BHF CoRE Bioinformatics Team Edinburgh Bioinformatics Meeting April 2013.
BioMart Federated Database Architecture Arek Kasprzyk EBI 9 June 2005.
A Portrait of the Semantic Web in Action Jeff Heflin and James Hendler IEEE Intelligent Systems December 6, 2010 Hyewon Lim.
Lei Kong, Ph.D. Center for Bioinformatics Peking University ABrowse - A General Purpose Genome Browser Framework.
EMBL-EBI Dimitris Dimitropoulos MSD-mine. EMBL-EBI MSD-mine overview  Web application for online data analysis and mining  For the advanced MSDSD researcher.
Fundamentals of Information Systems, Sixth Edition Chapter 3 Database Systems, Data Centers, and Business Intelligence.
Welcome to the GrameneMart Tutorial A tool for batch data sequence retrieval 1.Select a Gramene dataset to search against. 2.Add filters to the dataset.
ArrayExpress Ugis Sarkans EMBL - EBI
1 Copyright © 2008, Oracle. All rights reserved. Repository Basics.
National Cancer Institute Uma Mudunuri ABCC, NCI-Frederick ISRCE Monthly Meeting, Nov 9th 2010 bioDBnet The biological DataBase network.
Amy Krause EPCC OGSA-DAI An Overview OGSA-DAI on OMII 2.0 OMII The Open Middleware Infrastructure Institute NeSC,
The Holmes Platform and Applications
Fundamentals of Information Systems, Sixth Edition
Data Mining with BioMart
Welcome to the GrameneMart Tutorial
Tantan Liu, Fan Wang, Gagan Agrawal The Ohio State University
Presentation transcript:

BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006

BioMart A joint project –European Bioinformatics Institute (EBI) –Cold Spring Harbor Laboratory (CSHL) Aim –To develop a generic, query-oriented data management system capable of integrating distributed data sources.

Focus ‘Data mining’ or advance search –Creating custom datasets –Querying multiple datasets –Interactive Users –People who provide database-based service –‘Power user’ biologists and bioinformaticians

Requirements User –‘One-stop shop’ for biological data –Suitable for power biologists and bioinformaticians –A set of interfaces that allow user to group and refine biological data based upon many criteria Deployer –‘Out of the box’ installation –Built in ‘ query optimization –Easy data federation Architecture –Domain agnostic –Distributed –Platform independent

Advanced search GUIs

Single interface

Single access point

Queries across different databases Dataset 1 Dataset 2 Links

Main features Domain agnostic Platform independent (MySQL, ORACLE, Postgres) Scalable for big datasets Federated architecture Automated UI configuration

How does it work?

BioMart Data mart XML Meta data BioMart software Source data

Query Engine Federated architecture

FK PK Data model

FK PK FK Data model

main1 PK1 2 PK2 PK1 FK2 dm FK2 dm FK1 FK2 dm FK1 FK2 PK1 FK1 FK2 PK2 FK1 Data model - ‘reversed star’

Data mart and dataset Dataset

Data mart, dataset and virtual schema virtual schema

BioMart abstractions Dataset –A subset of data organized into 1 or more tables Attribute –A single data point –e. g. gene name Filter –An operation on an attribute –e. g. ‘Chromosome =1’

Datasets, Attributes and Filters GENE gene_id(PK) gene_stable_id gene_start gene_chrom_end chromosome gene_display_id description Mart Dataset Attribute Filter

BioMart abstractions (cont) Link –‘common currency’ between two datasets –e. g. accession Exportable –Potential links to export Importable –Potential links to import

Exportables, Importables and Links Dataset 1 Dataset 2 Links

Exportables, Importables and Links Dataset 1 Dataset 2 Exportable Importable name = uniprot_id attributes = uniprot_ac name = uniprot_id filters = uniprot_ac Links

Exportables, Importables and Links Dataset 1 Dataset 2 Exportable Importable name=genomic_region attributes=chr_name, chr_start, chr_end name=genomic_region filters=chr_name (=), chr_start (>=), chr_end (<=) Links

Creating BioMart databases

Building BioMart databases Source databases Mart Transformation MartBuilde r Configuration XML MartEditorMartBuilder

Schema transformation principles Central table –Longest n:1, 1:1 path Dimension table –Central transformation ‘around’ 1:n table. –Link tables are decomposed into a set of 1:n first

MartBuilder Application Read database meta data Transforms a source schema into suggested datasets and lets you edit the process Produces a set of SQL statements (DDL) to run against the server to perform the transformation

Dataset Configuration Dataset configuration Attributes Filters Trees, Groups, Collections Exportables, Importables Semantics Relational mapping User interface Linking datasets XML-based

Table naming convention Naïve configuration Tables –Meta tables meta_content –Data tables dataset__content__type Data tables –Main __main –Dimension __dm Columns –Key _key

Naming convention examples Homo sapiens gene ensembl –hsapiens_gene_ensembl__gene__main –hsapiens_gene_ensembl__xref_hugo__dm Encode –hsapiens_encode__encode__main Uniprot –uniprot__protein__main –uniprot__interpro__dm Uniprot sequence –uniprot_sequence__sequence__main

Dataset Configuration XML

MartEditor

Accessing BioMart databases

Retrieval myDatabase SNPVega EnsemblUniProt myMart MSD BioMart API JAVAPerl MartExplorer MartShellMartView Schema transformation MartBuilder XML MartEditor Configuration Databases Public data (local or remote) BioMart architecture

MartView (current)

MartView (new 0_5)

MartExplorer

MartShell Using = dataset Get = attribute Where = filter

MartShell (MQL) ● Uses Mart Query Language (MQL) to generate queries: using get where ● Can join datasets together: using Dataset1 get Attribute1 where Filter1=var1 as q; using Dataset2 get Attribute2 where Filter2=var2 and filter3 in q ● Can script and pipe: martshell.sh -E MQLscript.mql > results.txt martshell.sh -E MQLscript.mql | wc

MartShell examples MartShell> using MSD.msd get pdb_id where resolution_less < 1.5 and has_ec_info only; 193l 194l 1arb... MartShell> using MSD.msd get pdb_id where resolution_less < 1.5 and has_ec_info only as q; MartShell> using Ensembl.hsapiens_gene_ensembl get sequence transcript_flanks+1000 where pdb in q; ENST ENSG strand=forwardchr=21assembly=NCBI34 downstream flanking sequence of transcript only AAACTAAATTAGCTCTGATACTTATTTATATAAACAGCTTCAGTGG AA....

biomaRt

Taverna

DAS ProServer

BioMart deployers Large scale data federation (EBI) Optimising access to a large database (Ensembl, WormBase) Connecting priopriatery datasets to public data (Pasteur, Unilever, Serono, Sanofi-Aventis, DevGen etc …)

EBI Uniprot MSD SANGER Ensembl SNP Vega Sequence WWW Hinxton example

BioMart deployers Large scale data federation (Hinxton) Optimising access to a large database (Ensembl, WormBase, ArrayExpress) Connecting priopriatery datasets to public data (Pasteur, Unilever, Serono, Sanofi-Aventis, DevGen etc …)

WormBase Genes Expression Phenotypes Variations Literature Ontologies Sequence Genes Expression Phenotypes Variations Literature Ontologies Sequence

Ensembl Genes Ontologies Variations Protein annotation Disease Homologies Sequence Array annotations Genes Ontologies Variations Protein annotation Disease Homologies Sequence Array annotations

HapMap Population Frequencies Inter population comparisons Gene annotation Population Frequencies Inter population comparisons Gene annotation

ArrayExpress

BioMart deployers Large scale data federation (Hinxton) Optimising access to a large database (Ensembl, WormBase) Federating third party data with public data (Pasteur, INRA, Bayer,Unilever, Serono, Sanofi-Aventis, DevGen, Solexa etc …)

In development CAPRISA RGD DICTYBASE PURDUE UNIVERSITY RZPD

Music Mart

BioMart model Already applied –Ensembl –Vega –SNP –Uniprot –MSD –ArrayExpress –WormBase –Gramene –HapMap –Variety of ‘in house’ projects (academia and industrial)

User restriction XML Dataset XML martUser “default” “advanced”

Interface configuration XML Dataset XML Interface “single-page web interface” “wizard style web interface”

Web services MartView 3306 Local Mart 3306 X Remote Mart MartService XML

Web services (cont) MartService requests Registry XML Dataset information: name, type etc DatasetConfig XML Mart Query: –API query object is converted to a XML representation on the client and sent to the server. –Query object is regenerated on the server and processed. Results are sent back to client as a simple tab-delim HTML page.

Summary A generic data management system –A set of easily configurable user interfaces –Distributed Data federation –Query optimization

BioMart Open source (LGPL) Public MySQL server ftp

Acknowledgments BioMart –Arek Kasprzyk (EBI) –Damian Smedley (EBI) –Syed Haider (EBI) –Gudmundur Thorisson (CSHL) Contributors –Darin London (EBI) –Will Spooner (CSHL) –Damian Keefe (Ensembl) –Arne Stabenau (Ensembl) –Andreas Kahari (Ensembl) –Craig Melsopp (Ensembl) –Katerina Tzouvara (Uniprot) –Paul Donlon (Unilever) –Steffen Durinck (SCD-ESAT, Katholieke Universiteit Leuven) –Benoit Ballester (Universite de la Mediterranee) –Stephen Robinson (EBI) –Asif Kibria (EBI) –Paul Donlon (Unilever)