Presentation is loading. Please wait.

Presentation is loading. Please wait.

Interoperation of Molecular Biology Databases Peter D. Karp, Ph.D. Bioinformatics Research Group SRI International Menlo Park, CA

Similar presentations


Presentation on theme: "Interoperation of Molecular Biology Databases Peter D. Karp, Ph.D. Bioinformatics Research Group SRI International Menlo Park, CA"— Presentation transcript:

1 Interoperation of Molecular Biology Databases Peter D. Karp, Ph.D. Bioinformatics Research Group SRI International Menlo Park, CA pkarp@ai.sri.com

2 SRI International Bioinformatics Main Message Interoperation of molecular-biology databases is a challenging problem of critical importance DOE should initiate a program in interoperation of molecular biology databases l Pursue both warehouse approach and multidatabase approach l Major progress possible within 5 years

3 SRI International Bioinformatics Motivations Important biological problems require access to multiple bioinformatics databases Different problems require different sets of databases Hundreds of bioinformatics databases exist l Nucleic Acids Research 32:2004 – Database issue l Nucleic Acids Research DB list: http://www3.oup.co.uk/nar/database/a/ u 350 databases listed in 2002 u 560 databases listed in 2004 Applications of integration include l Complex queries l Comparison of overlapping sources l Data mining

4 SRI International Bioinformatics Bioinformatics Databases Tremendous progress in point-and-click access for biologist users Less progress toward providing a computable, interoperable infrastructure for large-scale data mining Every large-scale mining/learning problem requires time consuming crafting of input/training datasets

5 SRI International Bioinformatics Warehouse Approach vs Multidatabase Approach Multidatabase query approaches assume databases are in a queryable DBMS Most sites that do operate DBMSs do not allow remote query access because of security and loading concerns Users want to control data stability Users want to control hardware applied to problem Internet bandwidth limits query throughput Users need to capture, integrate and publish locally produced data of different types Replicating and refreshing very large sources is expensive Multidatabase and Warehouse approaches complementary

6 SRI International Bioinformatics SRI BioWarehouse Project Goal Create a toolkit for constructing bioinformatics database warehouses that integrate sets of bioinformatics databases into one physical DBMS

7 SRI International Bioinformatics BioWarehouse Approach Warehouse schema defines many bioinformatics datatypes Create loaders for public bioinformatics DBs l Parse file format for the DB l Apply semantic transformations l Insert database into warehouse tables Oracle and MySQL implementations Warehouse query access mechanisms l SQL queries via JDBC,Lisp,Perl, ODBC, OAA

8 SRI International Bioinformatics Warehouse Schema Manages many bioinformatics datatypes simultaneously l Pathways, Reactions, Chemicals l Proteins, Genes, Replicons l Sequences, Sequence Features l Organisms, Taxonomic relationships l Computations (sequence matches) l Citations, Controlled vocabularies l Links to external databases Each type of warehouse object implemented through one or more relational tables (currently 43)

9 SRI International Bioinformatics Warehouse Schema Manages multiple datasets simultaneously l Dataset = Single version of a database l Allows version comparison l Multiple software tools or experiments require access to different versions Each dataset is a warehouse entity Every warehouse object is registered in a dataset Different databases storing the same biological datatypes are coerced into same warehouse tables Design of most datatypes inspired by multiple databases Representational tricks to decrease schema bloat l Single space of primary keys l Single set of satellite tables such as for synonyms, citations, comments, etc.

10 SRI International Bioinformatics Current Databases Supported by BioWarehouse BioCyc l 15 genomes and metabolic networks Swiss-Prot, TrEMBL l 1.3M proteins ENZYME KEGG NCBI Taxonomy CMR l 105 genomes, 250K genes, 250K proteins Applications: l DARPA BioSpice program on biological simulation l Study of sequence coverage of known enzymes

11 SRI International Bioinformatics Summary Interoperation of molecular-biology databases is a challenging problem of critical importance DOE should initiate a program in interoperation of molecular biology databases l Pursue both warehouse approach and multidatabase approach l Major progress possible within 5 years


Download ppt "Interoperation of Molecular Biology Databases Peter D. Karp, Ph.D. Bioinformatics Research Group SRI International Menlo Park, CA"

Similar presentations


Ads by Google