Overview of Genome Databases Peter D. Karp, Ph.D. SRI International www-db.stanford.edu/dbseminar/seminar.html.

Slides:



Advertisements
Similar presentations
1 SRI International Bioinformatics The Ocelot Frame Knowledge Representation System Peter D. Karp, Ph.D. Bioinformatics Research Group SRI International.
Advertisements

The National Center for Biotechnology Information (NCBI) a primary resource for molecular biology information Database Resources.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Management Information Systems, Sixth Edition
SRI International Bioinformatics 1 The consistency Checker, or Overhauling a PGDB By Ron Caspi.
The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.
Curation of the EcoCyc Database: The EcoCyc Update Project Martha Arnaud Scientific Database Curator Bioinformatics Research Group SRI International
The Pathway Tools Schema. SRI International Bioinformatics Motivations for Understanding Schema Pathway Tools visualizations and analyses depend upon.
Contents of this Talk [Used as intro to Genome Databases Seminar, 2002] Overview of bioinformatics Motivations for genome databases Analogy of virus reverse-eng.
Bioinformatics for biomedicine Summary and conclusions. Further analysis of a favorite gene Lecture 8, Per Kraulis
Interoperation of Molecular Biology Databases Peter D. Karp, Ph.D. Bioinformatics Research Group SRI International Menlo Park, CA
Bioinformatics: a Multidisciplinary Challenge Ron Y. Pinter Dept. of Computer Science Technion March 12, 2003.
Database Management: Getting Data Together Chapter 14.
Use of Ontologies in the Life Sciences: BioPax Graciela Gonzalez, PhD (some slides adapted from presentations available at
August 29, 2002InforMax Confidential1 Vector PathBlazer Product Overview.
1 Lecture 13: Database Heterogeneity Debriefing Project Phase 2.
陳虹瑋 國立陽明大學 生物資訊學程 Genome Engineering Lab. Genome Engineering Lab The Newest.
The bioinformatics of biological processes The challenge of temporal data Per J. Kraulis CMCM, Tartu University.
Modeling Functional Genomics Datasets CVM Lesson 1 13 June 2007Bindu Nanduri.
Pathway/Genome Databases and Software Tools Peter D. Karp, Ph.D. Bioinformatics Research Group SRI International
Update on The Pathway Tools Software Peter D. Karp, Ph.D. Bioinformatics Research Group SRI International BioCyc.org EcoCyc.org MetaCyc.org.
Creating a … Community Database Organism-Specific Database Model-Organism Database.
Genome database & information system for Daphnia Don Gilbert, October 2002 Talk doc at
Overview of Bioinformatics A/P Shoba Ranganathan Justin Choo National University of Singapore A Tutorial on Bioinformatics.
Srihari-CSE730-Spring 2003 CSE 730 Information Retrieval of Biomedical Text and Data Inroduction.
1 SRI International Bioinformatics BioCyc Tutorial Peter D. Karp, Ph.D. Bioinformatics Research Group SRI International BioCyc.org EcoCyc.org,
Chapter 5 Lecture 2. Principles of Information Systems2 Objectives Understand Data definition language (DDL) and data dictionary Learn about popular DBMSs.
Introduction to databases Tuomas Hätinen. Topics File Formats Databases -Primary structure: UniProt -Tertiary structure: PDB Database integration system.
GO and OBO: an introduction. Jane Lomax EMBL-EBI What is the Gene Ontology? What is OBO? OBO-Edit demo & practical What is the Gene Ontology? What is.
Biological Databases By : Lim Yun Ping E mail :
Supporting High- Performance Data Processing on Flat-Files Xuan Zhang Gagan Agrawal Ohio State University.
The BioCyc Collection of Pathway/Genome Databases Alexander Shearer Bioinformatics Research Group SRI International BioCyc.org EcoCyc.org.
SRI International Bioinformatics 1 Recent Developments in Pathway Tools GMOD Workshop November ‘07 Suzanne Paley Bioinformatics Research Group SRI International.
SRI International Bioinformatics 1 The Structured Advanced Query Page Tomer Altman & Mario Latendresse Bioinformatics Research Group SRI, International.
SRI International Bioinformatics 1 The Structured Advanced Query Page Tomer Altman & Mario Latendresse Bioinformatics Research Group SRI, International.
Professor Michael J. Losacco CIS 1110 – Using Computers Database Management Chapter 9.
Organizing information in the post-genomic era The rise of bioinformatics.
The consistency Checker, or Overhauling a PGDB By Ron Caspi.
Ontologies GO Workshop 3-6 August Ontologies  What are ontologies?  Why use ontologies?  Open Biological Ontologies (OBO), National Center for.
IS 325 Notes for Wednesday August 28, Data is the Core of the Enterprise.
SRI International Bioinformatics 1 Submitting pathway to MetaCyc Ron Caspi.
1 SRI International Bioinformatics And now for our ‘Feature’ presentation: Automatic Loading of Protein Sequence Annotation Data from UniProt to Pathway.
Data resource management
SRI International Bioinformatics 1 SmartTables & Enrichment Analysis Peter Karp SRI Bioinformatics Research Group September 2015.
Proteomics databases for comparative studies: Transactional and Data Warehouse approaches Patricia Rodriguez-Tomé, Nicolas Pinaud, Thomas Kowall GeneProt,
SRI International Bioinformatics 1 The Structured Advanced Query Page Tomer Altman Bioinformatics Research Group SRI, International February 1, 2008.
Mining the Biomedical Research Literature Ken Baclawski.
Bioinformatics and Computational Biology
Issues in Ontology-based Information integration By Zhan Cui, Dean Jones and Paul O’Brien.
Opportunities for Text Mining in Bioinformatics (CS591-CXZ Text Data Mining Seminar) Dec. 8, 2004 ChengXiang Zhai Department of Computer Science University.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
SRI International Bioinformatics 1 The Structured Advanced Query Page Mario Latendresse Tomer Altman Bioinformatics Research Group SRI International March,
SRI International Bioinformatics 1 Editing Pathway/Genome Databases Ron Caspi.
Copyright (c) 2014 Pearson Education, Inc. Introduction to DBMS.
SRI International Bioinformatics 1 Pathway Tools Features Available Only in the Desktop Version PathoLogic.
SRI International Bioinformatics 1 The Structured Advanced Query Page Tomer Altman Mario Latendresse Bioinformatics Research Group SRI International April.
Recent Developments and Future Directions in Pathway Tools Peter D. Karp SRI International.
High throughput biology data management and data intensive computing drivers George Michaels.
1 Survey of Biodata Analysis from a Data Mining Perspective Peter Bajcsy Jiawei Han Lei Liu Jiong Yang.
XML and Distributed Applications By Quddus Chong Presentation for CS551 – Fall 2001.
Informatics for Scientific Data Bio-informatics and Medical Informatics Week 9 Lecture notes INF 380E: Perspectives on Information.
Why Create a PGDB? Perform pathway analyses as part of a genome project Analyze omics data Create a central public information resource for the organism,
Biological Databases By: Komal Arora.
The Pathway Tools Schema
Bioinformatics Capstone Project
Chapter 2 Database Environment.
Mangaldai College, Mangaldai
Overview of Microbial Pathway and Genome Databases
MANAGING DATA RESOURCES
SUBMITTED BY: DEEPTI SHARMA BIOLOGICAL DATABASE AND SEQUENCE ANALYSIS.
Presentation transcript:

Overview of Genome Databases Peter D. Karp, Ph.D. SRI International www-db.stanford.edu/dbseminar/seminar.html

Talk Overview Definition of bioinformatics Motivations for genome databases Issues in building genome databases

Definition of Bioinformatics Computational techniques for management and analysis of biological data and knowledge l Methods for disseminating, archiving, interpreting, and mining scientific information Computational theories of biology Genome Databases is a subfield of bioinformatics

Motivations for Bioinformatics Growth in molecular-biology knowledge (literature) Genomics 1. Study of genomes through DNA sequencing 2. Industrial Biology

Example Genomics Datatypes Genome sequences l DOE Joint Genome Institute u 511M bases in Dec 2001 u 11.97G bases since Mar 1999 Gene and protein expression data Protein-protein interaction data Protein 3-D structures

Genome Databases Experimental data l Archive experimental datasets l Retrieving past experimental results should be faster than repeating the experiment l Capture alternative analyses l Lots of data, simpler semantics Computational symbolic theories l Complex theories become too large to be grasped by a single mind l The database is the theory l Biology is very much concerned with qualitative relationships l Less data, more complex semantics

Bioinformatics Distinct intellectual field at the intersection of CS and molecular biology Distinct field because researchers in the field must know CS, biology, and bioinformatics Spectrum from CS research to biology service Rich source of challenging CS problems Large, noisy, complex data-sets and knowledge-sets Biologists and funding agencies demand working solutions

Bioinformatics Research algorithms + data structures = programs algorithms + databases = discoveries Combine sophisticated algorithms with the right content: l Properly structured l Carefully curated l Relevant data fields l Proper amount of data

Reference on Major Genome Databases Nucleic Acids Research Database Issue l 112 databases

Questions to Ask of a New Genome Database

What are Database Goals and Requirements? What problems will database be used to solve? Who are the users and what is their expertise?

What is its Organizing Principle? Different DBs partition the space of genome information in different dimensions Experimental methods (Genbank, PDB) Organism (EcoCyc, Flybase)

What is its Level of Interpretation? Laboratory data Primary literature (Genbank) Review (SwissProt, MetaCyc) Does DB model disagreement?

What are its Semantics and Content? What entities and relationships does it model? How does its content overlap with similar DBs? How many entities of each type are present? Sparseness of attributes and statistics on attribute values

What are Sources of its Data? Potential information sources l Laboratory instruments l Scientific literature u Manual entry u Natural-language text mining l Direct submission from the scientific community u Genbank Modification policy l DB staff only l Submission of new entries by scientific community l Update access by scientific community

What DBMS is Employed? None Relational Object oriented Frame knowledge representation system

Distribution / User Access Multiple distribution forms enhance access Browsing access with visualization tools API Portability

What Validation Approaches are Employed? None Declarative consistency constraints Programmatic consistency checking Internal vs external consistency checking What types of systematic errors might DB contain?

Database Documentation Schema and its semantics Format API Data acquisition techniques Validation techniques Size of different classes Coverage of subject matter Sparseness of attributes Error rates Update frequency

Relationship of Database Field to Bioinformatics Scientists generally unaware of basic DB principles l Complex queries vs click-at-a-time access l Data model l Defined semantics for DB fields l Controlled vocabularies l Regular syntax for flatfiles l Automated consistency checking Most biologists take one programming class Evolution of typical genome database Finer points of DB research off their radar screen Handfull of DB researchers work in bioinformatics

Database Field For many years, the majority of bioinformatics DBs did not employ a DBMS l Flatfiles were the rule l Scientists want to see the data directly l Commercial DBMSs too expensive, too complex l DBAs too expensive Most scientists do not understand l Differences between BA, MS, PhD in CS l CS research vs applications l Implications for project planning, funding, bioinformatics research

Recommendation Teaching scientists programming is not enough Teaching scientists how to build a DBMS is irrelevant Teach scientists basic aspects of databases and symbolic computing l Database requirements analysis l Data models, schema design l Knowledge representation, ontologies l Formal grammars l Complex queries l Database interoperability

BioSPICE Bioinformatics Database Warehouse Peter Karp, Dave Stringer-Calvert, Tom Lee, Kemal Sonmez SRI International

Project Goal Create a toolkit for constructing bioinformatics database warehouses that collect together a set of bioinformatics databases into one physical DBMS

Motivations Important bioinformatics problems require access to multiple bioinformatics databases Hundreds of bioinformatics databases exist l Nucleic Acids Research 30(1) 2002 – DB issue l Nucleic Acids Research DB list: 350 DBs at Different problems require different sets of databases

Motivations Combining multiple databases allows for data verification and complementation Simulation problems require access to data on pathways, enzymes, reactions, genetic regulation

Why is the Multidatabase Approach Not Sufficient? Multidatabase query approaches assume databases are in a DBMS Internet bandwidth limits query throughput Most sites that do operate DBMSs do not allow remote SQL access because of security and loading concerns Control data stability Need to capture, integrate and publish locally produced data of different types Multidatabase and Warehouse approaches complementary

Scenario 1 BioSPICE scientist wants to model multiple metabolic pathways in a given organism l Enumerate pathways and reactions l What enzymes catalyze each reaction? l What genes code for each enzyme? l What control regions regulate each gene?

Approach Oracle and MySQL implementations Warehouse schema defines many bioinformatics datatypes Create loaders for public bioinformatics DBs l Parse file format for the DB l Semantic transformations l Insert database into warehouse tables Warehouse query access mechanisms l SQL queries via Perl, ODBC, OAA

Example: Swiss-Prot DB Version 40.0 describes 101K proteins in a 320MB file Each protein described as one block of records (an entry) in a large text file Loader tool parses file one entry at a time Creates new entries in a set of warehouse tables

Warehouse Schema Manages many bioinformatics datatypes simultaneously l Pathways, Reactions, Chemicals l Proteins, Genes, Replicons l Citations, Organisms l Links to external databases Each type of warehouse object implemented through one or more relational tables (currently 43)

Warehouse Schema Databases on our wish list: l Genbank (nucleotide sequences) l Protein expression database l Protein-protein interactions database l Gene expression database l NCBI Taxonomy database l Gene Ontology l CMR

Warehouse Schema Manages multiple datasets simultaneously l Dataset = Single version of a database Support alternative measurements and viewpoints Version comparison Multiple software tools or experiments that require access to different versions Each dataset is a warehouse entity Every warehouse object is registered in a dataset

Warehouse Schema Different databases storing the same biological types are coerced into same warehouse tables Design of most datatypes inspired by multiple databases Representational tricks to decrease schema bloat l Single space of primary keys l Single set of satellite tables such as for synonyms, citations, comments, etc.

Warehouse Schema Examples l Protein data from Swiss-Prot, TrEMBL, KEGG, and EcoCyc all loaded into same relational tables l Pathway data from MetaCyc and KEGG are loaded into the same relational tables

Example: Swiss-Prot DB ID 1A11_CUCMA STANDARD; PRT; 493 AA. AC P23599; DT 01-NOV-1991 (Rel. 20, Created) DT 01-NOV-1991 (Rel. 20, Last sequence update) DT 15-DEC-1998 (Rel. 37, Last annotation update) DE 1-AMINOCYCLOPROPANE-1-CARBOXYLATE SYNTHASE CMW33 (EC ) (ACC DE SYNTHASE) (S-ADENOSYL-L-METHIONINE METHYLTHIOADENOSINE-LYASE). GN ACS1 OR ACCW.

How Swiss-Prot is Loaded into The Warehouse Register Swiss-Prot in Datasets table Create entry in Entry and Protein tables for each Swiss-Prot protein Satellite tables store l Protein synonyms, citations, comments, accession numbers, organism, sequence features, subunits/complexes, DB links

Protein Table CREATE TABLE Protein ( WID NUMBER --The warehouse ID of this protein Name VARCHAR2(500) --Common name of the protein AASequence VARCHAR2(4000),--Amino-acid sequence for this protein Charge NUMBER, --Charge of the chemical Fragment CHAR(1), --Is this protein a fragment or not, T or F MolecularWeightCalc NUMBER, --Molecular weight calculated from sequence. Units: Daltons. MolecularWeightExp NUMBER, --Molecular Weight determined through experimentation. Units: Daltons. PICalc VARCHAR2(50), --pI calculated from its sqeuence. PIExp VARCHAR2(50), --pI value determined through experimentation. DataSetWID NUMBER --Reference to the data set from which the entity came from );

Database Loaders Loader tool defined for each DB to be loaded into Warehouse Example loaders available in several languages Loaders l KEGG (C) l BioCyc collection of 15 pathway DBs (C) l Swiss-Prot (Java) l ENZYME (Java)

Terminology Model Organism Database (MOD) – DB describing genome and other information about an organism Pathway/Genome Database (PGDB) – MOD that combines information about l Pathways, reactions, substrates l Enzymes, transporters l Genes, replicons l Transcription factors, promoters, operons, DNA binding sites BioCyc – Collection of 15 PGDBs at BioCyc.org l EcoCyc, AgroCyc, YeastCyc

Loader Architecture Grammar for Swiss-Prot Parser for SwissProt ANTLR Parser Generator Swiss-Prot Datafile SQL Insert Commands Oracle Loadable File

Current Warehouse Contents KEGGENZYMESwissProtBsubCycWarehouse Total Chemicals 7,2842, ,812 Genes 5,714088,6054,22198,540 Organisms , ,868 Proteins 3,8293,870101,6024,150113,451 Enzymatic Reactions 3, ,226 Pathways 4, ,655 Pathway Reactions 36, ,801

Example Warehouse Uses Check completeness of data sources Count reactions in ENZYME database with (and without) associated protein sequences in SWISS-PROT database: 3870 reactions in ENZYME 1662 reactions (43%) with a sequence in SWISS-PROT 2208 reactions (57%) without a sequence in SWISS-PROT Count #of distinct non-partial EC numbers in SWISS-PROT: 1554 distinct EC numbers in SWISS-PROT (non-partial)