Download presentation
Presentation is loading. Please wait.
1
Contents of this Talk [Used as intro to Genome Databases Seminar, 2002] Overview of bioinformatics Motivations for genome databases Analogy of virus reverse-eng to genome analysis Questions to ask of a genome DB
2
Overview of Genome Databases Peter D. Karp, Ph.D. SRI International pkarp@ai.sri.com www-db.stanford.edu/dbseminar/seminar.html
3
Talk Overview Definition of bioinformatics Motivations for genome databases Computer virus analogy Issues in building genome databases
4
Definition of Bioinformatics Computational techniques for management and analysis of biological data and knowledge l Methods for disseminating, archiving, interpreting, and mining scientific information Computational theories of biology Genome Databases is a subfield of bioinformatics
5
Motivations for Bioinformatics Growth in molecular-biology knowledge (literature) Genomics 1. Study of genomes through DNA sequencing 2. Industrial Biology
6
Example Genomics Datatypes Genome sequences l DOE Joint Genome Institute u 511M bases in Dec 2001 u 11.97G bases since Mar 1999 Gene and protein expression data Protein-protein interaction data Protein 3-D structures
7
Genome Databases Experimental data l Archive experimental datasets l Retrieving past experimental results should be faster than repeating the experiment l Capture alternative analyses l Lots of data, simpler semantics Computational symbolic theories l Complex theories become too large to be grasped by a single mind l The database is the theory l Biology is very much concerned with qualitative relationships l Less data, more complex semantics
8
Bioinformatics Distinct intellectual field at the intersection of CS and molecular biology Distinct field because researchers in the field should know CS, biology, and bioinformatics Spectrum from CS research to biology service Rich source of challenging CS problems Large, noisy, complex data-sets and knowledge-sets Biologists and funding agencies demand working solutions
9
Common Computer-Science Areas Database design and interoperation Machine learning Scientific visualization Combinatorial algorithms Distributed systems Text understanding
10
Bioinformatics Research algorithms + data structures = programs algorithms + databases = discoveries Combine sophisticated algorithms with the right content: l Properly structured l Carefully curated l Relevant data fields l Proper amount of data
11
Goals of Systems Biology Catalog the molecular parts lists of cells Understand the function(s) of each part Understand how those parts interact to produce the behavior of a cell or organism Understand the evolution of those molecular parts
13
Analogy: Genome Analysis and Virus Analysis Given: Virus binary executable file for known machine architecture Reverse engineer the program l Procedures l Call graph l Specifications for I/O behavior of the program and all procedures Capture and publish an annotated analysis of the virus Comparative analysis of related viruses
14
Genome Analysis Example: M. tuberculosis genome Given: 4.4Mbp of DNA (genome) Infer: l Molecular parts list of Mtb l A model of the biochemical machinery of Mtb cell DNA is a blueprint for the program of life
15
Start 4.4Mbyte binary program 4.4Mbp DNA sequence
16
Step 1 Distinguish code from data segments Find procedure boundaries Distinguish coding from non-coding regions – Gene Finding
17
Step 2 Predict semantics of procedures Predict gene functions A B C D
18
Step 3 Predict procedure call graph Predict biochemical and gene networks AB C D AB C D A B C D
19
Step 4 Predict conditions under which procedures are invoked Predict expression of network fragments AB C D QR S
20
Step 5 Infer complete program specification Formulate dynamic cellular simulation
21
Step 6 Internet publishing of structured program annotation with explanations, references, commentary Internet publishing of structured genome annotation with explanations, references, commentary
22
Step 7 Comparative analysis of viruses Evolutionary relationships among viruses Comparative analysis of genomes Evolutionary relationships among genomes
23
Step 8 Identify measures to disable virus or prevent its spread Identify target proteins for anti-microbial drug discovery AB C D QR S
24
Database of Viruses Create a database that stores l Binaries for all viruses l All annotation of virus programs by different investigators l Comparative analyses Support l Remote API access l Click-at-a-time browsing
25
Reference on Major Genome Databases Nucleic Acids Research Database Issue http://nar.oupjournals.org/content/vol30/issue1/ l 112 databases
26
Questions to Ask of a New Genome Database
27
What are Database Goals and Requirements? How many users? What expertise do users have? What problems will database be used to solve?
28
What is its Organizing Principle? Different DBs partition the space of genome information in different dimensions Experimental methods (Genbank, PDB) Organism (EcoCyc, Flybase)
29
What is its Level of Interpretation? Laboratory data Primary literature (Genbank) Review (SwissProt, MetaCyc) Does DB model disagreement?
30
What are its Semantics and Content? What entities and relationships does it model? How does its content overlap with similar DBs? How many entities of each type are present? Sparseness of attributes and statistics on attribute values
31
What are Sources of its Data? Potential information sources l Laboratory instruments l Scientific literature u Manual entry u Natural-language text mining l Direct submission from the scientific community u Genbank Modification policy l DB staff only l Submission of new entries by scientific community l Update access by scientific community
32
What DBMS is Employed? None Relational Object oriented Frame knowledge representation system
33
Distribution / User Access Multiple distribution forms enhance access Browsing access with visualization tools API Portability
34
What Validation Approaches are Employed? None Declarative consistency constraints Programmatic consistency checking Internal vs external consistency checking What types of systematic errors might DB contain?
35
Database Documentation Schema and its semantics Format API Data acquisition techniques Validation techniques Size of different classes Coverage of subject matter Sparseness of attributes Error rates
36
Relationship of Database Field to Bioinformatics Scientists generally ignorant of basic DB principles l Complex queries vs click-at-a-time access l Data model l Defined semantics for DB fields l Controlled vocabularies l Regular syntax for flatfiles l Automated consistency checking Most biologists take one programming class Evolution of typical genome database Finer points of DB research off their radar screen Handfull of DB researchers work in bioinformatics
37
Database Field For many years, the majority of bioinformatics DBs did not employ a DBMS l Flatfiles were the rule l Scientists want to see the data directly l Commercial DBMSs too expensive, too complex l DBAs too expensive Most scientists do not understand l Differences between BA, MS, PhD in CS l CS research vs applications l Implications for project planning, funding, bioinformatics research
38
Recommendation Teaching scientists programming is not enough Teaching scientists how to build a DBMS is irrelevant Teach scientists basic aspects of databases and symbolic computing l Database requirements analysis l Data models, schema design l Knowledge representation, ontologies l Formal grammars l Complex queries l Database interoperability
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.