The Pathway Tools Ontology and Inferencing Layer Peter D. Karp, Ph.D. SRI International.

Slides:



Advertisements
Similar presentations
Editing Pathway/Genome Databases. SRI International Bioinformatics Pathway Tools Paradigm Separate database from user interface Navigator provides one.
Advertisements

The Pathway/Genome Navigator (These slides are a guide as you experiment with the Navigator)
1 SRI International Bioinformatics The Ocelot Frame Knowledge Representation System Peter D. Karp, Ph.D. Bioinformatics Research Group SRI International.
SRI International Bioinformatics Data Import / Export Markus Krummenacker Bioinformatics Research Group SRI, International Q
SRI International Bioinformatics Comparative Analysis Q
Overview of the Pathway Tools Software and Pathway/Genome Databases.
SRI International Bioinformatics 1 The consistency Checker, or Overhauling a PGDB By Ron Caspi.
Curation of the EcoCyc Database: The EcoCyc Update Project Martha Arnaud Scientific Database Curator Bioinformatics Research Group SRI International
The Pathway Tools Schema. SRI International Bioinformatics Motivations for Understanding Schema Pathway Tools visualizations and analyses depend upon.
Overview of Genome Databases Peter D. Karp, Ph.D. SRI International www-db.stanford.edu/dbseminar/seminar.html.
New Developments in the Pathway Tools Software and EcoCyc Database Peter D. Karp, Ph.D. Bioinformatics Research Group SRI International
Contents of this Talk [Used as intro to Genome Databases Seminar, 2002] Overview of bioinformatics Motivations for genome databases Analogy of virus reverse-eng.
The EcoCyc and MetaCyc Pathway/Genome Databases
Interoperation of Molecular Biology Databases Peter D. Karp, Ph.D. Bioinformatics Research Group SRI International Menlo Park, CA
Overview of the Pathway Tools Software and Pathway/Genome Databases.
Introduction to the Pathway Tools Software David Walsh and Simon Eng bigDATA Workshop—May 29, 2010.
Use of Ontologies in the Life Sciences: BioPax Graciela Gonzalez, PhD (some slides adapted from presentations available at
Pathway Tools User Group Meeting Introduction Peter D. Karp, Ph.D. Bioinformatics Research Group SRI International BioCyc.org EcoCyc.org.
Pathway databases Goto S, Bono H, Ogata H, Fujibuchi W, Nishioka T, Sato K, Kanehisa M. (1997) Organizing and computing metabolic pathway data in terms.
陳虹瑋 國立陽明大學 生物資訊學程 Genome Engineering Lab. Genome Engineering Lab The Newest.
Pathway/Genome Databases and Software Tools Peter D. Karp, Ph.D. Bioinformatics Research Group SRI International
Update on The Pathway Tools Software Peter D. Karp, Ph.D. Bioinformatics Research Group SRI International BioCyc.org EcoCyc.org MetaCyc.org.
Creating a … Community Database Organism-Specific Database Model-Organism Database.
Computational Exploration of Metabolic Networks with Pathway Tools Part 1: Overview & Representations Suzanne Paley Bioinformatics Research Group SRI International.
Genome database & information system for Daphnia Don Gilbert, October 2002 Talk doc at
Integration of E. Coli Data (E. coli Pathway and Genomic Data from BioCyc) Jesse Walsh.
1 SRI International Bioinformatics Introduction to Lisp Peter D. Karp, Ph.D. Bioinformatics Research Group SRI International
1 SRI International Bioinformatics BioCyc Tutorial Peter D. Karp, Ph.D. Bioinformatics Research Group SRI International BioCyc.org EcoCyc.org,
1 SRI International Bioinformatics The Pathway Tools Software and BioCyc Database Collection Peter D. Karp, Ph.D. Bioinformatics Research Group SRI International.
SRI International Bioinformatics 1 Pathway Tools: Recent Developments GMOD Meeting, June 2006.
Computational Exploration of Metabolic Networks with Pathway Tools Part 2: APIs & Examples Randy Gobbel, Ph.D. Bioinformatics Research Group SRI International.
1 SRI International Bioinformatics EcoCyc, MetaCyc, and the Pathway Tools Software Peter D. Karp, Ph.D. Bioinformatics Research Group SRI International.
Data Content of the BioCyc Databases. BioCyc Tier 1 Databases.
The BioCyc Collection of Pathway/Genome Databases Alexander Shearer Bioinformatics Research Group SRI International BioCyc.org EcoCyc.org.
SRI International Bioinformatics 1 Recent Developments in Pathway Tools GMOD Workshop November ‘07 Suzanne Paley Bioinformatics Research Group SRI International.
SRI International Bioinformatics 1 The Structured Advanced Query Page Tomer Altman & Mario Latendresse Bioinformatics Research Group SRI, International.
SRI International Bioinformatics 1 Advanced Editing of Pathway/Genome Databases Ron Caspi.
Semantic Web for Life Sciences Workshop Session VII: Semantic Aggregation, Integration, and Inference Moderator: Joanne Luciano October, Cambridge,
The consistency Checker, or Overhauling a PGDB By Ron Caspi.
MetaCyc and AraCyc: Plant Metabolic Databases Hartmut Foerster Carnegie Institution.
1 SRI International Bioinformatics GO Term Integration and Curation in Pathway Tools and EcoCyc Ingrid M. Keseler Bioinformatics Research Group SRI International.
Top Four Essential TAIR Resources Debbie Alexander Metabolic Pathway Databases for Arabidopsis and Other Plants Peifen Zhang.
SRI International Bioinformatics 1 Submitting pathway to MetaCyc Ron Caspi.
1 SRI International Bioinformatics And now for our ‘Feature’ presentation: Automatic Loading of Protein Sequence Annotation Data from UniProt to Pathway.
The Pathway Tools Schema. SRI International Bioinformatics Motivations for Understanding Schema Pathway Tools visualizations and analyses depend upon.
SRI International Bioinformatics 1 SmartTables & Enrichment Analysis Peter Karp SRI Bioinformatics Research Group September 2015.
© 2014 SRI International About OMICS Group OMICS Group International is an amalgamation of Open Access publications and worldwide international science.
Copyright © 1997 Pangea Systems, Inc. All rights reserved. Pathway Tools Training Course.
Functional and Evolutionary Attributes through Analysis of Metabolism Sophia Tsoka European Bioinformatics Institute Cambridge UK.
Overview of the Pathway Tools Software and Pathway/Genome Databases Peter D. Karp Bioinformatics Research Group SRI International
Writing Programs that Analyze Pathway/Genome Databases Markus Krummenacker Bioinformatics Research Group SRI International BioCyc.org EcoCyc.org.
SRI International Bioinformatics 1 The Structured Advanced Query Page Mario Latendresse Tomer Altman Bioinformatics Research Group SRI International March,
SRI International Bioinformatics Update your computers! To install a patch: Tools => Instant Patch => Download and Activate All Patches.
SRI International Bioinformatics 1 Editing Pathway/Genome Databases Ron Caspi.
Copyright OpenHelix. No use or reproduction without express written consent1 1.
SRI International Bioinformatics 1 Pathway Tools Features Available Only in the Desktop Version PathoLogic.
SRI International Bioinformatics 1 The Structured Advanced Query Page Tomer Altman Mario Latendresse Bioinformatics Research Group SRI International April.
Recent Developments and Future Directions in Pathway Tools Peter D. Karp SRI International.
Editing Pathway/Genome Databases
Why Create a PGDB? Perform pathway analyses as part of a genome project Analyze omics data Create a central public information resource for the organism,
An Advanced Web Query Interface for Biological Databases
The Pathway Tools FBA Module
The Pathway Tools Schema
How to Administer a PGDB
The Pathway Tools Software and BioCyc Database Collection
A Community Effort to Model the Human Microbiome
Overview of Microbial Pathway and Genome Databases
Metadata Framework as the basis for Metadata-driven Architecture
SRI Bioinformatics Research Group
Overview of the Pathway Tools Software and Pathway/Genome Databases
Presentation transcript:

The Pathway Tools Ontology and Inferencing Layer Peter D. Karp, Ph.D. SRI International

Bioinformatics Overview Definitions Ontologies ultimately exciting because of the inferences/computations they enable: Where are the ontology killer apps? Adding more facets to an ontology increases inferences that can be made with it Pathway Tools ontology and associated applications

SRI International Bioinformatics Terminology Model Organism Database (MOD) – DB describing genome and other information about an organism Pathway/Genome Database (PGDB) – MOD that combines information about l Pathways, reactions, substrates l Enzymes, transporters l Genes, replicons l Transcription factors, promoters, operons, DNA binding sites BioCyc – Collection of 15 PGDBs at BioCyc.org l EcoCyc, AgroCyc, YeastCyc

SRI International Bioinformatics Terminology – Pathway Tools Software PathoLogic l Prediction of metabolic network from genome l Computational creation of new Pathway/Genome Databases Pathway/Genome Editors l Distributed curation of PGDBs l Distributed object database system, interactive editing tools Pathway/Genome Navigator l WWW publishing of PGDBs l Querying, visualization of pathways, chromosomes, operons l Analysis operations u Pathway visualization of gene-expression data u Global comparisons of metabolic networks Bioinformatics 18:S

SRI International Bioinformatics Ontology Ontology = Terms + Taxonomy + Slots + Constraints

SRI International Bioinformatics Pathway Tools Ontology: Terms and Taxonomy Pathway Tools ontology contains 916 classes l Define datatypes u Replicons, Genes, Operons, Promoters, Trans Fac Binding Sites u Proteins: Enzymes, Transporters, Transcription Factors u Small molecule compounds u Reactions, pathways l Define taxonomies u Taxonomy of chemical compounds u Riley’s gene ontology u Taxonomy of metabolic pathways u EC system Bioinformatics 16:

SRI International Bioinformatics Operations Enabled by Controlled Vocabulary Equality testing: l Is the function of gene X in organism A the same as the function of gene Y in organism B? l Is location L1 in organism A the same as location L2 in organism B?

SRI International Bioinformatics Operations Enabled by Taxonomy Counting / Pie charts l How many genes of category “small molecule metabolism” are in organism A? Intersecting sets l How many of these up-regulated genes are in class “cell cycle”? User search via drill down Applying rules l If the substrate of X is an amino acid, then XXX

SRI International Bioinformatics Ontology Ontology = Terms + Taxonomy + Slots + Constraints

SRI International Bioinformatics Pathway Tools Ontology: Slots Pathway Tools ontology contains 199 slots Categories of slots: l Meta-data: Creator, Creation-Date l Textual data: Common-Name, Synonyms, Comment, Citations l Attributes: Molecular-Weight, pI l Relationships: Gene, Catalyzes, In-Reaction Give stats on how many slots in each of these classes

SRI International Bioinformatics Pathway Tools Ontology: Slots Slots introduced at appropriate place in taxonomy l Child classes inherit the slot; parent classes do not Examples: Proteins: pI, MolWt, Component-Of l Polypeptides: Gene l Protein-Complexes: Components Reactions: Left, Right, Keq, In-Pathway Pathways: Reaction-List, Predecessor-List Transcription Units: Components Genes: Product, Component-Of

SRI International Bioinformatics Operations Enabled by Slots Store/retrieve attributes of an entity l Get pI of protein l Get citations associated with pathway Traverse network of semantic relationships l Find all substrates of all reactions in pathway X l Find all genes that encode an enzyme that catalyzes a reaction in pathway X l Find all regulons encoding multiple metabolic pathways

SRI International Bioinformatics Ontology Ontology = Terms + Taxonomy + Slots + Constraints

SRI International Bioinformatics Pathway Tools Ontology: Constraints Every Pathway Tools slot has associated meta data: l Class(es) to which it pertains u Keq pertains to Reactions l Data type (number, string, frame, etc) u Keq data type is number l Collection type (list, bag) u Keq is not a collection l Documentation string l Cardinality constraints -- At most one Keq value l Range constraints l Taxonomy constraints u Values of Left slot of Reactions must be Chemicals

SRI International Bioinformatics Operations Enabled by Constraints Constraints make a system “intelligent” because they encode definitions in a machine- understandable fashion Automated DB consistency checkers (batch or interactive) Schema-driven data input tools Subsumption – Compare two concept definitions

SRI International Bioinformatics Pathway Tools Inference Layer Commonly used queries implemented as stored procedures Infer what is implicitly recorded in the KB

SRI International Bioinformatics Compute Transitive Relationships Sdh-flavoSdh-Fe-SSdh-membrane-1Sdh-membrane-2 sdhAsdhB sdhCsdhD succinate + FAD = fumarate + FADH 2 Enzymatic-reaction Succinate dehydrogenase TCA Cycle product component-of catalyzes reaction in-pathway Chrom succinate FAD fumarate FADH 2 left right

SRI International Bioinformatics Pathway Tools Inference Layer Enumerate reactions given alternative definitions of a reaction: all, enzyme, transport, small-mol, smm All substrates, all cofactors, all transported chemicals Protein tests: Is X a transcription factor, enzyme, transporter l Rather than force user to manually assign physiological roles, compute when possible from biochemical function Transcription-unit-binding-sites Compute in parts hierarchy: monomers-of-protein, components-of-protein, genes-of-protein, modified-forms Complex: regulon-of-protein, regulator-proteins-of- transcription-unit

SRI International Bioinformatics What Killer Apps have Ontologies Enabled? What comes after pie charts and drill-down interfaces?

SRI International Bioinformatics Terminology – Pathway Tools Software PathoLogic l Prediction of metabolic network from genome l Computational creation of new Pathway/Genome Databases Pathway/Genome Editors l Distributed curation of PGDBs l Distributed object database system, interactive editing tools Pathway/Genome Navigator l WWW publishing of PGDBs l Querying, visualization of pathways, chromosomes, operons l Analysis operations u Pathway visualization of gene-expression data u Global comparisons of metabolic networks

SRI International Bioinformatics BioCyc Collection of Pathway/Genome DBs Literature-based Datasets: MetaCyc Escherichia coli (EcoCyc) Computationally Derived Datasets: Agrobacterium tumefaciens Caulobacter crescentus Chlamydia trachomatis Bacillus subtilis Helicobacter pylori Haemophilus influenzae Mycobacterium tuberculosis RvH37 Mycobacterium tuberculosis CDC1551 Mycoplasma pneumonia Pseudomonas aeruginosa Saccharomyces cerevisiae Treponema pallidum Vibrio cholerae Yellow Underlined = Open Database

SRI International Bioinformatics Pathway/Genome DBs Created by External Users Plasmodium falciparum, Stanford University l plasmocyc.stanford.edu Arabidopsis thaliana and Synechosistis, Carnegie Institution of Washington l Arabidopsis.org:1555 Methanococcus janaschii, EBI l Maine.ebi.ac.uk:1555 Other PGDBs in progress by 20 other users Software freely available Each PGDB owned by its creator

SRI International Bioinformatics Ontology Reuse A holy grail in AI since “ontology” became a buzz- word l Decrease knowledge acquisition bottleneck GO qualifies as a large success in ontology reuse Pathway Tools ontology reused across 18 PGDBs Pathway Tools algorithms portable across all PGDBs

SRI International Bioinformatics Pathway Tools Algorithms Visualization and editing tools for following datatypes Full Metabolic Map l Paint gene expression data on metabolic network; compare metabolic networks Pathways l Pathway prediction Reactions l Balance checker Compounds l Chemical substructure comparison Enzymes, Transporters, Transcription Factors Genes Chromosomes Operons l Operon prediction; visualize genetic network

SRI International Bioinformatics Inference of Metabolic Pathways Pathway/Genome Database Annotated Genomic Sequence Genes/ORFs Gene Products DNA Sequences Reactions Pathways Compounds Multi-organism Pathway Database (MetaCyc) PathoLogic Software Integrates genome and pathway data to identify putative metabolic networks Genomic Map Genes Gene Products Reactions Pathways Compounds

SRI International Bioinformatics PathoLogic Analysis Phases Trial parsing of input data files [few days] Initialize schema of new PGDB [3 min] Create DB objects for replicons, genes, proteins [5 min] Assign enzymes to reactions they catalyze l ferrochelatase [10 min / 1 week] l glutamate 1-semialdehyde 2,1-aminomutase l porphobilinogen deaminase A C G BDEF E1E1 E2E2

SRI International Bioinformatics PathoLogic Analysis Phases From assigned reactions, infer what pathways are present [5 min / few days] Define metabolic overview diagram [1 day] Define protein complexes [few days]

SRI International Bioinformatics Killer App: Global Consistency Checking of Biochemical Network Given: l A PGDB for an organism l A set of initial metabolites Infer: l What set of products can be synthesized by the small- molecule metabolism of the organism Can known growth medium yield known essential compounds? Pacific Symposium on Biocomputing p

SRI International Bioinformatics Algorithm: Forward Propagation Nutrient set Metabolite set “Fire” reactions Transport Products Reactants PGDB reaction pool

SRI International Bioinformatics Results Phase I: Forward propagation l 21 initial compounds yielded only half of 38 essential compounds for E. coli Phase II: Manually identify l Bugs in EcoCyc (e.g., two objects for tryptophan) l Missing initial protein substrates (e.g., ACP) l Missing pathways in EcoCyc Phase III: Forward propagation with 11 more initial metabolites l Yielded all 38 essential compounds

SRI International Bioinformatics How to Characterize the Metabolic Network of a Cell?

SRI International Bioinformatics Aggregate Properties of the E. coli Metabolic Network EcoCyc is not a complete picture of E. coli metabolism l 30% of E. coli genes remain unidentified Analysis pertains to pathways of small-molecule metabolism Computed with respect to EcoCyc v4.5 (Sep-1998) Joint work with Christos Ouzounis of EBI Genome Research 10:

SRI International Bioinformatics Enzymes 4391 genes in E. coli genome 4288 code for proteins 676 (15%) gene products form 607 enzymes Of the 607 enzymes, 296 are monomers, 311 are multimers l 90% of genes for heteromultimers are linked

SRI International Bioinformatics Reactions 744 reactions of small-molecule metabolism l 582 assigned to at least one pathway

SRI International Bioinformatics Compounds 791 substrates in the 744 reactions Each reaction contains 4.0 substrates on average Each substrate appears in 2.1 reactions

SRI International Bioinformatics Enzyme Modulation 805 enzymatic-reaction objects in EcoCyc l 80 have physiological inhibitors l 22 have physiological activators l 17 have both l 43% have a modulator l 327 require a cofactor or prosthetic group

SRI International Bioinformatics Enzyme-Reaction Associations 585 reactions catalyzed by 1 enzyme 55 reactions catalyzed by 2 enzymes 12 reactions catalyzed by 3 enzymes 1 reaction catalyzed by 4 enzymes 483 reactions belong to a single pathway 99 reactions belong to multiple pathways 100 of the 607 E. coli enzymes are multifunctional

SRI International Bioinformatics Pathway Tools Implementation Allegro Common Lisp Sun and PC platforms l Run as window application or WWW server Ocelot object database 250,000 lines of code Lisp-based WWW server at BioCyc.org l Lisp process reads URLs from the network and generates GIF+HTML from PGDBs l Manages 15 PGDBs

SRI International Bioinformatics Ocelot Knowledge Server Architecture Frame data model l Classes, instances, inheritance Persistent storage via disk files, Oracle DBMS l Concurrent development: Oracle l Single-user development: disk files l Read-only delivery: bundle data into binary program Transaction logging facility Schema evolution Local disk cache to improve Internet performance J. Intelligent Information Systems 1:

SRI International Bioinformatics GKB Editor Browser and editor for KBs and ontologies Three editing tools: l Taxonomy editor l Frame editor l Relationships editor All operations are schema driven

SRI International Bioinformatics The Common Lisp Programming Environment Gatt studied Lisp and Java implementation of 16 programs by 14 programmers (Intelligence 11: )

SRI International Bioinformatics Peter Norvig’s Solution “I wrote my version in Lisp. It took me about 2 hours (compared to a range of hours for the other Lisp programmers in the study, 3-25 for C/C++ and 4-63 for Java) and I ended up with 45 non-comment non-blank lines (compared with a range of for Lisp, and for the other languages). (That means that some Java programmer was spending 13 lines and 84 minutes to provide the functionality of each line of my Lisp program.)”

SRI International Bioinformatics Common Lisp Programming Environment Interpreted and/or compiled execution Fabulous debugging environment High-level language Interactive data exploration Extensive built-in libraries Dynamic redefinition Find out more! l ALU.org -- Association of Lisp Users l BioLisp.org

SRI International Bioinformatics Pathway Exchange Ontology BioPathways group developing ontology and format for exchange of pathway data l Metabolic pathways l Signaling pathways l Protein interactions Moving upwards from chemicals, proteins, to reactions and pathways Working to extend CML Draft ontology at

SRI International Bioinformatics Summary Pathway Tools apps: l Predict pathways and generate PGDBs l Visualization and editing tools l Paint gene expression data; compare entire pathway maps l Global consistency checking of metabolic network l Characterize metabolic and genetic networks New killer apps: l Interoperability l Text mining l Bake-off for genome annotation pipelines

SRI International Bioinformatics BioCyc and Pathway Tools Availability WWW BioCyc freely available to all l BioCyc.org l Six BioCyc DBs openly available to all BioCyc DBs freely available to non-profits l Flatfiles downloadable from BioCyc.org l Binary executable: u Sun UltraSparc-170 w/ 64MB memory u PC, 400MHz CPU, 64MB memory, Windows-98 or newer l PerlCyc API Pathway Tools freely available to non-profits

SRI International Bioinformatics Acknowledgements SRI l Suzanne Paley, Pedro Romero, John Pick, Cindy Krieger, Martha Arnaud EcoCyc Project l Julio Collado-Vides, Ian Paulsen, Monica Riley, Milton Saier MetaCyc Project l Sue Rhee, Lukas Mueller, Peifen Zhang, Chris Somerville Stanford l Gary Schoolnik, Harley McAdams, Lucy Shapiro, Russ Altman, Iwei Yeh Funding sources: l NIH National Center for Research Resources l NIH National Institute of General Medical Sciences l NIH National Human Genome Research Institute l Department of Energy Microbial Cell Project l DARPA BioSpice, UPC BioCyc.org