Download presentation
Presentation is loading. Please wait.
Published byDorthy Harriet Eaton Modified over 9 years ago
1
The Pathway Tools Ontology and Inferencing Layer Peter D. Karp, Ph.D. SRI International
2
Bioinformatics Overview Definitions Ontologies ultimately exciting because of the inferences/computations they enable: Where are the ontology killer apps? Adding more facets to an ontology increases inferences that can be made with it Pathway Tools ontology and associated applications
3
SRI International Bioinformatics Terminology Model Organism Database (MOD) – DB describing genome and other information about an organism Pathway/Genome Database (PGDB) – MOD that combines information about l Pathways, reactions, substrates l Enzymes, transporters l Genes, replicons l Transcription factors, promoters, operons, DNA binding sites BioCyc – Collection of 15 PGDBs at BioCyc.org l EcoCyc, AgroCyc, YeastCyc
4
SRI International Bioinformatics Terminology – Pathway Tools Software PathoLogic l Prediction of metabolic network from genome l Computational creation of new Pathway/Genome Databases Pathway/Genome Editors l Distributed curation of PGDBs l Distributed object database system, interactive editing tools Pathway/Genome Navigator l WWW publishing of PGDBs l Querying, visualization of pathways, chromosomes, operons l Analysis operations u Pathway visualization of gene-expression data u Global comparisons of metabolic networks Bioinformatics 18:S225 2002
5
SRI International Bioinformatics Ontology Ontology = Terms + Taxonomy + Slots + Constraints
6
SRI International Bioinformatics Pathway Tools Ontology: Terms and Taxonomy Pathway Tools ontology contains 916 classes l Define datatypes u Replicons, Genes, Operons, Promoters, Trans Fac Binding Sites u Proteins: Enzymes, Transporters, Transcription Factors u Small molecule compounds u Reactions, pathways l Define taxonomies u Taxonomy of chemical compounds u Riley’s gene ontology u Taxonomy of metabolic pathways u EC system Bioinformatics 16:269 2000
7
SRI International Bioinformatics Operations Enabled by Controlled Vocabulary Equality testing: l Is the function of gene X in organism A the same as the function of gene Y in organism B? l Is location L1 in organism A the same as location L2 in organism B?
8
SRI International Bioinformatics Operations Enabled by Taxonomy Counting / Pie charts l How many genes of category “small molecule metabolism” are in organism A? Intersecting sets l How many of these up-regulated genes are in class “cell cycle”? User search via drill down Applying rules l If the substrate of X is an amino acid, then XXX
9
SRI International Bioinformatics Ontology Ontology = Terms + Taxonomy + Slots + Constraints
10
SRI International Bioinformatics Pathway Tools Ontology: Slots Pathway Tools ontology contains 199 slots Categories of slots: l Meta-data: Creator, Creation-Date l Textual data: Common-Name, Synonyms, Comment, Citations l Attributes: Molecular-Weight, pI l Relationships: Gene, Catalyzes, In-Reaction Give stats on how many slots in each of these classes
11
SRI International Bioinformatics Pathway Tools Ontology: Slots Slots introduced at appropriate place in taxonomy l Child classes inherit the slot; parent classes do not Examples: Proteins: pI, MolWt, Component-Of l Polypeptides: Gene l Protein-Complexes: Components Reactions: Left, Right, Keq, In-Pathway Pathways: Reaction-List, Predecessor-List Transcription Units: Components Genes: Product, Component-Of
12
SRI International Bioinformatics Operations Enabled by Slots Store/retrieve attributes of an entity l Get pI of protein l Get citations associated with pathway Traverse network of semantic relationships l Find all substrates of all reactions in pathway X l Find all genes that encode an enzyme that catalyzes a reaction in pathway X l Find all regulons encoding multiple metabolic pathways
13
SRI International Bioinformatics Ontology Ontology = Terms + Taxonomy + Slots + Constraints
14
SRI International Bioinformatics Pathway Tools Ontology: Constraints Every Pathway Tools slot has associated meta data: l Class(es) to which it pertains u Keq pertains to Reactions l Data type (number, string, frame, etc) u Keq data type is number l Collection type (list, bag) u Keq is not a collection l Documentation string l Cardinality constraints -- At most one Keq value l Range constraints l Taxonomy constraints u Values of Left slot of Reactions must be Chemicals
15
SRI International Bioinformatics Operations Enabled by Constraints Constraints make a system “intelligent” because they encode definitions in a machine- understandable fashion Automated DB consistency checkers (batch or interactive) Schema-driven data input tools Subsumption – Compare two concept definitions
16
SRI International Bioinformatics Pathway Tools Inference Layer Commonly used queries implemented as stored procedures Infer what is implicitly recorded in the KB
17
SRI International Bioinformatics Compute Transitive Relationships Sdh-flavoSdh-Fe-SSdh-membrane-1Sdh-membrane-2 sdhAsdhB sdhCsdhD succinate + FAD = fumarate + FADH 2 Enzymatic-reaction Succinate dehydrogenase TCA Cycle product component-of catalyzes reaction in-pathway Chrom succinate FAD fumarate FADH 2 left right
18
SRI International Bioinformatics Pathway Tools Inference Layer Enumerate reactions given alternative definitions of a reaction: all, enzyme, transport, small-mol, smm All substrates, all cofactors, all transported chemicals Protein tests: Is X a transcription factor, enzyme, transporter l Rather than force user to manually assign physiological roles, compute when possible from biochemical function Transcription-unit-binding-sites Compute in parts hierarchy: monomers-of-protein, components-of-protein, genes-of-protein, modified-forms Complex: regulon-of-protein, regulator-proteins-of- transcription-unit
19
SRI International Bioinformatics What Killer Apps have Ontologies Enabled? What comes after pie charts and drill-down interfaces?
20
SRI International Bioinformatics Terminology – Pathway Tools Software PathoLogic l Prediction of metabolic network from genome l Computational creation of new Pathway/Genome Databases Pathway/Genome Editors l Distributed curation of PGDBs l Distributed object database system, interactive editing tools Pathway/Genome Navigator l WWW publishing of PGDBs l Querying, visualization of pathways, chromosomes, operons l Analysis operations u Pathway visualization of gene-expression data u Global comparisons of metabolic networks
21
SRI International Bioinformatics BioCyc Collection of Pathway/Genome DBs Literature-based Datasets: MetaCyc Escherichia coli (EcoCyc) Computationally Derived Datasets: Agrobacterium tumefaciens Caulobacter crescentus Chlamydia trachomatis Bacillus subtilis Helicobacter pylori Haemophilus influenzae Mycobacterium tuberculosis RvH37 Mycobacterium tuberculosis CDC1551 Mycoplasma pneumonia Pseudomonas aeruginosa Saccharomyces cerevisiae Treponema pallidum Vibrio cholerae Yellow Underlined = Open Database http://BioCyc.org/
22
SRI International Bioinformatics Pathway/Genome DBs Created by External Users Plasmodium falciparum, Stanford University l plasmocyc.stanford.edu Arabidopsis thaliana and Synechosistis, Carnegie Institution of Washington l Arabidopsis.org:1555 Methanococcus janaschii, EBI l Maine.ebi.ac.uk:1555 Other PGDBs in progress by 20 other users Software freely available Each PGDB owned by its creator
23
SRI International Bioinformatics Ontology Reuse A holy grail in AI since “ontology” became a buzz- word l Decrease knowledge acquisition bottleneck GO qualifies as a large success in ontology reuse Pathway Tools ontology reused across 18 PGDBs Pathway Tools algorithms portable across all PGDBs
24
SRI International Bioinformatics Pathway Tools Algorithms Visualization and editing tools for following datatypes Full Metabolic Map l Paint gene expression data on metabolic network; compare metabolic networks Pathways l Pathway prediction Reactions l Balance checker Compounds l Chemical substructure comparison Enzymes, Transporters, Transcription Factors Genes Chromosomes Operons l Operon prediction; visualize genetic network
25
SRI International Bioinformatics Inference of Metabolic Pathways Pathway/Genome Database Annotated Genomic Sequence Genes/ORFs Gene Products DNA Sequences Reactions Pathways Compounds Multi-organism Pathway Database (MetaCyc) PathoLogic Software Integrates genome and pathway data to identify putative metabolic networks Genomic Map Genes Gene Products Reactions Pathways Compounds
26
SRI International Bioinformatics PathoLogic Analysis Phases Trial parsing of input data files [few days] Initialize schema of new PGDB [3 min] Create DB objects for replicons, genes, proteins [5 min] Assign enzymes to reactions they catalyze l ferrochelatase [10 min / 1 week] l glutamate 1-semialdehyde 2,1-aminomutase l porphobilinogen deaminase A C G BDEF E1E1 E2E2
27
SRI International Bioinformatics PathoLogic Analysis Phases From assigned reactions, infer what pathways are present [5 min / few days] Define metabolic overview diagram [1 day] Define protein complexes [few days]
28
SRI International Bioinformatics Killer App: Global Consistency Checking of Biochemical Network Given: l A PGDB for an organism l A set of initial metabolites Infer: l What set of products can be synthesized by the small- molecule metabolism of the organism Can known growth medium yield known essential compounds? Pacific Symposium on Biocomputing p471 2001
29
SRI International Bioinformatics Algorithm: Forward Propagation Nutrient set Metabolite set “Fire” reactions Transport Products Reactants PGDB reaction pool
30
SRI International Bioinformatics Results Phase I: Forward propagation l 21 initial compounds yielded only half of 38 essential compounds for E. coli Phase II: Manually identify l Bugs in EcoCyc (e.g., two objects for tryptophan) l Missing initial protein substrates (e.g., ACP) l Missing pathways in EcoCyc Phase III: Forward propagation with 11 more initial metabolites l Yielded all 38 essential compounds
31
SRI International Bioinformatics How to Characterize the Metabolic Network of a Cell?
32
SRI International Bioinformatics Aggregate Properties of the E. coli Metabolic Network EcoCyc is not a complete picture of E. coli metabolism l 30% of E. coli genes remain unidentified Analysis pertains to pathways of small-molecule metabolism Computed with respect to EcoCyc v4.5 (Sep-1998) Joint work with Christos Ouzounis of EBI Genome Research 10:268 2001
33
SRI International Bioinformatics Enzymes 4391 genes in E. coli genome 4288 code for proteins 676 (15%) gene products form 607 enzymes Of the 607 enzymes, 296 are monomers, 311 are multimers l 90% of genes for heteromultimers are linked
34
SRI International Bioinformatics Reactions 744 reactions of small-molecule metabolism l 582 assigned to at least one pathway
35
SRI International Bioinformatics Compounds 791 substrates in the 744 reactions Each reaction contains 4.0 substrates on average Each substrate appears in 2.1 reactions
36
SRI International Bioinformatics Enzyme Modulation 805 enzymatic-reaction objects in EcoCyc l 80 have physiological inhibitors l 22 have physiological activators l 17 have both l 43% have a modulator l 327 require a cofactor or prosthetic group
38
SRI International Bioinformatics Enzyme-Reaction Associations 585 reactions catalyzed by 1 enzyme 55 reactions catalyzed by 2 enzymes 12 reactions catalyzed by 3 enzymes 1 reaction catalyzed by 4 enzymes 483 reactions belong to a single pathway 99 reactions belong to multiple pathways 100 of the 607 E. coli enzymes are multifunctional
39
SRI International Bioinformatics Pathway Tools Implementation Allegro Common Lisp Sun and PC platforms l Run as window application or WWW server Ocelot object database 250,000 lines of code Lisp-based WWW server at BioCyc.org l Lisp process reads URLs from the network and generates GIF+HTML from PGDBs l Manages 15 PGDBs
40
SRI International Bioinformatics Ocelot Knowledge Server Architecture Frame data model l Classes, instances, inheritance Persistent storage via disk files, Oracle DBMS l Concurrent development: Oracle l Single-user development: disk files l Read-only delivery: bundle data into binary program Transaction logging facility Schema evolution Local disk cache to improve Internet performance J. Intelligent Information Systems 1:155-94 1999
41
SRI International Bioinformatics GKB Editor Browser and editor for KBs and ontologies Three editing tools: l Taxonomy editor l Frame editor l Relationships editor All operations are schema driven http://www.ai.sri.com/~gkb/user-man.html
42
SRI International Bioinformatics The Common Lisp Programming Environment Gatt studied Lisp and Java implementation of 16 programs by 14 programmers (Intelligence 11:21 2000)
43
SRI International Bioinformatics Peter Norvig’s Solution “I wrote my version in Lisp. It took me about 2 hours (compared to a range of 2-8.5 hours for the other Lisp programmers in the study, 3-25 for C/C++ and 4-63 for Java) and I ended up with 45 non-comment non-blank lines (compared with a range of 51-182 for Lisp, and 107-614 for the other languages). (That means that some Java programmer was spending 13 lines and 84 minutes to provide the functionality of each line of my Lisp program.)” http://www.norvig.com/java-lisp.html
44
SRI International Bioinformatics Common Lisp Programming Environment Interpreted and/or compiled execution Fabulous debugging environment High-level language Interactive data exploration Extensive built-in libraries Dynamic redefinition Find out more! l ALU.org -- Association of Lisp Users l BioLisp.org
45
SRI International Bioinformatics Pathway Exchange Ontology BioPathways group developing ontology and format for exchange of pathway data l Metabolic pathways l Signaling pathways l Protein interactions Moving upwards from chemicals, proteins, to reactions and pathways Working to extend CML Draft ontology at http://www.ai.sri.com/pkarp/misc/interactions.html
46
SRI International Bioinformatics Summary Pathway Tools apps: l Predict pathways and generate PGDBs l Visualization and editing tools l Paint gene expression data; compare entire pathway maps l Global consistency checking of metabolic network l Characterize metabolic and genetic networks New killer apps: l Interoperability l Text mining l Bake-off for genome annotation pipelines
47
SRI International Bioinformatics BioCyc and Pathway Tools Availability WWW BioCyc freely available to all l BioCyc.org l Six BioCyc DBs openly available to all BioCyc DBs freely available to non-profits l Flatfiles downloadable from BioCyc.org l Binary executable: u Sun UltraSparc-170 w/ 64MB memory u PC, 400MHz CPU, 64MB memory, Windows-98 or newer l PerlCyc API Pathway Tools freely available to non-profits
48
SRI International Bioinformatics Acknowledgements SRI l Suzanne Paley, Pedro Romero, John Pick, Cindy Krieger, Martha Arnaud EcoCyc Project l Julio Collado-Vides, Ian Paulsen, Monica Riley, Milton Saier MetaCyc Project l Sue Rhee, Lukas Mueller, Peifen Zhang, Chris Somerville Stanford l Gary Schoolnik, Harley McAdams, Lucy Shapiro, Russ Altman, Iwei Yeh Funding sources: l NIH National Center for Research Resources l NIH National Institute of General Medical Sciences l NIH National Human Genome Research Institute l Department of Energy Microbial Cell Project l DARPA BioSpice, UPC BioCyc.org
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.