1 SRI International Bioinformatics EcoCyc, MetaCyc, and the Pathway Tools Software Peter D. Karp, Ph.D. Bioinformatics Research Group SRI International BioCyc.org EcoCyc.org, MetaCyc.org
2 SRI International Bioinformatics MetaCyc Family of Pathway/Genome Databases 1,700+ databases from multiple institutions Cover all domains of life with microbial emphasis All DBs derived from MetaCyc via computational pathway prediction Common schema Common controlled vocabularies Common methodologies Archives of Toxicology 2011
3 SRI International Bioinformatics Curated Databases Within the MetaCyc Family DatabaseOrganismOrganizationCurated From MetaCycMultiorganismSRI26,000 EcoCycE. coliSRI21,000 HumanCycH. sapiensSRI AraCycA. thalianaCarnegie Instit.2,282 YeastCycS. cerevisiaeStanford Univ565 MouseCycM. musculusJackson Labs
4 SRI International Bioinformatics BioCyc Collection of 1,100 Pathway/Genome Databases Pathway/Genome Database (PGDB) – combines information about l Pathways, reactions, substrates l Enzymes, transporters l Genes, replicons l Transcription factors/sites, promoters, operons Tier 1: Literature-Derived PGDBs l MetaCyc l EcoCyc -- Escherichia coli K-12 Tier 2: Computationally-derived DBs, Some Curation PGDBs l HumanCyc, BsubCyc l Mycobacterium tuberculosis Tier 3: Computationally-derived DBs, No Curation -- The remainder
5 SRI International Bioinformatics EcoCyc Project – EcoCyc.org E. coli Encyclopedia l Review-level Model-Organism Database for E. coli l Tracks evolving annotation of the E. coli genome and cellular networks l The two paradigms of EcoCyc “Multi-dimensional annotation of the E. coli K-12 genome” l Positions of genes; functions of gene products – 76% / 66% exp l Gene Ontology terms; MultiFun terms l Gene product summaries and literature citations l Evidence codes l Multimeric complexes l Metabolic pathways l Regulation of gene expression and of protein activity Nuc. Acids Res. 35: ASM News 70: Science 293:2040 Karp, Gunsalus, Collado-Vides, Paulsen
6 SRI International Bioinformatics EcoCyc = E.coli Dataset + Pathway/Genome Navigator Genes: 4,489 Proteins: 4,479 Complexes: 895 RNAs: 285 Reactions: Metabolic: 1446 Transport: 287 Pathways: 260 Compounds: 1,830 URL: EcoCyc.org Regulation: Operons: 3,409 Trans Factors: 206 Promoters: 1,878 TF Binding Sites: 2,394 Reg Interactions: 5345 EcoCyc v15.0 Citations: 21,000
7 SRI International Bioinformatics EcoCyc on the iPhone
8 SRI International Bioinformatics EcoCyc on the iPhone
9 SRI International Bioinformatics PortEco.org EcoCyc + PortEco = E. coli model-organism database Query multiple E. coli databases simultaneously E. coli gene expression archive E. coli Wiki ~40 E. coli and Shigella databases available at BioCyc.org
10 SRI International Bioinformatics MetaCyc : Metabolic Encyclopedia Describe a representative sample of every experimentally determined metabolic pathway Describe properties of metabolic enzymes Literature-based DB with extensive references and commentary Pathways, reactions, enzymes, substrates MetaCyc vs BioCyc: Experimentally elucidated pathways Jointly developed by l P. Karp, R. Caspi, C. Fulcher, SRI International l L. Mueller, A. Pujar, Boyce Thompson Institute l S. Rhee, P. Zhang, Carnegie Institution Nucleic Acids Research 2010
11 SRI International Bioinformatics Applications of MetaCyc Reference source on metabolic pathways and enzymes Predict pathways from genomes Metabolic engineering l Find desired metabolic pathways and reactions l Find enzymes with desired activities, regulatory properties l Determine cofactor requirements
12 SRI International Bioinformatics MetaCyc Data -- Version 15.4 Pathways 1,747 Reactions 9,460 Enzymes 7,424 Small Molecules 9,188 Organisms2,170 Citations 29,900
13 SRI International Bioinformatics Comparison with KEGG KEGG vs MetaCyc: Reference pathway collections l KEGG maps are not pathways Nuc Acids Res 34: u KEGG maps contain multiple biological pathways u KEGG maps are composites of pathways in many organisms -- do not identify what specific pathways elucidated in what organisms u Two genes chosen at random from a BioCyc pathway are more likely to be related according to genome context methods than from a KEGG pathway l KEGG has no literature citations, no comments, less enzyme detail l KEGG assigns half as many reactions to pathways as MetaCyc KEGG vs organism-specific PGDBs l KEGG does not curate or customize pathway networks for each organism l Highly curated PGDBs now exist for important organisms such as E. coli, yeast, mouse, Arabidopsis
14 SRI International Bioinformatics Comparison of Pathway Tools to KEGG Inference tools l KEGG does not predict presence or absence of pathways l KEGG lacks pathway hole filler, operon predictor Curation tools l KEGG does not distribute curation tools l No ability to customize pathways to the organism l Pathway Tools schema much more comprehensive Visualization and analysis l KEGG does not perform automatic pathway layout l KEGG metabolic-map diagram extremely limited l No comparative pathway analysis
15 SRI International Bioinformatics EcoCyc and MetaCyc Review level databases Data derived primarily from biomedical literature l Manual entry by staff curators l Updates by staff curators only DBMS: Frame knowledge representation system Data validation l Consistency constraints l Lisp programs that verify other semantic relationships u Unbalanced chemical reactions
16 SRI International Bioinformatics Pathway Tools Software
17 SRI International Bioinformatics Pathway Tools Software Pathway/Genome Editors Pathway/Genome Database PathoLogic Annotated Genome Pathway/Genome Navigator Briefings in Bioinformatics 11: Genome-Scale Flux Model
18 SRI International Bioinformatics Pathway Tools Software: PathoLogic Computational creation of new Pathway/Genome Databases Transforms genome into Pathway Tools schema and layers inferred information above the genome Predicts operons Predicts metabolic network Predicts which genes code for missing enzymes in metabolic pathways Infers transport reactions from transporter names
19 SRI International Bioinformatics Pathway Tools Software: Pathway/Genome Editors Interactively update PGDBs with graphical editors Support geographically distributed teams of curators with object database system Gene and protein editor Reaction editor Compound editor Pathway editor Operon editor Publication editor
20 SRI International Bioinformatics Pathway Tools Software: Pathway/Genome Navigator Querying and visualization of: l Pathways l Reactions l Metabolites l Genes/Proteins/RNA l Regulatory interactions l Chromosomes Two modes of operation: l Web mode l Desktop mode l Most functionality shared, but each has unique functionality
21 SRI International Bioinformatics Pathway Tools Implementation Details Platforms: l Macintosh, PC/Linux, and PC/Windows platforms Same binary can run as desktop app or Web server Production-quality software l Version control l Two regular releases per year l Extensive quality assurance l Extensive documentation l Auto-patch l Automatic DB-upgrade 480,000 lines of Lisp code
22 SRI International Bioinformatics Why Do We Code in Common Lisp? Gatt studied Lisp and Java implementation of 16 programs by 14 programmers (Intelligence 11: ) l The average Lisp program ran 33 times faster than the average Java program l The average Lisp program was written 5 times faster than the average Java program Roberts compared Java and Lisp implementations of a Domain Name Server (DNS) resolver l l The Lisp version had ½ as many lines of code
23 SRI International Bioinformatics Cellular Overview Diagram Combines metabolic map and transporters Automatically generated for each organism Zoomable, queryable Web-based and desktop BioCyc.org l Tools Cellular Overview l Tools Regulatory Overview l Fastest with Safari, Chrome, Firefox
24 SRI International Bioinformatics
25 SRI International Bioinformatics
26 SRI International Bioinformatics
27 SRI International Bioinformatics Omics Data Graphing on Cellular Overview
28 SRI International Bioinformatics
29 SRI International Bioinformatics
30 SRI International Bioinformatics Genome Overview
31 SRI International Bioinformatics Genome Poster
32 SRI International Bioinformatics Regulatory Overview and Omics Viewer Show regulatory relationships among gene groups
33 SRI International Bioinformatics Genome Browser ChIP-Chip Data Shown in Graph Track
34 SRI International Bioinformatics Enrichment Analysis “My experiments yielded a set of genes/metabolites. What do they have in common?” Given a set of genes: l What GO terms are statistically over-represented in that set? l What metabolic pathways are over-represented? l What transcriptional regulators are over-represented? Given a set of metabolites: l What metabolic pathways are statistically over-represented in that set?
35 SRI International Bioinformatics Automated Generation of Metabolic Flux Models from PGDBs Joint work with Mario Latendresse
36 SRI International Bioinformatics Goals Decrease the time required to construct FBA models from 9-12 months to several weeks Create richer FBA models that are tightly coupled to genome and regulatory information Make FBA models and results more transparent
37 SRI International Bioinformatics Approach: Derive FBA Models from PGDBs Store and update metabolic model within Pathway Tools Export to constraint solver for model execution/solving Fast generation of metabolic model from annotated genome Pathway Tools schema l Associate a wealth of information with each metabolic model l Unique identifiers and controlled vocabulary for model components Tools for querying and visualization of metabolic models Tools for model debugging and analysis l Reaction balance checking l Dead-end metabolite analysis l Visualize reaction flux using cellular overview l Multiple gap filling
38 SRI International Bioinformatics FBA Generation Module: Inputs Nutrients Biomass Secretions A ABC X DD Reaction List
39 SRI International Bioinformatics FBA Formulation as Linear Program Boundary reactions: l Exchange fluxes for nutrients and secretions l Biomass reaction L-arginine … + GTP … + … biomass For each internal metabolite M l R1: A + M B l R2: C + M D l R3: E + M F + G l R4: X + Y M l R5: W M + Z Consuming fluxes balance producing fluxes l R1 + R2 + R3 = R4 + R5
40 SRI International Bioinformatics FBA Model Execution Runs SCIP solver on.lp file l Konrad-Zuse-Zentrum für Informationstechnik Berlin Interpret SCIP output l Determine if SCIP found a solution l Map fluxes to PGDB reactions Display resulting fluxes on the Cellular Overview
41 SRI International Bioinformatics Model Debugging via Multiple Gap Filling Most FBA models are not initially solvable because of incomplete or incorrect information Use meta-optimization to postulate alterations to a model to render it solvable Each alteration has an associated cost; minimize cost of alterations Formulate as MILP and submit to SCIP
42 SRI International Bioinformatics Multiple Gap Filling of FBA Models Reaction gap filling (Kumar et al, BMC Bioinf :212) : l Reverse directionality of selected reactions l Add a minimal number of reactions from MetaCyc to the model to enable a solution l Reaction cost is a function of reaction taxonomic range Metabolite gap filling: Postulate additional nutrients and secretions Partial solutions: Identify maximal subset of biomass components for which model can yield positive production rates
43 SRI International Bioinformatics MILP Objective Function for Gap Filling Σ w b B i + Σ w r R a + Σ w t R b + Σ w m R c + Σ w s S k + Σ w n N p Where W b > 0, w r, wt, w m, w s, w n < 0 are weights for biomass, reactions (2), secretions, and nutrients B i, R a, R b, R c, S k, N p are binary variables iab ckp
44 SRI International Bioinformatics Results – FBA Model of Human Metabolism 46 biomass compounds 13 nutrients 2secretions 207reactions carry non-zero flux
45 SRI International Bioinformatics Gap Filler Suggestions Addition of 8 new reactions from MetaCyc; 4 supported by literature research Reversal of 4 reactions confirmed by literature searches Enzyme curated into wrong compartment FBA analysis identified an amino-acid biosynthetic pathway that should not have been present in HumanCyc Further issues identified by dead-end metabolite analysis and reachability analysis
46 SRI International Bioinformatics
47 SRI International Bioinformatics Comparative Analysis Via Cellular Overview Comparative genome browser Comparative pathway table Comparative analysis reports l Compare reaction complements l Compare pathway complements l Compare transporter complements
48 SRI International Bioinformatics Advanced Query Form Intuitive construction of complex database queries of SQL power
49 SRI International Bioinformatics Work in Progress Computation of reaction atom mappings Program to generate metabolic pathways that synthesize target compound from feedstock compound
50 SRI International Bioinformatics How to Learn More BioCyc.org Help menu BioCyc Webinars l Biocyc.org/webinar.shtml Publications page l Biocyc.org/publications.shtml Tutorials held at SRI l Next week: FBA