Overview of the Pathway Tools Software and Pathway/Genome Databases Peter D. Karp Bioinformatics Research Group SRI International
Pathway/Genome Database Integrating Genomic and Biochemical Data Chromosomes, Plasmids Genes Proteins Reactions Pathways Compounds CELL Operons, Promoters, DNA Binding Sites
Key Functionality Pathway analysis l Prediction of pathways from genomes l Comparative pathway analysis Ongoing curation of PGDBs WWW publishing of PGDBs Analysis of gene expression data
Tools and Datasets PGDB PathwaysGenes Pathway/Genome Navigator PathoLogic Editors Create PGDBs Visualize, Query and Analyze PGDBs Update PGDBs
PathoLogic Pathway Predictor New PGDB Set of Annotated Genes Pathway Prediction MetaCyc PGDB Reports
Prediction of Pathways from Genomes Pathways Compounds Genomic Map Genes Proteins Reactions Metabolic Network Pathway/Genome Database DNA Sequence List of Genes/ORFs List of Gene Products Annotated Genome PathoLogic
MetaCyc Overview Meta Metabolic Encyclopedia 439 pathways, 1095 enzymes, 4217 reactions l 173 E. coli pathways Literature-based DB with extensive references and commentary Pathways, reactions, enzymes, substrates Editor in chief: Dr. Monica Riley
Pathway/Genome Navigator Query and visualization tools for PGDBs l Metabolic pathways, reactions, compounds l Enzymes, transporters, transcription factors l Genome maps, genes, operons, promoters, DNA sites l Retrieve nucleotide and DNA sequences l Perform Blast searches Runs as an application on Solaris, Windows Runs as a WWW server on Solaris Query and comparative analysis functions
Interactive Editing Tools Pathway editor Reaction editor Gene editor Enzyme editor Compound editor Transcription Unit Editor Facilitate updates to PGDBs l Improved computational predictions l Literature-based data Record citations, comments, evidence, history
Pathway Views of Expression Data Import gene expression data Compute expression ratios Obtain pathway based visualizations of data l Numerical spectrum of expression values mapped to a color spectrum l Steps of overview painted with color corresponding to expression level(s) of genes that encode enzyme(s) for that step l Absolute or relative expression values
Environment for Computational Exploration of Genomes Powerful ontology opens many facets of the biology to computational exploration Global characterization of metabolic network Analysis of interface between transport and metabolism Nutrient analysis of metabolic network
PathoLogic Pathway Predictor
Pathologic Pathway Predictor Introduction Description of PPP execution Inputs to PPP Using the GUI to create a pathway/genome database Output from PPP Caveats
PathoLogic Goals Create the set of class frames that encode DB schema l Copied from MetaCyc Create the appropriate set of instance frames l Genes, genetic elements, proteins created from input files l Substrates, reactions, and pathways are copied from the reference database Interconnect frames in a manner that accurately reflects their semantic relationships
PathoLogic Input/Output Inputs: l File listing genetic elements u l Files containing DNA sequence for each genetic element l Files containing annotation for each genetic element l MetaCyc database Output: l Pathway/genome database for the subject organism l Directory tree for the subject organism l Reports that summarize: u Evidence contained in the input genome for the presence of reference pathways u Reactions missing from inferred pathways
Inputs to PathoLogic Pathway Predictor genetic-elements.dat Sequence files GenBank file format PathoLogic format Directory Structure
genetic-elements.dat ID TEST-CHROM-1 NAME Chromosome 1 TYPE :CHRSM CIRCULAR? N ANNOT-FILE chrom1.pf SEQ-FILE chrom1.fsa // ID TEST-CHROM-2 NAME Chromosome 2 CIRCULAR? N ANNOT-FILE /mydata/chrom2.gbk SEQ-FILE /mydata/chrom2.fna //
File Naming Conventions One pair of sequence and annotation files for each genetic element Sequence files: FASTA format l suffix fsa or fna Annotation file: l Genbank format: suffix.gbk l PathoLogic format: suffix.pf
GenBank File Format Accepted feature types: l CDS, tRNA, rRNA, misc_RNA Accepted qualifiers: l /labelUnique ID [recm] l /geneGene name [req] l /product [req] l /EC_number [recm] l /product_comment [opt] l /gene_comment [opt] l /alt_nameSynonyms [opt] For multifunctional proteins, put each function in a separate /product line
Typical Problems Using Genbank Files With PathoLogic Wrong qualifier names used Extraneous information in a given qualifier Check results of trial parse carefully
PathoLogic File Format Each record starts with line containing an ID attribute Tab delimited Each record ends with a line containing // One attribute-value pair is allowed per line l Use multiple FUNCTION lines for multifunctional proteins Lines starting with ‘;’ are comment lines Valid attributes are: l ID, NAME, SYNONYM l STARTBASE, ENDBASE, GENE-COMMENT l FUNCTION, PRODUCT-TYPE, EC, FUNCTION-COMMENT l DBLINK
PathoLogic File Format IDTP0734 NAMEdeoD STARTBASE ENDBASE FUNCTIONpurine nucleoside phosphorylase DBLINK PID:g PRODUCT-TYPE P GENE-COMMENTsimilar to GP: percent identity: 57.51; identified by sequence similarity; putative // IDTP0735 NAMEgltA STARTBASE ENDBASE FUNCTIONglutamate synthase DBLINK PID:g PRODUCT-TYPE P
Using the PPP GUI to Create a Pathway/Genome Database Input Project Information l Organism -> Create New Trial Parse l Build -> Trial Parse Build pathway/genome database l Build -> Automated Build Manual polishing l Refine -> Resolve Ambiguous Name Matches l Refine -> Assign Modified Proteins l Refine -> Create Protein Complexes l Refine -> Run Consistency Checker l Refine -> Update Overview
PathoLogic Command Menus Organism l Select l Create New l Save KB l Revert KB l Reinitialize KB l Exit Build l Trial Parse l Automated Build Refine l Resolve Ambiguous Name Matches l Assign Modified Proteins l Create Protein Complexes l Re-run Name Matcher l Rescore Pathways l Run Consistency Checker l Update Overview
Input Project Information
PathoLogic PP Parse Output
Enzyme Name to Reaction Mapping
Enzyme Name Matching Tool Dictionary of enzyme names assembled from: l All metabolic reactions found in MetaCyc l Two files that map synonyms not found in MetaCyc to reaction names: u System file (pangea-enzyme-mappings.dat) u User-supplied file (local-enzyme-mappings.dat) Location of sources: l $GPROOT/pathologic/$VERSION-NUMBER/data
Enzyme Name Matcher Matches on full enzyme name Match is case-insensitive and removes the punctuation characters “ -_(){}',:” Also matches after removal of prefixes and suffixes such as: l “Putative”, “Hypothetical”, etc l alpha|beta|…|catalytic|inducible chain|subunit|component l Parenthetical gene name
Enzyme Name Matcher For names that do not match, software identifies probable metabolic enzymes as those l Containing “ase” l Not containing keywords such as u “sensor kinase” u “topoisomerase” u “protein kinase” u “peptidase” u Etc Research unknown enzymes l MetaCyc, Swiss-Prot, PIR, Medline, EMP
Assigning Evidence Scores to Predicted Pathways X|Y|Z denotes score for P in O l where: u X = total number of reactions in P u Y = enzymes catalyzing number of reactions for which there is evidence in O u Z = number of Y reactions that are used in other pathways in O Not clear how to convert these scores into a probability of occurrence
Algorithm for Automated Pathway Pruning A pathway will never be pruned if it contains a unique enzyme – an enzyme not present in any other pathway A pathway will be pruned if one of the following conditions holds: l Evidence is better for a different pathway in same variant set l Evidence for only one reaction in pathway, or l Its set of reactions present is a proper subset of the reactions present in some other pathway, and u If pathway is a biosynthetic pathway, final reaction(s) missing u If pathway is a degradation pathway, initial reaction(s) missing u If pathway is an energy metabolism pathway, more than half the reactions are missing
Creating Protein Complexes
Complex Subunits Stoichiometries
Proteins as Reaction Substrates
Manual Pruning of Pathways Use pathway evidence report l Coloring scheme aids in assessing pathway evidence Phase I: Prune extra variant pathways Rescore pathways, re-generate pathway evidence report Phase II: Prune pathways unlikely to be present l No/few unique enzymes l Most pathway steps present because they are used in another pathway l Pathway very unlikely to be present in this organism
Overview Graph
Output from PPP Pathway/genome database Summary pages l Pathway evidence page u Click “Summary of Organisms”, then click organism name, then click “Pathway Evidence”, then click “Save Pathway Report” l Missing enzymes report Directory tree containing sequence files, reports, etc.
Resulting Directory Structure ROOT/aic-export/ecocyc/ORGIDcyc/VERSION/ l input u organism.dat u organism-init.dat u genetic-elements.dat u annotations files u sequence files l reports u name-matching-report.txt u trial-parse-report.txt l kb u ORGIDbase.ocelot l data u overview.graph l released -> VERSION
Caveats Cannot predict pathways not present in MetaCyc Evidence for short pathways is hard to interpret Since many reactions occur in lots of pathways, many false positives
The Pathway Tools Schema
Motivations for Understanding Schema Pathway Tools visualizations and analyses depend upon the software being able to find precise information in precise places within a Pathway/Genome DB When writing Lisp complex queries to PGDBs, those queries must name classes and slots within the schema A Pathway/Genome Database is a web of interconnected objects; each object represents a biological entity
Reference Pathway Tools User’s Guide, Volume I l Appendix A: Guide to the Pathway Tools Schema
Web of Relationships for One Enzyme Sdh-flavoSdh-Fe-SSdh-membrane-1Sdh-membrane-2 sdhAsdhB sdhCsdhD Succinate + FAD = fumarate + FADH 2 Enzymatic-reaction Succinate dehydrogenase TCA Cycle
Frame Data Model and Schema Frame Data Model -- organizational principle for a DB Object Displays Schema l Gene slots l Polypeptide slots l Protein slots l Protein Complex slots l Reaction slots l Enzymatic Reaction slots
Frame Data Model Knowledge base (KB, Database, DB) Frames Slots Facets Annotations
Knowledge Base Collection of frames and their associated slots, values, facets, and annotations Can be stored within l An Oracle DB l A disk file l A Pathway Tools binary program
Frames Entities with which facts are associated Kinds of frames: l Classes: Genes, Pathways, Biosynthetic Pathways l Instances (objects): trpA, TCA cycle Classes: l Superclass(es) l Subclass(es) l Instance(s) A symbolic frame name (id, key) uniquely identifies each frame
Slots Encode attributes/properties of a frame l Integer, real number, string Represent relationships between frames l The value of a slot is the identifier of another frame Every slot is described by a “slot frame” in a KB that defines meta information about that slot
Slot Links Sdh-flavoSdh-Fe-SSdh-membrane-1Sdh-membrane-2 sdhAsdhB sdhCsdhD Succinate + FAD = fumarate + FADH 2 Enzymatic-reaction Succinate dehydrogenase TCA Cycle product component-of catalyzes reaction in-pathway
Slots Number of values l Single valued l Multivalued: sets, bags Slot values l Any LISP object: Integer, real, string, symbol (frame name), list Slotunits define properties of slots: datatypes, classes, constraints Two slots are inverses if they encode opposite relationships l Slot Product in class Genes l Slot Gene in class Polypeptides
Representation of Function Sdh-flavoSdh-Fe-SSdh-membrane-1Sdh-membrane-2 sdhAsdhB sdhCsdhD Succinate + FAD = fumarate + FADH 2 Enzymatic-reaction Succinate dehydrogenase TCA Cycle EC# K eq Cofactors Inhibitors Molecular wt pI Left-end-position
Monofunctional Monomer Gene Reaction Enzymatic-reaction Monomer Pathway
Bifunctional Monomer Gene Reaction Enzymatic-reaction Monomer Pathway Reaction Enzymatic-reaction
Monofunctional Multimer Monomer Gene Reaction Enzymatic-reaction Multimer Pathway
Pathway and Substrates Reactant-1 Reaction Pathway Reaction Reactant-2 Product-2 Product-1 in-pathway left right
Transcriptional Regulation site001 pro001 trpE trpD trpC trpB trpA trpL Int003RpoSig70 TrpR*trpInt001 trpLEDCBA trp apoTrpR Int005
Annotations Encode information about individual slot values Used to attach comments and citations to slot values Example: l Frame tryptophan-synthetase has a slot called Molecular- Weight with a value of 28 l Attached to that value is an annotation whose label is Citation and whose value is “[ ]”
Facets Encode information about slots Allow association between a slot and: l comments l citations Example: Comment attached to Inhibitors of EnzRxn Allow access to schema information
Principle Classes Class names are capitalized, plural Genetic-Elements, with subclasses: l Chromosomes l Plasmids Genes Transcription-Units RNAs Proteins, with subclasses: l Polypeptides l Protein Complexes
Principle Classes Reactions, with subclasses: l Transport-Reactions Enzymatic-Reactions Pathways Compounds-And-Elements
Slots in Multiple Classes Common-Name Synonyms Names (computed as union of Common-Name, Synonyms) Comment Citations DB-Links
Genes Slots Chromosome Left-End-Position Right-End-Position Centisome-Position Transcription-Direction Product
Proteins Slots Molecular-Weight-Seq Molecular-Weight-Exp pI Locations Modified-Form Unmodified-Form Component-Of
Polypeptides Slots Gene
Protein-Complexes Slots Components
Reactions Slots EC-Number Left, Right Substrates (computed as union of Left, Right) DeltaG0 Keq Spontaneous? Species
Enzymatic-Reactions Slots Enzyme Reaction Activators Inhibitors Physiologically-Relevant Cofactors Prosthetic-Groups Alternative-Substrates Alternative-Cofactors
Editing Pathway/Genome Databases
Pathway Tools Paradigm Separate database from user interface Navigator provides one view of the DB Editors provide an alternative view of the DB
Invoking the Editors Right-Click on an Object Handle l Edit l Notes l Show Shift-Middle-Click on an Object Handle
Saving Changes The user must save changes explicitly with Save KB To discard changes made since last save l Special -> KB -> Revert KB
Administering the Pathway Tools
Information Sources Pathway Tools User’s Guide l aic-export/ecocyc/genopath/released/doc/userguide1.pdf u Appendix A: Guide to the Pathway Tools Schema l aic-export/ecocyc/genopath/released/doc/userguide2.pdf Pathway Tools Web Site l Pathway Tools Tutorial l
Reporting Problems to Include: l Error message l Result of :zoom :count :all l What version and platform you are running l What operation were you performing when the error occurred?