PathoLogic Pathway Predictor

Slides:



Advertisements
Similar presentations
Editing Pathway/Genome Databases. SRI International Bioinformatics Pathway Tools Paradigm Separate database from user interface Navigator provides one.
Advertisements

ADABAS to RDBMS UsingNatQuery. The following session will provide a high-level overview of NatQuerys ability to automatically extract ADABAS data from.
Instantiation of Generic Reactions by Markus Krummenacker Q
SRI International Bioinformatics Comparative Analysis Q
Overview of the Pathway Tools Software and Pathway/Genome Databases.
SRI International Bioinformatics 1 Orthology-Based Multi-PGDB Curation Tools Suzanne Paley Pathway Tools Workshop 2010.
Biocyc.org Identify Pathway Hole Fillers Definition: Pathway Holes are reactions in metabolic pathways for which no enzyme is identified in the PGDB. holes.
Windows XP Basics OVERVIEW Next.
Overview of the Pathway Tools Software and Pathway/Genome Databases.
SRI International Bioinformatics 1 The consistency Checker, or Overhauling a PGDB By Ron Caspi.
Unauthorized Reproduction Prohibited SkyPoint Alarm Integration Add-On Using OnGuard Alarms to create events in SkyPoint Also called ‘SkyPoint V0’ CR4400.
Introduction to the Pathway Tools Software David Walsh and Simon Eng bigDATA Workshop—May 29, 2010.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment Chapter 12: Managing and Implementing Backups and Disaster Recovery.
陳虹瑋 國立陽明大學 生物資訊學程 Genome Engineering Lab. Genome Engineering Lab The Newest.
Genome Annotation BCB 660 October 20, From Carson Holt.
Creating a … Community Database Organism-Specific Database Model-Organism Database.
Genome Evolution: Duplication (Paralogs) & Degradation (Pseudogenes)
Enzymatic Function Module (KEGG, MetaCyc, and EC Numbers)
Software Development, Programming, Testing & Implementation.
PathoLogic Pathway Predictor. SRI International Bioinformatics Inference of Metabolic Pathways Pathway/Genome Database Annotated Genomic Sequence Genes/ORFs.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment, Enhanced Chapter 12: Managing and Implementing Backups and Disaster Recovery.
Making Sense of DNA and protein sequence analysis tools (course #2) Dave Baumler Genome Center of Wisconsin,
Ogden Air Logistics Center. Purpose of Excel2FV Many agencies produce point lists of different data (target lists, force locations, etc.) in either Excel.
MS Access Advanced Instructor: Vicki Weidler Assistant:
SRI International Bioinformatics 1 Pathway Tools: Recent Developments GMOD Meeting, June 2006.
Basic Introduction of BLAST Jundi Wang School of Computing CSC691 09/08/2013.
Software Engineering 2003 Jyrki Nummenmaa 1 CASE Tools CASE = Computer-Aided Software Engineering A set of tools to (optimally) assist in each.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment, Enhanced Chapter 12: Managing and Implementing Backups and Disaster Recovery.
Pathway Assignments. The assignment – Annotating Pathways KEGG Pathway Database.
Copyright © 2007, Oracle. All rights reserved. Managing Concurrent Requests.
The BioCyc Collection of Pathway/Genome Databases Alexander Shearer Bioinformatics Research Group SRI International BioCyc.org EcoCyc.org.
SRI International Bioinformatics 1 Recent Developments in Pathway Tools GMOD Workshop November ‘07 Suzanne Paley Bioinformatics Research Group SRI International.
Regulatory Genomics Lab Saurabh Sinha Regulatory Genomics Lab v1 | Saurabh Sinha1 Powerpoint by Casey Hanson.
SRI International Bioinformatics 1 Advanced Editing of Pathway/Genome Databases Ron Caspi.
Welcome to DNA Subway Classroom-friendly Bioinformatics.
The consistency Checker, or Overhauling a PGDB By Ron Caspi.
MetaCyc and AraCyc: Plant Metabolic Databases Hartmut Foerster Carnegie Institution.
SRI International Bioinformatics 1 Submitting pathway to MetaCyc Ron Caspi.
PIRSF Classification System PIRSF: Evolutionary relationships of proteins from super- to sub-families Homeomorphic Family: Homologous proteins sharing.
Regulatory Genomics Lab Saurabh Sinha Regulatory Genomics | Saurabh Sinha | PowerPoint by Casey Hanson.
SRI International Bioinformatics 1 Genome Browser Tomer Altman Bioinformatics Research Group SRI, International August 19th, 2009.
Overview of the Pathway Tools Software and Pathway/Genome Databases Peter D. Karp Bioinformatics Research Group SRI International
SRI International Bioinformatics Update your computers! To install a patch: Tools => Instant Patch => Download and Activate All Patches.
Introduction to KE EMu Unit objectives: Introduction to Windows Use the keyboard and mouse Use the desktop Open, move and resize a.
SRI International Bioinformatics 1 Editing Pathway/Genome Databases Ron Caspi.
SAGExplore web server tutorial. The SAGExplore server has three different modules …
Welcome to Gramene’s RiceCyc (Pathways) Tutorial RiceCyc allows biochemical pathways to be analyzed and visualized. This tutorial has been developed for.
Tools in Bioinformatics Genome Browsers. Retrieving genomic information Previous lesson(s): annotation-based perspective of search/data Today: genomic-based.
SRI International Bioinformatics 1 Pathway Tools Features Available Only in the Desktop Version PathoLogic.
Welcome to the combined BLAST and Genome Browser Tutorial.
SRI International Bioinformatics Selected PathoLogic Refining Tasks Creation of Protein Complexes Assignment of Modified Proteins Operon Prediction.
HANDS-ON ConSurf! Web-Server: The ConSurf webserver.
Recent Developments and Future Directions in Pathway Tools Peter D. Karp SRI International.
PathoLogic Pathway Predictor
Editing Pathway/Genome Databases
Comparative Analysis in BioCyc
Why Create a PGDB? Perform pathway analyses as part of a genome project Analyze omics data Create a central public information resource for the organism,
SQL and SQL*Plus Interaction
Single Sample Registration
PathoLogic: More about Matching Enzyme Names to Reactions
How to Administer a PGDB
Comparative Analysis Q
BLAST.
Overview of Microbial Pathway and Genome Databases
Incremental PathoLogic
Propagating Changed Annotation and Pathway Information
BLAST.
Annotation Presentation
Advanced PGDB Editing: Gene Ontology (GO) Terms
Part II SeqViewer AraCyc Help
Presentation transcript:

PathoLogic Pathway Predictor

Inference of Metabolic Pathways Annotated Genomic Sequence Pathway/Genome Database Genes/ORFs Gene Products DNA Sequences Pathways Reactions PathoLogic Software Integrates genome and pathway data to identify putative metabolic networks Compounds Multi-organism Pathway Database (MetaCyc) Gene Products Reactions Pathways Compounds Genes Genomic Map

PathoLogic Functionality Initialize schema for new PGDB Transform existing genome to PGDB form Infer metabolic pathways and store in PGDB Infer operons and store in PGDB Assemble Overview diagram Assist user with manual tasks Assign enzymes to reactions they catalyze Identify false-positive pathway predictions Build protein complexes from monomers Infer transport reactions

PathoLogic Input/Output Inputs: File listing genetic elements http://bioinformatics.ai.sri.com/ptools/genetic-elements.dat Files containing DNA sequence for each genetic element Files containing annotation for each genetic element MetaCyc database Output: Pathway/genome database for the subject organism Reports that summarize: Evidence contained in the input genome for the presence of reference pathways Reactions missing from inferred pathways

PathoLogic Analysis Phases Trial parsing of input data files [few days] Initialize schema of new PGDB [3 min] Create DB objects for replicons, genes, proteins [5 min] Assign enzymes to reactions they catalyze ferrochelatase [10 min / 1 week] glutamate 1-semialdehyde 2,1-aminomutase porphobilinogen deaminase E1 E2 A C G B D E F

PathoLogic Analysis Phases From assigned reactions, infer what pathways are present [5 min / few days] Define metabolic overview diagram [30 min] Define protein complexes [few days]

genetic-elements.dat ID TEST-CHROM-1 NAME Chromosome 1 TYPE :CHRSM CIRCULAR? N ANNOT-FILE chrom1.pf SEQ-FILE chrom1.fsa // ID TEST-CHROM-2 NAME Chromosome 2 ANNOT-FILE /mydata/chrom2.gbk SEQ-FILE /mydata/chrom2.fna

File Naming Conventions One pair of sequence and annotation files for each genetic element Sequence files: FASTA format suffix fsa or fna Annotation file: Genbank format: suffix .gbk PathoLogic format: suffix .pf

Typical Problems Using Genbank Files With PathoLogic Wrong qualifier names used: read PathoLogic documentation! Extraneous information in a given qualifier Check results of trial parse carefully

GenBank File Format Accepted feature types: CDS, tRNA, rRNA, misc_RNA Accepted qualifiers: /locus_tag Unique ID [recm] /gene Gene name [req] /product [req] /EC_number [recm] /product_comment [opt] /gene_comment [opt] /alt_name Synonyms [opt] /pseudo Gene is a pseudogene [opt] For multifunctional proteins, put each function in a separate /product line

PathoLogic File Format Each record starts with line containing an ID attribute Tab delimited Each record ends with a line containing // One attribute-value pair is allowed per line Use multiple FUNCTION lines for multifunctional proteins Lines starting with ‘;’ are comment lines Valid attributes are: ID, NAME, SYNONYM STARTBASE, ENDBASE, GENE-COMMENT FUNCTION, PRODUCT-TYPE, EC, FUNCTION-COMMENT DBLINK INTRON

PathoLogic File Format ID TP0734 NAME deoD STARTBASE 799084 ENDBASE 799785 FUNCTION purine nucleoside phosphorylase DBLINK PID:g3323039 PRODUCT-TYPE P GENE-COMMENT similar to GP:1638807 percent identity: 57.51; identified by sequence similarity; putative // ID TP0735 NAME gltA STARTBASE 799867 ENDBASE 801423 FUNCTION glutamate synthase DBLINK PID:g3323040

Before you start: What to do when an error occurs Most Navigator errors are automatically trapped – debugging information is saved to error.tmp file. All other errors (including most PathoLogic errors) will cause software to drop into the Lisp debugger Unix: error message will show up in the original terminal window from which you started Pathway Tools. Windows: Error message will show up in the Lisp console. The Lisp console usually starts out iconified – its icon is a blue bust of Franz Liszt 2 goals when an error occurs: Try to continue working Obtain enough information for a bug report to send to pathway-tools support team.

The Lisp Debugger Sample error (details and number of restart actions differ for each case) Error: Received signal number 2 (Keyboard interrupt) Restart actions (select using :continue): 0: continue computation 1: Return to command level 2: Pathway Tools version 10.0 top level 3: Exit Pathway Tools version 10.0 [1c] EC(2): To generate debugging information (stack backtrace): :zoom :count :all To continue from error, find a restart that takes you to the top level – in this case, number 2 :cont 2 To exit Pathway Tools: :exit

How to report an error Determine if problem is reproducible, and how to reproduce it (make sure you have all the latest patches installed) Send email to ptools-support@ai.sri.com containing: Pathway Tools version number and platform Description of exactly what you were doing (which command you invoked, what you typed, etc.) or instructions for how to reproduce the problem error.tmp file, if one was generated If software breaks into the lisp debugger, the complete error message and stack backtrace (obtained using the command :zoom :count :all, as described on previous slide)

Using the PPP GUI to Create a Pathway/Genome Database Input Project Information Organism -> Create New

Input Project Information

PathoLogic Command Menus Organism Select Create New Save KB Revert KB Reinitialize KB Specify Reference PGDB(s) Exit Build Trial Parse Automated Build Refine Assign Probable Enzymes Assign Modified Proteins Create Protein Complexes Re-run Name Matcher Rescore Pathways Predict transcription units Transport Identification Parser Update Overview Pathway Hole Filler

Next Steps Trial Parse Build -> Trial Parse Fix any errors in input files Build pathway/genome database Build -> Automated Build

PathoLogic Parser Output

Assign Enzymes to Reactions 5.1.3.2 Gene product MetaCyc UDP-glucose-4-epimerase Match yes no Probable enzyme -ase Assign UDP-D-glucose  UDP-galactose no yes Manually search Not a metabolic enzyme no yes Assign Can’t Assign

Enzyme Name Matcher Matches on full enzyme name Match is case-insensitive and removes the punctuation characters “ -_(){}',:” Also matches after removal of prefixes and suffixes such as: “Putative”, “Hypothetical”, etc alpha|beta|…|catalytic|inducible chain|subunit|component Parenthetical gene name

Enzyme Name Matcher For names that do not match, software identifies probable metabolic enzymes as those Containing “ase” Not containing keywords such as “sensor kinase” “topoisomerase” “protein kinase” “peptidase” Etc Research unknown enzymes MetaCyc, Swiss-Prot, PubMed

Enzyme Name to Reaction Mapping See also file PTools Tutorial/PathoLogic Reports/name-matching-report.txt

Manual Polishing Refine -> Assign Probable Enzymes  Do this first Refine -> Rescore Pathways  Redo after assigning enzymes Refine -> Create Protein Complexes  Can be done at any time Refine -> Assign Modified Proteins  Can be done at any time Refine -> Transport Identification Parser  Can be done at any time Refine -> Pathway Hole Filler Refine -> Predict Transcription Units Refine -> Update Overview  Do this last, and repeat after any material changes to PGDB

Assign Probable Enzymes

How to find reactions for probable enzymes First, verify that enzyme name describes a specific, metabolic function Search for fragment of name in MetaCyc – you may be able to find a match that PathoLogic missed Look up protein in SwissProt or other DBs Search for gene name in PGDB for related organism (bear in mind that gene names are not reliable indicators of function, so check carefully) Search for function name in PubMed Other…

Manual Polishing Refine -> Assign Probable Enzymes Refine -> Rescore Pathways Refine -> Create Protein Complexes Refine -> Assign Modified Proteins Refine -> Transport Identification Parser Refine -> Pathway Hole Filler Refine -> Predict Transcription Units Refine -> Run Consistency Checker Refine -> Update Overview

Automated Pathway Inference All pathways in MetaCyc for which there is at least one enzyme identified in the target organism are considered for possible inclusion. Algorithm errs on side of inclusivity – easier to manually delete a pathway from an organism than to find a pathway that should have been predicted but wasn’t.

Considerations taken into account when deciding whether or not a pathway should be inferred: Is there a unique enzyme – an enzyme not involved in any other pathway? Does the organism fall in the expected taxonomic domain of the pathway? Is this pathway part of a variant set, and, if so, is there more evidence for some other variant? If there is no unique enzyme: Is there evidence for more than one enzyme? If a biosynthetic pathway, is there evidence for final reaction(s)? If a degradation pathway, is there evidence for initial reaction(s)? If an energy metabolism pathway, is there evidence for more than half the reactions?

Assigning Evidence Scores to Predicted Pathways X|Y|Z denotes score for P in O where: X = total number of reactions in P Y = enzymes catalyzing number of reactions for which there is evidence in O Z = number of Y reactions that are used in other pathways in O

Manual Pruning of Pathways Use pathway evidence report Coloring scheme aids in assessing pathway evidence Phase I: Prune extra variant pathways Rescore pathways, re-generate pathway evidence report Phase II: Prune pathways unlikely to be present No/few unique enzymes Most pathway steps present because they are used in another pathway Pathway very unlikely to be present in this organism Nonspecific enzyme name assigned to a pathway step

Caveats Cannot predict pathways not present in MetaCyc Evidence for short pathways is hard to interpret Since many reactions occur in multiple pathways, some false positives

Output from PPP Pathway/genome database Summary pages Pathway evidence page Click “Summary of Organisms”, then click organism name, then click “Pathway Evidence”, then click “Save Pathway Report” Missing enzymes report Directory tree containing sequence files, reports, etc.

Resulting Directory Structure ROOT/ptools-local/pgdbs/user/ORGIDcyc/VERSION/ input organism.dat organism-init.dat genetic-elements.dat annotation files sequence files reports name-matching-report.txt trial-parse-report.txt kb ORGIDbase.ocelot data overview.graph released -> VERSION

Manual Polishing Refine -> Assign Probable Enzymes Refine -> Rescore Pathways Refine -> Create Protein Complexes Refine -> Assign Modified Proteins Refine -> Transport Identification Parser Refine -> Pathway Hole Filler Refine -> Predict Transcription Units Refine -> Run Consistency Checker Refine -> Update Overview

Creating Protein Complexes

Complex Subunits Stoichiometries

Manual Polishing Refine -> Assign Probable Enzymes Refine -> Re-run Name Matcher Refine -> Create Protein Complexes Refine -> Assign Modified Proteins Refine -> Transport Identification Parser Refine -> Pathway Hole Filler Refine -> Predict Transcription Units Refine -> Run Consistency Checker Refine -> Update Overview

Proteins as Reaction Substrates

Manual polishing Refine -> Assign Probable Enzymes Refine -> Re-run Name Matcher Refine -> Create Protein Complexes Refine -> Assign Modified Proteins Refine -> Transport Identification Parser Refine -> Pathway Hole Filler Refine -> Predict Transcription Units Refine -> Run Consistency Checker Refine -> Update Overview

nicotinate nucleotide What are pathway holes? At least one reaction in the pathway has an enzyme assigned. The reactions in the pathway without enzymes assigned are holes. 1.4.3.- L-aspartate iminoaspartate No EC# quinolinate holes n.n. pyrophosphorylase nadC, RV1596 6.3.1.5 deamido-NAD deamido-NAD nicotinate nucleotide 2.7.7.18 6.3.5.1 NAD

Algorithm for identifying candidates and consolidating data… Step III & IV: Consolidate hits and evaluate evidence using a Bayes classifier Step II: BLAST against target genome Step I: collect query isozymes of function A 3 queries have low-scoring hits to sequence X Resulting P(has-function) is low gene X organism 1 enzyme A organism 2 enzyme A organism 3 enzyme A 8 queries have high-scoring hits to sequence Y Resulting P(has-function) is high organism 4 enzyme A organism 5 enzyme A gene Y organism 6 enzyme A organism 7 enzyme A organism 8 enzyme A 5 queries have low-scoring hits to sequence Z Resulting P(has-function) is low gene Z target genome

Reference for the Pathway Hole Filler… Green, ML and Karp, PD. A Bayesian method for identifying missing enzymes in predicted metabolic pathway databases. BMC Bioinformatics 2004, 5:76.

Features used to calculate the probability that a protein has the desired function… Candidate is in a contiguous set of genes transcribed in one direction with another gene in the pathway Best E-value Avg. rank Avg % aligned Number of query sequences aligned Potential operon? Adjacent reactions? Candidate is adjacent to the gene assigned to an adjacent reaction in the pathway

Navigating to the Pathway Hole Filler

Steps that must be completed before running the Pathway Hole Filler Install BLAST executable (should already be installed on training room machines) Prepare BLAST protein db Need FASTA format genome nucleotide sequence (see instructor if you have something different, like ESTs, or have no sequence data file) In general, the more pathways in your PGDB, the more the pathway hole filler will have to search for

Steps for operating the pathway hole filler Prepare training data for Bayes classifier Collect feature data for known rxns in PGDB Calculate probability distributions for classifier Identify and evaluate candidates Collect feature data for each candidate Use classifier to determine P(has-function) Choose holes to fill in KB Either select all above a cutoff or manually review candidates

Step 1: Prepare Training Data… Calculate training data from your organism or use existing training data… Once Step 1 has been completed, the training data are saved and can be reused (even in another Pathway Tools session). If using existing data from E. coli the training data are based on data from the literature.

Step 2: Identify & Evaluate Candidates…

Step 2: Identify & Evaluate Candidates Select reactions from a list Select pathways from a list A list of all pathways in the PGDB with holes A list of all pathway holes in the PGDB

Modes of operation… Fully automatic No interaction required from user CAUTION!! Fully automatic No interaction required from user All default values used Prepare training data – all known rxns in KB Identify and evaluate candidates – all pathways with pathway holes Choose holes to fill in KB – all holes with P>0.9 filled

Modes of operation… Wizard Power-user mode Wizard prompts user for training data source and for which holes to make predictions. Wizard runs Steps 1 & 2, then prompts user to complete Step 3. Power-user mode User must proceed through each step in order. Program still prompts user for required parameters, but each step must be completed before advancing to next step.

Step 3: Choose Holes to Fill in KB

Step 3: Choose Holes to Fill in KB

Output from Pathway Hole Filler - from “Prepare Training Data” step ROOT/aic-export/ecocyc/ORGIDcyc/VERSION/data/ (e.g., ROOT/aic-export/ecocyc/caulocyc/1.0/data/) rxn-list = data retrieved from ORGID for calculating training data priors/ = directory containing training data that is loaded when using existing data from ORGID These files contain the training data computed in Step 1. If either file is available, the user may use “existing” training data in Step 1. * Each file is overwritten each time you run this step.

Output from Pathway Hole Filler - from “Identify and Evaluate Candidates” step ROOT/aic-export/ecocyc/ORGIDcyc/VERSION/reports/ (e.g., ROOT/aic-export/ecocyc/caulocyc/1.0/reports/) ORGIDholesX-Y.html (e.g., CAULOholes0-10.html) ORGID_filled-holes.html = the list of holes that user selected to fill in the KB in Step 3. blasterrors.log = log of each rxn describing whether or not any candidates were found hole-data = file containing data (in a Lisp structure) found for each rxn, used to generate list in “Choose holes to fill in KB” dialogue. If this file is available, step 3 can be initiated without repeating Step 2. * Each file is overwritten each time you run this step.

Manual polishing Refine -> Assign Probable Enzymes Refine -> Rescore Pathways Refine -> Create Protein Complexes Refine -> Assign Modified Proteins Refine -> Transport Identification Parser Refine -> Pathway Hole Filler Refine -> Predict Transcription Units Refine -> Run Consistency Checker Refine -> Update Overview

Nomenclature WO pair = pair of genes within an operon TUB pair = pair of genes at a transcription unit boundary (delineate operons)

Operation of the operon predictor For each contiguous gene pair, predict whether gene pairs are within the same operon or at a transcription unit boundary Use pairwise predictions to identify potential operons AB = TUB pair BC = WO pair operon = BCD CD = WO pair DE = TUB pair A B C D E

Operon predictor Predicts operon gene pairs based on: intergenic distance between genes genes in the same functional class Typically used for operon prediction We use method from Salgado et al, PNAS (2000) as a starting point. Uses E. coli experimentally verified data as a training set. Compute log likelihood of two genes being WO or TUB pair based on intergenic distance.

Operon predictor Additional features easily computed from a PGDB both genes products enzymes in the same metabolic pathway both gene products monomers in the same protein complex one gene product transports a substrate for a metabolic pathway in which the other gene product is involved as an enzyme a gene upstream or downstream from the gene pair (and within the same directon) is related to either one of the genes in the pair as per features 1, 2 and 3 above.