PathoLogic Pathway Predictor

Slides:



Advertisements
Similar presentations
Editing Pathway/Genome Databases. SRI International Bioinformatics Pathway Tools Paradigm Separate database from user interface Navigator provides one.
Advertisements

Configuration management
SRI International Bioinformatics Data Import / Export Markus Krummenacker Bioinformatics Research Group SRI, International Q
SRI International Bioinformatics Comparative Analysis Q
SRI International Bioinformatics 1 Genome Browser Markus Krummenacker Bioinformatics Research Group SRI, International Q
Overview of the Pathway Tools Software and Pathway/Genome Databases.
Overviews and Omics Viewers. SRI International Bioinformatics Introduction Each overview is a genome-scale diagram of a different aspect of the cellular.
Overview of the Pathway Tools Software and Pathway/Genome Databases.
SRI International Bioinformatics 1 The consistency Checker, or Overhauling a PGDB By Ron Caspi.
Introduction to the Pathway Tools Software David Walsh and Simon Eng bigDATA Workshop—May 29, 2010.
陳虹瑋 國立陽明大學 生物資訊學程 Genome Engineering Lab. Genome Engineering Lab The Newest.
Enzymatic Function Module (KEGG, MetaCyc, and EC Numbers)
PathoLogic Pathway Predictor. SRI International Bioinformatics Inference of Metabolic Pathways Pathway/Genome Database Annotated Genomic Sequence Genes/ORFs.
Overviews, Omics Viewers, and Object Groups. SRI International Bioinformatics Introduction Each overview is a genome-scale diagram of cellular machinery.
Testing. Definition From the dictionary- the means by which the presence, quality, or genuineness of anything is determined; a means of trial. For software.
Copyright OpenHelix. No use or reproduction without express written consent1.
The BioCyc Collection of Pathway/Genome Databases Alexander Shearer Bioinformatics Research Group SRI International BioCyc.org EcoCyc.org.
SRI International Bioinformatics 1 Recent Developments in Pathway Tools GMOD Workshop November ‘07 Suzanne Paley Bioinformatics Research Group SRI International.
SRI International Bioinformatics 1 Advanced Editing of Pathway/Genome Databases Ron Caspi.
PathoLogic Pathway Predictor
Welcome to DNA Subway Classroom-friendly Bioinformatics.
The consistency Checker, or Overhauling a PGDB By Ron Caspi.
SRI International Bioinformatics 1 Submitting pathway to MetaCyc Ron Caspi.
PIRSF Classification System PIRSF: Evolutionary relationships of proteins from super- to sub-families Homeomorphic Family: Homologous proteins sharing.
6 th Annual Focus Users’ Conference 6 th Annual Focus Users’ Conference Import Testing Data Presented by: Adrian Ruiz Presented by: Adrian Ruiz.
SRI International Bioinformatics 1 Genome Browser Tomer Altman Bioinformatics Research Group SRI, International August 19th, 2009.
Overview of the Pathway Tools Software and Pathway/Genome Databases Peter D. Karp Bioinformatics Research Group SRI International
SRI International Bioinformatics Update your computers! To install a patch: Tools => Instant Patch => Download and Activate All Patches.
SRI International Bioinformatics 1 Editing Pathway/Genome Databases Ron Caspi.
Welcome to Gramene’s RiceCyc (Pathways) Tutorial RiceCyc allows biochemical pathways to be analyzed and visualized. This tutorial has been developed for.
SRI International Bioinformatics 1 Pathway Tools Features Available Only in the Desktop Version PathoLogic.
Welcome to the combined BLAST and Genome Browser Tutorial.
SRI International Bioinformatics Selected PathoLogic Refining Tasks Creation of Protein Complexes Assignment of Modified Proteins Operon Prediction.
Recent Developments and Future Directions in Pathway Tools Peter D. Karp SRI International.
QC – User Interface QUALITY CENTER. QC – Testing Process QC testing process includes four phases: Specifying Requirements Specifying Requirements Planning.
The Pathway/Genome Navigator
Overviews, Omics Viewers, Pathway Collages
Editing Pathway/Genome Databases
Development Environment
WORKSHOP 1 introduction
Comparative Analysis in BioCyc
Working in the Forms Developer Environment
Why Create a PGDB? Perform pathway analyses as part of a genome project Analyze omics data Create a central public information resource for the organism,
by Markus Krummenacker June 2011
Bioinformatics Research Group
Single Sample Registration
The Pathway Tools FBA Module
The Pathway Tools Schema
PathoLogic: More about Matching Enzyme Names to Reactions
Building Metabolic Models
How to Administer a PGDB
A Community Effort to Model the Human Microbiome
Designing and Debugging Batch and Interactive COBOL Programs
Comparative Analysis Q
Overview of Microbial Pathway and Genome Databases
Incremental PathoLogic
Propagating Changed Annotation and Pathway Information
BLAST.
Chapter 1 Introduction(1.1)
Electronics II Physics 3620 / 6620
Welcome to the Quantitative Trait Loci (QTL) Tutorial
Annotation Presentation
Advanced PGDB Editing: Gene Ontology (GO) Terms
Welcome to Gramene’s RiceCyc (Pathways) Tutorial
Overview of the Pathway Tools FBA Module
SRI Bioinformatics Research Group
Part II SeqViewer AraCyc Help
Using Microsoft Outlook: Outlook Support Number
V3 Education: Building the HMD
Overview of the Pathway Tools Software and Pathway/Genome Databases
Presentation transcript:

PathoLogic Pathway Predictor

Inference of Metabolic Pathways Annotated Genomic Sequence Pathway/Genome Database Genes/ORFs Gene Products DNA Sequences Pathways Reactions PathoLogic Software Integrates genome and pathway data to identify putative metabolic networks Compounds Multi-organism Pathway Database (MetaCyc) Gene Products Reactions Pathways Compounds Genes Genomic Map

PathoLogic Functionality Initialize schema for new PGDB Transform existing genome to PGDB form Infer metabolic pathways and store in PGDB Infer operons and store in PGDB Assemble Overview diagram Assist user with manual tasks Assign enzymes to reactions they catalyze Identify false-positive pathway predictions Build protein complexes from monomers Infer transport reactions Fill pathway holes Note PathoLogic can be run from command line

Overview of Metabolic Pathway Inference

PathoLogic Step 3: Metabolic Reconstruction Phase I: Qualitative metabolic reconstruction (PathoLogic) Inference of the reactome from the annotated genome Inference of metabolic pathways by selecting from MetaCyc pathways Karp et al, Stand Genomic Sci, 2011 5:424 Phase II: Quantitative model construction (MetaFlux) Infer biomass metabolites, nutrients Gap fill reaction network Modify reaction complement until biomass metabolites producible from nutrients Solve model, assess computed fluxes Iterate Karp et al, Briefings in Bioinformatics 2015 Dale et al, BMC Bioinformatics 2010 11:15

MetaCyc: Curated Metabolic Database MetaCyc v19.0 2015 KEGG 2013 SEED 2015 Citations 45,000 Pathways 2,310 179 Modules 583 Subsystems Reactions 12,400 8,692 Metabolites 12,000 Mini-reviews: Textbook Pages 6,300 “A Systematic Comparison of the MetaCyc and KEGG Pathway Databases BMC Bioinformatics 2013 14(1):112

Pathway Prediction Pathway prediction is useful because Pathways organize the metabolic network into tractable units Pathways guide us to search for missing enzymes Pathways can be used for analysis of high-throughput data Visualization, enrichment analysis Pathway inference fills gaps in metabolic network Reduces computational demands of gap filling Pathway prediction is hard because Reactome inference is imperfect Some reactions present in multiple pathways Pathway variants share many reactions in common Increasing size of MetaCyc

Reactome Inference For each protein in the organism, infer reaction(s) it catalyzes Build from existing genome annotation! Match protein functions to MetaCyc reactions Enzyme names (uncontrolled vocabulary) EC numbers Gene Ontology terms

PathoLogic Enzyme Name Matcher Name matcher generates alternative variants of each name and matches each to MetaCyc Strips extraneous information found in enzyme names Putative carbamate kinase, alpha subunit Flavin subunit of carbamate kinase Cytoplasmic carbamate kinase Carbamate kinase (abcD) Carbamate kinase (3.2.1.4)

Algorithm for Inference of Metabolic Pathways For each pathway in MetaCyc consider For what fraction of its reactions are enzymes present in the organism? Are enzymes present for reactions unique to the pathway? Is a given pathway outside its designated taxonomic range? Calvin cycle: green plants, green algae, etc Are enzymes present for designated “key reactions” within MetaCyc pathways? Calvin cycle / ribulose bisphosphate carboxylase Standards in Genomic Sciences 5:424-429 2011

New Addition: Pathway Score PS : Pathway Score [0,1] R : Set of reactions within pathway Ignore spontaneous reactions RS : Reaction Score for a given reaction T : Boost if organism is within taxonomic range of pathway

Reaction Score RS = P + U + K P = presence score 0.2 if enzyme catalyzing rxn is present Else 0 U = uniqueness score Ranges from 0.6 (rxn present in single pathway) to 0 (many pathways) K = key reaction score 0.5 if rxn is a key reaction of the pathway

Pathway Decision Procedure for Pathway P REJECT P if P is a transport, signaling, or synthetic (engineered) pathway REJECT P if P is an electron transport pathway AND P lacks enzymes for any reaction INCLUDE P if P has all reactions present (meaning an enzyme is present for each reaction) AND if P is outside its taxonomic range, P contains more than 3 reactions REJECT P if P is outside its taxonomic range REJECT P if P is missing enzymes for all key reactions of P

Decision Procedure REJECT P if the score of P is significantly less than the score of a variant pathway of P INCLUDE P if the score of P exceeds the threshold PATHWAY-PREDICTION-SCORE-CUTOFF Defined in ptools-init.dat Default decision: REJECT

PathoLogic Analysis Phases Trial parsing of input data files -- fix errors Initialize schema of new PGDB (automatic) Create DB objects for replicons, genes, proteins (automatic) Assign enzymes to reactions they catalyze (part automatic, part manual) From assigned reactions, infer what pathways are present (automatic, with manual review)

PathoLogic Analysis Phases Define metabolic overview diagram (automatic, redo after changing data) Define protein complexes (manual) Define transcription units (automatic) Infer transport reactions (manual review necessary) Fill Pathway Holes (manual review necessary)

PathoLogic Input/Output Inputs: List of all genetic elements Enter using GUI or provide a file Files containing annotation for each genetic element Files containing DNA sequence for each genetic element MetaCyc database Output: Pathway/genome database for the subject organism Reports that summarize: Evidence in the input genome for the presence of reference pathways Reactions missing from inferred pathways

File Naming Conventions One pair of sequence and annotation files for each genetic element Sequence files: FASTA format suffix fsa or fna Annotation file: Genbank format: suffix .gbk PathoLogic format: suffix .pf

Typical Problems Using Genbank Files With PathoLogic Wrong qualifier names used: read PathoLogic documentation! Extraneous information in a given qualifier Check results of trial parse carefully

GenBank File Format Accepted feature types: CDS, tRNA, rRNA, misc_RNA Accepted qualifiers: /locus_tag Unique ID [recm] /gene Gene name [req] /product [req] /EC_number [recm] /product_comment [opt] /gene_comment [opt] /alt_name Synonyms [opt] /pseudo Gene is a pseudogene [opt] /db_xref DB:AccessionID [opt] /go_component, /go_function, /go_process GO terms [opt] For multifunctional proteins, put each function in a separate /product line

PathoLogic File Format Each record starts with line containing an ID attribute Tab delimited Each record ends with a line containing // One attribute-value pair is allowed per line Use multiple FUNCTION lines for multifunctional proteins Lines starting with ‘;’ are comment lines Valid attributes are: ID, NAME, SYNONYM STARTBASE, ENDBASE, GENE-COMMENT FUNCTION, PRODUCT-TYPE, EC, FUNCTION-COMMENT DBLINK GO INTRON

PathoLogic File Format ID TP0734 NAME deoD STARTBASE 799084 ENDBASE 799785 FUNCTION purine nucleoside phosphorylase DBLINK PID:g3323039 PRODUCT-TYPE P GENE-COMMENT similar to GP:1638807 percent identity: 57.51; identified by sequence similarity; putative // ID TP0735 NAME gltA STARTBASE 799867 ENDBASE 801423 FUNCTION glutamate synthase DBLINK PID:g3323040 GO glutamate synthase (NADPH) activity [goid 0004355] [evidence IDA] [pmid 4565085]

Before you start: What to do when an error occurs Most Navigator errors are automatically trapped – debugging information is saved to error.tmp file. All other errors (including most PathoLogic errors) will cause software to drop into the Lisp debugger Unix: error message will show up in the original terminal window from which you started Pathway Tools. Windows: Error message will show up in the Lisp console. The Lisp console usually starts out iconified – its icon is a blue bust of Franz Liszt 2 goals when an error occurs: Try to continue working Obtain enough information for a bug report to send to pathway-tools support team.

The Lisp Debugger Sample error (details and number of restart actions differ for each case) Error: Received signal number 2 (Keyboard interrupt) Restart actions (select using :continue): 0: continue computation 1: Return to command level 2: Pathway Tools version 10.0 top level 3: Exit Pathway Tools version 10.0 [1c] EC(2): To generate debugging information (stack backtrace): :zoom :count :all To continue from error, find a restart that takes you to the top level – in this case, number 2 :cont 2 To exit Pathway Tools: :exit

How to report an error Determine if problem is reproducible, and how to reproduce it (make sure you have all the latest patches installed) Send email to ptools-support@ai.sri.com containing: Pathway Tools version number and platform Description of exactly what you were doing (which command you invoked, what you typed, etc.) or instructions for how to reproduce the problem error.tmp file, if one was generated If software breaks into the lisp debugger, the complete error message and stack backtrace (obtained using the command :zoom :count :all, as described on previous slide)

PathoLogic Command Menus Invoking PathoLogic: Tools -> PathoLogic Organism Select Create New Save KB Revert KB Reinitialize KB Convert File KB to Oracle KB Convert File KB to MySQL KB Backup KB to File New Version Specify Reference PGDB(s) Exit Build Trial Parse Automated Build Update Build for Revised Annotation Refine Assign Probable Enzymes Assign Modified Proteins Create Protein Complexes Re-run Name Matcher Rescore Pathways Predict transcription units Transport Identification Parser Update Overview Pathway Hole Filler

Using the PPP GUI to Create a Pathway/Genome Database Input Project Information Organism -> Create New Creates directory structure for new PGDB Creates and saves empty PGDB, populated only with objects common to all PGDBs (schema classes, elements, etc.) and data you entered in the form. Offers to invoke Replicon Editor

Input Project Information

Enter Replicon Information For each replicon Name Type: chromosome, plasmid, etc. Circular? Annotation file Sequence file (optional) Contigs (optional) Links to other DBs (optional) GUI-Based entry Build->Specify Replicons File-Based Entry Create genetic-elements.dat file using template provided

GUI-Based Replicon Entry

Batch Entry of Replicon Info File /<orgid>cyc/<version>/input/genetic-elements.dat: ID TEST-CHROM-1 NAME Chromosome 1 TYPE :CHRSM CIRCULAR? N ANNOT-FILE chrom1.pf SEQ-FILE chrom1.fsa // ID TEST-CHROM-2 NAME Chromosome 2 ANNOT-FILE /mydata/chrom2.gbk SEQ-FILE /mydata/chrom2.fna

Specify Reference PGDB(s) This step is optional, and most users will omit it MetaCyc is always the primary reference PGDB Specify additional reference PGDB if you have your own curated PGDB which has: Pathways and/or reactions that are not in MetaCyc Manual functional assignments, with names similar to current genome There is no point specifying any of our PGDBs as references, only your own curated PGDBs.

Building the PGDB Trial Parse Build -> Trial Parse Check output to ensure numbers “look right” Same number of gene start positions, end positions, names Did my file contain EC numbers? Were they detected? Did my file contain RNAs? Were they detected? Fix any errors in input files Build pathway/genome database Build -> Automated Build

PathoLogic Parser Output

Automated Build Parses input files Creates objects for every gene and gene product Uses EC numbers, GO annotations and name matcher to match enzymes to reactions in MetaCyc Imports catalyzed enzymes and compounds from MetaCyc Generates list of likely enzymes that couldn’t be assigned Infers pathways likely to be present Generates Cellular Overview Diagram (first pass) Generates reports

Assign Enzymes to Reactions 5.1.3.2 Gene product MetaCyc UDP-glucose-4-epimerase Match yes no Probable enzyme -ase Assign UDP-D-glucose  UDP-galactose no yes Manually search Not a metabolic enzyme no yes Assign Can’t Assign

Enzyme Name Matcher For names that do not match, software identifies probable metabolic enzymes as those Containing “ase” Not containing keywords such as “sensor kinase” “topoisomerase” “protein kinase” “peptidase” Etc User should research unknown enzymes MetaCyc, Swiss-Prot, PubMed

Stored in ORGIDcyc/VERSION/reports/name-matching-report.txt

Pathway Evidence Report On Organism Summary Page in Navigator, button “Generate Pathway Evidence Report” Report saved as HTML file, view in browser Hierarchical listing of all inferred pathways “Pathway Glyph” shows evidence graphically Steps with/without enzymes (green/black) Steps that are unique to pathway (orange) Steps filled by Pathway Hole Filler (blue) Counts reactions in pathway, with evidence, in other pathways Lists other pathways that share reactions Link to pathway in MetaCyc

Manual Pruning of Pathways Use pathway evidence report Coloring scheme aids in assessing pathway evidence Phase I: Prune extra variant pathways Rescore pathways, re-generate pathway evidence report Phase II: Prune pathways unlikely to be present No/few unique enzymes Most pathway steps present because they are used in another pathway Pathway very unlikely to be present in this organism Nonspecific enzyme name assigned to a pathway step

Caveats Cannot predict pathways not present in MetaCyc Evidence for short pathways is hard to interpret Since many reactions occur in multiple pathways, some false positives

Output from PPP Pathway/genome database Summary pages Pathway evidence page Click “Summary of Organisms”, then click organism name, then click “Pathway Evidence”, then click “Save Pathway Report” Missing enzymes report Directory tree containing sequence files, reports, etc.

Resulting Directory Structure ROOT/ptools-local/pgdbs/user/ORGIDcyc/VERSION/ input organism.dat organism-init.dat genetic-elements.dat annotation files sequence files reports name-matching-report.txt trial-parse-report.txt kb ORGIDbase.ocelot data overview.graph released -> VERSION

Manual Polishing Refine -> Assign Probable Enzymes  Do this first Refine -> Rescore Pathways  Redo after assigning enzymes Refine -> Create Protein Complexes  Can be done at any time Refine -> Assign Modified Proteins  Can be done at any time Refine -> Transport Identification Parser  Can be done at any time Refine -> Pathway Hole Filler Refine -> Predict Transcription Units Refine -> Update Overview  Do this last, and repeat after any material changes to PGDB

Assign Probable Enzymes

How to find reactions for probable enzymes First, verify that enzyme name describes a specific, metabolic function Search for fragment of name in MetaCyc – you may be able to find a match that PathoLogic missed Look up protein in UniProt or other DBs Search for gene name in PGDB for related organism (bear in mind that gene names are not reliable indicators of function, so check carefully) Search for function name in PubMed Other…