Semantic Aggregation, Integration, and Inference of Pathway Data

Semantic Aggregation, Integration, and Inference of Pathway Data
(Pedantic Aggravation, Irritation, and Interference) Co-Destructors: Joanne Luciano, PhD Jeremy Zucker ISMB 2005 Tutorial Detroit Michigan June 25th 2005

Overview Time Out (15 minutes) Introduction (45 minutes)
Workshop Case Studies & Exercises (2 hrs 15 minutes) Subdivide into groups of triads and dyads Case Study I (45 minutes) Case Study II (45 minutes) Case Study III (45 minutes) Lessons Learned (30 minutes) Lessons Not Yet Learned (take home)

Introduction (45 minutes)
Semantic Aggregation, Integration and Inference of Pathway Data Pathway Data (domain) What is it? What does it look like? Why do we care? (motivation) Definitions & Disclaimers Strategies

So many pathway databases, so little time.
Pathway Data (domain) What is it? Pathway Databases So many pathway databases, so little time. Graphic from Mike Cary and Gary Bader

Different types of pathways (different strokes for different folks, it’s OK.)
Glycolysis Protein-Protein Apoptosis Lac Operon Also include gene regulation lac operon LacZ, the first gene of the lac operon, encodes the enzyme b-galactosidase which breaks down lactose to galactose and glucose. The initiation of transcription of lac operon is controlled by the following mechanisms participating lac repressor protein and CAP: (1) lactose addition increases the concentration of allolactose which binds to the repressor protein and removes it from the DNA, and (2) glucose addition decreases the concentration of cAMP; because cAMP no longer binds to CAP, this gene activator dissociates from the DNA, turning of the operon. LacZ, the first gene of the lac operon, encodes the enzyme b-galactosidase which breaks down lactose to galactose and glucose. The initiation of transcription of lac operon is controlled by the following mechanisms participating lac repressor protein and CAP: (1) lactose addition increases the concentration of allolactose which binds to the repressor protein and removes it from the DNA, and (2) glucose addition decreases the concentration of cAMP; because cAMP no longer binds to CAP, this gene activator dissociates from the DNA, turning of the operon. Molecular Interaction Networks Gene Regulation Metabolic Pathways Signaling Pathways The Main Categories

Different representations of the same pathways
<!ELEMENT reaction (substrate*,product*)> <!ATTLIST reaction name %keggid.type; #REQUIRED> <!ATTLIST reaction type %reaction-type.type; #REQUIRED> <!ELEMENT substrate EMPTY> <!ATTLIST substrate name %keggid.type; #REQUIRED> <!ELEMENT product EMPTY> <!ATTLIST product name %keggid.type; #REQUIRED> starts at a-D-Glucose 1P KEGG Reference Pathway GLYCOLYSIS

reactions.dat This file lists all chemical reactions in the PGDB. Attributes: UNIQUE-ID TYPES COMMON-NAME ACTIVATORS BASAL-TRANSCRIPTION-VALUE DBLINKS DELTAG0 DEPRESSORS EC-LIST EC-NUMBER ENZYMATIC-REACTION EQUILIBRIUM-CONSTANT IN-PATHWAY INHIBITORS LEFT MOVED-IN MOVED-OUT OFFICIAL-EC? REACTANTS REQUIREMENTS RIGHT SIGNAL SPECIES SPONTANEOUS? STIMULATORS SYNONYMS starts at b-D-glucose6-phosphate BioCYC Reference Pathway GLYCOLYSIS

<reaction name="R_alpha_D_glucose_6_phosphate_D_fructose_6_phosphate" id="R_163457"> <listOfReactants> <speciesReference species="R_30537_alpha_D_Glucose_6_phosphate" /> </listOfReactants> <listOfProducts> <speciesReference species="R_29512_D_Fructose_6_phosphate" /> </listOfProducts> <listOfModifiers> <modifierSpeciesReference species="R_163455_glucose_6_phosphate_isomerase_dimer_name_copied_from_complex_in_Homo_sapiens_" /> </listOfModifiers> </reaction> DatabaseObject [41245] Event [8285] Reaction [6598] ConcreteReaction [4034] GenericReaction [2564] Reactome Pathway GLYCOLYSIS

Does not compute. Pretty, but useless Reactions clickable but... Starts at Glucose (but it doesn’t matter) BioCarta Reference Pathway GLYCOLYSIS

Pathway Data Why do we care?
Pathway Research has Broad Impact Drug Discovery (pathway of target, safety) Basic Science (identify pathways) Disease Research (cancer pathways) Environmental Research (microbial research) Combine knowledge from multiple sources Whole is greater than the sum of its parts Biological knowledge is fragmented Need database to manage resources

Definitions & Disclaimers
Aggregation 2 (or more) data sources, different data models, common link between (among) them. Integration 2 (or more) data sources, same data model, semantic mapping and instance merging required. Inference 1 (or more) data sources, one data model, creating new instances or new relationships. (Evidence code type kind of “inference”) Disclaimer “Controlled” Vocabulary scope = this tutorial

Assembling Knowledge Aggregation, Integration, Inference
“When it comes to data cleaning, there’s no such thing as a free lunch.” Tim Berners-Lee Some tasks are specific to a use case, some are common to more than one and there’s no escaping others.

Bridging Chemistry and Molecular Biology
Different Views have different semantics: Lenses When there is a correspondence between objects, a semantic binding is possible Example of semantic web technology being used to join (aggregate) two databases. Uniprot:P49841 Apply Correspondence Rule: if ?target.xref.lsid == ?bpx:prot.xref.lsid then ?target.correspondsTo.?bpx:prot Source: Eric Neumann Haystack BioDASH Demo

Seamark Demonstration: Identification of new drug candidates
GO2Keyword.rdf UniProt.rdf GO.rdf Keywords.rdf Taxonomy.rdf PubMed.xml Citation IntAct.rdf Organism Enzymes.rdf OMIM.rdf GO2OMIM.rdf GO2Enzyme.rdf MIM Id KEGG.rdf Keyword GO2UniProt.rdf Protein Enzyme ProbeSet.rdf Gene Probe Pathway Compound 1. Differentiate different forms of disease 2. Identify patients subgroups. 3. Identify top biomarkers 4. Identify function 5. Identify biological and chemical properties and disease associations of biomarker 6. Identify documents 7. Identify role in metabolic pathways 8. Identify compounds that interact 9. Identify and compare function in other organisms 10. Identify any prior art Example of semantic web technology being used to join (aggregate) two databases.

SMBL integration using BioPAX
Use BioPAX to Address SBML’s data integration issues Different data types, same representation Same data, different representations External references… Synonyms… Provenance…

A problem: same representation different semantics (SBML)
Protein-Protein Interaction <reaction id=“pyruvate_dehydrogenase_cplx”/> <listOfReactants> <speciesRef species=“PdhA”/> <speciesRef species=“PdhB”/> </listOfReactants> <listOfProducts> <speciesRef species=“Pyruvate_dehydrogenase_E1”/> </listOfProducts> </reaction> Biochemical Reaction <reaction id=“pyruvate_dehydrogenase_rxn”/> <listOfReactants> <speciesRef species=“NADP+”/> <speciesRef species=“CoA”/> <speciesRef species=“pyruvate”/> </listOfReactants> <listOfProducts> <speciesRef species=“NADPH”/> <speciesRef species=“acetyl-CoA”/> <speciesRef species=“CO2”/> </listOfProducts> <listOfModifers> <modifierSpeciesRef species=“pyruvate_dehydrogenase_E1”/> </listOfModifiers> </reaction> Example of semantic mapping:Protein-protein interaction and Biochemical reaction are two different types or interactions but they are represented as one type (reaction) in SMBL.

SBML annotated with BioPAX
<sbml xmlns:bp=“ xmlns:owl=" xmlns:rdf=" <listOfSpecies> <species id=“PdhA” metaid=“PdhA”> <annotation> <bp:protein rdf:ID=“#PdhA”/> </annotation> </species> <species id=“NADP+” metaid=“NADP+”> <bp:smallMolecule rdf:ID=“#NADP+”/> </listOfSpecies> <listOfReactions> <reaction id=“pyruvate_dehydrogenase_cplx”> <bp:complexAssembly rdf:ID=“#pyruvate_dehydrogenase_cplx”/> </reaction> </listOfReactions> species is protein protein is PdhA species is small molecule small molecule is NADP+

BioPAX: External References
<species id=“pyruvate” metaid=“pyruvate”> <annotation xmlns:bp=“ <bp:smallMolecule rdf:ID=“#pyruvate”> <bp:Xref> <bp:unificationXref rdf:ID=“#unificationXref119"> <bp:DB>LIGAND</bp:DB> <bp:ID>c00022</bp:ID> </bp:unificationXref> </bp:Xref> </bp:smallMolecule> </annotation> </species> Example of BioPAX Example of integration We can use the annotation field of SBML, which is typically free format text, to carry the BioPAX ontology as a controlled vocabulary in order to achieve aggregation. Here we point to the BioPAX namspace as the source of the controlled vocabulary that will define the terms Xref, smallMolecule, unificationXref and ID. The metadid and rdfid must be identical. That is how the two representations are linked and the instance data are found.

BioPAX: Synonyms <species id=“pyruvate” metaid=“pyruvate”>
<annotation xmlns:bp=“ <bp:smallMolecule rdf:ID=“#pyruvate” > <bp:SYNONYMS>2-oxo-propionic acid</bp:SYNONYMS> <bp:SYNONYMS>2-oxopropanoate</bp:SYNONYMS> <bp:SYNONYMS>BTS</bp:SYNONYMS> <bp:SYNONYMS>pyruvic acid</bp:SYNONYMS> </bp:smallMolecule> </annotation> </species> Example of integration. This example shows how multiple data sources can be linked via the synonyms. All known synonyms should be included for maximum coverage.

or have the world take over itself?)
Strategies How we get to a Standard Pathway Representation? (Game plan: Take over the world or have the world take over itself?) Develop bridging technologies Develop pathway representation standard within the Life Science community (BioPAX) (Social Engineering!) Utilize Semantic Web Integration Technologies (LSID, RDF/OWL)

Exchange Formats in Pathway Data Space (Scope)
BioPAX PSI-MI 2 SBML, CellML Genetic Interactions Molecular Interactions Pro:Pro All:All Interaction Networks Molecular Non-molecular Pro:Pro TF:Gene Genetic Regulatory Pathways Low Detail High Detail Database Exchange Formats Simulation Model Exchange Formats Rate Formulas Metabolic Pathways Low Detail High Detail Biochemical Reactions Small Molecules To design bridging technologies, we look at the landscape. This slide shows the coverage of exchange formats in the pathway data space highlighting the focus of their different domains. BioPAX level 1 is broad and shallow allowing us to quickly build a format that can represent most the existing data types in databases today. Details will be added in subsequent levels. This is a practical approach. Pathway databases and tools are not considered here, although each one is important and addresses specific use cases. Graphic from Mike Cary & Gary Bader

BioPAX Objectives Accommodate existing database representations
Integration and exchange of pathway data Interchange through a common (standard) representation Provide a basis for future databases Enable development of tools for searching and reasoning over the data Also many other databases and tools have been published e.g. PATIKA We would like to learn from these as well. Importantly: An exchange format needs to be a compatible superset of the databases, thus BioPAX will be copying certain structures from the databases in order to support them. A good reaction to BioPAX from a group that has developed their own ontology is that BioPAX looks like their specification. Designed by the databases for themselves (DB-DB exchange) and for users. Group emerged from this community, at this very meeting two years ago.

BioPAX Motivation >180 DBs and tools Application Database User
Designed by the databases for themselves (DB-DB exchange) and for users. User Before BioPAX With BioPAX Common format will make data more accessible, promoting data sharing and distributed curation efforts

BioPAX Biological PAthway eXchange
A data exchange ontology and format for biological pathway integration, aggregation and inference Initiative arose from the community BioPAX will provide a consistent format for pathway data so it will be easier for consumers of pathway data (e.g. tool developers, DB curators) to integrate data from multiple sources.

Biological pathways of the Cell
What is a Pathway? Glycolysis Protein-Protein Apoptosis Lac Operon Also include gene regulation Molecular Interaction Networks Metabolic Pathways Signaling Pathways Gene Regulation BioPAX Level 1 BioPAX Level 2

Aggregation, Integration, Inference
Multiple kinds of pathway databases metabolic molecular interactions signal transduction Constructs designed for integration DB References XRefs (Publication, Unification, Relationship) synonyms provenance OWL DL – to enable reasoning

BioPAX Biochemical Reaction
OWL (schema) Instances (Individuals) (data) phosphoglucose isomerase

BioPAX Ontology Conceptual framework based upon existing DB schemas:
aMAZE, BIND, EcoCyc, WIT, KEGG, Reactome, etc. Allows wide range of detail, multiple levels of abstraction BioPAX ontology in OWL (XML) Designed for pathway database integration Database ID Unification X-REF Relationship X-REF Publication X-REF Synonyms Provenance

BioPAX uses other ontologies
Use pointers to existing ontologies to provide supplemental annotation where appropriate Cellular location  GO Component Cell type  Cell.obo Organism  NCBI taxon DB Incorporate other standards where appropriate Chemical structure  SMILES, CML, INCHI

BioPAX Ontology: Overview
a set of interactions parts This slide puts it all together how the parts are known to interact Level 1 v1.0 (July 7th, 2004)

BioPAX Ontology: Top Level
Entity Pathway Interaction Physical Entity Subclass (is a) Contains (has a) Pathway A set of interactions E.g. Glycolysis, MAPK, Apoptosis Interaction A set of entities and some relationship between them E.g. Reaction, Molecular Association, Catalysis Physical Entity A building block of simple interactions E.g. Small molecule, Protein, DNA, RNA We may end up calling parts ‘Interactor’ instead of ‘Part’ to be compatible with the PSI format Pathway is a name for a set of interactions. The cell is really a large network, but we tend to organize the network into pathways for our own understanding of the very complex network. For this reason, it is important to be able to have a set of interactions and give it a name (and other description) Graphic from Gary Bader

BioPAX Ontology: Root Root class: Entity
Any concept referred to as a discrete biological unit when describing pathways. This is the root class for all biological concepts in the ontology, which include pathways, interactions and physical entities

Metabolic Pathways participants Interaction sub-classes Definition
An entity that defines a single biochemical interaction between two or more entities. An interaction cannot be defined without the entities it relates. Future BioPAX levels, this list may be extended to include other classes, such as genetic interactions. participants

Metabolic Pathways Interaction sub-classes
Definition Two terms exist under interaction: Control and conversion. In future BioPAX levels, this list may be extended to include other classes, such as genetic interactions. Examples Enzyme catalysis controls a biochemical reaction, transport catalysis controls transport, a small molecule that inhibits a pathway by an unknown mechanism controls the pathway.

BioPAX as a solution to Aggregation, Integration, Inference
Multiple kinds of pathway databases metabolic molecular interactions signal transduction gene regulatory Constructs designed for integration DB References XRefs (Publication, Unification, Relationship) Synonyms Provenance (not yet implemented) OWL DL – to enable reasoning Be careful when using the word provenance. It turns out that there is an official way to do provenance in RDF, and we need to think more deeply about how to do this in BioPAX. Larry Hunter talks about this quite knowledgably.

Time Out (15 minutes)

Workshop Case Studies & Exercises (2 hrs 15 minutes)
Break into groups of triads and dyads Case Study I (45 minutes) Use Case 1: Inference of a Metabolic Flux Model from an Annotated Genome Group Exercise 1 Case Study II (45 minutes) Use Case 2: Integration of a metabolic flux model from two sources Group Exercise 2 Case Study III (45 minutes) Use Case 3: Multi-source aggregation Validation and Testing Group Exercise 3

Methodology Define the goal of the integration
How will the integrated data be used? This defines the level of integration from syntactic through semantic Take stock of current resources This defines your staring point Data base sources, programmers, lab access, collaborators Scope the work to get from B to A Data Profiling Resource Profiling

3 Case Studies Case study I: Semantic Inference of metabolic pathway data from an annotated genome. Case study II: Semantic Integration of a metabolic flux model from two sources. Case study III: Semantic Aggregation of pathway data from multiple sources

Case Study I: Inference of a Metabolic Flux Model from an Annotated Genome
Objective: To apply Biological knowledge to constrain the possible behaviors of a metabolic network. Resources: Annotated Genome, Transport DB, Pathway databases, experimental community, published literature Metabolic Flux analysis is a rigorous test of the knowledge contained in a Pathway Database. Failure of a metabolic flux model to accurately predict the growth rate of a single-cell organism under known nutrient conditions indicates the database may be incomplete or contain incorrect data. Through the semantic aggregation, integration and inference of pathway data, we hope to discover those errors and correct them wherever possible.

Genes make RNA make Protein
Legend: Gene5 RNA 5 P5 Gene RNA Protein Gene6 RNA 6 P6 Transporter Gene7 RNA 7 P7 Enzyme Gene8 RNA 8 P8 Transcription Translation Gene9 RNA 9 P9

Proteins catalyze biochemical reactions
Periplasm P1 P5 P9 P8 B 2 D Cytoplasm E A E F D P2 P6 Legend: Metabolites: A-F C A B 2 B Transporter F Enzyme P7 P3 Catalyzes A C C D Reaction:

Biochemical reactions comprise a metabolic network
Uptake: R5 B E R4 R2 2B Biomass: R8 2D Uptake: R1 A R6 Given: sum of the fluxes which produce a metabolite = sum of fluxes which consume the metabolite for every metabolite. Given uptake limits on R1 and R5, compute the maximum flux through R8. D R3 F R7 Legend: Exchange Intracellular Objective C Waste: R9

Metabolic Inference Subgoals
Infer genes from sequence and homology Infer enzymatic reactions from Enzyme Commission (EC) numbers Infer metabolic reaction network from enzymatic reactions and metabolites. Infer pathway holes using network debugging algorithms Propose candidate enzymes using pathway-hole filling algorithms Add experimentally verified candidates to the annotated genome Lather, rinse, repeat

Data Profiling of the Annotated Genome
Orphaned genes Orphaned enzymes Misannotated genes Misannotated enzymes Sequencing errors BLAST Algorithm errors

Schema Level Errors Biochemical reaction Biochemical reaction
Enzyme (protein) that catalyzes the biochemical reaction Gene that codes for the gene product (protein enzyme)

Semantic bugs revealed by chemical structure
EcoCyc 7.5 Pathway: Riboflavin and FMN and FAD biosynthesis No place to go! 4-(1-D-ribitylamino)-5-amino-2,6-dihydroxypyrimidine:

Semantic bugs revealed by chemical structure
EcoCyc 8.0 Pathway: Riboflavin and FMN and FAD biosynthesis Synonyms 4-(1-D-ribitylamino)-5-amino-2,6-dihydroxypyrimidine:

Data Profiling of Pathway/Genome Database
Unbalanced Reactions Pathway holes Unproducible metabolites Generalized Metabolites Unconsumable metabolites (toxins) Add more slides on network debugging algorithms that infer pathway holes. Add slides on algorithms that fill pathway holes.

Bugs in Network structure revealed by Forward and Backward chaining
Known Nutrient set Fired Reaction Unfired Reaction Essential compounds Missing essential compound Biomass

Bugs in Network structure revealed by Forward and Backward chaining
Unproduced metabolite Precursor metabolite Essential compounds Missing essential compound Biomass

Case study II: Integration of a metabolic flux model from two sources
What is metabolic flux analysis? How does one build a metabolic flux model? What can go wrong in building a metabolic flux model?

What is Metabolic Flux Analysis?
Starts with the metabolic network Assumes steady-state behavior Constrain with Thermodynamics Add Nutrient conditions Choose an objective: Biomass growth Predicts growth rate for mutant and wild-type organisms under different conditions. Pathologic parses annotated genome files from GENBANK, looks for annotation tags such as accession number, EC number, etc. Infers proteins, enzyme complexes, biochemical reactions, pathways, metabolic fluxes. Pathologic also parses FASTA sequence to infer transcription units (operons) Pathologic also infers the missing enzymes that appear as gaps in metabolic pathways using ‘hole filling’ algorithm (infers from metabolic reactions from other organisms (multi tier – evidence of which enzymes within which reactions of which pathways).

Start with the metabolic network
Objective v8 B A F 2D E C v3 v2 Waste: v9 Uptake: v5 Uptake: v1 v4 v7 v6 D 2B Flux legend: Exchange Intracellular Objective

Stoichiometric Matrix: Representation of the metabolic network
+1 -1 B -2 C D 2 E F R1: → A R2: A → B R3: A → C R4: B + E → 2D R5: → E R6: 2B → C + F R7: C → D R8: D → R9: F → Given: sum of the fluxes which produce a metabolite = sum of fluxes which consume the metabolite for every metabolite. Given uptake limits on R1 and R5, compute the maximum flux through R8.

What is a metabolic flux?
Source fluxes Metabolite Pool Sink fluxes

For a reaction of stoichiometry R2: A → B the rate of reaction, or flux is equal to: For a reaction of stoichiometry R4: B+E → 2D the flux is equal to:

For a reaction of stoichiometry R4: B+E → 2D The rate of reaction, or flux, is equal to:

At steady-state, nonlinear dynamics simplify to linear fluxes.
B B k2 v2 P2 k1 v1 P1 A Aext A Aext k3 P3 v3 C C

At steady-state, the sum of the fluxes that produce a metabolite is equal to the sum of the fluxes that consume it. B v2 v1 A Aext v3 C

Stoichiometric Matrix: more unknowns than equations
+1 -1 B -2 C D 2 E F v1 v2 v3 v4 v5 v6 v7 v8 v9 Given: sum of the fluxes which produce a metabolite = sum of fluxes which consume the metabolite for every metabolite. Given uptake limits on R1 and R5, compute the maximum flux through R8.

How to determine the metabolic capabilities of a network?
Uptake: v5 B E v4 v2 2B Biomass: v8 2D Uptake: v1 A v6 Given: sum of the fluxes which produce a metabolite = sum of fluxes which consume the metabolite for every metabolite. Given uptake limits on R1 and R5, compute the maximum flux through R8. D v3 F v7 Flux legend: Exchange Intracellular Objective C Waste: v9

Using Elementary modes to study the steady state-behavior
F v3 v7 C F v7 C R9 v9 v5 V1 v2 v3 v4 v5 v6 v7 v8 v9 A +1 -1 B -2 C D 2 E F All metabolic capabilities in steady states are composed of elementary flux modes, which are the minimal sets of enzymes that can each generate valid steady states. B E v4 v2 2B 2D v1 A v6 v8 D v3 F v7 C v9

How to make predictions about the behavior of the metabolic network?
Uptake: v5 B E v4 v2 2B Biomass: v8 2D Uptake: v1 A v6 Given: sum of the fluxes which produce a metabolite = sum of fluxes which consume the metabolite for every metabolite. Given uptake limits on R1 and R5, compute the maximum flux through R8. D v3 F v7 Flux legend: Exchange Intracellular Objective C Waste: v9

Optimal wild-type flux distribution
v5 10 Optimal Growth Flux B E v4 v2 2B 10 10 2D v1 A v8 10 v6 D 20 v3 F Steady State assumption Given uptake limits on R1, compute the maximum flux through R8. What results is a genome-scale metabolic flux model that is capable of growth rate and internal flux predictions. v7 C v9

Optimal mutant flux distribution
v5 B E v4 v2 2B STOP 2D v1 A v8 10 v6 D 10 v3 10 10 F By knocking out the gene whose product catalyzes R4, what is the maximum flux we can obtain through R8? C v7 v9

Suboptimal mutant flux distribution
v5 B E v4 v2 2B 6.7 STOP 2D v1 A 3.3 v8 10 v6 D 6.7 Minimization of metabolic adjustment: Minimize the difference between the wild-type flux distribution and the mutant flux distribution. v3 3.3 6.7 F C v7 3.3 v9

Case II: Palsson JR904 good flux balance model implicit schema
literature curated biochemical reactions 904 enzymatic reactions gene, enzyme-reaction associations

Case II: What sources of data are available to build a Metabolic Flux model?
Annotated Genome Literature Pathway Databases Experimental measurements

(fluxes in [mmol/gr DM h] normalized to glucose uptake flux)
Model vs. Exper., Glucose limited (fluxes in [mmol/gr DM h] normalized to glucose uptake flux) (Segrè, Vitkup and Church, PNAS 2002)

ni (exper) ni (exper) ni (exper) Corr.coeff.=0.91 Corr.coeff.=0.97
Low Glucose Limited High Glucose Limited Nitrogen Limited ni (exper) ni (exper) ni (exper) Corr.coeff.=0.91 Corr.coeff.=0.97 Corr.coeff.=0.78

Min Adjust. (suboptimal)
Max growth (optimal) Min Adjust. (suboptimal) Corr . coeff .=0.564 250 P - value=0.007 200 7 ) 8 150 theor 10 9 13 100 11 ( 14 3 1 12 i v 50 16 2 15 6 17 5 4 - 50 - 50 50 100 150 200 250 v ( exper ) i

The power of a model lies in its ability to distinguish between competing hypotheses

Case II: EcoCyc good schema Flux balance model doesn’t work

What happens if the steady-state behavior of the model fails to reproduce the steady-state behavior of the organism? Genome Pathologic Nutrients & Objective Model Definition (SBML) BioCyc to SBML Pathway/ Genome Database Transporter prediction The key insight that drives this project lies not in what these models can predict correctly, but in what they predict incorrectly, for by rigorously examining the assumptions underlying that model, new knowledge can be gained. FBA & MOMA Flux prediction

What happens if the steady-state behavior of the model fails to reproduce the steady-state behavior of the organism? Genome Pathologic Nutrients & Objective Model Definition (SBML) BioCyc to SBML Pathway/ Genome Database Transporter prediction The key insight that drives this project lies not in what these models can predict correctly, but in what they predict incorrectly, for by rigorously examining the assumptions underlying that model, new knowledge can be gained. FBA & MOMA Network Debugging Flux prediction

Case II: EcoCyc/JR904 Best of both worlds
Biological Objective: From nutrients create all essential compounds required for growth True test of metabolic databases: Is the data good enough to predict growth rate under different nutrient conditions and effect of gene knockouts?

Case II: Schema level integration
Translation from BioCyc ontology to BioPAX ontology Translation of implicit JR904 schema to BioPAX ontology Integration of JR904 concepts with BioPAX ontology (flux limits)

Case II: Instance level
EcoCyc <-> JR904 Gene names EcoCyc <-> JR904 Enzyme names EcoCyc <-> JR904 Reaction names EcoCyc <-> JR904 Reversibility/flux limits EcoCyc <-> JR904 Gene->protein associations EcoCyc <-> JR904 protein->enzyme complex associations EcoCyc <-> JR904 enzyme->reaction

Data Profiling of Flux Model
Incorrect constraints (reversibility) Incorrect Nutrient conditions Incorrect Biomass composition Incorrect protein function predictions

Data profiling of Flux Predictions
Incorrect hypothesis (FBA vs MOMA vs ROOM) Incorrect network architecture (Gene knockouts) Incorrect modeling assumptions (steady state assumption, gene expression profiles)

Fixing the problems you find
Requires different amounts of time, money, and expertise Enzyme Genomics project Community annotation projects Adopt-a-Genome project High-throughput experiments Pathway hole filling algorithms

Case III: Semantic Aggregation Case study
Prochlorococcus marinus MED4 Most abundant species in the ocean Responsible for a significant portion of photosynthetic carbon fixation. Iron hypothesis: Possible solution to global warming? Need to understand details of metabolic network

Case III: Multi-source aggregation
Public KEGG (metabolism) BioCyc (metabolism) WIT (metabolism) TransportDB (transport proteins) Local RNA expression (microarrays) protein expression (mass spec)

Case III: Goal Constrain metabolic flux model with
experimental measurements: RNA expression Protein expression Metabolite concentrations Flux measurements

Case III: Aggregation Problems
Higher Level: Orphan enzymes Schema Level: Bridge ontologies Instance Level: Object identity problem Simulation Level: underdetermined system. Solution: BioPAX -> Biowarehouse semantic test suite for data validation

Case III: Multi-source aggregation Validation and Testing
Joint-learning from multiple sources Semantic test suite for data validation Network debugging algorithms :

Time Out (15 minutes)

Lessons Learned (30 minutes)
What did you learn? Discussion “A good representation is the key to good problem solving” –Patrick Winston “Standard is better than best”—Gerald J Sussman “The great thing about standards is that there are so many from which to choose” --Unknown “Above all, one must develop a feeling for the organism.”—Barbara McClintock “Someone does it once, everybody benefits.” Eric Miller, W3C Semantic Web Activity Lead Remember people, process, technology, however without people there isn’t any process or technology, so it’s all social engineering.

Lessons Not Yet Learned (Take home exercise)

How did we do? Please let us know how we can improve this tutorial.
Feedback Our goal is to have you walk away with a clear understanding of how to approach any database integration project To provide A methodology to scope and plan the project An understanding of what to expect Some specific examples to illustrate what is common to all integration projects (data cleaning) and what specific to a particular task. (i.e. to provide you with examples to give a sense of it) Some first hand experience at pedantic aggravation, irritation and interference How did we do? Please let us know how we can improve this tutorial.

Thank You Joanne & Jeremy

Semantic Aggregation, Integration, and Inference of Pathway Data

Similar presentations

Presentation on theme: "Semantic Aggregation, Integration, and Inference of Pathway Data"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Semantic Aggregation, Integration, and Inference of Pathway Data

Similar presentations

Presentation on theme: "Semantic Aggregation, Integration, and Inference of Pathway Data"— Presentation transcript:

Similar presentations

About project

Feedback