A knowledge based approach for representing, reasoning and hypothesizing about biochemical networks Chitta Baral Arizona State University.

A knowledge based approach for representing, reasoning and hypothesizing about biochemical networks Chitta Baral Arizona State University

Three parts to the talk Prediction, Explanation and Planning with respect to biochemical networks Hypothesis Generation with respect to biochemical networks Collaborative BioCuration: CBioC

Motivation: purpose of interaction databases? Suppose: We have an almost exhaustive database of the intracellular interactions (protein-protein, metabolic, etc.) of particular cells. What next? How will we use this database? What if our knowledge is incomplete?

Motivation: Uses of networks & pathways Visualize the pathways Analyze the graphs of the networks Compare graphs of the networks Use pathway data in conjunction with micro- array data analysis Do system level simulation Is that all?

Motivation: ultimate uses! Prediction/System Simulation (Systems Biology?)  Impact of particular perturbations (say caused by a drug that introduces certain proteins to the cell membrane or into the cell)  Do the perturbations have the desired impact?  Do they mess up something else? (side effects!) But that’s not all!

Motivation: Explaining observations A phenotypical observation (leading to) OR  an observation that a particular protein or chemical has abnormally high concentration What is wrong? What is out of the ordinary? The cause/explanation will give us approaches to fix the problem. How deep should the explanations go? How do we compare explanations?

Motivation: Designing drugs & therapies What perturbations (when and where) need to be made so as to make the cell behave in a particular way? In case of cancer: prevent proliferation, induce apoptosis, prevent migration, etc.

What if knowledge is incomplete? What kind of useful reasoning can we do with incomplete knowledge? Drug makers don’t wait till full knowledge is available. Answer: hypothesis formation

Motivation: Use summary The ultimate uses of signaling (metabolic, etc.) interaction databases are to do:  Prediction – therapy verification; determining side effects.  Explanation -- diagnosing what is wrong.  Planning – therapy and drug design. Intermediate or immediate use  Generate Hypothesis

Initial goal of our research Use knowledge representation and reasoning techniques to:  Represent interactions  Reason about these interactions: prediction, explanation, planning and hypothesis formation.

Some questions Isn’t it a little premature?  We know very little about the networks  New knowledge is being constantly added Why knowledge representation and reasoning?  Why not simulation  Why not use Petri nets,  calculus Why a knowledge-based approach? Why not a data base approach? What’s the difference?

Our approach : present and future Yes, prediction is kind-of same as simulation  Incompleteness of information is an issue though! But hard to do explanation generation, or design of therapies (planning) using simulation – guesses can be verified using simulation though The core database query languages can not express explanation or planning queries. Dealing with incompleteness!

Dealing with incompleteness – ongoing and future work Is one of the key criteria behind a `good’ knowledge representation language when building AI systems.  Need to be non-monotonic.  Need to be elaboration tolerant. Proper analysis leads to hypothesizing  If certain observations can not be satisfactorily explained by the existing knowledge about the network then use general biological knowledge to hypothesize

Motivation -- summary Goal: To emulate the abstract reasoning done by biologists, medical researchers, and pharmacology researchers. Types of reasoning: prediction, explanation, planning and hypothesis formation. Current system biology approaches: mostly prediction. Ongoing issues: Dealing with incomplete knowledge and elaboration tolerance.

Related Works Quantitative approaches. (hybrid systems, use of differential equations) Graphical representations. Other qualitative approaches.  Petri Nets   -calculus  Pathway Logic  Model Checking

Overview of our approach Represent signal network as a knowledge base that describes  actions/events (biological interactions, processes).  effect of these actions/events.  triggering conditions of the actions/events. To query using the knowledge base:  Prediction; explanation; planning; Hypothesis generation BioSigNet-RR (Biological Signal Network - Representation and Reasoning) and BioSigNet-RRH systems.

Foundation behind our approach Research on representing and reasoning about dynamic systems (space shuttles, mobile robots, software agents)  causal relations between properties of the world  effects of actions (when can they be executed)  goal specification  action-plans Research on knowledge representation, reasoning and declarative problem solving – the AnsProlog language.

An NF  B signaling pathway

Syntax by example bind(TNF- ,TNFR1) causes trimerized(TNFR1) trimerized(TNFR1) triggers bind(TNFR1,TRADD)

General syntax to represent networks e causes f if f 1 ; …; f k g 1 ; … ; g k causes g h 1 ; … ; h m n_triggers e k 1 ; … ; k l triggers e r 1 ; … ; r l inhibits e e is an event (also referred to as an action) and the rest are fluents (properties of the cell) For metabolic interactions: e converts g 1 ; … ; g k to f 1 ; …; f k if h 1 ; … ; h m

Semantics: queries and entailment Observation part of queries  f at t  a occurs_at t Given the Network N and observation O  Predict if a temporal expression holds.  Explain a set of observations.  Plan to achieve a goal.

Importance of a formal semantics Besides defining prediction, explanation and planning, it is also useful in identifying:  Under what restrictions the answer given by a given (graph based) algorithm will be correct. (soundness!)  Under what restrictions a given (graph based) algorithm will find a correct answer if one exists. (completeness!)

Utility of declarative programming languages (such as AnsProlog) Allows for quick implementation of the semantics  The specification or the definition of what is an explanation, or what is a plan becomes a program that finds explanations and plans respectively.

Prediction Given some initial conditions and observations, to predict how the world would evolve or predict the outcome of (hypothetical) interventions.

Back to the example Binding of TNF-  with TNFR1 leads to TRADD binding with one or more of TRAF2, FADD, RIP. TRADD binding with TRAF2 leads to over-expression of FLIP provided NIK is phosphorylated on the way. TRADD binding with RIP inhibits phosphorylation of NIK. TRADD binding with FADD in the absence of FLIP leads to cell death.

Prediction 1. Binding of TNF-  with TNFR1 leads to TRADD binding with one or more of TRAF2, FADD, RIP. TRADD binding with TRAF2 leads to over-expression of FLIP provided NIK is phosphorylated on the way. TRADD binding with RIP inhibits phosphorylation of NIK. TRADD binding with FADD in the absence of FLIP leads to cell death. Initial Condition  bind(TNF-α,TNF-R1) occurs at t0 Query  predict eventually apoptosis Answer  Unknown!  Incomplete knowledge about the TRADD’s bindings.  Depends on if bind(TRADD, RIP) happened or not!

Prediction 2 Binding of TNF-  with TNFR1 leads to TRADD binding with one or more of TRAF2, FADD, RIP. TRADD binding with TRAF2 leads to over-expression of FLIP provided NIK is phosphorylated on the way. TRADD binding with RIP inhibits phosphorylation of NIK. TRADD binding with FADD in the absence of FLIP leads to cell death. Initial Condition  bind(TNF-α,TNF-R1) occurs at t0 Observation  TRADD’s binding with TRAF2, FADD, RIP Query  predict eventually apoptosis Answer: Yes!

Explanation Given initial condition and observations, to explain why final outcome does not match expectation.

Explanation 1 Binding of TNF-  with TNFR1 leads to TRADD binding with one or more of TRAF2, FADD, RIP. TRADD binding with TRAF2 leads to over-expression of FLIP provided NIK is phosphorylated on the way. TRADD binding with RIP inhibits phosphorylation of NIK. TRADD binding with FADD in the absence of FLIP leads to cell death. Initial condition:  bound(TNF- ,TNFR1) at t0 Observation:  bound(TRADD, TRAF2) at t1 Query: Explain apoptosis One explanation:  Binding of TRADD with RIP  Binding of TRADD with FADD

Planning Given initial conditions, to plan interventions to achieve a goal. Application in drug and therapy design.

Planning requirements In addition to the knowledge about the pathway we need additional information about possible interventions such as:  What proteins can be introduced  What mutations can be forced.

Planning example Defining possible interventions:  intervention intro(DN-TRAF2)  intro(DN-TRAF2) causes present(DN-TRAF2)  present(DN-TRAF2) inhibits bind(TRAF2,TRADD)  present(DN-TRAF2) inhibits interact(TRAF2,NIK) Initial condition:  bound(NFκB,IκB) at 0  bind(TNF-α,TNF-R1) at 0 Goal: to keep NFκB remain inactive. Query:  plan always bound(NFκB,IκB) from 0

Conclusion of part 1 From paper in ISMB 2004:  Our goal in this paper was to make progress towards developing a system (and the necessary representation language and reasoning algorithms) that can be used to represent signal networks and pathways associated with cells and reason with them. A start was made.  Defined a simple language (syntax and semantics)  Defined prediction, planning and explanation  A prototype implementation using AnsProlog  Illustration of its applicability with respect to an NFkB pathway.

Issues with incomplete knowledge Often one may not be able to do much predication, explanation or planning. What then? Can reasoning help in obtaining new knowledge? Yes, through hypothesis generation! In fact, hypothesis generation needs reasoning!

Part II: Hypothesis Generation

Hypothesis generation Our observations can not be explained by our existing knowledge OR the explanations given by our existing knowledge are invalidated by experiments? Conclusion: Our knowledge needs to be augmented or revised? How? Can we use a reasoning system to predict some hypothesis that one can verify through experimentation? Automate the reasoning in the minds of a biologist, especially helpful when the background knowledge is humongous.

Hypothesis space Knowledge base No cancer Cancer p53 UV leads_to cancer High UV (K,I) | = O

Issues in this tiny example Hypothesis formation: Theory: UV leads to cancer. Observation: wild-type p53 resists the UV effect. Hypothesis: p53 is a tumor-suppressor. Elaboration tolerance: How do we update/revise “UV leads to cancer”? Default & NM reasoning: Normally UV leads to cancer. UV does not lead to cancer if p53 is present.

Related Works: some prior mention of hypothesis formation HYPGENE (Karp, 1991) TRANSGENE (Darden, 1997) GenePath (Zupan et al., 2003) Robot Scientist (King et al., 2004) Database (Doherty et al., 2004) BIOCHAM (Calzone et al., 2005) PathLogic (Karp et al. 2002) Cytoscape (Shannon et al., 2003) Integrative Scheme (Su et al., 2003) Pathway Analysis (Ingenuity  ) … do not use the latest advances in knowledge representation and reasoning. (eg. lack of ways to express defaults, non- monotonicity, elaboration tolerance, problem solving rules, etc.)

Hypothesis formation Knowledge base: K Set of initial conditions: I Set of (experimental) observations: O (K,I) does not entail O To expand (K,I) to (K’, I’): (K’, I’) entails O How to expand (hypothesis space)  Explanation: expand only I  Diagnosis: normality assumptions about I, minimally abandon the normality assumptions  Hypothesis formation: expand K

Construction of hypothesis space Present: manual construction, using research literature Future: integration of multiple data sources  Protein interactions  Pathway databases  Biological ontologies …….. Provide cues, hunches such as A may interact with B: action interact(A,B) A-B interaction may have effect C: interact(A,B) causes C

Generation of hypotheses Enumeration of hypotheses Search: computing with Smodels (an implementation of AnsProlog) Heuristics  A trigger statement is selected only if it is the only cause of some action occurrence that is needed to explain the novel observations.  An inhibition statement is selected only if it is the only blocker of some triggered action at some time. Maximizing preferences of selected statements

Generation … (cont’): heuristics Knowledge base K  a causes g  b causes g Initial condition I = { intially f } Observation O = { eventually g } (K,I) does not entail O Hypothesis space: to expand K with rules among  f triggers a  f triggers b Hypotheses: { f triggers a }, or { f triggers b }

Case study: p53 network

Tumor suppression by p53 p53 has 3 main functional domains  N terminal transactivator domain  Central DNA-binding domain  C terminal domain that recognizes DNA damage Appropriate binding of N terminal activates pathways that lead to protection of cell from cancer. Inappropriate binding (say to Mdm2) inhibits p53 induced tumor suppression.

p53 knowledge base Stress  high(UV ) triggers upregulate(mRNA(p53)) Upregulation of p53  upregulate(mRNA(p53)) causes high(mRNA(p53))  high(mRNA(p53)) triggers translate(p53)  translate(p53) causes high(p53)

p53 knowledge base (cont.) Tumor suppression by p53  high(p53) inhibits growth(tumor)

p53 knowledge base (cont’) Interaction between Mdm2 and p53  high(p53), high(mdm2) triggers bind(p53,mdm2)  bind(p53,mdm2) causes bound(dom(p53,N))  bind(p53,mdm2) causes high([p53 : mdm2]),  bind(p53,mdm2) causes ¬high(p53),¬high(mdm2)

Hypothesis formation Experimental observation:  I = { initially high(UV), high(mdm2), high(ARF) }  O = { eventually ~ tumorous } (K,I) does not entail O Need to hypothesize the role of ARF.

Constructing hypothesis space Levels of ARF and p53 correlate  high(ARF) triggers upregulate(mRNA(p53))  high(p53) triggers upregulate(mRNA(ARF))

Interactions of ARF with the known proteins  bind(p53,ARF) causes bound(dom(p53,N)) Constructing …(cont’)

Influence of X (=ARF) on other interactions  high(ARF) triggers upreg(mRNA(p53))  high(ARF) triggers translate(p53)  high(ARF) triggers bind(p53,mdm2) Constructing …(cont’)

Twelve Generated Hypothesis such as high(UV) triggers upregulate(mRNA(ARF)) high(ARF), high(mdm2) triggers bind(ARF,mdm2)

Conclusion of part 2 Goal: Automation of hypothesis formation (with respect to interactions and pathways) Approach: Viewed known qualitative aspects of cell activities as a knowledge base Used knowledge representation language that  Can express defaults  Allows reasoning with incomplete knowledge  Can express reasoning as well as problem solving rules Developed a system BioSigNet-RRH: Formalizing and reasoning about hypotheses Illustration: Hypothesizing the role of ARF protein in the p53 network.

Future Work on Reasoning about Biochemical Networks (Part I and II) Further development of the language Validation with respect to larger networks  Kohn’s map  Networks in Reactome and other repositories Going from prototype to deployable systems Scaling up challenges  Recent advances in automatic planning Integration with Biopax

Part III: CBioC http://cbioc.org

Do we have enough knowledge in the various databases Some have been curated into databases. But there is much more in the literature. So what do we do?

Current status of curation from text About 15 million abstracts in Pubmed  3 million published by US and EU researchers during 1994-2004 (800 articles per day) 300 K articles published so far reporting protein-protein interactions in human, yeast and mouse.  BIND (in 7 yrs) -- 23K ; DIP – 3K; MINT – 2.4K.

Premise: High cost of human curation Overwhelming cost of large curation efforts may be unsustainable for long periods  BIND: Nov 2005 bad news. Operated for 7 years Listed over 100 curators & programmers CND $29 million received in 2003, plus other funding  Curation efforts of AFCS has recently stopped.  Lack of funding for some genome annotation projects.

Premise: summary Human curation of text is expensive. Human curation of text is not scalable. Human curation of text is not sustainable.

Why not resort to computers? – do automatic extraction Lessons from DARPA funded MUCs (message understanding conferences) in 90s for a decade and at the cost of tens of millions of dollars.  Getting to 60% recall and precision is quick  Then every 5% improvement is about a years work.  Even when we get to 90% for an individual entity extraction for recognizing 4 related entities: (.9) 4 =.64 Lessons from Biomedical text extraction  No proper evaluation.  Recognized that recall and precision is not very good even in the “best” systems.

What do we do? How do we curate not only the existing articles, but also the future articles? Too important to give up! Need to think of a new way to do it. Faster computers, better sequencing technology and better algorithms came to the rescue of the Human Genome project. Hmm. What resources are we overlooking?

Key Idea If lots of articles are being written then lot of people are writing them and lot of people are reading them. If only we could make these people (the authors and the readers) contribute to the curation effort … Especially the readers; the ones who need the curated data!

Mass collaboration has worked in Wikipedia Project Gutenberg Netflix rating Amazon rating Etc.

Mass collaborative curation: initial hurdles An average reader  (S)he is not normally interested in filling a blank curation form.  We can not make an average reader go though curation training.  So it has to be very different from just making the existing curation tools available to the mass and expect them to contribute.

Mass collaborative curation : key initial ideas Make it very easy:  user need not remember where (which database, which web page) to put the curated knowledge.  Curation opportunity should present itself seamlessly. Curation should not be a burden to an average user  Make the curated knowledge “thin”. There should be immediate rewards  Do not start with a blank slate.

Realization of the key ideas: a biologist with a gene name Goes to Pubmed, types the gene name, clicks on one of the abstracts Curation panel presents itself automatically  Our approach calls for researchers to contribute to the curation of facts as they read and research over the web But not with a blank slate  No one wants to be the first one!  Automatic extraction jump-starts the process, and then researchers improve upon the extracted data, “ ironing out ” inconsistencies by subsequent edits on a massive scale. Thin Schemas  Average users turned off by traditional wide schemas  Wide schemas need to be broken down.

Summary Information/curation window pops up automatically. Automatic extraction is used as a boot strap so that no user is working on a blank slate. Users vote on correctness, make corrections, add fact.  Suppose 60% precision and recall of automatic extraction system  A person will have an easier time discarding 40% of wrongly extracted text than identifying 60% of correct entries and entering them!

Very useful byproducts Avoids some problems with existing human curation approach  Curators’ bias  Curators miss things  Curators have disagreements  Slow access to newest findings  Researchers at large have little or no control over what gets curated and when A large curated corpus of text gets created  Very useful to evaluate and improve automated extraction systems.

Current status of CBioC; future plans Basic system, as described, is ready Being populated with  Facts from existing databases (BIND etc.)  Facts extracted using our extraction system Querying mechanism  Answer display Future work  Voter confidence issues  …

Conclusion Collecting what is known Reasoning with what is known Hypothesizing what is unknown (based on observations)

Open Invitation We are building and eager to help other groups build knowledge bases in particular domains to  Predict impact of interventions  Plan (therapy design) to make a pathway behave in a desired way  Explain observation  Hypothesize new knowledge  Further improvements to and adaptation of CBioC

Acknowledgements BioSignet  Nam Tran, Ph.D thesis on this, Postdoc @ Yale  Karen Chancellor, Ph.D student  Michael Berens and his group (Ana Joy, Nhan Tran)  Lokesh Joshi and his group (Vinay Nagraj) CBioc: Graciela Gonzalez, Lian Yu, Luis Tari, Tony Gitter, Amanda Ziegler, Ryan Wendt, Prabhdeep Singh. Other projects:  BioQA  Biogenenet

Thank you!

A knowledge based approach for representing, reasoning and hypothesizing about biochemical networks Chitta Baral Arizona State University.

Similar presentations

Presentation on theme: "A knowledge based approach for representing, reasoning and hypothesizing about biochemical networks Chitta Baral Arizona State University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A knowledge based approach for representing, reasoning and hypothesizing about biochemical networks Chitta Baral Arizona State University.

Similar presentations

Presentation on theme: "A knowledge based approach for representing, reasoning and hypothesizing about biochemical networks Chitta Baral Arizona State University."— Presentation transcript:

Similar presentations

About project

Feedback