Integrative Genomics BME 230. Probabilistic Networks Incorporate uncertainty explicitly Capture sparseness of wiring Incorporate multiple kinds of data.

Slides:



Advertisements
Similar presentations
CS188: Computational Models of Human Behavior
Advertisements

Learning with Missing Data
1 Some Comments on Sebastiani et al Nature Genetics 37(4)2005.
Exact Inference in Bayes Nets
Learning: Parameter Estimation
. Inferring Subnetworks from Perturbed Expression Profiles D. Pe’er A. Regev G. Elidan N. Friedman.
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. Learning I Excerpts from Tutorial at:
GS 540 week 6. HMM basics Given a sequence, and state parameters: – Each possible path through the states has a certain probability of emitting the sequence.
Graphical Models - Learning -
Nir Friedman, Iftach Nachman, and Dana Peer Announcer: Kyu-Baek Hwang
Integrating Bayesian Networks and Simpson’s Paradox in Data Mining Alex Freitas University of Kent Ken McGarry University of Sunderland.
Bayesian network inference
Date:2011/06/08 吳昕澧 BOA: The Bayesian Optimization Algorithm.
1 Graphical Models in Data Assimilation Problems Alexander Ihler UC Irvine Collaborators: Sergey Kirshner Andrew Robertson Padhraic Smyth.
Regulatory Network (Part II) 11/05/07. Methods Linear –PCA (Raychaudhuri et al. 2000) –NIR (Gardner et al. 2003) Nonlinear –Bayesian network (Friedman.
. Learning Bayesian networks Slides by Nir Friedman.
Lecture 5: Learning models using EM
Probabilistic Graphical Models Tool for representing complex systems and performing sophisticated reasoning tasks Fundamental notion: Modularity Complex.
. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.
CS 188: Artificial Intelligence Spring 2007 Lecture 14: Bayes Nets III 3/1/2007 Srini Narayanan – ICSI and UC Berkeley.
Visual Recognition Tutorial
Learning Bayesian Networks
. Approximate Inference Slides by Nir Friedman. When can we hope to approximate? Two situations: u Highly stochastic distributions “Far” evidence is discarded.
. Learning Bayesian networks Most Slides by Nir Friedman Some by Dan Geiger.
Learning Bayesian Networks (From David Heckerman’s tutorial)
Cristina Manfredotti D.I.S.Co. Università di Milano - Bicocca An Introduction to the Use of Bayesian Network to Analyze Gene Expression Data Cristina Manfredotti.
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Quiz 4: Mean: 7.0/8.0 (= 88%) Median: 7.5/8.0 (= 94%)
A Brief Introduction to Graphical Models
Using Bayesian Networks to Analyze Expression Data N. Friedman, M. Linial, I. Nachman, D. Hebrew University.
Genetic Regulatory Network Inference Russell Schwartz Department of Biological Sciences Carnegie Mellon University.
1 Naïve Bayes Models for Probability Estimation Daniel Lowd University of Washington (Joint work with Pedro Domingos)
Learning Structure in Bayes Nets (Typically also learn CPTs here) Given the set of random variables (features), the space of all possible networks.
Cognitive Computer Vision Kingsley Sage and Hilary Buxton Prepared under ECVision Specific Action 8-3
Bayesian networks. Motivation We saw that the full joint probability can be used to answer any question about the domain, but can become intractable as.
Unsupervised Learning: Clustering Some material adapted from slides by Andrew Moore, CMU. Visit for
1 Instance-Based & Bayesian Learning Chapter Some material adapted from lecture notes by Lise Getoor and Ron Parr.
Bayesian Learning Chapter Some material adapted from lecture notes by Lise Getoor and Ron Parr.
Inference Complexity As Learning Bias Daniel Lowd Dept. of Computer and Information Science University of Oregon Joint work with Pedro Domingos.
1 CMSC 671 Fall 2001 Class #25-26 – Tuesday, November 27 / Thursday, November 29.
Module networks Sushmita Roy BMI/CS 576 Nov 18 th & 20th, 2014.
Course files
Learning With Bayesian Networks Markus Kalisch ETH Zürich.
Marginalization & Conditioning Marginalization (summing out): for any sets of variables Y and Z: Conditioning(variant of marginalization):
CS498-EA Reasoning in AI Lecture #10 Instructor: Eyal Amir Fall Semester 2009 Some slides in this set were adopted from Eran Segal.
Exploiting Structure in Probability Distributions Irit Gat-Viks Based on presentation and lecture notes of Nir Friedman, Hebrew University.
Bayesian networks and their application in circuit reliability estimation Erin Taylor.
1 Parameter Learning 2 Structure Learning 1: The good Graphical Models – Carlos Guestrin Carnegie Mellon University September 27 th, 2006 Readings:
Exact Inference in Bayes Nets. Notation U: set of nodes in a graph X i : random variable associated with node i π i : parents of node i Joint probability:
Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven
Learning and Acting with Bayes Nets Chapter 20.. Page 2 === A Network and a Training Data.
Advances in Bayesian Learning Learning and Inference in Bayesian Networks Irina Rish IBM T.J.Watson Research Center
1 Param. Learning (MLE) Structure Learning The Good Graphical Models – Carlos Guestrin Carnegie Mellon University October 1 st, 2008 Readings: K&F:
1 Learning P-maps Param. Learning Graphical Models – Carlos Guestrin Carnegie Mellon University September 24 th, 2008 Readings: K&F: 3.3, 3.4, 16.1,
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
04/21/2005 CS673 1 Being Bayesian About Network Structure A Bayesian Approach to Structure Discovery in Bayesian Networks Nir Friedman and Daphne Koller.
Crash Course on Machine Learning Part VI Several slides from Derek Hoiem, Ben Taskar, Christopher Bishop, Lise Getoor.
1 Structure Learning (The Good), The Bad, The Ugly Inference Graphical Models – Carlos Guestrin Carnegie Mellon University October 13 th, 2008 Readings:
Computational methods for inferring cellular networks II Stat 877 Apr 17 th, 2014 Sushmita Roy.
. The EM algorithm Lecture #11 Acknowledgement: Some slides of this lecture are due to Nir Friedman.
Integrative Genomics I BME 230. Probabilistic Networks Incorporate uncertainty explicitly Capture sparseness of wiring Incorporate multiple kinds of data.
Qian Liu CSE spring University of Pennsylvania
Inference in Bayesian Networks
Learning Bayesian Network Models from Data
Irina Rish IBM T.J.Watson Research Center
Still More Uncertainty
Learning Bayesian networks
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Bayesian Learning Chapter
Learning Bayesian networks
Presentation transcript:

Integrative Genomics BME 230

Probabilistic Networks Incorporate uncertainty explicitly Capture sparseness of wiring Incorporate multiple kinds of data

Most models incomplete Molecular systems are complex & rife with uncertainty Data/Models often incomplete –Some gene structures misidentified in some organisms by gene structure model –Some true binding sites for a transcription factor can’t be found with a motif model –Not all genes in the same pathway predicted to be coregulated by a clustering model Even if have perfect learning and inference methods, appreciable amount of data left unexplained

Why infer system properties from data? Knowledge acquisition bottleneck Knowledge acquisition is an expensive process Often we don’t have an expert Data is cheap Amount of available information growing rapidly Learning allows us to construct models from raw data Discovery Want to identify new relationships in a data-driven way

Graphical models for joint-learning Combine probability theory & graph theory Explicitly link our assumptions about how a system works with its observed behavior Incorporate notion of modularity: complex systems often built from simpler pieces Ensures we find consistent models in the probabilistic sense Flexible and intuitive for modeling Efficient algorithms exist for drawing inferences and learning Many classical formulations are special cases Michael Jordan, 1998

Motif Model Example (Barash ’03) Sites can have arbitrary dependence on each other.

Barash ’03 Results Many TFs have binding sites that exhibit dependency

Barash ’03 Results

Bayesian Networks for joint-learning Provide an intuitive formulation for combining models Encode notion of causality which can guide model formulation Formally expresses decomposition of system state into modular, independent sub-pieces Makes learning in complex domains tractable

Unifying models for molecular biology We need knowledge representation systems that can maintain our current understanding of gene networks E.g. DNA damage response and promotion into S-phase –highly linked sub-system –experts united around sub-system –but probably need combined model to understand either sub-system Graphical models offer one solution to this

“Explaining Away” Causes “compete” to explain observed data So, if observe data and one of the causes, this provides information about the other cause. Intuition into “V-structures”: sprinklerrain wet grass sprinkler wet grass Observing grass is wet and then finding out sprinklers were on decreases our belief that it rained. So sprinkler and rain dependent given their child is observed.

Conditional independence: “Bayes Ball” analogy Ross Schacter, 1995 converging: ball does not pass through when unobserved; passes through when observed. diverging or parallel: ball passes through when unobserved; does not pass through when observed unobserved observed

Inference Given some set of evidence, what is the most likely cause? BN allows us to ask any question that can be posed about the probabilities of any combination of variables

Inference Since BN provides joint distribution, can ask questions by computing any probability among the set of variables. For example, what’s the probability p53 is activated given ATM is off and the cell is arrested before S-phase? Need to marginalize (sum out) variables not interested in:

Variable Elimination Inference amounts to distributing sums over products Message passing in the BN Generalization of forward-backward algorithm Pearl, 1988

Variable Elimination Procedure The initial potentials are the CPTs in BN. Repeat until only query variable(s) remain: –Choose another variable to eliminate. –Multiply all potentials that contain the variable. –If no evidence for the variable then sum the variable out and replace original potential by the new result. –Else, remove variable based on evidence. Normalize remaining potentials to get the final distribution over the query variable.

Motif Model Example (Barash ’03) Sites can have arbitrary dependence on each other.

Barash ’03 Results Many TFs have binding sites that exhibit dependency

Barash ’03 Results

Conditional independence: “Bayes Ball” analogy Ross Schacter, 1995 converging: ball does not pass through when unobserved; passes through when observed. diverging or parallel: ball passes through when unobserved; does not pass through when observed unobserved observed

Inference Given some set of evidence, what is the most likely cause? BN allows us to ask any question that can be posed about the probabilities of any combination of variables

Inference Since BN provides joint distribution, can ask questions by computing any probability among the set of variables. For example, what’s the probability p53 is activated given ATM is off and the cell is arrested before S-phase? Need to marginalize (sum out) variables not interested in:

Variable Elimination Inference amounts to distributing sums over products Message passing in the BN Generalization of forward-backward algorithm Pearl, 1988

Variable Elimination Procedure The initial potentials are the CPTs in BN. Repeat until only query variable(s) remain: –Choose another variable to eliminate. –Multiply all potentials that contain the variable. –If no evidence for the variable then sum the variable out and replace original potential by the new result. –Else, remove variable based on evidence. Normalize remaining potentials to get the final distribution over the query variable.

Learning Networks from Data Have a dataset X What is the best model that explains X? We define a score, score(G,X) that scores networks by computing how likely X is given the network Then search for the network that gives us the best score

As M (amount of data) grows, –Increasing pressure to fit dependencies in distribution –Complexity term avoids fitting noise Asymptotic equivalence to MDL score Bayesian score is consistent –Observed data eventually overrides prior Fit dependencies in empirical distribution Complexity penalty Bayesian Score Friedman & Koller ‘03

Learning Parameters: Summary Estimation relies on sufficient statistics –For multinomials: counts N(x i,pa i ) –Parameter estimation Both are asymptotically equivalent and consistent Both can be implemented in an on-line manner by accumulating sufficient statistics MLE Bayesian (Dirichlet)

Incomplete Data Data is often incomplete Some variables of interest are not assigned values This phenomenon happens when we have Missing values: –Some variables unobserved in some instances Hidden variables: –Some variables are never observed –We might not even know they exist Friedman & Koller ‘03

Hidden (Latent) Variables Why should we care about unobserved variables? X1X1 X2X2 X3X3 H Y1Y1 Y2Y2 Y3Y3 X1X1 X2X2 X3X3 Y1Y1 Y2Y2 Y3Y3 17 parameters 59 parameters Friedman & Koller ‘03

Expectation Maximization (EM) A general purpose method for learning from incomplete data Intuition: If we had true counts, we could estimate parameters But with missing values, counts are unknown We “complete” counts using probabilistic inference based on current parameter assignment We use completed counts as if real to re-estimate parameters Friedman & Koller ‘03

Expectation Maximization (EM) N (X,Y ) XY # HTHTHTHT HHTTHHTT Expected Counts X Z HTHHTHTHHT Y ??HTT??HTT T??THT??TH Data P(Y=H|X=T,  ) = 0.4 P(Y=H|X=H,Z=T,  ) = 0.3 Current model Friedman & Koller ‘03

Expectation Maximization (EM) Training Data X1X1 X2X2 X3X3 H Y1Y1 Y2Y2 Y3Y3 Initial network (G,  0 )  Expected Counts N(X 1 ) N(X 2 ) N(X 3 ) N(H, X 1, X 1, X 3 ) N(Y 1, H) N(Y 2, H) N(Y 3, H) Computation (E-Step) X1X1 X2X2 X3X3 H Y1Y1 Y2Y2 Y3Y3 Updated network (G,  1 ) Reparameterize (M-Step) Reiterate X1X1 X2X2 X3X3 H Y1Y1 Y2Y2 Y3Y3 Friedman & Koller ‘03

Expectation Maximization (EM) Computational bottleneck: Computation of expected counts in E-Step –Need to compute posterior for each unobserved variable in each instance of training set –All posteriors for an instance can be derived from one pass of standard BN inference Friedman & Koller ‘03

Why Struggle for Accurate Structure? Increases the number of parameters to be estimated Wrong assumptions about domain structure Cannot be compensated for by fitting parameters Wrong assumptions about domain structure EarthquakeAlarm Set Sound Burglary EarthquakeAlarm Set Sound Burglary Earthquake Alarm Set Sound Burglary Adding an arc Missing an arc

Learning BN structure Treat like a search problem Bayesian Score allows us to measure how well a BN fits the data Search for a BN that fits the data the best Start with initial network B 0 = Define successor operations on current network

Structural EM Recall, in complete data we had –Decomposition  efficient search Idea: Instead of optimizing the real score… Find decomposable alternative score Such that maximizing new score  improvement in real score

Structural EM Idea: Use current model to help evaluate new structures Outline: Perform search in (Structure, Parameters) space At each iteration, use current model for finding either: –Better scoring parameters: “parametric” EM step or –Better scoring structure: “structural” EM step

Training Data Expected Counts N(X 1 ) N(X 2 ) N(X 3 ) N(H, X 1, X 1, X 3 ) N(Y 1, H) N(Y 2, H) N(Y 3, H) Computation X1X1 X2X2 X3X3 H Y1Y1 Y2Y2 Y3Y3 X1X1 X2X2 X3X3 H Y1Y1 Y2Y2 Y3Y3  Score & Parameterize X1X1 X2X2 X3X3 H Y1Y1 Y2Y2 Y3Y3 Reiterate N(X 2, X 1 ) N(H, X 1, X 3 ) N(Y 1, X 2 ) N(Y 2, Y 1, H) X1X1 X2X2 X3X3 H Y1Y1 Y2Y2 Y3Y3

Structure Search Bottom Line Discrete optimization problem In some cases, optimization problem is easy –Example: learning trees In general, NP-Hard –Need to resort to heuristic search –In practice, search is relatively fast (~100 vars in ~2-5 min): Decomposability Sufficient statistics –Adding randomness to search is critical

Heuristic Search Define a search space: –search states are possible structures –operators make small changes to structure Traverse space looking for high-scoring structures Search techniques: –Greedy hill-climbing –Best first search –Simulated Annealing –... Friedman & Koller ‘03

Local Search Start with a given network –empty network –best tree –a random network At each iteration –Evaluate all possible changes –Apply change based on score Stop when no modification improves score Friedman & Koller ‘03

Heuristic Search Typical operations: S C E D Reverse C  E Delete C  E Add C  D S C E D S C E D S C E D  score = S({C,E}  D) - S({E}  D) To update score after local change, only re-score families that changed Friedman & Koller ‘03

Naive Approach to Structural EM Perform EM for each candidate graph G1G1 G3G3 G2G2 Parametric optimization (EM) Parameter space Local Maximum G4G4 GnGn u Computationally expensive: l Parameter optimization via EM — non-trivial l Need to perform EM for all candidate structures l Spend time even on poor candidates u  In practice, considers only a few candidates

Structural EM Recall, in complete data we had –Decomposition  efficient search Idea: Instead of optimizing the real score… Find decomposable alternative score Such that maximizing new score  improvement in real score Friedman & Koller ‘03

Structural EM Idea: Use current model to help evaluate new structures Outline: Perform search in (Structure, Parameters) space At each iteration, use current model for finding either: –Better scoring parameters: “parametric” EM step or –Better scoring structure: “structural” EM step Friedman & Koller ‘03

Bayesian Approach Posterior distribution over structures Estimate probability of features –Edge X  Y –Path X  …  Y –… Feature of G, e.g., X  Y Indicator function for feature f Bayesian score for G

Discovering Structure Current practice: model selection –Pick a single high-scoring model –Use that model to infer domain structure E R B A C P(G|D)

Discovering Structure Problem –Small sample size  many high scoring models –Answer based on one model often useless –Want features common to many models E R B A C E R B A C E R B A C E R B A C E R B A C P(G|D)

Application: Gene expression Input: Measurement of gene expression under different conditions –Thousands of genes –Hundreds of experiments Output: Models of gene interaction –Uncover pathways Friedman & Koller ‘03

“Mating response” Substructure Automatically constructed sub-network of high-confidence edges Almost exact reconstruction of yeast mating pathway KAR4 AGA1PRM1 TEC1 SST2 STE6 KSS1 NDJ1 FUS3 AGA2 YEL059W TOM6 FIG1 YLR343W YLR334C MFA1 FUS1 N. Friedman et al 2000

Bayesian Network Limitation: Model pdf over instances Bayesian nets use propositional representation Real world has objects, related to each other Can we take advantage of general properties of objects of the same class? Intelligence Difficulty Grade

Bayesian Networks: Problem Bayesian nets use propositional representation Real world has objects, related to each other Intell_J.Doe Diffic_CS101 Grade_JDoe_CS101 Intell_FGump Diffic_Geo101 Grade_FGump_Geo101 Intell_FGump Diffic_CS101 Grade_FGump_CS101 These “instances” are not independent! A C

Relational Schema Specifies types of objects in domain, attributes of each type of object, & types of links between objects Teach Student Intelligence Registration Grade Satisfaction Course Difficulty Professor Teaching-Ability In TakeClasses Links Attributes

Modeling real-world relationships Bayesian nets use propositional representation Biological system has objects, related to each other from E. Segal’s PhD Thesis 2004

Probabilistic Relational Models (PRMs) A skeleton σ defining classes & their relations Random variables are the attributes of the objects

Probabilistic Relational Models Dependencies exist at the class level from E. Segal’s PhD Thesis 2004

Converting a PRM into a BN We can “unroll” (or instantiate) a PRM into its underlying BN. E.g.: 2 genes & 3 experiments: from E. Segal’s PhD Thesis 2004

A PRM has important differences from a BN Dependencies are defined at the class level –Thus can be reused for any objects of the class Uses the relational structure to allow attributes of an object to depend on attributes of related objects Parameters are shared across instances of the same class! (Thus, more data for each parameter)

Example of joint learning Segal et al. 2003

Joint Learning: Motifs & Modules Segal et al. 2003

Joint Model of Segal et al Segal et al. 2003

PRM for the joint model Segal et al. 2003

Joint Model Motif model: Regulation model: Expression Model: Segal et al. 2003

“Ground BN” Segal et al. 2003

TFs associated to modules Segal et al. 2003

PRM for genome-wide location data Segal et al. 2003