Integrative Genomics BME 230
Probabilistic Networks Incorporate uncertainty explicitly Capture sparseness of wiring Incorporate multiple kinds of data
Most models incomplete Molecular systems are complex & rife with uncertainty Data/Models often incomplete –Some gene structures misidentified in some organisms by gene structure model –Some true binding sites for a transcription factor can’t be found with a motif model –Not all genes in the same pathway predicted to be coregulated by a clustering model Even if have perfect learning and inference methods, appreciable amount of data left unexplained
Why infer system properties from data? Knowledge acquisition bottleneck Knowledge acquisition is an expensive process Often we don’t have an expert Data is cheap Amount of available information growing rapidly Learning allows us to construct models from raw data Discovery Want to identify new relationships in a data-driven way
Graphical models for joint-learning Combine probability theory & graph theory Explicitly link our assumptions about how a system works with its observed behavior Incorporate notion of modularity: complex systems often built from simpler pieces Ensures we find consistent models in the probabilistic sense Flexible and intuitive for modeling Efficient algorithms exist for drawing inferences and learning Many classical formulations are special cases Michael Jordan, 1998
Motif Model Example (Barash ’03) Sites can have arbitrary dependence on each other.
Barash ’03 Results Many TFs have binding sites that exhibit dependency
Barash ’03 Results
Bayesian Networks for joint-learning Provide an intuitive formulation for combining models Encode notion of causality which can guide model formulation Formally expresses decomposition of system state into modular, independent sub-pieces Makes learning in complex domains tractable
Unifying models for molecular biology We need knowledge representation systems that can maintain our current understanding of gene networks E.g. DNA damage response and promotion into S-phase –highly linked sub-system –experts united around sub-system –but probably need combined model to understand either sub-system Graphical models offer one solution to this
“Explaining Away” Causes “compete” to explain observed data So, if observe data and one of the causes, this provides information about the other cause. Intuition into “V-structures”: sprinklerrain wet grass sprinkler wet grass Observing grass is wet and then finding out sprinklers were on decreases our belief that it rained. So sprinkler and rain dependent given their child is observed.
Conditional independence: “Bayes Ball” analogy Ross Schacter, 1995 converging: ball does not pass through when unobserved; passes through when observed. diverging or parallel: ball passes through when unobserved; does not pass through when observed unobserved observed
Inference Given some set of evidence, what is the most likely cause? BN allows us to ask any question that can be posed about the probabilities of any combination of variables
Inference Since BN provides joint distribution, can ask questions by computing any probability among the set of variables. For example, what’s the probability p53 is activated given ATM is off and the cell is arrested before S-phase? Need to marginalize (sum out) variables not interested in:
Variable Elimination Inference amounts to distributing sums over products Message passing in the BN Generalization of forward-backward algorithm Pearl, 1988
Variable Elimination Procedure The initial potentials are the CPTs in BN. Repeat until only query variable(s) remain: –Choose another variable to eliminate. –Multiply all potentials that contain the variable. –If no evidence for the variable then sum the variable out and replace original potential by the new result. –Else, remove variable based on evidence. Normalize remaining potentials to get the final distribution over the query variable.
Motif Model Example (Barash ’03) Sites can have arbitrary dependence on each other.
Barash ’03 Results Many TFs have binding sites that exhibit dependency
Barash ’03 Results
Conditional independence: “Bayes Ball” analogy Ross Schacter, 1995 converging: ball does not pass through when unobserved; passes through when observed. diverging or parallel: ball passes through when unobserved; does not pass through when observed unobserved observed
Inference Given some set of evidence, what is the most likely cause? BN allows us to ask any question that can be posed about the probabilities of any combination of variables
Inference Since BN provides joint distribution, can ask questions by computing any probability among the set of variables. For example, what’s the probability p53 is activated given ATM is off and the cell is arrested before S-phase? Need to marginalize (sum out) variables not interested in:
Variable Elimination Inference amounts to distributing sums over products Message passing in the BN Generalization of forward-backward algorithm Pearl, 1988
Variable Elimination Procedure The initial potentials are the CPTs in BN. Repeat until only query variable(s) remain: –Choose another variable to eliminate. –Multiply all potentials that contain the variable. –If no evidence for the variable then sum the variable out and replace original potential by the new result. –Else, remove variable based on evidence. Normalize remaining potentials to get the final distribution over the query variable.
Learning Networks from Data Have a dataset X What is the best model that explains X? We define a score, score(G,X) that scores networks by computing how likely X is given the network Then search for the network that gives us the best score
As M (amount of data) grows, –Increasing pressure to fit dependencies in distribution –Complexity term avoids fitting noise Asymptotic equivalence to MDL score Bayesian score is consistent –Observed data eventually overrides prior Fit dependencies in empirical distribution Complexity penalty Bayesian Score Friedman & Koller ‘03
Learning Parameters: Summary Estimation relies on sufficient statistics –For multinomials: counts N(x i,pa i ) –Parameter estimation Both are asymptotically equivalent and consistent Both can be implemented in an on-line manner by accumulating sufficient statistics MLE Bayesian (Dirichlet)
Incomplete Data Data is often incomplete Some variables of interest are not assigned values This phenomenon happens when we have Missing values: –Some variables unobserved in some instances Hidden variables: –Some variables are never observed –We might not even know they exist Friedman & Koller ‘03
Hidden (Latent) Variables Why should we care about unobserved variables? X1X1 X2X2 X3X3 H Y1Y1 Y2Y2 Y3Y3 X1X1 X2X2 X3X3 Y1Y1 Y2Y2 Y3Y3 17 parameters 59 parameters Friedman & Koller ‘03
Expectation Maximization (EM) A general purpose method for learning from incomplete data Intuition: If we had true counts, we could estimate parameters But with missing values, counts are unknown We “complete” counts using probabilistic inference based on current parameter assignment We use completed counts as if real to re-estimate parameters Friedman & Koller ‘03
Expectation Maximization (EM) N (X,Y ) XY # HTHTHTHT HHTTHHTT Expected Counts X Z HTHHTHTHHT Y ??HTT??HTT T??THT??TH Data P(Y=H|X=T, ) = 0.4 P(Y=H|X=H,Z=T, ) = 0.3 Current model Friedman & Koller ‘03
Expectation Maximization (EM) Training Data X1X1 X2X2 X3X3 H Y1Y1 Y2Y2 Y3Y3 Initial network (G, 0 ) Expected Counts N(X 1 ) N(X 2 ) N(X 3 ) N(H, X 1, X 1, X 3 ) N(Y 1, H) N(Y 2, H) N(Y 3, H) Computation (E-Step) X1X1 X2X2 X3X3 H Y1Y1 Y2Y2 Y3Y3 Updated network (G, 1 ) Reparameterize (M-Step) Reiterate X1X1 X2X2 X3X3 H Y1Y1 Y2Y2 Y3Y3 Friedman & Koller ‘03
Expectation Maximization (EM) Computational bottleneck: Computation of expected counts in E-Step –Need to compute posterior for each unobserved variable in each instance of training set –All posteriors for an instance can be derived from one pass of standard BN inference Friedman & Koller ‘03
Why Struggle for Accurate Structure? Increases the number of parameters to be estimated Wrong assumptions about domain structure Cannot be compensated for by fitting parameters Wrong assumptions about domain structure EarthquakeAlarm Set Sound Burglary EarthquakeAlarm Set Sound Burglary Earthquake Alarm Set Sound Burglary Adding an arc Missing an arc
Learning BN structure Treat like a search problem Bayesian Score allows us to measure how well a BN fits the data Search for a BN that fits the data the best Start with initial network B 0 = Define successor operations on current network
Structural EM Recall, in complete data we had –Decomposition efficient search Idea: Instead of optimizing the real score… Find decomposable alternative score Such that maximizing new score improvement in real score
Structural EM Idea: Use current model to help evaluate new structures Outline: Perform search in (Structure, Parameters) space At each iteration, use current model for finding either: –Better scoring parameters: “parametric” EM step or –Better scoring structure: “structural” EM step
Training Data Expected Counts N(X 1 ) N(X 2 ) N(X 3 ) N(H, X 1, X 1, X 3 ) N(Y 1, H) N(Y 2, H) N(Y 3, H) Computation X1X1 X2X2 X3X3 H Y1Y1 Y2Y2 Y3Y3 X1X1 X2X2 X3X3 H Y1Y1 Y2Y2 Y3Y3 Score & Parameterize X1X1 X2X2 X3X3 H Y1Y1 Y2Y2 Y3Y3 Reiterate N(X 2, X 1 ) N(H, X 1, X 3 ) N(Y 1, X 2 ) N(Y 2, Y 1, H) X1X1 X2X2 X3X3 H Y1Y1 Y2Y2 Y3Y3
Structure Search Bottom Line Discrete optimization problem In some cases, optimization problem is easy –Example: learning trees In general, NP-Hard –Need to resort to heuristic search –In practice, search is relatively fast (~100 vars in ~2-5 min): Decomposability Sufficient statistics –Adding randomness to search is critical
Heuristic Search Define a search space: –search states are possible structures –operators make small changes to structure Traverse space looking for high-scoring structures Search techniques: –Greedy hill-climbing –Best first search –Simulated Annealing –... Friedman & Koller ‘03
Local Search Start with a given network –empty network –best tree –a random network At each iteration –Evaluate all possible changes –Apply change based on score Stop when no modification improves score Friedman & Koller ‘03
Heuristic Search Typical operations: S C E D Reverse C E Delete C E Add C D S C E D S C E D S C E D score = S({C,E} D) - S({E} D) To update score after local change, only re-score families that changed Friedman & Koller ‘03
Naive Approach to Structural EM Perform EM for each candidate graph G1G1 G3G3 G2G2 Parametric optimization (EM) Parameter space Local Maximum G4G4 GnGn u Computationally expensive: l Parameter optimization via EM — non-trivial l Need to perform EM for all candidate structures l Spend time even on poor candidates u In practice, considers only a few candidates
Structural EM Recall, in complete data we had –Decomposition efficient search Idea: Instead of optimizing the real score… Find decomposable alternative score Such that maximizing new score improvement in real score Friedman & Koller ‘03
Structural EM Idea: Use current model to help evaluate new structures Outline: Perform search in (Structure, Parameters) space At each iteration, use current model for finding either: –Better scoring parameters: “parametric” EM step or –Better scoring structure: “structural” EM step Friedman & Koller ‘03
Bayesian Approach Posterior distribution over structures Estimate probability of features –Edge X Y –Path X … Y –… Feature of G, e.g., X Y Indicator function for feature f Bayesian score for G
Discovering Structure Current practice: model selection –Pick a single high-scoring model –Use that model to infer domain structure E R B A C P(G|D)
Discovering Structure Problem –Small sample size many high scoring models –Answer based on one model often useless –Want features common to many models E R B A C E R B A C E R B A C E R B A C E R B A C P(G|D)
Application: Gene expression Input: Measurement of gene expression under different conditions –Thousands of genes –Hundreds of experiments Output: Models of gene interaction –Uncover pathways Friedman & Koller ‘03
“Mating response” Substructure Automatically constructed sub-network of high-confidence edges Almost exact reconstruction of yeast mating pathway KAR4 AGA1PRM1 TEC1 SST2 STE6 KSS1 NDJ1 FUS3 AGA2 YEL059W TOM6 FIG1 YLR343W YLR334C MFA1 FUS1 N. Friedman et al 2000
Bayesian Network Limitation: Model pdf over instances Bayesian nets use propositional representation Real world has objects, related to each other Can we take advantage of general properties of objects of the same class? Intelligence Difficulty Grade
Bayesian Networks: Problem Bayesian nets use propositional representation Real world has objects, related to each other Intell_J.Doe Diffic_CS101 Grade_JDoe_CS101 Intell_FGump Diffic_Geo101 Grade_FGump_Geo101 Intell_FGump Diffic_CS101 Grade_FGump_CS101 These “instances” are not independent! A C
Relational Schema Specifies types of objects in domain, attributes of each type of object, & types of links between objects Teach Student Intelligence Registration Grade Satisfaction Course Difficulty Professor Teaching-Ability In TakeClasses Links Attributes
Modeling real-world relationships Bayesian nets use propositional representation Biological system has objects, related to each other from E. Segal’s PhD Thesis 2004
Probabilistic Relational Models (PRMs) A skeleton σ defining classes & their relations Random variables are the attributes of the objects
Probabilistic Relational Models Dependencies exist at the class level from E. Segal’s PhD Thesis 2004
Converting a PRM into a BN We can “unroll” (or instantiate) a PRM into its underlying BN. E.g.: 2 genes & 3 experiments: from E. Segal’s PhD Thesis 2004
A PRM has important differences from a BN Dependencies are defined at the class level –Thus can be reused for any objects of the class Uses the relational structure to allow attributes of an object to depend on attributes of related objects Parameters are shared across instances of the same class! (Thus, more data for each parameter)
Example of joint learning Segal et al. 2003
Joint Learning: Motifs & Modules Segal et al. 2003
Joint Model of Segal et al Segal et al. 2003
PRM for the joint model Segal et al. 2003
Joint Model Motif model: Regulation model: Expression Model: Segal et al. 2003
“Ground BN” Segal et al. 2003
TFs associated to modules Segal et al. 2003
PRM for genome-wide location data Segal et al. 2003