. PGM: Tirgul 10 Learning Structure I. Benefits of Learning Structure u Efficient learning -- more accurate models with less data l Compare: P(A) and.

Slides:

Advertisements

Similar presentations

A Tutorial on Learning with Bayesian Networks

Advertisements

Learning with Missing Data

Learning: Parameter Estimation

. The sample complexity of learning Bayesian Networks Or Zuk*^, Shiri Margel* and Eytan Domany* *Dept. of Physics of Complex Systems Weizmann Inst. of.

CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 10: The Bayesian way to fit models Geoffrey Hinton.

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. Learning I Excerpts from Tutorial at:

Parameter Estimation using likelihood functions Tutorial #1

Graphical Models - Learning -

Bayesian Networks - Intro - Wolfram Burgard, Luc De Raedt, Kristian Kersting, Bernhard Nebel Albert-Ludwigs University Freiburg, Germany PCWP CO HRBP HREKG.

Nir Friedman, Iftach Nachman, and Dana Peer Announcer: Kyu-Baek Hwang

Graphical Models - Inference -

Bayesian network inference

Date:2011/06/08 吳昕澧 BOA: The Bayesian Optimization Algorithm.

Graphical Models - Inference - Wolfram Burgard, Luc De Raedt, Kristian Kersting, Bernhard Nebel Albert-Ludwigs University Freiburg, Germany PCWP CO HRBP.

Regulatory Network (Part II) 11/05/07. Methods Linear –PCA (Raychaudhuri et al. 2000) –NIR (Gardner et al. 2003) Nonlinear –Bayesian network (Friedman.

. Learning Bayesian networks Slides by Nir Friedman.

Structure Learning in Bayesian Networks

Probabilistic Graphical Models Tool for representing complex systems and performing sophisticated reasoning tasks Fundamental notion: Modularity Complex.

Basics of Statistical Estimation. Learning Probabilities: Classical Approach Simplest case: Flipping a thumbtack tails heads True probability  is unknown.

Required Sample size for Bayesian network Structure learning

. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. Learning Bayesian Networks from Data Nir Friedman U.C.

. Approximate Inference Slides by Nir Friedman. When can we hope to approximate? Two situations: u Highly stochastic distributions “Far” evidence is discarded.

Thanks to Nir Friedman, HU

. Learning Bayesian Networks from Data Nir Friedman Daphne Koller Hebrew U. Stanford.

Learning Bayesian Networks (From David Heckerman’s tutorial)

1 Learning with Bayesian Networks Author: David Heckerman Presented by Yan Zhang April

CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.

. PGM 2002/3 – Tirgul6 Approximate Inference: Sampling.

If we measured a distribution P, what is the tree- dependent distribution P t that best approximates P? Search Space: All possible trees Goal: From all.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.

1 Bayesian Param. Learning Bayesian Structure Learning Graphical Models – Carlos Guestrin Carnegie Mellon University October 6 th, 2008 Readings:

CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.

Unsupervised Learning: Clustering Some material adapted from slides by Andrew Moore, CMU. Visit for

1 Instance-Based & Bayesian Learning Chapter Some material adapted from lecture notes by Lise Getoor and Ron Parr.

Bayesian Learning Chapter Some material adapted from lecture notes by Lise Getoor and Ron Parr.

COMP 538 Reasoning and Decision under Uncertainty Introduction Readings: Pearl (1998, Chapter 1 Shafer and Pearl, Chapter 1.

1 CMSC 671 Fall 2001 Class #25-26 – Tuesday, November 27 / Thursday, November 29.

Learning With Bayesian Networks Markus Kalisch ETH Zürich.

Notes on Graphical Models Padhraic Smyth Department of Computer Science University of California, Irvine.

CS498-EA Reasoning in AI Lecture #10 Instructor: Eyal Amir Fall Semester 2009 Some slides in this set were adopted from Eran Segal.

Conditional Probability Mass Function. Introduction P[A|B] is the probability of an event A, giving that we know that some other event B has occurred.

Exploiting Structure in Probability Distributions Irit Gat-Viks Based on presentation and lecture notes of Nir Friedman, Hebrew University.

The generalization of Bayes for continuous densities is that we have some density f(y|  ) where y and  are vectors of data and parameters with  being.

Describing Data The canonical descriptive strategy is to describe the data in terms of their underlying distribution As usual, we have a p-dimensional.

1 Parameter Learning 2 Structure Learning 1: The good Graphical Models – Carlos Guestrin Carnegie Mellon University September 27 th, 2006 Readings:

Machine Learning 5. Parametric Methods.

1 Param. Learning (MLE) Structure Learning The Good Graphical Models – Carlos Guestrin Carnegie Mellon University October 1 st, 2008 Readings: K&F:

Guidance: Assignment 3 Part 1 matlab functions in statistics toolbox  betacdf, betapdf, betarnd, betastat, betafit.

1 Learning P-maps Param. Learning Graphical Models – Carlos Guestrin Carnegie Mellon University September 24 th, 2008 Readings: K&F: 3.3, 3.4, 16.1,

Lecture 3: MLE, Bayes Learning, and Maximum Entropy

Bayesian Optimization Algorithm, Decision Graphs, and Occam’s Razor Martin Pelikan, David E. Goldberg, and Kumara Sastry IlliGAL Report No May.

04/21/2005 CS673 1 Being Bayesian About Network Structure A Bayesian Approach to Structure Discovery in Bayesian Networks Nir Friedman and Daphne Koller.

Crash Course on Machine Learning Part VI Several slides from Derek Hoiem, Ben Taskar, Christopher Bishop, Lise Getoor.

1 Structure Learning (The Good), The Bad, The Ugly Inference Graphical Models – Carlos Guestrin Carnegie Mellon University October 13 th, 2008 Readings:

Lecture 1.31 Criteria for optimal reception of radio signals.

Learning Tree Structures

Approximate Inference

Introduction to Artificial Intelligence

Irina Rish IBM T.J.Watson Research Center

Introduction to Artificial Intelligence

CSCI 5822 Probabilistic Models of Human and Machine Learning

Introduction to Artificial Intelligence

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

Bayesian Learning Chapter

Parameter Learning 2 Structure Learning 1: The good

Parametric Methods Berlin Chen, 2005 References:

Learning Bayesian networks

Presentation transcript:

. PGM: Tirgul 10 Learning Structure I

Benefits of Learning Structure u Efficient learning -- more accurate models with less data l Compare: P(A) and P(B) vs. joint P(A,B) u Discover structural properties of the domain l Ordering of events l Relevance u Identifying independencies  faster inference u Predict effect of actions l Involves learning causal relationship among variables

Why Struggle for Accurate Structure? u Increases the number of parameters to be fitted u Wrong assumptions about causality and domain structure u Cannot be compensated by accurate fitting of parameters u Also misses causality and domain structure EarthquakeAlarm Set Sound Burglary EarthquakeAlarm Set Sound Burglary Earthquake Alarm Set Sound Burglary Adding an arcMissing an arc

Approaches to Learning Structure u Constraint based l Perform tests of conditional independence l Search for a network that is consistent with the observed dependencies and independencies u Pros & Cons  Intuitive, follows closely the construction of BNs  Separates structure learning from the form of the independence tests  Sensitive to errors in individual tests

Approaches to Learning Structure u Score based l Define a score that evaluates how well the (in)dependencies in a structure match the observations l Search for a structure that maximizes the score u Pros & Cons  Statistically motivated  Can make compromises  Takes the structure of conditional probabilities into account  Computationally hard

Likelihood Score for Structures First cut approach: l Use likelihood function u Recall, the likelihood score for a network structure and parameters is u Since we know how to maximize parameters from now we assume

Likelihood Score for Structure (cont.) Rearranging terms: where  H(X) is the entropy of X  I(X;Y) is the mutual information between X and Y I(X;Y) measures how much “information” each variables provides about the other I(X;Y)  0 I(X;Y) = 0 iff X and Y are independent I(X;Y) = H(X) iff X is totally predictable given Y

Likelihood Score for Structure (cont.) Good news: u Intuitive explanation of likelihood score: l The larger the dependency of each variable on its parents, the higher the score u Likelihood as a compromise among dependencies, based on their strength

Likelihood Score for Structure (cont.) Bad news: u Adding arcs always helps l I(X;Y)  I(X;Y,Z) l Maximal score attained by fully connected networks l Such networks can overfit the data --- parameters capture the noise in the data

Avoiding Overfitting “Classic” issue in learning. Approaches: u Restricting the hypotheses space l Limits the overfitting capability of the learner l Example: restrict # of parents or # of parameters u Minimum description length l Description length measures complexity l Prefer models that compactly describes the training data u Bayesian methods l Average over all possible parameter values l Use prior knowledge

Bayesian Inference  Bayesian Reasoning---compute expectation over unknown G  Assumption: G s are mutually exclusive and exhaustive  We know how to compute P(x[M+1]|G,D) Same as prediction with fixed structure  How do we compute P(G|D) ?

Marginal likelihood Prior over structures Using Bayes rule: P(D) is the same for all structures G Can be ignored when comparing structures Probability of Data Posterior Score

Marginal Likelihood u By introduction of variables, we have that u This integral measures sensitivity to choice of parameters Likelihood Prior over parameters

Marginal Likelihood: Binomial case Assume we observe a sequence of coin tosses…. u By the chain rule we have: recall that where N m H is the number of heads in first m examples.

Marginal Likelihood: Binomials (cont.) We simplify this by using Thus

Binomial Likelihood: Example  Idealized experiment with P(H) = M Dirichlet(.5,.5) Dirichlet(1,1) Dirichlet(5,5) (log P(D))/M

Marginal Likelihood: Example (cont.)  Actual experiment with P(H) = (log P(D))/M M Dirichlet(.5,.5) Dirichlet(1,1) Dirichlet(5,5)

Marginal Likelihood: Multinomials The same argument generalizes to multinomials with Dirichlet prior  P(  ) is Dirichlet with hyperparameters  1,…,  K  D is a dataset with sufficient statistics N 1,…,N K Then

Marginal Likelihood: Bayesian Networks HTTHTHH HTHHTTH X Y u Network structure determines form of marginal likelihood Network 1: Two Dirichlet marginal likelihoods P(X[1],…,X[7]) P(Y[1],…,Y[7]) XY

Marginal Likelihood: Bayesian Networks HTTHTHH HTHHTTH X Y u Network structure determines form of marginal likelihood Network 2: Three Dirichlet marginal likelihoods P(X[1],…,X[7]) P(Y[1],Y[4],Y[6],Y[7]) P(Y[2],Y[3],Y[5]) XY

Idealized Experiment u P(X = H) = 0.5 u P(Y = H|X = H) = pP(Y = H|X = T) = p Independent P = 0.05 P = 0.10 P = 0.15 P = 0.20 (log P(D))/M M

Marginal Likelihood for General Network The marginal likelihood has the form: where u N(..) are the counts from the data   (..) are the hyperparameters for each family given G Dirichlet Marginal Likelihood For the sequence of values of X i when X i ’ s parents have a particular value

Priors  We need: prior counts  (..) for each network structure G u This can be a formidable task l There are exponentially many structures…

BDe Score Possible solution: The BDe prior  Represent prior using two elements M 0, B 0 M 0 - equivalent sample size B 0 - network representing the prior probability of events

BDe Score Intuition: M 0 prior examples distributed by B 0  Set  (x i,pa i G ) = M 0 P(x i,pa i G | B 0 ) Note that pa i G are not the same as the parents of X i in B 0. Compute P(x i,pa i G | B 0 ) using standard inference procedures u Such priors have desirable theoretical properties l Equivalent networks are assigned the same score

Bayesian Score: Asymptotic Behavior Theorem: If the prior P(  |G) is “well-behaved”, then Proof:  For the case of Dirichlet priors, use Stirling’s approximation to  ( ) u General case, defer to incomplete data section

Asymptotic Behavior: Consequences u Bayesian score is consistent As M  the “true” structure G* maximizes the score (almost surely) For sufficiently large M, the maximal scoring structures are equivalent to G* u Observed data eventually overrides prior information l Assuming that the prior assigns positive probability to all cases

Asymptotic Behavior u This score can also be justified by the Minimal Description Length (MDL) principle u This equation explicitly shows the tradeoff between l Fitness to data --- likelihood term l Penalty for complexity --- regularization term

Scores -- Summary u Likelihood, MDL, (log) BDe have the form u BDe requires assessing prior network. It can naturally incorporate prior knowledge and previous experience u BDe is consistent and asymptotically equivalent (up to a constant) to MDL u All are score-equivalent G equivalent to G’  Score(G) = Score(G’)

Optimization Problem Input: l Training data l Scoring function (including priors, if needed) l Set of possible structures H Including prior knowledge about structure Output: l A network (or networks) that maximize the score Key Property: l Decomposability: the score of a network is a sum of terms.

Learning Trees u Trees: l At most one parent per variable u Why trees? l Elegant math  we can solve the optimization problem efficiently (with a greedy algorithm) l Sparse parameterization  avoid overfitting while adapting to the data

Learning Trees (cont.)  Let p(i) denote the parent of X i, or 0 if X i has no parents u We can write the score as u Score = sum of edge scores + constant Score of “empty” network Improvement over “empty” network

Learning Trees (cont) Algorithm: u Construct graph with vertices: 1, 2, …  Set w(i  j) be Score( X j | X i ) - Score(X j ) u Find tree (or forest) with maximal weight l This can be done using standard algorithms in low-order polynomial time by building a tree in a greedy fashion (Kruskal’s maximum spanning tree algorithm) Theorem: This procedure finds the tree with maximal score When score is likelihood, then w(i  j) is proportional to I(X i ; X j ) this is known as the Chow & Liu method

Not every edge in tree is in the the original network Tree direction is arbitrary --- we can’t learn about arc direction Learning Trees: Example Tree learned from alarm data correct arcs spurious arcs PCWP CO HRBP HREKG HRSAT ERRCAUTER HR HISTORY CATECHOL SAO2 EXPCO2 ARTCO2 VENTALV VENTLUNG VENITUBE DISCONNECT MINVOLSET VENTMACH KINKEDTUBE INTUBATIONPULMEMBOLUS PAPSHUNT ANAPHYLAXIS MINOVL PVSAT FIO2 PRESS INSUFFANESTHTPR LVFAILURE ERRBLOWOUTPUT STROEVOLUMELVEDVOLUME HYPOVOLEMIA CVP BP PCWP CO HRBP HREKG HRSAT ERRCAUTER HR HISTORY CATECHOL SAO2 EXPCO2 ARTCO2 VENTALV VENTLUNG VENITUBE DISCONNECT MINVOLSET VENTMACH KINKEDTUBE INTUBATIONPULMEMBOLUS PAPSHUNT ANAPHYLAXIS MINOVL PVSAT FIO2 PRESS INSUFFANESTHTPR LVFAILURE ERRBLOWOUTPUT STROEVOLUMELVEDVOLUME HYPOVOLEMIA CVP BP

Beyond Trees When we consider more complex network, the problem is not as easy u Suppose we allow two parents u A greedy algorithm is no longer guaranteed to find the optimal network u In fact, no efficient algorithm exists Theorem: Finding maximal scoring network structure with at most k parents for each variables is NP-hard for k > 1

Heuristic Search We address the problem by using heuristic search u Define a search space: l nodes are possible structures l edges denote adjacency of structures u Traverse this space looking for high-scoring structures Search techniques: l Greedy hill-climbing l Best first search l Simulated Annealing l...

Heuristic Search (cont.) u Typical operations: S C E D S C E D Reverse C  E Delete C  E Add C  D S C E D S C E D

Exploiting Decomposability in Local Search u Caching: To update the score of after a local change, we only need to re-score the families that were changed in the last move S C E D S C E D S C E D S C E D

Greedy Hill-Climbing u Simplest heuristic local search l Start with a given network H empty network H best tree H a random network l At each iteration H Evaluate all possible changes H Apply change that leads to best improvement in score H Reiterate l Stop when no modification improves score u Each step requires evaluating approximately n new changes

Greedy Hill-Climbing: Possible Pitfalls u Greedy Hill-Climbing can get struck in: l Local Maxima: H All one-edge changes reduce the score l Plateaus: H Some one-edge changes leave the score unchanged H Happens because equivalent networks received the same score and are neighbors in the search space u Both occur during structure search u Standard heuristics can escape both l Random restarts l TABU search

Equivalence Class Search Idea: u Search the space of equivalence classes u Equivalence classes can be represented by PDAGs (partially ordered graph) Benefits: u The space of PDAGs has fewer local maxima and plateaus u There are fewer PDAGs than DAGs

Equivalence Class Search (cont.) Evaluating changes is more expensive u These algorithms are more complex to implement X Z YX Z YX Z Y Add Y---Z Original PDAG New PDAG Consistent DAG Score

Learning in Practice: Alarm domain KL Divergence M True Structure/BDe M' = 10 Unknown Structure/BDe M' = 10

Model Selection u So far, we focused on single model l Find best scoring model l Use it to predict next example u Implicit assumption: l Best scoring model dominates the weighted sum u Pros: l We get a single structure l Allows for efficient use in our tasks u Cons: l We are committing to the independencies of a particular structure l Other structures might be as probable given the data

Model Averaging u Recall, Bayesian analysis started with l This requires us to average over all possible models

Model Averaging (cont.) u Full Averaging l Sum over all structures l Usually intractable--- there are exponentially many structures u Approximate Averaging l Find K largest scoring structures l Approximate the sum by averaging over their prediction l Weight of each structure determined by the Bayes Factor The actual score we compute

Search: Summary u Discrete optimization problem u In general, NP-Hard l Need to resort to heuristic search l In practice, search is relatively fast (~100 vars in ~10 min): H Decomposability H Sufficient statistics u In some cases, we can reduce the search problem to an easy optimization problem l Example: learning trees