Bayesian Networks: Motivation

Slides:



Advertisements
Similar presentations
A Tutorial on Learning with Bayesian Networks
Advertisements

Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
1 Some Comments on Sebastiani et al Nature Genetics 37(4)2005.
Learning Bayesian Networks. Dimensions of Learning ModelBayes netMarkov net DataCompleteIncomplete StructureKnownUnknown ObjectiveGenerativeDiscriminative.
Learning in Bayes Nets Task 1: Given the network structure and given data, where a data point is an observed setting for the variables, learn the CPTs.
Identifying Conditional Independencies in Bayes Nets Lecture 4.
Junction Trees: Motivation Standard algorithms (e.g., variable elimination) are inefficient if the undirected graph underlying the Bayes Net contains cycles.
GS 540 week 6. HMM basics Given a sequence, and state parameters: – Each possible path through the states has a certain probability of emitting the sequence.
Overview Full Bayesian Learning MAP learning
1 Chapter 12 Probabilistic Reasoning and Bayesian Belief Networks.
Bayesian Belief Networks
1 Learning Entity Specific Models Stefan Niculescu Carnegie Mellon University November, 2003.
Learning Bayesian Networks. Dimensions of Learning ModelBayes netMarkov net DataCompleteIncomplete StructureKnownUnknown ObjectiveGenerativeDiscriminative.
Learning Bayesian Networks
. Approximate Inference Slides by Nir Friedman. When can we hope to approximate? Two situations: u Highly stochastic distributions “Far” evidence is discarded.
Learning Bayesian Networks (From David Heckerman’s tutorial)
1 Bayesian Networks Chapter ; 14.4 CS 63 Adapted from slides by Tim Finin and Marie desJardins. Some material borrowed from Lise Getoor.
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Learning Structure in Bayes Nets (Typically also learn CPTs here) Given the set of random variables (features), the space of all possible networks.
Bayesian networks. Motivation We saw that the full joint probability can be used to answer any question about the domain, but can become intractable as.
Inference Complexity As Learning Bias Daniel Lowd Dept. of Computer and Information Science University of Oregon Joint work with Pedro Domingos.
Chapter 6 Bayesian Learning
Slides for “Data Mining” by I. H. Witten and E. Frank.
Marginalization & Conditioning Marginalization (summing out): for any sets of variables Y and Z: Conditioning(variant of marginalization):
1 Parameter Learning 2 Structure Learning 1: The good Graphical Models – Carlos Guestrin Carnegie Mellon University September 27 th, 2006 Readings:
Inference Algorithms for Bayes Networks
Learning and Acting with Bayes Nets Chapter 20.. Page 2 === A Network and a Training Data.
1 Param. Learning (MLE) Structure Learning The Good Graphical Models – Carlos Guestrin Carnegie Mellon University October 1 st, 2008 Readings: K&F:
1 CMSC 671 Fall 2001 Class #20 – Thursday, November 8.
1 Learning P-maps Param. Learning Graphical Models – Carlos Guestrin Carnegie Mellon University September 24 th, 2008 Readings: K&F: 3.3, 3.4, 16.1,
Today’s Topics Bayesian Networks (BNs) used a lot in medical diagnosis M-estimates Searching for Good BNs Markov Blanket what is conditionally independent.
Bayesian Learning Bayes Theorem MAP, ML hypotheses MAP learners
Introduction on Graphic Models
Conditional Independence As with absolute independence, the equivalent forms of X and Y being conditionally independent given Z can also be used: P(X|Y,
CS 2750: Machine Learning Bayesian Networks Prof. Adriana Kovashka University of Pittsburgh March 14, 2016.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Oliver Schulte Machine Learning 726
CS 2750: Machine Learning Directed Graphical Models
Qian Liu CSE spring University of Pennsylvania
Boosted Augmented Naive Bayes. Efficient discriminative learning of
Inference in Bayesian Networks
Computer Science Department
Read R&N Ch Next lecture: Read R&N
Statistical Models for Automatic Speech Recognition
Bayes Net Learning: Bayesian Approaches
Oliver Schulte Machine Learning 726
Data Mining Lecture 11.
CSCI 5822 Probabilistic Models of Human and Machine Learning
cs540- Fall 2016 (Shavlik©), Lecture 15, Week 9
Still More Uncertainty
Read R&N Ch Next lecture: Read R&N
Hidden Markov Models Part 2: Algorithms
Bayesian Models in Machine Learning
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Propagation Algorithm in Bayesian Networks
CAP 5636 – Advanced Artificial Intelligence
CSCI 5822 Probabilistic Models of Human and Machine Learning
Artificial Intelligence Chapter 20 Learning and Acting with Bayes Nets
An Algorithm for Bayesian Network Construction from Data
Class #19 – Tuesday, November 3
CS 188: Artificial Intelligence
Class #16 – Tuesday, October 26
Parameter Learning 2 Structure Learning 1: The good
Chapter 20. Learning and Acting with Bayes Nets
Machine Learning: Lecture 6
BN Semantics 3 – Now it’s personal! Parameter Learning 1
Machine Learning: UNIT-3 CHAPTER-1
CS 188: Artificial Intelligence Spring 2006
Read R&N Ch Next lecture: Read R&N
Presentation transcript:

Bayesian Networks: Motivation Capture independence and conditional independence where they exist. Among variables where dependencies exist, encode the relevant portion of the full joint. Use a graphical representation for which we can more easily investigate the complexity of inference and can search for efficient inference algorithms. © Jude Shavlik 2006 David Page 2007 CS 760 – Machine Learning (UW-Madison)

CS 760 – Machine Learning (UW-Madison) A Bayesian Network is a ... Directed Acyclic Graph (DAG) in which … … the nodes denote random variables … each node X has a conditional probability distribution P(X|Parents(X)). The intuitive meaning of an arc from X to Y is that X directly influences Y. © Jude Shavlik 2006 David Page 2007 CS 760 – Machine Learning (UW-Madison)

Additional Terminology If X and its parents are discrete, we can represent the distribution P(X|Parents(X)) by a conditional probability table (CPT) specifying the probability of each value of X given each possible combination of settings for the variables in Parents(X). A conditioning case is a row in this CPT (a setting of values for the parent nodes). © Jude Shavlik 2006 David Page 2007 CS 760 – Machine Learning (UW-Madison)

CS 760 – Machine Learning (UW-Madison) An Example Bayes Net C P(D) T 0.9 (9,1) F 0.2 (1,4) A B C D E P(A) 0.1 (1,9) A B P(C) T T 0.9 (9,1) T F 0.6 (3,2) F T 0.3 (3,7) F F 0.2 (1,4) P(B) 0.2 (1,4) C P(E) T 0.8 (4,1) F 0.1 (1,9) The numbers (probabilities) are also called parameters. In parentheses are the hyperparameters. These usually are not shown, but are important when doing parameter learning. © Jude Shavlik 2006 David Page 2007 CS 760 – Machine Learning (UW-Madison)

Bayesian Network Semantics A Bayesian Network completely specifies a full joint distribution over its random variables, as below -- this is its meaning. P In the above, P(x1,…,xn) is shorthand notation for P(X1=x1,…,Xn=xn). © Jude Shavlik 2006 David Page 2007 CS 760 – Machine Learning (UW-Madison)

Conditional Independence Again A node X is conditionally independent of its predecessors given Parents(X). Markov Blanket of X: set consisting of the parents of X, the children of X, and the other parents of the children of X. X is conditionally independent of all nodes in the network given its Markov Blanket. © Jude Shavlik 2006 David Page 2007 CS 760 – Machine Learning (UW-Madison)

Learning CPTs from Complete Settings Suppose we are given a set of data, where each data point is a complete setting for all the variables. One assumption we make is that the data set is a random sample from the distribution we’re trying to model. For each node in our network, we consider each entry in its CPT (each setting of values © Jude Shavlik 2006 David Page 2007 CS 760 – Machine Learning (UW-Madison)

Learning CPTs (Continued) (Continued)… for its parents). For each entry in the CPT, we have a prior (possibly uniform) Dirichlet distribution over its values. We simply update this distribution based on the relevant data points (those that agree on the settings for the parents that correspond with this CPT entry). A second, implicit assumption is that the © Jude Shavlik 2006 David Page 2007 CS 760 – Machine Learning (UW-Madison)

Learning CPTs (Continued) (Continued)… distributions over different rows of the CPT are independent of one another. Finally, it is worth noting that instead of this last assumption, we might have a stronger bias over the form of the CPT. We might believe it is a linear function or a tree, in which case we could use a learning algorithm we have seen already. © Jude Shavlik 2006 David Page 2007 CS 760 – Machine Learning (UW-Madison)

CS 760 – Machine Learning (UW-Madison) Simple Example Suppose we believe the variables PinLength and HeadWeight directly influence whether a thumbtack comes up heads or tails. For simplicity, suppose PinLength can be long or short and HeadWeight can be heavy or light. Suppose we adopt the following prior over the CPT entries for the variable Thumbtack. © Jude Shavlik 2006 David Page 2007 CS 760 – Machine Learning (UW-Madison)

Simple Example (Continued) © Jude Shavlik 2006 David Page 2007 CS 760 – Machine Learning (UW-Madison)

Simple Example (Continued) Notice that we have equal confidence in our prior (initial) probabilities for the first and last columns of the CPT, less confidence in those of the second column, and more in those of the third column. A new data point will affect only one of the columns. A new data point will have more effect on the second column than the others. © Jude Shavlik 2006 David Page 2007 CS 760 – Machine Learning (UW-Madison)

More Difficult Case: What if Some Variables are Missing Recall our earlier notion of hidden variables. Sometimes a variable is hidden because it cannot be explicitly measured. For example, we might hypothesize that a chromosomal abnormality is responsible for some patients with a particular cancer not responding well to treatment. © Jude Shavlik 2006 David Page 2007 CS 760 – Machine Learning (UW-Madison)

Missing Values (Continued) We might include a node for this chromosomal abnormality in our network because we strongly believe it exists, other variables can be used to predict it, and it is in turn predictive of still other variables. But in estimating CPTs from data, none of our data points has a value for this variable. © Jude Shavlik 2006 David Page 2007 CS 760 – Machine Learning (UW-Madison)

CS 760 – Machine Learning (UW-Madison) General EM Framework Given: Data with missing values, Space of possible models, Initial model. Repeat until no change greater than threshold: Expectation (E) Step: Compute expectation over missing values, given model. Maximization (M) Step: Replace current model with model that maximizes probability of data. © Jude Shavlik 2006 David Page 2007 CS 760 – Machine Learning (UW-Madison)

(“Soft”) EM vs. “Hard” EM Standard (soft) EM: expectation is a probability distribution. Hard EM: expectation is “all or nothing”… most likely/probable value. Advantage of hard EM is computational efficiency when expectation is over state consisting of values for multiple variables (next example illustrates). © Jude Shavlik 2006 David Page 2007 CS 760 – Machine Learning (UW-Madison)

EM for Parameter Learning: E Step For each data point with missing values, compute the probability of each possible completion of that data point. Replace the original data point with all these completions, weighted by probabilities. Computing the probability of each completion (expectation) is just answering query over missing variables given others. © Jude Shavlik 2006 David Page 2007 CS 760 – Machine Learning (UW-Madison)

EM for Parameter Learning: M Step Use the completed data set to update our Dirichlet distributions as we would use any complete data set, except that our counts (tallies) may be fractional now. Update CPTs based on new Dirichlet distributions, as we would with any complete data set. © Jude Shavlik 2006 David Page 2007 CS 760 – Machine Learning (UW-Madison)

EM for Parameter Learning Iterate E and M steps until no changes occur. We will not necessarily get the global MAP (or ML given uniform priors) setting of all the CPT entries, but under a natural set of conditions we are guaranteed convergence to a local MAP solution. EM algorithm is used for a wide variety of tasks outside of BN learning as well. © Jude Shavlik 2006 David Page 2007 CS 760 – Machine Learning (UW-Madison)

Subtlety for Parameter Learning Overcounting based on number of interations required to converge to settings for the missing values. After each repetition of E step, reset all Dirichlet distributions before repeating M step. © Jude Shavlik 2006 David Page 2007 CS 760 – Machine Learning (UW-Madison)

EM for Parameter Learning C P(D) T 0.9 (9,1) F 0.2 (1,4) A B C D E P(A) 0.1 (1,9) A B P(C) T T 0.9 (9,1) T F 0.6 (3,2) F T 0.3 (3,7) F F 0.2 (1,4) P(B) 0.2 (1,4) C P(E) T 0.8 (4,1) F 0.1 (1,9) A B C D E 0 0 ? 0 0 0 0 ? 1 0 0 ? 1 1 0 0 ? 0 1 0 1 ? 1 0 1 ? 1 1 Data © Jude Shavlik 2006 David Page 2007 CS 760 – Machine Learning (UW-Madison)

EM for Parameter Learning C P(D) T 0.9 (9,1) F 0.2 (1,4) A B C D E P(A) 0.1 (1,9) A B P(C) T T 0.9 (9,1) T F 0.6 (3,2) F T 0.3 (3,7) F F 0.2 (1,4) P(B) 0.2 (1,4) C P(E) T 0.8 (4,1) F 0.1 (1,9) A B C D E 0 0 0 0 0 0 1 0 0 1 1 0 0 0 1 0 1 1 0 1 1 1 Data 0: 0.99 1: 0.01 0: 0.02 1: 0.98 0: 0.80 1: 0.20 0: 0.70 1: 0.30 0: 0.003 1: 0.997 © Jude Shavlik 2006 David Page 2007 CS 760 – Machine Learning (UW-Madison)

Multiple Missing Values B C D E P(A) 0.1 (1,9) A B P(C) T T 0.9 (9,1) T F 0.6 (3,2) F T 0.3 (3,7) F F 0.2 (1,4) P(B) 0.2 (1,4) C P(D) T 0.9 (9,1) F 0.2 (1,4) C P(E) T 0.8 (4,1) F 0.1 (1,9) A B C D E ? 0 ? 0 1 Data © Jude Shavlik 2006 David Page 2007 CS 760 – Machine Learning (UW-Madison)

Multiple Missing Values B C D E P(A) 0.1 (1,9) A B P(C) T T 0.9 (9,1) T F 0.6 (3,2) F T 0.3 (3,7) F F 0.2 (1,4) P(B) 0.2 (1,4) C P(E) T 0.8 (4,1) F 0.1 (1,9) C P(D) T 0.9 (9,1) F 0.2 (1,4) A B C D E 0 0 0 0 1 0 0 1 0 1 1 0 0 0 1 1 0 1 0 1 Data 0.72 0.18 0.04 0.06 © Jude Shavlik 2006 David Page 2007 CS 760 – Machine Learning (UW-Madison)

Multiple Missing Values B C D E P(A) 0.1 (1.1,9.9) A B P(C) T T 0.9 (9,1) T F 0.6 (3.06,2.04) F T 0.3 (3,7) F F 0.2 (1.18,4.72) P(B) 0.17 (1,5) C P(D) T 0.88 (9,1.24) F 0.17 (1,4.76) C P(E) T 0.81 (4.24,1) F 0.16 (1.76,9) A B C D E 0 0 0 0 1 0 0 1 0 1 1 0 0 0 1 1 0 1 0 1 Data 0.72 0.18 0.04 0.06 © Jude Shavlik 2006 David Page 2007 CS 760 – Machine Learning (UW-Madison)

CS 760 – Machine Learning (UW-Madison) Problems with EM Only local optimum (not much way around that, though). Deterministic … if priors are uniform, may be impossible to make any progress… … next figure illustrates the need for some randomization to move us off an uninformative prior… © Jude Shavlik 2006 David Page 2007 CS 760 – Machine Learning (UW-Madison)

CS 760 – Machine Learning (UW-Madison) What will EM do here? P(A) 0.5 (1,1) Data A A B C 0 ? 0 ? 1 1 ? 1 A P(B) T 0.5 (1,1) F 0.5 (1,1) B B P(C) T 0.5 (1,1) F 0.5 (1,1) C © Jude Shavlik 2006 David Page 2007 CS 760 – Machine Learning (UW-Madison)

EM Dependent on Initial Beliefs P(A) 0.5 (1,1) Data A A B C 0 ? 0 ? 1 1 ? 1 A P(B) T 0.6 (6,4) F 0.4 (4,6) B B P(C) T 0.5 (1,1) F 0.5 (1,1) C © Jude Shavlik 2006 David Page 2007 CS 760 – Machine Learning (UW-Madison)

EM Dependent on Initial Beliefs P(A) 0.5 (1,1) B is more likely T than F when A is T. Filling this in makes C more likely T than F when B is T. This makes B still more likely T than F when A is T. Etc. Small change in CPT for B (swap 0.6 and 0.4) would have opposite effect. Data A A B C 0 ? 0 ? 1 1 ? 1 A P(B) T 0.6 (6,4) F 0.4 (4,6) B B P(C) T 0.5 (1,1) F 0.5 (1,1) C © Jude Shavlik 2006 David Page 2007 CS 760 – Machine Learning (UW-Madison)

Learning Structure + Parameters Number of structures is superexponential Finding optimal structure (ML or MAP) is NP-complete Two common options: Severely restrict possible structures – e.g., tree-augmented naïve Bayes (TAN) Heuristic search (e.g., sparse candidate) © Jude Shavlik 2006 David Page 2007 CS 760 – Machine Learning (UW-Madison)

Recall: Naïve Bayes Net F 2 F 3 F N-2 F N-1 F N F 1 Class Value … © Jude Shavlik 2006 David Page 2007 CS 760 – Machine Learning (UW-Madison)

CS 760 – Machine Learning (UW-Madison) Alternative: TAN Class Value F 2 F N-2 F N-1 F N F 1 F 3 … © Jude Shavlik 2006 David Page 2007 CS 760 – Machine Learning (UW-Madison)

Tree-Augmented Naïve Bayes In addition to the Naïve Bayes arcs (class to feature), we are permitted a directed tree among the features Given this restriction, there exists a polynomial-time algorithm to find the maximum likelihood structure © Jude Shavlik 2006 David Page 2007 CS 760 – Machine Learning (UW-Madison)

TAN Learning Algorithm Friedman, Geiger & Goldszmidt ’97 For every pair of features, compute the mutual information (information gain of one for the other), conditional on the class Add arcs between all pairs of features, weighted by this value Compute the maximum weight spanning tree, and direct arcs from the root Compute parameters as already seen © Jude Shavlik 2006 David Page 2007 CS 760 – Machine Learning (UW-Madison)

CS 760 – Machine Learning (UW-Madison) © Jude Shavlik 2006 David Page 2007 CS 760 – Machine Learning (UW-Madison)

CS 760 – Machine Learning (UW-Madison) © Jude Shavlik 2006 David Page 2007 CS 760 – Machine Learning (UW-Madison)

CS 760 – Machine Learning (UW-Madison) © Jude Shavlik 2006 David Page 2007 CS 760 – Machine Learning (UW-Madison)

CS 760 – Machine Learning (UW-Madison) © Jude Shavlik 2006 David Page 2007 CS 760 – Machine Learning (UW-Madison)

CS 760 – Machine Learning (UW-Madison) © Jude Shavlik 2006 David Page 2007 CS 760 – Machine Learning (UW-Madison)

CS 760 – Machine Learning (UW-Madison) © Jude Shavlik 2006 David Page 2007 CS 760 – Machine Learning (UW-Madison)

CS 760 – Machine Learning (UW-Madison) © Jude Shavlik 2006 David Page 2007 CS 760 – Machine Learning (UW-Madison)

CS 760 – Machine Learning (UW-Madison) © Jude Shavlik 2006 David Page 2007 CS 760 – Machine Learning (UW-Madison)

CS 760 – Machine Learning (UW-Madison) One Other Key Point The previous discussion assumes we are going to make a prediction based on the best (e.g., MAP or maximum likelihood) single hypothesis. Alternatively, we could make avoid committing to a single Bayes Net. Instead we could compute all Bayes Nets, and have a probability for each. For any new query © Jude Shavlik 2006 David Page 2007 CS 760 – Machine Learning (UW-Madison)

One Other Key Point (Continued) (Continued)… we could calculate the prediction of every network. We could then weigh each network’s prediction by the probability that it is the correct network (given our previous training data), and go with the highest scoring prediction. Such a predictor is the Bayes-optimal predictor. © Jude Shavlik 2006 David Page 2007 CS 760 – Machine Learning (UW-Madison)

Problem with Bayes Optimal Because there are a super-exponential number of structures, we don’t want to average over all of them. Two options are used in practice: Selective model averaging:just choose a subset of “best” but “distinct” models (networks) and pretend it’s exhaustive. Go back to MAP/ML (model selection). © Jude Shavlik 2006 David Page 2007 CS 760 – Machine Learning (UW-Madison)