Bayesian Networks: Motivation

Bayesian Networks: Motivation
Capture independence and conditional independence where they exist. Among variables where dependencies exist, encode the relevant portion of the full joint. Use a graphical representation for which we can more easily investigate the complexity of inference and can search for efficient inference algorithms. © Jude Shavlik 2006 David Page 2007 CS 760 – Machine Learning (UW-Madison)

CS 760 – Machine Learning (UW-Madison)
A Bayesian Network is a ... Directed Acyclic Graph (DAG) in which … … the nodes denote random variables … each node X has a conditional probability distribution P(X|Parents(X)). The intuitive meaning of an arc from X to Y is that X directly influences Y. © Jude Shavlik 2006 David Page 2007 CS 760 – Machine Learning (UW-Madison)

Additional Terminology
If X and its parents are discrete, we can represent the distribution P(X|Parents(X)) by a conditional probability table (CPT) specifying the probability of each value of X given each possible combination of settings for the variables in Parents(X). A conditioning case is a row in this CPT (a setting of values for the parent nodes). © Jude Shavlik 2006 David Page 2007 CS 760 – Machine Learning (UW-Madison)

An Example Bayes Net C P(D) T (9,1) F (1,4) A B C D E P(A) 0.1 (1,9) A B P(C) T T 0.9 (9,1) T F 0.6 (3,2) F T 0.3 (3,7) F F 0.2 (1,4) P(B) 0.2 (1,4) C P(E) T (4,1) F (1,9) The numbers (probabilities) are also called parameters. In parentheses are the hyperparameters. These usually are not shown, but are important when doing parameter learning. © Jude Shavlik 2006 David Page 2007 CS 760 – Machine Learning (UW-Madison)

Bayesian Network Semantics
A Bayesian Network completely specifies a full joint distribution over its random variables, as below -- this is its meaning. P In the above, P(x1,…,xn) is shorthand notation for P(X1=x1,…,Xn=xn). © Jude Shavlik 2006 David Page 2007 CS 760 – Machine Learning (UW-Madison)

Conditional Independence Again
A node X is conditionally independent of its predecessors given Parents(X). Markov Blanket of X: set consisting of the parents of X, the children of X, and the other parents of the children of X. X is conditionally independent of all nodes in the network given its Markov Blanket. © Jude Shavlik 2006 David Page 2007 CS 760 – Machine Learning (UW-Madison)

Learning CPTs from Complete Settings
Suppose we are given a set of data, where each data point is a complete setting for all the variables. One assumption we make is that the data set is a random sample from the distribution we’re trying to model. For each node in our network, we consider each entry in its CPT (each setting of values © Jude Shavlik 2006 David Page 2007 CS 760 – Machine Learning (UW-Madison)

Learning CPTs (Continued)
(Continued)… for its parents). For each entry in the CPT, we have a prior (possibly uniform) Dirichlet distribution over its values. We simply update this distribution based on the relevant data points (those that agree on the settings for the parents that correspond with this CPT entry). A second, implicit assumption is that the © Jude Shavlik 2006 David Page 2007 CS 760 – Machine Learning (UW-Madison)

Learning CPTs (Continued)
(Continued)… distributions over different rows of the CPT are independent of one another. Finally, it is worth noting that instead of this last assumption, we might have a stronger bias over the form of the CPT. We might believe it is a linear function or a tree, in which case we could use a learning algorithm we have seen already. © Jude Shavlik 2006 David Page 2007 CS 760 – Machine Learning (UW-Madison)

Simple Example Suppose we believe the variables PinLength and HeadWeight directly influence whether a thumbtack comes up heads or tails. For simplicity, suppose PinLength can be long or short and HeadWeight can be heavy or light. Suppose we adopt the following prior over the CPT entries for the variable Thumbtack. © Jude Shavlik 2006 David Page 2007 CS 760 – Machine Learning (UW-Madison)

Simple Example (Continued)
© Jude Shavlik 2006 David Page 2007 CS 760 – Machine Learning (UW-Madison)

Simple Example (Continued)
Notice that we have equal confidence in our prior (initial) probabilities for the first and last columns of the CPT, less confidence in those of the second column, and more in those of the third column. A new data point will affect only one of the columns. A new data point will have more effect on the second column than the others. © Jude Shavlik 2006 David Page 2007 CS 760 – Machine Learning (UW-Madison)

More Difficult Case: What if Some Variables are Missing
Recall our earlier notion of hidden variables. Sometimes a variable is hidden because it cannot be explicitly measured. For example, we might hypothesize that a chromosomal abnormality is responsible for some patients with a particular cancer not responding well to treatment. © Jude Shavlik 2006 David Page 2007 CS 760 – Machine Learning (UW-Madison)

Missing Values (Continued)
We might include a node for this chromosomal abnormality in our network because we strongly believe it exists, other variables can be used to predict it, and it is in turn predictive of still other variables. But in estimating CPTs from data, none of our data points has a value for this variable. © Jude Shavlik 2006 David Page 2007 CS 760 – Machine Learning (UW-Madison)

General EM Framework Given: Data with missing values, Space of possible models, Initial model. Repeat until no change greater than threshold: Expectation (E) Step: Compute expectation over missing values, given model. Maximization (M) Step: Replace current model with model that maximizes probability of data. © Jude Shavlik 2006 David Page 2007 CS 760 – Machine Learning (UW-Madison)

(“Soft”) EM vs. “Hard” EM
Standard (soft) EM: expectation is a probability distribution. Hard EM: expectation is “all or nothing”… most likely/probable value. Advantage of hard EM is computational efficiency when expectation is over state consisting of values for multiple variables (next example illustrates). © Jude Shavlik 2006 David Page 2007 CS 760 – Machine Learning (UW-Madison)

EM for Parameter Learning: E Step
For each data point with missing values, compute the probability of each possible completion of that data point. Replace the original data point with all these completions, weighted by probabilities. Computing the probability of each completion (expectation) is just answering query over missing variables given others. © Jude Shavlik 2006 David Page 2007 CS 760 – Machine Learning (UW-Madison)

EM for Parameter Learning: M Step
Use the completed data set to update our Dirichlet distributions as we would use any complete data set, except that our counts (tallies) may be fractional now. Update CPTs based on new Dirichlet distributions, as we would with any complete data set. © Jude Shavlik 2006 David Page 2007 CS 760 – Machine Learning (UW-Madison)

EM for Parameter Learning
Iterate E and M steps until no changes occur. We will not necessarily get the global MAP (or ML given uniform priors) setting of all the CPT entries, but under a natural set of conditions we are guaranteed convergence to a local MAP solution. EM algorithm is used for a wide variety of tasks outside of BN learning as well. © Jude Shavlik 2006 David Page 2007 CS 760 – Machine Learning (UW-Madison)

Subtlety for Parameter Learning
Overcounting based on number of interations required to converge to settings for the missing values. After each repetition of E step, reset all Dirichlet distributions before repeating M step. © Jude Shavlik 2006 David Page 2007 CS 760 – Machine Learning (UW-Madison)

C P(D) T (9,1) F (1,4) A B C D E P(A) 0.1 (1,9) A B P(C) T T 0.9 (9,1) T F 0.6 (3,2) F T 0.3 (3,7) F F 0.2 (1,4) P(B) 0.2 (1,4) C P(E) T (4,1) F (1,9) A B C D E ? ? 0 ? ? ? 1 ? Data © Jude Shavlik 2006 David Page 2007 CS 760 – Machine Learning (UW-Madison)

C P(D) T (9,1) F (1,4) A B C D E P(A) 0.1 (1,9) A B P(C) T T 0.9 (9,1) T F 0.6 (3,2) F T 0.3 (3,7) F F 0.2 (1,4) P(B) 0.2 (1,4) C P(E) T (4,1) F (1,9) A B C D E Data 0: 0.99 1: 0.01 0: 0.02 1: 0.98 0: 0.80 1: 0.20 0: 0.70 1: 0.30 0: 0.003 1: 0.997 © Jude Shavlik 2006 David Page 2007 CS 760 – Machine Learning (UW-Madison)

Multiple Missing Values
B C D E P(A) 0.1 (1,9) A B P(C) T T 0.9 (9,1) T F 0.6 (3,2) F T 0.3 (3,7) F F 0.2 (1,4) P(B) 0.2 (1,4) C P(D) T (9,1) F (1,4) C P(E) T (4,1) F (1,9) A B C D E ? ? Data © Jude Shavlik 2006 David Page 2007 CS 760 – Machine Learning (UW-Madison)

B C D E P(A) 0.1 (1,9) A B P(C) T T 0.9 (9,1) T F 0.6 (3,2) F T 0.3 (3,7) F F 0.2 (1,4) P(B) 0.2 (1,4) C P(E) T (4,1) F (1,9) C P(D) T (9,1) F (1,4) A B C D E Data 0.72 0.18 0.04 0.06 © Jude Shavlik 2006 David Page 2007 CS 760 – Machine Learning (UW-Madison)

B C D E P(A) 0.1 (1.1,9.9) A B P(C) T T 0.9 (9,1) T F 0.6 (3.06,2.04) F T 0.3 (3,7) F F 0.2 (1.18,4.72) P(B) 0.17 (1,5) C P(D) T (9,1.24) F (1,4.76) C P(E) T (4.24,1) F (1.76,9) A B C D E Data 0.72 0.18 0.04 0.06 © Jude Shavlik 2006 David Page 2007 CS 760 – Machine Learning (UW-Madison)

Problems with EM Only local optimum (not much way around that, though). Deterministic … if priors are uniform, may be impossible to make any progress… … next figure illustrates the need for some randomization to move us off an uninformative prior… © Jude Shavlik 2006 David Page 2007 CS 760 – Machine Learning (UW-Madison)

What will EM do here? P(A) 0.5 (1,1) Data A A B C ? ? ? A P(B) T (1,1) F (1,1) B B P(C) T (1,1) F (1,1) C © Jude Shavlik 2006 David Page 2007 CS 760 – Machine Learning (UW-Madison)

EM Dependent on Initial Beliefs
P(A) 0.5 (1,1) Data A A B C ? ? ? A P(B) T (6,4) F (4,6) B B P(C) T (1,1) F (1,1) C © Jude Shavlik 2006 David Page 2007 CS 760 – Machine Learning (UW-Madison)

EM Dependent on Initial Beliefs
P(A) 0.5 (1,1) B is more likely T than F when A is T. Filling this in makes C more likely T than F when B is T. This makes B still more likely T than F when A is T. Etc. Small change in CPT for B (swap 0.6 and 0.4) would have opposite effect. Data A A B C ? ? ? A P(B) T (6,4) F (4,6) B B P(C) T (1,1) F (1,1) C © Jude Shavlik 2006 David Page 2007 CS 760 – Machine Learning (UW-Madison)

Learning Structure + Parameters
Number of structures is superexponential Finding optimal structure (ML or MAP) is NP-complete Two common options: Severely restrict possible structures – e.g., tree-augmented naïve Bayes (TAN) Heuristic search (e.g., sparse candidate) © Jude Shavlik 2006 David Page 2007 CS 760 – Machine Learning (UW-Madison)

Recall: Naïve Bayes Net
F 2 F 3 F N-2 F N-1 F N F 1 Class Value … © Jude Shavlik 2006 David Page 2007 CS 760 – Machine Learning (UW-Madison)

Alternative: TAN Class Value F 2 F N-2 F N-1 F N F 1 F 3 … © Jude Shavlik 2006 David Page 2007 CS 760 – Machine Learning (UW-Madison)

Tree-Augmented Naïve Bayes
In addition to the Naïve Bayes arcs (class to feature), we are permitted a directed tree among the features Given this restriction, there exists a polynomial-time algorithm to find the maximum likelihood structure © Jude Shavlik 2006 David Page 2007 CS 760 – Machine Learning (UW-Madison)

TAN Learning Algorithm Friedman, Geiger & Goldszmidt ’97
For every pair of features, compute the mutual information (information gain of one for the other), conditional on the class Add arcs between all pairs of features, weighted by this value Compute the maximum weight spanning tree, and direct arcs from the root Compute parameters as already seen © Jude Shavlik 2006 David Page 2007 CS 760 – Machine Learning (UW-Madison)

One Other Key Point The previous discussion assumes we are going to make a prediction based on the best (e.g., MAP or maximum likelihood) single hypothesis. Alternatively, we could make avoid committing to a single Bayes Net. Instead we could compute all Bayes Nets, and have a probability for each. For any new query © Jude Shavlik 2006 David Page 2007 CS 760 – Machine Learning (UW-Madison)

One Other Key Point (Continued)
(Continued)… we could calculate the prediction of every network. We could then weigh each network’s prediction by the probability that it is the correct network (given our previous training data), and go with the highest scoring prediction. Such a predictor is the Bayes-optimal predictor. © Jude Shavlik 2006 David Page 2007 CS 760 – Machine Learning (UW-Madison)

Problem with Bayes Optimal
Because there are a super-exponential number of structures, we don’t want to average over all of them. Two options are used in practice: Selective model averaging:just choose a subset of “best” but “distinct” models (networks) and pretend it’s exhaustive. Go back to MAP/ML (model selection). © Jude Shavlik 2006 David Page 2007 CS 760 – Machine Learning (UW-Madison)

Bayesian Networks: Motivation

Similar presentations

Presentation on theme: "Bayesian Networks: Motivation"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Bayesian Networks: Motivation

Similar presentations

Presentation on theme: "Bayesian Networks: Motivation"— Presentation transcript:

Similar presentations

About project

Feedback