Uncertainty Probability and Bayesian Networks

Uncertainty Probability and Bayesian Networks
Computer Science CPSC 502 Uncertainty Probability and Bayesian Networks (Ch. 6)

Outline Uncertainty and Probability
Marginal and Conditional Independence Bayesian networks

Where are we? Environment Stochastic Deterministic Problem Type Arc
Representation Reasoning Technique Environment Stochastic Deterministic Problem Type Arc Consistency Constraint Satisfaction Vars + Constraints Search Static Belief Nets Query Logics Variable Elimination Search Decision Nets Sequential STRIPS Variable Elimination Planning Search Done with Deterministic Environments Markov Processes Value Iteration

Where are we? Environment Stochastic Deterministic Problem Type Arc
Representation Reasoning Technique Environment Stochastic Deterministic Problem Type Arc Consistency Constraint Satisfaction Vars + Constraints Search Static Belief Nets Query Logics Variable Elimination Search Decision Nets Sequential STRIPS Variable Elimination Planning Search Second Part of the Course Markov Processes Value Iteration

Where Are We? Environment Stochastic Deterministic Problem Type Arc
Representation Reasoning Technique Environment Stochastic Deterministic Problem Type Arc Consistency Constraint Satisfaction Vars + Constraints Search Static Belief Nets Query Logics Variable Elimination Search Decision Nets Sequential STRIPS Variable Elimination Planning Search We’ll focus on Belief Nets Markov Processes Value Iteration

Two main sources of uncertainty
(From Lecture 2) Sensing Uncertainty: The agent cannot fully observe a state of interest. For example: Right now, how many people are in this building? What disease does this patient have? Where is the soccer player behind me? Effect Uncertainty: The agent cannot be certain about the effects of its actions. If I work hard, will I get an A? Will this drug work for this patient? Where will the ball go when I kick it?

Motivation for uncertainty
To act in the real world, we almost always have to handle uncertainty (both effect and sensing uncertainty) Deterministic domains are an abstraction Sometimes this abstraction enables more powerful inference Now we don’t make this abstraction anymore AI main focus shifted from logic to probability in the 1980s The language of probability is very expressive and general New representations enable efficient reasoning We will see some of these, in particular Bayesian networks Reasoning under uncertainty is part of the ‘new’ AI This is not a dichotomy: framework for probability is logical! New frontier: combine logic and probability

Probability as a measure of uncertainty/ignorance
Probability measures an agent's degree of belief in truth of propositions about states of the world Belief in a proposition f can be measured in terms of a number between 0 and 1 this is the probability of f E.g. P(“roll of fair die came out as a 6”) = 1/6 ≈ 16.7% = 0.167 P(f) = 0 means that f is believed to be definitely false P(f) = 1 means f is believed to be definitely true Using probabilities between 0 and 1 is purely a convention. .

Probability Theory and Random Variables
system of logical axioms and formal operations for sound reasoning under uncertainty Basic element: random variable X X is a variable like the ones we have seen in CSP/Planning/Logic but the agent can be uncertain about the value of X As usual, the domain of a random variable X, written dom(X), is the set of values X can take Types of variables Boolean: e.g., Cancer (does the patient have cancer or not?) Categorical: e.g., CancerType could be one of {breastCancer, lungCancer, skinMelanomas} Numeric: e.g., Temperature (integer or real) We will focus on Boolean and categorical variables

Possible Worlds A possible world specifies an assignment to each random variable Example: weather in Vancouver, represented by random variables - - - Temperature: {hot mild cold} - Weather: {sunny, cloudy} There are 6 possible worlds: Weather Temperature w1 sunny hot w2 mild w3 cold w4 cloudy w5 w6 w╞ f means that proposition f is true in world w A probability measure (w) over possible worlds w is a nonnegative real number such that (w) sums to 1 over all possible worlds w Because for sure we are in one of these worlds!

Possible Worlds A possible world specifies an assignment to each random variable Example: weather in Vancouver, represented by random variables - - - Temperature: {hot mild cold} - Weather: {sunny, cloudy} There are 6 possible worlds: Weather Temperature µ(w) w1 sunny hot 0.10 w2 mild 0.20 w3 cold w4 cloudy 0.05 w5 0.35 w6 w╞ f means that proposition f is true in world w A probability measure (w) over possible worlds w is a nonnegative real number such that (w) sums to 1 over all possible worlds w The probability of proposition f is defined by: P(f )=Σ w╞ f µ(w). i.e. sum of the probabilities of the worlds w in which f is true Because for sure we are in one of these worlds!

Example Remember What’s the probability of it being cloudy or cold?
µ(w3) + µ(w4) + µ(w5) + µ(w6) = 0.7 Weather Temperature µ(w) w1 sunny hot 0.10 w2 mild 0.20 w3 cold w4 cloudy 0.05 w5 0.35 w6 Remember The probability of proposition f is defined by: P(f )=Σ w╞ f µ(w) sum of the probabilities of the worlds w in which f is true

Joint Probability Distribution
Definition (probability distribution) A probability distribution P on a random variable X is a function dom(X)  [0,1] such that x  P(X=x) Joint distribution over random variables X1, …, Xn: a probability distribution over the joint random variable <X1, …, Xn> with domain dom(X1) × … × dom(Xn) (the Cartesian product) Think of a joint distribution over n variables as the n-dimensional table of the corresponding possible worlds Each row corresponds to an assignment X1= x1, …, Xn= xn and its probability P(X1= x1, … ,Xn= xn) E.g., {Weather, Temperature} example Weather Temperature µ(w) sunny hot 0.10 mild 0.20 cold cloudy 0.05 0.35

P(X=x) = zdom(Z) P(X=x, Z = z)
Marginalization Given the joint distribution, we can compute distributions over subsets of the variables through marginalization: We also write this as P(X) = zdom(Z) P(X, Z = z). Simply an application of the definition of probability measure! P(X=x) = zdom(Z) P(X=x, Z = z) Marginalization over Z Remember? The probability of proposition f is defined by: P(f )=Σ w╞ f µ(w) sum of the probabilities of the worlds w in which f is true

P(X=x) = zdom(Z) P(X=x, Z = z)
Marginalization Given the joint distribution, we can compute distributions over subsets of the variables through marginalization: We also write this as P(X) = zdom(Z) P(X, Z = z). This corresponds to summing out a dimension in the table. Does the new table still sum to 1? P(X=x) = zdom(Z) P(X=x, Z = z) Marginalization over Z Temperature µ(w) hot ? mild cold Weather Temperature µ(w) sunny hot 0.10 mild 0.20 cold cloudy 0.05 0.35

P(X=x) = zdom(Z) P(X=x, Z = z)
Marginalization Given the joint distribution, we can compute distributions over subsets of the variables through marginalization: We also write this as P(X) = zdom(Z) P(X, Z = z). This corresponds to summing out a dimension in the table. The new table still sums to 1. It must, since it’s a probability distribution! P(X=x) = zdom(Z) P(X=x, Z = z) Marginalization over Z Temperature µ(w) hot ? mild cold Weather Temperature µ(w) sunny hot 0.10 mild 0.20 cold cloudy 0.05 0.35

P(X=x) = zdom(Z) P(X=x, Z = z)
Marginalization Given the joint distribution, we can compute distributions over subsets of the variables through marginalization: We also write this as P(X) = zdom(Z) P(X, Z = z). This corresponds to summing out a dimension in the table. The new table still sums to 1. It must, since it’s a probability distribution! P(X=x) = zdom(Z) P(X=x, Z = z) Marginalization over Z Weather Temperature µ(w) sunny hot 0.10 mild 0.20 cold cloudy 0.05 0.35 Temperature µ(w) hot 0.15 mild cold P(Temperature=hot) = P(Weather=sunny, Temperature = hot) + P(Weather=cloudy, Temperature = hot) = = 0.15

P(X=x) = z1dom(Z1),…, zndom(Zn) P(X=x, Z1 = z1, …, Zn = zn)
Marginalization We can also marginalize over more than one variable at once P(X=x) = z1dom(Z1),…, zndom(Zn) P(X=x, Z1 = z1, …, Zn = zn) Wind Weather Temperature µ(w) yes sunny hot 0.04 mild 0.09 cold 0.07 cloudy 0.01 0.10 0.12 no 0.06 0.11 0.03 0.25 0.08 i.e., Marginalization over Temperature and Wind Weather µ(w) sunny 0.40 cloudy

Marginalization We can also get marginals for more than one variable
P(X=x,Y=y) = z1dom(Z1),…, zndom(Zn) P(X=x, Y=y, Z1 = z1, …, Zn = zn) Wind Weather Temperature µ(w) yes sunny hot 0.04 mild 0.09 cold 0.07 cloudy 0.01 0.10 0.12 no 0.06 0.11 0.03 0.25 0.08 Weather Temperature µ(w) sunny hot 0.10 mild cold cloudy

Marginalization We can also get marginals for more than one variable
Still simply an application of the definition of probability measure! P(X=x,Y=y) = z1dom(Z1),…, zndom(Zn) P(X=x, Y=y, Z1 = z1, …, Zn = zn) The probability of proposition f is defined by: P(f )=Σ w╞ f µ(w) sum of the probabilities of the worlds w in which f is true

Conditioning Conditioning: revise beliefs based on new observations
Build a probabilistic model (the joint probability distribution, JPD) Take into account all background information Called the prior probability distribution Denote the prior probability for proposition h as P(h) Observe new information about the world Call all information we received subsequently the evidence e Integrate the two sources of information to compute the conditional probability P(h|e) This is also called the posterior probability of h given e.

Example for conditioning
You have a prior for the joint distribution of weather and temperature Possible world Weather Temperature µ(w) w1 sunny hot 0.10 w2 mild 0.20 w3 cold w4 cloudy 0.05 w5 0.35 w6 T P(T|W=sunny) hot 0.10/0.40=0.25 mild cold Now, you look outside and see that it’s sunny You are now certain that you’re in one of worlds w1, w2, or w3 To get the conditional probability P(T|W=sunny) renormalize µ(w1), µ(w2), µ(w3) to sum to 1 µ(w1) + µ(w2) + µ(w3) = =0.40

Example for conditioning
You have a prior for the joint distribution of weather and temperature Possible world Weather Temperature µ(w) w1 sunny hot 0.10 w2 mild 0.20 w3 cold w4 cloudy 0.05 w5 0.35 w6 T P(T|W=sunny) hot 0.10/0.40=0.25 mild 0.20/0.40=0.50 cold Now, you look outside and see that it’s sunny You are now certain that you’re in one of worlds w1, w2, or w3 To get the conditional probability P(T|W=sunny) renormalize µ(w1), µ(w2), µ(w3) to sum to 1 µ(w1) + µ(w2) + µ(w3) = =0.40

Conditional Probability
Definition (conditional probability) The conditional probability of proposition h given evidence e is P(e): Sum of probability for all worlds in which e is true P(he): Sum of probability for all worlds in which both h and e are true

Conditional Probability among Random Variables
It expresses the conditional probability of each possible value for X given each possible value for Y P(X | Y) = P(X , Y) / P(Y) P(X | Y) = P(Temperature | Weather) = P(Temperature  Weather) / P(Weather) T = hot T = cold W = sunny P(hot|sunny) P(cold|sunny) W = cloudy P(hot|cloudy) P(cold|cloudy) Crucial that you can answer this question. Think about it at home and let me know if you have questions next time Which of the following is true? The probabilities in each row should sum to 1 The probabilities in each column should sum to 1 Both of the above None of the above

Inference by Enumeration
Great, we can compute arbitrary probabilities now! Given: Prior joint probability distribution (JPD) on set of variables X specific values e for the evidence variables E (subset of X) We want to compute: posterior joint distribution of query variables Y (a subset of X) given evidence e Step 1: Condition to get distribution P(X|e) Step 2: Marginalize to get distribution P(Y|e)

Inference by Enumeration: example
Given P(W,C,T) as JPD below, and evidence e : “Wind=yes” What is the probability that it is hot? I.e., P(Temperature=hot | Wind=yes) Step 1: condition to get distribution P(X|e) Windy W Cloudy C Temperature T P(W, C, T) yes no hot 0.04 mild 0.09 cold 0.07 0.01 0.10 0.12 0.06 0.11 0.03 0.25 0.08

Given P(X) as JPD below, and evidence e : “Wind=yes” What is the probability that it is hot? I.e., P(Temperature=hot | Wind=yes) Step 1: condition to get distribution P(X|e) P(X|e) Windy W Cloudy C Temperature T P(W, C, T) yes no hot 0.04 mild 0.09 cold 0.07 0.01 0.10 0.12 0.06 0.11 0.03 0.25 0.08 Cloudy C Temperature T P(C, T| W=yes) sunny hot mild cold cloudy

Given P(X) as JPD below, and evidence e : “Wind=yes” What is the probability that it is hot? I.e., P(Temperature=hot | Wind=yes) Step 1: condition to get distribution P(X|e) Windy W Cloudy C Temperature T P(W, C, T) yes no hot 0.04 mild 0.09 cold 0.07 0.01 0.10 0.12 0.06 0.11 0.03 0.25 0.08 Cloudy C Temperature T P(C, T| W=yes) sunny hot 0.04/0.43  0.10 mild 0.09/0.43  0.21 cold 0.07/0.43  0.16 cloudy 0.01/0.43  0.02 0.10/0.43  0.23 0.12/0.43  0.28

Given P(X) as JPD below, and evidence e : “Wind=yes” What is the probability that it is hot? I.e., P(Temperature=hot | Wind=yes) Step 2: marginalize to get distribution P(Y|e) Cloudy C Temperature T P(C, T| W=yes) sunny hot 0.10 mild 0.21 cold 0.16 cloudy 0.02 0.23 0.28 Temperature T P(T| W=yes) hot = 0.12 mild = 0.44 cold = 0.44

Problems of Inference by Enumeration
If we have n variables, and d is the size of the largest domain What is the space complexity to store the joint distribution? We need to store the probability for each possible world There are O(dn) possible worlds, so the space complexity is O(dn) How do we find the numbers for O(dn) entries? Time complexity O(dn) In the worse case, need to sum over all entries in the JPD We will look at an alternative way to perform inference, Bayesian networks Formalism to exploit (conditional) independence between variables But first, let’s look at a couple more definitions

Product Rule By definition, we know that : We can rewrite this to
In general

Chain Rule

Why does the chain rule help us?
We will see how, under specific circumstances (variables independence), this rule helps gain compactness We can represent the JPD as a product of marginal distributions We can simplify some terms when the variables involved are independent or conditionally independent

Marginal Independence
Intuitively: if X and Y are marginally independent, then learning that Y=y does not change your belief in X and this is true for all values y that Y could take For example, weather is marginally independent from the result of a dice throw

Examples for marginal independence
Intuitively (without numbers): Boolean random variable “Canucks win the Stanley Cup this season” Numerical random variable “Canucks’ revenue last season” ? Are the two marginally independent?

Examples for marginal independence
Intuitively (without numbers): Boolean random variable “Canucks win the Stanley Cup this season” Numerical random variable “Canucks’ revenue last season” ? Are the two marginally independent? No! Without revenue they cannot afford to keep their best players

Exploiting marginal independence
Exponentially fewer than the JPD!

Follow-up Example We said that “Canucks win the Stanley Cup this season”and “Canucks’ revenue last season” are not marginally independent? But they are conditionally independent given the Canucks line-up Once we know who is playing then learning their revenue last year won’t change our belief in their chances

Conditional Independence
Intuitively: if X and Y are conditionally independent given Z, then learning that Y=y does not change your belief in X when we already know Z=z and this is true for all values y that Y could take and all values z that Z could take

Example for Conditional Independence
Up s2 Lit l1 Whether light l1 is lit and the position of switch s2 are not marginally independent The position of the switch determines whether there is power in the wire w0 connected to the light

Example for Conditional Independence
Power w0 Lit l1 Up s2 Whether light l1 is lit and the position of switch s2 are not marginally independent The position of the switch determines whether there is power in the wire w0 connected to the light However, whether light l1 is lit is conditionally independent from the position of switch s2 given whether there is power in wire w0 Once we know Power w0, learning values for any other variable will not change our beliefs about light l1 I.e., Lit l1 is independent of any other variable given Power w0

Exploiting Conditional Independence
Recall the chain rule

Belief Networks Belief networks and their extensions are R&R systems explicitly defined to exploit independence in probabilistic reasoning

Bayesian Network Motivation
We want a representation and reasoning system that is based on conditional (and marginal) independence Compact yet expressive representation Efficient reasoning procedures Bayesian (Belief) Networks are such a representation Named after Thomas Bayes (ca. 1702 –1761) Term coined in 1985 by Judea Pearl (1936 – ) Their invention changed the primary focus of AI from logic to probability! Thomas Bayes Judea Pearl Pearl recently received the very prestigious ACM Turing Award for his contributions to Artificial Intelligence! And is going to give a DLS talk on November 8!

Belief (or Bayesian) networks
Def. A Belief network consists of a directed, acyclic graph (DAG) where each node is associated with a random variable Xi A domain for each variable Xi represented a set of conditional probability distributions for each node Xi given its parents Pa(Xi) in the graph P (Xi | Pa(Xi)) The parents Pa(Xi) of a variable Xi are those variables upon which Xi depends directly A Bayesian network is a compact representation of the JDP for a set of variables (X1, …,Xn ) P(X1, …,Xn) = ∏ni= 1 P (Xi | Pa(Xi))

How to build a Bayesian network
Define a total order over the random variables: (X1, …,Xn) If we apply the chain rule, we have P(X1, …,Xn) = ∏ni= 1 P(Xi | X1, … ,Xi-1) For each Xi, , select as parents in the Belief network a minimal set of its predecessors Pa (Xi) such that P(Xi | X1, … ,Xi-1) = P (Xi | Pa (Xi)) Putting it all together, in a Belief network P(X1, …,Xn) = ∏ni= 1 P (Xi | Pa(Xi)) Predecessors of Xi in the total order defined over the variables Xi is conditionally independent from all its other predecessors given Parents(Xi) Compact representation of the JPD - factorization over the JDP based on existing conditional independencies among the variables

Example for BN construction: Fire Diagnosis
You want to diagnose whether there is a fire in a building You receive a noisy report about whether everyone is leaving the building If everyone is leaving, this may have been caused by a fire alarm If there is a fire alarm, it may have been caused by a fire or by tampering If there is a fire, there may be smoke Let’s construct the Bayesian network for this

First you choose the variables. In this case, all are Boolean: Tampering is true when the alarm has been tampered with Fire is true when there is a fire Alarm is true when there is an alarm Smoke is true when there is smoke Leaving is true if there are lots of people leaving the building Report is true if the sensor reports that lots of people are leaving the building

Next, define a total ordering of variables: Let’s say Fire (F), Tampering (T), Alarm, (A), Smoke (S) Leaving (L) Report (R) The chain rule gives us: P(F,T,A,S,L,R) = Now choose the parents for each variable by evaluating conditional independencies

P(F)P (T ) P (A | F,T) P (S | F,T,A) P (L | F,T,A,S) P (R | F,T,A,S,L) Alarm (A) depends on both Fire and Tampering: it could be caused by either or both Tampering Fire Alarm

P(F)P (T ) P (A | F,T) P (S | F) P (L | A) P (R | F,T,A,S,L) Report ( R) is caused by Leaving, and thus is independent of the other variables given Leaving Tampering Fire Alarm Smoke Leaving Report

The result is the following Bnets, and its corresponding, very compact factorization of the original JPD Tampering Fire Alarm Smoke Leaving Report P(F)P (T ) P (A | F,T) P (S | F) P (L | A) P (R | L)

We are not done yet: must specify the Conditional Probability Table (CPT) for each variable. All variables are Boolean. How many probabilities do we need to specify for this Bayesian network?

We are not done yet: must specify the Conditional Probability Table (CPT) for each variable. All variables are Boolean. How many probabilities do we need to specify for this Bayesian network? P(Tampering): 1 probability P(Alarm|Tampering, Fire): 4 (independent) 1 probability for each of the 4 instantiations of the parents In total: = 12 (compared to 26 -1= 63 for full JPD!)

P(Tampering=t) 0.02 P(Tampering=t) P(Tampering=f) 0.02 0.98 We don’t need to store P(Tampering=f) since probabilities sum to 1

P(Tampering=t) 0.02 P(Fire=t) 0.01

P(Tampering=t) 0.02 P(Fire=t) 0.01 Tampering T Fire F P(Alarm=t|T,F) P(Alarm=f|T,F) t 0.5 f 0.85 0.15 0.99 0.01 0.0001 0.9999 We don’t need to store P(Alarm=f|T,F) since probabilities sum to 1 Each row of this table is a conditional probability distribution

P(Tampering=t) 0.02 P(Fire=t) 0.01 Tampering T Fire F P(Alarm=t|T,F) t 0.5 f 0.85 0.99 0.0001 Fire F P(Smoke=t |F) t 0.9 f 0.01 Alarm P(Leaving=t|A) t 0.88 f 0.001 Leaving P(Report=t|A) t 0.75 f 0.01 In total: = 12 (compared to for full JPD!)

Compactness A CPT for a Boolean variable Xi with k Boolean parents has …… rows for the combinations of parent values If each variable has no more than k parents, the complete network with n nodes requires to specify …………. numbers For k<< n, this is a substantial improvement, the numbers required grow ………… with n, vs. …………..for the full joint distribution E.g., if we have a Bnets with 30 boolean variables, each with 5 parents Need to specify……………… probability But we need ……..for JPD

Compactness A CPT for a Boolean variable Xi with k Boolean parents has 2k rows for the combinations of parent values Each row requires one number p for Xi = true (the number for Xi = false is just 1-p) If each variable has no more than k parents, the complete network requires to specify O(n · 2k) numbers For k<< n, this is a substantial improvement, the numbers required grow linearly with n, vs. O(2n) for the full joint distribution E.g., if we have a Bnets with 30 boolean variables, each with 5 parents Need to specify 30*25probability But we need 230 for JPD

Realistic BNet: Liver Diagnosis Source: Onisko et al., 1999

Compactness What happens if the network is fully connected?
Or k ≈ n Not much saving compared to the numbers needed to specify the full JPD Bnets are useful in sparse (or locally structured) domains Domains in with each component interacts with (is related to) a small fraction of other components What if this is not the case in a domain we need to reason about May need to make simplifying assumptions to reduce the dependencies in a domain

“Where do the numbers (CPTs) come from?”
From experts Tedious Costly Not always reliable From data => Machine Learning There are algorithms to learn both structures and numbers – later in the course Can be hard to get enough data Still, usually better than specifying the full JPD

Example of probability computation
P(Tampering=t) 0.02 P(Fire=t) 0.01 Tampering T Fire F P(Alarm=t|T,F) t 0.5 f 0.85 0.99 0.0001 Fire F P(Smoke=t |F) t 0.9 f 0.01 Alarm P(Leaving=t|A) t 0.88 f 0.001 Leaving P(Report=t|A) t 0.75 f 0.01 How do we compute, for instance, P(Tampering=t, Fire=f, Alarm=t, Smoke=f, Leaving=t, Report=t) =

P(Tampering=t) 0.02 P(Fire=t) 0.01 Tampering T Fire F P(Alarm=t|T,F) t 0.5 f 0.85 0.99 0.0001 Fire F P(Smoke=t |F) t 0.9 f 0.01 Alarm P(Leaving=t|A) t 0.88 f 0.001 Leaving P(Report=t|A) t 0.75 f 0.01 P(Tampering=t, Fire=f, Alarm=t, Smoke=f, Leaving=t, Report=t) = P(Tampering=t) x P(Fire=f) x P(Alarm=t| Tampering=t, Fire=f) x x P(Smoke=f| Fire = f) x P(Leaving=t| Alarm=t) x P(Report=t|Leaving=t) =

What if we use a different ordering?
Say, we use the following order: Leaving; Tampering; Report; Smoke; Alarm; Fire. Leaving Tampering Report Alarm Smoke Fire We end up with a completely different network structure! Which of the two structures is better (think computationally)?

Which Structure is Better?
Leaving Report Tampering Smoke Alarm Fire

Leaving Report Tampering Smoke Alarm Fire Non-causal network is less compact: ……………………………numbers needed Deciding on conditional independence is hard in non-causal directions Causal models and conditional independence seem hardwired for humans! Specifying the conditional probabilities may be harder than in causal direction For instance, we have lost the direct dependency between alarm and one of its causes, which essentially describes the alarm’s reliability (info often provided by the maker)

Leaving Report Tampering Smoke Alarm Fire Non-causal network is less compact: = 23 numbers needed Deciding on conditional independence is hard in non-causal directions Causal models and conditional independence seem hardwired for humans! Specifying the conditional probabilities may be harder than in causal direction For instance, we have lost the direct dependency between alarm and one of its causes, which essentially describes the alarm’s reliability (info often provided by the maker)

Example contd. Other than that, our two Bnets for the Alarm problem are equivalent as long as they represent the same probability distribution Leaving Report Tampering Smoke Alarm Fire P(T,F,A,S,L,R) = P (T) P (F) P (A | T,F) P (L | A) P (R|L) = = P(L)P(T|L)P(R|L)P(S|L)P(A|S,L,T) P(F|S,A,T) i.e., they are equivalent if the corresponding CPTs are specified so that they satisfy the equation above

Are there wrong network structures?
How can a network structure be wrong? If it misses directed edges that are required E.g. an edge is missing below: Fire is not conditionally independent of Alarm | {Tampering, Smoke} Leaving Tampering Report Alarm Smoke Fire But remember what we said a few slides back. Sometimes we may need to make simplifying assumptions - e.g. assume conditional independence when it does not actually hold – in order to reduce complexity

Deciding on Structure In general, the direction of a direct dependency can always be changed using Bayes rule Product rule P(ab) = P(a | b) P(b) = P(b | a) P(a) Bayes' rule: P(a | b) = P(b | a) P(a) / P(b) Useful for assessing diagnostic probability from causal probability (or vice-versa): P(Cause|Effect) = P(Effect|Cause) P(Cause) / P(Effect)

Structure (contd.) So the two simple Bnets below are equivalent as long as the CPTs are related via Bayes rule Which structure to chose depends, among other things, on which CPT it is easier to specify Burglair Alarm P(A | B) = P(B | A) P(A) / P(B) Alarm Burglar

Stucture (contd.) CPTs for causal relationships represent knowledge of the mechanims underlying the process of interest. e.g. how an alarm works, why a disease generates certain symptoms CPTs for diagnostic relations can be defined only based on past observations. E.g., let m be meningitis, s be stiff neck: P(m|s) = P(s|m) P(m) / P(s) P(s|m) can be defined based on medical knowledge on the workings of meningitis P(m|s) requires statistics on how often the symptom of stiff neck appears in conjuction with meningities.

In Summary A Belief network is such that, the JPD of the variables involved is defined as the product of the local conditional distributions: P (X1, … ,Xn) = ∏i P(Xi | X1, … ,Xi-1) = ∏ i P (Xi | Pa (Xi)) Once we know the JPD, we can answer any query about any subset of the variables Thus, a Belief network allows one to answer any query on any subset of the variables

Bayesian Networks: Types of Inference
Diagnostic Predictive Mixed Intercausal Fire happens F=t There is no fire F=f There is tampering T=t Fire Fire Fire P(F|A=t,T=t)=? P(F|L=t)=? Tampering Fire Alarm Alarm Alarm P(A|F=f,L=t)=? Alarm Leaving Leaving Leaving Alarm goes off P(a) = 1.0 People are leaving L=t P(L|F=t)=? People are leaving L=t We will use the same reasoning procedure for all of these types

Dependencies in a Bayesian Network
Grey areas in the picture below represent evidence X By construction, a node X is conditionally independent of its non-descendant nodes (e.g., Zij in the picture) given its parents. The gray area “blocks” probability propagation

X A node X is conditionally independent of all other nodes in the network given its Markov blanket (the gray area in the picture). It “blocks” probability propagation Note that node X is conditionally dependent of non-descendant nodes in its Markov blanket (e.g., its children’s parents, like Z1j ) given their common descendants (e.g., Y1j). This allows, for instance, explaining away one cause (e.g. X) because of evidence of its effect (e.g., Y1) and another potential cause (e.g. z1j)

Dependencies: another way to see them
In 1, 2 and 3, X and Y are dependent Y X 1 E Z Z 2 E 3 E Z In 3, X and Y become dependent as soon as there is evidence on Z or on any of its descendants. This is because knowledge of one possible cause given evidence of the effect explains away the other cause

Or Conditional Independencies
Or, blocking paths for probability propagation. Three ways in which a path between Y to X (or viceversa) can be blocked, given evidence E Y E X 1 Z Z 2 3 Z

Conditional Independece in a Bnet
Z X Y E 1 2 3 Is H conditionally independent of E given I?

Conditional Independece in a Bnet
Z X Y E 1 2 3 Is H conditionally independent of E given I? T: Since we have no evidence on F, changes on the belief of G (caused by changes in the belief of H) do not affect belief on E

(In)Dependencies in a Bnet
Z X Y E 1 2 3 Is A conditionally independent of I given F?

Z X Y E 1 2 3 Is A conditionally independent of I given F? False: Since we have evidence on F, changes to the belief on G due to changes in I affect belief on E, which in turns propagates to A via C and B What if node C is not there?

Z X Y E 1 2 3 Is A conditionally independent of I given F?

Z X Y E 1 2 3 Is A conditionally independent of I given F? T: lack on evidence on D blocks the path that goes through E, D, B,

What does this means in terms of choosing structure?
That you need to double check the appropriateness of the indirect dependencies/independencies generated by your chosen structures Example: representing a domain for an intelligent system that acts as a tutor (aka Intelligent Tutoring System) Topics divided in sub-topics Student knowledge of a topic depends on student knowledge of its sub-topics We can never observe student knowledge directly, we can only observe it indirectly via student test answers

Two Ways of Representing Knowledge
Overall Proficiency Topic 1 Sub-topic 1.1 Sub-topic 1.2 Answer 1 Answer 2 Answer 3 Answer 4 Answer 1 Answer 2 Answer 3 Answer 4 Sub-topic 1.1 Sub-topic 1.2 Topic 1 Overall Proficiency Which one should I pick?

Two Ways of Representing Knowledge
Overall Proficiency Change in probability for a given node always propagates to its siblings, because we never get direct evidence on knowledge Topic 1 Sub-topic 1.1 Sub-topic 1.2 Answer 1 Answer 1 Answer 2 Answer 3 Answer 4 Answer 1 Answer 1 Answer 2 Answer 3 Answer 4 Sub-topic 1.1 Sub-topic 1.2 Change in probability for a given node does not propagate to its siblings, because we never get direct evidence on knowledge Topic 1 Overall Proficiency Which one you want to chose depends on the domain you want to represent

Bayesian Networks - AISpace
Try the queries in the previous slides with the Belief Networks applet in AISpace, Load the Fire Diagnosis example (ex in textbook) Compare the probability of each query node before and after the new evidence is observed try first to predict how they should change using your understanding of the domain Make sure you explore and understand the various other example networks in textbook and Aispace Electrical Circuit example (textbook ex 6.11) Patient’s wheezing and coughing example (ex. 6.14) Not in applet, but you can easily insert it Several other examples in AIspace

Test your understandings of dependencies in a Bnet
Use the AISpace ( applet for Belief and Decision networks ( Load the “conditional independence quiz” network Go in “Solve” mode and select “Independence Quiz”

Learning Goals For Today’s Class
Given a JPD Marginalize over specific variables Compute distributions over any subset of the variables Use inference by enumeration to compute joint posterior probability distributions over any subset of variables given evidence Define and use marginal and conditional independence Build a Bayesian Network for a given domain Compute the representational savings in terms of number of probabilities required Identify dependencies/independencies between nodes in a Bayesian Network

Uncertainty Probability and Bayesian Networks

Similar presentations

Presentation on theme: "Uncertainty Probability and Bayesian Networks"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Uncertainty Probability and Bayesian Networks

Similar presentations

Presentation on theme: "Uncertainty Probability and Bayesian Networks"— Presentation transcript:

Similar presentations

About project

Feedback