Bayes Nets and Probabilities

Slides:

Advertisements

Similar presentations

Bayesian networks Chapter 14 Section 1 – 2. Outline Syntax Semantics Exact computation.

Advertisements

Bayesian Networks CSE 473. © Daniel S. Weld 2 Last Time Basic notions Atomic events Probabilities Joint distribution Inference by enumeration Independence.

CHAPTER 14 Oliver Schulte Bayesian Networks. Environment Type: Uncertain Artificial Intelligence a modern approach 2 Fully Observable Deterministic Certainty:

BAYESIAN NETWORKS. Bayesian Network Motivation  We want a representation and reasoning system that is based on conditional independence  Compact yet.

Probability: Review The state of the world is described using random variables Probabilities are defined over events –Sets of world states characterized.

1 22c:145 Artificial Intelligence Bayesian Networks Reading: Ch 14. Russell & Norvig.

Bayesian Networks Chapter 14 Section 1, 2, 4. Bayesian networks A simple, graphical notation for conditional independence assertions and hence for compact.

Reasoning under Uncertainty: Conditional Prob., Bayes and Independence Computer Science cpsc322, Lecture 25 (Textbook Chpt ) March, 17, 2010.

Marginal Independence and Conditional Independence Computer Science cpsc322, Lecture 26 (Textbook Chpt 6.1-2) March, 19, 2010.

Review: Bayesian learning and inference

CPSC 422 Review Of Probability Theory.

Uncertainty Chapter 13. Uncertainty Let action A t = leave for airport t minutes before flight Will A t get me there on time? Problems: 1.partial observability.

Bayesian networks Chapter 14 Section 1 – 2.

Bayesian Belief Networks

KI2 - 2 Kunstmatige Intelligentie / RuG Probabilities Revisited AIMA, Chapter 13.

University College Cork (Ireland) Department of Civil and Environmental Engineering Course: Engineering Artificial Intelligence Dr. Radu Marinescu Lecture.

Ai in game programming it university of copenhagen Welcome to... the Crash Course Probability Theory Marco Loog.

Bayesian networks More commonly called graphical models A way to depict conditional independence relationships between random variables A compact specification.

Uncertainty Chapter 13.

Uncertainty Chapter 13.

Probabilistic Reasoning

EE562 ARTIFICIAL INTELLIGENCE FOR ENGINEERS

CS 4100 Artificial Intelligence Prof. C. Hafner Class Notes Feb 28 and March 13-15, 2012.

Uncertainty Chapter 13. Uncertainty Let action A t = leave for airport t minutes before flight Will A t get me there on time? Problems: 1.partial observability.

CHAPTER 13 Oliver Schulte Summer 2011 Uncertainty.

Bayesian Networks Material used 1 Random variables

Bayesian networks Chapter 14. Outline Syntax Semantics.

Bayesian networks Chapter 14 Section 1 – 2. Bayesian networks A simple, graphical notation for conditional independence assertions and hence for compact.

An Introduction to Artificial Intelligence Chapter 13 & : Uncertainty & Bayesian Networks Ramin Halavati

CS 4100 Artificial Intelligence Prof. C. Hafner Class Notes March 13, 2012.

Uncertainty Chapter 13. Outline Uncertainty Probability Syntax and Semantics Inference Independence and Bayes' Rule.

1 Chapter 13 Uncertainty. 2 Outline Uncertainty Probability Syntax and Semantics Inference Independence and Bayes' Rule.

CS 4100 Artificial Intelligence Prof. C. Hafner Class Notes March 20, 2012.

Probabilistic Belief States and Bayesian Networks (Where we exploit the sparseness of direct interactions among components of a world) R&N: Chap. 14, Sect.

Bayesian networks. Motivation We saw that the full joint probability can be used to answer any question about the domain, but can become intractable as.

An Introduction to Artificial Intelligence Chapter 13 & : Uncertainty & Bayesian Networks Ramin Halavati

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS.

Uncertainty. Assumptions Inherent in Deductive Logic-based Systems All the assertions we wish to make and use are universally true. Observations of the.

Marginalization & Conditioning Marginalization (summing out): for any sets of variables Y and Z: Conditioning(variant of marginalization):

Uncertainty Chapter 13. Outline Uncertainty Probability Syntax and Semantics Inference Independence and Bayes' Rule.

Uncertainty Chapter 13. Outline Uncertainty Probability Syntax and Semantics Inference Independence and Bayes' Rule.

Uncertainty Let action A t = leave for airport t minutes before flight Will A t get me there on time? Problems: 1.partial observability (road state, other.

1 Probability FOL fails for a domain due to: –Laziness: too much to list the complete set of rules, too hard to use the enormous rules that result –Theoretical.

Conditional Probability, Bayes’ Theorem, and Belief Networks CISC 2315 Discrete Structures Spring2010 Professor William G. Tanner, Jr.

CSE 473 Uncertainty. © UW CSE AI Faculty 2 Many Techniques Developed Fuzzy Logic Certainty Factors Non-monotonic logic Probability Only one has stood.

Uncertainty Fall 2013 Comp3710 Artificial Intelligence Computing Science Thompson Rivers University.

Chapter 12. Probability Reasoning Fall 2013 Comp3710 Artificial Intelligence Computing Science Thompson Rivers University.

CS 2750: Machine Learning Bayesian Networks Prof. Adriana Kovashka University of Pittsburgh March 14, 2016.

CMPT 726 CHAPTER 13 Oliver Schulte

CMPT 726 Simon Fraser University CHAPTER 14 Oliver Schulte

CS 2750: Machine Learning Directed Graphical Models

CMPT 310 CHAPTER 13 Oliver Schulte

Bayesian Networks Chapter 14 Section 1, 2, 4.

Bayesian networks Chapter 14 Section 1 – 2.

Presented By S.Yamuna AP/CSE

Uncertainty Chapter 13.

Conditional Probability, Bayes’ Theorem, and Belief Networks

Uncertainty in Environments

Probabilistic Reasoning; Network-based reasoning

CS 188: Artificial Intelligence Fall 2007

Uncertainty Chapter 13.

Bayesian networks Chapter 14 Section 1 – 2.

Probabilistic Reasoning

Uncertainty Chapter 13.

Uncertainty Chapter 13.

Bayesian Networks CSE 573.

Representing Uncertainty

Uncertainty Chapter 13.

Presentation transcript:

Bayes Nets and Probabilities Oliver Schulte Machine Learning 726 If you use “insert slide number” under “Footer”, that text box only displays the slide number, not the total number of slides. So I use a new textbox for the slide number in the master.

Bayes Nets: General Points Represent domain knowledge. Allow for uncertainty. Complete representation of probabilistic knowledge. Represent causal relations. Fast answers to types of queries: Probabilistic: What is the probability that a patient has strep throat given that they have fever? Relevance: Is fever relevant to having strep throat?

Bayes Net Links Judea Pearl's Turing Award See UBC’s AISpace

Probability Reasoning (With Bayes Nets)

Random Variables A random variable has a probability associated with each of its values. A basic statement assigns a value to a random variable. Variable Value Probability Weather Sunny 0.7 Rainy 0.2 Cloudy 0.08 Snow 0.02 Cavity True False 0.8 Fill in Cavity value later

Probability for Sentences A sentence or query is formed by using “and”, “or”, “not” recursively with basic statements. Sentences also have probabilities assigned to them. Sentence Probability P(Cavity = false AND Toothache = false) 0.72 P(Cavity = true OR Toothache = false) 0.08 Use sentences that are used later. 2nd sentence should have higher probability. Exercise: Prove that if A entails B, then P(A) >= P(B).

Probability Notation Often probability theorists write A,B instead of A  B (like Prolog). If the intended random variables are known, they are often not mentioned. Shorthand Full Notation P(Cavity = false,Toothache = false) P(Cavity = false Toothache = false) P(false, false)

Axioms of probability 0 ≤ P(A) ≤ 1 P(true) = 1 and P(false) = 0 P(A  B) = P(A) + P(B) - P(A  B) P(A) = P(B) if A and B are logically equivalent. Logical equivalence connects probability and logic. “True” is a constant sentence that is true in all possible worlds. Formulas considered as sets of complet

Rule 1: Logical Equivalence P(NOT (NOT Cavity)) P(Cavity) 0.2 P(NOT (Cavity AND Toothache)) P(Cavity = F OR Toothache = F) 0.88 Spot the pattern P(NOT (Cavity OR Toothache) P(Cavity = F AND Toothache = F) 0.72

The Logical Equivalence Pattern P(NOT (NOT Cavity)) = P(Cavity) 0.2 Rule 1: Logically equivalent expressions have the same probability. P(NOT (Cavity AND Toothache)) = P(Cavity = F OR Toothache = F) 0.88 This shows how logical reasoning is an important part of probabilistic reasoning. Often easier to determine probability after transforming expression into a logically equivalent form. P(NOT (Cavity OR Toothache) = P(Cavity = F AND Toothache = F) 0.72

Rule 2: Marginalization P(Cavity, Toothache) P(Cavity, Toothache = F) P(Cavity) 0.12 0.08 0.2 P(Cavity = F, Toothache) P(Cavity = F, Toothache = F) P(Cavity = F) 0.08 0.72 0.8 Assignment: fill in marginal over 2 variables. P(Cavity, Toothache) P(Cavity = F, Toothache) P(Toothache) 0.12 0.08 0.2

The Marginalization Pattern P(Cavity, Toothache) + P(Cavity, Toothache = F) = P(Cavity) 0.12 0.08 0.2 P(Cavity = F, Toothache) + P(Cavity = F, Toothache = F) = P(Cavity = F) 0.08 0.72 0.8 Assignment: fill in marginal over 2 variables. P(Cavity, Toothache) + P(Cavity = F, Toothache) = P(Toothache) 0.12 0.08 0.2

Prove the Pattern: Marginalization Theorem. P(A) = P(A,B) + P(A, not B) Proof. A is logically equivalent to [A and B)  (A and not B)]. P(A) = P([A and B)  (A and not B)]) = P(A and B) + P(A and not B) – P([A and B)  (A and not B)]). Disjunction Rule. [A and B)  (A and not B)] is logically equivalent to false, so P([A and B)  (A and not B)]) =0. So 2. implies P(A) = P(A and B) + P(A and not B).

Completeness of Bayes Nets A probabilistic query system is complete if it can compute a probability for every sentence. Proposition: A Bayes net is complete. Proof has two steps. Any system that encodes the joint distribution is complete. A Bayes net encodes the joint distribution.

The Joint Distribution

Assigning Probabilities to Sentences A complete assignment is a conjunctive sentence that assigns a value to each random variable. The joint probability distribution specifies a probability for each complete assignment. A joint distribution determines an probability for every sentence. How? Spot the pattern. Give examples: P(toothache, not cavity, toothache or cavity, as needed before).

Probabilities for Sentences: Spot the Pattern Probability P(Cavity = false AND Toothache = false) 0.72 P(Cavity = true OR Toothache = false) 0.08 P(Toothache = false) 0.8

Inference by enumeration

Inference by enumeration Marginalization: For any sentence A, sum the joint probabilities for the complete assignments where A is true. P(toothache) = 0.108 + 0.012 + 0.016 + 0.064 = 0.2.

Completeness Proof for Joint Distribution Theorem [from propositional logic] Every sentence is logically equivalent to a disjunction of the form A1 or A2 or ... or Ak where the Ai are complete assignments. All of the Ai are mutually exclusive (joint probability 0). Why? So if S is equivalent to A1 or A2 or ... or Ak, then P(S) = Σi P(Ai) where each Ai is given by the joint distribution.

Bayes Nets and The Joint Distribution

Example: Complete Bayesian Network Example Horn Clauses: Alarm -> JohnClass p:0.9

The Story You have a new burglar alarm installed at home. It’s reliable at detecting burglary but also responds to earthquakes. You have two neighbors that promise to call you at work when they hear the alarm. John always calls when he hears the alarm, but sometimes confuses alarm with telephone ringing. Mary listens to loud music and sometimes misses the alarm.

Computing The Joint Distribution A Bayes net provides a compact factored representation of a joint distribution. In words, the joint probability is computed as follows. For each node Xi: Find the assigned value xi. Find the values y1,..,yk assigned to the parents of Xi. Look up the conditional probability P(xi|y1,..,yk) in the Bayes net. Multiply together these conditional probabilities.

Product Formula Example: Burglary Query: What is the joint probability that all variables are true? P(M, J, A, E, B) = P(M|A) p(J|A) p(A|E,B)P(E)P(B) = .7 x .9 x .95 x .002 x .001

Compactness of Bayesian Networks Consider n binary variables Unconstrained joint distribution requires O(2n) probabilities If we have a Bayesian network, with a maximum of k parents for any node, then we need O(n 2k) probabilities Example Full unconstrained joint distribution n = 30: need 230 probabilities for full joint distribution Bayesian network n = 30, k = 4: need 480 probabilities Recall toothache example where we directly represented the joint distribution.

Summary: Why are Bayes nets useful? - Graph structure supports - Modular representation of knowledge - Local, distributed algorithms for inference and learning - Intuitive (possibly causal) interpretation - Factored representation may have exponentially fewer parameters than full joint P(X1,…,Xn) => lower sample complexity (less data for learning) lower time complexity (less time for inference)

Is it Magic? How can the Bayes net reduce parameters? By exploiting conditional independencies. Why does the product formula work? The Bayes net topological or graphical semantics. The graph by itself entails conditional independencies. The Chain Rule.

Conditional Probabilities and Independence

Conditional Probabilities: Intro Given (A) that a die comes up with an odd number, what is the probability that (B) the number is a 2 a 3 Answer: the number of cases that satisfy both A and B, out of the number of cases that satisfy A. Examples: #faces with (odd and 2)/#faces with odd = 0 / 3 = 0. #faces with (odd and 3)/#faces with odd = 1 / 3.

Conditional Probs ctd. Suppose that 50 students are taking 310 and 30 are women. Given (A) that a student is taking 310, what is the probability that (B) they are a woman? Answer: #students who take 310 and are a woman/ #students in 310 = 30/50 = 3/5. Notation: P(A|B)

Conditional Ratios: Spot the Pattern P(die comes up with odd number) P(die comes up with 3) P(3|odd number) 1/2 1/6 1/3 P(Student takes 310) P(Student takes 310 and is woman) P(Student is woman|Student takes 310) =50/15,000 30/15,000 3/5

Conditional Probs: The Ratio Pattern Spot the Pattern P(die comes up with odd number) / P(die comes up with 3) = P(3|odd number) 1/2 1/6 1/3 P(Student takes 310) / P(Student takes 310 and is woman) = P(Student is woman|Student takes 310) =50/15,000 30/15,000 3/5 Exercise: prove that conditioning leads to a well-defined normalized probability measure. P(A|B) = P(A and B)/ P(B) Important!

Conditional Probabilities: Motivation Much knowledge can be represented as implications B1,..,Bk =>A. Conditional probabilities are a probabilistic version of reasoning about what follows from conditions. Cognitive Science: Our minds store implicational knowledge.

The Product Rule: Spot the Pattern P(Cavity) P(Toothache|Cavity) P(Cavity,Toothache) 0.12 0.08 0.2 P(Toothache) P(Cavity| Toothache) P(Cavity,Toothache) 0.08 0.72 0.8 P(Cavity =F) P(Toothache|Cavity = F) P(Toothache,Cavity =F) 0.12 0.08 0.2

The Product Rule Pattern P(Cavity) x P(Toothache|Cavity) = P(Cavity,Toothache) 0.12 0.08 0.2 P(Toothache) x P(Cavity| Toothache) = P(Cavity,Toothache) 0.08 0.72 0.8 Moral: Joint proabilities can be inferred from conditional probabilities. P(Cavity =F) x P(Toothache|Cavity = F) = P(Toothache,Cavity =F) 0.12 0.08 0.2

Independence A and B are independent iff P(A|B) = P(A) or P(B|A) = P(B) or P(A, B) = P(A) P(B) Suppose that Weather is independent of the Cavity Scenario. Then the joint distribution decomposes: P(Toothache, Catch, Cavity, Weather) = P(Toothache, Catch, Cavity) P(Weather) Absolute independence powerful but rare Dentistry is a large field with hundreds of variables, none of which are independent. What to do?

Exercise Prove that the three definitions of independence are equivalent (assuming all positive probabilities). A and B are independent iff P(A|B) = P(A) or P(B|A) = P(B) or P(A, B) = P(A) P(B)

Conditional independence If I have a cavity, the probability that the probe catches in it doesn't depend on whether I have a toothache: (1) P(catch | toothache, cavity) = P(catch | cavity) The same independence holds if I haven't got a cavity: (2) P(catch | toothache,cavity) = P(catch | cavity) Catch is conditionally independent of Toothache given Cavity: P(Catch | Toothache,Cavity) = P(Catch | Cavity) The equivalences for independence also holds for conditional independence, e.g.: P(Toothache | Catch, Cavity) = P(Toothache | Cavity) P(Toothache, Catch | Cavity) = P(Toothache | Cavity) P(Catch | Cavity) Conditional independence is our most basic and robust form of knowledge about uncertain environments.

Bayes Nets Graphical Semantics

Common Causes: Spot the Pattern Cavity Catch toothache Catch is independent of toothache given Cavity. In UBC “Simple Diagnostic Example: Fever is independent of Coughing Given Influenza. Coughing is independent of Fever given Bronchitis.

Burglary Example JohnCalls, MaryCalls are conditionally independent given Alarm.

Spot the Pattern: Chain Scenario MaryCalls is independent of Burglary given Alarm. JohnCalls is independent of Earthquake given Alarm. This is typical for sequential data. In UBC “Simple Diagnostic Example: Wheezing is independent of Influenza Given Bronchitis. Coughing is independent of Smoke given Bronchitis. Influenze is independent of Smokes. [check assignment]

The Markov Condition A Bayes net is constructed so that: each variable is conditionally independent of its nondescendants given its parents. The graph alone (without specified probabilities) entails conditional independencies. Causal Interpretation: Each parent is a direct cause. This can always be achieved by letting a node have enough parents. See text for details on how to construct a Bayesian network. 2. Helps with the retrieve relevant problem: from the graph we can tell that certain information is ignorable. Can work on this using “spot the pattern”.

Derivation of the Product Formula

The Chain Rule We can always write P(a, b, c, … z) = P(a | b, c, …. z) P(b, c, … z) (Product Rule) Repeatedly applying this idea, we obtain P(a, b, c, … z) = P(a | b, c, …. z) P(b | c,.. z) P(c| .. z)..P(z) Order the variables such that children come before parents. Then given its parents, each node is independent of its other ancestors by the topological independence. P(a,b,c, … z) = Πx. P(x|parents)

Example in Burglary Network P(M, J,A,E,B) = P(M| J,A,E,B) p(J,A,E,B)= P(M|A) p(J,A,E,B) = P(M|A) p(J|A,E,B) p(A,E,B) = P(M|A) p(J|A) p(A,E,B) = P(M|A) p(J|A) p(A|E,B) P(E,B) = P(M|A) p(J|A) p(A|E,B) P(E)P(B) Colours show applications of the Bayes net topological independence.

Explaining Away

Common Effects: Spot the Pattern Influenza and Smokes are independent. Given Bronchitis, they become dependent. Influenza Smokes Bronchitis Battery Age Charging System OK Battery Age and Charging System are independent. Given Battery Voltage, they become dependent. From UBC “Simple diagnostic problem” Does this make sense? Battery Voltage

Conditioning on Children Independent Causes: A and B are independent. “Explaining away” effect: Given C, observing A makes B less likely. E.g. Bronchitis in UBC “Simple Diagnostic Problem”. A and B are (marginally) independent, become dependent once C is known. A B C Another example: Wumpus on one square explains stench. Characterizes Bayes nets.

D-separation A, B, and C are non-intersecting subsets of nodes in a directed graph. A path from A to B is blocked if it contains a node such that either the arrows on the path meet either head-to-tail or tail-to-tail at the node, and the node is in the set C, or the arrows meet head-to-head at the node, and neither the node, nor any of its descendants, are in the set C. If all paths from A to B are blocked, A is said to be d-separated from B by C. If A is d-separated from B by C, the joint distribution over all variables in the graph satisfies .

D-separation: Example

Mathematical Analysis Theorem: If A, B have no common ancestors and neither is a descendant of the other, then they are independent of each other. Proof for our example: P(a,b) = Σc P(a,b,c) = Σc P(a) P(b) P(c|a,b) Σc P(a) P(b) P(c|a,b) = P(a) P(b) Σc P(c|a,b) = P(a) P(b) A B C Not quite a proof because we A,B may have parents. But illustrates general idea. First step follows from marginilization. Second step follows from product formula. 3rd step follows since P(a), P(b) do not depend on c. Last step follows P(c|a,b) adds up to 1 over possible c values.

Bayes’ Theorem

Abductive Reasoning Burglary Alarm Cavity Toothache Implications are often causal, from cause to effect. Many important queries are diagnostic, from effect to cause. Burglary Alarm Cavity But not impossible in logic: see nonmonotonic reasoning. Black: causal direction. Red: direction of inference. I mean that wumpus and stench are on adjacent squares. Toothache

Bayes’ Theorem: Another Example A doctor knows the following. The disease meningitis causes the patient to have a stiff neck 50% of the time. The prior probability that someone has meningitis is 1/50,000. The prior that someone has a stiff neck is 1/20. Question: knowing that a person has a stiff neck what is the probability that they have meningitis?

Spot the Pattern: Diagnosis P(Cavity) P(Toothache|Cavity) P(Toothache) P(Cavity|Toothache) 0.2 0.6 P(Wumpus) P(Stench|Wumpus) P(Stench) P(Wumpus|Stench) 0.2 0.6 How does the last number depend on the first three? P(Meningitis) P(Stiff Neck| Meningitis) P(Stiff Neck) P(Meningitis|Stiff Neck) 1/50,000 1/2 1/20 0.6

Spot the Pattern: Diagnosis P(Cavity) x P(Toothache|Cavity) / P(Toothache) = P(Cavity|Toothache) 0.2 0.6 P(Wumpus) x P(Stench|Wumpus) / P(Stench) = P(Wumpus|Stench) 0.2 0.6 P(Meningitis) x P(Stiff Neck| Meningitis) / P(Stiff Neck) = P(Meningitis|Stiff Neck) 1/50,000 1/2 1/20 1/5,000

Explain the Pattern: Bayes’ Theorem Exercise: Prove Bayes’ Theorem P(A | B) = P(B | A) P(A) / P(B).

On Bayes’ Theorem P(a | b) = P(b | a) P(a) / P(b). P(Cause|Effect) = P(Effect|Cause) P(Cause) / P(Effect). The better the cause explains the effect, the more likely it is. The more plausible the cause is, the more likely it is. The more surprising the evidence (the lower its prior probability), the greater its impact. Likelihood: how well does the cause explain the effect? Prior: how plausible is the explanation before any evidence? Evidence Term/Normalization Constant: how surprising is the evidence?