Cooperating Intelligent Systems Bayesian networks Chapter 14, AIMA.

Slides:



Advertisements
Similar presentations
Bayesian networks Chapter 14 Section 1 – 2. Outline Syntax Semantics Exact computation.
Advertisements

BAYESIAN NETWORKS Ivan Bratko Faculty of Computer and Information Sc. University of Ljubljana.
Probabilistic Reasoning Bayesian Belief Networks Constructing Bayesian Networks Representing Conditional Distributions Summary.
Bayesian Networks CSE 473. © Daniel S. Weld 2 Last Time Basic notions Atomic events Probabilities Joint distribution Inference by enumeration Independence.
Artificial Intelligence Universitatea Politehnica Bucuresti Adina Magda Florea
BAYESIAN NETWORKS. Bayesian Network Motivation  We want a representation and reasoning system that is based on conditional independence  Compact yet.
1 Knowledge Engineering for Bayesian Networks. 2 Probability theory for representing uncertainty l Assigns a numerical degree of belief between 0 and.
Identifying Conditional Independencies in Bayes Nets Lecture 4.
For Monday Read chapter 18, sections 1-2 Homework: –Chapter 14, exercise 8 a-d.
For Monday Finish chapter 14 Homework: –Chapter 13, exercises 8, 15.
1 22c:145 Artificial Intelligence Bayesian Networks Reading: Ch 14. Russell & Norvig.
Bayesian Networks Chapter 14 Section 1, 2, 4. Bayesian networks A simple, graphical notation for conditional independence assertions and hence for compact.
CPSC 322, Lecture 26Slide 1 Reasoning Under Uncertainty: Belief Networks Computer Science cpsc322, Lecture 27 (Textbook Chpt 6.3) March, 16, 2009.
Review: Bayesian learning and inference
Bayesian Networks Chapter 2 (Duda et al.) – Section 2.11
Bayesian Networks. Motivation The conditional independence assumption made by naïve Bayes classifiers may seem to rigid, especially for classification.
Probabilistic Reasoning Copyright, 1996 © Dale Carnegie & Associates, Inc. Chapter 14 (14.1, 14.2, 14.3, 14.4) Capturing uncertain knowledge Probabilistic.
Cooperating Intelligent Systems Bayesian networks Chapter 14, AIMA.
Bayesian networks Chapter 14 Section 1 – 2.
CSCI 5582 Fall 2006 CSCI 5582 Artificial Intelligence Lecture 14 Jim Martin.
Bayesian Belief Networks
Bayesian Networks. Graphical Models Bayesian networks Conditional random fields etc.
Bayesian Networks What is the likelihood of X given evidence E? i.e. P(X|E) = ?
CS 188: Artificial Intelligence Spring 2009 Lecture 15: Bayes’ Nets II -- Independence 3/10/2009 John DeNero – UC Berkeley Slides adapted from Dan Klein.
5/25/2005EE562 EE562 ARTIFICIAL INTELLIGENCE FOR ENGINEERS Lecture 16, 6/1/2005 University of Washington, Department of Electrical Engineering Spring 2005.
Cooperating Intelligent Systems
1 Bayesian Networks Chapter ; 14.4 CS 63 Adapted from slides by Tim Finin and Marie desJardins. Some material borrowed from Lise Getoor.
Bayesian Reasoning. Tax Data – Naive Bayes Classify: (_, No, Married, 95K, ?)
Bayesian networks More commonly called graphical models A way to depict conditional independence relationships between random variables A compact specification.
1 CMSC 471 Fall 2002 Class #19 – Monday, November 4.
Quiz 4: Mean: 7.0/8.0 (= 88%) Median: 7.5/8.0 (= 94%)
EE562 ARTIFICIAL INTELLIGENCE FOR ENGINEERS
Advanced Artificial Intelligence
Bayesian Networks Material used 1 Random variables
Artificial Intelligence CS 165A Tuesday, November 27, 2007  Probabilistic Reasoning (Ch 14)
Bayes’ Nets  A Bayes’ net is an efficient encoding of a probabilistic model of a domain  Questions we can ask:  Inference: given a fixed BN, what is.
Bayesian networks Chapter 14. Outline Syntax Semantics.
Bayesian networks Chapter 14 Section 1 – 2. Bayesian networks A simple, graphical notation for conditional independence assertions and hence for compact.
Probabilistic Belief States and Bayesian Networks (Where we exploit the sparseness of direct interactions among components of a world) R&N: Chap. 14, Sect.
Bayesian networks. Motivation We saw that the full joint probability can be used to answer any question about the domain, but can become intractable as.
1 Chapter 14 Probabilistic Reasoning. 2 Outline Syntax of Bayesian networks Semantics of Bayesian networks Efficient representation of conditional distributions.
For Wednesday Read Chapter 11, sections 1-2 Program 2 due.
2 Syntax of Bayesian networks Semantics of Bayesian networks Efficient representation of conditional distributions Exact inference by enumeration Exact.
Baye’s Rule.
1 Monte Carlo Artificial Intelligence: Bayesian Networks.
An Introduction to Artificial Intelligence Chapter 13 & : Uncertainty & Bayesian Networks Ramin Halavati
The famous “sprinkler” example (J. Pearl, Probabilistic Reasoning in Intelligent Systems, 1988)
Probabilistic Reasoning [Ch. 14] Bayes Networks – Part 1 ◦Syntax ◦Semantics ◦Parameterized distributions Inference – Part2 ◦Exact inference by enumeration.
Marginalization & Conditioning Marginalization (summing out): for any sets of variables Y and Z: Conditioning(variant of marginalization):
Review: Bayesian inference  A general scenario:  Query variables: X  Evidence (observed) variables and their values: E = e  Unobserved variables: Y.
Inference Algorithms for Bayes Networks
1 CMSC 671 Fall 2001 Class #20 – Thursday, November 8.
1 Probability FOL fails for a domain due to: –Laziness: too much to list the complete set of rules, too hard to use the enormous rules that result –Theoretical.
1 BN Semantics 2 – Representation Theorem The revenge of d-separation Graphical Models – Carlos Guestrin Carnegie Mellon University September 17.
Belief Networks Kostas Kontogiannis E&CE 457. Belief Networks A belief network is a graph in which the following holds: –A set of random variables makes.
Conditional Independence As with absolute independence, the equivalent forms of X and Y being conditionally independent given Z can also be used: P(X|Y,
PROBABILISTIC REASONING Heng Ji 04/05, 04/08, 2016.
Chapter 12. Probability Reasoning Fall 2013 Comp3710 Artificial Intelligence Computing Science Thompson Rivers University.
CS 2750: Machine Learning Bayesian Networks Prof. Adriana Kovashka University of Pittsburgh March 14, 2016.
Web-Mining Agents Data Mining Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme Karsten Martiny (Übungen)
Bayesian Networks Chapter 2 (Duda et al.) – Section 2.11 CS479/679 Pattern Recognition Dr. George Bebis.
A Brief Introduction to Bayesian networks
CS 2750: Machine Learning Directed Graphical Models
Bayesian networks Chapter 14 Section 1 – 2.
Presented By S.Yamuna AP/CSE
CS 188: Artificial Intelligence Fall 2007
Class #16 – Tuesday, October 26
Bayesian networks Chapter 14 Section 1 – 2.
Probabilistic Reasoning
Presentation transcript:

Cooperating Intelligent Systems Bayesian networks Chapter 14, AIMA

Inference Inference in the statistical setting means computing probabilities for different outcomes to be true given the information We need an efficient method for doing this, which is more powerful than the naïve Bayes model.

Bayesian networks A Bayesian network is a directed graph in which each node is annotated with quantitative probability information: 1.A set of random variables, {X 1,X 2,X 3,...}, makes up the nodes of the network. 2.A set of directed links connect pairs of nodes, parent  child 3.Each node X i has a conditional probability distribution P(X i | Parents(X i )). 4.The graph is a directed acyclic graph (DAG).

The dentist network Cavity CatchToothache Weather

The alarm network Alarm BurglaryEarthquake JohnCalls MaryCalls Burglar alarm responds to both earthquakes and burglars. Two neighbors: John and Mary, who have promised to call you when the alarm goes off. John always calls when there’s an alarm, and sometimes when there’s not an alarm. Mary sometimes misses the alarms (she likes loud music).

The cancer network From Breese and Coller 1997 AgeGender SmokingToxics Cancer Serum Calcium Lung Tumour Genetic Damage

The cancer network From Breese and Coller 1997 AgeGender SmokingToxics Cancer Serum Calcium Lung Tumour Genetic Damage P(A,G) = P(A)P(G) P(C|S,T,A,G) = P(C|S,T) P(SC,C,LT,GD) = P(SC|C)P(LT|C,GD)P(C) P(GD) P(A,G,T,S,C,SC,LT,GD) = P(A)P(G)P(T|A)P(S|A,G)  P(C|T,S)P(GD)P(SC|C)  P(LT|C,GD)

The product (chain) rule (This is for Bayesian networks, the general case comes later in this lecture)

Bayes network node is a function AB ¬a¬a b a∧ba∧ba∧¬ba∧¬b ¬a∧b¬a∧b ¬a∧¬b¬a∧¬b Min Max C P(C| ¬ a,b) = U[0.7,1.9]

Bayes network node is a function AB C A BN node is a conditional distribution function Inputs = Parent values Output = distribution over values Any type of function from values to distributions.

Example: The alarm network Alarm BurglaryEarthquake JohnCalls MaryCalls P(B=b) P(E=e) AP(J=j) a0.90 ¬a¬a 0.05 AP(M=m) a0.70 ¬a¬a 0.01 BEP(A=a) be0.95 b ¬e¬e 0.94 ¬b¬b e0.29 ¬b¬b ¬e¬e Note: Each number in the tables represents a boolean distribution. Hence there is a distribution output for every input.

Example: The alarm network Alarm BurglaryEarthquake JohnCalls MaryCalls P(B=b) P(E=e) AP(J=j) a0.90 ¬a¬a 0.05 AP(M=m) a0.70 ¬a¬a 0.01 BEP(A=a) be0.95 b ¬e¬e 0.94 ¬b¬b e0.29 ¬b¬b ¬e¬e Probability distribution for ”no earthquake, no burglary, but alarm, and both Mary and John make the call”

Meaning of Bayesian network The general chain rule (always true): The Bayesian network chain rule: The BN is a correct representation of the domain iff each node is conditionally independent of its predecessors, given its parents.

The alarm network Alarm BurglaryEarthquake JohnCalls MaryCalls The fully correct alarm network might look something like the figure. The Bayesian network (red) assumes that some of the variables are independent (or that the dependecies can be neglected since they are very weak). The correctness of the Bayesian network of course depends on the validity of these assumptions. Alarm BurglaryEarthquake JohnCalls MaryCalls It is this sparse connection structure that makes the BN approach feasible (~linear growth in complexity rather than exponential)

How construct a BN? Add nodes in causal order (”causal” determined from expertize). Determine conditional independence using either (or all) of the following semantics: –Blocking/d-separation rule –Non-descendant rule –Markov blanket rule –Experience/your beliefs

Path blocking & d-separation Intuitively, knowledge about Serum Calcium influences our belief about Cancer, if we don’t know the value of Cancer, which in turn influences our belief about Lung Tumour, etc. However, if we are given the value of Cancer (i.e. C= true or false), then knowledge of Serum Calcium will not tell us anything about Lung Tumour that we don’t already know. We say that Cancer d-separates (direction-dependent separation) Serum Calcium and Lung Tumour. Cancer Serum Calcium Lung Tumour Genetic Damage

Some definitions of BN (from Wikipedia) 1.X is a Bayesian network with respect to G if its joint probability density function can be written as a product of the individual density functions, conditional on their parent variables: X = {X 1, X 2,..., X N } is a set of random variables G = (V,E) is a directed acyclic graph (DAG) of vertices (V) and edges (E)

Some definitions of BN (from Wikipedia) 2.X is a Bayesian network with respect to G if it satisfies the local Markov property; each variable is conditionally independent of its non-descendants given its parent variables: X = {X 1, X 2,..., X N } is a set of random variables G = (V,E) is a directed acyclic graph (DAG) of vertices (V) and edges (E) Note:

Non-descendants A node is conditionally independent of its non-descendants ( Z ij ), given its parents.

Some definitions of BN (from Wikipedia) 3.X is a Bayesian network with respect to G if every node is conditionally independent of all other nodes in the network, given its Markov blanket. The Markov blanket of a node is its parents, children and children's parents. X = {X 1, X 2,..., X N } is a set of random variables G = (V,E) is a directed acyclic graph (DAG) of vertices (V) and edges (E)

Markov blanket A node is conditionally independent of all other nodes in the network, given its parents, children, and children’s parents These constitute the node’s Markov blanket. X1X1 X2X2 X3X3 X4X4 X5X5 X6X6 XkXk

Path blocking & d-separation Two nodes X i and X j are conditionally independent given a set  = {X 1,X 2,X 3,...} of nodes if for every undirected path in the BN between X i and X j there is some node X k on the path having one of the following three properties: 1. X k ∈ , and both arcs on the path lead out of X k. 2. X k ∈ , and one arc on the path leads into X k and one arc leads out. 3.Neither X k nor any descendant of X k is in , and both arcs on the path lead into X k. X k blocks the path between X i and X j XiXi XjXj X k1 X k2 X k3  X k1 X k2 X k3 X i and X j are d-separated if all paths betweeen them are blocked

Some definitions of BN (from Wikipedia) 4.X is a Bayesian network with respect to G if, for any two nodes i, j: X = {X 1, X 2,..., X N } is a set of random variables G = (V,E) is a directed acyclic graph (DAG) of vertices (V) and edges (E) The Markov blanket of node i is the minimal set of nodes that d-separates node i from all other nodes. The d-separating set(i,j) is the set of nodes that d-separate node i and j.

Causal networks Bayesian networks are usually used to represent causal relationships. This is, however, not strictly necessary: a directed edge from node i to node j does not require that X i is causally dependent on X j. –This is demonstrated by the fact that Bayesian networks on the two graphs: are equivalent. They impose the same conditional independence requirements. ABCABC A causal network is a Bayesian network with an explicit requirement that the relationships be causal.

Causal networks ABCABC The equivalence is proved with Bayes theorem...

Exercise 14.3 (a) in AIMA Two astronomers in different parts of the world make measurements M 1 and M 2 of the number of stars N in some small region of the sky, using their telescopes. Normally there is a small possibility e of error up to one star in each direction. Each telescope can also (with a much smaller probability f) be badly out of focus (events F 1 and F 2 ) in which case the scientist will undercount by three or more stars (or, if N is less than 3, fail to detect any stars at all). Consider the three networks in Figure –(a) Which of these Bayesian networks are correct (but not necessarily efficient) representations of the preceeding information?

Exercise 14.3 (a) in AIMA Two astronomers in different parts of the world make measurements M 1 and M 2 of the number of stars N in some small region of the sky, using their telescopes. Normally there is a small possibility e of error up to one star in each direction. Each telescope can also (with a much smaller probability f) be badly out of focus (events F 1 and F 2 ) in which case the scientist will undercount by three or more stars (or, if N is less than 3, fail to detect any stars at all). Consider the three networks in Figure –(a) Which of these Bayesian networks are correct (but not necessarily efficient) representations of the preceeding information? N M1M1 M2M2 F1F1 F2F2

Exercise 14.3 (a) in AIMA Two astronomers in different parts of the world make measurements M 1 and M 2 of the number of stars N in some small region of the sky, using their telescopes. Normally there is a small possibility e of error up to one star in each direction. Each telescope can also (with a much smaller probability f) be badly out of focus (events F 1 and F 2 ) in which case the scientist will undercount by three or more stars (or, if N is less than 3, fail to detect any stars at all). Consider the three networks in Figure –(a) Which of these Bayesian networks are correct (but not necessarily efficient) representations of the preceeding information?

Exercise 14.3 (a) in AIMA (i) must be incorrect – N is d-separated from F 1 and F 2, i.e. knowing the focus states F does not affect N if we know M. This cannot be correct. wrong

Exercise 14.3 (a) in AIMA Two astronomers in different parts of the world make measurements M 1 and M 2 of the number of stars N in some small region of the sky, using their telescopes. Normally there is a small possibility e of error up to one star in each direction. Each telescope can also (with a much smaller probability f) be badly out of focus (events F 1 and F 2 ) in which case the scientist will undercount by three or more stars (or, if N is less than 3, fail to detect any stars at all). Consider the three networks in Figure –(a) Which of these Bayesian networks are correct (but not necessarily efficient) representations of the preceeding information? wrong

Exercise 14.3 (a) in AIMA (ii) is correct – it describes the causal relationships. It is a causal network. wrong ok

Exercise 14.3 (a) in AIMA Two astronomers in different parts of the world make measurements M 1 and M 2 of the number of stars N in some small region of the sky, using their telescopes. Normally there is a small possibility e of error up to one star in each direction. Each telescope can also (with a much smaller probability f) be badly out of focus (events F 1 and F 2 ) in which case the scientist will undercount by three or more stars (or, if N is less than 3, fail to detect any stars at all). Consider the three networks in Figure –(a) Which of these Bayesian networks are correct (but not necessarily efficient) representations of the preceeding information? wrong ok

Exercise 14.3 (a) in AIMA (iii) is also ok – a fully connected graph would be correct (but not efficient). (iii) has all connections except M i –F j and F i –F j. (iii) is not causal and not efficient. wrong ok ok – but not good

Efficient representation of PDs Boolean  Boolean Boolean  Discrete Boolean  Continuous Discrete  Boolean Discrete  Discrete Discrete  Continuous Continuous  Boolean Continuous  Discrete Continuous  Continuous C A B P(C|a,b) ?

Efficient representation of PDs Boolean  Boolean: Noisy-OR, Noisy-AND Boolean/Discrete  Discrete: Noisy-MAX Bool./Discr./Cont.  Continuous: Parametric distribution (e.g. Gaussian) Continuous  Boolean: Logit/Probit

Noisy-OR example Boolean → Boolean P(E|C 1,C 2,C 3 ) C1C C2C C3C P(E=0) P(E=1) Example from L.E. Sucar The effect (E) is off (false) when none of the causes are true. The probability for the effect increases with the number of true causes. (for this example)

Noisy-OR general case Boolean → Boolean Example on previous slide used q i = 0.1 for all i. q1q1 P(E|C 1,...) C1C1 C2C2 CnCn q2q2 qnqn PROD Image adapted from Laskey & Mahoney 1999 Needs only n parameters, not 2 n parameters.

Noisy-OR example (II) Fever is True if and only if Cold, Flu or Malaria is True. each cause has an independent chance of causing the effect. –all possible causes are listed –inhibitors are independent ColdFluMalaria q1q1 q2q2 q3q3 Fever

Noisy-OR example (II) ColdFluMalaria q 1 = 0.6q 2 = 0.2q 3 = 0.1 Fever P(Fever | Cold) = 0.4 ⇒ q 1 = 0.6 P(Fever | Flu) = 0.8 ⇒ q 2 = 0.2 P(Fever | Malaria) = 0.9 ⇒ q 3 = 0.1

Noisy-OR example (II) ColdFluMalaria q 1 = 0.6q 2 = 0.2q 3 = 0.1 Fever P(Fever | Cold) = 0.4 ⇒ q 1 = 0.6 P(Fever | Flu) = 0.8 ⇒ q 2 = 0.2 P(Fever | Malaria) = 0.9 ⇒ q 3 = 0.1

Noisy-OR example (II) ColdFluMalaria q 1 = 0.6q 2 = 0.2q 3 = 0.1 Fever P(Fever | Cold) = 0.4 ⇒ q 1 = 0.6 P(Fever | Flu) = 0.8 ⇒ q 2 = 0.2 P(Fever | Malaria) = 0.9 ⇒ q 3 = 0.1

Noisy-OR example (II) ColdFluMalaria q 1 = 0.6q 2 = 0.2q 3 = 0.1 Fever P(Fever | Cold) = 0.4 ⇒ q 1 = 0.6 P(Fever | Flu) = 0.8 ⇒ q 2 = 0.2 P(Fever | Malaria) = 0.9 ⇒ q 3 = 0.1

Parametric probability densities Boolean/Discr./Continuous → Continuous Use parametric probability densities, e.g., the normal distribution Gaussian networks ( a = input to the node)

Probit & Logit Discrete → Boolean If the input is continuous but output is boolean, use probit or logit P(A|x) x

The cancer network Age: {1-10, 11-20,...} Gender: {M, F} Toxics: {Low, Medium, High} Smoking: {No, Light, Heavy} Cancer: {No, Benign, Malignant} Serum Calcium: Level Lung Tumour: {Yes, No} AgeGender SmokingToxics Cancer Serum Calcium Lung Tumour DiscreteDiscrete/boolean Discrete ContinuousDiscrete/boolean

Inference in BN Inference means computing P(X|e), where X is a query (variable) and e is a set of evidence variables (for which we know the values). Examples: P(Burglary | john_calls, mary_calls) P(Cancer | age, gender, smoking, serum_calcium) P(Cavity | toothache, catch)

Exact inference in BN ”Doable” for boolean variables: Look up entries in conditional probability tables (CPTs).

Example: The alarm network Alarm BurglaryEarthquake JohnCalls MaryCalls P(B=b) P(E=e) AP(J=j) a0.90 ¬a¬a 0.05 AP(M=m) a0.70 ¬a¬a 0.01 BEP(A=a) be0.95 b ¬e¬e 0.94 ¬b¬b e0.29 ¬b¬b ¬e¬e Evidence variables = {J,M} Query variable = B What is the probability for a burglary if both John and Mary call?

Example: The alarm network Alarm BurglaryEarthquake JohnCalls MaryCalls P(B=b) P(E=e) AP(J=j) a0.90 ¬a¬a 0.05 AP(M=m) a0.70 ¬a¬a 0.01 BEP(A=a) be0.95 b ¬e¬e 0.94 ¬b¬b e0.29 ¬b¬b ¬e¬e = = What is the probability for a burglary if both John and Mary call?

Example: The alarm network Alarm BurglaryEarthquake JohnCalls MaryCalls P(B=b) P(E=e) AP(J=j) a0.90 ¬a¬a 0.05 AP(M=m) a0.70 ¬a¬a 0.01 BEP(A=a) be0.95 b ¬e¬e 0.94 ¬b¬b e0.29 ¬b¬b ¬e¬e What is the probability for a burglary if both John and Mary call?

Example: The alarm network Alarm BurglaryEarthquake JohnCalls MaryCalls P(B=b) P(E=e) AP(J=j) a0.90 ¬a¬a 0.05 AP(M=m) a0.70 ¬a¬a 0.01 BEP(A=a) be0.95 b ¬e¬e 0.94 ¬b¬b e0.29 ¬b¬b ¬e¬e What is the probability for a burglary if both John and Mary call?

Example: The alarm network Alarm BurglaryEarthquake JohnCalls MaryCalls P(B=b) P(E=e) AP(J=j) a0.90 ¬a¬a 0.05 AP(M=m) a0.70 ¬a¬a 0.01 BEP(A=a) be0.95 b ¬e¬e 0.94 ¬b¬b e0.29 ¬b¬b ¬e¬e What is the probability for a burglary if both John and Mary call? Answer: 28%

Use depth-first search A lot of unneccesary repeated computation...

Complexity of exact inference By eliminating repeated calculation & uninteresting paths we can speed up the inference a lot. Linear time complexity for singly connected networks (polytrees). Exponential for multiply connected networks. –Clustering can improve this

Approximate inference in BN Exact inference is intractable in large multiply connected BNs ⇒ use approximate inference: Monte Carlo methods (random sampling). –Direct sampling –Rejection sampling –Likelihood weighting –Markov chain Monte Carlo

Markov chain Monte Carlo 1.Fix the evidence variables (E 1, E 2,...) at their given values. 2.Initialize the network with values for all other variables, including the query variable. 3.Repeat the following many, many, many times: a.Pick a non-evidence variable at random (query X i or hidden Y j ) b.Select a new value for this variable, conditioned on the current values in the variable’s Markov blanket. Monitor the values of the query variables.