Uncertainty Chapter 13. Outline Uncertainty Probability Syntax and Semantics Inference Independence and Bayes' Rule.

Slides:



Advertisements
Similar presentations
Bayesian networks Chapter 14 Section 1 – 2. Outline Syntax Semantics Exact computation.
Advertisements

Probability: Review The state of the world is described using random variables Probabilities are defined over events –Sets of world states characterized.
PROBABILITY. Uncertainty  Let action A t = leave for airport t minutes before flight from Logan Airport  Will A t get me there on time ? Problems :
Where are we in CS 440? Now leaving: sequential, deterministic reasoning Entering: probabilistic reasoning and machine learning.
1 22c:145 Artificial Intelligence Bayesian Networks Reading: Ch 14. Russell & Norvig.
Bayesian Networks Chapter 14 Section 1, 2, 4. Bayesian networks A simple, graphical notation for conditional independence assertions and hence for compact.
Review: Bayesian learning and inference
CPSC 422 Review Of Probability Theory.
Probability.
Bayesian Networks. Motivation The conditional independence assumption made by naïve Bayes classifiers may seem to rigid, especially for classification.
Uncertainty Chapter 13. Uncertainty Let action A t = leave for airport t minutes before flight Will A t get me there on time? Problems: 1.partial observability.
Bayesian networks Chapter 14 Section 1 – 2.
Bayesian Belief Networks
KI2 - 2 Kunstmatige Intelligentie / RuG Probabilities Revisited AIMA, Chapter 13.
University College Cork (Ireland) Department of Civil and Environmental Engineering Course: Engineering Artificial Intelligence Dr. Radu Marinescu Lecture.
Ai in game programming it university of copenhagen Welcome to... the Crash Course Probability Theory Marco Loog.
Uncertainty Logical approach problem: we do not always know complete truth about the environment Example: Leave(t) = leave for airport t minutes before.
Bayesian Reasoning. Tax Data – Naive Bayes Classify: (_, No, Married, 95K, ?)
Bayesian networks More commonly called graphical models A way to depict conditional independence relationships between random variables A compact specification.
Uncertainty Chapter 13.
Uncertainty Chapter 13.
Probabilistic Reasoning
EE562 ARTIFICIAL INTELLIGENCE FOR ENGINEERS
CS 4100 Artificial Intelligence Prof. C. Hafner Class Notes Feb 28 and March 13-15, 2012.
Handling Uncertainty. Uncertain knowledge Typical example: Diagnosis. Consider:  x Symptom(x, Toothache)  Disease(x, Cavity). The problem is that this.
Uncertainty Chapter 13. Uncertainty Let action A t = leave for airport t minutes before flight Will A t get me there on time? Problems: 1.partial observability.
Bayesian networks Chapter 14. Outline Syntax Semantics.
Bayesian networks Chapter 14 Section 1 – 2. Bayesian networks A simple, graphical notation for conditional independence assertions and hence for compact.
An Introduction to Artificial Intelligence Chapter 13 & : Uncertainty & Bayesian Networks Ramin Halavati
CS 4100 Artificial Intelligence Prof. C. Hafner Class Notes March 15and20, 2012.
CS 4100 Artificial Intelligence Prof. C. Hafner Class Notes March 13, 2012.
Uncertainty Chapter 13. Outline Uncertainty Probability Syntax and Semantics Inference Independence and Bayes' Rule.
1 Chapter 13 Uncertainty. 2 Outline Uncertainty Probability Syntax and Semantics Inference Independence and Bayes' Rule.
Uncertainty Chapter 13. Outline Uncertainty Probability Syntax and Semantics Inference Independence and Bayes' Rule.
Bayesian networks. Motivation We saw that the full joint probability can be used to answer any question about the domain, but can become intractable as.
2 Syntax of Bayesian networks Semantics of Bayesian networks Efficient representation of conditional distributions Exact inference by enumeration Exact.
Probability and naïve Bayes Classifier Louis Oliphant cs540 section 2 Fall 2005.
An Introduction to Artificial Intelligence Chapter 13 & : Uncertainty & Bayesian Networks Ramin Halavati
Uncertainty. Assumptions Inherent in Deductive Logic-based Systems All the assertions we wish to make and use are universally true. Observations of the.
Probabilistic Reasoning [Ch. 14] Bayes Networks – Part 1 ◦Syntax ◦Semantics ◦Parameterized distributions Inference – Part2 ◦Exact inference by enumeration.
Marginalization & Conditioning Marginalization (summing out): for any sets of variables Y and Z: Conditioning(variant of marginalization):
Uncertainty Chapter 13. Outline Uncertainty Probability Syntax and Semantics Inference Independence and Bayes' Rule.
Uncertainty Chapter 13. Outline Uncertainty Probability Syntax and Semantics Inference Independence and Bayes' Rule.
2101INT – Principles of Intelligent Systems Lecture 9.
Uncertainty Let action A t = leave for airport t minutes before flight Will A t get me there on time? Problems: 1.partial observability (road state, other.
Uncertainty 1. 2 ♦ Uncertainty ♦ Probability ♦ Syntax and Semantics ♦ Inference ♦ Independence and Bayes’ Rule Outline.
CSE 473 Uncertainty. © UW CSE AI Faculty 2 Many Techniques Developed Fuzzy Logic Certainty Factors Non-monotonic logic Probability Only one has stood.
Uncertainty Fall 2013 Comp3710 Artificial Intelligence Computing Science Thompson Rivers University.
Outline [AIMA Ch 13] 1 Uncertainty Probability Syntax and Semantics Inference Independence and Bayes' Rule.
Uncertainty & Probability CIS 391 – Introduction to Artificial Intelligence AIMA, Chapter 13 Many slides adapted from CMSC 421 (U. Maryland) by Bonnie.
Chapter 12. Probability Reasoning Fall 2013 Comp3710 Artificial Intelligence Computing Science Thompson Rivers University.
Web-Mining Agents Data Mining Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme Karsten Martiny (Übungen)
Introduction to Artificial Intelligence – Unit 7 Probabilistic Reasoning Course The Hebrew University of Jerusalem School of Engineering and Computer.
Bayesian networks Chapter 14 Section 1 – 2.
Presented By S.Yamuna AP/CSE
Where are we in CS 440? Now leaving: sequential, deterministic reasoning Entering: probabilistic reasoning and machine learning.
Uncertainty Chapter 13.
Uncertainty Chapter 13.
Where are we in CS 440? Now leaving: sequential, deterministic reasoning Entering: probabilistic reasoning and machine learning.
Uncertainty.
Bayesian Networks: Motivation
Uncertainty in Environments
Uncertainty Logical approach problem: we do not always know complete truth about the environment Example: Leave(t) = leave for airport t minutes before.
Uncertainty Chapter 13.
Bayesian networks Chapter 14 Section 1 – 2.
Probabilistic Reasoning
Uncertainty Chapter 13.
Uncertainty Chapter 13.
Uncertainty Chapter 13.
Presentation transcript:

Uncertainty Chapter 13

Outline Uncertainty Probability Syntax and Semantics Inference Independence and Bayes' Rule

Uncertainty Let action A t = leave for airport t minutes before flight Will A t get me there on time? Problems: 1.partial observability (road state, other drivers' plans, etc.) 2.noisy sensors (traffic reports) 3.uncertainty in action outcomes (flat tire, etc.) 4.immense complexity of modeling and predicting traffic Hence a purely logical approach either 1.risks falsehood: “A 25 will get me there on time”, or 2.leads to conclusions that are too weak for decision making: “A 25 will get me there on time if there's no accident on the bridge and it doesn't rain and my tires remain intact etc etc.” (A 1440 might reasonably be said to get me there on time but I'd have to stay overnight in the airport …)

Medical Diagnosis Fail Laziness: It is too much work to list the complete set of antecedents or consequents needed to ensure an exceptionless rule, and too hard to use the enormous rules that result. Theoretical ignorance: Medical science has no complete theory for the domain. Practical ignorance: Even if we know all the rules, we may be uncertain about a particular patient because all the necessary tests have not or cannot be run.

Methods for handling uncertainty Default or nonmonotonic logic: –Assume my car does not have a flat tire –Assume A 25 works unless contradicted by evidence Issues: What assumptions are reasonable? How to handle contradiction? Rules with fudge factors: –A 25 |→ 0.3 get there on time –Sprinkler |→ 0.99 WetGrass –WetGrass |→ 0.7 Rain Issues: Problems with combination, e.g., Sprinkler causes Rain?? Probability –Model agent's degree of belief –Given the available evidence, –A 25 will get me there on time with probability 0.04

Probability Probabilistic assertions summarize effects of –laziness: failure to enumerate exceptions, qualifications, etc. –ignorance: lack of relevant facts, initial conditions, etc. Subjective probability: Probabilities relate propositions to agent's own state of knowledge e.g., P(A 25 | no reported accidents) = 0.06 These are not assertions about the world Probabilities of propositions change with new evidence: e.g., P(A 25 | no reported accidents, 5 a.m.) = 0.15

Making decisions under uncertainty Suppose I believe the following: P(A 25 gets me there on time | …) = 0.04 P(A 90 gets me there on time | …) = 0.70 P(A 120 gets me there on time | …) = 0.95 P(A 1440 gets me there on time | …) = Which action to choose? Depends on my preferences for missing flight vs. time spent waiting, etc. –Utility theory is used to represent and infer preferences –Decision theory = probability theory + utility theory

Syntax Basic element: random variable Similar to propositional logic: possible worlds defined by assignment of values to random variables. Boolean random variables e.g., Cavity (do I have a cavity?) Discrete random variables e.g., Weather is one of Domain values must be exhaustive and mutually exclusive Elementary proposition constructed by assignment of a value to a random variable: e.g., Weather = sunny, Cavity = false (abbreviated as  cavity) Complex propositions formed from elementary propositions and standard logical connectives e.g., Weather = sunny  Cavity = false

Syntax Atomic event: A complete specification of the state of the world about which the agent is uncertain E.g., if the world consists of only two Boolean variables Cavity and Toothache, then there are 4 distinct atomic events: Cavity = false  Toothache = false Cavity = false  Toothache = true Cavity = true  Toothache = false Cavity = true  Toothache = true Atomic events are mutually exclusive and exhaustive

Axioms of probability For any propositions A, B –0 ≤ P(A) ≤ 1 –P(true) = 1 and P(false) = 0 –P(A  B) = P(A) + P(B) - P(A  B)

Prior probability Prior or unconditional probabilities of propositions e.g., P(Cavity = true) = 0.1 and P(Weather = sunny) = 0.72 correspond to belief prior to arrival of any (new) evidence Probability distribution gives values for all possible assignments: P(Weather) = (normalized, i.e., sums to 1) Joint probability distribution for a set of random variables gives the probability of every atomic event on those random variables P(Weather,Cavity) = a 4 × 2 matrix of values: Weather =sunnyrainycloudysnow Cavity = true Cavity = false Every question about a domain can be answered by the joint distribution

Conditional probability Conditional or posterior probabilities e.g., P(cavity | toothache) = 0.8 i.e., given that toothache is all I know (Notation for conditional distributions: P(Cavity | Toothache) = 2-element vector of 2-element vectors) If we know more, e.g., cavity is also given, then we have P(cavity | toothache,cavity) = 1 New evidence may be irrelevant, allowing simplification, e.g., P(cavity | toothache, sunny) = P(cavity | toothache) = 0.8 This kind of inference, sanctioned by domain knowledge, is crucial

Conditional probability Definition of conditional probability: P(a | b) = P(a  b) / P(b) if P(b) > 0 Product rule gives an alternative formulation: P(a  b) = P(a | b) P(b) = P(b | a) P(a) A general version holds for whole distributions, e.g., P(Weather,Cavity) = P(Weather | Cavity) P(Cavity) (View as a set of 4 × 2 equations, not matrix mult.) Chain rule is derived by successive application of product rule: P(X 1, …,X n ) = P(X 1,...,X n-1 ) P(X n | X 1,...,X n-1 ) = P(X 1,...,X n-2 ) P(X n-1 | X 1,...,X n-2 ) P(X n | X 1,...,X n-1 ) = … = π i= 1 ^n P(X i | X 1, …,X i-1 )

Inference by enumeration Start with the joint probability distribution: For any proposition φ, sum the atomic events where it is true: P(φ) = Σ ω:ω╞φ P(ω)

Inference by enumeration Start with the joint probability distribution: For any proposition φ, sum the atomic events where it is true: P(φ) = Σ ω:ω╞φ P(ω) P(toothache) = = 0.2

Inference by enumeration Start with the joint probability distribution: For any proposition φ, sum the atomic events where it is true: P(φ) = Σ ω:ω╞φ P(ω) P(toothache) = = 0.2

Inference by enumeration Start with the joint probability distribution: Can also compute conditional probabilities: P(  cavity | toothache) = P(  cavity  toothache) P(toothache) = = 0.4

Normalization Denominator can be viewed as a normalization constant α P(Cavity | toothache) = α, P(Cavity,toothache) = α, [P(Cavity,toothache,catch) + P(Cavity,toothache,  catch)] = α, [ + ] = α, = General idea: compute distribution on query variable by fixing evidence variables and summing over hidden variables

Inference by enumeration, contd. Typically, we are interested in the posterior joint distribution of the query variables Y given specific values e for the evidence variables E Let the hidden variables be H = X - Y - E Then the required summation of joint entries is done by summing out the hidden variables: P(Y | E = e) = αP(Y,E = e) = αΣ h P(Y,E= e, H = h) The terms in the summation are joint entries because Y, E and H together exhaust the set of random variables Obvious problems: 1.Worst-case time complexity O(d n ) where d is the largest arity 2.Space complexity O(d n ) to store the joint distribution 3.How to find the numbers for O(d n ) entries?

Independence A and B are independent iff P(A|B) = P(A) or P(B|A) = P(B) or P(A, B) = P(A) P(B) P(Toothache, Catch, Cavity, Weather) = P(Toothache, Catch, Cavity) P(Weather) 32 entries reduced to 12; for n independent biased coins, O(2 n ) →O(n) Absolute independence powerful but rare Dentistry is a large field with hundreds of variables, none of which are independent. What to do?

Conditional independence P(Toothache, Cavity, Catch) has 2 3 – 1 = 7 independent entries If I have a cavity, the probability that the probe catches in it doesn't depend on whether I have a toothache: (1) P(catch | toothache, cavity) = P(catch | cavity) The same independence holds if I haven't got a cavity: (2) P(catch | toothache,  cavity) = P(catch |  cavity) Catch is conditionally independent of Toothache given Cavity: P(Catch | Toothache,Cavity) = P(Catch | Cavity) Equivalent statements: P(Toothache | Catch, Cavity) = P(Toothache | Cavity) P(Toothache, Catch | Cavity) = P(Toothache | Cavity) P(Catch | Cavity)

Conditional independence contd. Write out full joint distribution using chain rule: P(Toothache, Catch, Cavity) = P(Toothache | Catch, Cavity) P(Catch, Cavity) = P(Toothache | Catch, Cavity) P(Catch | Cavity) P(Cavity) = P(Toothache | Cavity) P(Catch | Cavity) P(Cavity) I.e., = 5 independent numbers In most cases, the use of conditional independence reduces the size of the representation of the joint distribution from exponential in n to linear in n. Conditional independence is our most basic and robust form of knowledge about uncertain environments.

Bayes' Rule Product rule P(a  b) = P(a | b) P(b) = P(b | a) P(a)  Bayes' rule: P(a | b) = P(b | a) P(a) / P(b) or in distribution form P(Y|X) = P(X|Y) P(Y) / P(X) = αP(X|Y) P(Y) Useful for assessing diagnostic probability from causal probability: –P(Cause|Effect) = P(Effect|Cause) P(Cause) / P(Effect) –E.g., let M be meningitis, S be stiff neck: P(m|s) = P(s|m) P(m) / P(s) = 0.8 × / 0.1 = –Note: posterior probability of meningitis still very small!

Bayes' Rule and conditional independence P(Cavity | toothache  catch) = αP(toothache  catch | Cavity) P(Cavity) = αP(toothache | Cavity) P(catch | Cavity) P(Cavity) This is an example of a naïve Bayes model: P(Cause,Effect 1, …,Effect n ) = P(Cause) π i P(Effect i |Cause) Total number of parameters is linear in n

Example A doctor knows that the disease meningitis causes the patient to have a stiff neck, say, 50% of the time. The doctor also knows some unconditional facts: the prior probability of a patient having meningitis is 1/50,000, and the prior probability of any patient having a stiff neck is 1/20. Letting S be the proposition that the patient has a stiff neck and M be the proposition that the patient has meningitis, we have P(S|M) = 0.5 P(M) = 1/50000 P(S) = 1/20 P(M|S) = P(S|M)P(M)/P(S)= (0.5 x 1/50000)/ (1/20)=

Bayes Normalization Consider again the equation for calculating the probability of meningitis given a stiff neck: P(M|S) = P(S|M)P(M)/P(S) Suppose we are also concerned with the possibility that the patient is suffering from whiplash W given a stiff neck: P(W|S) = P(S|W)P(W)/P(S)

Bayes Normalization We can replace W by ~M P(M|S) = P(S|M)P(M)/P(S) P(~M|S) = P(S|~M)P(~M)/P(S) Adding these two equations, and using the fact that P(M\S) + P(-^M\S) = 1, P(S) = P(S\M)P(M) + P(S|~M)P(~M)

Bayes Normalization Substituting into the equation for P(M\S), we have P(M\S)=P(S\M)P(M)/(P(S\M)P(M)+ P(S|~M)P(~M))

Summary Probability is a rigorous formalism for uncertain knowledge Joint probability distribution specifies probability of every atomic event Queries can be answered by summing over atomic events For nontrivial domains, we must find a way to reduce the joint size Independence and conditional independence provide the tools

Bayesian Networks Chapter 14

Outline Syntax Semantics

Bayesian Networks: Motivation Capture independence and conditional independence where they exist. Among variables where dependencies exist, encode the relevant portion of the full joint. Use a graphical representation for which we can more easily investigate the complexity of inference and can search for efficient inference algorithms.

Bayesian networks A simple, graphical notation for conditional independence assertions and hence for compact specification of full joint distributions Syntax: –a set of nodes, one per variable –a directed, acyclic graph (link ≈ "directly influences") –a conditional distribution for each node given its parents: P (X i | Parents (X i )) In the simplest case, conditional distribution represented as a conditional probability table (CPT) giving the distribution over X i for each combination of parent values

Example Topology of network encodes conditional independence assertions: Weather is independent of the other variables Toothache and Catch are conditionally independent given Cavity

Example I'm at work, neighbor John calls to say my alarm is ringing, but neighbor Mary doesn't call. Sometimes it's set off by minor earthquakes. Is there a burglar? Variables: Burglary, Earthquake, Alarm, JohnCalls, MaryCalls Network topology reflects "causal" knowledge: –A burglar can set the alarm off –An earthquake can set the alarm off –The alarm can cause Mary to call –The alarm can cause John to call

Example contd.

Compactness A CPT for Boolean X i with k Boolean parents has 2 k rows for the combinations of parent values Each row requires one number p for X i = true (the number for X i = false is just 1-p) If each variable has no more than k parents, the complete network requires O(n · 2 k ) numbers I.e., grows linearly with n, vs. O(2 n ) for the full joint distribution For burglary net, = 10 numbers (vs = 31)

Semantics The full joint distribution is defined as the product of the local conditional distributions: P (X 1, …,X n ) = π i = 1 P (X i | Parents(X i )) e.g., P(j  m  a   b   e) = P (j | a) P (m | a) P (a |  b,  e) P (  b) P (  e) n

REASONING RULES 1. Probability of conjunction: p( X1  X2 | Cond) = p( X1 | Cond) * p( X2 | X1  Cond) 2. Probability of a certain event: p( X | Y1 ...  X ...) = 1 3. Probability of impossible event: p( X | Y1 ...  ~X ...) = 0 4. Probability of negation: p( ~X | Cond) = 1 – p( X | Cond)

5. If condition involves a descendant of X then use Bayes' theorem: If Cond0 = Y  Cond where Y is a descendant of X in belief net then p(X|Cond0) = p(X|Cond) * p(Y|X  Cond) / p(Y|Cond) 6. Cases when condition Cond does not involve a descendant of X: (a) If X has no parents then p(X|Cond) = p(X), p(X) given (b) If X has parents Parents then

Constructing Bayesian networks 1. Choose an ordering of variables X 1, …,X n 2. For i = 1 to n –add X i to the network –select parents from X 1, …,X i-1 such that P (X i | Parents(X i )) = P (X i | X 1,... X i-1 ) This choice of parents guarantees: P (X 1, …,X n ) = π i =1 P (X i | X 1, …, X i-1 ) (chain rule) = π i =1 P (X i | Parents(X i )) (by construction) n n

Suppose we choose the ordering M, J, A, B, E P(J | M) = P(J)? Example

Suppose we choose the ordering M, J, A, B, E P(J | M) = P(J)? No P(A | J, M) = P(A | J)? P(A | J, M) = P(A)? Example

Suppose we choose the ordering M, J, A, B, E P(J | M) = P(J)? No P(A | J, M) = P(A | J)? P(A | J, M) = P(A)? No P(B | A, J, M) = P(B | A)? P(B | A, J, M) = P(B)? Example

Suppose we choose the ordering M, J, A, B, E P(J | M) = P(J)? No P(A | J, M) = P(A | J)? P(A | J, M) = P(A)? No P(B | A, J, M) = P(B | A)? Yes P(B | A, J, M) = P(B)? No P(E | B, A,J, M) = P(E | A)? P(E | B, A, J, M) = P(E | A, B)? Example

Suppose we choose the ordering M, J, A, B, E P(J | M) = P(J)? No P(A | J, M) = P(A | J)? P(A | J, M) = P(A)? No P(B | A, J, M) = P(B | A)? Yes P(B | A, J, M) = P(B)? No P(E | B, A,J, M) = P(E | A)? No P(E | B, A, J, M) = P(E | A, B)? Yes Example

Example contd. Deciding conditional independence is hard in noncausal directions (Causal models and conditional independence seem hardwired for humans!) Network is less compact: = 13 numbers needed

COMMENTS Complexity of reasoning in belief networks grows exponentially with the number of nodes. Substantial algorithmic improvements required for large networks for improved efficiency.

d-SEPARATION Follows from basic independence assumption of Bayes networks d-separation = direction-dependent separation Let E = set of “evidence nodes” (subset of variables in Bayes network) Let V i, V j be two variables in the network

d-SEPARATION Nodes V i and V j are conditionally independent given set E if E d-separates V i and V j E d-separates V i, V j if all (undirected) paths (V i,V j ) are “blocked” by E If E d-separates V i, V j, then V i and V j are conditionally independent, given E We write I(V i,V j | E) This means: p(V i,V j | E) = p(V i | E) * p(V j | E)

More Difficult Case: What if Some Variables are Missing Recall our earlier notion of hidden variables. Sometimes a variable is hidden because it cannot be explicitly measured. For example, we might hypothesize that a chromosomal abnormality is responsible for some patients with a particular cancer not responding well to treatment.

Missing Values (Continued) We might include a node for this chromosomal abnormality in our network because we strongly believe it exists, other variables can be used to predict it, and it is in turn predictive of still other variables. But in estimating CPTs from data, none of our data points has a value for this variable.

General EM Framework Given: Data with missing values, Space of possible models, Initial model. Repeat until no change greater than threshold: –Expectation (E) Step: Compute expectation over missing values, given model. –Maximization (M) Step: Replace current model with model that maximizes probability of data.

(“Soft”) EM vs. “Hard” EM Standard (soft) EM: expectation is a probability distribution. Hard EM: expectation is “all or nothing”… most likely/probable value. Advantage of hard EM is computational efficiency when expectation is over state consisting of values for multiple variables (next example illustrates).

EM for Parameter Learning: E Step For each data point with missing values, compute the probability of each possible completion of that data point. Replace the original data point with all these completions, weighted by probabilities. Computing the probability of each completion (expectation) is just answering query over missing variables given others.

EM for Parameter Learning: M Step Use the completed data set to update our Dirichlet distributions as we would use any complete data set, except that our counts (tallies) may be fractional now. Update CPTs based on new Dirichlet distributions, as we would with any complete data set.

EM for Parameter Learning Iterate E and M steps until no changes occur. We will not necessarily get the global MAP (or ML given uniform priors) setting of all the CPT entries, but under a natural set of conditions we are guaranteed convergence to a local MAP solution. EM algorithm is used for a wide variety of tasks outside of BN learning as well.

Subtlety for Parameter Learning Overcounting based on number of interations required to converge to settings for the missing values. After each repetition of E step, reset all Dirichlet distributions before repeating M step.

EM for Parameter Learning C P(D) T 0.9 (9,1) F 0.2 (1,4) AB C D E P(A) 0.1 (1,9) A B P(C) T T 0.9 (9,1) T F 0.6 (3,2) F T 0.3 (3,7) F F 0.2 (1,4) P(B) 0.2 (1,4) C P(E) T 0.8 (4,1) F 0.1 (1,9) A B C D E 0 0 ? ? ? ? ? ? ? ? ? ? 0 1 Data

EM for Parameter Learning C P(D) T 0.9 (9,1) F 0.2 (1,4) AB C D E P(A) 0.1 (1,9) A B P(C) T T 0.9 (9,1) T F 0.6 (3,2) F T 0.3 (3,7) F F 0.2 (1,4) P(B) 0.2 (1,4) C P(E) T 0.8 (4,1) F 0.1 (1,9) A B C D E Data 0: : : : : : : : : : : : : : : : : : : : 0.997

Multiple Missing Values AB C D E P(A) 0.1 (1,9) A B P(C) T T 0.9 (9,1) T F 0.6 (3,2) F T 0.3 (3,7) F F 0.2 (1,4) P(B) 0.2 (1,4) C P(D) T 0.9 (9,1) F 0.2 (1,4) C P(E) T 0.8 (4,1) F 0.1 (1,9) A B C D E ? 0 ? 0 1 Data

Multiple Missing Values AB C D E P(A) 0.1 (1,9) A B P(C) T T 0.9 (9,1) T F 0.6 (3,2) F T 0.3 (3,7) F F 0.2 (1,4) P(B) 0.2 (1,4) C P(E) T 0.8 (4,1) F 0.1 (1,9) C P(D) T 0.9 (9,1) F 0.2 (1,4) A B C D E Data

Multiple Missing Values AB C D E P(A) 0.1 (1.1,9.9) A B P(C) T T 0.9 (9,1) T F 0.6 (3.06,2.04) F T 0.3 (3,7) F F 0.2 (1.18,4.72) P(B) 0.17 (1,5) C P(D) T 0.88 (9,1.24) F 0.17 (1,4.76) C P(E) T 0.81 (4.24,1) F 0.16 (1.76,9) A B C D E Data

Problems with EM Only local optimum (not much way around that, though). Deterministic … if priors are uniform, may be impossible to make any progress… … next figure illustrates the need for some randomization to move us off an uninformative prior…

What will EM do here? A B C Data A B C 0 ? 0 1 ? 1 0 ? 0 1 ? 1 0 ? 0 1 ? 1 P(A) 0.5 (1,1) B P(C) T 0.5 (1,1) F 0.5 (1,1) A P(B) T 0.5 (1,1) F 0.5 (1,1)

EM Dependent on Initial Beliefs A B C Data A B C 0 ? 0 1 ? 1 0 ? 0 1 ? 1 0 ? 0 1 ? 1 P(A) 0.5 (1,1) B P(C) T 0.5 (1,1) F 0.5 (1,1) A P(B) T 0.6 (6,4) F 0.4 (4,6)

EM Dependent on Initial Beliefs A B C Data A B C 0 ? 0 1 ? 1 0 ? 0 1 ? 1 0 ? 0 1 ? 1 P(A) 0.5 (1,1) B P(C) T 0.5 (1,1) F 0.5 (1,1) A P(B) T 0.6 (6,4) F 0.4 (4,6) B is more likely T than F when A is T. Filling this in makes C more likely T than F when B is T. This makes B still more likely T than F when A is T. Etc. Small change in CPT for B (swap 0.6 and 0.4) would have opposite effect.

Learning Structure + Parameters Number of structures is superexponential Finding optimal structure (ML or MAP) is NP-complete Two common options: –Severely restrict possible structures – e.g., tree-augmented naïve Bayes (TAN) –Heuristic search (e.g., sparse candidate)

Recall: Naïve Bayes Net F 2F 3F N-2F N-1F NF 1 Class Value …

Alternative: TAN F 2F N-2F N-1F NF 1F 3 Class Value …

Tree-Augmented Naïve Bayes In addition to the Naïve Bayes arcs (class to feature), we are permitted a directed tree among the features Given this restriction, there exists a polynomial-time algorithm to find the maximum likelihood structure

TAN Learning Algorithm Friedman, Geiger & Goldszmidt ’97 For every pair of features, compute the mutual information (information gain of one for the other), conditional on the class Add arcs between all pairs of features, weighted by this value Compute the maximum weight spanning tree, and direct arcs from the root Compute parameters as already seen

Problem with Bayes Optimal Because there are a super-exponential number of structures, we don’t want to average over all of them. Two options are used in practice: Selective model averaging:just choose a subset of “best” but “distinct” models (networks) and pretend it’s exhaustive. Go back to MAP/ML (model selection).

Markov Chain Monte Carlo Markov Chain Monte Carlo (MCMC) methods are a class of algorithms for sampling from probability distributions based on constructing a Markov chain that has the desired distribution as its equilibrium distribution. The state of the chain after a large number of steps is then used as a sample from the desired distribution. The quality of the sample improves as a function of the number of steps.Markov ChainMonte Carloalgorithms probability distributions Markov chainequilibrium distribution

Markov Chain Monte Carlo Usually it is not hard to construct a Markov Chain with the desired properties. The more difficult problem is to determine how many steps are needed to converge to the stationary distribution within an acceptable error. A good chain will have rapid mixing—the stationary distribution is reached quickly starting from an arbitrary position