Bayesian Networks Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

Slides:



Advertisements
Similar presentations
CPSC 422, Lecture 11Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 11 Jan, 29, 2014.
Advertisements

Lirong Xia Hidden Markov Models Tue, March 28, 2014.
Probabilistic Reasoning (2)
Reasoning under Uncertainty: Conditional Prob., Bayes and Independence Computer Science cpsc322, Lecture 25 (Textbook Chpt ) March, 17, 2010.
Bayesian network inference
CS 188: Artificial Intelligence Fall 2009 Lecture 15: Bayes’ Nets II – Independence 10/15/2009 Dan Klein – UC Berkeley.
Bayesian Networks. Graphical Models Bayesian networks Conditional random fields etc.
CS 188: Artificial Intelligence Fall 2009 Lecture 17: Bayes Nets IV 10/27/2009 Dan Klein – UC Berkeley.
CS 188: Artificial Intelligence Spring 2009 Lecture 15: Bayes’ Nets II -- Independence 3/10/2009 John DeNero – UC Berkeley Slides adapted from Dan Klein.
5/25/2005EE562 EE562 ARTIFICIAL INTELLIGENCE FOR ENGINEERS Lecture 16, 6/1/2005 University of Washington, Department of Electrical Engineering Spring 2005.
CS 188: Artificial Intelligence Spring 2007 Lecture 14: Bayes Nets III 3/1/2007 Srini Narayanan – ICSI and UC Berkeley.
CS 188: Artificial Intelligence Fall 2006 Lecture 17: Bayes Nets III 10/26/2006 Dan Klein – UC Berkeley.
10/22  Homework 3 returned; solutions posted  Homework 4 socket opened  Project 3 assigned  Mid-term on Wednesday  (Optional) Review session Tuesday.
Announcements Homework 8 is out Final Contest (Optional)
1 Bayesian Networks Chapter ; 14.4 CS 63 Adapted from slides by Tim Finin and Marie desJardins. Some material borrowed from Lise Getoor.
CPSC 322, Lecture 24Slide 1 Reasoning under Uncertainty: Intro to Probability Computer Science cpsc322, Lecture 24 (Textbook Chpt 6.1, 6.1.1) March, 15,
Advanced Artificial Intelligence
Bayes’ Nets  A Bayes’ net is an efficient encoding of a probabilistic model of a domain  Questions we can ask:  Inference: given a fixed BN, what is.
Bayesian Networks Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,
Probabilistic Models  Models describe how (a portion of) the world works  Models are always simplifications  May not account for every variable  May.
CS 188: Artificial Intelligence Fall 2006 Lecture 18: Decision Diagrams 10/31/2006 Dan Klein – UC Berkeley.
QUIZ!!  T/F: Traffic, Umbrella are cond. independent given raining. TRUE  T/F: Fire, Smoke are cond. Independent given alarm. FALSE  T/F: BNs encode.
CPSC 322, Lecture 28Slide 1 More on Construction and Compactness: Compact Conditional Distributions Once we have established the topology of a Bnet, we.
UBC Department of Computer Science Undergraduate Events More
Bayes’ Nets: Sampling [These slides were created by Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188 materials are available.
Announcements Project 4: Ghostbusters Homework 7
The famous “sprinkler” example (J. Pearl, Probabilistic Reasoning in Intelligent Systems, 1988)
Probabilistic Reasoning [Ch. 14] Bayes Networks – Part 1 ◦Syntax ◦Semantics ◦Parameterized distributions Inference – Part2 ◦Exact inference by enumeration.
CHAPTER 5 Probability Theory (continued) Introduction to Bayesian Networks.
CPSC 422, Lecture 11Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 11 Oct, 2, 2015.
Bayesian Networks Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,
CS 188: Artificial Intelligence Bayes Nets: Approximate Inference Instructor: Stuart Russell--- University of California, Berkeley.
Advanced Artificial Intelligence Lecture 5: Probabilistic Inference.
Inference Algorithms for Bayes Networks
Quick Warm-Up Suppose we have a biased coin that comes up heads with some unknown probability p; how can we use it to produce random bits with probabilities.
1 CMSC 671 Fall 2001 Class #20 – Thursday, November 8.
CSE 473: Artificial Intelligence Autumn 2011 Bayesian Networks: Inference Luke Zettlemoyer Many slides over the course adapted from either Dan Klein, Stuart.
Bayes network inference  A general scenario:  Query variables: X  Evidence (observed) variables and their values: E = e  Unobserved variables: Y 
QUIZ!!  T/F: You can always (theoretically) do BNs inference by enumeration. TRUE  T/F: In VE, always first marginalize, then join. FALSE  T/F: VE is.
Web-Mining Agents Data Mining Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme Karsten Martiny (Übungen)
CS 541: Artificial Intelligence Lecture VII: Inference in Bayesian Networks.
Artificial Intelligence Bayes’ Nets: Independence Instructors: David Suter and Qince Li Course Harbin Institute of Technology [Many slides.
CS 188: Artificial Intelligence Spring 2007
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 12
CS b553: Algorithms for Optimization and Learning
Artificial Intelligence
CS 4/527: Artificial Intelligence
Quizzz Rihanna’s car engine does not start (E).
CS 4/527: Artificial Intelligence
CAP 5636 – Advanced Artificial Intelligence
Advanced Artificial Intelligence
Inference Inference: calculating some useful quantity from a joint probability distribution Examples: Posterior probability: Most likely explanation: B.
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 12
Instructors: Fei Fang (This Lecture) and Dave Touretzky
CAP 5636 – Advanced Artificial Intelligence
CSE 473: Artificial Intelligence Autumn 2011
CS 188: Artificial Intelligence
CS 188: Artificial Intelligence Fall 2007
CS 188: Artificial Intelligence Fall 2008
Class #19 – Tuesday, November 3
CS 188: Artificial Intelligence
CS 188: Artificial Intelligence Fall 2008
Class #16 – Tuesday, October 26
CS 188: Artificial Intelligence Spring 2007
Hidden Markov Models Lirong Xia.
CS 188: Artificial Intelligence Fall 2007
CS 188: Artificial Intelligence Spring 2006
Bayesian networks (1) Lirong Xia. Bayesian networks (1) Lirong Xia.
CS 188: Artificial Intelligence Fall 2008
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 12
Presentation transcript:

Bayesian Networks Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell, Andrew Moore, Percy Liang, Luke Zettlemoyer, Rob Pless, Killian Weinberger, Deva Ramanan 1

Announcements Some students in the back are having trouble hearing the lecture due to talking. Please respect your fellow students. If you have a question or comment relevant to the course please share with all of us. Otherwise, don’t talk during lecture. Also, if you are having trouble hearing in the back there are plenty of seats further forward.

Reminder HW3 was released 2/27 –Written questions only (no programming) –Due Tuesday, 3/18, 11:59pm

From last class

Random Variables Random variables be a realization ofLet A random variable is some aspect of the world about which we (may) have uncertainty. Random variables can be: Binary (e.g. {true,false}, {spam/ham}), Take on a discrete set of values (e.g. {Spring, Summer, Fall, Winter}), Or be continuous (e.g. [0 1]).

Joint Probability Distribution Random variables Joint Probability Distribution: be a realization ofLet Also written Gives a real value for all possible assignments.

Queries Joint Probability Distribution: Also written Given a joint distribution, we can reason about unobserved variables given observations (evidence): Stuff you care aboutStuff you already know

Main kinds of models Undirected (also called Markov Random Fields) - links express constraints between variables. Directed (also called Bayesian Networks) - have a notion of causality -- one can regard an arc from A to B as indicating that A "causes" B.

Syntax  Directed Acyclic Graph (DAG)  Nodes: random variables  Can be assigned (observed) or unassigned (unobserved)  Arcs: interactions  An arrow from one variable to another indicates direct influence  Encode conditional independence  Weather is independent of the other variables  Toothache and Catch are conditionally independent given Cavity  Must form a directed, acyclic graph Weather Cavity ToothacheCatch

Bayes Nets Directed Graph, G = (X,E) Nodes Edges Each node is associated with a random variable

Example

Joint Distribution By Chain Rule (using the usual arithmetic ordering)

Directed Graphical Models Directed Graph, G = (X,E) Nodes Edges Each node is associated with a random variable Definition of joint probability in a graphical model: where are the parents of

Example Joint Probability:

Example

Size of a Bayes’ Net How big is a joint distribution over N Boolean variables? 2N2N How big is an N-node net if nodes have up to k parents? O(N * 2 k+1 ) Both give you the power to calculate BNs: Huge space savings! Also easier to elicit local CPTs Also turns out to be faster to answer queries 16

The joint probability distribution  For example, P(j, m, a, ¬b, ¬e) = P(¬b) P(¬e) P(a | ¬b, ¬e) P(j | a) P(m | a)

Independence in a BN Important question about a BN: –Are two nodes independent given certain evidence? –If yes, can prove using algebra (tedious in general) –If no, can prove with a counter example –Example: –Question: are X and Z necessarily independent? Answer: no. Example: low pressure causes rain, which causes traffic. X can influence Z, Z can influence X (via Y) Addendum: they could be independent: how? XYZ

Causal Chains This configuration is a “causal chain” –Is Z independent of X given Y? –Evidence along the chain “blocks” the influence XYZ Yes! X: Project due Y: No office hours Z: Students panic 19

Common Cause Another basic configuration: two effects of the same cause –Are X and Z independent? –Are X and Z independent given Y? –Observing the cause blocks influence between effects. X Y Z Yes! Y: Homework due X: Full attendance Z: Students sleepy 20

Common Effect Last configuration: two causes of one effect (v-structures) –Are X and Z independent? Yes: the ballgame and the rain cause traffic, but they are not correlated Still need to prove they must be (try it!) –Are X and Z independent given Y? No: seeing traffic puts the rain and the ballgame in competition as explanation –This is backwards from the other cases Observing an effect activates influence between possible causes. X Y Z X: Raining Z: Ballgame Y: Traffic 21

The General Case Any complex example can be analyzed using these three canonical cases General question: in a given BN, are two variables independent (given evidence)? Solution: analyze the graph 22 Causal Chain Common Cause (Unobserved) Common Effect

Bayes Ball Shade all observed nodes. Place balls at the starting node, let them bounce around according to some rules, and ask if any of the balls reach any of the goal node. We need to know what happens when a ball arrives at a node on its way to the goal node. 23

24

Example Yes 25 R T B T’T’

Bayesian decision making Suppose the agent has to make decisions about the value of an unobserved query variable X based on the values of an observed evidence variable E Inference problem: given some evidence E = e, what is P(X | e)? Learning problem: estimate the parameters of the probabilistic model P(X | E) given training samples {(x 1,e 1 ), …, (x n,e n )}

Inference Graphs can have observed (shaded) and unobserved nodes. If nodes are always unobserved they are called hidden or latent variables Probabilistic inference is the problem of computing a conditional probability distribution over the values of some of the nodes (the “hidden” or “unobserved” nodes), given the values of other nodes (the “evidence” or “observed” nodes).

Probabilistic inference  A general scenario:  Query variables: X  Evidence (observed) variables: E = e  Unobserved variables: Y  If we know the full joint distribution P(X, E, Y), how can we perform inference about X?

Inference Inference: calculating some useful quantity from a joint probability distribution Examples: –Posterior probability: –Most likely explanation: 29 BE A JM

Inference – computing conditional probabilities Marginalization: Conditional Probabilities:

Inference by Enumeration Given unlimited time, inference in BNs is easy Recipe: –State the marginal probabilities you need –Figure out ALL the atomic probabilities you need –Calculate and combine them Example: 31 BE A JM

Example: Enumeration In this simple method, we only need the BN to synthesize the joint entries 32

Probabilistic inference  A general scenario:  Query variables: X  Evidence (observed) variables: E = e  Unobserved variables: Y  If we know the full joint distribution P(X, E, Y), how can we perform inference about X?  Problems  Full joint distributions are too large  Marginalizing out Y may involve too many summation terms

Inference by Enumeration? 34

Variable Elimination Why is inference by enumeration on a Bayes Net inefficient? –You join up the whole joint distribution before you sum out the hidden variables –You end up repeating a lot of work! Idea: interleave joining and marginalizing! –Called “Variable Elimination” –Choosing the order to eliminate variables that minimizes work is NP-hard, but *anything* sensible is much faster than inference by enumeration 35

General Variable Elimination Query: Start with initial factors: –Local CPTs (but instantiated by evidence) While there are still hidden variables (not Q or evidence): –Pick a hidden variable H –Join all factors mentioning H –Eliminate (sum out) H Join all remaining factors and normalize 36

37 Example: Variable elimination Query: What is the probability that a student attends class, given that they pass the exam? [based on slides taken from UMBC CMSC 671, 2005] P(pr|at,st)atst 0.9TT 0.5TF 0.7FT 0.1FF attendstudy prepared fair pass P(at)=.8 P(st)=.6 P(fa)=.9 P(pa|at,pre,fa)pratfa 0.9TTT 0.1TTF 0.7TFT 0.1TFF 0.7FTT 0.1FTF 0.2FFT 0.1FFF

38 Join study factors attendstudy prepared fair pass P(at)=.8 P(st)=.6 P(fa)=.9 OriginalJointMarginal prepstudyattendP(pr|at,st)P(st)P(pr,st|sm)P(pr|sm) TTT TFT TTF TFF FTT FFT FTF FFF P(pa|at,pre,fa)pratfa 0.9TTT 0.1TTF 0.7TFT 0.1TFF 0.7FTT 0.1FTF 0.2FFT 0.1FFF

39 Marginalize out study attend prepared, study fair pass P(at)=.8 P(fa)=.9 OriginalJointMarginal prepstudyattendP(pr|at,st)P(st)P(pr,st|at)P(pr|at) TTT TFT TTF TFF FTT FFT FTF FFF P(pa|at,pre,fa)pratfa 0.9TTT 0.1TTF 0.7TFT 0.1TFF 0.7FTT 0.1FTF 0.2FFT 0.1FFF

40 Remove “study” attend prepared fair pass P(at)=.8 P(fa)=.9 P(pr|at)prat 0.74TT 0.46TF 0.26FT 0.54FF P(pa|at,pre,fa)pratfa 0.9TTT 0.1TTF 0.7TFT 0.1TFF 0.7FTT 0.1FTF 0.2FFT 0.1FFF

41 Join factors “fair” attend prepared fair pass P(at)=.8 P(fa)=.9 P(pr|at)prepattend 0.74TT 0.46TF 0.26FT 0.54FF OriginalJointMarginal papreattendfair P(pa|at,pre, fa)P(fair) P(pa,fa|sm, pre) P(pa|sm,pre ) tTTT tTTF tTFT tTFF tFTT tFTF tFFT tFFF

42 Marginalize out “fair” attend prepared pass, fair pass, fair P(at)=.8 P(pr|at)prepattend 0.74TT 0.46TF 0.26FT 0.54FF OriginalJointMarginal papreattendfairP(pa|at,pre,fa)P(fair)P(pa,fa|at,pre)P(pa|at,pre) TTTT TTTF TTFT TTFF TFTT TFTF TFFT TFFF

43 Marginalize out “fair” attend prepared pass P(at)=.8 P(pr|at)prepattend 0.74TT 0.46TF 0.26FT 0.54FF P(pa|at,pre)papreattend 0.82tTT 0.64tTF tFT 0.19tFF

44 Join factors “prepared” attend prepared pass P(at)=.8 OriginalJointMarginal papreattendP(pa|at,pr)P(pr|at)P(pa,pr|sm)P(pa|sm) tTT tTF tFT tFF

45 Join factors “prepared” attend pass, prepared P(at)=.8 OriginalJointMarginal papreattendP(pa|at,pr)P(pr|at)P(pa,pr|at)P(pa|at) tTT tTF tFT tFF

46 Join factors “prepared” attend pass P(at)=.8 P(pa|at)paattend tT 0.397tF

47 Join factors attend pass P(at)=.8 OriginalJointNormalized: paattendP(pa|at)P(at)P(pa,sm)P(at|pa) TT TF

48 Join factors attend, pass OriginalJointNormalized: paattendP(pa|at)P(at)P(pa,at)P(at|pa) TT TF

Bayesian network inference: Big picture Exact inference is intractable –There exist techniques to speed up computations, but worst-case complexity is still exponential except in some classes of networks Approximate inference –Sampling, variational methods, message passing / belief propagation…

Approximate Inference Sampling (particle based method) 50

Approximate Inference 51

Sampling – the basics... Scrooge McDuck gives you an ancient coin. He wants to know what is P(H) You have no homework, and nothing good is on television – so you toss it 1 Million times. You obtain x Heads, and x Tails. What is P(H)? 52

Sampling – the basics... Exactly, P(H)=0.7 Why? 53

Monte Carlo Method 54 Who is more likely to win? Green or Purple? What is the probability that green wins, P(G)? Two ways to solve this: 1.Compute the exact probability. 2.Play 100,000 games and see how many times green wins.

Approximate Inference Simulation has a name: sampling Sampling is a hot topic in machine learning, and it’s really simple Basic idea: –Draw N samples from a sampling distribution S –Compute an approximate posterior probability –Show this converges to the true probability P Why sample? –Learning: get samples from a distribution you don’t know –Inference: getting a sample is faster than computing the right answer (e.g. with variable elimination) 55 S A F

Forward Sampling Cloudy Sprinkler Rain WetGrass Cloudy Sprinkler Rain WetGrass 56 +c0.5 -c0.5 +c +s0.1 -s0.9 -c +s0.5 -s0.5 +c +r0.8 -r0.2 -c +r0.2 -r0.8 +s +r +w0.99 -w0.01 -r +w0.90 -w0.10 -s +r +w0.90 -w0.10 -r +w0.01 -w0.99 Samples: +c, -s, +r, +w -c, +s, -r, +w …

Forward Sampling This process generates samples with probability: …i.e. the BN’s joint probability Let the number of samples of an event be Then I.e., the sampling procedure is consistent 57

Example We’ll get a bunch of samples from the BN: +c, -s, +r, +w +c, +s, +r, +w -c, +s, +r, -w +c, -s, +r, +w -c, -s, -r, +w If we want to know P(W) –We have counts –Normalize to get P(W) = –This will get closer to the true distribution with more samples –Can estimate anything else, too –What about P(C| +w)? P(C| +r, +w)? P(C| -r, -w)? –Fast: can use fewer samples if less time (what’s the drawback?) Cloudy Sprinkler Rain WetGrass C S R W 58

Rejection Sampling Let’s say we want P(C) –No point keeping all samples around –Just tally counts of C as we go Let’s say we want P(C| +s) –Same thing: tally C outcomes, but ignore (reject) samples which don’t have S=+s –This is called rejection sampling –It is also consistent for conditional probabilities (i.e., correct in the limit) +c, -s, +r, +w +c, +s, +r, +w -c, +s, +r, -w +c, -s, +r, +w -c, -s, -r, +w Cloudy Sprinkler Rain WetGrass C S R W 59

Sampling Example There are 2 cups. –The first contains 1 penny and 1 quarter –The second contains 2 quarters Say I pick a cup uniformly at random, then pick a coin randomly from that cup. It's a quarter (yes!). What is the probability that the other coin in that cup is also a quarter?

Likelihood Weighting Problem with rejection sampling: –If evidence is unlikely, you reject a lot of samples –You don’t exploit your evidence as you sample –Consider P(B|+a) Idea: fix evidence variables and sample the rest Problem: sample distribution not consistent! Solution: weight by probability of evidence given parents BurglaryAlarm BurglaryAlarm 61 -b, -a +b, +a -b +a -b, +a +b, +a

Likelihood Weighting Sampling distribution if z sampled and e fixed evidence Now, samples have weights Together, weighted sampling distribution is consistent Cloudy R C S W 62

Likelihood Weighting 63 +c0.5 -c0.5 +c +s0.1 -s0.9 -c +s0.5 -s0.5 +c +r0.8 -r0.2 -c +r0.2 -r0.8 +s +r +w0.99 -w0.01 -r +w0.90 -w0.10 -s +r +w0.90 -w0.10 -r +w0.01 -w0.99 Samples: +c, +s, +r, +w … Cloudy Sprinkler Rain WetGrass Cloudy Sprinkler Rain WetGrass

 Inference:  Sum over weights that match query value  Divide by total sample weight  What is P(C|+w,+r)? Likelihood Weighting Example 64 CloudyRainySprinklerWet GrassWeight

Likelihood Weighting Likelihood weighting is good –We have taken evidence into account as we generate the sample –E.g. here, W’s value will get picked based on the evidence values of S, R –More of our samples will reflect the state of the world suggested by the evidence Likelihood weighting doesn’t solve all our problems –Evidence influences the choice of downstream variables, but not upstream ones (C isn’t more likely to get a value matching the evidence) We would like to consider evidence when we sample every variable 65 Cloudy Rain C S R W

Markov Chain Monte Carlo* Idea: instead of sampling from scratch, create samples that are each like the last one. Procedure: resample one variable at a time, conditioned on all the rest, but keep evidence fixed. E.g., for P(b|c): Properties: Now samples are not independent (in fact they’re nearly identical), but sample averages are still consistent estimators! What’s the point: both upstream and downstream variables condition on evidence. 66 +a+c+b +a+c-b-b-a-a -b-b

Gibbs Sampling 1.Set all evidence E to e 2.Do forward sampling to obtain x 1,...,x n 3.Repeat: 1.Pick any variable X i uniformly at random. 2.Resample x i ’ from p(X i | x 1,..., x i-1, x i+1,..., x n ) 3.Set all other x j ’=x j 4.The new sample is x 1’,..., x n ’ 67

Markov Blanket 68 X Markov blanket of X: 1.All parents of X 2.All children of X 3.All parents of children of X (except X itself) X is conditionally independent from all other variables in the BN, given all variables in the markov blanket (besides X).

Inference Algorithms Exact algorithms –Elimination algorithm –Sum-product algorithm –Junction tree algorithm Sampling algorithms –Importance sampling –Markov chain Monte Carlo Variational algorithms –Mean field methods –Sum-product algorithm and variations –Semidefinite relaxations

Summary Sampling can be your salvation The dominating approach to inference in BNs Approaches: –Forward (/Prior) Sampling –Rejection Sampling –Likelihood Weighted Sampling –Gibbs Sampling 70