Still More Uncertainty

Slides:

Advertisements

Similar presentations

Bayes rule, priors and maximum a posteriori

Advertisements

A Tutorial on Learning with Bayesian Networks

Bayesian Networks CSE 473. © Daniel S. Weld 2 Last Time Basic notions Atomic events Probabilities Joint distribution Inference by enumeration Independence.

BAYESIAN NETWORKS. Bayesian Network Motivation  We want a representation and reasoning system that is based on conditional independence  Compact yet.

Probability: Review The state of the world is described using random variables Probabilities are defined over events –Sets of world states characterized.

Junction Trees And Belief Propagation. Junction Trees: Motivation What if we want to compute all marginals, not just one? Doing variable elimination for.

Flipping A Biased Coin Suppose you have a coin with an unknown bias, θ ≡ P(head). You flip the coin multiple times and observe the outcome. From observations,

Learning in Bayes Nets Task 1: Given the network structure and given data, where a data point is an observed setting for the variables, learn the CPTs.

GS 540 week 6. HMM basics Given a sequence, and state parameters: – Each possible path through the states has a certain probability of emitting the sequence.

Inference in Bayesian Nets

. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.

CS 188: Artificial Intelligence Spring 2007 Lecture 14: Bayes Nets III 3/1/2007 Srini Narayanan – ICSI and UC Berkeley.

CS 188: Artificial Intelligence Fall 2006 Lecture 17: Bayes Nets III 10/26/2006 Dan Klein – UC Berkeley.

© Daniel S. Weld 1 Naïve Bayes & Expectation Maximization CSE 573.

Learning Bayesian Networks

. Approximate Inference Slides by Nir Friedman. When can we hope to approximate? Two situations: u Highly stochastic distributions “Far” evidence is discarded.

Rutgers CS440, Fall 2003 Introduction to Statistical Learning Reading: Ch. 20, Sec. 1-4, AIMA 2 nd Ed.

© Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors.

1 Bayesian Networks Chapter ; 14.4 CS 63 Adapted from slides by Tim Finin and Marie desJardins. Some material borrowed from Lise Getoor.

Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.

A Brief Introduction to Graphical Models

Visibility Graph. Voronoi Diagram Control is easy: stay equidistant away from closest obstacles.

Aprendizagem Computacional Gladys Castillo, UA Bayesian Networks Classifiers Gladys Castillo University of Aveiro.

CSE 446: Point Estimation Winter 2012 Dan Weld Slides adapted from Carlos Guestrin (& Luke Zettlemoyer)

Inference Complexity As Learning Bias Daniel Lowd Dept. of Computer and Information Science University of Oregon Joint work with Pedro Domingos.

CSE 473: Artificial Intelligence Spring 2012 Bayesian Networks - Learning Dan Weld Slides adapted from Jack Breese, Dan Klein, Daphne Koller, Stuart Russell,

Learning With Bayesian Networks Markus Kalisch ETH Zürich.

Slides for “Data Mining” by I. H. Witten and E. Frank.

The famous “sprinkler” example (J. Pearl, Probabilistic Reasoning in Intelligent Systems, 1988)

Bayesian Networks CSE 473. © D. Weld and D. Fox 2 Bayes Nets In general, joint distribution P over set of variables (X 1 x... x X n ) requires exponential.

1 Parameter Learning 2 Structure Learning 1: The good Graphical Models – Carlos Guestrin Carnegie Mellon University September 27 th, 2006 Readings:

Inference Algorithms for Bayes Networks

1 Param. Learning (MLE) Structure Learning The Good Graphical Models – Carlos Guestrin Carnegie Mellon University October 1 st, 2008 Readings: K&F:

1 CMSC 671 Fall 2001 Class #20 – Thursday, November 8.

1 Learning P-maps Param. Learning Graphical Models – Carlos Guestrin Carnegie Mellon University September 24 th, 2008 Readings: K&F: 3.3, 3.4, 16.1,

Bayesian Networks Statistical Learning CSE 573 Based on lecture notes from David Page and Dan Weld.

Conditional Independence As with absolute independence, the equivalent forms of X and Y being conditionally independent given Z can also be used: P(X|Y,

Matching ® ® ® Global Map Local Map … … … obstacle Where am I on the global map?                                   

CS 2750: Machine Learning Bayesian Networks Prof. Adriana Kovashka University of Pittsburgh March 14, 2016.

CSC321: Lecture 8: The Bayesian way to fit models Geoffrey Hinton.

Integrative Genomics I BME 230. Probabilistic Networks Incorporate uncertainty explicitly Capture sparseness of wiring Incorporate multiple kinds of data.

CS498-EA Reasoning in AI Lecture #19 Professor: Eyal Amir Fall Semester 2011.

Bayesian Estimation and Confidence Intervals Lecture XXII.

Oliver Schulte Machine Learning 726

CS 2750: Machine Learning Directed Graphical Models

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 12

Qian Liu CSE spring University of Pennsylvania

Inference in Bayesian Networks

Read R&N Ch Next lecture: Read R&N

Bayes Net Learning: Bayesian Approaches

Artificial Intelligence

CS 4/527: Artificial Intelligence

CAP 5636 – Advanced Artificial Intelligence

Bayesian Networks: Motivation

Read R&N Ch Next lecture: Read R&N

Inference Inference: calculating some useful quantity from a joint probability distribution Examples: Posterior probability: Most likely explanation: B.

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 12

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

Instructors: Fei Fang (This Lecture) and Dave Touretzky

CS 188: Artificial Intelligence

Class #19 – Tuesday, November 3

CS 188: Artificial Intelligence Fall 2008

Class #16 – Tuesday, October 26

Hankz Hankui Zhuo Bayesian Networks Hankz Hankui Zhuo

Read R&N Ch Next lecture: Read R&N

Bayesian Networks CSE 573.

Mathematical Foundations of BME Reza Shadmehr

Probabilistic Reasoning

Representing Uncertainty

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 12

Presentation transcript:

Still More Uncertainty

Administrivia Read Ch 14, 15

Axioms of Probability Theory Just 3 are enough to build entire theory! 1. Range of probabilities: 0 ≤ P(A) ≤ 1 2. P(true) = 1 and P(false) = 0 3. Probability of disjunction of events is: A B A  B True

Bayes rules! Simple proof from def of conditional probability 4

Independence P(AB) = P(A)P(B) A A  B B True © Daniel S. Weld 5

Conditional Independence A&B not independent, since P(A|B) < P(A) A A  B B True © Daniel S. Weld 6

Conditional Independence But: A&B are made independent by C A A  B AC P(A|C) = P(A|B,C) B True C B  C © Daniel S. Weld 7

Joint Distribution All you need to know Can answer any question Inference by enumeration © Daniel S. Weld

Joint Distribution All you need to know Can answer any question Inference by enumeration But… exponential Both time & space Solution: exploit conditional independence © Daniel S. Weld 9

Sample Bayes Net Earthquake Burglary Radio Alarm Nbr1Calls Nbr2Calls Pr(B=t) Pr(B=f) 0.05 0.95 Earthquake Burglary Pr(A|E,B) e,b 0.9 (0.1) e,b 0.2 (0.8) e,b 0.85 (0.15) e,b 0.01 (0.99) Radio Alarm Nbr1Calls Nbr2Calls © Daniel S. Weld 10

Given Parents, X is Independent of Non-Descendants © Daniel S. Weld 11

Given Markov Blanket, X is Independent of All Other Nodes MB(X) = Par(X)  Childs(X)  Par(Childs(X)) © Daniel S. Weld 12

Given Markov Blanket, X is Independent of All Other Nodes Sprinkler On Rained Grass Wet MB(X) = Par(X)  Childs(X)  Par(Childs(X)) © Daniel S. Weld 13

Galaxy Quest 2 astronomers in Chile & Alaska measure (M1, M2) the number (N) of stars Normally there is a chance of error, e, of under or over counting by 1 star But sometimes, with probability f, a telescope can be out of focus (events F1 and F2) in which case the affected scientist will undercount by 3 stars

Structures F1 F2 F1 N F2 M1 M2 M1 M2 M1 M2 N N F1 F2 Two astronomers in Chile & Alaska measure (M1, M2) the number (N) of stars. Normally there is a chance of error, e, of under or over counting by 1 star. But sometimes, with probability f, a telescope can be out of focus (events F1 and F2) in which case the affected scientist will undercount by 3 stars

Inference in BNs We generally want to compute Pr(X), or Pr(X|E) where E is (conjunctive) evidence The graphical independence representation Yields efficient inference Computations organized by network topology Two simple algorithms: Variable elimination (VE) Junction trees © Daniel S. Weld 16

Variable Elimination Procedure The initial potentials are the CPTs in BN Repeat until only query variable remains: Choose another variable to eliminate Multiply all potentials that contain the variable If no evidence for the variable then sum the variable out and replace original potential by the new result Else, remove variable based on evidence Normalize remaining potential to get the final distribution over the query variable

Junction Tree Inference Algorithm Incorporate Evidence: For each evidence variable, go to one table including that variable Set to 0 all entries that disagree with evidence Renormalize this potential Upward Step: Pass message to parents Downward Step: Pass message to children

Learning Parameter Estimation: Maximum Likelihood (ML) Maximum A Posteriori (MAP) Bayesian Learning Parameters for a Bayesian Network Learning Structure of Bayesian Networks © Daniel S. Weld 19

Topics Bayesian networks overview Infernence Parameter Estimation: Variable elimination Junction trees Parameter Estimation: Maximum Likelihood (ML) Maximum A Posteriori (MAP) Bayesian Learning Parameters for a Bayesian Network Learning Structure of Bayesian Networks © Daniel S. Weld

Which coin will I use? Coin Flip P(H|C1) = 0.1 C1 C2 P(H|C3) = 0.9 C3 P(C1) = 1/3 P(C2) = 1/3 P(C3) = 1/3 Uniform Prior: All hypothesis are equally likely before we make any observations

What is the probability of heads after two experiments? Your Estimate? What is the probability of heads after two experiments? Most likely coin: C2 Best estimate for P(H) P(H|C2) = 0.5 C1 C2 C3 P(H|C1) = 0.1 P(H|C2) = 0.5 P(H|C3) = 0.9 P(C1) = 1/3 P(C2) = 1/3 P(C3) = 1/3

Maximum Likelihood Estimate The best hypothesis that fits observed data assuming uniform prior After 2 Experiments: heads, tails Most likely coin: Best estimate for P(H) P(H|C2) = 0.5 C2 P(H|C2) = 0.5 P(C2) = 1/3

Using Prior Knowledge Should we always use a Uniform Prior ? Background knowledge: Heads => we have take-home midterm Dan likes take-homes… => Dan is more likely to use a coin biased in his favor P(C1) = 0.05 P(C2) = 0.25 P(C3) = 0.70 P(H|C1) = 0.1 C1 C2 P(H|C3) = 0.9 C3 P(H|C2) = 0.5

Maximum A Posteriori (MAP) Estimate The best hypothesis that fits observed data assuming a non-uniform prior Most likely coin: Best estimate for P(H) P(H|C3) = 0.9 C3 P(H|C3) = 0.9 P(C3) = 0.70

Bayesian Estimate non-uniform prior = 0.680 P(C1|HT)=0.035 Minimizes prediction error, given data and (generally) assuming a non-uniform prior = 0.680 P(C1|HT)=0.035 P(C2|HT)=0.481 P(C3|HT)=0.485 C1 C2 C3 P(H|C1) = 0.1 P(H|C2) = 0.5 P(H|C3) = 0.9

Summary For Now Prior Hypothesis Maximum Likelihood Estimate Maximum A Posteriori Estimate Bayesian Estimate Uniform The most likely Any Weighted combination

Topics Parameter Estimation: Maximum Likelihood (ML) Maximum A Posteriori (MAP) Bayesian Learning Parameters for a Bayesian Network Learning Structure of Bayesian Networks © Daniel S. Weld

Parameter Estimation and Bayesian Networks J M T F ... We have: - Bayes Net structure and observations - We need: Bayes Net parameters

Parameter Estimation and Bayesian Networks J M T F ... Prior Now compute either MAP or Bayesian estimate P(B) = ? + data =

What Prior to Use? The following are two common priors Binary variable Beta Posterior distribution is binomial Easy to compute posterior Discrete variable Dirichlet Posterior distribution is multinomial © Daniel S. Weld

One Prior: Beta Distribution a,b For any positive integer y, G(y) = (y-1)!

Beta Distribution Example: Flip coin with Beta distribution as prior over p [prob(heads)] Parameterized by two positive numbers: a, b Mode of distribution (E[p]) is a/(a+b) Specify our prior belief for p = a/(a+b) Specify confidence in this belief with high initial values for a and b Updating our prior belief based on data incrementing a for every heads outcome incrementing b for every tails outcome So after h heads out of n flips, our posterior distribution says P(head)=(a+h)/(a+b+n)

Parameter Estimation and Bayesian Networks J M T F ... Prior .3 B ¬B .7 P(B|data) = ? Beta(1,4) “+ data” = (3,7) Prior P(B)= 1/(1+4) = 20% with equivalent sample size 5

Parameter Estimation and Bayesian Networks J M T F ... P(A|E,B) = ? P(A|E,¬B) = ? P(A|¬E,B) = ? P(A|¬E,¬B) = ?

Parameter Estimation and Bayesian Networks J M T F ... P(A|E,B) = ? P(A|E,¬B) = ? P(A|¬E,B) = ? P(A|¬E,¬B) = ? Prior Beta(2,3)

Parameter Estimation and Bayesian Networks J M T F ... P(A|E,B) = ? P(A|E,¬B) = ? P(A|¬E,B) = ? P(A|¬E,¬B) = ? Prior Beta(2,3) + data= (3,4)

What if we don’t know structure?

Learning The Structure of Bayesian Networks Search thru the space… of possible network structures! (for now, assume we observe all variables) For each structure, learn parameters Pick the one that fits observed data best Caveat – won’t we end up fully connected???? Problem !?!? When scoring, add a penalty  model complexity

Learning The Structure of Bayesian Networks Search thru the space For each structure, learn parameters Pick the one that fits observed data best Penalize complex models Problem? Exponential number of networks! And we need to learn parameters for each! Exhaustive search out of the question! So what now?

Structure Learning as Search Local Search Start with some network structure Try to make a change (add or delete or reverse edge) See if the new network is any better What should the initial state be? Uniform prior over random networks? Based on prior knowledge? Empty network? How do we evaluate networks? © Daniel S. Weld

A E C D B A B C A E C D B A E C D B D E A E C D B

Score Functions Bayesian Information Criteion (BIC) MAP score P(D | BN) – penalty Penalty = ½ (# parameters) Log (# data points) MAP score P(BN | D) = P(D | BN) P(BN) P(BN) must decay exponentially with # of parameters for this to work well © Daniel S. Weld

Naïve Bayes Class Value … F 1 F 2 F 3 F N-2 F N-1 F N Assume that features are conditionally ind. given class variable Works well in practice Forces probabilities towards 0 and 1

Tree Augmented Naïve Bayes (TAN) [Friedman,Geiger & Goldszmidt 1997] Class Value … F 1 F 2 F 3 F N-2 F N-1 F N Models limited set of dependencies Guaranteed to find best structure Runs in polynomial time