© Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors.

Slides:



Advertisements
Similar presentations
Bayes rule, priors and maximum a posteriori
Advertisements

A Tutorial on Learning with Bayesian Networks
Bayesian Networks CSE 473. © Daniel S. Weld 2 Last Time Basic notions Atomic events Probabilities Joint distribution Inference by enumeration Independence.
ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 10: The Bayesian way to fit models Geoffrey Hinton.
Bayesian Wrap-Up (probably). 5 minutes of math... Marginal probabilities If you have a joint PDF:... and want to know about the probability of just one.
Intro to Bayesian Learning Exercise Solutions Ata Kaban The University of Birmingham 2005.
Parameter Estimation using likelihood functions Tutorial #1
Maximum Likelihood. Likelihood The likelihood is the probability of the data given the model.
Review: Bayesian learning and inference
Machine Learning II Ensembles & Cotraining CSE 573 Representing Uncertainty.
Bayesian Belief Networks
Bayesian learning finalized (with high probability)
Review P(h i | d) – probability that the hypothesis is true, given the data (effect  cause) Used by MAP: select the hypothesis that is most likely given.
. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.
© Daniel S. Weld 1 Naïve Bayes & Expectation Maximization CSE 573.
Learning Bayesian Networks
. Approximate Inference Slides by Nir Friedman. When can we hope to approximate? Two situations: u Highly stochastic distributions “Far” evidence is discarded.
Computer vision: models, learning and inference Chapter 10 Graphical Models.
Thanks to Nir Friedman, HU
Rutgers CS440, Fall 2003 Introduction to Statistical Learning Reading: Ch. 20, Sec. 1-4, AIMA 2 nd Ed.
Learning Bayesian Networks (From David Heckerman’s tutorial)
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
CSE 221: Probabilistic Analysis of Computer Systems Topics covered: Statistical inference.
1 Bayesian methods for parameter estimation and data assimilation with crop models Part 2: Likelihood function and prior distribution David Makowski and.
A Brief Introduction to Graphical Models
CSE 573: Artificial Intelligence Spring 2012 Learning Bayesian Networks Dan Weld Slides adapted from Carlos Guestrin, Krzysztof Gajos, Dan Klein, Stuart.
Naive Bayes Classifier
Visibility Graph. Voronoi Diagram Control is easy: stay equidistant away from closest obstacles.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
1 Instance-Based & Bayesian Learning Chapter Some material adapted from lecture notes by Lise Getoor and Ron Parr.
Bayesian Learning Chapter Some material adapted from lecture notes by Lise Getoor and Ron Parr.
Introduction to Bayesian Networks
27 February 2001What is Confidence?Slide 1 What is Confidence? How to Handle Overfitting When Given Few Examples Top Changwatchai AIML seminar 27 February.
CSE 446: Point Estimation Winter 2012 Dan Weld Slides adapted from Carlos Guestrin (& Luke Zettlemoyer)
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
CSE 473: Artificial Intelligence Spring 2012 Bayesian Networks - Learning Dan Weld Slides adapted from Jack Breese, Dan Klein, Daphne Koller, Stuart Russell,
CSE 473: Artificial Intelligence Spring 2012 Bayesian Networks - Learning Dan Weld Slides adapted from Jack Breese, Dan Klein, Daphne Koller, Stuart Russell,
Learning With Bayesian Networks Markus Kalisch ETH Zürich.
METU Informatics Institute Min720 Pattern Classification with Bio-Medical Applications Lecture notes 9 Bayesian Belief Networks.
Slides for “Data Mining” by I. H. Witten and E. Frank.
Learning In Bayesian Networks. General Learning Problem Set of random variables X = {X 1, X 2, X 3, X 4, …} Training set D = { X (1), X (2), …, X (N)
CHAPTER 5 Probability Theory (continued) Introduction to Bayesian Networks.
Bayesian Networks CSE 473. © D. Weld and D. Fox 2 Bayes Nets In general, joint distribution P over set of variables (X 1 x... x X n ) requires exponential.
Conditional Probability Mass Function. Introduction P[A|B] is the probability of an event A, giving that we know that some other event B has occurred.
Bayes Theorem. Prior Probabilities On way to party, you ask “Has Karl already had too many beers?” Your prior probabilities are 20% yes, 80% no.
1 Learning P-maps Param. Learning Graphical Models – Carlos Guestrin Carnegie Mellon University September 24 th, 2008 Readings: K&F: 3.3, 3.4, 16.1,
Bayesian Networks Statistical Learning CSE 573 Based on lecture notes from David Page and Dan Weld.
1 Structure Learning (The Good), The Bad, The Ugly Inference Graphical Models – Carlos Guestrin Carnegie Mellon University October 13 th, 2008 Readings:
Naive Bayes Classifier. REVIEW: Bayesian Methods Our focus this lecture: – Learning and classification methods based on probability theory. Bayes theorem.
CSC321: Lecture 8: The Bayesian way to fit models Geoffrey Hinton.
CS498-EA Reasoning in AI Lecture #19 Professor: Eyal Amir Fall Semester 2011.
Bayesian Estimation and Confidence Intervals Lecture XXII.
CS 2750: Machine Learning Review
Bayesian Estimation and Confidence Intervals
CS 2750: Machine Learning Directed Graphical Models
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 12
Qian Liu CSE spring University of Pennsylvania
Bayes Net Learning: Bayesian Approaches
Still More Uncertainty
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 12
Lecture 5 Unsupervised Learning in fully Observed Directed and Undirected Graphical Models.
CSCI 5822 Probabilistic Models of Human and Machine Learning
Important Distinctions in Learning BNs
Bayesian Learning Chapter
Bayesian Networks CSE 573.
Mathematical Foundations of BME Reza Shadmehr
Probabilistic Reasoning
Representing Uncertainty
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 12
Presentation transcript:

© Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors

© Daniel S. Weld 2 Logistics Team Meetings Midterm Open book, notes Studying See AIMA exercises

© Daniel S. Weld Topics Agency Problem Spaces Search Knowledge Representation & Inference Planning Supervised Learning Logic-BasedProbabilistic Reinforcement Learning

© Daniel S. Weld 4 Topics Parameter Estimation: Maximum Likelihood (ML) Maximum A Posteriori (MAP) Bayesian Continuous case Learning Parameters for a Bayesian Network Naive Bayes Maximum Likelihood estimates Priors Learning Structure of Bayesian Networks

Coin Flip P(H|C 2 ) = 0.5 P(H|C 1 ) = 0.1 C1C1 C2C2 P(H|C 3 ) = 0.9 C3C3 Which coin will I use? P(C 1 ) = 1/3P(C 2 ) = 1/3P(C 3 ) = 1/3 Prior: Probability of a hypothesis before we make any observations

Coin Flip P(H|C 2 ) = 0.5 P(H|C 1 ) = 0.1 C1C1 C2C2 P(H|C 3 ) = 0.9 C3C3 Which coin will I use? P(C 1 ) = 1/3P(C 2 ) = 1/3P(C 3 ) = 1/3 Uniform Prior: All hypothesis are equally likely before we make any observations

Experiment 1: Heads Which coin did I use? P(C 1 |H) = ?P(C 2 |H) = ?P(C 3 |H) = ? P(H|C 2 ) = 0.5 P(H|C 3 ) = 0.9P(H|C 1 )=0.1 C1C1 C2C2 C3C3 P(C 1 )=1/3P(C 2 ) = 1/3P(C 3 ) = 1/3

Experiment 1: Heads Which coin did I use? P(C 1 |H) = 0.066P(C 2 |H) = 0.333P(C 3 |H) = 0.6 P(H|C 2 ) = 0.5 P(H|C 3 ) = 0.9P(H|C 1 ) = 0.1 C1C1 C2C2 C3C3 P(C 1 ) = 1/3P(C 2 ) = 1/3P(C 3 ) = 1/3 Posterior: Probability of a hypothesis given data

Terminology Prior: Probability of a hypothesis before we see any data Uniform Prior: A prior that makes all hypothesis equaly likely Posterior: Probability of a hypothesis after we saw some data Likelihood: Probability of data given hypothesis

Experiment 2: Tails Which coin did I use? P(H|C 2 ) = 0.5 P(H|C 3 ) = 0.9P(H|C 1 ) = 0.1 C1C1 C2C2 C3C3 P(C 1 ) = 1/3P(C 2 ) = 1/3P(C 3 ) = 1/3 P(C 1 |HT) = ?P(C 2 |HT) = ?P(C 3 |HT) = ?

Experiment 2: Tails Which coin did I use? P(H|C 2 ) = 0.5 P(H|C 3 ) = 0.9P(H|C 1 ) = 0.1 C1C1 C2C2 C3C3 P(C 1 ) = 1/3P(C 2 ) = 1/3P(C 3 ) = 1/3 P(C 1 |HT) = 0.21P(C 2 |HT) = 0.58P(C 3 |HT) = 0.21

Experiment 2: Tails Which coin did I use? P(H|C 2 ) = 0.5 P(H|C 3 ) = 0.9P(H|C 1 ) = 0.1 C1C1 C2C2 C3C3 P(C 1 ) = 1/3P(C 2 ) = 1/3P(C 3 ) = 1/3 P(C 1 |HT) = 0.21P(C 2 |HT) = 0.58P(C 3 |HT) = 0.21

Your Estimate? What is the probability of heads after two experiments? P(H|C 2 ) = 0.5 P(H|C 3 ) = 0.9P(H|C 1 ) = 0.1 C1C1 C2C2 C3C3 P(C 1 ) = 1/3P(C 2 ) = 1/3P(C 3 ) = 1/3 Best estimate for P(H) P(H|C 2 ) = 0.5 Most likely coin: C2C2

Your Estimate? P(H|C 2 ) = 0.5 C2C2 P(C 2 ) = 1/3 Most likely coin:Best estimate for P(H) P(H|C 2 ) = 0.5 C2C2 Maximum Likelihood Estimate: The best hypothesis that fits observed data assuming uniform prior

Using Prior Knowledge Should we always use a Uniform Prior ? Background knowledge: Heads => we have take-home midterm Dan doesn’t like take-homes… => Dan is more likely to use a coin biased in his favor P(H|C 2 ) = 0.5 P(H|C 1 ) = 0.1 C1C1 C2C2 P(H|C 3 ) = 0.9 C3C3

Using Prior Knowledge P(H|C 2 ) = 0.5 P(H|C 1 ) = 0.1 C1C1 C2C2 P(H|C 3 ) = 0.9 C3C3 P(C 1 ) = 0.05P(C 2 ) = 0.25P(C 3 ) = 0.70 We can encode it in the prior:

Experiment 1: Heads Which coin did I use? P(C 1 |H) = ?P(C 2 |H) = ?P(C 3 |H) = ? P(H|C 2 ) = 0.5 P(H|C 3 ) = 0.9P(H|C 1 ) = 0.1 C1C1 C2C2 C3C3 P(C 1 ) = 0.05P(C 2 ) = 0.25P(C 3 ) = 0.70

Experiment 1: Heads Which coin did I use? P(C 1 |H) = 0.006P(C 2 |H) = 0.165P(C 3 |H) = P(H|C 2 ) = 0.5 P(H|C 3 ) = 0.9P(H|C 1 ) = 0.1 C1C1 C2C2 C3C3 P(C 1 ) = 0.05P(C 2 ) = 0.25P(C 3 ) = 0.70 P(C 1 |H) = 0.066P(C 2 |H) = 0.333P(C 3 |H) = Compare with ML posterior after Exp 1:

Experiment 2: Tails Which coin did I use? P(C 1 |HT) = ?P(C 2 |HT) = ?P(C 3 |HT) = ? P(H|C 2 ) = 0.5 P(H|C 3 ) = 0.9P(H|C 1 ) = 0.1 C1C1 C2C2 C3C3 P(C 1 ) = 0.05P(C 2 ) = 0.25P(C 3 ) = 0.70

Experiment 2: Tails Which coin did I use? P(H|C 2 ) = 0.5 P(H|C 3 ) = 0.9P(H|C 1 ) = 0.1 C1C1 C2C2 C3C3 P(C 1 ) = 0.05P(C 2 ) = 0.25P(C 3 ) = 0.70 P(C 1 |HT) = 0.035P(C 2 |HT) = 0.481P(C 3 |HT) = 0.485

Experiment 2: Tails Which coin did I use? P(C 1 |HT) = 0.035P(C 2 |HT)=0.481P(C 3 |HT) = P(H|C 2 ) = 0.5 P(H|C 3 ) = 0.9P(H|C 1 ) = 0.1 C1C1 C2C2 C3C3 P(C 1 ) = 0.05P(C 2 ) = 0.25P(C 3 ) = 0.70

Your Estimate? What is the probability of heads after two experiments? P(H|C 2 ) = 0.5 P(H|C 3 ) = 0.9P(H|C 1 ) = 0.1 C1C1 C2C2 C3C3 P(C 1 ) = 0.05P(C 2 ) = 0.25P(C 3 ) = 0.70 Best estimate for P(H) P(H|C 3 ) = 0.9 C3C3 Most likely coin:

Your Estimate? Most likely coin:Best estimate for P(H) P(H|C 3 ) = 0.9 C3C3 Maximum A Posteriori (MAP) Estimate: The best hypothesis that fits observed data assuming a non-uniform prior P(H|C 3 ) = 0.9 C3C3 P(C 3 ) = 0.70

Did We Do The Right Thing? P(H|C 2 ) = 0.5 P(H|C 3 ) = 0.9P(H|C 1 ) = 0.1 C1C1 C2C2 C3C3 P(C 1 |HT)=0.035P(C 2 |HT)=0.481P(C 3 |HT)=0.485

Did We Do The Right Thing? P(C 1 |HT) =0.035P(C 2 |HT)=0.481P(C 3 |HT)=0.485 P(H|C 2 ) = 0.5 P(H|C 3 ) = 0.9P(H|C 1 ) = 0.1 C1C1 C2C2 C3C3 C 2 and C 3 are almost equally likely

A Better Estimate P(H|C 2 ) = 0.5 P(H|C 3 ) = 0.9P(H|C 1 ) = 0.1 C1C1 C2C2 C3C3 Recall: = P(C 1 |HT)=0.035P(C 2 |HT)=0.481P(C 3 |HT)=0.485

Bayesian Estimate P(C 1 |HT)=0.035P(C 2 |HT)=0.481P(C 3 |HT)=0.485 P(H|C 2 ) = 0.5 P(H|C 3 ) = 0.9P(H|C 1 ) = 0.1 C1C1 C2C2 C3C3 = Bayesian Estimate: Minimizes prediction error, given data and (generally) assuming a non-uniform prior

Comparison After more experiments: HTH 8 ML (Maximum Likelihood): P(H) = 0.5 after 10 experiments: P(H) = 0.9 MAP (Maximum A Posteriori): P(H) = 0.9 after 10 experiments: P(H) = 0.9 Bayesian: P(H) = 0.68 after 10 experiments: P(H) = 0.9

Comparison ML (Maximum Likelihood): Easy to compute MAP (Maximum A Posteriori): Still easy to compute Incorporates prior knowledge Bayesian: Minimizes error => great when data is scarce Potentially much harder to compute

Summary For Now PriorHypothesis Maximum Likelihood Estimate Maximum A Posteriori Estimate Bayesian Estimate UniformThe most likely AnyThe most likely Any Weighted combination

Continuous Case In the previous example, we chose from a discrete set of three coins In general, we have to pick from a continuous distribution of biased coins

Continuous Case

PriorExp 1: Heads Exp 2: Tails uniform with background knowledge

Continuous Case Bayesian Estimate MAP Estimate ML Estimate Posterior after 2 experiments: w/ uniform prior with background knowledge

After 10 Experiments... Posterior: Bayesian Estimate MAP Estimate ML Estimate w/ uniform prior with background knowledge

After 100 Experiments...

© Daniel S. Weld 38 Topics Parameter Estimation: Maximum Likelihood (ML) Maximum A Posteriori (MAP) Bayesian Continuous case Learning Parameters for a Bayesian Network Naive Bayes Maximum Likelihood estimates Priors Learning Structure of Bayesian Networks

© Daniel S. Weld 39 Review: Conditional Probability P(A | B) is the probability of A given B Assumes that B is the only info known. Defined by: A B ABAB True

© Daniel S. Weld 40 Conditional Independence True B AA  B A&B not independent, since P(A|B) < P(A)

© Daniel S. Weld 41 Conditional Independence True B AA  B C B  C ACAC But: A&B are made independent by  C P(A|  C) = P(A|B,  C)

© Daniel S. Weld 42 Bayes Rule Simple proof from def of conditional probability: QED: (Def. cond. prob.) (Mult by P(H) in line 1) (Substitute #3 in #2)

© Daniel S. Weld 43 An Example Bayes Net Earthquake BurglaryAlarm Nbr2CallsNbr1Calls Pr(B=t) Pr(B=f) Pr(A|E,B) e,b 0.9 (0.1) e,b 0.2 (0.8) e,b 0.85 (0.15) e,b 0.01 (0.99) Radio

© Daniel S. Weld 44 Given Parents, X is Independent of Non-Descendants

© Daniel S. Weld 45 Given Markov Blanket, X is Independent of All Other Nodes MB(X) = Par(X)  Childs(X)  Par(Childs(X))

Parameter Estimation and Bayesian Networks EBRAJM TFTTFT FFFFFT FTFTTT FFFTTT FTFFFF... We have: - Bayes Net structure and observations - We need: Bayes Net parameters

Parameter Estimation and Bayesian Networks EBRAJM TFTTFT FFFFFT FTFTTT FFFTTT FTFFFF... P(B) = ? Prior + data = Now compute either MAP or Bayesian estimate

Parameter Estimation and Bayesian Networks EBRAJM TFTTFT FFFFFT FTFTTT FFFTTT FTFFFF... P(A|E,B) = ? P(A|E,¬B) = ? P(A|¬E,B) = ? P(A|¬E,¬B) = ?

Parameter Estimation and Bayesian Networks EBRAJM TFTTFT FFFFFT FTFTTT FFFTTT FTFFFF... P(A|E,B) = ? P(A|E,¬B) = ? P(A|¬E,B) = ? P(A|¬E,¬B) = ? Prior + data= Now compute either MAP or Bayesian estimate

© Daniel S. Weld 50 Topics Parameter Estimation: Maximum Likelihood (ML) Maximum A Posteriori (MAP) Bayesian Continuous case Learning Parameters for a Bayesian Network Naive Bayes Maximum Likelihood estimates Priors Learning Structure of Bayesian Networks

© Daniel S. Weld 51 Recap Given a BN structure (with discrete or continuous variables), we can learn the parameters of the conditional prop tables. Spam Nigeria Nude Sex Earthqk Burgl Alarm N2N1

What if we don’t know structure?

Learning The Structure of Bayesian Networks Search thru the space… of possible network structures! (for now, assume we observe all variables) For each structure, learn parameters Pick the one that fits observed data best Caveat – won’t we end up fully connected???? When scoring, add a penalty  model complexity Problem !?!?

Learning The Structure of Bayesian Networks Search thru the space For each structure, learn parameters Pick the one that fits observed data best Problem? Exponential number of networks! And we need to learn parameters for each! Exhaustive search out of the question! So what now?

Learning The Structure of Bayesian Networks Local search! Start with some network structure Try to make a change (add or delete or reverse edge) See if the new network is any better What should be the initial state?

Initial Network Structure? Uniform prior over random networks? Network which reflects expert knowledge?

© Daniel S. Weld 57 Learning BN Structure

The Big Picture We described how to do MAP (and ML) learning of a Bayes net (including structure) How would Bayesian learning (of BNs) differ? Find all possible networks Calculate their posteriors When doing inference, return weighed combination of predictions from all networks!