Learning Bayesian Networks

Slides:

Advertisements

Similar presentations

Pattern Recognition and Machine Learning

Advertisements

Mixture Models and the EM Algorithm

Bayesian Networks CSE 473. © Daniel S. Weld 2 Last Time Basic notions Atomic events Probabilities Joint distribution Inference by enumeration Independence.

ITCS 3153 Artificial Intelligence Lecture 24 Statistical Learning Chapter 20 Lecture 24 Statistical Learning Chapter 20.

ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.

CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 10: The Bayesian way to fit models Geoffrey Hinton.

What is Statistical Modeling

Statistical Learning Methods Copyright, 1996 © Dale Carnegie & Associates, Inc. Chapter 20 (20.1, 20.2, 20.3, 20.4) Fall 2005.

Ai in game programming it university of copenhagen Statistical Learning Methods Marco Loog.

GS 540 week 6. HMM basics Given a sequence, and state parameters: – Each possible path through the states has a certain probability of emitting the sequence.

Visual Recognition Tutorial

Overview Full Bayesian Learning MAP learning

1 Graphical Models in Data Assimilation Problems Alexander Ihler UC Irvine Collaborators: Sergey Kirshner Andrew Robertson Padhraic Smyth.

Lecture 5: Learning models using EM

1 Learning Entity Specific Models Stefan Niculescu Carnegie Mellon University November, 2003.

Review P(h i | d) – probability that the hypothesis is true, given the data (effect  cause) Used by MAP: select the hypothesis that is most likely given.

Sample Midterm question. Sue want to build a model to predict movie ratings. She has a matrix of data, where for M movies and U users she has collected.

Representing Uncertainty CSE 473. © Daniel S. Weld 2 Many Techniques Developed Fuzzy Logic Certainty Factors Non-monotonic logic Probability Only one.

Bayesian Learning and Learning Bayesian Networks.

© Daniel S. Weld 1 Naïve Bayes & Expectation Maximization CSE 573.

Computer vision: models, learning and inference Chapter 10 Graphical Models.

Computer vision: models, learning and inference

Thanks to Nir Friedman, HU

Rutgers CS440, Fall 2003 Introduction to Statistical Learning Reading: Ch. 20, Sec. 1-4, AIMA 2 nd Ed.

© Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors.

Dynamic Bayesian Networks CSE 473. © Daniel S. Weld Topics Agency Problem Spaces Search Knowledge Representation Reinforcement Learning InferencePlanningLearning.

Learning Bayesian Networks (From David Heckerman’s tutorial)

Jeff Howbert Introduction to Machine Learning Winter Classification Bayesian Classifiers.

Crash Course on Machine Learning

Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.

A Brief Introduction to Graphical Models

ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:

Naive Bayes Classifier

Machine Learning Lecture 23: Statistical Estimation with Sampling Iain Murray’s MLSS lecture on videolectures.net:

CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.

Unsupervised Learning: Clustering Some material adapted from slides by Andrew Moore, CMU. Visit for

Artificial Intelligence Recap & Expectation Maximization CSE 473 Dan Weld.

Bayesian Learning Chapter Some material adapted from lecture notes by Lise Getoor and Ron Parr.

Statistical Learning (From data to distributions).

Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.

CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.

Bayesian Classification. Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities.

CSC321: Neural Networks Lecture 24 Products of Experts Geoffrey Hinton.

CSC 2535 Lecture 8 Products of Experts Geoffrey Hinton.

1 CMSC 671 Fall 2001 Class #25-26 – Tuesday, November 27 / Thursday, November 29.

Bayes’ Nets: Sampling [These slides were created by Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188 materials are available.

Learning With Bayesian Networks Markus Kalisch ETH Zürich.

Slides for “Data Mining” by I. H. Witten and E. Frank.

Lecture 2: Statistical learning primer for biologists

CS 188: Artificial Intelligence Bayes Nets: Approximate Inference Instructor: Stuart Russell--- University of California, Berkeley.

CSE 517 Natural Language Processing Winter 2015

Learning and Acting with Bayes Nets Chapter 20.. Page 2 === A Network and a Training Data.

1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.

Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.

CSC Lecture 23: Sigmoid Belief Nets and the wake-sleep algorithm Geoffrey Hinton.

Machine Learning Expectation Maximization and Gaussian Mixtures CSE 473 Chapter 20.3.

Naive Bayes Classifier. REVIEW: Bayesian Methods Our focus this lecture: – Learning and classification methods based on probability theory. Bayes theorem.

CSC321: Lecture 8: The Bayesian way to fit models Geoffrey Hinton.

Statistical Learning Methods

Qian Liu CSE spring University of Pennsylvania

Statistical Models for Automatic Speech Recognition

Bayes Net Learning: Bayesian Approaches

Maximum Likelihood Estimation

CS 416 Artificial Intelligence

Still More Uncertainty

Bayesian Models in Machine Learning

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

Lecture 5 Unsupervised Learning in fully Observed Directed and Undirected Graphical Models.

Bayesian Learning Chapter

LECTURE 07: BAYESIAN ESTIMATION

Presentation transcript:

Learning Bayesian Networks CSE 473

Last Time Basic notions Bayesian networks Statistical learning Parameter learning (MAP, ML, Bayesian L) Naïve Bayes Issues Structure Learning Expectation Maximization (EM) Dynamic Bayesian networks (DBNs) Markov decision processes (MDPs) © Daniel S. Weld

… Simple Spam Model Spam Nigeria Sex Nude Naïve Bayes assumes attributes independent Given class of parent Correct? © Daniel S. Weld

Naïve Bayes Incorrect, but Easy! P(spam | X1 … Xn) =  P(spam | Xi) How compute P(spam | Xi)? How compute P(Xi | spam)? i © Daniel S. Weld

Smooth with a Prior # + # # + mp + m P(Xi | S) = Note that if m = 10 in the above, it is like saying “I have seen 10 samples that make me believe that P(Xi | S) = p” Hence, m is referred to as the equivalent sample size

Probabilities: Important Detail! P(spam | X1 … Xn) =  P(spam | Xi) i Is there any potential problem here? We are multiplying lots of small numbers Danger of underflow! 0.557 = 7 E -18 Solution? Use logs and add! p1 * p2 = e log(p1)+log(p2) Always keep in log form

P(S | X) Easy to compute from data if X discrete Instance X Spam? 1 T F 2 3 4 5 P(S | X) = ¼ ignoring smoothing… © Daniel S. Weld

P(S | X) What if X is real valued? What now? Instance X Spam? 1 0.01 F 2 3 0.02 4 0.03 T 5 0.05 ----- <T ----- <T ----- >T What now? © Daniel S. Weld

Anything Else? P(S|.04)? 3 2 1 .01 .02 .03 .04 .05 # X S? 1 0.01 F 2 3 0.02 4 0.03 T 5 0.05 P(S|.04)? 3 2 1 .01 .02 .03 .04 .05 © Daniel S. Weld

Smooth with Gaussian then sum “Kernel Density Estimation” # X S? 1 0.01 F 2 3 0.02 4 0.03 T 5 0.05 3 2 1 .01 .02 .03 .04 .05 © Daniel S. Weld

Spam? What’s with the shape? P(S| X=.023) 3 2 P(S| X=.023) 1 # X S? 1 0.01 F 2 3 0.02 4 0.03 T 5 0.05 P(S| X=.023) P(S| X=.023) 3 2 1 What’s with the shape? .01 .02 .03 .04 .05 © Daniel S. Weld

Analysis Attribute value © Daniel S. Weld

Recap Given a BN structure (with discrete or continuous variables), we can learn the parameters of the conditional prop tables. Earthqk Burgl Spam Nigeria Nude Sex Alarm N1 N2 © Daniel S. Weld

What if we don’t know structure?

Outline Statistical learning Dynamic Bayesian networks (DBNs) Parameter learning (MAP, ML, Bayesian L) Naïve Bayes Issues Structure Learning Expectation Maximization (EM) Dynamic Bayesian networks (DBNs) Markov decision processes (MDPs) © Daniel S. Weld

Learning The Structure of Bayesian Networks Search thru the space of possible network structures! (for now, assume we observe all variables) For each structure, learn parameters Pick the one that fits observed data best Caveat – won’t we end up fully connected? ???? When scoring, add a penalty  model complexity

Learning The Structure of Bayesian Networks Search thru the space For each structure, learn parameters Pick the one that fits observed data best Problem? Exponential number of networks! And we need to learn parameters for each! Exhaustive search out of the question! So what now?

Learning The Structure of Bayesian Networks Local search! Start with some network structure Try to make a change (add or delete or reverse edge) See if the new network is any better What should be the initial state?

Initial Network Structure? Uniform prior over random networks? Network which reflects expert knowledge?

Learning BN Structure © Daniel S. Weld

The Big Picture We described how to do MAP (and ML) learning of a Bayes net (including structure) How would Bayesian learning (of BNs) differ? Find all possible networks Calculate their posteriors When doing inference, return weighed combination of predictions from all networks!

Outline Statistical learning Dynamic Bayesian networks (DBNs) Parameter learning (MAP, ML, Bayesian L) Naïve Bayes Issues Structure Learning Expectation Maximization (EM) Dynamic Bayesian networks (DBNs) Markov decision processes (MDPs) © Daniel S. Weld

Hidden Variables But we can’t observe the disease variable Can’t we learn without it? © Daniel S. Weld

We –could- But we’d get a fully-connected network With 708 parameters (vs. 78) Much harder to learn! © Daniel S. Weld

Chicken & Egg Problem If we knew that a training instance (patient) had the disease… It would be easy to learn P(symptom | disease) But we can’t observe disease, so we don’t. If we knew params, e.g. P(symptom | disease) then it’d be easy to estimate if the patient had the disease. But we don’t know these parameters. © Daniel S. Weld

Expectation Maximization (EM) (high-level version) Pretend we do know the parameters Initialize randomly [E step] Compute probability of instance having each possible value of the hidden variable [M step] Treating each instance as fractionally having both values compute the new parameter values Iterate until convergence! © Daniel S. Weld

+ Candy Problem Given a bowl of mixed candies Can we learn contents of each bag, and What percentage of each is in bowl? © Daniel S. Weld

Naïve Candy Flavor: cherry / lime Wrapper: red / green Hole: 0/1 © Daniel S. Weld

Simpler Problem Mixture of two distributions Know: form of distr, variance, % =5 Just need mean of each .01 .03 .05 .07 .09 © Daniel S. Weld

ML Mean of Single Gaussian Uml = argminu i(xi – u)2 .01 .03 .05 .07 .09 © Daniel S. Weld

EM Randomly Initialize:  F1,W1,H1, F2,W2,H2 [E Step] In observable case, we would Estimate  from observed counts Of candies in bags 1 and 2 But bag is hidden, so calculate expected amounts instead 1 = j P(bag=1 | flavorj, wrapperj, holesj) / N Etc. for other parameters [M Step] Recompute Iterate © Daniel S. Weld

Generate Data Assume (true) model is =0.5, F1=W1=H1=0.8, W=red W=green H=1 H=0 F=cherry 273 93 104 90 F=lime 79 100 94 167 © Daniel S. Weld