CSE 573: Artificial Intelligence Spring 2012 Learning Bayesian Networks Dan Weld Slides adapted from Carlos Guestrin, Krzysztof Gajos, Dan Klein, Stuart.

Slides:

Advertisements

Similar presentations

CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.

Advertisements

An Overview of Machine Learning

Supervised Learning Recap

What is Statistical Modeling

Ai in game programming it university of copenhagen Statistical Learning Methods Marco Loog.

Visual Recognition Tutorial

Bayesian Learning Rong Jin. Outline MAP learning vs. ML learning Minimum description length principle Bayes optimal classifier Bagging.

Probabilistic inference

Bayes Rule How is this rule derived? Using Bayes rule for probabilistic inference: –P(Cause | Evidence): diagnostic probability –P(Evidence | Cause): causal.

Machine Learning CSE 473. © Daniel S. Weld Topics Agency Problem Spaces Search Knowledge Representation Reinforcement Learning InferencePlanning.

CS 188: Artificial Intelligence Fall 2009

1er. Escuela Red ProTIC - Tandil, de Abril, Bayesian Learning 5.1 Introduction –Bayesian learning algorithms calculate explicit probabilities.

Visual Recognition Tutorial

Bayesian Learning Rong Jin.

© Daniel S. Weld 1 Naïve Bayes & Expectation Maximization CSE 573.

Learning Bayesian Networks

Sparse vs. Ensemble Approaches to Supervised Learning

Thanks to Nir Friedman, HU

Rutgers CS440, Fall 2003 Introduction to Statistical Learning Reading: Ch. 20, Sec. 1-4, AIMA 2 nd Ed.

© Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors.

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.

Crash Course on Machine Learning

Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.

1 CSE 446 Machine Learning Daniel Weld Xiao Ling Congle Zhang.

Made by: Maor Levy, Temple University  Up until now: how to reason in a give model  Machine learning: how to acquire a model on the basis of.

CS 188: Artificial Intelligence Spring 2007 Lecture 18: Classification: Part I Naïve Bayes 03/22/2007 Srini Narayanan – ICSI and UC Berkeley.

Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.

CSE 446 Gaussian Naïve Bayes & Logistic Regression Winter 2012

ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Classification: Naïve Bayes Readings: Barber

CSE 511a: Artificial Intelligence Spring 2013

CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 16 Nov, 3, 2011 Slide credit: C. Conati, S.

CSE 446 Perceptron Learning Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.

LOGISTIC REGRESSION David Kauchak CS451 – Fall 2013.

CS 188: Artificial Intelligence Spring 2006 Lecture 9: Naïve Bayes 2/14/2006 Dan Klein – UC Berkeley Many slides from either Stuart Russell or Andrew Moore.

Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.

CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.

Empirical Research Methods in Computer Science Lecture 6 November 16, 2005 Noah Smith.

Classification Techniques: Bayesian Classification

CSE 446: Naïve Bayes Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & Dan Klein.

Machine Learning, Decision Trees, Overfitting Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University January 14,

CSE 473: Artificial Intelligence Spring 2012 Bayesian Networks - Learning Dan Weld Slides adapted from Jack Breese, Dan Klein, Daphne Koller, Stuart Russell,

CSE 473: Artificial Intelligence Spring 2012 Bayesian Networks - Learning Dan Weld Slides adapted from Jack Breese, Dan Klein, Daphne Koller, Stuart Russell,

Chapter 6 Bayesian Learning

Machine Learning  Up until now: how to reason in a model and how to make optimal decisions  Machine learning: how to acquire a model on the basis of.

CHAPTER 6 Naive Bayes Models for Classification. QUESTION????

ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Classification: Logistic Regression –NB & LR connections Readings: Barber.

Announcements  Project 5  Due Friday 4/10 at 5pm  Homework 9  Released soon, due Monday 4/13 at 11:59pm.

Lecture 2: Statistical learning primer for biologists

Machine Learning 5. Parametric Methods.

CSE 446 Logistic Regression Perceptron Learning Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.

Bayesian Learning Bayes Theorem MAP, ML hypotheses MAP learners

Bayesian Networks Statistical Learning CSE 573 Based on lecture notes from David Page and Dan Weld.

CS 541: Artificial Intelligence Lecture IX: Supervised Machine Learning Slides Credit: Peter Norvig and Sebastian Thrun.

CMPS 142/242 Review Section Fall 2011 Adapted from Lecture Slides.

CAP 5636 – Advanced Artificial Intelligence

Deep Feedforward Networks

Naïve Bayes Classifiers

CS 188: Artificial Intelligence Fall 2006

Artificial Intelligence

CS 188: Artificial Intelligence Fall 2008

Data Mining Lecture 11.

Still More Uncertainty

Why Machine Learning Flood of data

Parametric Methods Berlin Chen, 2005 References:

Speech recognition, machine learning

CS 188: Artificial Intelligence Fall 2007

Recap: Naïve Bayes classifier

Naïve Bayes Classifiers

Speech recognition, machine learning

Presentation transcript:

CSE 573: Artificial Intelligence Spring 2012 Learning Bayesian Networks Dan Weld Slides adapted from Carlos Guestrin, Krzysztof Gajos, Dan Klein, Stuart Russell, Andrew Moore & Luke Zettlemoyer 1

trofdo  Way to much of KG’s coin junk – get to the point!  MAP vs Bayseian estimate  Prior  Beta  Should be 5-10 min not 30 2

What action next? PerceptsActions Environment Static vs. Dynamic Fully vs. Partially Observable Perfect vs. Noisy Deterministic vs. Stochastic Instantaneous vs. Durative

What action next? Blind search Heuristic search Mini-max & Expectimax MDPs (& POMDPS) Reinforcement learning State estimation Algorithms

Knowledge Representation HMMs Bayesian networks First-order logic Description logic Constraint networks Markov logic networks …

Learning ?

7 What is Machine Learning ?

8 © Carlos Guestrin Machine Learning Study of algorithms that  improve their performance  at some task  with experience DataUnderstanding Machine Learning

9 © Carlos Guestrin Exponential Growth in Data DataUnderstanding Machine Learning

10 © Carlos Guestrin Supremacy of Machine Learning  Machine learning is preferred approach to  Speech recognition, Natural language processing  Web search – result ranking  Computer vision  Medical outcomes analysis  Robot control  Computational biology  Sensor networks ……  This trend is accelerating  Improved machine learning algorithms  Improved data capture, networking, faster computers  Software too complex to write by hand  New sensors / IO devices  Demand for self-customization to user, environment

11 Space of ML Problems What is Being Learned? Type of Supervision (eg, Experience, Feedback) Labeled Examples RewardNothing Discrete Function ClassificationClustering Continuous Function Regression PolicyApprenticeship Learning Reinforcement Learning

12 ©2009 Carlos Guestrin Classification from data to discrete classes

Spam filtering 13 ©2009 Carlos Guestrin data prediction

14 ©2009 Carlos Guestrin Text classification Company home page vs Personal home page vs Univeristy home page vs …

Weather prediction 15 ©2009 Carlos Guestrin

16 ©2009 Carlos Guestrin Object detection Example training images for each orientation (Prof. H. Schneiderman)

17 ©2009 Carlos Guestrin Reading a noun (vs verb) [Rustandi et al., 2005]

The classification pipeline 18 ©2009 Carlos Guestrin Training Testing

Machine Learning 19 Supervised Learning Parametric Reinforcement Learning Unsupervised Learning Non-parametric Nearest neighbor Kernel density estimation Support vector machines

Machine Learning 20 Supervised Learning Y DiscreteY Continuous Gaussians Learned in closed form Linear Functions 1. Learned in closed form 2. Using gradient descent Decision Trees Greedy search; pruning Probability of class | features 1.Learn P(Y), P(X|Y); apply Bayes 2.Learn P(Y|X) w/ gradient descent Parametric Reinforcement Learning Unsupervised Learning Non-parametric Non-probabilistic Linear Classifier Learn w/ gradient descent

21 ©2009 Carlos Guestrin Regression predicting a numeric value

Stock market 22 ©2009 Carlos Guestrin

Weather prediction revisted 23 ©2009 Carlos Guestrin Temperature

24 ©2009 Carlos Guestrin Clustering discovering structure in data

Machine Learning 25 Supervised Learning Parametric Reinforcement Learning Unsupervised Learning Non-parametric Agglomerative Clustering K-means Expectation Maximization (EM) Principle Component Analysis (PCA)

Clustering Data: Group similar things

Clustering images 27 ©2009 Carlos Guestrin [Goldberger et al.] Set of Images

Clustering web search results 28 ©2009 Carlos Guestrin

29 In Summary What is Being Learned? Type of Supervision (eg, Experience, Feedback) Labeled Examples RewardNothing Discrete Function ClassificationClustering Continuous Function Regression PolicyApprenticeship Learning Reinforcement Learning

30 Key Concepts

Classifier Hypothesis: Function for labeling examples Label: - Label: + ? ? ? ?

32 Generalization  Hypotheses must generalize to correctly classify instances not in the training data.  Simply memorizing training examples is a consistent hypothesis that does not generalize.

© Daniel S. Weld 33 A Learning Problem

© Daniel S. Weld 34 Hypothesis Spaces

© Daniel S. Weld 35 Why is Learning Possible? Experience alone never justifies any conclusion about any unseen instance. Learning occurs when PREJUDICE meets DATA! Learning a “Frobnitz”

Frobnitz 36 Not a Frobnitz

© Daniel S. Weld 37 Bias  The nice word for prejudice is “bias”.  Different from “Bias” in statistics  What kind of hypotheses will you consider?  What is allowable range of functions you use when approximating?  What kind of hypotheses do you prefer?

© Daniel S. Weld 38 Some Typical Biases  Occam’s razor “It is needless to do more when less will suffice” – William of Occam, died 1349 of the Black plague  MDL – Minimum description length  Concepts can be approximated by ... conjunctions of predicates... by linear functions... by short decision trees

ML = Function Approximation 39 c(x) x May not be any perfect fit Classification ~ discrete functions h(x) h(x) = contains(`nigeria’, x)  contains(`wire-transfer’, x)

Learning as Optimization  Preference Bias  Loss Function  Minimize loss over training data (test data)  Loss(h,data) = error(h, data) + complexity(h)  Error + regularization  Methods  Closed form  Greedy search  Gradient ascent 40

Bias / Variance Tradeoff Slide from T Dietterich  Variance: E[ (h(x*) – h(x*)) 2 ] How much h(x*) varies between training sets Reducing variance risks underfitting  Bias: [h(x*) – f(x*)] Describes the average error of h(x*) Reducing bias risks overfitting Note: inductive bias vs estimator bias

Regularization

Regularization: vs.

Learning as Optimization  Methods  Closed form  Greedy search  Gradient ascent  Loss Function  Minimize loss over training data (test data)  Loss(h,data) = error(h, data) + complexity(h)  Error + regularization 44

Bia / Variance Tradeoff Slide from T Dietterich  Variance: E[ (h(x*) – h(x*)) 2 ] How much h(x*) varies between training sets Reducing variance risks underfitting  Bias: [h(x*) – f(x*)] Describes the average error of h(x*) Reducing bias risks overfitting

Regularization

Regularization: vs.

Overfitting  Hypothesis H is overfit when  H’ and  H has smaller error on training examples, but  H has bigger error on test examples

Overfitting  Hypothesis H is overfit when  H’ and  H has smaller error on training examples, but  H has bigger error on test examples  Causes of overfitting  Training set is too small  Large number of features  Big problem in machine learning  Solutions: bias, regularization  Validation set

© Daniel S. Weld50 Overfitting Accuracy On training data On test data Model complexity (e.g., number of nodes in decision tree)

© Daniel S. Weld 51 Learning Bayes Nets  Learning Parameters for a Bayesian Network  Fully observable  Maximum Likelihood (ML)  Maximum A Posteriori (MAP)  Bayesian  Hidden variables (EM algorithm)  Learning Structure of Bayesian Networks

© Daniel S. Weld 52 What’s in a Bayes Net? Earthquake BurglaryAlarm Nbr2CallsNbr1Calls Pr(B=t) Pr(B=f) Pr(A|E,B) e,b 0.9 (0.1) e,b 0.2 (0.8) e,b 0.85 (0.15) e,b 0.01 (0.99) Radio

Parameter Estimation and Bayesian Networks EBRAJM TFTTFT FFFFFT FTFTTT FFFTTT FTFFFF... We have: - Bayes Net structure and observations - We need: Bayes Net parameters

Parameter Estimation and Bayesian Networks EBRAJM TFTTFT FFFFFT FTFTTT FFFTTT FTFFFF... P(B) = ? P(¬B) = 1- P(B) = 0.4 = 0.6

Parameter Estimation and Bayesian Networks EBRAJM TFTTFT FFFFFT FTFTTT FFFTTT FTFFFF... P(A|E,B) = ? P(A|E,¬B) = ? P(A|¬E,B) = ? P(A|¬E,¬B) = ?

Parameter Estimation and Bayesian Networks Coin

Coin Flip P(H|C 2 ) = 0.5 P(H|C 1 ) = 0.1 C1C1 C2C2 P(H|C 3 ) = 0.9 C3C3 Which coin will I use? P(C 1 ) = 1/3P(C 2 ) = 1/3P(C 3 ) = 1/3 Prior: Probability of a hypothesis before we make any observations

Coin Flip P(H|C 2 ) = 0.5 P(H|C 1 ) = 0.1 C1C1 C2C2 P(H|C 3 ) = 0.9 C3C3 Which coin will I use? P(C 1 ) = 1/3P(C 2 ) = 1/3P(C 3 ) = 1/3 Uniform Prior: All hypothesis are equally likely before we make any observations

Experiment 1: Heads Which coin did I use? P(C 1 |H) = ?P(C 2 |H) = ?P(C 3 |H) = ? P(H|C 2 ) = 0.5 P(H|C 3 ) = 0.9P(H|C 1 )=0.1 C1C1 C2C2 C3C3 P(C 1 )=1/3P(C 2 ) = 1/3P(C 3 ) = 1/3

Experiment 1: Heads Which coin did I use? P(C 1 |H) = 0.066P(C 2 |H) = 0.333P(C 3 |H) = 0.6 P(H|C 2 ) = 0.5 P(H|C 3 ) = 0.9P(H|C 1 ) = 0.1 C1C1 C2C2 C3C3 P(C 1 ) = 1/3P(C 2 ) = 1/3P(C 3 ) = 1/3 Posterior: Probability of a hypothesis given data

Terminology  Prior:  Probability of a hypothesis before we see any data  Uniform Prior:  A prior that makes all hypothesis equally likely  Posterior:  Probability of a hypothesis after we saw some data  Likelihood:  Probability of data given hypothesis

Experiment 2: Tails Now, Which coin did I use? P(H|C 2 ) = 0.5 P(H|C 3 ) = 0.9P(H|C 1 ) = 0.1 C1C1 C2C2 C3C3 P(C 1 ) = 1/3P(C 2 ) = 1/3P(C 3 ) = 1/3 P(C 1 |HT) = ?P(C 2 |HT) = ?P(C 3 |HT) = ?

Experiment 2: Tails Now, Which coin did I use? P(H|C 2 ) = 0.5 P(H|C 3 ) = 0.9P(H|C 1 ) = 0.1 C1C1 C2C2 C3C3 P(C 1 ) = 1/3P(C 2 ) = 1/3P(C 3 ) = 1/3 P(C 1 |HT) = 0.21P(C 2 |HT) = 0.58P(C 3 |HT) = 0.21

Experiment 2: Tails Which coin did I use? P(H|C 2 ) = 0.5 P(H|C 3 ) = 0.9P(H|C 1 ) = 0.1 C1C1 C2C2 C3C3 P(C 1 ) = 1/3P(C 2 ) = 1/3P(C 3 ) = 1/3 P(C 1 |HT) = 0.21P(C 2 |HT) = 0.58P(C 3 |HT) = 0.21

Your Estimate? What is the probability of heads after two experiments? P(H|C 2 ) = 0.5 P(H|C 3 ) = 0.9P(H|C 1 ) = 0.1 C1C1 C2C2 C3C3 P(C 1 ) = 1/3P(C 2 ) = 1/3P(C 3 ) = 1/3 Best estimate for P(H) P(H|C 2 ) = 0.5 Most likely coin: C2C2

Your Estimate? P(H|C 2 ) = 0.5 C2C2 P(C 2 ) = 1/3 Most likely coin:Best estimate for P(H) P(H|C 2 ) = 0.5 C2C2 Maximum Likelihood Estimate: The best hypothesis that fits observed data assuming uniform prior

Using Prior Knowledge  Should we always use a Uniform Prior ?  Background knowledge: Heads => we have to buy Dan chocolate Dan likes chocolate… => Dan is more likely to use a coin biased in his favor P(H|C 2 ) = 0.5 P(H|C 1 ) = 0.1 C1C1 C2C2 P(H|C 3 ) = 0.9 C3C3

Using Prior Knowledge P(H|C 2 ) = 0.5 P(H|C 1 ) = 0.1 C1C1 C2C2 P(H|C 3 ) = 0.9 C3C3 P(C 1 ) = 0.05P(C 2 ) = 0.25P(C 3 ) = 0.70 We can encode it in the prior:

Experiment 1: Heads Which coin did I use? P(C 1 |H) = ?P(C 2 |H) = ?P(C 3 |H) = ? P(H|C 2 ) = 0.5 P(H|C 3 ) = 0.9P(H|C 1 ) = 0.1 C1C1 C2C2 C3C3 P(C 1 ) = 0.05P(C 2 ) = 0.25P(C 3 ) = 0.70

Experiment 1: Heads Which coin did I use? P(C 1 |H) = 0.006P(C 2 |H) = 0.165P(C 3 |H) = P(H|C 2 ) = 0.5 P(H|C 3 ) = 0.9P(H|C 1 ) = 0.1 C1C1 C2C2 C3C3 P(C 1 ) = 0.05P(C 2 ) = 0.25P(C 3 ) = 0.70 P(C 1 |H) = 0.066P(C 2 |H) = 0.333P(C 3 |H) = Compare with ML posterior after Exp 1:

Experiment 2: Tails Which coin did I use? P(C 1 |HT) = ?P(C 2 |HT) = ?P(C 3 |HT) = ? P(H|C 2 ) = 0.5 P(H|C 3 ) = 0.9P(H|C 1 ) = 0.1 C1C1 C2C2 C3C3 P(C 1 ) = 0.05P(C 2 ) = 0.25P(C 3 ) = 0.70

Experiment 2: Tails Which coin did I use? P(H|C 2 ) = 0.5 P(H|C 3 ) = 0.9P(H|C 1 ) = 0.1 C1C1 C2C2 C3C3 P(C 1 ) = 0.05P(C 2 ) = 0.25P(C 3 ) = 0.70 P(C 1 |HT) = 0.035P(C 2 |HT) = 0.481P(C 3 |HT) = 0.485

Experiment 2: Tails Which coin did I use? P(C 1 |HT) = 0.035P(C 2 |HT)=0.481P(C 3 |HT) = P(H|C 2 ) = 0.5 P(H|C 3 ) = 0.9P(H|C 1 ) = 0.1 C1C1 C2C2 C3C3 P(C 1 ) = 0.05P(C 2 ) = 0.25P(C 3 ) = 0.70

Your Estimate? What is the probability of heads after two experiments? P(H|C 2 ) = 0.5 P(H|C 3 ) = 0.9P(H|C 1 ) = 0.1 C1C1 C2C2 C3C3 P(C 1 ) = 0.05P(C 2 ) = 0.25P(C 3 ) = 0.70 Best estimate for P(H) P(H|C 3 ) = 0.9 C3C3 Most likely coin:

Your Estimate? Most likely coin:Best estimate for P(H) P(H|C 3 ) = 0.9 C3C3 Maximum A Posteriori (MAP) Estimate: The best hypothesis that fits observed data assuming a non-uniform prior P(H|C 3 ) = 0.9 C3C3 P(C 3 ) = 0.70

Did We Do The Right Thing? P(H|C 2 ) = 0.5 P(H|C 3 ) = 0.9P(H|C 1 ) = 0.1 C1C1 C2C2 C3C3 P(C 1 |HT)=0.035P(C 2 |HT)=0.481P(C 3 |HT)=0.485

Did We Do The Right Thing? P(C 1 |HT) =0.035P(C 2 |HT)=0.481P(C 3 |HT)=0.485 P(H|C 2 ) = 0.5 P(H|C 3 ) = 0.9P(H|C 1 ) = 0.1 C1C1 C2C2 C3C3 C 2 and C 3 are almost equally likely

A Better Estimate P(H|C 2 ) = 0.5 P(H|C 3 ) = 0.9P(H|C 1 ) = 0.1 C1C1 C2C2 C3C3 Recall: = P(C 1 |HT)=0.035P(C 2 |HT)=0.481P(C 3 |HT)=0.485

Bayesian Estimate P(C 1 |HT)=0.035P(C 2 |HT)=0.481P(C 3 |HT)=0.485 P(H|C 2 ) = 0.5 P(H|C 3 ) = 0.9P(H|C 1 ) = 0.1 C1C1 C2C2 C3C3 = Bayesian Estimate: Minimizes prediction error, given data assuming an arbitrary prior

Comparison After more experiments: HTHHHHHHHHH ML (Maximum Likelihood): P(H) = 0.5 after 10 experiments: P(H) = 0.9 MAP (Maximum A Posteriori): P(H) = 0.9 after 10 experiments: P(H) = 0.9 Bayesian: P(H) = 0.68 after 10 experiments: P(H) = 0.9

Summary PriorHypothesis Maximum Likelihood Estimate Maximum A Posteriori Estimate Bayesian Estimate UniformThe most likely AnyThe most likely Any Weighted combination Easy to compute Still easy to compute Incorporates prior knowledge Minimizes error Great when data is scarce Potentially much harder to compute

Bayesian Learning Use Bayes rule: Or equivalently: P(Y | X)  P(X | Y) P(Y) Prior Normalization Data Likelihood Posterior P(Y | X) = P(X |Y) P(Y) P(X)

Parameter Estimation and Bayesian Networks EBRAJM TFTTFT FFFFFT FTFTTT FFFTTT FTFFFF... P(B) = ? Prior + data = Now compute either MAP or Bayesian estimate

What Prior to Use?  Prev, you knew: it was one of only three coins  Now more complicated…  The following are two common priors  Binary variable Beta  Posterior distribution is binomial  Easy to compute posterior  Discrete variable Dirichlet  Posterior distribution is multinomial  Easy to compute posterior © Daniel S. Weld 84

Beta Distribution

 Example: Flip coin with Beta distribution as prior over p [prob(heads)] 1.Parameterized by two positive numbers: a, b 2.Mode of distribution (E[p]) is a/(a+b) 3.Specify our prior belief for p = a/(a+b) 4.Specify confidence in this belief with high initial values for a and b  Updating our prior belief based on data  incrementing a for every heads outcome  incrementing b for every tails outcome  So after h heads out of n flips, our posterior distribution says P(head)=(a+h)/(a+b+n)

One Prior: Beta Distribution a,b For any positive integer y,  (y) = (y-1)!

Parameter Estimation and Bayesian Networks EBRAJM TFTTFT FFFFFT FTFTTT FFFTTT FTFFFF... P(B|data) = ? Prior “+ data” = Beta(1,4) (3,7).3 B¬B¬B.7 Prior P(B)= 1/(1+4) = 20% with equivalent sample size 5

Parameter Estimation and Bayesian Networks EBRAJM TFTTFT FFFFFT FTFTTT FFFTTT FTFFFF... P(A|E,B) = ? P(A|E,¬B) = ? P(A|¬E,B) = ? P(A|¬E,¬B) = ?

Parameter Estimation and Bayesian Networks EBRAJM TFTTFT FFFFFT FTFTTT FFFTTT FTFFFF... P(A|E,B) = ? P(A|E,¬B) = ? P(A|¬E,B) = ? P(A|¬E,¬B) = ? Prior Beta(2,3)

Parameter Estimation and Bayesian Networks EBRAJM TFTTFT FFFFFT FTFTTT FFFTTT FTFFFF... P(A|E,B) = ? P(A|E,¬B) = ? P(A|¬E,B) = ? P(A|¬E,¬B) = ? Prior + data= Beta(2,3) Beta(3,4)

Output of Learning EBRAJM TFTTFT FFFFFT FTFTTT FFFTTT FTFFFF... Pr(A|E,B) e,b 0.9 (0.1) e,b 0.2 (0.8) e,b 0.85 (0.15) e,b 0.01 (0.99) Pr(B=t) Pr(B=f)

Did Learning Work Well? EBRAJM TFTTFT FFFFFT FTFTTT FFFTTT FTFFFF... Pr(A|E,B) e,b 0.9 (0.1) e,b 0.2 (0.8) e,b 0.85 (0.15) e,b 0.01 (0.99) Pr(B=t) Pr(B=f) Can easily calculate P(data) for learned parameters

© Daniel S. Weld Learning with Continuous Variables Earthquake Pr(E=x) mean:  = ? variance:  = ?

Using Bayes Nets for Classification  One method of classification:  Use a probabilistic model!  Features are observed random variables F i  Y is the query variable  Use probabilistic inference to compute most likely Y  You already know how to do this inference

A Popular Structure: Naïve Bayes F 2F N F 1F 3 Y Class Value … Assume that features are conditionally independent given class variable Works surprisingly well for classification (predicting the right class) But forces probabilities towards 0 and 1

Naïve Bayes  Naïve Bayes assumption:  Features are independent given class:  More generally:  How many parameters?  Suppose X is composed of n binary features

A Spam Filter  Naïve Bayes spam filter  Data:  Collection of s, labeled spam or ham  Note: someone has to hand label all this data!  Split into training, held- out, test sets  Classifiers  Learn on the training set  (Tune it on a held-out set)  Test it on new s Dear Sir. First, I must solicit your confidence in this transaction, this is by virture of its nature as being utterly confidencial and top secret. … TO BE REMOVED FROM FUTURE MAILINGS, SIMPLY REPLY TO THIS MESSAGE AND PUT "REMOVE" IN THE SUBJECT. 99 MILLION ADDRESSES FOR ONLY $99 Ok, Iknow this is blatantly OT but I'm beginning to go insane. Had an old Dell Dimension XPS sitting in the corner and decided to put it to use, I know it was working pre being stuck in the corner, but when I plugged it in, hit the power nothing happened.

Naïve Bayes for Text  Bag-of-Words Naïve Bayes:  Predict unknown class label (spam vs. ham)  Assume evidence features (e.g. the words) are independent  Warning: subtly different assumptions than before!  Generative model  Tied distributions and bag-of-words  Usually, each variable gets its own conditional probability distribution P(F|Y)  In a bag-of-words model  Each position is identically distributed  All positions share the same conditional probs P(W|C)  Why make this assumption? Word at position i, not i th word in the dictionary!

Example: Spam Filtering  Model:  What are the parameters? the : to : and : of : you : a : with: from: the : to : of : : with: from: and : a : ham : 0.66 spam: 0.33  Where do these come from?

Example: Overfitting  Posteriors determined by relative probabilities (odds ratios): south-west : inf nation : inf morally : inf nicely : inf extent : inf seriously : inf... What went wrong here? screens : inf minute : inf guaranteed : inf $ : inf delivery : inf signature : inf...

Generalization and Overfitting  Relative frequency parameters will overfit the training data!  Unlikely that every occurrence of “money” is 100% spam  Unlikely that every occurrence of “office” is 100% ham  What about all the words that don’t occur in the training set at all?  In general, we can’t go around giving unseen events zero probability  As an extreme case, imagine using the entire as the only feature  Would get the training data perfect (if deterministic labeling)  Wouldn’t generalize at all  Just making the bag-of-words assumption gives some generalization,  but not enough  To generalize better: we need to smooth or regularize the estimates

Estimation: Smoothing  Problems with maximum likelihood estimates:  If I flip a coin once, and it’s heads, what’s the estimate for P(heads)?  What if I flip 10 times with 8 heads?  What if I flip 10M times with 8M heads?  Basic idea:  We have some prior expectation about parameters (here, the probability of heads)  Given little evidence, we should skew towards our prior  Given a lot of evidence, we should listen to the data

Estimation: Smoothing  Relative frequencies are the maximum likelihood estimates ????  In Bayesian statistics, we think of the parameters as just another random variable, with its own distribution

Estimation: Laplace Smoothing  Laplace’s estimate: pretend you saw every outcome once more than you actually did Can derive this as a MAP estimate with Dirichlet priors (Bayesian justification) HHT

Estimation: Laplace Smoothing  Laplace’s estimate (extended):  Pretend you saw every outcome k extra times  What’s Laplace with k = 0?  k is the strength of the prior HHT  Laplace for conditionals:  Smooth each condition independently:

Real NB: Smoothing  For real classification problems, smoothing is critical  New odds ratios: helvetica : 11.4 seems : 10.8 group : 10.2 ago : 8.4 areas : verdana : 28.8 Credit : 28.4 ORDER : 27.2 : 26.9 money : Do these make more sense?

NB with Bag of Words for text classification  Learning phase:  Prior P(Y)  Count how many documents from each topic (prior)  P(X i |Y)  For each of m topics, count how many times you saw word X i in documents of this topic (+ k for prior)  Divide by number of times you saw the word (+ k  |words|)  Test phase:  For each document  Use naïve Bayes decision rule

Probabilities: Important Detail! Any more potential problems here?  P(spam | X 1 … X n ) =  P(spam | X i ) i  We are multiplying lots of small numbers Danger of underflow!  = 7 E -18  Solution? Use logs and add!  p 1 * p 2 = e log(p1)+log(p2)  Always keep in log form

Naïve Bayes F 2F N F 1F 3 Y Class Value … Assume that features are conditionally independent given class variable Works surprisingly well for classification (predicting the right class) But forces probabilities towards 0 and 1

Example Bayes’ Net: Car

What if we don’t know structure?

Learning The Structure of Bayesian Networks  Search thru the space…  of possible network structures!  (for now still assume can observe all values)  For each structure, learn parameters  As just shown…  Pick the one that fits observed data best  Calculate P(data)

A E C D B A E C D B A E C D B Two problems: Fully connected will be most probable Exponential number of structures

Learning The Structure of Bayesian Networks  Search thru the space…  of possible network structures!  For each structure, learn parameters  As just shown…  Pick the one that fits observed data best  Calculate P(data) Two problems: Fully connected will be most probable Add penalty term (regularization)  model complexity Exponential number of structures Local search

Structure Learning as Search  Local Search 1.Start with some network structure 2.Try to make a change (add or delete or reverse edge) 3.See if the new network is any better  What should the initial state be?  Uniform prior over random networks?  Based on prior knowledge?  Empty network? © Daniel S. Weld 116

A E C D B A E C D B A E C D B A E C D B A E C D B

Score Functions  Bayesian Information Criteion (BIC)  P(D | BN) – penalty  Penalty = ½ (# parameters) Log (# data points)  MAP score  P(BN | D) = P(D | BN) P(BN)  P(BN) must decay exponentially with # of parameters for this to work well © Daniel S. Weld 118

Learning as Optimization  Preference Bias  Loss Function  Minimize loss over training data (test data)  Loss(h,data) = error(h, data) + complexity(h)  Error + regularization  Methods  Closed form  Greedy search  Gradient ascent 119

© Daniel S. Weld 120 Topics  Learning Parameters for a Bayesian Network  Fully observable  Maximum Likelihood (ML),  Maximum A Posteriori (MAP)  Bayesian  Hidden variables (EM algorithm)  Learning Structure of Bayesian Networks