CS498-EA Reasoning in AI Lecture #20

CS498-EA Reasoning in AI Lecture #20
Instructor: Eyal Amir Fall Semester 2009 Who is in this class? Some slides in this set were adopted from Eran Segal

Summary of last time: Inference
We presented the variable elimination algorithm Specifically, VE for finding marginal P(Xi) over one variable, Xi from X1,…,Xn Order on variables such that One variable Xj eliminated at a time (a) Move unneeded terms (those not involving Xj) outside summation over Xj (b) Create a new potential function, fXj(.) over other variables appearing in the terms of the summation at (a) Works for both BNs and MFs (Markov Fields)

Exact Inference Treewidth methods: Variable elimination
Clique tree algorithm Treewidth

Today: Learning in Graphical Models
Parameter Estimation Maximum Likelihood Complete Observations Naïve Bayes Not so Naïve Bayes

Learning Introduction
So far, we assumed that the networks were given Where do the networks come from? Knowledge engineering with aid of experts Automated construction of networks Learn by examples or instances 5

maximize i P(bi,ci,ei; w0, w1)
Parameter estimation Maximum likelihood estimation: maximize i P(bi,ci,ei; w0, w1)

Learning Introduction
Input: dataset of instances D={d[1],...d[m]} Output: Bayesian network Measures of success How close is the learned network to the original distribution Use distance measures between distributions Often hard because we do not have the true underlying distribution Instead, evaluate performance by how well the network predicts new unseen examples (“test data”) Classification accuracy How close is the structure of the network to the true one? Use distance metric between structures Hard because we do not know the true structure Instead, ask whether independencies learned hold in test data 7

Prior Knowledge Prespecified structure Prespecified variables
Learn only CPDs Prespecified variables Learn network structure and CPDs Hidden variables Learn hidden variables, structure, and CPDs Complete/incomplete data Missing data Unobserved variables 8

Learning Bayesian Networks
X1 X2 Inducer Data Prior information Y P(Y|X1,X2) X1 X2 y0 y1 x10 x20 1 x21 0.2 0.8 x11 0.1 0.9 0.02 0.98 9

Known Structure, Complete Data
Goal: Parameter estimation Data does not contain missing values X1 X2 X1 X2 Inducer Initial network Y Y X1 X2 Y x10 x21 y0 x11 x20 y1 P(Y|X1,X2) X1 X2 y0 y1 x10 x20 1 x21 0.2 0.8 x11 0.1 0.9 0.02 0.98 Input Data 10

Unknown Structure, Complete Data
Goal: Structure learning & parameter estimation Data does not contain missing values X1 X2 X1 X2 Inducer Initial network Y Y X1 X2 Y x10 x21 y0 x11 x20 y1 P(Y|X1,X2) X1 X2 y0 y1 x10 x20 1 x21 0.2 0.8 x11 0.1 0.9 0.02 0.98 Input Data 11

Known Structure, Incomplete Data
Goal: Parameter estimation Data contains missing values X1 X2 X1 X2 Inducer Initial network Y Y X1 X2 Y ? x21 y0 x11 x10 x20 y1 P(Y|X1,X2) X1 X2 y0 y1 x10 x20 1 x21 0.2 0.8 x11 0.1 0.9 0.02 0.98 Input Data 12

Unknown Structure, Incomplete Data
Goal: Structure learning & parameter estimation Data contains missing values X1 X2 X1 X2 Inducer Initial network Y Y X1 X2 Y ? x21 y0 x11 x10 x20 y1 P(Y|X1,X2) X1 X2 y0 y1 x10 x20 1 x21 0.2 0.8 x11 0.1 0.9 0.02 0.98 Input Data 13

Parameter Estimation Input Goal: Learn CPD parameters
Network structure Choice of parametric family for each CPD P(Xi|Pa(Xi)) Goal: Learn CPD parameters Two main approaches Maximum likelihood estimation Bayesian approaches 14

Biased Coin Toss Example
Coin can land in two positions: Head or Tail Estimation task Given toss examples x[1],...x[m] estimate P(H)= and P(T)=1- Assumption: i.i.d samples Tosses are controlled by an (unknown) parameter  Tosses are sampled from the same distribution Tosses are independent of each other 15

Biased Coin Toss Example
Goal: find [0,1] that predicts the data well “Predicts the data well” = likelihood of the data given  Example: probability of sequence H,T,T,H,H 0.2 0.4 0.6 0.8 1  L(D:) 16

Maximum Likelihood Estimator
Parameter  that maximizes L(D:) In our example, =0.6 maximizes the sequence H,T,T,H,H 0.2 0.4 0.6 0.8 1  L(D:) 17

Maximum Likelihood Estimator
General case Observations: MH heads and MT tails Find  maximizing likelihood Equivalent to maximizing log-likelihood Differentiating the log-likelihood and solving for  we get that the maximum likelihood parameter is: 18

Sufficient Statistics
For computing the parameter  of the coin toss example, we only needed MH and MT since  MH and MT are sufficient statistics 19

Sufficient Statistics
Definition: Function s(D) is a sufficient statistics from instances to a vector in k if for any two datasets D and D’ and any  we have Datasets Statistics 20

Sufficient Statistics for Multinomial
A sufficient statistics for a dataset D over a variable Y with k values is the tuple of counts <M1,...Mk> such that Mi is the number of times that the Y=yi in D Sufficient statistic Define s(x[i]) as a tuple of dimension k s(x[i])=(0,...0,1,0,...,0) (1,...,i-1) (i+1,...,k) 21

Bias and Variance of Estimator

Next Time

CS498-EA Reasoning in AI Lecture #20

Similar presentations

Presentation on theme: "CS498-EA Reasoning in AI Lecture #20"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CS498-EA Reasoning in AI Lecture #20

Similar presentations

Presentation on theme: "CS498-EA Reasoning in AI Lecture #20"— Presentation transcript:

Similar presentations

About project

Feedback