CS498-EA Reasoning in AI Lecture #10 Instructor: Eyal Amir Fall Semester 2009 Some slides in this set were adopted from Eran Segal.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Basics of Statistical Estimation
A Tutorial on Learning with Bayesian Networks
Learning with Missing Data
Graphical Models - Inference - Wolfram Burgard, Luc De Raedt, Kristian Kersting, Bernhard Nebel Albert-Ludwigs University Freiburg, Germany PCWP CO HRBP.
INTRODUCTION TO MACHINE LEARNING Bayesian Estimation.
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. Learning Bayesian Networks from Data Nir Friedman U.C.
Learning: Parameter Estimation
Learning Bayesian Networks. Dimensions of Learning ModelBayes netMarkov net DataCompleteIncomplete StructureKnownUnknown ObjectiveGenerativeDiscriminative.
Flipping A Biased Coin Suppose you have a coin with an unknown bias, θ ≡ P(head). You flip the coin multiple times and observe the outcome. From observations,
.. . Parameter Estimation using likelihood functions Tutorial #1 This class has been cut and slightly edited from Nir Friedman’s full course of 12 lectures.
Parameter Estimation using likelihood functions Tutorial #1
Graphical Models - Learning -
Bayesian Networks - Intro - Wolfram Burgard, Luc De Raedt, Kristian Kersting, Bernhard Nebel Albert-Ludwigs University Freiburg, Germany PCWP CO HRBP HREKG.
Graphical Models - Inference -
Graphical Models - Learning - Wolfram Burgard, Luc De Raedt, Kristian Kersting, Bernhard Nebel Albert-Ludwigs University Freiburg, Germany PCWP CO HRBP.
Graphical Models - Inference - Wolfram Burgard, Luc De Raedt, Kristian Kersting, Bernhard Nebel Albert-Ludwigs University Freiburg, Germany PCWP CO HRBP.
. Learning Bayesian networks Slides by Nir Friedman.
Lecture 5: Learning models using EM
. PGM: Tirgul 10 Learning Structure I. Benefits of Learning Structure u Efficient learning -- more accurate models with less data l Compare: P(A) and.
This presentation has been cut and slightly edited from Nir Friedman’s full course of 12 lectures which is available at Changes.
Structure Learning in Bayesian Networks
Probabilistic Graphical Models Tool for representing complex systems and performing sophisticated reasoning tasks Fundamental notion: Modularity Complex.
Basics of Statistical Estimation. Learning Probabilities: Classical Approach Simplest case: Flipping a thumbtack tails heads True probability  is unknown.
Learning Bayesian Networks. Dimensions of Learning ModelBayes netMarkov net DataCompleteIncomplete StructureKnownUnknown ObjectiveGenerativeDiscriminative.
. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.
Thanks to Nir Friedman, HU
. Learning Bayesian Networks from Data Nir Friedman Daphne Koller Hebrew U. Stanford.
Learning Bayesian Networks (From David Heckerman’s tutorial)
Learning In Bayesian Networks. Learning Problem Set of random variables X = {W, X, Y, Z, …} Training set D = { x 1, x 2, …, x N }  Each observation specifies.
Crash Course on Machine Learning
Recitation 1 Probability Review
Chapter Two Probability Distributions: Discrete Variables
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
.. . Maximum Likelihood (ML) Parameter Estimation with applications to inferring phylogenetic trees Comput. Genomics, lecture 6a Presentation taken from.
1 Bayesian Param. Learning Bayesian Structure Learning Graphical Models – Carlos Guestrin Carnegie Mellon University October 6 th, 2008 Readings:
Unsupervised Learning: Clustering Some material adapted from slides by Andrew Moore, CMU. Visit for
Bayesian Learning Chapter Some material adapted from lecture notes by Lise Getoor and Ron Parr.
Bayesian Classification. Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities.
COMP 538 Reasoning and Decision under Uncertainty Introduction Readings: Pearl (1998, Chapter 1 Shafer and Pearl, Chapter 1.
Notes on Graphical Models Padhraic Smyth Department of Computer Science University of California, Irvine.
Learning In Bayesian Networks. General Learning Problem Set of random variables X = {X 1, X 2, X 3, X 4, …} Training set D = { X (1), X (2), …, X (N)
1 Parameter Learning 2 Structure Learning 1: The good Graphical Models – Carlos Guestrin Carnegie Mellon University September 27 th, 2006 Readings:
Maximum Likelihood Estimation
Some Neat Results From Assignment 1. Assignment 1: Negative Examples (Rohit)
1 Param. Learning (MLE) Structure Learning The Good Graphical Models – Carlos Guestrin Carnegie Mellon University October 1 st, 2008 Readings: K&F:
Guidance: Assignment 3 Part 1 matlab functions in statistics toolbox  betacdf, betapdf, betarnd, betastat, betafit.
1 Learning P-maps Param. Learning Graphical Models – Carlos Guestrin Carnegie Mellon University September 24 th, 2008 Readings: K&F: 3.3, 3.4, 16.1,
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
The Uniform Prior and the Laplace Correction Supplemental Material not on exam.
Statistical NLP: Lecture 4 Mathematical Foundations I: Probability Theory (Ch2)
Parameter Estimation. Statistics Probability specified inferred Steam engine pump “prediction” “estimation”
Oliver Schulte Machine Learning 726
CS 2750: Machine Learning Density Estimation
Parameter Estimation 主講人:虞台文.
Bayes Net Learning: Bayesian Approaches
Maximum Likelihood Estimation
Introduction to Artificial Intelligence
Oliver Schulte Machine Learning 726
Introduction to Artificial Intelligence
Tutorial #3 by Ma’ayan Fishelson
Introduction to Artificial Intelligence
Important Distinctions in Learning BNs
CS498-EA Reasoning in AI Lecture #20
Statistical NLP: Lecture 4
CSCI 5822 Probabilistic Models of Human and Machine Learning
Important Distinctions in Learning BNs
Parameter Learning 2 Structure Learning 1: The good
Parametric Methods Berlin Chen, 2005 References:
Learning Bayesian networks
Presentation transcript:

CS498-EA Reasoning in AI Lecture #10 Instructor: Eyal Amir Fall Semester 2009 Some slides in this set were adopted from Eran Segal

Known Structure, Complete Data Goal: Parameter estimation Data does not contain missing values P(Y|X 1,X 2 ) X1X1 X2X2 y0y0 y1y1 x10x10 x20x20 10 x10x10 x21x x11x11 x20x x11x11 x21x X1X1 Y X2X2 Inducer X1X1 X2X2 Y x10x10 x21x21 y0y0 x11x11 x20x20 y0y0 x10x10 x21x21 y1y1 x10x10 x20x20 y0y0 x11x11 x21x21 y1y1 x10x10 x21x21 y1y1 x11x11 x20x20 y0y0 Input Data X1X1 Y X2X2 Initial network

Maximum Likelihood Estimator General case –Observations: M H heads and M T tails –Find  maximizing likelihood –Equivalent to maximizing log-likelihood –Differentiating the log-likelihood and solving for  we get that the maximum likelihood parameter is:

Sufficient Statistics for Multinomial A sufficient statistics for a dataset D over a variable Y with k values is the tuple of counts such that M i is the number of times that the Y=y i in D Sufficient statistic –Define s(x[i]) as a tuple of dimension k –s(x[i])=(0,...0,1,0,...,0) – (1,...,i-1)(i+1,...,k)

Properties of ML Estimator Unbiased –M number of occurences of H –N total no.experiments –  = Pr(H) Variance

Sufficient Statistic for Gaussian Gaussian distribution: Rewrite as  sufficient statistics for Gaussian: s(x[i])=

Maximum Likelihood Estimation MLE Principle: Choose  that maximize L(D:  ) Multinomial MLE: Gaussian MLE:

MLE for Bayesian Networks Parameters –  x 0,  x 1 –  y 0| x 0,  y 1| x 0,  y 0| x 1,  y 1| x 1 Data instance tuple: Likelihood X Y Y Xy0y0 y1y1 x0x x1x X x0x0 x1x  Likelihood decomposes into two separate terms, one for each variable

MLE for Bayesian Networks Terms further decompose by CPDs: By sufficient statistics where M[x 0,y 0 ] is the number of data instances in which X takes the value x 0 and Y takes the value y 0 MLE

MLE for Bayesian Networks Likelihood for Bayesian network  if  X i |Pa(X i ) are disjoint then MLE can be computed by maximizing each local likelihood separately

MLE for Table CPD BayesNets Multinomial CPD For each value x  X we get an independent multinomial problem where the MLE is

MLE for Tree CPDs Assume tree CPD with known tree structure y0y0 y1y1 z1z1 z0z0 Z Y  x 0 |y 0,  x 1 |y 0  x 0 |y 1,z 1,  x 1 |y 1,z 1 Y X Z Terms for and can be combined Optimization can be done by leaves

MLE for Tree CPD BayesNets Tree CPD T, leaves l For each value l  Leaves(T) we get an independent multinomial problem where the MLE is

Limitations of MLE Two teams play 10 times, and the first wins 7 of the 10 matches  Probability of first team winning = 0.7 A coin is tosses 10 times, and comes out ‘head’ 7 of the 10 tosses  Probability of head = 0.7 Would you place the same bet on the next game as you would on the next coin toss? We need to incorporate prior knowledge –Prior knowledge should only be used as a guide

Bayesian Inference Assumptions –Given a fixed  tosses are independent –If  is unknown tosses are not marginally independent – each toss tells us something about  The following network captures our assumptions  X[1] X[M]X[2] …

Bayesian Inference Joint probabilistic model Posterior probability over   X[1] X[M]X[2] … Likelihood Prior Normalizing factor For a uniform prior, posterior is the normalized likelihood

Bayesian Prediction Predict the data instance from the previous ones Solve for uniform prior P(  )=1 and binomial variable

Example: Binomial Data Prior: uniform for  in [0,1] –P(  ) = 1  P(  |D) is proportional to the likelihood L(D:  ) MLE for P(X=H) is 4/5 = 0.8 Bayesian prediction is 5/ (M H,M T ) = (4,1)

Dirichlet Priors A Dirichlet prior is specified by a set of (non- negative) hyperparameters  1,...  k so that  ~ Dirichlet(  1,...  k ) if – where and –Intuitively, hyperparameters correspond to the number of imaginary examples we saw before starting the experiment

Dirichlet Priors – Example Dirichlet(1,1) Dirichlet(2,2) Dirichlet(0.5,0.5) Dirichlet(5,5)

Dirichlet Priors Dirichlet priors have the property that the posterior is also Dirichlet –Data counts are M 1,...,M k –Prior is Dir(  1,...  k ) –  Posterior is Dir(  1 +M 1,...  k +M k ) The hyperparameters  1,…,  K can be thought of as “imaginary” counts from our prior experience Equivalent sample size =  1 +…+  K –The larger the equivalent sample size the more confident we are in our prior

Effect of Priors Different strength  H +  T Fixed ratio  H /  T Fixed strength  H +  T Different ratio  H /  T Prediction of P(X=H) after seeing data with M H =1/4M T as a function of the sample size

Effect of Priors (cont.) P(X = 1|D) N MLE Dirichlet(.5,.5) Dirichlet(1,1) Dirichlet(5,5) Dirichlet(10,10) N 0 1 Toss Result In real data, Bayesian estimates are less sensitive to noise in the data

General Formulation Joint distribution over D,  Posterior distribution over parameters P(D) is the marginal likelihood of the data As we saw, likelihood can be described compactly using sufficient statistics We want conditions in which posterior is also compact

Conjugate Families A family of priors P(  :  ) is conjugate to a model P(  |  ) if for any possible dataset D of i.i.d samples from P(  |  ) and choice of hyperparameters  for the prior over , there are hyperparameters  ’ that describe the posterior, i.e., P(  :  ’)  P(D|  )P(  :  )  –Posterior has the same parametric form as the prior –Dirichlet prior is a conjugate family for the multinomial likelihood Conjugate families are useful since: –Many distributions can be represented with hyperparameters –They allow for sequential update within the same representation –In many cases we have closed-form solutions for prediction

Bayesian Estimation in BayesNets Instances are independent given the parameters –(x[m’],y[m’]) are d-separated from (x[m],y[m]) given  Priors for individual variables are a priori independent –Global independence of parameters XX X[1] X[M]X[2] … X Y Bayesian network Bayesian network for parameter estimation Y[1] Y[M]Y[2] …  Y|X

Bayesian Estimation in BayesNets Posteriors of  are independent given complete data –Complete data d-separates parameters for different CPDs – –As in MLE, we can solve each estimation problem separately XX X[1] X[M]X[2] … X Y Bayesian network Bayesian network for parameter estimation Y[1] Y[M]Y[2] …  Y|X

Bayesian Estimation in BayesNets Posteriors of  are independent given complete data –Also holds for parameters within families –Note context specific independence between  Y|X=0 and  Y|X=1 when given both X and Y XX X[1] X[M]X[2] … X Y Bayesian network Bayesian network for parameter estimation Y[1] Y[M]Y[2] …  Y|X=0  Y|X=1

Bayesian Estimation in BayesNets Posteriors of  can be computed independently –For multinomial  X i |pa i posterior is Dirichlet with parameters  X i =1|pa i +M[X i =1|pa i ],...,  X i =k|pa i +M[X i =k|pa i ] – XX X[1] X[M]X[2] … X Y Bayesian network Bayesian network for parameter estimation Y[1] Y[M]Y[2] …  Y|X=0  Y|X=1

Assessing Priors for BayesNets We need the  (x i,pa i ) for each node x i We can use initial parameters  0 as prior information –Need also an equivalent sample size parameter M’ –Then, we let  (x i,pa i ) = M’  P(x i,pa i |  0 ) This allows to update a network using new data X Y Example network for priors P(X=0)=P(X=1)=0.5 P(Y=0)=P(Y=1)=0.5 M’=1 Note:  (x 0 )=0.5  (x 0,y 0 )=0.25

Case Study: ICU Alarm Network The “Alarm” network –37 variables Experiment –Sample instances –Learn parameters MLE Bayesian PCWP CO HRBP HREKG HRSAT ERRCAUTER HR HISTORY CATECHOL SAO2 EXPCO2 ARTCO2 VENTALV VENTLUNG VENITUBE DISCONNECT MINVOLSET VENTMACH KINKEDTUBE INTUBATIONPULMEMBOLUS PAPSHUNT ANAPHYLAXIS MINOVL PVSAT FIO2 PRESS INSUFFANESTHTPR LVFAILURE ERRBLOWOUTPUT STROEVOLUMELVEDVOLUME HYPOVOLEMIA CVP BP

Case Study: ICU Alarm Network KL Divergence M MLE Bayes w/ Uniform Prior, M'=5 Bayes w/ Uniform Prior, M'=10 Bayes w/ Uniform Prior, M'=20 Bayes w/ Uniform Prior, M'=50 MLE performs worst Prior M’=5 provides best smoothing

Parameter Estimation Summary Estimation relies on sufficient statistics –For multinomials these are of the form M[x i,pa i ] –Parameter estimation Bayesian methods also require choice of priors MLE and Bayesian are asymptotically equivalent Both can be implemented in an online manner by accumulating sufficient statistics MLE Bayesian (Dirichlet)

THE END

Approaches to structure learning Constraint-based: –dependency from statistical tests (eg.  2 ) –deduce structure from dependencies E B C B (Pearl, 2000; Spirtes et al., 1993)

Approaches to structure learning E B C B Constraint-based: –dependency from statistical tests (eg.  2 ) –deduce structure from dependencies (Pearl, 2000; Spirtes et al., 1993)

Approaches to structure learning E B C B Constraint-based: –dependency from statistical tests (eg.  2 ) –deduce structure from dependencies (Pearl, 2000; Spirtes et al., 1993)

Approaches to structure learning E B C B Attempts to reduce inductive problem to deductive problem Constraint-based: –dependency from statistical tests (eg.  2 ) –deduce structure from dependencies (Pearl, 2000; Spirtes et al., 1993)

Approaches to structure learning E B C B Bayesian: –compute posterior probability of structures, given observed data E B C E B C P(h|data)  P(data|h) P(h) P(h 1 |data)P(h 0 |data) Constraint-based: –dependency from statistical tests (eg.  2 ) –deduce structure from dependencies (Pearl, 2000; Spirtes et al., 1993) (Heckerman, 1998; Friedman, 1999)

Bayesian Occam’s Razor All possible data sets d P(d | h ) h 0 (no relationship) h 1 (relationship) For any model h,

Unknown Structure, Complete Data Goal: Structure learning & parameter estimation Data does not contain missing values P(Y|X 1,X 2 ) X1X1 X2X2 y0y0 y1y1 x10x10 x20x20 10 x10x10 x21x x11x11 x20x x11x11 x21x X1X1 Y X2X2 Inducer X1X1 X2X2 Y x10x10 x21x21 y0y0 x11x11 x20x20 y0y0 x10x10 x21x21 y1y1 x10x10 x20x20 y0y0 x11x11 x21x21 y1y1 x10x10 x21x21 y1y1 x11x11 x20x20 y0y0 Input Data X1X1 Y X2X2 Initial network

Known Structure, Incomplete Data Goal: Parameter estimation Data contains missing values P(Y|X 1,X 2 ) X1X1 X2X2 y0y0 y1y1 x10x10 x20x20 10 x10x10 x21x x11x11 x20x x11x11 x21x X1X1 Y X2X2 Inducer X1X1 X2X2 Y ?x21x21 y0y0 x11x11 ?y0y0 ?x21x21 ? x10x10 x20x20 y0y0 ?x21x21 y1y1 x10x10 x21x21 ? x11x11 ?y0y0 Input Data Initial network X1X1 Y X2X2

Unknown Structure, Incomplete Data Goal: Structure learning & parameter estimation Data contains missing values P(Y|X 1,X 2 ) X1X1 X2X2 y0y0 y1y1 x10x10 x20x20 10 x10x10 x21x x11x11 x20x x11x11 x21x X1X1 Y X2X2 Inducer X1X1 X2X2 Y ?x21x21 y0y0 x11x11 ?y0y0 ?x21x21 ? x10x10 x20x20 y0y0 ?x21x21 y1y1 x10x10 x21x21 ? x11x11 ?y0y0 Input Data Initial network X1X1 Y X2X2