CS498-EA Reasoning in AI Lecture #20

Slides:

Advertisements

Similar presentations

Basics of Statistical Estimation

Advertisements

A Tutorial on Learning with Bayesian Networks

Learning with Missing Data

CS498-EA Reasoning in AI Lecture #15 Instructor: Eyal Amir Fall Semester 2011.

INTRODUCTION TO MACHINE LEARNING Bayesian Estimation.

ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.

Learning: Parameter Estimation

Flipping A Biased Coin Suppose you have a coin with an unknown bias, θ ≡ P(head). You flip the coin multiple times and observe the outcome. From observations,

The EM algorithm LING 572 Fei Xia Week 10: 03/09/2010.

.. . Parameter Estimation using likelihood functions Tutorial #1 This class has been cut and slightly edited from Nir Friedman’s full course of 12 lectures.

CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 10: The Bayesian way to fit models Geoffrey Hinton.

Parameter Estimation using likelihood functions Tutorial #1

Overview Full Bayesian Learning MAP learning

. Learning Bayesian networks Slides by Nir Friedman.

This presentation has been cut and slightly edited from Nir Friedman’s full course of 12 lectures which is available at Changes.

Probabilistic Graphical Models Tool for representing complex systems and performing sophisticated reasoning tasks Fundamental notion: Modularity Complex.

Basics of Statistical Estimation. Learning Probabilities: Classical Approach Simplest case: Flipping a thumbtack tails heads True probability  is unknown.

. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.

. Maximum Likelihood (ML) Parameter Estimation with applications to reconstructing phylogenetic trees Comput. Genomics, lecture 6b Presentation taken from.

Class 3: Estimating Scoring Rules for Sequence Alignment.

Thanks to Nir Friedman, HU

Rutgers CS440, Fall 2003 Introduction to Statistical Learning Reading: Ch. 20, Sec. 1-4, AIMA 2 nd Ed.

Learning Bayesian Networks (From David Heckerman’s tutorial)

Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.

ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:

.. . Maximum Likelihood (ML) Parameter Estimation with applications to inferring phylogenetic trees Comput. Genomics, lecture 6a Presentation taken from.

1 Bayesian Param. Learning Bayesian Structure Learning Graphical Models – Carlos Guestrin Carnegie Mellon University October 6 th, 2008 Readings:

Unsupervised Learning: Clustering Some material adapted from slides by Andrew Moore, CMU. Visit for

Bayesian Learning Chapter Some material adapted from lecture notes by Lise Getoor and Ron Parr.

Bayesian Classification. Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities.

CS498-EA Reasoning in AI Lecture #10 Instructor: Eyal Amir Fall Semester 2009 Some slides in this set were adopted from Eran Segal.

Lecture 2: Statistical learning primer for biologists

1 Parameter Learning 2 Structure Learning 1: The good Graphical Models – Carlos Guestrin Carnegie Mellon University September 27 th, 2006 Readings:

Maximum Likelihood Estimation

Machine Learning 5. Parametric Methods.

1 Learning P-maps Param. Learning Graphical Models – Carlos Guestrin Carnegie Mellon University September 24 th, 2008 Readings: K&F: 3.3, 3.4, 16.1,

1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.

Parameter Estimation. Statistics Probability specified inferred Steam engine pump “prediction” “estimation”

Naive Bayes Classifier. REVIEW: Bayesian Methods Our focus this lecture: – Learning and classification methods based on probability theory. Bayes theorem.

CS498-EA Reasoning in AI Lecture #23 Instructor: Eyal Amir Fall Semester 2011.

CS498-EA Reasoning in AI Lecture #19 Professor: Eyal Amir Fall Semester 2011.

Bayesian Estimation and Confidence Intervals Lecture XXII.

Applied statistics Usman Roshan.

Usman Roshan CS 675 Machine Learning

Chapter 3: Maximum-Likelihood Parameter Estimation

ICS 280 Learning in Graphical Models

CS 2750: Machine Learning Density Estimation

Bayes Net Learning: Bayesian Approaches

Approximate Inference

Maximum Likelihood Estimation

Oliver Schulte Machine Learning 726

Irina Rish IBM T.J.Watson Research Center

Tutorial #3 by Ma’ayan Fishelson

Introduction to EM algorithm

Important Distinctions in Learning BNs

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

10701 / Machine Learning Today: - Cross validation,

Important Distinctions in Learning BNs

CONTEXT DEPENDENT CLASSIFICATION

Bayesian Learning Chapter

Learning Probabilistic Graphical Models Overview Learning Problems.

LECTURE 07: BAYESIAN ESTIMATION

Parameter Learning 2 Structure Learning 1: The good

Parametric Methods Berlin Chen, 2005 References:

Machine Learning: Lecture 6

BN Semantics 3 – Now it’s personal! Parameter Learning 1

Read R&N Ch Next lecture: Read R&N

Mathematical Foundations of BME Reza Shadmehr

Applied Statistics and Probability for Engineers

Learning Bayesian networks

Presentation transcript:

CS498-EA Reasoning in AI Lecture #20 Instructor: Eyal Amir Fall Semester 2009 Who is in this class? Some slides in this set were adopted from Eran Segal

Summary of last time: Inference We presented the variable elimination algorithm Specifically, VE for finding marginal P(Xi) over one variable, Xi from X1,…,Xn Order on variables such that One variable Xj eliminated at a time (a) Move unneeded terms (those not involving Xj) outside summation over Xj (b) Create a new potential function, fXj(.) over other variables appearing in the terms of the summation at (a) Works for both BNs and MFs (Markov Fields)

Exact Inference Treewidth methods: Variable elimination Clique tree algorithm Treewidth

Today: Learning in Graphical Models Parameter Estimation Maximum Likelihood Complete Observations Naïve Bayes Not so Naïve Bayes

Learning Introduction So far, we assumed that the networks were given Where do the networks come from? Knowledge engineering with aid of experts Automated construction of networks Learn by examples or instances 5

maximize i P(bi,ci,ei; w0, w1) Parameter estimation Maximum likelihood estimation: maximize i P(bi,ci,ei; w0, w1)

Learning Introduction Input: dataset of instances D={d[1],...d[m]} Output: Bayesian network Measures of success How close is the learned network to the original distribution Use distance measures between distributions Often hard because we do not have the true underlying distribution Instead, evaluate performance by how well the network predicts new unseen examples (“test data”) Classification accuracy How close is the structure of the network to the true one? Use distance metric between structures Hard because we do not know the true structure Instead, ask whether independencies learned hold in test data 7

Prior Knowledge Prespecified structure Prespecified variables Learn only CPDs Prespecified variables Learn network structure and CPDs Hidden variables Learn hidden variables, structure, and CPDs Complete/incomplete data Missing data Unobserved variables 8

Learning Bayesian Networks X1 X2 Inducer Data Prior information Y P(Y|X1,X2) X1 X2 y0 y1 x10 x20 1 x21 0.2 0.8 x11 0.1 0.9 0.02 0.98 9

Known Structure, Complete Data Goal: Parameter estimation Data does not contain missing values X1 X2 X1 X2 Inducer Initial network Y Y X1 X2 Y x10 x21 y0 x11 x20 y1 P(Y|X1,X2) X1 X2 y0 y1 x10 x20 1 x21 0.2 0.8 x11 0.1 0.9 0.02 0.98 Input Data 10

Unknown Structure, Complete Data Goal: Structure learning & parameter estimation Data does not contain missing values X1 X2 X1 X2 Inducer Initial network Y Y X1 X2 Y x10 x21 y0 x11 x20 y1 P(Y|X1,X2) X1 X2 y0 y1 x10 x20 1 x21 0.2 0.8 x11 0.1 0.9 0.02 0.98 Input Data 11

Known Structure, Incomplete Data Goal: Parameter estimation Data contains missing values X1 X2 X1 X2 Inducer Initial network Y Y X1 X2 Y ? x21 y0 x11 x10 x20 y1 P(Y|X1,X2) X1 X2 y0 y1 x10 x20 1 x21 0.2 0.8 x11 0.1 0.9 0.02 0.98 Input Data 12

Unknown Structure, Incomplete Data Goal: Structure learning & parameter estimation Data contains missing values X1 X2 X1 X2 Inducer Initial network Y Y X1 X2 Y ? x21 y0 x11 x10 x20 y1 P(Y|X1,X2) X1 X2 y0 y1 x10 x20 1 x21 0.2 0.8 x11 0.1 0.9 0.02 0.98 Input Data 13

Parameter Estimation Input Goal: Learn CPD parameters Network structure Choice of parametric family for each CPD P(Xi|Pa(Xi)) Goal: Learn CPD parameters Two main approaches Maximum likelihood estimation Bayesian approaches 14

Biased Coin Toss Example Coin can land in two positions: Head or Tail Estimation task Given toss examples x[1],...x[m] estimate P(H)= and P(T)=1- Assumption: i.i.d samples Tosses are controlled by an (unknown) parameter  Tosses are sampled from the same distribution Tosses are independent of each other 15

Biased Coin Toss Example Goal: find [0,1] that predicts the data well “Predicts the data well” = likelihood of the data given  Example: probability of sequence H,T,T,H,H 0.2 0.4 0.6 0.8 1  L(D:) 16

Maximum Likelihood Estimator Parameter  that maximizes L(D:) In our example, =0.6 maximizes the sequence H,T,T,H,H 0.2 0.4 0.6 0.8 1  L(D:) 17

Maximum Likelihood Estimator General case Observations: MH heads and MT tails Find  maximizing likelihood Equivalent to maximizing log-likelihood Differentiating the log-likelihood and solving for  we get that the maximum likelihood parameter is: 18

Sufficient Statistics For computing the parameter  of the coin toss example, we only needed MH and MT since  MH and MT are sufficient statistics 19

Sufficient Statistics Definition: Function s(D) is a sufficient statistics from instances to a vector in k if for any two datasets D and D’ and any  we have Datasets Statistics 20

Sufficient Statistics for Multinomial A sufficient statistics for a dataset D over a variable Y with k values is the tuple of counts <M1,...Mk> such that Mi is the number of times that the Y=yi in D Sufficient statistic Define s(x[i]) as a tuple of dimension k s(x[i])=(0,...0,1,0,...,0) (1,...,i-1) (i+1,...,k) 21

Bias and Variance of Estimator

Next Time