Bayesian Learning Chapter

Slides:



Advertisements
Similar presentations
Learning with Missing Data
Advertisements

ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi)
Learning: Parameter Estimation
Learning Bayesian Networks. Dimensions of Learning ModelBayes netMarkov net DataCompleteIncomplete StructureKnownUnknown ObjectiveGenerativeDiscriminative.
Ai in game programming it university of copenhagen Statistical Learning Methods Marco Loog.
Visual Recognition Tutorial
Overview Full Bayesian Learning MAP learning
Hidden Markov Model 11/28/07. Bayes Rule The posterior distribution Select k with the largest posterior distribution. Minimizes the average misclassification.
Regulatory Network (Part II) 11/05/07. Methods Linear –PCA (Raychaudhuri et al. 2000) –NIR (Gardner et al. 2003) Nonlinear –Bayesian network (Friedman.
. Learning Bayesian networks Slides by Nir Friedman.
Lecture 5: Learning models using EM
1 Learning Entity Specific Models Stefan Niculescu Carnegie Mellon University November, 2003.
Review P(h i | d) – probability that the hypothesis is true, given the data (effect  cause) Used by MAP: select the hypothesis that is most likely given.
Neural Networks Marco Loog.
Learning Bayesian Networks. Dimensions of Learning ModelBayes netMarkov net DataCompleteIncomplete StructureKnownUnknown ObjectiveGenerativeDiscriminative.
. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.
Visual Recognition Tutorial
Learning Bayesian Networks
Thanks to Nir Friedman, HU
Rutgers CS440, Fall 2003 Introduction to Statistical Learning Reading: Ch. 20, Sec. 1-4, AIMA 2 nd Ed.
1 Machine Learning: Naïve Bayes, Neural Networks, Clustering Skim 20.5 CMSC 471.
Learning Bayesian Networks (From David Heckerman’s tutorial)
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Naive Bayes Classifier
CMSC 471 Spring 2014 Class #16 Thursday, March 27, 2014 Machine Learning II Professor Marie desJardins,
Unsupervised Learning: Clustering Some material adapted from slides by Andrew Moore, CMU. Visit for
1 Instance-Based & Bayesian Learning Chapter Some material adapted from lecture notes by Lise Getoor and Ron Parr.
Bayesian Learning Chapter Some material adapted from lecture notes by Lise Getoor and Ron Parr.
1 CMSC 671 Fall 2001 Class #25-26 – Tuesday, November 27 / Thursday, November 29.
Chapter 6 Bayesian Learning
Slides for “Data Mining” by I. H. Witten and E. Frank.
Marginalization & Conditioning Marginalization (summing out): for any sets of variables Y and Z: Conditioning(variant of marginalization):
CHAPTER 6 Naive Bayes Models for Classification. QUESTION????
Learning and Acting with Bayes Nets Chapter 20.. Page 2 === A Network and a Training Data.
Machine Learning 5. Parametric Methods.
1 Learning P-maps Param. Learning Graphical Models – Carlos Guestrin Carnegie Mellon University September 24 th, 2008 Readings: K&F: 3.3, 3.4, 16.1,
Bayesian Learning Bayes Theorem MAP, ML hypotheses MAP learners
1 Structure Learning (The Good), The Bad, The Ugly Inference Graphical Models – Carlos Guestrin Carnegie Mellon University October 13 th, 2008 Readings:
1 BN Semantics 3 – Now it’s personal! Parameter Learning 1 Graphical Models – Carlos Guestrin Carnegie Mellon University September 22 nd, 2006 Readings:
CS479/679 Pattern Recognition Dr. George Bebis
Class #21 – Tuesday, November 10
CS 2750: Machine Learning Directed Graphical Models
Qian Liu CSE spring University of Pennsylvania
Parameter Estimation 主講人:虞台文.
Maximum Likelihood Estimation
Oliver Schulte Machine Learning 726
Irina Rish IBM T.J.Watson Research Center
Data Mining Lecture 11.
CS498-EA Reasoning in AI Lecture #20
Readings: K&F: 15.1, 15.2, 15.3, 15.4, 15.5 K&F: 7 (overview of inference) K&F: 8.1, 8.2 (Variable Elimination) Structure Learning in BNs 3: (the good,
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Lecture 5 Unsupervised Learning in fully Observed Directed and Undirected Graphical Models.
Artificial Intelligence Chapter 20 Learning and Acting with Bayes Nets
EM for Inference in MV Data
Class #19 – Tuesday, November 3
Professor Marie desJardins,
Class #16 – Tuesday, October 26
Parameter Learning 2 Structure Learning 1: The good
Chapter 20. Learning and Acting with Bayes Nets
Topic Models in Text Processing
Lecture 11 Generalizations of EM.
Class #25 – Monday, November 29
Machine Learning: Lecture 6
BN Semantics 3 – Now it’s personal! Parameter Learning 1
Ch 3. Linear Models for Regression (2/2) Pattern Recognition and Machine Learning, C. M. Bishop, Previously summarized by Yung-Kyun Noh Updated.
EM for Inference in MV Data
Machine Learning: UNIT-3 CHAPTER-1
Learning Bayesian networks
Presentation transcript:

Bayesian Learning Chapter 20.1-20.4 Some material adapted from lecture notes by Lise Getoor and Ron Parr

Simple case: Naïve Bayes Use Bayesian modeling Make the simplest possible independence assumption: Each attribute is independent of the values of the other attributes, given the class variable In our restaurant domain: Cuisine is independent of Patrons, given a decision to stay (or not)

Bayesian Formulation p(C | F1, ..., Fn) = p(C) p(F1, ..., Fn | C) / P(F1, ..., Fn) = α p(C) p(F1, ..., Fn | C) Assume each feature Fi is conditionally independent of the other given the class C. Then: p(C | F1, ..., Fn) = α p(C) Πi p(Fi | C) Estimate each of these conditional probabilities from the observed counts in the training data: p(Fi | C) = N(Fi ∧ C) / N(C) One subtlety of using the algorithm in practice: when your estimated probabilities are zero, ugly things happen The fix: Add one to every count (aka “Laplacian smoothing”—they have a different name for everything!)

Naive Bayes: Example p(Wait | Cuisine, Patrons, Rainy?) = = α  p(Wait)  p(Cuisine|Wait)  p(Patrons|Wait)  p(Rainy?|Wait) = p(Wait)  p(Cuisine|Wait)  p(Patrons|Wait)  p(Rainy?|Wait) p(Cuisine)  p(Patrons)  p(Rainy?) We can estimate all of the parameters (p(F) and p(C) just by counting from the training examples

Naive Bayes: Analysis Naive Bayes is amazingly easy to implement (once you understand the bit of math behind it) Naive Bayes can outperform many much more complex algorithms—it’s a baseline that should be ried and/or used for comparison Naive Bayes can’t capture interdependencies between variables (obviously)—for that, we need Bayes nets!

Learning Bayesian networks Given training set Find B that best matches D model selection parameter estimation C A E B Inducer Data D

Learning Bayesian Networks Describe a BN by specifying its (1) structure and (2) conditional probability tables (CPTs) Both can be learned from data, but learning structure much harder than learning parameters learning when some nodes are hidden, or with missing data harder still Four cases: Structure Observability Method Known Full Maximum Likelihood Estimation Known Partial EM (or gradient ascent) Unknown Full Search through model space Unknown Partial EM + search through model space

Parameter estimation Assume known structure Goal: estimate BN parameters q entries in local probability models, P(X | Parents(X)) A parameterization q is good if it is likely to generate the observed data: Maximum Likelihood Estimation (MLE) Principle: Choose q* so as to maximize L i.i.d. samples

Parameter estimation II The likelihood decomposes according to the structure of the network → we get a separate estimation task for each parameter The MLE (maximum likelihood estimate) solution: for each value x of a node X and each instantiation u of Parents(X) Just need to collect the counts for every combination of parents and children observed in the data MLE is equivalent to an assumption of a uniform prior over parameter values sufficient statistics

Model selection Goal: Select the best network structure, given the data Input: Training data Scoring function Output: A network that maximizes the score

Structure selection: Scoring Bayesian: prior over parameters and structure get balance between model complexity and fit to data as a byproduct Score (G:D) = log P(G|D)  log [P(D|G) P(G)] Marginal likelihood just comes from our parameter estimates Prior on structure can be any measure we want; typically a function of the network complexity Marginal likelihood Prior Same key property: Decomposability Score(structure) = Si Score(family of Xi)

Heuristic search B E A C Δscore(C) Add EC B E A C Δscore(A,E) Reverse EA Δscore(A) Delete EA B E A C B E A C

Exploiting decomposability Δscore(A) Delete EA To recompute scores, only need to re-score families that changed in the last move B E A C Δscore(C) Add EC B E A C Δscore(A) Reverse EA Δscore(A) Delete EA B E A C

Variations on a theme Known structure, fully observable: only need to do parameter estimation Unknown structure, fully observable: do heuristic search through structure space, then parameter estimation Known structure, missing values: use expectation maximization (EM) to estimate parameters Known structure, hidden variables: apply adaptive probabilistic network (APN) techniques Unknown structure, hidden variables: too hard to solve!