Bayesian Framework Finding the best model Minimizing model complexity

Slides:



Advertisements
Similar presentations
Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.
Advertisements

CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 10: The Bayesian way to fit models Geoffrey Hinton.
Introduction of Probabilistic Reasoning and Bayesian Networks
Chapter 4: Linear Models for Classification
Parameter Estimation using likelihood functions Tutorial #1
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
Probabilistic inference
Pattern Recognition and Machine Learning
Bayes Rule How is this rule derived? Using Bayes rule for probabilistic inference: –P(Cause | Evidence): diagnostic probability –P(Evidence | Cause): causal.
1 Graphical Models in Data Assimilation Problems Alexander Ihler UC Irvine Collaborators: Sergey Kirshner Andrew Robertson Padhraic Smyth.
1 Chapter 12 Probabilistic Reasoning and Bayesian Belief Networks.
Statistical Learning: Bayesian and ML COMP155 Sections May 2, 2007.
Basics of Statistical Estimation. Learning Probabilities: Classical Approach Simplest case: Flipping a thumbtack tails heads True probability  is unknown.
. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.
Arizona State University DMML Kernel Methods – Gaussian Processes Presented by Shankar Bhargav.
Bayesian Learning Rong Jin.
Today Logistic Regression Decision Trees Redux Graphical Models
CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation X = {
CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation Given.
Rutgers CS440, Fall 2003 Introduction to Statistical Learning Reading: Ch. 20, Sec. 1-4, AIMA 2 nd Ed.
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Recitation 1 Probability Review
Machine Learning CUNY Graduate Center Lecture 21: Graphical Models.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
1 Bayesian methods for parameter estimation and data assimilation with crop models Part 2: Likelihood function and prior distribution David Makowski and.
A Brief Introduction to Graphical Models
机器学习 陈昱 北京大学计算机科学技术研究所 信息安全工程研究中心. Concept Learning Reference : Ch2 in Mitchell’s book 1. Concepts: Inductive learning hypothesis General-to-specific.
Soft Computing Lecture 17 Introduction to probabilistic reasoning. Bayesian nets. Markov models.
Bayesian Learning By Porchelvi Vijayakumar. Cognitive Science Current Problem: How do children learn and how do they get it right?
CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation Given.
Estimating parameters in a statistical model Likelihood and Maximum likelihood estimation Bayesian point estimates Maximum a posteriori point.
第十讲 概率图模型导论 Chapter 10 Introduction to Probabilistic Graphical Models
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
CSC 2535 Lecture 8 Products of Experts Geoffrey Hinton.
Graphical Models for Machine Learning and Computer Vision.
Chapter 6 Bayesian Learning
INTRODUCTION TO Machine Learning 3rd Edition
CS Statistical Machine learning Lecture 24
ECE-7000: Nonlinear Dynamical Systems Overfitting and model costs Overfitting  The more free parameters a model has, the better it can be adapted.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Supervised Learning Resources: AG: Conditional Maximum Likelihood DP:
Lecture 2: Statistical learning primer for biologists
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Machine Learning 5. Parametric Methods.
1 Learning P-maps Param. Learning Graphical Models – Carlos Guestrin Carnegie Mellon University September 24 th, 2008 Readings: K&F: 3.3, 3.4, 16.1,
Bayesian Optimization Algorithm, Decision Graphs, and Occam’s Razor Martin Pelikan, David E. Goldberg, and Kumara Sastry IlliGAL Report No May.
CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov
Statistical NLP: Lecture 4 Mathematical Foundations I: Probability Theory (Ch2)
Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability Primer Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
CSC321: Lecture 8: The Bayesian way to fit models Geoffrey Hinton.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
1 BN Semantics 3 – Now it’s personal! Parameter Learning 1 Graphical Models – Carlos Guestrin Carnegie Mellon University September 22 nd, 2006 Readings:
1 1)Bayes’ Theorem 2)MAP, ML Hypothesis 3)Bayes optimal & Naïve Bayes classifiers IES 511 Machine Learning Dr. Türker İnce (Lecture notes by Prof. T. M.
Crash course in probability theory and statistics – part 2 Machine Learning, Wed Apr 16, 2008.
Sublinear Computational Time Modeling in Statistical Machine Learning Theory for Markov Random Fields Kazuyuki Tanaka GSIS, Tohoku University, Sendai,
ICS 280 Learning in Graphical Models
Data Mining Lecture 11.
Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2)
Markov Properties of Directed Acyclic Graphs
Lecture 26: Faces and probabilities
CS498-EA Reasoning in AI Lecture #20
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Statistical NLP: Lecture 4
Expectation-Maximization & Belief Propagation
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Machine Learning: Lecture 6
BN Semantics 3 – Now it’s personal! Parameter Learning 1
Machine Learning: UNIT-3 CHAPTER-1
Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2)
Presentation transcript:

Bayesian Framework Finding the best model Minimizing model complexity Maximum likelihood Maximum a posteriori Posterior mean estimator Minimizing model complexity Ockham’s razor Minimum Description Length Parametrizing models Lecture 5, CS567

Anatomy of a model Model = Parameter scheme + values for parameters Model for DNA sequence 4 parameters, one for each character Model M(w1) P(A) = P(T) = P(G) = P(C) = 0.25 Model M(w2) P(A) = P(G) = 0.3; P(T) = P(C) = 0.2 Lecture 5, CS567

Maximum Likelihood Likelihood: Given a particular model, how likely is it that this data would have been observed? L(M(wi)) = P(D|M(wi)) Maximum likelihood: Given a number of models, which one has the highest likelihood? Maximum value of L(M) Wmax = maxarg M(w) P(D|M(wi)) Lecture 5, CS567

Maximum Likelihood Example: Data: HHHTTT (sequence, i.e., as permutation) Model: Binomial with parameter p(H) Parameter set 1: p(H) = 0.5; p(T) = 1-p(H); Parameter set 2: p(H) = 0.25; p(T) = 1-p(H); Likelihoods: P(D|M(w1)) = (0.5)3(0.5)3 = 0.0156 P(D|M(w2)) = (0.25)3(0.75)3 = 0.0066 Maximum likelihood estimate = M(w1) In fact, L(M(w1)) > L(M(wi|i  1) Lecture 5, CS567

Maximum Likelihood Example: Data: HTTT (sequence, i.e., as permutation) Model: Binomial with parameter p(H) Parameter set 1: p(H) = 0.5; p(T) = 1-p(H); Parameter set 2: p(H) = 0.25; p(T) = 1-p(H); Likelihoods: P(D|M(w1)) = (0.5)(0.5)3 = 0.0625 P(D|M(w2)) = (0.25)(0.75)3 = 0.1055 Maximum likelihood estimate = M(w2)! In fact, L(M(w2)) > L(M(wi|i  2) So, is something wrong with this coin? Lecture 5, CS567

Maximum Likelihood Maximum is unreliable when data set size is small Prior important in dealing with such errors As data sample gets to be larger (more representative) Maximum likelihood estimate of parameters tends to the ‘true’ value Lecture 5, CS567

Maximum a posteriori Need to factor in prior in maximum likelihood estimate Posterior likelihood = (Likelihood) (Prior) = P(D|M(wi)) P(wi|M) Maximum a posteriori WMAP = maxarg M(w) P(D|M(wi)) P(wi|M) From Bayes theorem: P(w|M,D) = [P(D|M(wi)) P(wi|M)] / [P(D|M)] As P(D|M) does not affect the maximum of LHS, numerator is sufficient to find MAP Lecture 5, CS567

Posterior Mean Estimator Instead of using maximum value, use Expectation of model parameters Wpme = (wi)P (wi|n)dW where n = number of parameter combinations Makes sense when no clearly optimal choice (no sharp peak in parametric space) Lecture 5, CS567

Dealing with Model Complexity Ockham’s razor: “Car is stopping at cross-walk to allow me to cross, not to shoot a bullet at me” Go for the simplest explanation that matches the facts (probabilistically, of course) Introduce priors than penalize complex models Simpler models assign higher likelihoods Minimum Description Length (kind of similar): Economical specification of model Lecture 5, CS567

Graphical Models Real world = Massive network of dependencies Model = Sparsely connected network (Reduction of dimensionality) Graph representation Edge = dependency; No edge = Independence Directed/Undirected/Mixed (Chain independence) Goal: Factor graph into clusters of local probabilities Lecture 5, CS567

Graphical Models Undirected graphs Directed graphs Markov networks/random fields, Boltzmann machines Symmetric Statistical mechanics, Image processing Directed graphs Bayesian/Belief/Causal/Influence networks Temporal Causality Expert systems Neural networks, Hidden Markov models Lecture 5, CS567

Graphical Models Neighborhood For a single variable For a set of inter-dependent variables (Boundary) Hidden variables (Use Expectation Maximization algorithm) Hierarchy Different time scales/length scales Hyperparameters () P(w) =  P(w|) P() d Prior = P() Computationally easier Mixture/Hybrid modeling P= n i Pi Lecture 5, CS567