Thanks to Nir Friedman, HU

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Bayes rule, priors and maximum a posteriori
Basics of Statistical Estimation
Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
Classification. Introduction A discriminant is a function that separates the examples of different classes. For example – IF (income > Q1 and saving >Q2)
Review of Probability. Definitions (1) Quiz 1.Let’s say I have a random variable X for a coin, with event space {H, T}. If the probability P(X=H) is.
Learning: Parameter Estimation
Flipping A Biased Coin Suppose you have a coin with an unknown bias, θ ≡ P(head). You flip the coin multiple times and observe the outcome. From observations,
.. . Parameter Estimation using likelihood functions Tutorial #1 This class has been cut and slightly edited from Nir Friedman’s full course of 12 lectures.
Bayesian Decision Theory
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 10: The Bayesian way to fit models Geoffrey Hinton.
Chapter 4: Linear Models for Classification
Parameter Estimation using likelihood functions Tutorial #1
Visual Recognition Tutorial
Hidden Markov Model 11/28/07. Bayes Rule The posterior distribution Select k with the largest posterior distribution. Minimizes the average misclassification.
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
This presentation has been cut and slightly edited from Nir Friedman’s full course of 12 lectures which is available at Changes.
Probabilistic Graphical Models Tool for representing complex systems and performing sophisticated reasoning tasks Fundamental notion: Modularity Complex.
Basics of Statistical Estimation. Learning Probabilities: Classical Approach Simplest case: Flipping a thumbtack tails heads True probability  is unknown.
Presenting: Assaf Tzabari
. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.
. Maximum Likelihood (ML) Parameter Estimation with applications to reconstructing phylogenetic trees Comput. Genomics, lecture 6b Presentation taken from.
Generative Models Rong Jin. Statistical Inference Training ExamplesLearning a Statistical Model  Prediction p(x;  ) Female: Gaussian distribution N(
Class 3: Estimating Scoring Rules for Sequence Alignment.
Learning Bayesian Networks (From David Heckerman’s tutorial)
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Crash Course on Machine Learning
Recitation 1 Probability Review
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Additional Slides on Bayesian Statistics for STA 101 Prof. Jerry Reiter Fall 2008.
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
.. . Maximum Likelihood (ML) Parameter Estimation with applications to inferring phylogenetic trees Comput. Genomics, lecture 6a Presentation taken from.
Estimating parameters in a statistical model Likelihood and Maximum likelihood estimation Bayesian point estimates Maximum a posteriori point.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
Bayesian Classification. Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities.
Probability and Measure September 2, Nonparametric Bayesian Fundamental Problem: Estimating Distribution from a collection of Data E. ( X a distribution-valued.
Probability Course web page: vision.cis.udel.edu/cv March 19, 2003  Lecture 15.
CHAPTER 6 Naive Bayes Models for Classification. QUESTION????
Ch15: Decision Theory & Bayesian Inference 15.1: INTRO: We are back to some theoretical statistics: 1.Decision Theory –Make decisions in the presence of.
Maximum Likelihood Estimation
Statistical Estimation Vasileios Hatzivassiloglou University of Texas at Dallas.
Bayesian decision theory: A framework for making decisions when uncertainty exit 1 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e.
1 Learning P-maps Param. Learning Graphical Models – Carlos Guestrin Carnegie Mellon University September 24 th, 2008 Readings: K&F: 3.3, 3.4, 16.1,
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
The Uniform Prior and the Laplace Correction Supplemental Material not on exam.
CS 2750: Machine Learning Linear Models for Classification Prof. Adriana Kovashka University of Pittsburgh February 15, 2016.
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Joo-kyung Kim Biointelligence Laboratory,
CHAPTER 3: BAYESIAN DECISION THEORY. Making Decision Under Uncertainty Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Parameter Estimation. Statistics Probability specified inferred Steam engine pump “prediction” “estimation”
BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.
CSC321: Lecture 8: The Bayesian way to fit models Geoffrey Hinton.
Bayesian Classification 1. 2 Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership.
Bayesian Estimation and Confidence Intervals Lecture XXII.
Applied statistics Usman Roshan.
Oliver Schulte Machine Learning 726
Usman Roshan CS 675 Machine Learning
Bayesian Estimation and Confidence Intervals
Bayes Net Learning: Bayesian Approaches
Oliver Schulte Machine Learning 726
Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2)
Review of Probability and Estimators Arun Das, Jason Rebello
Tutorial #3 by Ma’ayan Fishelson
CS498-EA Reasoning in AI Lecture #20
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
LECTURE 07: BAYESIAN ESTIMATION
CS 594: Empirical Methods in HCC Introduction to Bayesian Analysis
Parametric Methods Berlin Chen, 2005 References:
Mathematical Foundations of BME Reza Shadmehr
Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2)
Presentation transcript:

Thanks to Nir Friedman, HU Bayesian Learning Thanks to Nir Friedman, HU .

Example Suppose we are required to build a controller that removes bad oranges from a packaging line Decision are made based on a sensor that reports the overall color of the orange sensor Bad oranges

Classifying oranges Suppose we know all the aspects of the problem: Prior Probabilities: Probability of good (+1) and bad (-1) oranges P(C = +1) = probability of a good orange P(C = -1) = probability of a bad orange Note: P(C = +1) + P(C = -1) = 1 Assumption: oranges are independent The occurrence of a bad orange does not depend on previous

Classifying oranges (cont) Sensor performance: Let X denote sensor measurement from each type of oranges p(X | C = -1 ) p(X | C = +1 )

Bayes Rule Given this knowledge, we can compute the posterior probabilities Bayes Rule

Posterior of Oranges Data likelihood after normalization … p(X | C = -1 ) p(X | C = +1 ) Data likelihood 1 p(C = -1 |X) P(C = +1|X) after normalization … P(C = -1 ) p(X | C = -1 ) P(C = +1)p(X | C = +1 ) … combined with prior…

Decision making Intuition: Predict “Good” if P(C=+1 | X) > P(C=-1 | X) Predict “Bad”, otherwise 1 p(C = -1 |X) P(C = +1|X) good bad

Loss function Assume we have classes +1, -1 Suppose we can make predictions a1,…,ak A loss function L(ai, cj) describes the loss associated with making prediction ai when the class is cj Real Label -1 +1 Bad 1 5 Good 10 Prediction

Expected Risk Given the estimates of P(C | X) we can compute the expected conditional risk of each decision

The Risk in Oranges -1 +1 Bad 1 5 Good 10 R(Good|X) R(Bad|X) 1 10 5 1 p(C = -1 |X) P(C = +1|X) -1 +1 Bad 1 5 Good 10 10 R(Good|X) 5 R(Bad|X)

Optimal Decisions Goal: Minimize risk Optimal decision rule: “Given X = x, predict ai if R(ai|X=x) = mina R(a|X=x) “ (break ties arbitrarily) Note: randomized decisions do not help

0-1 Loss If we don’t have prior knowledge, it is common to use the 0-1 loss L(a,c) = 0 if a = c L(a,c) = 1 otherwise Consequence: R(a|X) = P(a c|X) Decision rule: “choose ai if P(C = ai | X) = maxa P(C = a|X) “

Bayesian Decisions: Summery Decisions based on two components: Conditional distribution P(C|X) Loss function L(A,C) Pros: Specifies optimal actions in presence of noisy signals Can deal with skewed loss functions Cons: Requires P(C|X)

Simple Statistics : Binomial Experiment Head Tail When tossed, it can land in one of two positions: Head or Tail We denote by  the (unknown) probability P(H). Estimation task: Given a sequence of toss samples x[1], x[2], …, x[M] we want to estimate the probabilities P(H)=  and P(T) = 1 - 

Why Learning is Possible? Suppose we perform M independent flips of the thumbtack The number of head we see is a binomial distribution and thus This suggests, that we can estimate  by

Maximum Likelihood Estimation MLE Principle: Learn parameters that maximize the likelihood function This is one of the most commonly used estimators in statistics Intuitively appealing Well studied properties

Computing the Likelihood Functions To compute the likelihood in the thumbtack example we only require NH and NT (the number of heads and the number of tails) Applying the MLE principle we get NH and NT are sufficient statistics for the binomial distribution

Sufficient Statistics A sufficient statistic is a function of the data that summarizes the relevant information for the likelihood Formally, s(D) is a sufficient statistics if for any two datasets D and D’ s(D) = s(D’ )  L( |D) = L( |D’) Datasets Statistics

Maximum A Posterior (MAP) Suppose we observe the sequence H, H MLE estimate is P(H) = 1, P(T) = 0 Should we really believe that tails are impossible at this stage? Such an estimate can have disastrous effect If we assume that P(T) = 0, then we are willing to act as though this outcome is impossible

Laplace Correction Suppose we observe n coin flips with k heads MLE As though we observed one additional H and one additional T Can we justify this estimate? Uniform prior!

Bayesian Reasoning In Bayesian reasoning we represent our uncertainty about the unknown parameter  by a probability distribution This probability distribution can be viewed as subjective probability This is a personal judgment of uncertainty

Bayesian Inference We start with P() - prior distribution about the values of  P(x1, …, xn|) - likelihood of examples given a known value  Given examples x1, …, xn, we can compute posterior distribution on  Where the marginal likelihood is

Binomial Distribution: Laplace Est. In this case the unknown parameter is  = P(H) Simplest prior P() = 1 for 0< <1 Likelihood where k is number of heads in the sequence Marginal Likelihood:

Marginal Likelihood Using integration by parts we have: Multiply both side by n choose k, we have

Marginal Likelihood - Cont The recursion terminates when k = n Thus We conclude that the posterior is

Bayesian Prediction How do we predict using the posterior? We can think of this as computing the probability of the next element in the sequence Assumption: if we know , the probability of Xn+1 is independent of X1, …, Xn

Bayesian Prediction Thus, we conclude that

Naïve Bayes .

Bayesian Classification: Binary Domain Consider the following situation: Two classes: -1, +1 Each example is described by by N attributes Xn is a binary variable with value 0,1 Example dataset: X1 X2 … XN C 1 +1 -1

Binary Domain - Priors How do we estimate P(C) ? Simple Binomial estimation Count # of instances with C = -1, and with C = +1 X1 X2 … XN C 1 +1 -1

Binary Domain - Attribute Probability How do we estimate P(X1,…,XN|C) ? Two sub-problems: Training set for P(X1,…,XN|C=+1): Training set for P(X1,…,XN|C=-1): X1 X2 … XN C 1 +1 -1

Naïve Bayes Naïve Bayes: Assume This is an independence assumption Each attribute Xi is independent of the other attributes once we know the value of C

Naïve Bayes:Boolean Domain Parameters: for each i How do we estimate 1|+1? Simple binomial estimation Count #1 and #0 values of X1 in instances where C=+1 X1 X2 … XN C 1 +1 -1

Interpretation of Naïve Bayes

Interpretation of Naïve Bayes Each Xi “votes” about the prediction If P(Xi|C=-1) = P(Xi|C=+1) then Xi has no say in classification If P(Xi|C=-1) = 0 then Xi overrides all other votes (“veto”)

Interpretation of Naïve Bayes Set Classification rule

Normal Distribution The Gaussian distribution: N(0,12) N(4,22) 0.4 0.3 0.2 0.1 -4 -2 2 4 6 8 10

Maximum Likelihood Estimate Suppose we observe x1, …, xm Simple calculations show that the MLE is Sufficient statistics are

Naïve Bayes with Gaussian Distributions Recall, Assume: Mean of Xi depends on class Variance of Xi does not

Naïve Bayes with Gaussian Distributions Recall Distance between means Distance of Xi to midway point

Different Variances? If we allow different variances, the classification rule is more complex The term is quadratic in Xi