Bayesian Rule & Gaussian Mixture Models

Slides:



Advertisements
Similar presentations
University of Joensuu Dept. of Computer Science P.O. Box 111 FIN Joensuu Tel fax Gaussian Mixture.
Advertisements

Pattern Recognition and Machine Learning
Biointelligence Laboratory, Seoul National University
2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi)
Supervised Learning Recap
Chapter 4: Linear Models for Classification
Ai in game programming it university of copenhagen Statistical Learning Methods Marco Loog.
Visual Recognition Tutorial
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Introduction to Bayesian Learning Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán)
Machine Learning CMPT 726 Simon Fraser University
1er. Escuela Red ProTIC - Tandil, de Abril, Bayesian Learning 5.1 Introduction –Bayesian learning algorithms calculate explicit probabilities.
Visual Recognition Tutorial
Introduction to Bayesian Learning Ata Kaban School of Computer Science University of Birmingham.
EE462 MLCV 1 Lecture 3-4 Clustering (1hr) Gaussian Mixture and EM (1hr) Tae-Kyun Kim.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Gaussian Mixture Models and Expectation Maximization.
Crash Course on Machine Learning
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Midterm Review Rao Vemuri 16 Oct Posing a Machine Learning Problem Experience Table – Each row is an instance – Each column is an attribute/feature.
1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.
Naive Bayes Classifier
B AYESIAN L EARNING & G AUSSIAN M IXTURE M ODELS Jianping Fan Dept of Computer Science UNC-Charlotte.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.
Lecture 17 Gaussian Mixture Models and Expectation Maximization
Chapter 6 Bayesian Learning
Chapter 3: Maximum-Likelihood Parameter Estimation l Introduction l Maximum-Likelihood Estimation l Multivariate Case: unknown , known  l Univariate.
INTRODUCTION TO Machine Learning 3rd Edition
B AYESIAN L EARNING & G AUSSIAN M IXTURE M ODELS Jianping Fan Dept of Computer Science UNC-Charlotte.
Classification And Bayesian Learning
Lecture 2: Statistical learning primer for biologists
1 Machine Learning: Lecture 6 Bayesian Learning (Based on Chapter 6 of Mitchell T.., Machine Learning, 1997)
Bayesian Learning Bayes Theorem MAP, ML hypotheses MAP learners
Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.
Naive Bayes Classifier. REVIEW: Bayesian Methods Our focus this lecture: – Learning and classification methods based on probability theory. Bayes theorem.
Bayesian Classification 1. 2 Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Bayesian Learning Reading: Tom Mitchell, “Generative and discriminative classifiers: Naive Bayes and logistic regression”, Sections 1-2. (Linked from.
Lecture 1.31 Criteria for optimal reception of radio signals.
CS479/679 Pattern Recognition Dr. George Bebis
Chapter 7. Classification and Prediction
Chapter 3: Maximum-Likelihood Parameter Estimation
Probability Theory and Parameter Estimation I
Naive Bayes Classifier
Classification of unlabeled data:
Statistical Models for Automatic Speech Recognition
Data Mining Lecture 11.
Pattern Classification, Chapter 3
Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2)
Latent Variables, Mixture Models and EM
Course Outline MODEL INFORMATION COMPLETE INCOMPLETE
Bayesian Models in Machine Learning
Probabilistic Models with Latent Variables
Statistical Models for Automatic Speech Recognition
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Gaussian Mixture Models And their training with the EM algorithm
Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models
Pattern Recognition and Machine Learning
Parametric Methods Berlin Chen, 2005 References:
Biointelligence Laboratory, Seoul National University
Machine Learning: Lecture 6
Ch 3. Linear Models for Regression (2/2) Pattern Recognition and Machine Learning, C. M. Bishop, Previously summarized by Yung-Kyun Noh Updated.
Machine Learning: UNIT-3 CHAPTER-1
Naive Bayes Classifier
Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2)
Presentation transcript:

Bayesian Rule & Gaussian Mixture Models Jianping Fan Dept of Computer Science UNC-Charlotte

Input Output Basic Classification Spam vs. Not-Spam Spam Binary filtering Binary !!!!$$$!!!!

C Input Output Basic Classification Multi-Class Character recognition C vs. other 25 characters  

Structured Classification Input Output Handwriting recognition Structured output Graph Model brace building 3D object recognition tree

Overview of Bayesian Decision Bayesian classification: one example E.g. How to decide if a patient is sick or healthy, based on A probabilistic model of the observed data (data distributions) Prior knowledge (ratio or importance) Majority has bigger voice

Overview of Bayesian Decision Test patient Group assignment for test patient; Prior knowledge about the assigned group (c) Properties of the assigned group (sick or healthy)

Overview of Bayesian Decision Test patient Observations: Bayesian decision process is a data modeling process, e.g., estimate the data distribution K-means clustering: any relationship & difference?

Bayes’ Rule: normalization Who is who in Bayes’ rule

Classification problem Training data: examples of the form (d,h(d)) where d are the data objects to classify (inputs) and h(d) are the correct class info for d, h(d){1,…K} Goal: given dnew, provide h(dnew)

Why Bayesian? Provides practical learning algorithms E.g. Naïve Bayes Prior knowledge and observed data can be combined It is a generative (model based) approach, which offers a useful conceptual framework E.g. sequences could also be classified, based on a probabilistic model specification Any kind of objects can be classified, based on a probabilistic model specification

Bayes’ Rule: P(d|h): data distribution of group h P(h): importance   data distribution of whole data set

Bayes’ Rule: Estimating the data distribution for whole data set Works to support Bayesian decision: Estimating the data distribution for whole data set Estimating the data distribution for specific group Prior knowledge about specific group

Gaussian Mixture Model (GMM) How to estimate the data distribution for a given dataset?

Gaussian Mixture Model (GMM)

Gaussian Mixture Model (GMM) What does GMM mean? 100 = 5*10 + 2*20 + 2*5 100 = 100*1 100 = 10*10 ………………. GMM may prefer larger K with ``smaller” Gaussians

Gaussian Mixture Model (GMM)

Gaussian Mixture Model (GMM) Real Data Distribution

Gaussian Mixture Model (GMM) Real Data Distribution Approximated Gaussian functions

Gaussian Mixture Model (GMM) Parameters to be estimated:   Training Set:   We assume K is available (or pre-defined)! Algorithms may always prefer larger K!

Matching between Gaussian Function and Samples   Sampling

Given x, it is a function of  and 2 Maximum Likelihood   Sampling We want to maximize it. Given x, it is a function of  and 2

Log-Likelihood Function Maximize this instead By setting and

Max. the Log-Likelihood Function

Max. the Log-Likelihood Function

Gaussian Mixture Models Rather than identifying clusters by “nearest” centroids Fit a Set of k Gaussians to the data Maximum Likelihood over a mixture model p(x) = \pi_0f_0(x) + \pi_1f_1(x) + \pi_2f_2(x) + \ldots + \pi_kf_k(x)

GMM example

Mixture Models Formally a Mixture Model is the weighted sum of a number of pdfs where the weights are determined by a distribution,

Gaussian Mixture Models GMM: the weighted sum of a number of Gaussians where the weights are determined by a distribution,

Expectation Maximization Both the training of GMMs and Graphical Models with latent variables can be accomplished using Expectation Maximization Step 1: Expectation (E-step) Evaluate the “responsibilities” of each cluster with the current parameters Step 2: Maximization (M-step) Re-estimate parameters using the existing “responsibilities” Similar to k-means training.

Latent Variable Representation We can represent a GMM involving a latent variable What does this give us? TODO: plate notation

GMM data and Latent variables

One last bit We have representations of the joint p(x,z) and the marginal, p(x)… The conditional of p(z|x) can be derived using Bayes rule. The responsibility that a mixture component takes for explaining an observation x.

Maximum Likelihood over a GMM As usual: Identify a likelihood function And set partials to zero…

Maximum Likelihood of a GMM Optimization of means.

Maximum Likelihood of a GMM Optimization of covariance Note the similarity to the regular MLE without responsibility terms.

Maximum Likelihood of a GMM Optimization of mixing term \frac{\partial \ln p(x|\pi, \mu,\Sigma) + \lambda\left(\sum_{k=1}^K \pi_k -1\right)}{\partial \pi_k}&=&

MLE of a GMM

EM for GMMs Initialize the parameters Evaluate the log likelihood Expectation-step: Evaluate the responsibilities Maximization-step: Re-estimate Parameters Check for convergence

EM for GMMs E-step: Evaluate the Responsibilities

EM for GMMs M-Step: Re-estimate Parameters

EM for GMMs Evaluate the log likelihood and check for convergence of either the parameters or the log likelihood. If the convergence criterion is not satisfied, return to E-Step.

Visual example of EM

Incorrect Number of Gaussians

Incorrect Number of Gaussians

Too many parameters For n-dimensional data, we need to estimate at least 3n parameters

Relationship to K-means K-means makes hard decisions. Each data point gets assigned to a single cluster. GMM/EM makes soft decisions. Each data point can yield a posterior p(z|x) Soft K-means is a special case of EM.

Soft means as GMM/EM Assume equal covariance matrices for every mixture component: Likelihood: Responsibilities: As epsilon approaches zero, the responsibility approaches unity. p(x|\mu_k,\Sigma_k) = \frac{1}{(2\pi\epsilon)^{M/2}}\exp\left\{-\frac{1}{2\epsilon}\lVert x-\mu_k\rVert^2\right\}

Soft K-Means as GMM/EM Overall Log likelihood as epsilon approaches zero: The expectation of soft k-means is the inter-cluster variability Note: only the means are re-estimated in Soft K- means. The covariance matrices are all tied.

General form of EM Given a joint distribution over observed and latent variables: Want to maximize: Initialize parameters E Step: Evaluate: M-Step: Re-estimate parameters (based on expectation of complete-data log likelihood) Check for convergence of parameters or likelihood

Gaussian Mixture Model

The conditional probability p(zk = 1|x) denoted by γ(zk ) is obtained by Bayes' theorem, We view πk as the prior probability of zk = 1, and γ(zk ) as the posterior probability. γ(zk ) is the responsibility that k-component takes for explaining the observation x.

Maximum Likelihood Estimation Given a data set of X = {x1,…, xN}, the log of the likelihood function is s.t.

Setting the derivatives of ln p(X|π, μ, Σ) with respect to μk to zero, we obtain

  Finally, we maximize ln p(X|π, μ, Σ) with respect to the mixing coefficients πk. We use a Largrange multiplier objective ftn. f(x) constraints g(x) max f(x) s.t. g(x)=0  

which gives we find λ = -N and

Initialise the means μk , covariances Σk and mixing coefficients πk. Ε step: Evaluate the responsibilities using the current parameter values 3. M step: RE-estimate the parameters using the current responsibilities 4. Evaluate the log likelihood and check for convergence of either the parameters or the log likelihood. If the convergence criterion is not satisfied, return to step 2.

Source code for GMM 1. EPFL http://lasa.epfl.ch/sourcecode/ 2. Google Source Code on GMM http://code.google.com/p/gmmreg/ 3. GMM & EM http://crsouza.blogspot.com/2010/10/gaussian-mixture-models-and-expectation.html

Probabilities – auxiliary slide for memory refreshing Have two dice h1 and h2 The probability of rolling an i given die h1 is denoted P(i|h1). This is a conditional probability Pick a die at random with probability P(hj), j=1 or 2. The probability for picking die hj and rolling an i with it is called joint probability and is P(i, hj)=P(hj)P(i| hj). For any events X and Y, P(X,Y)=P(X|Y)P(Y) If we know P(X,Y), then the so-called marginal probability P(X) can be computed as

Does patient have cancer or not? A patient takes a lab test and the result comes back positive. It is known that the test returns a correct positive result in only 98% of the cases and a correct negative result in only 97% of the cases. Furthermore, only 0.008 of the entire population has this disease. 1. What is the probability that this patient has cancer? 2. What is the probability that he does not have cancer? 3. What is the diagnosis?

Choosing Hypotheses Maximum Likelihood hypothesis: Generally we want the most probable hypothesis given training data. This is the maximum a posteriori hypothesis: Useful observation: it does not depend on the denominator P(d)

Now we compute the diagnosis To find the Maximum Likelihood hypothesis, we evaluate P(d|h) for the data d, which is the positive lab test and chose the hypothesis (diagnosis) that maximises it: To find the Maximum A Posteriori hypothesis, we evaluate P(d|h)P(h) for the data d, which is the positive lab test and chose the hypothesis (diagnosis) that maximises it. This is the same as choosing the hypotheses gives the higher posterior probability.

Naïve Bayes Classifier What can we do if our data d has several attributes? Naïve Bayes assumption: Attributes that describe data instances are conditionally independent given the classification hypothesis it is a simplifying assumption, obviously it may be violated in reality in spite of that, it works well in practice The Bayesian classifier that uses the Naïve Bayes assumption and computes the MAP hypothesis is called Naïve Bayes classifier One of the most practical learning methods Successful applications: Medical Diagnosis Text classification

Example. ‘Play Tennis’ data  

Naïve Bayes solution Classify any new datum instance x=(a1,…aT) as: To do this based on training examples, we need to estimate the parameters from the training examples: For each target value (hypothesis) h For each attribute value at of each datum instance

Based on the examples in the table, classify the following datum x: x=(Outl=Sunny, Temp=Cool, Hum=High, Wind=strong) That means: Play tennis or not? Working:

Learning to classify text Learn from examples which articles are of interest The attributes are the words Observe the Naïve Bayes assumption just means that we have a random sequence model within each class! NB classifiers are one of the most effective for this task Resources for those interested: Tom Mitchell: Machine Learning (book) Chapter 6.

Results on a benchmark text corpus

Remember Bayes’ rule can be turned into a classifier Maximum A Posteriori (MAP) hypothesis estimation incorporates prior knowledge; Max Likelihood doesn’t Naive Bayes Classifier is a simple but effective Bayesian classifier for vector data (i.e. data with several attributes) that assumes that attributes are independent given the class. Bayesian classification is a generative approach to classification

Resources Textbook reading (contains details about using Naïve Bayes for text classification): Tom Mitchell, Machine Learning (book), Chapter 6. Software: NB for classifying text: http://www-2.cs.cmu.edu/afs/cs/project/theo-11/www/naive-bayes.html Useful reading for those interested to learn more about NB classification, beyond the scope of this module: http://www-2.cs.cmu.edu/~tom/NewChapters.html

GMM & Bayesian Decision Open Issues Number of Gaussians K Initializations of parameters Pay more attentions on majority

GMM & Bayesian Rule Advantages Intuitive modeling Probabilistic Disadvantages Intuitive modeling Probabilistic Multi-class solution Number of Gaussians K Initializations of Parameters Not-discriminative Feature dimensions Majority vs. minority

GMM for K-Means Clustering

GMM for K-Means Clustering

KNN (K nearest neighbors) Classifier Test sample

Problems of GMM Number of Gaussians K How to estimate K? Majority vs. Minority How to make right decision for rare events?

Problems of GMM Parameter Initialization & Feature Dimensions What are the solutions? Not-discriminative What’s the solution?