B AYESIAN L EARNING & G AUSSIAN M IXTURE M ODELS Jianping Fan Dept of Computer Science UNC-Charlotte.

Slides:

Advertisements

Similar presentations

Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.

Advertisements

CS 4100 Artificial Intelligence Prof. C. Hafner Class Notes March 27, 2012.

ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.

Biointelligence Laboratory, Seoul National University

2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi)

LECTURE 11: BAYESIAN PARAMETER ESTIMATION

What we will cover here What is a classifier

Intro to Bayesian Learning Exercise Solutions Ata Kaban The University of Birmingham 2005.

Naïve Bayes Classifier

What is Statistical Modeling

Visual Recognition Tutorial

1 Chapter 12 Probabilistic Reasoning and Bayesian Belief Networks.

Lecture 5: Learning models using EM

1 Learning Entity Specific Models Stefan Niculescu Carnegie Mellon University November, 2003.

Introduction to Bayesian Learning Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán)

1er. Escuela Red ProTIC - Tandil, de Abril, Bayesian Learning 5.1 Introduction –Bayesian learning algorithms calculate explicit probabilities.

Introduction to Bayesian Learning Ata Kaban School of Computer Science University of Birmingham.

Learning Bayesian Networks

Computer vision: models, learning and inference

Thanks to Nir Friedman, HU

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.

Jeff Howbert Introduction to Machine Learning Winter Classification Bayesian Classifiers.

Crash Course on Machine Learning

Midterm Review Rao Vemuri 16 Oct Posing a Machine Learning Problem Experience Table – Each row is an instance – Each column is an attribute/feature.

1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.

Naive Bayes Classifier

Kansas State University Department of Computing and Information Sciences CIS 730: Introduction to Artificial Intelligence Lecture 25 Wednesday, 20 October.

G AUSSIAN M IXTURE M ODELS David Sears Music Information Retrieval October 8, 2009.

CS464 Introduction to Machine Learning1 Bayesian Learning Features of Bayesian learning methods: Each observed training example can incrementally decrease.

Computing & Information Sciences Kansas State University Wednesday, 22 Oct 2008CIS 530 / 730: Artificial Intelligence Lecture 22 of 42 Wednesday, 22 October.

Kansas State University Department of Computing and Information Sciences CIS 730: Introduction to Artificial Intelligence Lecture 25 of 41 Monday, 25 October.

Bayesian Classification. Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities.

Classification Techniques: Bayesian Classification

1 Chapter 12 Probabilistic Reasoning and Bayesian Belief Networks.

Chapter 6 Bayesian Learning

B AYESIAN L EARNING & G AUSSIAN M IXTURE M ODELS Jianping Fan Dept of Computer Science UNC-Charlotte.

Review: Probability Random variables, events Axioms of probability Atomic events Joint and marginal probability distributions Conditional probability distributions.

CHAPTER 6 Naive Bayes Models for Classification. QUESTION????

Classification And Bayesian Learning

CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Parameter Learning with Hidden Variables & Expectation Maximization.

1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.

1 Machine Learning: Lecture 6 Bayesian Learning (Based on Chapter 6 of Mitchell T.., Machine Learning, 1997)

Design and Implementation of Speech Recognition Systems Fall 2014 Ming Li Special topic: the Expectation-Maximization algorithm and GMM Sep Some.

Bayesian Learning Bayes Theorem MAP, ML hypotheses MAP learners

Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.

BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.

Computational Intelligence: Methods and Applications Lecture 26 Density estimation, Expectation Maximization. Włodzisław Duch Dept. of Informatics, UMK.

EM Algorithm 主講人：虞台文大同大學資工所智慧型多媒體研究室. Contents Introduction Example  Missing Data Example  Mixed Attributes Example  Mixture Main Body Mixture Model.

Bayesian Learning Evgueni Smirnov Overview Bayesian Theorem Maximum A Posteriori Hypothesis Naïve Bayes Classifier Learning Text Classifiers.

Naive Bayes Classifier. REVIEW: Bayesian Methods Our focus this lecture: – Learning and classification methods based on probability theory. Bayes theorem.

Bayesian Classification 1. 2 Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership.

Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.

Bayesian Learning Reading: Tom Mitchell, “Generative and discriminative classifiers: Naive Bayes and logistic regression”, Sections 1-2. (Linked from.

CS479/679 Pattern Recognition Dr. George Bebis

Qian Liu CSE spring University of Pennsylvania

Naive Bayes Classifier

Bayesian Rule & Gaussian Mixture Models

Statistical Models for Automatic Speech Recognition

Data Mining Lecture 11.

Latent Variables, Mixture Models and EM

Bayesian Models in Machine Learning

Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models

Machine Learning: Lecture 6

Machine Learning: UNIT-3 CHAPTER-1

Naive Bayes Classifier

EM Algorithm 主講人：虞台文.

Mathematical Foundations of BME Reza Shadmehr

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes International.

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes International.

Presentation transcript:

B AYESIAN L EARNING & G AUSSIAN M IXTURE M ODELS Jianping Fan Dept of Computer Science UNC-Charlotte

B ASIC C LASSIFICATION !!!!$$$!!!! Spam filtering Character recognition InputOutput C Binary Multi-Class Spam vs. Not-Spam C vs. other 25 characters

S TRUCTURED C LASSIFICATION Handwriting recognition InputOutput 3D object recognition building tree brace Structured output

O VERVIEW OF B AYESIAN D ECISION Bayesian classification: one example E.g. How to decide if a patient is sick or healthy, based on A probabilistic model of the observed data ( data distributions ) Prior knowledge ( ratio or importance )

B AYES ’ R ULE : Who is who in Bayes’ rule

C LASSIFICATION PROBLEM Training data: examples of the form (d,h(d)) where d are the data objects to classify (inputs) and h(d) are the correct class info for d, h(d)  {1,…K} Goal: given d new, provide h(d new )

W HY B AYESIAN ? Provides practical learning algorithms E.g. Naïve Bayes Prior knowledge and observed data can be combined It is a generative (model based) approach, which offers a useful conceptual framework E.g. sequences could also be classified, based on a probabilistic model specification Any kind of objects can be classified, based on a probabilistic model specification

Gaussian Mixture Model (GMM)

U NIVARIATE N ORMAL S AMPLE   Sampling

M AXIMUM L IKELIHOOD   Sampling Given x, it is a function of  and  2 We want to maximize it.

L OG -L IKELIHOOD F UNCTION Maximize this instead By setting and

M AX. THE L OG -L IKELIHOOD F UNCTION

M ISS D ATA   Sampling Missing data

E-S TEP Let be the estimated parameters at the initial of the t th iterations

E-S TEP Let be the estimated parameters at the initial of the t th iterations

M-S TEP Let be the estimated parameters at the initial of the t th iterations

E XERCISE n = 40 (10 data missing ) Estimate using different initial conditions.

M ULTINOMIAL P OPULATION Sampling N samples

M AXIMUM L IKELIHOOD Sampling N samples

M AXIMUM L IKELIHOOD Sampling N samples We want to maximize it.

L OG -L IKELIHOOD

M IXED A TTRIBUTES Sampling N samples x 3 is not available

E-S TEP Sampling N samples x 3 is not available Given  (t), what can you say about x 3 ?

M-S TEP

E XERCISE Estimate  using different initial conditions?

B INOMIAL /P OISON M IXTURE # Obasongs n0n0 n1n1 n2n2 n3n3 n4n4 n5n5 n6n6 # Children Married Obasongs Unmarried Obasongs (No Children) M : married obasong X : # Children

B INOMIAL /P OISON M IXTURE # Obasongs n0n0 n1n1 n2n2 n3n3 n4n4 n5n5 n6n6 # Children M : married obasong X : # Children Married Obasongs Unmarried Obasongs (No Children) n A : # married Ob’s n B : # unmarried Ob’s Unobserved data:

B INOMIAL /P OISON M IXTURE # Obasongs n0n0 n1n1 n2n2 n3n3 n4n4 n5n5 n6n6 # Children M : married obasong X : # Children Complete data n1n1 n2n2 n3n3 n4n4 n5n5 n6n6 Probability p A, p B p1p1 p2p2 p3p3 p4p4 p5p5 p6p6

B INOMIAL /P OISON M IXTURE # Obasongs n0n0 n1n1 n2n2 n3n3 n4n4 n5n5 n6n6 # Children Complete data n1n1 n2n2 n3n3 n4n4 n5n5 n6n6 Probability p A, p B p1p1 p2p2 p3p3 p4p4 p5p5 p6p6

C OMPLETE D ATA L IKELIHOOD # Obasongs n0n0 n1n1 n2n2 n3n3 n4n4 n5n5 n6n6 # Children Complete data n1n1 n2n2 n3n3 n4n4 n5n5 n6n6 Probability p A, p B p1p1 p2p2 p3p3 p4p4 p5p5 p6p6

M AXIMUM L IKELIHOOD 

L ATENT V ARIABLES Incomplete Data Complete Data 

C OMPLETE D ATA L IKELIHOOD Complete Data 

C OMPLETE D ATA L IKELIHOOD Complete Data A function of parameter  A function of latent variable Y and parameter  If we are given , Computable The result is in term of random variable Y. A function of random variable Y.

E XPECTATION S TEP Let  (i  1) be the parameter vector obtained at the (i  1) th step. Define

M AXIMIZATION S TEP Let  (i  1) be the parameter vector obtained at the (i  1) th step. Define

M IXTURE M ODELS If there is a reason to believe that a data set is comprised of several distinct populations, a mixture model can be used. It has the following form: with

M IXTURE M ODELS  Let y i  {1,…, M} represents the source that generates the data.

M IXTURE M ODELS  Let y i  {1,…, M} represents the source that generates the data.

M IXTURE M ODELS 

E XPECTATION Zero when y i  l

E XPECTATION

1

M AXIMIZATION Given the initial guess  g, We want to find , to maximize the above expectation. In fact, iteratively.

T HE GMM (G UASSIAN M IXTURE M ODEL ) Guassian model of a d-dimensional source, say j : GMM with M sources:

G OAL Mixture Model subject to To maximize:

F INDING  L Due to the constraint on  l ’s, we introduce Lagrange Multiplier, and solve the following equation.

S OURCE CODE FOR GMM 1. EPFL 2. Google Source Code on GMM 3. GMM & EM n-mixture-models-and-expectation.html

P ROBABILITIES – AUXILIARY SLIDE FOR MEMORY REFRESHING Have two dice h 1 and h 2 The probability of rolling an i given die h 1 is denoted P(i|h 1 ). This is a conditional probability Pick a die at random with probability P(h j ), j=1 or 2. The probability for picking die h j and rolling an i with it is called joint probability and is P(i, h j )=P(h j )P(i| h j ). For any events X and Y, P(X,Y)=P(X|Y)P(Y) If we know P(X,Y), then the so-called marginal probability P(X) can be computed as

D OES PATIENT HAVE CANCER OR NOT ? A patient takes a lab test and the result comes back positive. It is known that the test returns a correct positive result in only 98% of the cases and a correct negative result in only 97% of the cases. Furthermore, only of the entire population has this disease. 1. What is the probability that this patient has cancer? 2. What is the probability that he does not have cancer? 3. What is the diagnosis?

C HOOSING H YPOTHESES Maximum Likelihood hypothesis: Generally we want the most probable hypothesis given training data. This is the maximum a posteriori hypothesis: Useful observation: it does not depend on the denominator P(d)

N OW WE COMPUTE THE DIAGNOSIS To find the Maximum Likelihood hypothesis, we evaluate P(d|h) for the data d, which is the positive lab test and chose the hypothesis (diagnosis) that maximises it: To find the Maximum A Posteriori hypothesis, we evaluate P(d|h)P(h) for the data d, which is the positive lab test and chose the hypothesis (diagnosis) that maximises it. This is the same as choosing the hypotheses gives the higher posterior probability.

N AÏVE B AYES C LASSIFIER What can we do if our data d has several attributes? Naïve Bayes assumption: Attributes that describe data instances are conditionally independent given the classification hypothesis it is a simplifying assumption, obviously it may be violated in reality in spite of that, it works well in practice The Bayesian classifier that uses the Naïve Bayes assumption and computes the MAP hypothesis is called Naïve Bayes classifier One of the most practical learning methods Successful applications: Medical Diagnosis Text classification

E XAMPLE. ‘P LAY T ENNIS ’ DATA

N AÏVE B AYES SOLUTION Classify any new datum instance x= (a 1,…a T ) as: To do this based on training examples, we need to estimate the parameters from the training examples: For each target value (hypothesis) h For each attribute value a t of each datum instance

Based on the examples in the table, classify the following datum x : x=(Outl=Sunny, Temp=Cool, Hum=High, Wind=strong) That means: Play tennis or not? Working:

L EARNING TO CLASSIFY TEXT Learn from examples which articles are of interest The attributes are the words Observe the Naïve Bayes assumption just means that we have a random sequence model within each class! NB classifiers are one of the most effective for this task Resources for those interested: Tom Mitchell: Machine Learning (book) Chapter 6.

R ESULTS ON A BENCHMARK TEXT CORPUS

R EMEMBER Bayes’ rule can be turned into a classifier Maximum A Posteriori (MAP) hypothesis estimation incorporates prior knowledge; Max Likelihood doesn’t Naive Bayes Classifier is a simple but effective Bayesian classifier for vector data (i.e. data with several attributes) that assumes that attributes are independent given the class. Bayesian classification is a generative approach to classification

R ESOURCES Textbook reading (contains details about using Naïve Bayes for text classification): Tom Mitchell, Machine Learning (book), Chapter 6. Software: NB for classifying text: bayes.html Useful reading for those interested to learn more about NB classification, beyond the scope of this module: