B AYESIAN L EARNING & G AUSSIAN M IXTURE M ODELS Jianping Fan Dept of Computer Science UNC-Charlotte.

B AYESIAN L EARNING & G AUSSIAN M IXTURE M ODELS Jianping Fan Dept of Computer Science UNC-Charlotte

B ASIC C LASSIFICATION !!!!$$$!!!! Spam filtering Character recognition InputOutput C Binary Multi-Class Spam vs. Not-Spam C vs. other 25 characters

S TRUCTURED C LASSIFICATION Handwriting recognition InputOutput 3D object recognition building tree brace Structured output

O VERVIEW OF B AYESIAN D ECISION Bayesian classification: one example E.g. How to decide if a patient is sick or healthy, based on A probabilistic model of the observed data ( data distributions ) Prior knowledge ( ratio or importance )

B AYES ’ R ULE : Who is who in Bayes’ rule

C LASSIFICATION PROBLEM Training data: examples of the form (d,h(d)) where d are the data objects to classify (inputs) and h(d) are the correct class info for d, h(d)  {1,…K} Goal: given d new, provide h(d new )

W HY B AYESIAN ? Provides practical learning algorithms E.g. Naïve Bayes Prior knowledge and observed data can be combined It is a generative (model based) approach, which offers a useful conceptual framework E.g. sequences could also be classified, based on a probabilistic model specification Any kind of objects can be classified, based on a probabilistic model specification

Gaussian Mixture Model (GMM)

U NIVARIATE N ORMAL S AMPLE   Sampling

M AXIMUM L IKELIHOOD   Sampling Given x, it is a function of  and  2 We want to maximize it.

L OG -L IKELIHOOD F UNCTION Maximize this instead By setting and

M AX. THE L OG -L IKELIHOOD F UNCTION

M ISS D ATA   Sampling Missing data

E-S TEP Let be the estimated parameters at the initial of the t th iterations

M-S TEP Let be the estimated parameters at the initial of the t th iterations

E XERCISE 375.081556 362.275902 332.612068 351.383048 304.823174 386.438672 430.079689 395.317406 369.029845 365.343938 243.548664 382.789939 374.419161 337.289831 418.928822 364.086502 343.854855 371.279406 439.241736 338.281616 454.981077 479.685107 336.634962 407.030453 297.821512 311.267105 528.267783 419.841982 392.684770 301.910093 n = 40 (10 data missing ) Estimate using different initial conditions.

M ULTINOMIAL P OPULATION Sampling N samples

M AXIMUM L IKELIHOOD Sampling N samples

M AXIMUM L IKELIHOOD Sampling N samples We want to maximize it.

L OG -L IKELIHOOD

M IXED A TTRIBUTES Sampling N samples x 3 is not available

E-S TEP Sampling N samples x 3 is not available Given  (t), what can you say about x 3 ?

M-S TEP

E XERCISE Estimate  using different initial conditions?

B INOMIAL /P OISON M IXTURE # Obasongs n0n0 n1n1 n2n2 n3n3 n4n4 n5n5 n6n6 # Children Married Obasongs Unmarried Obasongs (No Children) M : married obasong X : # Children

B INOMIAL /P OISON M IXTURE # Obasongs n0n0 n1n1 n2n2 n3n3 n4n4 n5n5 n6n6 # Children M : married obasong X : # Children Married Obasongs Unmarried Obasongs (No Children) n A : # married Ob’s n B : # unmarried Ob’s Unobserved data:

B INOMIAL /P OISON M IXTURE # Obasongs n0n0 n1n1 n2n2 n3n3 n4n4 n5n5 n6n6 # Children M : married obasong X : # Children Complete data n1n1 n2n2 n3n3 n4n4 n5n5 n6n6 Probability p A, p B p1p1 p2p2 p3p3 p4p4 p5p5 p6p6

B INOMIAL /P OISON M IXTURE # Obasongs n0n0 n1n1 n2n2 n3n3 n4n4 n5n5 n6n6 # Children Complete data n1n1 n2n2 n3n3 n4n4 n5n5 n6n6 Probability p A, p B p1p1 p2p2 p3p3 p4p4 p5p5 p6p6

C OMPLETE D ATA L IKELIHOOD # Obasongs n0n0 n1n1 n2n2 n3n3 n4n4 n5n5 n6n6 # Children Complete data n1n1 n2n2 n3n3 n4n4 n5n5 n6n6 Probability p A, p B p1p1 p2p2 p3p3 p4p4 p5p5 p6p6

M AXIMUM L IKELIHOOD 

L ATENT V ARIABLES Incomplete Data Complete Data 

C OMPLETE D ATA L IKELIHOOD Complete Data 

C OMPLETE D ATA L IKELIHOOD Complete Data A function of parameter  A function of latent variable Y and parameter  If we are given , Computable The result is in term of random variable Y. A function of random variable Y.

E XPECTATION S TEP Let  (i  1) be the parameter vector obtained at the (i  1) th step. Define

M AXIMIZATION S TEP Let  (i  1) be the parameter vector obtained at the (i  1) th step. Define

M IXTURE M ODELS If there is a reason to believe that a data set is comprised of several distinct populations, a mixture model can be used. It has the following form: with

M IXTURE M ODELS  Let y i  {1,…, M} represents the source that generates the data.

M IXTURE M ODELS 

E XPECTATION Zero when y i  l

E XPECTATION

M AXIMIZATION Given the initial guess  g, We want to find , to maximize the above expectation. In fact, iteratively.

T HE GMM (G UASSIAN M IXTURE M ODEL ) Guassian model of a d-dimensional source, say j : GMM with M sources:

G OAL Mixture Model subject to To maximize:

F INDING  L Due to the constraint on  l ’s, we introduce Lagrange Multiplier, and solve the following equation.

S OURCE CODE FOR GMM 1. EPFL http://lasa.epfl.ch/sourcecode/ 2. Google Source Code on GMM http://code.google.com/p/gmmreg/ 3. GMM & EM http://crsouza.blogspot.com/2010/10/gaussia n-mixture-models-and-expectation.html

P ROBABILITIES – AUXILIARY SLIDE FOR MEMORY REFRESHING Have two dice h 1 and h 2 The probability of rolling an i given die h 1 is denoted P(i|h 1 ). This is a conditional probability Pick a die at random with probability P(h j ), j=1 or 2. The probability for picking die h j and rolling an i with it is called joint probability and is P(i, h j )=P(h j )P(i| h j ). For any events X and Y, P(X,Y)=P(X|Y)P(Y) If we know P(X,Y), then the so-called marginal probability P(X) can be computed as

D OES PATIENT HAVE CANCER OR NOT ? A patient takes a lab test and the result comes back positive. It is known that the test returns a correct positive result in only 98% of the cases and a correct negative result in only 97% of the cases. Furthermore, only 0.008 of the entire population has this disease. 1. What is the probability that this patient has cancer? 2. What is the probability that he does not have cancer? 3. What is the diagnosis?

C HOOSING H YPOTHESES Maximum Likelihood hypothesis: Generally we want the most probable hypothesis given training data. This is the maximum a posteriori hypothesis: Useful observation: it does not depend on the denominator P(d)

N OW WE COMPUTE THE DIAGNOSIS To find the Maximum Likelihood hypothesis, we evaluate P(d|h) for the data d, which is the positive lab test and chose the hypothesis (diagnosis) that maximises it: To find the Maximum A Posteriori hypothesis, we evaluate P(d|h)P(h) for the data d, which is the positive lab test and chose the hypothesis (diagnosis) that maximises it. This is the same as choosing the hypotheses gives the higher posterior probability.

N AÏVE B AYES C LASSIFIER What can we do if our data d has several attributes? Naïve Bayes assumption: Attributes that describe data instances are conditionally independent given the classification hypothesis it is a simplifying assumption, obviously it may be violated in reality in spite of that, it works well in practice The Bayesian classifier that uses the Naïve Bayes assumption and computes the MAP hypothesis is called Naïve Bayes classifier One of the most practical learning methods Successful applications: Medical Diagnosis Text classification

E XAMPLE. ‘P LAY T ENNIS ’ DATA

N AÏVE B AYES SOLUTION Classify any new datum instance x= (a 1,…a T ) as: To do this based on training examples, we need to estimate the parameters from the training examples: For each target value (hypothesis) h For each attribute value a t of each datum instance

Based on the examples in the table, classify the following datum x : x=(Outl=Sunny, Temp=Cool, Hum=High, Wind=strong) That means: Play tennis or not? Working:

L EARNING TO CLASSIFY TEXT Learn from examples which articles are of interest The attributes are the words Observe the Naïve Bayes assumption just means that we have a random sequence model within each class! NB classifiers are one of the most effective for this task Resources for those interested: Tom Mitchell: Machine Learning (book) Chapter 6.

R ESULTS ON A BENCHMARK TEXT CORPUS

R EMEMBER Bayes’ rule can be turned into a classifier Maximum A Posteriori (MAP) hypothesis estimation incorporates prior knowledge; Max Likelihood doesn’t Naive Bayes Classifier is a simple but effective Bayesian classifier for vector data (i.e. data with several attributes) that assumes that attributes are independent given the class. Bayesian classification is a generative approach to classification

R ESOURCES Textbook reading (contains details about using Naïve Bayes for text classification): Tom Mitchell, Machine Learning (book), Chapter 6. Software: NB for classifying text: http://www-2.cs.cmu.edu/afs/cs/project/theo-11/www/naive- bayes.html Useful reading for those interested to learn more about NB classification, beyond the scope of this module: http://www-2.cs.cmu.edu/~tom/NewChapters.html

B AYESIAN L EARNING & G AUSSIAN M IXTURE M ODELS Jianping Fan Dept of Computer Science UNC-Charlotte.

Similar presentations

Presentation on theme: "B AYESIAN L EARNING & G AUSSIAN M IXTURE M ODELS Jianping Fan Dept of Computer Science UNC-Charlotte."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

B AYESIAN L EARNING & G AUSSIAN M IXTURE M ODELS Jianping Fan Dept of Computer Science UNC-Charlotte.

Similar presentations

Presentation on theme: "B AYESIAN L EARNING & G AUSSIAN M IXTURE M ODELS Jianping Fan Dept of Computer Science UNC-Charlotte."— Presentation transcript:

Similar presentations

About project

Feedback