B AYESIAN L EARNING & G AUSSIAN M IXTURE M ODELS Jianping Fan Dept of Computer Science UNC-Charlotte
B ASIC C LASSIFICATION !!!!$$$!!!! Spam filtering Character recognition InputOutput C Binary Multi-Class Spam vs. Not-Spam C vs. other 25 characters
S TRUCTURED C LASSIFICATION Handwriting recognition InputOutput 3D object recognition building tree brace Structured output
O VERVIEW OF B AYESIAN D ECISION Bayesian classification: one example E.g. How to decide if a patient is sick or healthy, based on A probabilistic model of the observed data ( data distributions ) Prior knowledge ( ratio or importance )
B AYES ’ R ULE : Who is who in Bayes’ rule
C LASSIFICATION PROBLEM Training data: examples of the form (d,h(d)) where d are the data objects to classify (inputs) and h(d) are the correct class info for d, h(d) {1,…K} Goal: given d new, provide h(d new )
W HY B AYESIAN ? Provides practical learning algorithms E.g. Naïve Bayes Prior knowledge and observed data can be combined It is a generative (model based) approach, which offers a useful conceptual framework E.g. sequences could also be classified, based on a probabilistic model specification Any kind of objects can be classified, based on a probabilistic model specification
Gaussian Mixture Model (GMM)
U NIVARIATE N ORMAL S AMPLE Sampling
M AXIMUM L IKELIHOOD Sampling Given x, it is a function of and 2 We want to maximize it.
L OG -L IKELIHOOD F UNCTION Maximize this instead By setting and
M AX. THE L OG -L IKELIHOOD F UNCTION
M ISS D ATA Sampling Missing data
E-S TEP Let be the estimated parameters at the initial of the t th iterations
E-S TEP Let be the estimated parameters at the initial of the t th iterations
M-S TEP Let be the estimated parameters at the initial of the t th iterations
E XERCISE n = 40 (10 data missing ) Estimate using different initial conditions.
M ULTINOMIAL P OPULATION Sampling N samples
M AXIMUM L IKELIHOOD Sampling N samples
M AXIMUM L IKELIHOOD Sampling N samples We want to maximize it.
L OG -L IKELIHOOD
M IXED A TTRIBUTES Sampling N samples x 3 is not available
E-S TEP Sampling N samples x 3 is not available Given (t), what can you say about x 3 ?
M-S TEP
E XERCISE Estimate using different initial conditions?
B INOMIAL /P OISON M IXTURE # Obasongs n0n0 n1n1 n2n2 n3n3 n4n4 n5n5 n6n6 # Children Married Obasongs Unmarried Obasongs (No Children) M : married obasong X : # Children
B INOMIAL /P OISON M IXTURE # Obasongs n0n0 n1n1 n2n2 n3n3 n4n4 n5n5 n6n6 # Children M : married obasong X : # Children Married Obasongs Unmarried Obasongs (No Children) n A : # married Ob’s n B : # unmarried Ob’s Unobserved data:
B INOMIAL /P OISON M IXTURE # Obasongs n0n0 n1n1 n2n2 n3n3 n4n4 n5n5 n6n6 # Children M : married obasong X : # Children Complete data n1n1 n2n2 n3n3 n4n4 n5n5 n6n6 Probability p A, p B p1p1 p2p2 p3p3 p4p4 p5p5 p6p6
B INOMIAL /P OISON M IXTURE # Obasongs n0n0 n1n1 n2n2 n3n3 n4n4 n5n5 n6n6 # Children Complete data n1n1 n2n2 n3n3 n4n4 n5n5 n6n6 Probability p A, p B p1p1 p2p2 p3p3 p4p4 p5p5 p6p6
C OMPLETE D ATA L IKELIHOOD # Obasongs n0n0 n1n1 n2n2 n3n3 n4n4 n5n5 n6n6 # Children Complete data n1n1 n2n2 n3n3 n4n4 n5n5 n6n6 Probability p A, p B p1p1 p2p2 p3p3 p4p4 p5p5 p6p6
M AXIMUM L IKELIHOOD
L ATENT V ARIABLES Incomplete Data Complete Data
C OMPLETE D ATA L IKELIHOOD Complete Data
C OMPLETE D ATA L IKELIHOOD Complete Data A function of parameter A function of latent variable Y and parameter If we are given , Computable The result is in term of random variable Y. A function of random variable Y.
E XPECTATION S TEP Let (i 1) be the parameter vector obtained at the (i 1) th step. Define
M AXIMIZATION S TEP Let (i 1) be the parameter vector obtained at the (i 1) th step. Define
M IXTURE M ODELS If there is a reason to believe that a data set is comprised of several distinct populations, a mixture model can be used. It has the following form: with
M IXTURE M ODELS Let y i {1,…, M} represents the source that generates the data.
M IXTURE M ODELS Let y i {1,…, M} represents the source that generates the data.
M IXTURE M ODELS
E XPECTATION Zero when y i l
E XPECTATION
1
M AXIMIZATION Given the initial guess g, We want to find , to maximize the above expectation. In fact, iteratively.
T HE GMM (G UASSIAN M IXTURE M ODEL ) Guassian model of a d-dimensional source, say j : GMM with M sources:
G OAL Mixture Model subject to To maximize:
F INDING L Due to the constraint on l ’s, we introduce Lagrange Multiplier, and solve the following equation.
S OURCE CODE FOR GMM 1. EPFL 2. Google Source Code on GMM 3. GMM & EM n-mixture-models-and-expectation.html
P ROBABILITIES – AUXILIARY SLIDE FOR MEMORY REFRESHING Have two dice h 1 and h 2 The probability of rolling an i given die h 1 is denoted P(i|h 1 ). This is a conditional probability Pick a die at random with probability P(h j ), j=1 or 2. The probability for picking die h j and rolling an i with it is called joint probability and is P(i, h j )=P(h j )P(i| h j ). For any events X and Y, P(X,Y)=P(X|Y)P(Y) If we know P(X,Y), then the so-called marginal probability P(X) can be computed as
D OES PATIENT HAVE CANCER OR NOT ? A patient takes a lab test and the result comes back positive. It is known that the test returns a correct positive result in only 98% of the cases and a correct negative result in only 97% of the cases. Furthermore, only of the entire population has this disease. 1. What is the probability that this patient has cancer? 2. What is the probability that he does not have cancer? 3. What is the diagnosis?
C HOOSING H YPOTHESES Maximum Likelihood hypothesis: Generally we want the most probable hypothesis given training data. This is the maximum a posteriori hypothesis: Useful observation: it does not depend on the denominator P(d)
N OW WE COMPUTE THE DIAGNOSIS To find the Maximum Likelihood hypothesis, we evaluate P(d|h) for the data d, which is the positive lab test and chose the hypothesis (diagnosis) that maximises it: To find the Maximum A Posteriori hypothesis, we evaluate P(d|h)P(h) for the data d, which is the positive lab test and chose the hypothesis (diagnosis) that maximises it. This is the same as choosing the hypotheses gives the higher posterior probability.
N AÏVE B AYES C LASSIFIER What can we do if our data d has several attributes? Naïve Bayes assumption: Attributes that describe data instances are conditionally independent given the classification hypothesis it is a simplifying assumption, obviously it may be violated in reality in spite of that, it works well in practice The Bayesian classifier that uses the Naïve Bayes assumption and computes the MAP hypothesis is called Naïve Bayes classifier One of the most practical learning methods Successful applications: Medical Diagnosis Text classification
E XAMPLE. ‘P LAY T ENNIS ’ DATA
N AÏVE B AYES SOLUTION Classify any new datum instance x= (a 1,…a T ) as: To do this based on training examples, we need to estimate the parameters from the training examples: For each target value (hypothesis) h For each attribute value a t of each datum instance
Based on the examples in the table, classify the following datum x : x=(Outl=Sunny, Temp=Cool, Hum=High, Wind=strong) That means: Play tennis or not? Working:
L EARNING TO CLASSIFY TEXT Learn from examples which articles are of interest The attributes are the words Observe the Naïve Bayes assumption just means that we have a random sequence model within each class! NB classifiers are one of the most effective for this task Resources for those interested: Tom Mitchell: Machine Learning (book) Chapter 6.
R ESULTS ON A BENCHMARK TEXT CORPUS
R EMEMBER Bayes’ rule can be turned into a classifier Maximum A Posteriori (MAP) hypothesis estimation incorporates prior knowledge; Max Likelihood doesn’t Naive Bayes Classifier is a simple but effective Bayesian classifier for vector data (i.e. data with several attributes) that assumes that attributes are independent given the class. Bayesian classification is a generative approach to classification
R ESOURCES Textbook reading (contains details about using Naïve Bayes for text classification): Tom Mitchell, Machine Learning (book), Chapter 6. Software: NB for classifying text: bayes.html Useful reading for those interested to learn more about NB classification, beyond the scope of this module: