Semi-Supervised Learning

Slides:



Advertisements
Similar presentations
Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
Advertisements

Image Modeling & Segmentation
Co Training Presented by: Shankar B S DMML Lab
Mixture Models and the EM Algorithm
Unsupervised Learning
K Means Clustering , Nearest Cluster and Gaussian Mixture
Flipping A Biased Coin Suppose you have a coin with an unknown bias, θ ≡ P(head). You flip the coin multiple times and observe the outcome. From observations,
Semi-Supervised Learning
PROBABILISTIC MODELS David Kauchak CS451 – Fall 2013.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.
Unsupervised and Weakly-Supervised Probabilistic Modeling of Text Ivan Titov April TexPoint fonts used in EMF. Read the TexPoint manual before.
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
EE-148 Expectation Maximization Markus Weber 5/11/99.
Overview Full Bayesian Learning MAP learning
Hidden Markov Model 11/28/07. Bayes Rule The posterior distribution Select k with the largest posterior distribution. Minimizes the average misclassification.
First introduced in 1977 Lots of mathematical derivation Problem : given a set of data (data is incomplete or having missing values). Goal : assume the.
Announcements  Homework 4 is due on this Thursday (02/27/2004)  Project proposal is due on 03/02.
Lecture 5: Learning models using EM
Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.
Most slides from Expectation Maximization (EM) Northwestern University EECS 395/495 Special Topics in Machine Learning.
1 Learning Entity Specific Models Stefan Niculescu Carnegie Mellon University November, 2003.
. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.
Expectation Maximization Algorithm
Co-training LING 572 Fei Xia 02/21/06. Overview Proposed by Blum and Mitchell (1998) Important work: –(Nigam and Ghani, 2000) –(Goldman and Zhou, 2000)
Expectation Maximization for GMM Comp344 Tutorial Kai Zhang.
CS Ensembles and Bayes1 Semi-Supervised Learning Can we improve the quality of our learning by combining labeled and unlabeled data Usually a lot.
Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Jeff Howbert Introduction to Machine Learning Winter Classification Bayesian Classifiers.
Bayesian Networks. Male brain wiring Female brain wiring.
EM and expected complete log-likelihood Mixture of Experts
1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.
Text Classification, Active/Interactive learning.
Naive Bayes Classifier
Logistic Regression Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata September 1, 2014.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
Unsupervised Learning: Clustering Some material adapted from slides by Andrew Moore, CMU. Visit for
1 COMP3503 Semi-Supervised Learning COMP3503 Semi-Supervised Learning Daniel L. Silver.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Bayesian Classification. Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities.
NLP. Introduction to NLP Sequence of random variables that aren’t independent Examples –weather reports –text.
HMM - Part 2 The EM algorithm Continuous density HMM.
Lecture 6 Spring 2010 Dr. Jianjun Hu CSCE883 Machine Learning.
1 CONTEXT DEPENDENT CLASSIFICATION  Remember: Bayes rule  Here: The class to which a feature vector belongs depends on:  Its own value  The values.
Computer Vision Lecture 6. Probabilistic Methods in Segmentation.
ECE 8443 – Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional Likelihood Mutual Information Estimation (CMLE) Maximum MI Estimation.
Analysis of Bootstrapping Algorithms Seminar of Machine Learning for Text Mining UPC, 18/11/2004 Mihai Surdeanu.
Maximum Likelihood Estimation
Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Statistical Models for Automatic Speech Recognition Lukáš Burget.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
Information Bottleneck versus Maximum Likelihood Felix Polyakov.
Machine Learning Expectation Maximization and Gaussian Mixtures CSE 473 Chapter 20.3.
BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.
Machine Learning Expectation Maximization and Gaussian Mixtures CSE 473 Chapter 20.3.
Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Data Mining Practical Machine Learning Tools and Techniques
Comp. Genomics Recitation 6 14/11/06 ML and EM.
Classification of unlabeled data:
Statistical Models for Automatic Speech Recognition
Data Mining Lecture 11.
دانشگاه صنعتی امیرکبیر Instructor : Saeed Shiry
Statistical Models for Automatic Speech Recognition
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models
CONTEXT DEPENDENT CLASSIFICATION
Clustering (2) & EM algorithm
Presentation transcript:

Machine Learning in Natural Language Semi-Supervised Learning and the EM Algorithm

Semi-Supervised Learning Consider the problem of Prepositional Phrase Attachment. Buy car with money ; buy car with steering wheel There are several ways to generate features. Given the limited representation, we can assume that all possible conjunctions of the up to 4 attributes are used. (15 features in each example). Assume we will use naïve Bayes for learning to decide between [n,v] Examples are: (x1,x2,…xn,[n,v])

Using naïve Bayes To use naïve Bayes, we need to use the data to estimate: P(n) P(v) P(x1|n) P(x1|v) P(x2|n) P(x2|v) …… P(xn|n) P(xn|v) Then, given an example (x1,x2,…xn,?), compare: Pn(x)=P(n) P(x1|n) P(x2|n)… P(xn|n) and Pv(x)=P(v) P(x1|v) P(x2|v)… P(xn|v)

Using naïve Bayes After seeing 10 examples, we have: P(n) =0.5; P(v)=0.5 P(x1|n)=0.75;P(x2|n) =0.5; P(x3|n) =0.5; P(x4|n) =0.5 P(x1|v)=0.25; P(x2|v) =0.25;P(x3|v) =0.75;P(x4|v) =0.5 Then, given an example (1000), we have: Pn(x)=0.5 0.75 0.5 0.5 0.5 = 3/64 Pv(x)=0.5 0.25 0.75 0.25 0.5=3/256 Now, assume that in addition to the 10 labeled examples, we also have 100 unlabeled examples.

Using naïve Bayes For example, what can be done with (1000?) ? We can guess the label of the unlabeled example… But, can we use it to improve the classifier? (that is, the estimation of the probabilities?) We can assume the example x=(1000) is a n example with probability Pn(x)/(Pn(x) + Pv(x)) v example with probability Pv(x)/(Pn(x) + Pv(x)) Estimation of probabilities does not require work with integers!

Using Unlabeled Data The discussion suggests several algorithms: Use a threshold. Chose examples labeled with high confidence. Labeled them [n,v]. Retrain. Use fractional examples. Label the examples with fractional labels [p of n, (1-p) of v]. Retrain.

Comments on Unlabeled Data Both algorithms suggested can be used iteratively. Both algorithms can be used with other classifiers, not only naïve Bayes. The only requirement – a robust confidence measure in the classification. E.g.: Brill, ACL’01: uses all three algorithms in SNoW for studies of these sort.

Comments on Semi-supervised Learning (1) Most approaches to Semi-Supervised learning are based on Bootstrap ideas. Yarowsky’s Bootstrap Co-Training: Features can be split into two sets; each sub-feature set is (assumed) sufficient to train a good classifier; the two sets are (assumed) conditionally independent given the class. Two separate classifiers are trained with the labeled data, on the two sub-feature sets respectively. Each classifier then classifies the unlabeled data, and ‘teaches’ the other classifier with the few unlabeled examples (and the predicted labels) they feel most confident. Each classifier is retrained with the additional training examples given by the other classifier, and the process repeats. Multi-view learning: A more general paradigm that utilizes the agreement among different learners. Multiple hypotheses (with different biases are trained from the same labeled and are required to make similar predictions on any given unlabeled instance.

EM EM is a class of algorithms that is used to estimate a probability distribution in the presence of missing attributes. Using it, requires an assumption on the underlying probability distribution. The algorithm can be very sensitive to this assumption and to the starting point (that is, the initial guess of parameters). In general, known to converge to a local maximum of the maximum likelihood function.

Three Coin Example We observe a series of coin tosses generated in the following way: A person has three coins. Coin 0: probability of Head is Coin 1: probability of Head p Coin 2: probability of Head q Consider the following coin-tossing scenario:

Generative Process Scenario II: Toss coin 0 (do not show it to anyone!). If Head – toss coin 1 m time s; o/w -- toss coin 2 m times. Only the series of tosses are observed Observing the sequence HHHT, HTHT, HHHT, HTTH What are the most likely values of parameters p, q and the selected coin tosses of the coin p q There is no known analytical solution to this problem. That is, it is not known how to compute the values of the parameters so as to maximize the likelihood of the data.

Key Intuition (1) If we knew which of the data points (HHHT), (HTHT), (HTTH) came from Coin1 and which from Coin2, there was no problem.

(Labels  Model Parameters) Likelihood of the data Key Intuition If we knew which of the data points (HHHT), (HTHT), (HTTH) came from Coin1 and which from Coin2, there was no problem. Instead, use an iterative approach for estimating the parameters: Guess the probability that a given data point came from Coin 1/2 Generate fictional labels, weighted according to this probability. Now, compute the most likely value of the parameters. [recall NB example] Compute the likelihood of the data given this model. Re-estimate the initial parameter setting: set them to maximize the likelihood of the data. (Labels  Model Parameters) Likelihood of the data This process can be iterated and can be shown to converge to a local maximum of the likelihood function

EM Algorithm (Coins) -I We will assume (for a minute) that we know the parameters and use it to estimate which Coin it is (Problem 1) Then, we will use the estimation for the tossed Coin, to estimate the most likely parameters and so on... What is the probability that the ith data point came from Coin1 ?

EM Algorithm (Coins) - II

EM Algorithm (Coins) - III

EM Algorithm (Coins) - IV Explicitly, we get:

EM Algorithm (Coins) - V When computing the derivatives, notice here is a constant; it was computed using the current parameters (including ) .

Models with Hidden Variables

EM

EM Summary (so far) EM is a general procedure for learning in the presence of unobserved variables. We have shown how to use it in order to estimate the most likely density function for a mixture of (Bernoulli) distributions. EM is an iterative algorithm that can be shown to converge to a local maximum of the likelihood function. It depends on assuming a family of probability distributions. In this sense, it is a family of algorithms. The update rules you will derive depend on the model assumed.