Statistical NLP Winter 2009 Lecture 5: Unsupervised Learning I Unsupervised Word Segmentation Roger Levy [thanks to Sharon Goldwater for many slides]

Supervised training Standard statistical systems use a supervised paradigm. Training: Labeled training data Machine learning system Statistics Prediction procedure

The real story Annotating labeled data is labor-intensive!!! Labeled training data Machine learning system Statistics Prediction procedure Human effort Training:

The real story (II) This also means that moving to a new language, domain, or even genre can be difficult. But unlabeled data is cheap! It would be nice to use the unlabeled data directly to learn the labelings you want in your model. Today we’ll look at methods for doing exactly this.

Today’s plan We’ll illustrate unsupervised learning with the “laboratory” (but cognitively hugely important!) task of word segmentation We’ll use Bayesian methods for unsupervised learning

Learning structured models Most of the models we’ll look at in this class are structured Tagging Parsing Role labeling Coreference The structure is latent With raw data, we have to construct models that will be rewarded for inferring that latent structure

A very simple example Suppose that we observe the following counts Suppose we are told that these counts arose from tossing two coins, each with a different label on each side For example, coin 1 might be A/C, coin 2 B/D Suppose further that we are told that the coins are not extremely unfair There is an intuitive solution; how can we learn it? A9B9C1D1A9B9C1D1

A very simple example (II) Suppose we fully parameterize the model: The MLE of this solution is totally degenerate: it cannot distinguish which letters should be paired on a coin Convince yourself of this! We need to specify more constraints on the model The general idea would be to place priors on the model parameters An extreme variant: force p 1 =p 2 =0.5 A9B9C1D1A9B9C1D1

A very simple example (III) An extreme variant: force p 1 =p 2 =0.5 This forces structure into the model It also makes it easy to visualize the log-likelihood as a function of the remaining free parameter π The intuitive solution is found! A9B9C1D1A9B9C1D1

The EM algorithm In the two-coin example, we were able to explore the likelihood surface exhaustively: Enumerating all possible model structures Analytically deriving the MLE for each model structure Picking the model structure with best MLE In general, however, latent structure often makes direct analysis of the likelihood surface intractable or impossible [mixture of Gaussians example?]

The EM algorithm In cases of an unanalyzable likelihood function, we want to use hill-climbing techniques to find good points on the likelihood surface Some of these fall under the category of iterative numerical optimization In our case, we’ll look at a general-purpose tool that is guaranteed “not to do bad things”: the Expectation- Maximization (EM) algorithm

EM for unsupervised HMM learning We’ve already seen examples of using dynamic programming via a trellis for inference in HMMs:

Category learning: EM for HMMs You want to estimate the parameters θ There are statistics you’d need to do this supervised For HMMs, the # transitions & emissions of each type Suppose you have a starting estimate of θ E: calculate the expectations over your statistics Expected # of transitions between each state pair Expected # of emissions from each state to each word M: re-estimate θ based on your expected statistics

EM for HMMs: example (M&S 1999) We have a crazy soft drink machine with two states We get the sequence Start with the parameters Re-estimate!

Adding a Bayesian Prior For model (w, t, θ), try to find the optimal value for θ using Bayes’ rule: Two standard objective functions are Maximum-likelihood estimation (MLE): Maximum a posteriori (MAP) estimation: posterior likelihood prior

Dirichlet priors For multinomial distributions, the Dirichlet makes a natural prior. β > 1 : prefer uniform distributions β = 1 : no preference β < 1 : prefer sparse (skewed) distributions A symmetric Dirichlet( β ) prior over θ = (θ 1, θ 2 ) :

MAP estimation with EM We have already seen how to do ML estimation with the Expectation-Maximization Algorithm We can also do MAP estimation with the appropriate type of prior MAP estimation affects the M-step of EM For example, with a Dirichlet prior, the MAP estimate can be calculated by treating the prior parameters as “pseudo-counts” (Beal 2003)

More than just the MAP Why do we want to estimate θ ? Prediction: estimate P(w n+1 |θ). Structure recovery: estimate P(t|θ,w). To the true Bayesian, the model θ parameters should really be marginalized out: Prediction: estimate Structure: estimate We don’t want to choose model parameters if we can avoid it

Bayesian integration When we integrate over the parameters θ, we gain Robustness: values of hidden variables will have high probability over a range of θ. Flexibility: allows wider choice of priors, including priors favoring sparse solutions.

Integration example Suppose we want to estimate where P(θ|w) is broad: P(t = 1|θ,w) is peaked: Estimating t based on fixed θ * favors t = 1, but for many probable values of θ, t = 0 is a better choice.

Sparse distributions In language learning, sparse distributions are often preferable (e.g., conditional distributions like P(w i |w i-1 )) Problem: when β < 1, setting any θ k = 0 makes P(θ) → ∞ regardless of other θ j. Solution: instead of fixing θ, integrate:

A new problem So far, we have assumed that all models under consideration have the same number of parameters. Topic models: fixed number of topics, fixed number of words. What if we want to compare models of different sizes? Standard method: use ML or MAP estimation to find optimal models using different numbers of parameters, then compare using model selection (e.g., likelihood ratio test). But… not always practical or even possible.

Word segmentation Given a corpus of fluent speech or text (no word boundaries), we want to identify the words. Early language acquisition task for human infants. whatsthat thedoggie yeah wheresthedoggie whats that the doggie yeah wheres the doggie see the doggie

Word segmentation Given a corpus of fluent speech or text (no word boundaries), we want to identify the words. Preprocessing step for many Asian languages. whatsthat thedoggie yeah wheresthedoggie whats that the doggie yeah wheres the doggie

Maximum-likelihood word segmentation Consider using a standard n -gram model (Venkataraman, 2001). Hypothesis space: all possible segmentations of the input corpus, with probabilities for each inferred lexical item. If model size is unconstrained, the ML solution is trivial. Each sentence is memorized as a “word”, with probability equal to its empirical probability. Adding boundaries shifts probability mass to unseen sentences. Explicitly computing and comparing solutions with all possible sets of words is intractable. Any non-trivial solution is due to implicit constraints of search. (unigram model)

A possible solution Brent (1999) defines a Bayesian model for word segmentation. Prior favors solutions with fewer and shorter lexical items. Uses approximate online algorithm to search for the MAP solution. Problems: Model is difficult to modify or extend in any interesting way. Results are influenced by special-purpose search algorithm.

Can we do better? We would like to develop models that Allow comparison between solutions of varying complexity. Use standard search algorithms. Can be easily modified and extended to explore the effects of different assumptions. Properties of lexical items (What does a typical word look like?) Properties of word behavior (How often is a word likely to appear?) Using Bayesian techniques, we can!

A new model Assume words w = {w 1 … w n } are generated as follows: G: is analogous to θ in BHMM, but with infinite dimension. As with θ, we integrate it out. DP(α 0, P 0 ) : a Dirichlet process with concentration parameter α 0 and base distribution P 0.

Recap: the Dirichlet distribution It’ll be useful to briefly revisit the Dirichlet distribution in order to get to the Dirichlet process The k-class Dirichlet distribution is a probability distribution over k-class multinomial distributions with parameters α i The normalizing constant Z is Symmetric Dirichlet: all α i are set to the same value α

The Dirichlet process Our generative model for word segmentation assumes data (words!) arise from clusters. Defines A distribution over the number and size of the clusters. A distribution P 0 over the parameters describing the distribution of data in each cluster. Clusters = frequencies of different words. Cluster parameters = identities of different words.

The Chinese restaurant process In the DP, the number of items in each cluster is defined by the Chinese restaurant process: Restaurant has an infinite number of tables, with infinite seating capacity. The table chosen by the i th customer, z i, depends on the seating arrangement of the previous i − 1 customers : CRP produces a power-law distribution over cluster sizes.

The two-stage restaurant 1.Assign data points to clusters (tables). 2.Sample labels for tables using P 0. seedogtheiscaton … …

Alternative view Equivalently, words are generated sequentially using a cache model: previously generated words are more likely to be generated again. seedogtheiscaton … … the … dogison

Unigram model DP model yields the following distribution over words: with for characters x 1 …x m. P 0 favors shorter lexical items. Words are not independent, but are exchangeable: a unigram model. Input corpus contains utterance boundaries. We assume a geometric distribution on utterance lengths. P(w 1, w 2, w 3, w 4 ) = P(w 2, w 4, w 1, w 3 )

Unigram model, in more detail How a corpus comes into being… First, a probability distribution over corpus length N: Next, a probability distribution for the type identity of each new word: Finally, a probability distribution over the phonological form of each word type

Advantages of DP language models Solutions are sparse, yet grow as data size grows. Smaller values α 0 of lead to fewer lexical items generated by P 0. Models lexical items separately from frequencies. Different choices for P 0 can infer different kinds of linguistic structure. Amenable to standard search procedures (e.g., Gibbs sampling).

Gibbs sampling Compare pairs of hypotheses differing by a single word boundary: Calculate the probabilities of the words that differ, given current analysis of all other words. Sample a hypothesis according to the ratio of probabilities. whats.that the.doggie yeah wheres.the.doggie … whats.that the.dog.gie yeah wheres.the.doggie …

Experiments Input: same corpus as Brent (1999), Venkataraman (2001). 9790 utterances of transcribed child-directed speech. Example input: Using different values of α 0, evaluate on a single sample after 20k iterations. youwanttoseethebook looktheresaboywithhishat andadoggie youwanttolookatthis...

Example results youwant to see thebook look theres aboy with his hat and adoggie you wantto lookatthis lookatthis havea drink okay now whatsthis whatsthat whatisit look canyou take itout...

Quantitative evaluation Proposed boundaries are more accurate than other models, but fewer proposals are made. Result: lower accuracy on words. Boundaries Prec Rec Word tokens Prec Rec Venk. (2001)80.684.867.770.2 Brent (1999)80.384.367.069.4 DP model92.462.261.947.6 Precision: #correct / #found Recall: #found / #true

What happened? DP model assumes (falsely) that words have the same probability regardless of context. Positing collocations allows the model to capture word- to-word dependencies. P( that ) =.024 P( that | whats ) =.46 P( that | to ) =.0019

What about other unigram models? Venkataraman’s system is based on MLE, so we know results are due to constraints imposed by search. Brent’s search algorithm also yields non-optimal solution. Our solution has higher probability under his model than his own solution does. On randomly permuted corpus, our system achieves 96% accuracy; Brent gets 81%. Any (reasonable) unigram model will undersegment. Bottom line: previous results were accidental properties of search, not systematic properties of models.

Improving the model By incorporating context (using a bigram model), perhaps we can improve segmentation…

Hierachical Dirichlet process 1.Generate G, a distribution over words, using DP(α 0, P 0 ). dogandgirlthered$ … car

Adding bigram dependencies 2.For each word in the data, generate a distribution over the words that follow it, using DP(α 1, G). dogandgirlthered$ … car and $ dog the and car a is girl $ red see dog the

Example results you want to see the book look theres a boy with his hat and a doggie you want to lookat this lookat this have a drink okay now whats this whats that whatis it look canyou take it out...

Quantitative evaluation With appropriate choice of α and β, Boundary precision nearly as good as unigram, recall much better. F-score (avg. of prec, rec) on all three measures outperforms all previously published models. Boundaries Prec Rec Word tokens Prec Rec Lexicon Prec Rec DP (unigram)92.462.261.947.657.057.5 Venk. (bigram)81.782.568.168.654.557.0 HDP (bigram)89.983.875.772.163.150.3

Summary: word segmentation Our approach to word segmentation using infinite Bayesian models allowed us to Incorporate sensible priors to avoid trivial solutions (à la MLE). Examine the effects of modeling assumptions without limitations from search. Demonstrate the importance of context for word segmentation. Achieve the best published results on this corpus.

Conclusion Bayesian methods are a promising way to improve unsupervised learning and increase coverage of NLP. Bayesian integration increases robustness to noisy data. Use of priors allows us to prefer sparse solutions (and explore other possible constraints). Infinite (nonparametric) models grow with the size of the data. POS tagging and word segmentation are specific examples, but similar methods can be applied to a variety of structure learning tasks.

Integrating out θ in HMMs We want to integrate: Problem: this is intractable Solution: we can approximate the integral using sampling techniques.

Structure of the Bayesian HMM Hyperparameters α,β determine the model parameters τ,ω, and these influence the generation of structure DetN theboy V is Prep on ω τ Start α β

The precise problem Unsupervised learning: We know the hyperparameters* and the observations We don’t really care about the parameters τ,ω We want to infer the conditional distr on the labels! ?? theboy ? is ? on ω=? τ=? Start α β

Posterior inference w/ Gibbs Sampling Suppose that we knew all the latent structure but for one tag We could then calculate the posterior distribution over this tag:

Posterior inference w/ Gibbs Sampling Really, even if we knew all but one label, we wouldn’t know the parameters τ, ω That turns out to be OK: we can integrate over them emissiontiti t i+1 t i+2

Posterior inference w/ Gibbs Sampling The theory of Markov Chain Monte Carlo sampling says that if we do this type of resampling for a long time, we will converge to the true posterior distribution over labels: Initialize the tag sequence however you want Iterate through the sequence many times, each time sampling over

Experiments of Goldwater & Griffiths 2006 Vary α, β using standard “unsupervised” POS tagging methodology: Tag dictionary lists possible tags for each word (based on ~1m words of Wall Street Journal corpus). Train and test on unlabeled corpus (24,000 words of WSJ). 53.6% of word tokens have multiple possible tags. Average number of tags per token = 2.3. Compare tagging accuracy to other methods. HMM with maximum-likelihood estimation using EM (MLHMM). Conditional Random Field with contrastive estimation (CRF/CE) (Smith & Eisner, 2005).

Results Transition hyperparameter α has more effect than output hyperparameter β. Smaller α enforces sparse transition matrix, improves scores. Less effect of β due to more varying output distributions? Even uniform priors outperform MLHMM (due to integration). MLHMM74.7 BHMM (α = 1, β = 1) 83.9 BHMM (best: α =.003, β = 1) 86.8 CRF/CE (best)90.1

Hyperparameter inference Selecting hyperparameters based on performance is problematic. Violates unsupervised assumption. Time-consuming. Bayesian framework allows us to infer values automatically. Add uniform priors over the hyperparameters. Resample each hyperparameter after each Gibbs iteration. Results: slightly worse than oracle (84.4% vs. 86.8%), but still well above MLHMM (74.7%).

Reducing lexical resources Experiments inspired by Smith & Eisner (2005): Collapse 45 treebank tags onto smaller set of 17. Create several dictionaries of varying quality. Words appearing at least d times in 24k training corpus are listed in dictionary (d = 1, 2, 3, 5, 10, ∞). Words appearing fewer than d times can belong to any class. Since standard accuracy measure requires labeled classes, we measure using best many-to-one matching of classes.

Results BHMM outperforms MLHMM for all dictionary levels, more so with smaller dictionaries: (results are using inference on hyperparameters). d =123510 ∞ MLHMM90.678.274.770.565.434.7 BHMM91.783.780.077.172.863.3

Clustering results MLHMM groups tokens of the same lexical item together. BHMM clusters are more coherent, more variable in size. Errors are often sensible (e.g. separating common nouns/proper nouns, confusing determiners/adjectives, prepositions/participles). BHMM:MLHMM:

Summary Using Bayesian techniques with a standard model dramatically improves unsupervised POS tagging. Integration over parameters adds robustness to estimates of hidden variables. Use of priors allows preference for sparse distributions typical of natural language. Especially helpful when learning is less constrained.

Statistical NLP Winter 2009 Lecture 5: Unsupervised Learning I Unsupervised Word Segmentation Roger Levy [thanks to Sharon Goldwater for many slides]

Similar presentations

Presentation on theme: "Statistical NLP Winter 2009 Lecture 5: Unsupervised Learning I Unsupervised Word Segmentation Roger Levy [thanks to Sharon Goldwater for many slides]"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Statistical NLP Winter 2009 Lecture 5: Unsupervised Learning I Unsupervised Word Segmentation Roger Levy [thanks to Sharon Goldwater for many slides]

Similar presentations

Presentation on theme: "Statistical NLP Winter 2009 Lecture 5: Unsupervised Learning I Unsupervised Word Segmentation Roger Levy [thanks to Sharon Goldwater for many slides]"— Presentation transcript:

Similar presentations

About project

Feedback