Download presentation
Presentation is loading. Please wait.
1
Lecture 8: Word Clustering
Kai-Wei Chang University of Virginia Couse webpage: 6501 Natural Language Processing
2
6501 Natural Language Processing
This lecture Brown Clustering 6501 Natural Language Processing
3
6501 Natural Language Processing
Brown Clustering Similar to language model But, basic unit is βword clustersβ Intuition: again, similar words appear in similar context Recap: Bigram Language Models π π€ 0 ,π€ 1 , π€ 2 ,β¦, π€ π =π π€ 1 π€ 0 π π€ 2 π€ 1 β¦π π€ π π€ πβ1 = Ξ i=1 n P( w π β£ π€ πβ1 ) π€ 0 is a dummy word representing βbegin of a sentenceβ 6501 Natural Language Processing
4
6501 Natural Language Processing
Motivation example βa dog is chasing a catβ π π€ 0 , βπβ,βπππβ,β¦, βπππ‘β =π βπβ π€ 0 π βπππβ βπβ β¦π βπππ‘β βπβ Assume Every word belongs to a cluster Cluster 3 a the Cluster 46 dog cat fox rabbit bird boy Cluster 64 is was Cluster 64 chasing following bitingβ¦ 6501 Natural Language Processing
5
6501 Natural Language Processing
Motivation example Assume every word belongs to a cluster βa dog is chasing a catβ C3 C46 C64 C8 C3 C46 Cluster 3 a the Cluster 46 dog cat fox rabbit bird boy Cluster 64 is was Cluster 8 chasing following bitingβ¦ 6501 Natural Language Processing
6
6501 Natural Language Processing
Motivation example Assume every word belongs to a cluster βa dog is chasing a catβ C3 C46 C64 C8 C3 C46 a dog is chasing a cat Cluster 3 a the Cluster 46 dog cat fox rabbit bird boy Cluster 64 is was Cluster 8 chasing following bitingβ¦ 6501 Natural Language Processing
7
6501 Natural Language Processing
Motivation example Assume every word belongs to a cluster βthe boy is following a rabbitβ C3 C46 C64 C8 C3 C46 the boy is following a rabbit Cluster 3 a the Cluster 46 dog cat fox rabbit bird boy Cluster 64 is was Cluster 8 chasing following bitingβ¦ 6501 Natural Language Processing
8
6501 Natural Language Processing
Motivation example Assume every word belongs to a cluster βa fox was chasing a birdβ C3 C46 C64 C8 C3 C46 a fox was chasing a bird Cluster 3 a the Cluster 46 dog cat fox rabbit bird boy Cluster 64 is was Cluster 8 chasing following bitingβ¦ 6501 Natural Language Processing
9
6501 Natural Language Processing
Brown Clustering Let πΆ π€ denote the cluster that π€ belongs to βa dog is chasing a catβ C3 C46 C64 C8 C3 C46 P(C(dog)|C(a)) P(cat|C(cat)) a dog is chasing a cat Cluster 3 a the Cluster 46 dog cat fox rabbit bird boy Cluster 64 is was Cluster 8 chasing following bitingβ¦ 6501 Natural Language Processing
10
Brown clustering model
P(βa dog is chasing a catβ) = P(C(βaβ)| πΆ 0 ) P(C(βdogβ)|C(βaβ)) P(C(βdogβ)|C(βaβ))β¦ P(βaβ|C(βaβ))P(βdogβ|C(βdogβ))... C3 C46 C64 C8 C3 C46 a dog is chasing a cat P(C(dog)|C(a)) P(cat|C(cat)) Cluster 3 a the Cluster 46 dog cat fox rabbit bird boy Cluster 64 is was Cluster 8 chasing following bitingβ¦ 6501 Natural Language Processing
11
Brown clustering model
P(βa dog is chasing a catβ) = P(C(βaβ)| πΆ 0 ) P(C(βdogβ)|C(βaβ)) P(C(βdogβ)|C(βaβ))β¦ P(βaβ|C(βaβ))P(βdogβ|C(βdogβ))... In general π π€ 0 ,π€ 1 , π€ 2 ,β¦, π€ π =π πΆ( π€ 1 ) πΆ π€ 0 π πΆ( π€ 2 ) πΆ(π€ 1 ) β¦π πΆ π€ π πΆ π€ πβ π( π€ 1 |πΆ π€ 1 π π€ 2 πΆ π€ 2 β¦π( π€ π |πΆ π€ π ) = Ξ i=1 n P πΆ w π πΆ π€ πβ1 π( π€ π β£πΆ π€ π ) 6501 Natural Language Processing
12
6501 Natural Language Processing
Model parameters π π€ 0 ,π€ 1 , π€ 2 ,β¦, π€ π = Ξ i=1 n P πΆ w π πΆ π€ πβ1 π( π€ π β£πΆ π€ π ) Parameter set 1: π( πΆ(π€ π )|πΆ π€ πβ1 ) Parameter set 2: π( π€ π |πΆ π€ π ) C3 C46 C64 C8 C3 C46 a dog is chasing a cat Parameter set 3: πΆ π€ π Cluster 3 a the Cluster 46 dog cat fox rabbit bird boy Cluster 64 is was Cluster 8 chasing following bitingβ¦ 6501 Natural Language Processing
13
6501 Natural Language Processing
Model parameters π π€ 0 ,π€ 1 , π€ 2 ,β¦, π€ π = Ξ i=1 n P πΆ w π πΆ π€ πβ1 π( π€ π β£πΆ π€ π ) A vocabulary set π A function πΆ:πβ{1, 2, 3,β¦π } A partition of vocabulary into k classes Conditional probability π(πβ²β£π) for π, π β² β 1,β¦,π Conditional probability π(π€β£π) for π, π β² β 1,β¦,π , π€βπ π represents the set of conditional probability parameters C represents the clustering 6501 Natural Language Processing
14
6501 Natural Language Processing
Log likelihood LL(π, πΆ ) = log π π€ 0 ,π€ 1 , π€ 2 ,β¦, π€ π π, πΆ = log Ξ i=1 n P πΆ w π πΆ π€ πβ1 π( π€ π β£πΆ π€ π ) = β i=1 n [log P πΆ w π πΆ π€ πβ1 + log π( π€ π β£πΆ π€ π ) ] Maximizing LL(π, πΆ) can be done by alternatively update π and πΆ max πβΞ πΏπΏ(π,πΆ) max πΆ πΏπΏ(π,πΆ) 6501 Natural Language Processing
15
6501 Natural Language Processing
max πβΞ πΏπΏ(π,πΆ) LL(π, πΆ ) = log π π€ 0 ,π€ 1 , π€ 2 ,β¦, π€ π π, πΆ = log Ξ i=1 n P πΆ w π πΆ π€ πβ1 π( π€ π β£πΆ π€ π ) = β i=1 n [log P πΆ w π πΆ π€ πβ1 + log π( π€ π β£πΆ π€ π ) ] π(πβ²β£π) = #( π β² ,π) #π π(π€β£π) = #(π€,π) #π 6501 Natural Language Processing
16
6501 Natural Language Processing
max πΆ πΏπΏ(π,πΆ) max πΆ β i=1 n [log P πΆ w π πΆ π€ πβ1 + log π( π€ π β£πΆ π€ π ) ] = n π=1 π πβ²=1 π π π, π β² log π π, π β² π π π( π β² ) +πΊ where G is a constant Here, π π, π β² = # π, π β² π,πβ² #(π, π β² ) , π π = # π π #(π) π π, π β² π π π( π β² ) = π π π β² π π (mutual information) 6501 Natural Language Processing
17
6501 Natural Language Processing
max πΆ πΏπΏ(π,πΆ) max πΆ β i=1 n [log P πΆ w π πΆ π€ πβ1 + log π( π€ π β£πΆ π€ π ) ] = n π=1 π πβ²=1 π π π, π β² log π π, π β² π π π( π β² ) +πΊ 6501 Natural Language Processing
18
6501 Natural Language Processing
Algorithm 1 Start with |V| clusters each word is in its own cluster The goal is to get k clusters We run |V|-k merge steps: Pick 2 clusters and merge them Each step pick the merge maximizing πΏπΏ(π,πΆ) Cost? (can be improved to π( π 3 )) O(|V|-k) π( π 2 ) π ( π 2 ) = π( π 5 ) #Iters #pairs compute LL 6501 Natural Language Processing
19
6501 Natural Language Processing
Algorithm 2 m : a hyper-parameter, sort words by frequency Take the top m most frequent words, put each of them in its own cluster π 1 , π 2 , π 3 , β¦ π π For π= π+1 β¦|π| Create a new cluster π π+1 (we have m+1 clusters) Choose two cluster from m+1 clusters based on πΏπΏ π,πΆ and merge β back to m clusters Carry out (m-1) final merges β full hierarchy Running time O π π 2 +π , n=#words in corpus 6501 Natural Language Processing
20
Example clusters (Brown+1992)
6501 Natural Language Processing
21
Example Hierarchy(Miller+2004)
6501 Natural Language Processing
22
6501 Natural Language Processing
Quiz 1 30 min (9/20 Tue. 12:30pm-1:00pm) Fill-in-the-blank, True/False Short answer Closed book, Closed notes, Closed laptop Sample questions: Add one smoothing v.s. Add-Lambda Smoothing π= 1, 3, 5 , π= 2, 3, 6 what is the cosine similarity between a and π? 6501 Natural Language Processing
23
6501 Natural Language Processing
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.