Lecture 8: Word Clustering Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/NLP16 6501 Natural Language Processing
6501 Natural Language Processing This lecture Brown Clustering 6501 Natural Language Processing
6501 Natural Language Processing Brown Clustering Similar to language model But, basic unit is “word clusters” Intuition: again, similar words appear in similar context Recap: Bigram Language Models 𝑃 𝑤 0 ,𝑤 1 , 𝑤 2 ,…, 𝑤 𝑛 =𝑃 𝑤 1 𝑤 0 𝑃 𝑤 2 𝑤 1 …𝑃 𝑤 𝑛 𝑤 𝑛−1 = Π i=1 n P( w 𝑖 ∣ 𝑤 𝑖−1 ) 𝑤 0 is a dummy word representing ”begin of a sentence” 6501 Natural Language Processing
6501 Natural Language Processing Motivation example ”a dog is chasing a cat” 𝑃 𝑤 0 , “𝑎”,”𝑑𝑜𝑔”,…, “𝑐𝑎𝑡” =𝑃 ”𝑎” 𝑤 0 𝑃 ”𝑑𝑜𝑔” ”𝑎” …𝑃 ”𝑐𝑎𝑡” ”𝑎” Assume Every word belongs to a cluster Cluster 3 a the Cluster 46 dog cat fox rabbit bird boy Cluster 64 is was Cluster 64 chasing following biting… 6501 Natural Language Processing
6501 Natural Language Processing Motivation example Assume every word belongs to a cluster “a dog is chasing a cat” C3 C46 C64 C8 C3 C46 Cluster 3 a the Cluster 46 dog cat fox rabbit bird boy Cluster 64 is was Cluster 8 chasing following biting… 6501 Natural Language Processing
6501 Natural Language Processing Motivation example Assume every word belongs to a cluster “a dog is chasing a cat” C3 C46 C64 C8 C3 C46 a dog is chasing a cat Cluster 3 a the Cluster 46 dog cat fox rabbit bird boy Cluster 64 is was Cluster 8 chasing following biting… 6501 Natural Language Processing
6501 Natural Language Processing Motivation example Assume every word belongs to a cluster “the boy is following a rabbit” C3 C46 C64 C8 C3 C46 the boy is following a rabbit Cluster 3 a the Cluster 46 dog cat fox rabbit bird boy Cluster 64 is was Cluster 8 chasing following biting… 6501 Natural Language Processing
6501 Natural Language Processing Motivation example Assume every word belongs to a cluster “a fox was chasing a bird” C3 C46 C64 C8 C3 C46 a fox was chasing a bird Cluster 3 a the Cluster 46 dog cat fox rabbit bird boy Cluster 64 is was Cluster 8 chasing following biting… 6501 Natural Language Processing
6501 Natural Language Processing Brown Clustering Let 𝐶 𝑤 denote the cluster that 𝑤 belongs to “a dog is chasing a cat” C3 C46 C64 C8 C3 C46 P(C(dog)|C(a)) P(cat|C(cat)) a dog is chasing a cat Cluster 3 a the Cluster 46 dog cat fox rabbit bird boy Cluster 64 is was Cluster 8 chasing following biting… 6501 Natural Language Processing
Brown clustering model P(“a dog is chasing a cat”) = P(C(“a”)| 𝐶 0 ) P(C(“dog”)|C(“a”)) P(C(“dog”)|C(“a”))… P(“a”|C(“a”))P(“dog”|C(“dog”))... C3 C46 C64 C8 C3 C46 a dog is chasing a cat P(C(dog)|C(a)) P(cat|C(cat)) Cluster 3 a the Cluster 46 dog cat fox rabbit bird boy Cluster 64 is was Cluster 8 chasing following biting… 6501 Natural Language Processing
Brown clustering model P(“a dog is chasing a cat”) = P(C(“a”)| 𝐶 0 ) P(C(“dog”)|C(“a”)) P(C(“dog”)|C(“a”))… P(“a”|C(“a”))P(“dog”|C(“dog”))... In general 𝑃 𝑤 0 ,𝑤 1 , 𝑤 2 ,…, 𝑤 𝑛 =𝑃 𝐶( 𝑤 1 ) 𝐶 𝑤 0 𝑃 𝐶( 𝑤 2 ) 𝐶(𝑤 1 ) …𝑃 𝐶 𝑤 𝑛 𝐶 𝑤 𝑛−1 𝑃( 𝑤 1 |𝐶 𝑤 1 𝑃 𝑤 2 𝐶 𝑤 2 …𝑃( 𝑤 𝑛 |𝐶 𝑤 𝑛 ) = Π i=1 n P 𝐶 w 𝑖 𝐶 𝑤 𝑖−1 𝑃( 𝑤 𝑖 ∣𝐶 𝑤 𝑖 ) 6501 Natural Language Processing
6501 Natural Language Processing Model parameters 𝑃 𝑤 0 ,𝑤 1 , 𝑤 2 ,…, 𝑤 𝑛 = Π i=1 n P 𝐶 w 𝑖 𝐶 𝑤 𝑖−1 𝑃( 𝑤 𝑖 ∣𝐶 𝑤 𝑖 ) Parameter set 1: 𝑃( 𝐶(𝑤 𝑖 )|𝐶 𝑤 𝑖−1 ) Parameter set 2: 𝑃( 𝑤 𝑖 |𝐶 𝑤 𝑖 ) C3 C46 C64 C8 C3 C46 a dog is chasing a cat Parameter set 3: 𝐶 𝑤 𝑖 Cluster 3 a the Cluster 46 dog cat fox rabbit bird boy Cluster 64 is was Cluster 8 chasing following biting… 6501 Natural Language Processing
6501 Natural Language Processing Model parameters 𝑃 𝑤 0 ,𝑤 1 , 𝑤 2 ,…, 𝑤 𝑛 = Π i=1 n P 𝐶 w 𝑖 𝐶 𝑤 𝑖−1 𝑃( 𝑤 𝑖 ∣𝐶 𝑤 𝑖 ) A vocabulary set 𝑊 A function 𝐶:𝑊→{1, 2, 3,…𝑘 } A partition of vocabulary into k classes Conditional probability 𝑃(𝑐′∣𝑐) for 𝑐, 𝑐 ′ ∈ 1,…,𝑘 Conditional probability 𝑃(𝑤∣𝑐) for 𝑐, 𝑐 ′ ∈ 1,…,𝑘 , 𝑤∈𝑐 𝜃 represents the set of conditional probability parameters C represents the clustering 6501 Natural Language Processing
6501 Natural Language Processing Log likelihood LL(𝜃, 𝐶 ) = log 𝑃 𝑤 0 ,𝑤 1 , 𝑤 2 ,…, 𝑤 𝑛 𝜃, 𝐶 = log Π i=1 n P 𝐶 w 𝑖 𝐶 𝑤 𝑖−1 𝑃( 𝑤 𝑖 ∣𝐶 𝑤 𝑖 ) = ∑ i=1 n [log P 𝐶 w 𝑖 𝐶 𝑤 𝑖−1 + log 𝑃( 𝑤 𝑖 ∣𝐶 𝑤 𝑖 ) ] Maximizing LL(𝜃, 𝐶) can be done by alternatively update 𝜃 and 𝐶 max 𝜃∈Θ 𝐿𝐿(𝜃,𝐶) max 𝐶 𝐿𝐿(𝜃,𝐶) 6501 Natural Language Processing
6501 Natural Language Processing max 𝜃∈Θ 𝐿𝐿(𝜃,𝐶) LL(𝜃, 𝐶 ) = log 𝑃 𝑤 0 ,𝑤 1 , 𝑤 2 ,…, 𝑤 𝑛 𝜃, 𝐶 = log Π i=1 n P 𝐶 w 𝑖 𝐶 𝑤 𝑖−1 𝑃( 𝑤 𝑖 ∣𝐶 𝑤 𝑖 ) = ∑ i=1 n [log P 𝐶 w 𝑖 𝐶 𝑤 𝑖−1 + log 𝑃( 𝑤 𝑖 ∣𝐶 𝑤 𝑖 ) ] 𝑃(𝑐′∣𝑐) = #( 𝑐 ′ ,𝑐) #𝑐 𝑃(𝑤∣𝑐) = #(𝑤,𝑐) #𝑐 6501 Natural Language Processing
6501 Natural Language Processing max 𝐶 𝐿𝐿(𝜃,𝐶) max 𝐶 ∑ i=1 n [log P 𝐶 w 𝑖 𝐶 𝑤 𝑖−1 + log 𝑃( 𝑤 𝑖 ∣𝐶 𝑤 𝑖 ) ] = n 𝑐=1 𝑘 𝑐′=1 𝑘 𝑝 𝑐, 𝑐 ′ log 𝑝 𝑐, 𝑐 ′ 𝑝 𝑐 𝑝( 𝑐 ′ ) +𝐺 where G is a constant Here, 𝑝 𝑐, 𝑐 ′ = # 𝑐, 𝑐 ′ 𝑐,𝑐′ #(𝑐, 𝑐 ′ ) , 𝑝 𝑐 = # 𝑐 𝑐 #(𝑐) 𝑝 𝑐, 𝑐 ′ 𝑝 𝑐 𝑝( 𝑐 ′ ) = 𝑝 𝑐 𝑐 ′ 𝑝 𝑐 (mutual information) 6501 Natural Language Processing
6501 Natural Language Processing max 𝐶 𝐿𝐿(𝜃,𝐶) max 𝐶 ∑ i=1 n [log P 𝐶 w 𝑖 𝐶 𝑤 𝑖−1 + log 𝑃( 𝑤 𝑖 ∣𝐶 𝑤 𝑖 ) ] = n 𝑐=1 𝑘 𝑐′=1 𝑘 𝑝 𝑐, 𝑐 ′ log 𝑝 𝑐, 𝑐 ′ 𝑝 𝑐 𝑝( 𝑐 ′ ) +𝐺 6501 Natural Language Processing
6501 Natural Language Processing Algorithm 1 Start with |V| clusters each word is in its own cluster The goal is to get k clusters We run |V|-k merge steps: Pick 2 clusters and merge them Each step pick the merge maximizing 𝐿𝐿(𝜃,𝐶) Cost? (can be improved to 𝑂( 𝑉 3 )) O(|V|-k) 𝑂( 𝑉 2 ) 𝑂 ( 𝑉 2 ) = 𝑂( 𝑉 5 ) #Iters #pairs compute LL 6501 Natural Language Processing
6501 Natural Language Processing Algorithm 2 m : a hyper-parameter, sort words by frequency Take the top m most frequent words, put each of them in its own cluster 𝑐 1 , 𝑐 2 , 𝑐 3 , … 𝑐 𝑚 For 𝑖= 𝑚+1 …|𝑉| Create a new cluster 𝑐 𝑚+1 (we have m+1 clusters) Choose two cluster from m+1 clusters based on 𝐿𝐿 𝜃,𝐶 and merge ⇒ back to m clusters Carry out (m-1) final merges ⇒ full hierarchy Running time O 𝑉 𝑚 2 +𝑛 , n=#words in corpus 6501 Natural Language Processing
Example clusters (Brown+1992) 6501 Natural Language Processing
Example Hierarchy(Miller+2004) 6501 Natural Language Processing
6501 Natural Language Processing Quiz 1 30 min (9/20 Tue. 12:30pm-1:00pm) Fill-in-the-blank, True/False Short answer Closed book, Closed notes, Closed laptop Sample questions: Add one smoothing v.s. Add-Lambda Smoothing 𝑎= 1, 3, 5 , 𝑏= 2, 3, 6 what is the cosine similarity between a and 𝑏? 6501 Natural Language Processing
6501 Natural Language Processing