Lecture 8: Word Clustering

Lecture 8: Word Clustering
Kai-Wei Chang University of Virginia Couse webpage: 6501 Natural Language Processing

6501 Natural Language Processing
This lecture Brown Clustering 6501 Natural Language Processing

Brown Clustering Similar to language model But, basic unit is “word clusters” Intuition: again, similar words appear in similar context Recap: Bigram Language Models 𝑃 𝑤 0 ,𝑤 1 , 𝑤 2 ,…, 𝑤 𝑛 =𝑃 𝑤 1 𝑤 0 𝑃 𝑤 2 𝑤 1 …𝑃 𝑤 𝑛 𝑤 𝑛−1 = Π i=1 n P( w 𝑖 ∣ 𝑤 𝑖−1 ) 𝑤 0 is a dummy word representing ”begin of a sentence” 6501 Natural Language Processing

Motivation example ”a dog is chasing a cat” 𝑃 𝑤 0 , “𝑎”,”𝑑𝑜𝑔”,…, “𝑐𝑎𝑡” =𝑃 ”𝑎” 𝑤 0 𝑃 ”𝑑𝑜𝑔” ”𝑎” …𝑃 ”𝑐𝑎𝑡” ”𝑎” Assume Every word belongs to a cluster Cluster 3 a the Cluster 46 dog cat fox rabbit bird boy Cluster 64 is was Cluster 64 chasing following biting… 6501 Natural Language Processing

Motivation example Assume every word belongs to a cluster “a dog is chasing a cat” C3 C46 C64 C8 C3 C46 Cluster 3 a the Cluster 46 dog cat fox rabbit bird boy Cluster 64 is was Cluster 8 chasing following biting… 6501 Natural Language Processing

Motivation example Assume every word belongs to a cluster “a dog is chasing a cat” C3 C46 C64 C8 C3 C46 a dog is chasing a cat Cluster 3 a the Cluster 46 dog cat fox rabbit bird boy Cluster 64 is was Cluster 8 chasing following biting… 6501 Natural Language Processing

Motivation example Assume every word belongs to a cluster “the boy is following a rabbit” C3 C46 C64 C8 C3 C46 the boy is following a rabbit Cluster 3 a the Cluster 46 dog cat fox rabbit bird boy Cluster 64 is was Cluster 8 chasing following biting… 6501 Natural Language Processing

Motivation example Assume every word belongs to a cluster “a fox was chasing a bird” C3 C46 C64 C8 C3 C46 a fox was chasing a bird Cluster 3 a the Cluster 46 dog cat fox rabbit bird boy Cluster 64 is was Cluster 8 chasing following biting… 6501 Natural Language Processing

Brown Clustering Let 𝐶 𝑤 denote the cluster that 𝑤 belongs to “a dog is chasing a cat” C3 C46 C64 C8 C3 C46 P(C(dog)|C(a)) P(cat|C(cat)) a dog is chasing a cat Cluster 3 a the Cluster 46 dog cat fox rabbit bird boy Cluster 64 is was Cluster 8 chasing following biting… 6501 Natural Language Processing

Brown clustering model
P(“a dog is chasing a cat”) = P(C(“a”)| 𝐶 0 ) P(C(“dog”)|C(“a”)) P(C(“dog”)|C(“a”))… P(“a”|C(“a”))P(“dog”|C(“dog”))... C3 C46 C64 C8 C3 C46 a dog is chasing a cat P(C(dog)|C(a)) P(cat|C(cat)) Cluster 3 a the Cluster 46 dog cat fox rabbit bird boy Cluster 64 is was Cluster 8 chasing following biting… 6501 Natural Language Processing

Brown clustering model
P(“a dog is chasing a cat”) = P(C(“a”)| 𝐶 0 ) P(C(“dog”)|C(“a”)) P(C(“dog”)|C(“a”))… P(“a”|C(“a”))P(“dog”|C(“dog”))... In general 𝑃 𝑤 0 ,𝑤 1 , 𝑤 2 ,…, 𝑤 𝑛 =𝑃 𝐶( 𝑤 1 ) 𝐶 𝑤 0 𝑃 𝐶( 𝑤 2 ) 𝐶(𝑤 1 ) …𝑃 𝐶 𝑤 𝑛 𝐶 𝑤 𝑛− 𝑃( 𝑤 1 |𝐶 𝑤 1 𝑃 𝑤 2 𝐶 𝑤 2 …𝑃( 𝑤 𝑛 |𝐶 𝑤 𝑛 ) = Π i=1 n P 𝐶 w 𝑖 𝐶 𝑤 𝑖−1 𝑃( 𝑤 𝑖 ∣𝐶 𝑤 𝑖 ) 6501 Natural Language Processing

Model parameters 𝑃 𝑤 0 ,𝑤 1 , 𝑤 2 ,…, 𝑤 𝑛 = Π i=1 n P 𝐶 w 𝑖 𝐶 𝑤 𝑖−1 𝑃( 𝑤 𝑖 ∣𝐶 𝑤 𝑖 ) Parameter set 1: 𝑃( 𝐶(𝑤 𝑖 )|𝐶 𝑤 𝑖−1 ) Parameter set 2: 𝑃( 𝑤 𝑖 |𝐶 𝑤 𝑖 ) C3 C46 C64 C8 C3 C46 a dog is chasing a cat Parameter set 3: 𝐶 𝑤 𝑖 Cluster 3 a the Cluster 46 dog cat fox rabbit bird boy Cluster 64 is was Cluster 8 chasing following biting… 6501 Natural Language Processing

Model parameters 𝑃 𝑤 0 ,𝑤 1 , 𝑤 2 ,…, 𝑤 𝑛 = Π i=1 n P 𝐶 w 𝑖 𝐶 𝑤 𝑖−1 𝑃( 𝑤 𝑖 ∣𝐶 𝑤 𝑖 ) A vocabulary set 𝑊 A function 𝐶:𝑊→{1, 2, 3,…𝑘 } A partition of vocabulary into k classes Conditional probability 𝑃(𝑐′∣𝑐) for 𝑐, 𝑐 ′ ∈ 1,…,𝑘 Conditional probability 𝑃(𝑤∣𝑐) for 𝑐, 𝑐 ′ ∈ 1,…,𝑘 , 𝑤∈𝑐 𝜃 represents the set of conditional probability parameters C represents the clustering 6501 Natural Language Processing

Log likelihood LL(𝜃, 𝐶 ) = log 𝑃 𝑤 0 ,𝑤 1 , 𝑤 2 ,…, 𝑤 𝑛 𝜃, 𝐶 = log Π i=1 n P 𝐶 w 𝑖 𝐶 𝑤 𝑖−1 𝑃( 𝑤 𝑖 ∣𝐶 𝑤 𝑖 ) = ∑ i=1 n [log P 𝐶 w 𝑖 𝐶 𝑤 𝑖−1 + log 𝑃( 𝑤 𝑖 ∣𝐶 𝑤 𝑖 ) ] Maximizing LL(𝜃, 𝐶) can be done by alternatively update 𝜃 and 𝐶 max 𝜃∈Θ 𝐿𝐿(𝜃,𝐶) max 𝐶 𝐿𝐿(𝜃,𝐶) 6501 Natural Language Processing

max 𝜃∈Θ 𝐿𝐿(𝜃,𝐶) LL(𝜃, 𝐶 ) = log 𝑃 𝑤 0 ,𝑤 1 , 𝑤 2 ,…, 𝑤 𝑛 𝜃, 𝐶 = log Π i=1 n P 𝐶 w 𝑖 𝐶 𝑤 𝑖−1 𝑃( 𝑤 𝑖 ∣𝐶 𝑤 𝑖 ) = ∑ i=1 n [log P 𝐶 w 𝑖 𝐶 𝑤 𝑖−1 + log 𝑃( 𝑤 𝑖 ∣𝐶 𝑤 𝑖 ) ] 𝑃(𝑐′∣𝑐) = #( 𝑐 ′ ,𝑐) #𝑐 𝑃(𝑤∣𝑐) = #(𝑤,𝑐) #𝑐 6501 Natural Language Processing

max 𝐶 𝐿𝐿(𝜃,𝐶) max 𝐶 ∑ i=1 n [log P 𝐶 w 𝑖 𝐶 𝑤 𝑖−1 + log 𝑃( 𝑤 𝑖 ∣𝐶 𝑤 𝑖 ) ] = n 𝑐=1 𝑘 𝑐′=1 𝑘 𝑝 𝑐, 𝑐 ′ log 𝑝 𝑐, 𝑐 ′ 𝑝 𝑐 𝑝( 𝑐 ′ ) +𝐺 where G is a constant Here, 𝑝 𝑐, 𝑐 ′ = # 𝑐, 𝑐 ′ 𝑐,𝑐′ #(𝑐, 𝑐 ′ ) , 𝑝 𝑐 = # 𝑐 𝑐 #(𝑐) 𝑝 𝑐, 𝑐 ′ 𝑝 𝑐 𝑝( 𝑐 ′ ) = 𝑝 𝑐 𝑐 ′ 𝑝 𝑐 (mutual information) 6501 Natural Language Processing

max 𝐶 𝐿𝐿(𝜃,𝐶) max 𝐶 ∑ i=1 n [log P 𝐶 w 𝑖 𝐶 𝑤 𝑖−1 + log 𝑃( 𝑤 𝑖 ∣𝐶 𝑤 𝑖 ) ] = n 𝑐=1 𝑘 𝑐′=1 𝑘 𝑝 𝑐, 𝑐 ′ log 𝑝 𝑐, 𝑐 ′ 𝑝 𝑐 𝑝( 𝑐 ′ ) +𝐺 6501 Natural Language Processing

Algorithm 1 Start with |V| clusters each word is in its own cluster The goal is to get k clusters We run |V|-k merge steps: Pick 2 clusters and merge them Each step pick the merge maximizing 𝐿𝐿(𝜃,𝐶) Cost? (can be improved to 𝑂( 𝑉 3 )) O(|V|-k) 𝑂( 𝑉 2 ) 𝑂 ( 𝑉 2 ) = 𝑂( 𝑉 5 ) #Iters #pairs compute LL 6501 Natural Language Processing

Algorithm 2 m : a hyper-parameter, sort words by frequency Take the top m most frequent words, put each of them in its own cluster 𝑐 1 , 𝑐 2 , 𝑐 3 , … 𝑐 𝑚 For 𝑖= 𝑚+1 …|𝑉| Create a new cluster 𝑐 𝑚+1 (we have m+1 clusters) Choose two cluster from m+1 clusters based on 𝐿𝐿 𝜃,𝐶 and merge ⇒ back to m clusters Carry out (m-1) final merges ⇒ full hierarchy Running time O 𝑉 𝑚 2 +𝑛 , n=#words in corpus 6501 Natural Language Processing

Example clusters (Brown+1992)
6501 Natural Language Processing

Example Hierarchy(Miller+2004)
6501 Natural Language Processing

Quiz 1 30 min (9/20 Tue. 12:30pm-1:00pm) Fill-in-the-blank, True/False Short answer Closed book, Closed notes, Closed laptop Sample questions: Add one smoothing v.s. Add-Lambda Smoothing 𝑎= 1, 3, 5 , 𝑏= 2, 3, 6 what is the cosine similarity between a and 𝑏? 6501 Natural Language Processing

Lecture 8: Word Clustering

Similar presentations

Presentation on theme: "Lecture 8: Word Clustering"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Lecture 8: Word Clustering

Similar presentations

Presentation on theme: "Lecture 8: Word Clustering"— Presentation transcript:

Similar presentations

About project

Feedback