Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lecture 8: Word Clustering

Similar presentations


Presentation on theme: "Lecture 8: Word Clustering"β€” Presentation transcript:

1 Lecture 8: Word Clustering
Kai-Wei Chang University of Virginia Couse webpage: 6501 Natural Language Processing

2 6501 Natural Language Processing
This lecture Brown Clustering 6501 Natural Language Processing

3 6501 Natural Language Processing
Brown Clustering Similar to language model But, basic unit is β€œword clusters” Intuition: again, similar words appear in similar context Recap: Bigram Language Models 𝑃 𝑀 0 ,𝑀 1 , 𝑀 2 ,…, 𝑀 𝑛 =𝑃 𝑀 1 𝑀 0 𝑃 𝑀 2 𝑀 1 …𝑃 𝑀 𝑛 𝑀 π‘›βˆ’1 = Ξ  i=1 n P( w 𝑖 ∣ 𝑀 π‘–βˆ’1 ) 𝑀 0 is a dummy word representing ”begin of a sentence” 6501 Natural Language Processing

4 6501 Natural Language Processing
Motivation example ”a dog is chasing a cat” 𝑃 𝑀 0 , β€œπ‘Žβ€,β€π‘‘π‘œπ‘”β€,…, β€œπ‘π‘Žπ‘‘β€ =𝑃 β€π‘Žβ€ 𝑀 0 𝑃 β€π‘‘π‘œπ‘”β€ β€π‘Žβ€ …𝑃 β€π‘π‘Žπ‘‘β€ β€π‘Žβ€ Assume Every word belongs to a cluster Cluster 3 a the Cluster 46 dog cat fox rabbit bird boy Cluster 64 is was Cluster 64 chasing following biting… 6501 Natural Language Processing

5 6501 Natural Language Processing
Motivation example Assume every word belongs to a cluster β€œa dog is chasing a cat” C3 C46 C64 C8 C3 C46 Cluster 3 a the Cluster 46 dog cat fox rabbit bird boy Cluster 64 is was Cluster 8 chasing following biting… 6501 Natural Language Processing

6 6501 Natural Language Processing
Motivation example Assume every word belongs to a cluster β€œa dog is chasing a cat” C3 C46 C64 C8 C3 C46 a dog is chasing a cat Cluster 3 a the Cluster 46 dog cat fox rabbit bird boy Cluster 64 is was Cluster 8 chasing following biting… 6501 Natural Language Processing

7 6501 Natural Language Processing
Motivation example Assume every word belongs to a cluster β€œthe boy is following a rabbit” C3 C46 C64 C8 C3 C46 the boy is following a rabbit Cluster 3 a the Cluster 46 dog cat fox rabbit bird boy Cluster 64 is was Cluster 8 chasing following biting… 6501 Natural Language Processing

8 6501 Natural Language Processing
Motivation example Assume every word belongs to a cluster β€œa fox was chasing a bird” C3 C46 C64 C8 C3 C46 a fox was chasing a bird Cluster 3 a the Cluster 46 dog cat fox rabbit bird boy Cluster 64 is was Cluster 8 chasing following biting… 6501 Natural Language Processing

9 6501 Natural Language Processing
Brown Clustering Let 𝐢 𝑀 denote the cluster that 𝑀 belongs to β€œa dog is chasing a cat” C3 C46 C64 C8 C3 C46 P(C(dog)|C(a)) P(cat|C(cat)) a dog is chasing a cat Cluster 3 a the Cluster 46 dog cat fox rabbit bird boy Cluster 64 is was Cluster 8 chasing following biting… 6501 Natural Language Processing

10 Brown clustering model
P(β€œa dog is chasing a cat”) = P(C(β€œa”)| 𝐢 0 ) P(C(β€œdog”)|C(β€œa”)) P(C(β€œdog”)|C(β€œa”))… P(β€œa”|C(β€œa”))P(β€œdog”|C(β€œdog”))... C3 C46 C64 C8 C3 C46 a dog is chasing a cat P(C(dog)|C(a)) P(cat|C(cat)) Cluster 3 a the Cluster 46 dog cat fox rabbit bird boy Cluster 64 is was Cluster 8 chasing following biting… 6501 Natural Language Processing

11 Brown clustering model
P(β€œa dog is chasing a cat”) = P(C(β€œa”)| 𝐢 0 ) P(C(β€œdog”)|C(β€œa”)) P(C(β€œdog”)|C(β€œa”))… P(β€œa”|C(β€œa”))P(β€œdog”|C(β€œdog”))... In general 𝑃 𝑀 0 ,𝑀 1 , 𝑀 2 ,…, 𝑀 𝑛 =𝑃 𝐢( 𝑀 1 ) 𝐢 𝑀 0 𝑃 𝐢( 𝑀 2 ) 𝐢(𝑀 1 ) …𝑃 𝐢 𝑀 𝑛 𝐢 𝑀 π‘›βˆ’ 𝑃( 𝑀 1 |𝐢 𝑀 1 𝑃 𝑀 2 𝐢 𝑀 2 …𝑃( 𝑀 𝑛 |𝐢 𝑀 𝑛 ) = Ξ  i=1 n P 𝐢 w 𝑖 𝐢 𝑀 π‘–βˆ’1 𝑃( 𝑀 𝑖 ∣𝐢 𝑀 𝑖 ) 6501 Natural Language Processing

12 6501 Natural Language Processing
Model parameters 𝑃 𝑀 0 ,𝑀 1 , 𝑀 2 ,…, 𝑀 𝑛 = Ξ  i=1 n P 𝐢 w 𝑖 𝐢 𝑀 π‘–βˆ’1 𝑃( 𝑀 𝑖 ∣𝐢 𝑀 𝑖 ) Parameter set 1: 𝑃( 𝐢(𝑀 𝑖 )|𝐢 𝑀 π‘–βˆ’1 ) Parameter set 2: 𝑃( 𝑀 𝑖 |𝐢 𝑀 𝑖 ) C3 C46 C64 C8 C3 C46 a dog is chasing a cat Parameter set 3: 𝐢 𝑀 𝑖 Cluster 3 a the Cluster 46 dog cat fox rabbit bird boy Cluster 64 is was Cluster 8 chasing following biting… 6501 Natural Language Processing

13 6501 Natural Language Processing
Model parameters 𝑃 𝑀 0 ,𝑀 1 , 𝑀 2 ,…, 𝑀 𝑛 = Ξ  i=1 n P 𝐢 w 𝑖 𝐢 𝑀 π‘–βˆ’1 𝑃( 𝑀 𝑖 ∣𝐢 𝑀 𝑖 ) A vocabulary set π‘Š A function 𝐢:π‘Šβ†’{1, 2, 3,β€¦π‘˜ } A partition of vocabulary into k classes Conditional probability 𝑃(π‘β€²βˆ£π‘) for 𝑐, 𝑐 β€² ∈ 1,…,π‘˜ Conditional probability 𝑃(π‘€βˆ£π‘) for 𝑐, 𝑐 β€² ∈ 1,…,π‘˜ , π‘€βˆˆπ‘ πœƒ represents the set of conditional probability parameters C represents the clustering 6501 Natural Language Processing

14 6501 Natural Language Processing
Log likelihood LL(πœƒ, 𝐢 ) = log 𝑃 𝑀 0 ,𝑀 1 , 𝑀 2 ,…, 𝑀 𝑛 πœƒ, 𝐢 = log Ξ  i=1 n P 𝐢 w 𝑖 𝐢 𝑀 π‘–βˆ’1 𝑃( 𝑀 𝑖 ∣𝐢 𝑀 𝑖 ) = βˆ‘ i=1 n [log P 𝐢 w 𝑖 𝐢 𝑀 π‘–βˆ’1 + log 𝑃( 𝑀 𝑖 ∣𝐢 𝑀 𝑖 ) ] Maximizing LL(πœƒ, 𝐢) can be done by alternatively update πœƒ and 𝐢 max πœƒβˆˆΞ˜ 𝐿𝐿(πœƒ,𝐢) max 𝐢 𝐿𝐿(πœƒ,𝐢) 6501 Natural Language Processing

15 6501 Natural Language Processing
max πœƒβˆˆΞ˜ 𝐿𝐿(πœƒ,𝐢) LL(πœƒ, 𝐢 ) = log 𝑃 𝑀 0 ,𝑀 1 , 𝑀 2 ,…, 𝑀 𝑛 πœƒ, 𝐢 = log Ξ  i=1 n P 𝐢 w 𝑖 𝐢 𝑀 π‘–βˆ’1 𝑃( 𝑀 𝑖 ∣𝐢 𝑀 𝑖 ) = βˆ‘ i=1 n [log P 𝐢 w 𝑖 𝐢 𝑀 π‘–βˆ’1 + log 𝑃( 𝑀 𝑖 ∣𝐢 𝑀 𝑖 ) ] 𝑃(π‘β€²βˆ£π‘) = #( 𝑐 β€² ,𝑐) #𝑐 𝑃(π‘€βˆ£π‘) = #(𝑀,𝑐) #𝑐 6501 Natural Language Processing

16 6501 Natural Language Processing
max 𝐢 𝐿𝐿(πœƒ,𝐢) max 𝐢 βˆ‘ i=1 n [log P 𝐢 w 𝑖 𝐢 𝑀 π‘–βˆ’1 + log 𝑃( 𝑀 𝑖 ∣𝐢 𝑀 𝑖 ) ] = n 𝑐=1 π‘˜ 𝑐′=1 π‘˜ 𝑝 𝑐, 𝑐 β€² log 𝑝 𝑐, 𝑐 β€² 𝑝 𝑐 𝑝( 𝑐 β€² ) +𝐺 where G is a constant Here, 𝑝 𝑐, 𝑐 β€² = # 𝑐, 𝑐 β€² 𝑐,𝑐′ #(𝑐, 𝑐 β€² ) , 𝑝 𝑐 = # 𝑐 𝑐 #(𝑐) 𝑝 𝑐, 𝑐 β€² 𝑝 𝑐 𝑝( 𝑐 β€² ) = 𝑝 𝑐 𝑐 β€² 𝑝 𝑐 (mutual information) 6501 Natural Language Processing

17 6501 Natural Language Processing
max 𝐢 𝐿𝐿(πœƒ,𝐢) max 𝐢 βˆ‘ i=1 n [log P 𝐢 w 𝑖 𝐢 𝑀 π‘–βˆ’1 + log 𝑃( 𝑀 𝑖 ∣𝐢 𝑀 𝑖 ) ] = n 𝑐=1 π‘˜ 𝑐′=1 π‘˜ 𝑝 𝑐, 𝑐 β€² log 𝑝 𝑐, 𝑐 β€² 𝑝 𝑐 𝑝( 𝑐 β€² ) +𝐺 6501 Natural Language Processing

18 6501 Natural Language Processing
Algorithm 1 Start with |V| clusters each word is in its own cluster The goal is to get k clusters We run |V|-k merge steps: Pick 2 clusters and merge them Each step pick the merge maximizing 𝐿𝐿(πœƒ,𝐢) Cost? (can be improved to 𝑂( 𝑉 3 )) O(|V|-k) 𝑂( 𝑉 2 ) 𝑂 ( 𝑉 2 ) = 𝑂( 𝑉 5 ) #Iters #pairs compute LL 6501 Natural Language Processing

19 6501 Natural Language Processing
Algorithm 2 m : a hyper-parameter, sort words by frequency Take the top m most frequent words, put each of them in its own cluster 𝑐 1 , 𝑐 2 , 𝑐 3 , … 𝑐 π‘š For 𝑖= π‘š+1 …|𝑉| Create a new cluster 𝑐 π‘š+1 (we have m+1 clusters) Choose two cluster from m+1 clusters based on 𝐿𝐿 πœƒ,𝐢 and merge β‡’ back to m clusters Carry out (m-1) final merges β‡’ full hierarchy Running time O 𝑉 π‘š 2 +𝑛 , n=#words in corpus 6501 Natural Language Processing

20 Example clusters (Brown+1992)
6501 Natural Language Processing

21 Example Hierarchy(Miller+2004)
6501 Natural Language Processing

22 6501 Natural Language Processing
Quiz 1 30 min (9/20 Tue. 12:30pm-1:00pm) Fill-in-the-blank, True/False Short answer Closed book, Closed notes, Closed laptop Sample questions: Add one smoothing v.s. Add-Lambda Smoothing π‘Ž= 1, 3, 5 , 𝑏= 2, 3, 6 what is the cosine similarity between a and 𝑏? 6501 Natural Language Processing

23 6501 Natural Language Processing


Download ppt "Lecture 8: Word Clustering"

Similar presentations


Ads by Google