Lecture 8: Word Clustering

Slides:



Advertisements
Similar presentations
Statistical Machine Translation
Advertisements

Language Models Naama Kraus (Modified by Amit Gross) Slides are based on Introduction to Information Retrieval Book by Manning, Raghavan and Schütze.
Size-estimation framework with applications to transitive closure and reachability Presented by Maxim Kalaev Edith Cohen AT&T Bell Labs 1996.
Clustering (2). Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram –A tree like.
Albert Gatt Corpora and Statistical Methods Lecture 13.
1 EE5900 Advanced Embedded System For Smart Infrastructure Static Scheduling.
Agglomerative Hierarchical Clustering 1. Compute a distance matrix 2. Merge the two closest clusters 3. Update the distance matrix 4. Repeat Step 2 until.
Advanced Topics in Algorithms and Data Structures Lecture pg 1 Recursion.
First introduced in 1977 Lots of mathematical derivation Problem : given a set of data (data is incomplete or having missing values). Goal : assume the.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
1 Smoothing LING 570 Fei Xia Week 5: 10/24/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA A A AA A A A.
Maximum Entropy Model LING 572 Fei Xia 02/07-02/09/06.
LING 438/538 Computational Linguistics Sandiway Fong Lecture 18: 10/26.
Lecture 13 (Greene Ch 16) Maximum Likelihood Estimation (MLE)
Parameter estimate in IBM Models: Ling 572 Fei Xia Week ??
Hidden Markov Model Continues …. Finite State Markov Chain A discrete time stochastic process, consisting of a domain D of m states {1,…,m} and 1.An m.
Query Operations: Automatic Global Analysis. Motivation Methods of local analysis extract information from local set of documents retrieved to expand.
Topics on Final Perceptrons SVMs Precision/Recall/ROC Decision Trees Naive Bayes Bayesian networks Adaboost Genetic algorithms Q learning Not on the final:
1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.
CHAPTER 7: Clustering Eick: K-Means and EM (modified Alpaydin transparencies and new transparencies added) Last updated: February 25, 2014.
This paper was presented at KDD ‘06 Discovering Interesting Patterns Through User’s Interactive Feedback Dong Xin Xuehua Shen Qiaozhu Mei Jiawei Han Presented.
1 Prune-and-Search Method 2012/10/30. A simple example: Binary search sorted sequence : (search 9) step 1  step 2  step 3  Binary search.
Date:. Definition: Part of speech: Synonyms: Antonyms: Collocations: Original sentence:
CS 361 – Chapters 8-9 Sorting algorithms –Selection, insertion, bubble, “swap” –Merge, quick, stooge –Counting, bucket, radix How to select the n-th largest/smallest.
9/22/1999 JHU CS /Jan Hajic 1 Introduction to Natural Language Processing ( ) LM Smoothing (The EM Algorithm) Dr. Jan Hajič CS Dept., Johns.
1 Introduction to Natural Language Processing ( ) LM Smoothing (The EM Algorithm) AI-lab
Y9 Booster Lesson 11. Objectives – what you should be able to do by the end of the lesson Systematically record all the outcomes of an experiment Understand.
Intro. ANN & Fuzzy Systems Lecture 23 Clustering (4)
Lecture 6 Spring 2010 Dr. Jianjun Hu CSCE883 Machine Learning.
CS Statistical Machine learning Lecture 24
Analysis of Algorithms CS 477/677 Instructor: Monica Nicolescu Lecture 7.
Multi pass algorithms. Nested-Loop joins Tuple-Based Nested-loop Join Algorithm: FOR each tuple s in S DO FOR each tuple r in R DO IF r and s join to.
Week 7 - Wednesday.  What did we talk about last time?  Recursive running time  Master Theorem  Symbol tables.
Information Retrieval Search Engine Technology (8) Prof. Dragomir R. Radev.
Machine Learning in Practice Lecture 21 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Vector Semantics Dense Vectors.
Automated Information Retrieval
Hierarchical Agglomerative Clustering on graphs
True/False questions (3pts*2)
Week 6 - Wednesday CS221.
Classification of unlabeled data:
LECTURE 10: EXPECTATION MAXIMIZATION (EM)
Machine Learning I & II.
Intro to NLP and Deep Learning
Multimodal Learning with Deep Boltzmann Machines
Neural Language Model CS246 Junghoo “John” Cho.
Jianping Fan Dept of CS UNC-Charlotte
Wednesday, April 18, 2018 Announcements… For Today…
Ch 7: Quicksort Ming-Te Chi
Do Now: Get your HW out for Ms. Taylor to Stamp
Information Organization: Clustering
Data Structures Review Session
Mathematical Foundations of BME Reza Shadmehr
CS-401 Computer Architecture & Assembly Language Programming
N-Gram Model Formulas Word sequences Chain rule of probability
Discrete Mathematics CMP-101 Lecture 12 Sorting, Bubble Sort, Insertion Sort, Greedy Algorithms Abdul Hameed
Expectation-Maximization Algorithm
PTAS for Bin-Packing.
ML – Lecture 3B Deep NN.
2016 REPORT.
CS 3343: Analysis of Algorithms
Probabilistic Latent Preference Analysis
Junghoo “John” Cho UCLA
Topic Models in Text Processing
Clustering.
Cat 1 Cat 2 Cat 3 Cat 4 Cat 5 $100 $100 $100 $100 $100 $200 $200 $200 $200 $200 $300 $300 $300 $300 $300 $400 $400 $400 $400 $400 $500 $500 $500 $500 $500.
Sorting and selection Prof. Noah Snavely CS1114
2016 REPORT.
CS249: Neural Language Model
Professor Junghoo “John” Cho UCLA
Presentation transcript:

Lecture 8: Word Clustering Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/NLP16 6501 Natural Language Processing

6501 Natural Language Processing This lecture Brown Clustering 6501 Natural Language Processing

6501 Natural Language Processing Brown Clustering Similar to language model But, basic unit is “word clusters” Intuition: again, similar words appear in similar context Recap: Bigram Language Models 𝑃 𝑤 0 ,𝑤 1 , 𝑤 2 ,…, 𝑤 𝑛 =𝑃 𝑤 1 𝑤 0 𝑃 𝑤 2 𝑤 1 …𝑃 𝑤 𝑛 𝑤 𝑛−1 = Π i=1 n P( w 𝑖 ∣ 𝑤 𝑖−1 ) 𝑤 0 is a dummy word representing ”begin of a sentence” 6501 Natural Language Processing

6501 Natural Language Processing Motivation example ”a dog is chasing a cat” 𝑃 𝑤 0 , “𝑎”,”𝑑𝑜𝑔”,…, “𝑐𝑎𝑡” =𝑃 ”𝑎” 𝑤 0 𝑃 ”𝑑𝑜𝑔” ”𝑎” …𝑃 ”𝑐𝑎𝑡” ”𝑎” Assume Every word belongs to a cluster Cluster 3 a the Cluster 46 dog cat fox rabbit bird boy Cluster 64 is was Cluster 64 chasing following biting… 6501 Natural Language Processing

6501 Natural Language Processing Motivation example Assume every word belongs to a cluster “a dog is chasing a cat” C3 C46 C64 C8 C3 C46 Cluster 3 a the Cluster 46 dog cat fox rabbit bird boy Cluster 64 is was Cluster 8 chasing following biting… 6501 Natural Language Processing

6501 Natural Language Processing Motivation example Assume every word belongs to a cluster “a dog is chasing a cat” C3 C46 C64 C8 C3 C46 a dog is chasing a cat Cluster 3 a the Cluster 46 dog cat fox rabbit bird boy Cluster 64 is was Cluster 8 chasing following biting… 6501 Natural Language Processing

6501 Natural Language Processing Motivation example Assume every word belongs to a cluster “the boy is following a rabbit” C3 C46 C64 C8 C3 C46 the boy is following a rabbit Cluster 3 a the Cluster 46 dog cat fox rabbit bird boy Cluster 64 is was Cluster 8 chasing following biting… 6501 Natural Language Processing

6501 Natural Language Processing Motivation example Assume every word belongs to a cluster “a fox was chasing a bird” C3 C46 C64 C8 C3 C46 a fox was chasing a bird Cluster 3 a the Cluster 46 dog cat fox rabbit bird boy Cluster 64 is was Cluster 8 chasing following biting… 6501 Natural Language Processing

6501 Natural Language Processing Brown Clustering Let 𝐶 𝑤 denote the cluster that 𝑤 belongs to “a dog is chasing a cat” C3 C46 C64 C8 C3 C46 P(C(dog)|C(a)) P(cat|C(cat)) a dog is chasing a cat Cluster 3 a the Cluster 46 dog cat fox rabbit bird boy Cluster 64 is was Cluster 8 chasing following biting… 6501 Natural Language Processing

Brown clustering model P(“a dog is chasing a cat”) = P(C(“a”)| 𝐶 0 ) P(C(“dog”)|C(“a”)) P(C(“dog”)|C(“a”))… P(“a”|C(“a”))P(“dog”|C(“dog”))... C3 C46 C64 C8 C3 C46 a dog is chasing a cat P(C(dog)|C(a)) P(cat|C(cat)) Cluster 3 a the Cluster 46 dog cat fox rabbit bird boy Cluster 64 is was Cluster 8 chasing following biting… 6501 Natural Language Processing

Brown clustering model P(“a dog is chasing a cat”) = P(C(“a”)| 𝐶 0 ) P(C(“dog”)|C(“a”)) P(C(“dog”)|C(“a”))… P(“a”|C(“a”))P(“dog”|C(“dog”))... In general 𝑃 𝑤 0 ,𝑤 1 , 𝑤 2 ,…, 𝑤 𝑛 =𝑃 𝐶( 𝑤 1 ) 𝐶 𝑤 0 𝑃 𝐶( 𝑤 2 ) 𝐶(𝑤 1 ) …𝑃 𝐶 𝑤 𝑛 𝐶 𝑤 𝑛−1 𝑃( 𝑤 1 |𝐶 𝑤 1 𝑃 𝑤 2 𝐶 𝑤 2 …𝑃( 𝑤 𝑛 |𝐶 𝑤 𝑛 ) = Π i=1 n P 𝐶 w 𝑖 𝐶 𝑤 𝑖−1 𝑃( 𝑤 𝑖 ∣𝐶 𝑤 𝑖 ) 6501 Natural Language Processing

6501 Natural Language Processing Model parameters 𝑃 𝑤 0 ,𝑤 1 , 𝑤 2 ,…, 𝑤 𝑛 = Π i=1 n P 𝐶 w 𝑖 𝐶 𝑤 𝑖−1 𝑃( 𝑤 𝑖 ∣𝐶 𝑤 𝑖 ) Parameter set 1: 𝑃( 𝐶(𝑤 𝑖 )|𝐶 𝑤 𝑖−1 ) Parameter set 2: 𝑃( 𝑤 𝑖 |𝐶 𝑤 𝑖 ) C3 C46 C64 C8 C3 C46 a dog is chasing a cat Parameter set 3: 𝐶 𝑤 𝑖 Cluster 3 a the Cluster 46 dog cat fox rabbit bird boy Cluster 64 is was Cluster 8 chasing following biting… 6501 Natural Language Processing

6501 Natural Language Processing Model parameters 𝑃 𝑤 0 ,𝑤 1 , 𝑤 2 ,…, 𝑤 𝑛 = Π i=1 n P 𝐶 w 𝑖 𝐶 𝑤 𝑖−1 𝑃( 𝑤 𝑖 ∣𝐶 𝑤 𝑖 ) A vocabulary set 𝑊 A function 𝐶:𝑊→{1, 2, 3,…𝑘 } A partition of vocabulary into k classes Conditional probability 𝑃(𝑐′∣𝑐) for 𝑐, 𝑐 ′ ∈ 1,…,𝑘 Conditional probability 𝑃(𝑤∣𝑐) for 𝑐, 𝑐 ′ ∈ 1,…,𝑘 , 𝑤∈𝑐 𝜃 represents the set of conditional probability parameters C represents the clustering 6501 Natural Language Processing

6501 Natural Language Processing Log likelihood LL(𝜃, 𝐶 ) = log 𝑃 𝑤 0 ,𝑤 1 , 𝑤 2 ,…, 𝑤 𝑛 𝜃, 𝐶 = log Π i=1 n P 𝐶 w 𝑖 𝐶 𝑤 𝑖−1 𝑃( 𝑤 𝑖 ∣𝐶 𝑤 𝑖 ) = ∑ i=1 n [log P 𝐶 w 𝑖 𝐶 𝑤 𝑖−1 + log 𝑃( 𝑤 𝑖 ∣𝐶 𝑤 𝑖 ) ] Maximizing LL(𝜃, 𝐶) can be done by alternatively update 𝜃 and 𝐶 max 𝜃∈Θ 𝐿𝐿(𝜃,𝐶) max 𝐶 𝐿𝐿(𝜃,𝐶) 6501 Natural Language Processing

6501 Natural Language Processing max 𝜃∈Θ 𝐿𝐿(𝜃,𝐶) LL(𝜃, 𝐶 ) = log 𝑃 𝑤 0 ,𝑤 1 , 𝑤 2 ,…, 𝑤 𝑛 𝜃, 𝐶 = log Π i=1 n P 𝐶 w 𝑖 𝐶 𝑤 𝑖−1 𝑃( 𝑤 𝑖 ∣𝐶 𝑤 𝑖 ) = ∑ i=1 n [log P 𝐶 w 𝑖 𝐶 𝑤 𝑖−1 + log 𝑃( 𝑤 𝑖 ∣𝐶 𝑤 𝑖 ) ] 𝑃(𝑐′∣𝑐) = #( 𝑐 ′ ,𝑐) #𝑐 𝑃(𝑤∣𝑐) = #(𝑤,𝑐) #𝑐 6501 Natural Language Processing

6501 Natural Language Processing max 𝐶 𝐿𝐿(𝜃,𝐶) max 𝐶 ∑ i=1 n [log P 𝐶 w 𝑖 𝐶 𝑤 𝑖−1 + log 𝑃( 𝑤 𝑖 ∣𝐶 𝑤 𝑖 ) ] = n 𝑐=1 𝑘 𝑐′=1 𝑘 𝑝 𝑐, 𝑐 ′ log 𝑝 𝑐, 𝑐 ′ 𝑝 𝑐 𝑝( 𝑐 ′ ) +𝐺 where G is a constant Here, 𝑝 𝑐, 𝑐 ′ = # 𝑐, 𝑐 ′ 𝑐,𝑐′ #(𝑐, 𝑐 ′ ) , 𝑝 𝑐 = # 𝑐 𝑐 #(𝑐) 𝑝 𝑐, 𝑐 ′ 𝑝 𝑐 𝑝( 𝑐 ′ ) = 𝑝 𝑐 𝑐 ′ 𝑝 𝑐 (mutual information) 6501 Natural Language Processing

6501 Natural Language Processing max 𝐶 𝐿𝐿(𝜃,𝐶) max 𝐶 ∑ i=1 n [log P 𝐶 w 𝑖 𝐶 𝑤 𝑖−1 + log 𝑃( 𝑤 𝑖 ∣𝐶 𝑤 𝑖 ) ] = n 𝑐=1 𝑘 𝑐′=1 𝑘 𝑝 𝑐, 𝑐 ′ log 𝑝 𝑐, 𝑐 ′ 𝑝 𝑐 𝑝( 𝑐 ′ ) +𝐺 6501 Natural Language Processing

6501 Natural Language Processing Algorithm 1 Start with |V| clusters each word is in its own cluster The goal is to get k clusters We run |V|-k merge steps: Pick 2 clusters and merge them Each step pick the merge maximizing 𝐿𝐿(𝜃,𝐶) Cost? (can be improved to 𝑂( 𝑉 3 )) O(|V|-k) 𝑂( 𝑉 2 ) 𝑂 ( 𝑉 2 ) = 𝑂( 𝑉 5 ) #Iters #pairs compute LL 6501 Natural Language Processing

6501 Natural Language Processing Algorithm 2 m : a hyper-parameter, sort words by frequency Take the top m most frequent words, put each of them in its own cluster 𝑐 1 , 𝑐 2 , 𝑐 3 , … 𝑐 𝑚 For 𝑖= 𝑚+1 …|𝑉| Create a new cluster 𝑐 𝑚+1 (we have m+1 clusters) Choose two cluster from m+1 clusters based on 𝐿𝐿 𝜃,𝐶 and merge ⇒ back to m clusters Carry out (m-1) final merges ⇒ full hierarchy Running time O 𝑉 𝑚 2 +𝑛 , n=#words in corpus 6501 Natural Language Processing

Example clusters (Brown+1992) 6501 Natural Language Processing

Example Hierarchy(Miller+2004) 6501 Natural Language Processing

6501 Natural Language Processing Quiz 1 30 min (9/20 Tue. 12:30pm-1:00pm) Fill-in-the-blank, True/False Short answer Closed book, Closed notes, Closed laptop Sample questions: Add one smoothing v.s. Add-Lambda Smoothing 𝑎= 1, 3, 5 , 𝑏= 2, 3, 6 what is the cosine similarity between a and 𝑏? 6501 Natural Language Processing

6501 Natural Language Processing