Download presentation
Presentation is loading. Please wait.
1
Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011
2
Roadmap Clustering Motivation & Applications Clustering Approaches Evaluation
3
Clustering Task: Given a set of objects, create a set of clusters over those objects Applications:
4
Clustering Task: Given a set of objects, create a set of clusters over those objects Applications: Exploratory data analysis Document clustering Language modeling Generalization for class-based LMs Unsupervised Word Sense Disambiguation Automatic thesaurus creations Unsupervised Part-of-Speech Tagging Speaker clustering,….
5
Example: Document Clustering Input: Set of individual documents Output: Sets of document clusters Many different types of clustering:
6
Example: Document Clustering Input: Set of individual documents Output: Sets of document clusters Many different types of clustering: Category: news, sports, weather, entertainment
7
Example: Document Clustering Input: Set of individual documents Output: Sets of document clusters Many different types of clustering: Category: news, sports, weather, entertainment Genre clustering: Similar styles: blogs, tweets, newswire
8
Example: Document Clustering Input: Set of individual documents Output: Sets of document clusters Many different types of clustering: Category: news, sports, weather, entertainment Genre clustering: Similar styles: blogs, tweets, newswire Author clustering
9
Example: Document Clustering Input: Set of individual documents Output: Sets of document clusters Many different types of clustering: Category: news, sports, weather, entertainment Genre clustering: Similar styles: blogs, tweets, newswire Author clustering Language ID: language clusters
10
Example: Document Clustering Input: Set of individual documents Output: Sets of document clusters Many different types of clustering: Category: news, sports, weather, entertainment Genre clustering: Similar styles: blogs, tweets, newswire Author clustering Language ID: language clusters Topic clustering: documents on the same topic OWS, debt supercommittee, Seattle Marathon, Black Friday..
11
Example: Word Clustering Input: Words Barbara, Edward, Gov, Mary, NFL, Reds, Scott, Sox, ballot, finance, inning, payments, polls, profit, quarterback, researchers, science, score, scored, seats Output: Word clusters
12
Example: Word Clustering Input: Words Barbara, Edward, Gov, Mary, NFL, Reds, Scott, Sox, ballot, finance, inning, payments, polls, profit, quarterback, researchers, science, score, scored, seats Output: Word clusters Example clusters:
13
Example: Word Clustering Input: Words Barbara, Edward, Gov, Mary, NFL, Reds, Scott, Sox, ballot, finance, inning, payments, polls, profit, quarterback, researchers, science, score, scored, seats Output: Word clusters Example clusters: (from NYT) ballot, polls, Gov, seats profit, finance, payments NFL, Reds, Sox, inning, quarterback, scored, score researchers, science Scott, Mary, Barbara, Edward
14
Questions What should a cluster represent? Due to F. Xia
15
Questions What should a cluster represent? Similarity among objects How can we create clusters? Due to F. Xia
16
Questions What should a cluster represent? Similarity among objects How can we create clusters? How can we evaluate clusters? Due to F. Xia
17
Questions What should a cluster represent? Similarity among objects How can we create clusters? How can we evaluate clusters? How can we improve NLP with clustering? Due to F. Xia
18
Similarity Between two instances
19
Similarity Between two instances Between an instance and a cluster
20
Similarity Between two instances Between an instance and a cluster Between clusters
21
Similarity Measures Given x=(x 1,x 2,…,x n ) and y=(y 1,y 2,…,y n )
22
Similarity Measures Given x=(x 1,x 2,…,x n ) and y=(y 1,y 2,…,y n ) Euclidean distance:
23
Similarity Measures Given x=(x 1,x 2,…,x n ) and y=(y 1,y 2,…,y n ) Euclidean distance: Manhattan distance:
24
Similarity Measures Given x=(x 1,x 2,…,x n ) and y=(y 1,y 2,…,y n ) Euclidean distance: Manhattan distance: Cosine similarity:
25
Clustering Algorithms
26
Types of Clustering Flat vs Hierarchical Clustering: Flat: partition data into k clusters
27
Types of Clustering Flat vs Hierarchical Clustering: Flat: partition data into k clusters Hierarchical: Nodes form hierarchy
28
Types of Clustering Flat vs Hierarchical Clustering: Flat: partition data into k clusters Hierarchical: Nodes form hierarchy Hard vs Soft Clustering Hard: Each object assigned to exactly one cluster
29
Types of Clustering Flat vs Hierarchical Clustering: Flat: partition data into k clusters Hierarchical: Nodes form hierarchy Hard vs Soft Clustering Hard: Each object assigned to exactly one cluster Soft: Allows degrees of membership and membership in more than one cluster Often probability distribution over cluster membership
30
Hierarchical Clustering
31
Hierarchical Vs. Flat Hierarchical clustering:
32
Hierarchical Vs. Flat Hierarchical clustering: More informative Good for data exploration Many algorithms, none good for all data Computationally expensive
33
Hierarchical Vs. Flat Hierarchical clustering: More informative Good for data exploration Many algorithms, none good for all data Computationally expensive Flat clustering:
34
Hierarchical Vs. Flat Hierarchical clustering: More informative Good for data exploration Many algorithms, none good for all data Computationally expensive Flat clustering: Fairly efficient Simple baseline algorithm: K-means Probabilistic models use EM algorithm
35
Clustering Algorithms Flat clustering: K-means clustering K-medoids clustering Hierarchical clustering: Greedy, bottom-up clustering
36
K-Means Clustering Initialize: Randomly select k initial centroids
37
K-Means Clustering Initialize: Randomly select k initial centroids Center (mean) of cluster Iterate until clusters stop changing
38
K-Means Clustering Initialize: Randomly select k initial centroids Center (mean) of cluster Iterate until clusters stop changing Assign each instance to the nearest cluster Cluster is nearest if cluster centroid is nearest
39
K-Means Clustering Initialize: Randomly select k initial centroids Center (mean) of cluster Iterate until clusters stop changing Assign each instance to the nearest cluster Cluster is nearest if cluster centroid is nearest Recompute cluster centroids Mean of instances in the cluster
40
K-Means: 1 step
41
K-Means Running time:
42
K-Means Running time: O(n) – where n is the number of clusters Converges in finite number of steps Issues:
43
K-Means Running time: O(n) – where n is the number of clusters Converges in finite number of steps Issues: Need to pick # clusters k Can find only local optimum Sensitive to outliers Requires Euclidean distance: What about enumerable classes (e.g. colors)?
44
Medoid Medoid: Element in cluster with highest average similarity to other elements in cluster
45
Medoid Medoid: Element in cluster with highest average similarity to other elements in cluster Finding the medoid: For each element compute:
46
Medoid Medoid: Element in cluster with highest average similarity to other elements in cluster Finding the medoid: For each element compute: Select the element with highest f(p)
47
K-Medoids Initialize: Select k instances at random as medoids
48
K-Medoids Initialize: Select k instances at random as medoids Iterate until no changes Assign instances to cluster with nearest medoid
49
K-Medoids Initialize: Select k instances at random as medoids Iterate until no changes Assign instances to cluster with nearest medoid Recompute medoid for each cluster
50
Greedy, Bottom-Up Hierarchical Clustering Initialize: Make an individual cluster for each instance
51
Greedy, Bottom-Up Hierarchical Clustering Initialize: Make an individual cluster for each instance Iterate until all instances in same cluster Merge two most similar clusters
52
Evaluation
53
With respect to gold standard Accuracy For each cluster, assign most common label to all items Rand index F-measure Alternatives:
54
Evaluation With respect to gold standard Accuracy For each cluster, assign most common label to all items Rand index F-measure Alternatives: Extrinsic evaluation
55
Evaluation With respect to gold standard Accuracy For each cluster, assign most common label to all items Rand index F-measure Alternatives: Extrinsic evaluation Human inspection
56
Configuration Given Set of objects O = {o 1,o 2, ….o n }
57
Configuration Given Set of objects O = {o 1,o 2, ….o n } Partition X ={x 1,…,x r } Partition Y ={y 1,….y s }
58
Configuration Given Set of objects O = {o 1,o 2, ….o n } Partition X ={x 1,…,x r } Partition Y ={y 1,….y s } In same sets in XIn diff’t sets in X In same sets in Y a d In diff’t sets in Y c b
59
Rand Index Measure of cluster similarity (Rand, 1971) No agreement? In same sets in XIn diff’t sets in X In same sets in Y a d In diff’t sets in Y c b
60
Rand Index Measure of cluster similarity (Rand, 1971) No agreement? 0; Full agreement In same sets in XIn diff’t sets in X In same sets in Y a d In diff’t sets in Y c b
61
Rand Index Measure of cluster similarity (Rand, 1971) No agreement? 0; Full agreement? 1 In same sets in XIn diff’t sets in X In same sets in Y a d In diff’t sets in Y c b
62
Precision & Recall Assume X is the gold standard partition Assume Y is the system-generated partition
63
Precision & Recall Assume X is the gold standard partition Assume Y is the system-generated partition For each pair of items in a cluster in Y Correct if they appear together in a cluster in X
64
Precision & Recall Assume X is the gold standard partition Assume Y is the system-generated partition For each pair of items in a cluster in Y Correct if they appear together in a cluster in X Can compute P, R, and F-measure
65
HW #10 Due to F. Xia
66
HW #10 Unsupervised POS tagging: Word clustering by neighboring word cooccurrence Create feature vectors: Features: counts of adjacent word occurrence E.g., L=he:10 or R=run:3 Perform clustering: K-medoids algorithm ( with cosine similarity) Evaluate clusters: Cluster mapping + accuracy
67
Q1 create_vectors.* training_file word_file feat_file outfile Training file: one-sentence-per-line: w1 w2 w3 …wn word_file: List of words to cluster word freq feat_file: List of words to use as features feat freq outfile: One list per word in word_file Format: word L=he 10 L=she 5 ….. R=gone 2 R=run 3…
68
Features Features are of the form: (L|R)=xx freq where xx is a word in the feat_file, L, R: the position where the feature appeared freq: # of times word xx appeared in position in training file Suppose ‘New York’ appears 540 times in corpus York L=New 540 … R=New 0…
69
Vector File One line per word in word_file Lines should be ordered by word_file Features should be sorted alphabetically by feature name E.g. L=an 3 L=the 10 … R=aqua 1 R=house 5 Feature sorting aids cosine computation
70
Q2 k_medoids.* vector_file num_clusters sys_cluster_file vector_file: Created by Q1 num_clusters : number of clusters to create sys_cluster_file: output representing clustering of vectors medoid w1 w2 w3 …wn where medoid is the medoid representing the cluster w1…wn are the words in the cluster
71
Q2: K-Medoids Similarity measure: Cosine similarity Initial medoids: Medoid i is at instance: where N is # of words to cluster C is # of clusters
72
Mapping Sys to Gold: One-to-One Find highest number in matrix Remove corresponding row and column Repeat until all rows removed s1 => g2 10 s2 => g1 7 s3 => g3 6 acc= (10+7+6)/sum Due to F. Xia g1g2g3 s12109 s2742 s3096 s4503
73
Mapping Sys to Gold: One-to-One Find highest number in matrix Remove corresponding row and column Repeat until all rows removed s1 => g2 10 s2 => g1 7 s3 => g3 6 acc= (10+7+6)/sum Due to F. Xia g1g2g3 s12109 s2742 s3096 s4503
74
Mapping Sys to Gold: Many-to-One Find highest number in matrix Remove corresponding row (but not column) Repeat until all rows removed s1 => g2 10 s2 => g1 7 s3 => g3 9 s4 => g1 5 acc= (10+7+9+5)/sum Due to F. Xia g1g2g3 s12109 s2742 s3096 s4503
75
Q3: calculate_accuracy calculate_accuracy.* sys_clust gold_clust flag map_file acc_file sys_clust: output of Q2: m w1 w2 … gold_clust: similar format, gold standard flag: 0: one-to-one; 1:many-to-one map_file: mapping of sys to gold clusters sys_clust_num => gold_clust_num count acc_file: just overall accuracy
76
Experiments Compare different numbers of words and different feature representations Compare different mapping strategies for accuracy Tabulate results
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.