Download presentation
Presentation is loading. Please wait.
Published byJavion Aborn Modified over 9 years ago
1
Albert Gatt Corpora and Statistical Methods Lecture 13
2
In this lecture Text categorisation overview of clustering methods machine learning methods for text classification
3
Text classification Given: a set of documents a set of categories Task: sort documents by category Examples: sort news text by topic (POLITICS, SPORT etc) sort email into SPAM/NON-SPAM classify documents by author
4
Setup Typical setup: identify relevant features of the documents individual words n-grams (e.g. bigrams) … learn a model to classify a document Naïve Bayes method maximum entropy language models …
5
Supervised vs unsupervised (cf. the un/supervised distinction for Word Sense Disambiguation; lecture 6) Supervised learning: training data is labeled several methods available (naïve Bayes, etc) Unsupervised learning: training data is unlabeled document classes have to be “discovered” possible method: clustering
6
Clustering documents Part 1
7
Clustering Flat/non-hierarchical just sets of related documents no relationship between clusters very efficient algorithms exist e.g. k-means clustering Hierarchical related documents grouped in a tree (dendrogram) tree branches indicate similarity (resp distance) less efficient than non-hierarchical clustering n documents need n * n similarity computations but more informative
8
Soft vs hard clusters Hard clustering: each document belongs to exactly 1 class hierarchical methods are usually hard Soft clustering: allow degrees of membership e.g. p(c1|d1) > p(c2|d1) i.e. d1 belongs to c1 to a greater degree than to c2
9
Similarity & monotonicity All hierarchical methods require a similarity metric: similarity computed between individual documents and between clusters Vector-space representation for documents with cosine similarity is a common technique The similarity metric needs to be monotonic: i.e. we expect merging not to increase similarity otherwise, when we merge 2 clusters, their similarity to a third might change
10
Agglomerative clustering algorithm Given: D = {d 1,…,d n } (the documents) similarity metric 1. Initialise clusters C = {c 1,…,c n } for {d 1,…,d n } 2. j := n+1 3. do until |C| = 1 a. find the most similar pair (c,c’) in C b. create a new cluster c j = c U c’ c. remove c, c’ from C d. add c j to C e. j := j+1
11
Agglomerative clustering - walkthrough Start with separate clusters for each document D4 D5D3D2D1
12
Agglomerative clustering - walkthrough D1 and D2 are most similar D4 D5D3D2D1
13
Agglomerative clustering - walkthrough D4 and D5 are most similar D4 D5D3D2D1
14
Agglomerative clustering - walkthrough D3 and {D4,D5} are most similar D4 D5D3D2D1
15
Agglomerative clustering - walkthrough Final step: merge last two clusters D4 D5D3D2D1
16
Merging: single link strategy Similarity of two clusters = similarity of the two most similar members. Pro: good local coherence (high pairwise similarity) Con: “elongated” clusters (bad global coherence) c1 c3 c2 c4 sim c5 c7 c6 c8
17
Merging: Complete link strategy Similarity of two clusters = similarity of the two most dissimilar members. better global coherence c1 c3 c2 c4 sim c5 c7 c6 c8
18
Group average strategy Similarity of two clusters = average pairwise similarity. Compromise between local & global coherence. When using a vector-space representation with cosine similarity, the average similarity of a cluster C = {C1,C2} can be computed from the average similarity of its children C1 & C2. much more efficient than computing average pairwise similarity between all document pairs in C1 * C2!
19
Divisive clustering a kind of top-down hierarchical clustering also a greedy algorithm 1. start with a single cluster representing all documents 2. iteratively divide clusters split the cluster which is least coherent (the cluster whose elements are least similar to eachother) to split a cluster C into {C1,C2}, one can run agglomerative clustering over the elements of C! therefore, computationally more expensive than pure agglomerative method
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.