Presentation is loading. Please wait.

Presentation is loading. Please wait.

Albert Gatt Corpora and Statistical Methods Lecture 13.

Similar presentations


Presentation on theme: "Albert Gatt Corpora and Statistical Methods Lecture 13."— Presentation transcript:

1 Albert Gatt Corpora and Statistical Methods Lecture 13

2 In this lecture Text categorisation overview of clustering methods machine learning methods for text classification

3 Text classification Given: a set of documents a set of categories Task: sort documents by category Examples: sort news text by topic (POLITICS, SPORT etc) sort email into SPAM/NON-SPAM classify documents by author

4 Setup Typical setup: identify relevant features of the documents individual words n-grams (e.g. bigrams) … learn a model to classify a document Naïve Bayes method maximum entropy language models …

5 Supervised vs unsupervised (cf. the un/supervised distinction for Word Sense Disambiguation; lecture 6) Supervised learning: training data is labeled several methods available (naïve Bayes, etc) Unsupervised learning: training data is unlabeled document classes have to be “discovered” possible method: clustering

6 Clustering documents Part 1

7 Clustering Flat/non-hierarchical just sets of related documents no relationship between clusters very efficient algorithms exist e.g. k-means clustering Hierarchical related documents grouped in a tree (dendrogram) tree branches indicate similarity (resp distance) less efficient than non-hierarchical clustering n documents need n * n similarity computations but more informative

8 Soft vs hard clusters Hard clustering: each document belongs to exactly 1 class hierarchical methods are usually hard Soft clustering: allow degrees of membership e.g. p(c1|d1) > p(c2|d1) i.e. d1 belongs to c1 to a greater degree than to c2

9 Similarity & monotonicity All hierarchical methods require a similarity metric: similarity computed between individual documents and between clusters Vector-space representation for documents with cosine similarity is a common technique The similarity metric needs to be monotonic: i.e. we expect merging not to increase similarity otherwise, when we merge 2 clusters, their similarity to a third might change

10 Agglomerative clustering algorithm Given: D = {d 1,…,d n } (the documents) similarity metric 1. Initialise clusters C = {c 1,…,c n } for {d 1,…,d n } 2. j := n+1 3. do until |C| = 1 a. find the most similar pair (c,c’) in C b. create a new cluster c j = c U c’ c. remove c, c’ from C d. add c j to C e. j := j+1

11 Agglomerative clustering - walkthrough Start with separate clusters for each document D4 D5D3D2D1

12 Agglomerative clustering - walkthrough D1 and D2 are most similar D4 D5D3D2D1

13 Agglomerative clustering - walkthrough D4 and D5 are most similar D4 D5D3D2D1

14 Agglomerative clustering - walkthrough D3 and {D4,D5} are most similar D4 D5D3D2D1

15 Agglomerative clustering - walkthrough Final step: merge last two clusters D4 D5D3D2D1

16 Merging: single link strategy Similarity of two clusters = similarity of the two most similar members. Pro: good local coherence (high pairwise similarity) Con: “elongated” clusters (bad global coherence) c1 c3 c2 c4 sim c5 c7 c6 c8

17 Merging: Complete link strategy Similarity of two clusters = similarity of the two most dissimilar members. better global coherence c1 c3 c2 c4 sim c5 c7 c6 c8

18 Group average strategy Similarity of two clusters = average pairwise similarity. Compromise between local & global coherence. When using a vector-space representation with cosine similarity, the average similarity of a cluster C = {C1,C2} can be computed from the average similarity of its children C1 & C2. much more efficient than computing average pairwise similarity between all document pairs in C1 * C2!

19 Divisive clustering a kind of top-down hierarchical clustering also a greedy algorithm 1. start with a single cluster representing all documents 2. iteratively divide clusters split the cluster which is least coherent (the cluster whose elements are least similar to eachother) to split a cluster C into {C1,C2}, one can run agglomerative clustering over the elements of C! therefore, computationally more expensive than pure agglomerative method


Download ppt "Albert Gatt Corpora and Statistical Methods Lecture 13."

Similar presentations


Ads by Google