Albert Gatt Corpora and Statistical Methods Lecture 13.

Slides:



Advertisements
Similar presentations
Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.
Advertisements

Information Retrieval Lecture 7 Introduction to Information Retrieval (Manning et al. 2007) Chapter 17 For the MSc Computer Science Programme Dell Zhang.
©2012 Paula Matuszek CSC 9010: Text Mining Applications: Document Clustering l Dr. Paula Matuszek l
Clustering Paolo Ferragina Dipartimento di Informatica Università di Pisa This is a mix of slides taken from several presentations, plus my touch !
1 Machine Learning: Lecture 10 Unsupervised Learning (Based on Chapter 9 of Nilsson, N., Introduction to Machine Learning, 1996)
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
1 Text Clustering. 2 Clustering Partition unlabeled examples into disjoint subsets of clusters, such that: –Examples within a cluster are very similar.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Advanced Multimedia Text Clustering Tamara Berg. Reminder - Classification Given some labeled training documents Determine the best label for a test (query)
Revision (Part II) Ke Chen COMP24111 Machine Learning Revision slides are going to summarise all you have learnt from Part II, which should be helpful.
Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz
1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Text mining.
Text Clustering.
Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
1 Automatic Classification of Bookmarked Web Pages Chris Staff Second Talk February 2007.
CLUSTERING. Overview Definition of Clustering Existing clustering methods Clustering examples.
Artificial Intelligence 8. Supervised and unsupervised learning Japan Advanced Institute of Science and Technology (JAIST) Yoshimasa Tsuruoka.
Prepared by: Mahmoud Rafeek Al-Farra
Course Work Project Project title “Data Analysis Methods for Microarray Based Gene Expression Analysis” Sushil Kumar Singh (batch ) IBAB, Bangalore.
DATA MINING WITH CLUSTERING AND CLASSIFICATION Spring 2007, SJSU Benjamin Lam.
Information Retrieval and Organisation Chapter 16 Flat Clustering Dell Zhang Birkbeck, University of London.
Mehdi Ghayoumi MSB rm 132 Ofc hr: Thur, a Machine Learning.
V. Clustering 인공지능 연구실 이승희 Text: Text mining Page:82-93.
Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree like diagram that.
Machine Learning Queens College Lecture 7: Clustering.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
Text Clustering Hongning Wang
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
1 CS 391L: Machine Learning Clustering Raymond J. Mooney University of Texas at Austin.
1 Machine Learning Lecture 9: Clustering Moshe Koppel Slides adapted from Raymond J. Mooney.
1 Pattern Recognition: Statistical and Neural Lonnie C. Ludeman Lecture 28 Nov 9, 2005 Nanjing University of Science & Technology.
Information Retrieval Search Engine Technology (8) Prof. Dragomir R. Radev.
Hierarchical clustering approaches for high-throughput data Colin Dewey BMI/CS 576 Fall 2015.
Machine Learning in Practice Lecture 21 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Clustering Machine Learning Unsupervised Learning K-means Optimization objective Random initialization Determining Number of Clusters Hierarchical Clustering.
Data Mining and Text Mining. The Standard Data Mining process.
Hierarchical Clustering & Topic Models
Unsupervised Learning: Clustering
Unsupervised Learning: Clustering
Information Organization: Overview
Semi-Supervised Clustering
Clustering CSC 600: Data Mining Class 21.
CZ5211 Topics in Computational Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel:
Machine Learning Clustering: K-means Supervised Learning
Hierarchical Clustering
Machine Learning Lecture 9: Clustering
Machine Learning and Data Mining Clustering
CS 685: Special Topics in Data Mining Jinze Liu
Self organizing networks
K-means and Hierarchical Clustering
Hierarchical clustering approaches for high-throughput data
Hierarchical and Ensemble Clustering
Revision (Part II) Ke Chen
Information Organization: Clustering
KAIST CS LAB Oh Jong-Hoon
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Revision (Part II) Ke Chen
Hierarchical and Ensemble Clustering
CS 391L: Machine Learning Clustering
Clustering Techniques
Text Categorization Berlin Chen 2003 Reference:
Clustering Techniques
CSE572: Data Mining by H. Liu
Information Organization: Overview
CS 685: Special Topics in Data Mining Jinze Liu
Presentation transcript:

Albert Gatt Corpora and Statistical Methods Lecture 13

In this lecture Text categorisation overview of clustering methods machine learning methods for text classification

Text classification Given: a set of documents a set of categories Task: sort documents by category Examples: sort news text by topic (POLITICS, SPORT etc) sort into SPAM/NON-SPAM classify documents by author

Setup Typical setup: identify relevant features of the documents individual words n-grams (e.g. bigrams) … learn a model to classify a document Naïve Bayes method maximum entropy language models …

Supervised vs unsupervised (cf. the un/supervised distinction for Word Sense Disambiguation; lecture 6) Supervised learning: training data is labeled several methods available (naïve Bayes, etc) Unsupervised learning: training data is unlabeled document classes have to be “discovered” possible method: clustering

Clustering documents Part 1

Clustering Flat/non-hierarchical just sets of related documents no relationship between clusters very efficient algorithms exist e.g. k-means clustering Hierarchical related documents grouped in a tree (dendrogram) tree branches indicate similarity (resp distance) less efficient than non-hierarchical clustering n documents need n * n similarity computations but more informative

Soft vs hard clusters Hard clustering: each document belongs to exactly 1 class hierarchical methods are usually hard Soft clustering: allow degrees of membership e.g. p(c1|d1) > p(c2|d1) i.e. d1 belongs to c1 to a greater degree than to c2

Similarity & monotonicity All hierarchical methods require a similarity metric: similarity computed between individual documents and between clusters Vector-space representation for documents with cosine similarity is a common technique The similarity metric needs to be monotonic: i.e. we expect merging not to increase similarity otherwise, when we merge 2 clusters, their similarity to a third might change

Agglomerative clustering algorithm Given: D = {d 1,…,d n } (the documents) similarity metric 1. Initialise clusters C = {c 1,…,c n } for {d 1,…,d n } 2. j := n+1 3. do until |C| = 1 a. find the most similar pair (c,c’) in C b. create a new cluster c j = c U c’ c. remove c, c’ from C d. add c j to C e. j := j+1

Agglomerative clustering - walkthrough Start with separate clusters for each document D4 D5D3D2D1

Agglomerative clustering - walkthrough D1 and D2 are most similar D4 D5D3D2D1

Agglomerative clustering - walkthrough D4 and D5 are most similar D4 D5D3D2D1

Agglomerative clustering - walkthrough D3 and {D4,D5} are most similar D4 D5D3D2D1

Agglomerative clustering - walkthrough Final step: merge last two clusters D4 D5D3D2D1

Merging: single link strategy Similarity of two clusters = similarity of the two most similar members. Pro: good local coherence (high pairwise similarity) Con: “elongated” clusters (bad global coherence) c1 c3 c2 c4 sim c5 c7 c6 c8

Merging: Complete link strategy Similarity of two clusters = similarity of the two most dissimilar members. better global coherence c1 c3 c2 c4 sim c5 c7 c6 c8

Group average strategy Similarity of two clusters = average pairwise similarity. Compromise between local & global coherence. When using a vector-space representation with cosine similarity, the average similarity of a cluster C = {C1,C2} can be computed from the average similarity of its children C1 & C2. much more efficient than computing average pairwise similarity between all document pairs in C1 * C2!

Divisive clustering a kind of top-down hierarchical clustering also a greedy algorithm 1. start with a single cluster representing all documents 2. iteratively divide clusters split the cluster which is least coherent (the cluster whose elements are least similar to eachother) to split a cluster C into {C1,C2}, one can run agglomerative clustering over the elements of C! therefore, computationally more expensive than pure agglomerative method