Hierarchical Clustering & Topic Models

Slides:



Advertisements
Similar presentations
CS 478 – Tools for Machine Learning and Data Mining Clustering: Distance-based Approaches.
Advertisements

Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.
Hierarchical Clustering
Information Retrieval Lecture 7 Introduction to Information Retrieval (Manning et al. 2007) Chapter 17 For the MSc Computer Science Programme Dell Zhang.
Albert Gatt Corpora and Statistical Methods Lecture 13.
©2012 Paula Matuszek CSC 9010: Text Mining Applications: Document Clustering l Dr. Paula Matuszek l
Clustering Paolo Ferragina Dipartimento di Informatica Università di Pisa This is a mix of slides taken from several presentations, plus my touch !
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 17: Hierarchical Clustering 1.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
1 Text Clustering. 2 Clustering Partition unlabeled examples into disjoint subsets of clusters, such that: –Examples within a cluster are very similar.
Lecture 13: Clustering (continued) May 12, 2010
CS276A Text Retrieval and Mining
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 17: Hierarchical Clustering 1.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Unsupervised Learning: Clustering 1 Lecture 16: Clustering Web Search and Mining.
Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.
Text Clustering.
Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.
tch?v=Y6ljFaKRTrI Fireflies.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Clustering Algorithms k-means Hierarchic Agglomerative Clustering (HAC) …. BIRCH Association Rule Hypergraph Partitioning (ARHP) Categorical clustering.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
Information Retrieval Lecture 6 Introduction to Information Retrieval (Manning et al. 2007) Chapter 16 For the MSc Computer Science Programme Dell Zhang.
Information Retrieval and Organisation Chapter 16 Flat Clustering Dell Zhang Birkbeck, University of London.
Clustering. What is Clustering? Clustering: the process of grouping a set of objects into classes of similar objects –Documents within a cluster should.
V. Clustering 인공지능 연구실 이승희 Text: Text mining Page:82-93.
Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree like diagram that.
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 15 10/13/2011.
Clustering (Modified from Stanford CS276 Slides - Lecture 17 Clustering)
Today’s Topic: Clustering  Document clustering Motivations Document representations Success criteria  Clustering algorithms Partitional Hierarchical.
Lecture 12: Clustering May 5, Clustering (Ch 16 and 17)  Document clustering  Motivations  Document representations  Success criteria  Clustering.
Text Clustering Hongning Wang
Information Retrieval and Organisation Chapter 17 Hierarchical Clustering Dell Zhang Birkbeck, University of London.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Clustering Slides adapted from Chris Manning, Prabhakar Raghavan, and Hinrich Schütze (
1 CS 391L: Machine Learning Clustering Raymond J. Mooney University of Texas at Austin.
1 Machine Learning Lecture 9: Clustering Moshe Koppel Slides adapted from Raymond J. Mooney.
Probabilistic Topic Models Hongning Wang Outline 1.General idea of topic models 2.Basic topic models -Probabilistic Latent Semantic Analysis (pLSA)
Introduction to Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Christopher Manning and Prabhakar.
Introduction to Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Christopher Manning and Prabhakar.
Data Mining and Text Mining. The Standard Data Mining process.
Big Data Infrastructure Week 9: Data Mining (4/4) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States.
Big Data Infrastructure
Unsupervised Learning: Clustering
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Sampath Jayarathna Cal Poly Pomona
Hierarchical Clustering
Unsupervised Learning: Clustering
IAT 355 Clustering ______________________________________________________________________________________.
Clustering CSC 600: Data Mining Class 21.
Machine Learning Clustering: K-means Supervised Learning
Probabilistic Topic Model
Machine Learning Lecture 9: Clustering
Hierarchical Clustering
K-means and Hierarchical Clustering
Hierarchical and Ensemble Clustering
Revision (Part II) Ke Chen
Information Organization: Clustering
Revision (Part II) Ke Chen
Hierarchical and Ensemble Clustering
本投影片修改自Introduction to Information Retrieval一書之投影片 Ch 16 & 17
Data-Intensive Distributed Computing
CS 391L: Machine Learning Clustering
Machine Learning on Data Lecture 9b- Clustering
Topic Models in Text Processing
Clustering Techniques
Text Categorization Berlin Chen 2003 Reference:
Clustering Techniques
Hierarchical Clustering
Presentation transcript:

Hierarchical Clustering & Topic Models Sampath Jayarathna Cal Poly Pomona

Hierarchical Clustering Build a tree-based hierarchical taxonomy (dendrogram) from a set of documents. One approach: recursive application of a partitional clustering algorithm. animal vertebrate fish reptile amphib. mammal worm insect crustacean invertebrate

Hierarchical agglomerative clustering (HAC) Start with each document in a separate cluster Then repeatedly merge the two clusters that are most similar Until there is only one cluster The history of merging is a hierarchy in the form of a binary tree. The standard way of depicting this history is a dendrogram.

A dendogram The history of mergers can be read off from bottom to top. The horizontal line of each merger tells us what the similarity of the merger was. We can cut the dendrogram at a particular point (e.g., at 0.1 or 0.4) to get a flat clustering.

What to do with the hierarchy? Sec. 17.1 What to do with the hierarchy? Use as is. Cut at a predetermined threshold. Cut to get a predetermined number of clusters K. Hierarchical clustering is often used to get K flat clusters. The hierarchy is then ignored.

Closest pair of clusters Sec. 17.2 Closest pair of clusters Many variants to defining closest pair of clusters Single-link Similarity of the most cosine-similar (single-link) Complete-link Similarity of the “furthest” points, the least cosine-similar Centroid Clusters whose centroids (centers of gravity) are the most cosine-similar Average-link Average cosine between pairs of elements

Cluster similarity: Example

Single-link: Maximum similarity

Complete-link: Minimum similarity

Centroid: Average intersimilarity intersimilarity = similarity of documents in different clusters

Group average: Average intrasimilarity intrasimilarity = similarity of any pair, including cases where the two documents are in the same cluster

Cluster similarity: Larger Example

Single-link: Maximum similarity

Complete-link: Minimum similarity

Centroid: Average intersimilarity

Group average: Average intrasimilarity

Single Link Agglomerative Clustering Sec. 17.2 Single Link Agglomerative Clustering Use maximum similarity of pairs: Can result in “straggly” (long and thin) clusters due to chaining effect. After merging ci and cj, the similarity of the resulting cluster to another cluster, ck, is:

Complete Link Use minimum similarity of pairs: Sec. 17.2 Complete Link Use minimum similarity of pairs: Makes “tighter,” spherical clusters that are typically preferable. After merging ci and cj, the similarity of the resulting cluster to another cluster, ck, is: Ci Cj Ck

Exercise: Compute single and complete link clustering

Sec. 17.2 Single Link Example

Single-link clustering

Sec. 17.2 Complete Link Example

Complete link clustering

Single-link vs. Complete link clustering

Flat or hierarchical clustering? For high efficiency, use flat clustering (k-means) When a hierarchical structure is desired: hierarchical algorithm HAC also can be applied if K cannot be predetermined (can start without knowing K)

Major issue in clustering – labeling After a clustering algorithm finds a set of clusters: how can they be useful to the end user? Need simple label for each cluster. For example, in search result clustering for “jaguar”, The labels of the clusters could be “animal”, and “car” Topic of this section: How can we automatically find good labels for clusters? Often done by hand Use metadata like Titles Use the medoid (documents) itself Top-terms (most frequent) Most distinguishing terms

Topic Models in Text Processing

Overview Motivation: Basic Assumptions: Applications Model the topic/subtopics in text collections Basic Assumptions: There are k topics in the whole collection Each topic is represented by a multinomial distribution over the vocabulary (language model) Each document can cover multiple topics Applications Summarizing topics Predict topic coverage for documents Model the topic correlations Classification, Clustering

Basic Topic Models Unigram model Mixture of unigrams Probabilistic LSI Latent Dirichlet Allocation (LDA) Correlated Topic Models

What is a “topic”? Representation: a probabilistic distribution over words. retrieval 0.2 information 0.15 model 0.08 query 0.07 language 0.06 feedback 0.03 …… Topic: A broad concept/theme, semantically coherent, which is hidden in documents e.g., politics; sports; technology; entertainment; education etc.

Document as a mixture of topics government 0.3 response 0.2 ... Topic 1 [ Criticism of government response to the hurricane primarily consisted of criticism of its response to the approach of the storm and its aftermath, specifically in the delayed response ] to the [ flooding of New Orleans. … 80% of the 1.3 million residents of the greater New Orleans metropolitan area evacuated ] …[ Over seventy countries pledged monetary donations or other assistance]. … city 0.2 new 0.1 orleans 0.05 ... Topic 2 … How can we discover these topic-word distributions? Many applications would be enabled by discovering such topics Summarize themes/aspects Facilitate navigation/browsing Retrieve documents Segment documents Many other text mining tasks donate 0.1 relief 0.05 help 0.02 ... Topic k is 0.05 the 0.04 a 0.03 ... Background k

Latent Dirichlet Allocation

Topics learned by LDA

Topic assignments in document Based on the topics shown in last slide

Final word In clustering, clusters are inferred from the data without human input (unsupervised learning) However, in practice, it’s a bit less clear: there are many ways of influencing the outcome of clustering: number of clusters, similarity measure, representation of documents.