Introduction to Information Retrieval Introduction to Information Retrieval Lecture 17: Hierarchical Clustering 1.

Slides:



Advertisements
Similar presentations
Clustering II.
Advertisements

SEEM Tutorial 4 – Clustering. 2 What is Cluster Analysis?  Finding groups of objects such that the objects in a group will be similar (or.
Clustering (2). Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram –A tree like.
Hierarchical Clustering
Cluster Analysis: Basic Concepts and Algorithms
1 CSE 980: Data Mining Lecture 16: Hierarchical Clustering.
Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.
Hierarchical Clustering, DBSCAN The EM Algorithm
Information Retrieval Lecture 7 Introduction to Information Retrieval (Manning et al. 2007) Chapter 17 For the MSc Computer Science Programme Dell Zhang.
Albert Gatt Corpora and Statistical Methods Lecture 13.
Clustering Paolo Ferragina Dipartimento di Informatica Università di Pisa This is a mix of slides taken from several presentations, plus my touch !
Data Mining Cluster Analysis: Basic Concepts and Algorithms
Agglomerative Hierarchical Clustering 1. Compute a distance matrix 2. Merge the two closest clusters 3. Update the distance matrix 4. Repeat Step 2 until.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 17: Hierarchical Clustering 1.
Introduction to Bioinformatics
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.
Clustering II.
1 Machine Learning: Symbol-based 10d More clustering examples10.5Knowledge and Learning 10.6Unsupervised Learning 10.7Reinforcement Learning 10.8Epilogue.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
1 Text Clustering. 2 Clustering Partition unlabeled examples into disjoint subsets of clusters, such that: –Examples within a cluster are very similar.
Clustering Specific Issues related to Project 2. Reducing dimensionality –Lowering the number of dimensions makes the problem more manageable Less memory.
CS276A Text Retrieval and Mining
Cluster Analysis: Basic Concepts and Algorithms
Cluster Analysis.  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods.
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Revision (Part II) Ke Chen COMP24111 Machine Learning Revision slides are going to summarise all you have learnt from Part II, which should be helpful.
Chapter 3: Cluster Analysis  3.1 Basic Concepts of Clustering  3.2 Partitioning Methods  3.3 Hierarchical Methods The Principle Agglomerative.
Clustering Unsupervised learning Generating “classes”
Hierarchical Clustering
Partitional and Hierarchical Based clustering Lecture 22 Based on Slides of Dr. Ikle & chapter 8 of Tan, Steinbach, Kumar.
Text Clustering.
Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.
tch?v=Y6ljFaKRTrI Fireflies.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Clustering Algorithms k-means Hierarchic Agglomerative Clustering (HAC) …. BIRCH Association Rule Hypergraph Partitioning (ARHP) Categorical clustering.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
Clustering What is clustering? Also called “unsupervised learning”Also called “unsupervised learning”
CSE5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai Li (Slides.
CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.
Clustering. What is Clustering? Clustering: the process of grouping a set of objects into classes of similar objects –Documents within a cluster should.
Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree like diagram that.
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 15 10/13/2011.
Information Retrieval and Organisation Chapter 17 Hierarchical Clustering Dell Zhang Birkbeck, University of London.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Clustering Slides adapted from Chris Manning, Prabhakar Raghavan, and Hinrich Schütze (
Hierarchical Clustering
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.
Hierarchical clustering approaches for high-throughput data Colin Dewey BMI/CS 576 Fall 2015.
Clustering (Search Engine Results) CSE 454. © Etzioni & Weld To Do Lecture is short Add k-means Details of ST construction.
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
Data Mining and Text Mining. The Standard Data Mining process.
Big Data Infrastructure Week 9: Data Mining (4/4) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States.
Hierarchical Clustering & Topic Models
Unsupervised Learning: Clustering
Hierarchical Clustering
Unsupervised Learning: Clustering
Semi-Supervised Clustering
Clustering CSC 600: Data Mining Class 21.
Hierarchical Clustering
Hierarchical Clustering
Hierarchical and Ensemble Clustering
Information Organization: Clustering
Hierarchical and Ensemble Clustering
Text Categorization Berlin Chen 2003 Reference:
Clustering Techniques
SEEM4630 Tutorial 3 – Clustering.
Hierarchical Clustering
Presentation transcript:

Introduction to Information Retrieval Introduction to Information Retrieval Lecture 17: Hierarchical Clustering 1

Introduction to Information Retrieval 2 Hierarchical clustering Our goal in hierarchical clustering is to create a hierarchy like the one we saw earlier in Reuters: We want to create this hierarchy automatically. We can do this either top-down or bottom-up. The best known bottom-up method is hierarchical agglomerative clustering. 2

Introduction to Information Retrieval Overview ❶ Recap ❷ Introduction ❸ Single-link/ Complete-link ❹ Centroid/ GAAC ❺ Variants ❻ Labeling clusters 3

Introduction to Information Retrieval Outline ❶ Recap ❷ Introduction ❸ Single-link/ Complete-link ❹ Centroid/ GAAC ❺ Variants ❻ Labeling clusters 4

Introduction to Information Retrieval 5 Top-down vs. Bottom-up  Divisive clustering:  Start with all docs in one big cluster  Then recursively split clusters  Eventually each node forms a cluster on its own.  → Bisecting K-means  Not included in this lecture. 5

Introduction to Information Retrieval 6 Hierarchical agglomerative clustering (HAC)  HAC creates a hierachy in the form of a binary tree.  Assumes a similarity measure for determining the similarity of two clusters.  Start with each document in a separate cluster  Then repeatedly merge the two clusters that are most similar until there is only one cluster  The history of merging is a hierarchy in the form of a binary tree.  The standard way of depicting this history is a dendrogram. 6

Introduction to Information Retrieval 7 A dendrogram  The vertical line of each merger tells us what the similarity of the merger was.  We can cut the dendrogram at a particular point (e.g., at 0.1 or 0.4) to get a flat clustering. 7

Introduction to Information Retrieval 8 Naive HAC algorithm 8

Introduction to Information Retrieval 9 Computational complexity of the naive algorithm  First, we compute the similarity of all N × N pairs of documents.  Then, in each of N iterations:  We scan the O(N × N) similarities to find the maximum similarity.  We merge the two clusters with maximum similarity.  We compute the similarity of the new cluster with all other (surviving) clusters.  There are O(N) iterations, each performing a O(N × N) “scan” operation.  Overall complexity is O(N 3 ).  We’ll look at more efficient algorithms later. 9

Introduction to Information Retrieval 10 Key question: How to define cluster similarity  Single-link: Maximum similarity  Maximum similarity of any two documents  Complete-link: Minimum similarity  Minimum similarity of any two documents  Centroid: Average “intersimilarity”  Average similarity of all document pairs (but excluding pairs of docs in the same cluster)  This is equivalent to the similarity of the centroids.  Group-average: Average “intrasimilarity”  Average similary of all document pairs, including pairs of docs in the same cluster 10

Introduction to Information Retrieval 11 Cluster similarity: Example 11

Introduction to Information Retrieval 12 Single-link: Maximum similarity 12

Introduction to Information Retrieval 13 Complete-link: Minimum similarity 13

Introduction to Information Retrieval 14 Centroid: Average intersimilarity intersimilarity = similarity of two documents in different clusters 14

Introduction to Information Retrieval 15 Group average: Average intrasimilarity intrasimilarity = similarity of any pair, including cases where the two documents are in the same cluster 15

Introduction to Information Retrieval 16 Cluster similarity: Larger Example 16

Introduction to Information Retrieval 17 Single-link: Maximum similarity 17

Introduction to Information Retrieval 18 Complete-link: Minimum similarity 18

Introduction to Information Retrieval 19 Centroid: Average intersimilarity 19

Introduction to Information Retrieval 20 Group average: Average intrasimilarity 20

Introduction to Information Retrieval Outline ❶ Recap ❷ Introduction ❸ Single-link/ Complete-link ❹ Centroid/ GAAC ❺ Variants ❻ Labeling clusters 21

Introduction to Information Retrieval 22 Single link HAC  The similarity of two clusters is the maximum intersimilarity – the maximum similarity of a document from the first cluster and a document from the second cluster.  Once we have merged two clusters, we update the similarity matrix with this: SIM (ω i, (ω k1 ∪ ω k2 )) = max( SIM (ω i, ω k1 ), SIM (ω i, ω k2 )) 22

Introduction to Information Retrieval 23 This dendogram was produced by single-link  Notice: many small clusters (1 or 2 members) being added to the main cluster  There is no balanced 2- cluster or 3-cluster clustering that can be derived by cutting the dendrogram. 23

Introduction to Information Retrieval 24 Complete link HAC  The similarity of two clusters is the minimum intersimilarity – the minimum similarity of a document from the first cluster and a document from the second cluster.  Once we have merged two clusters, we update the similarity matrix with this: SIM( ω i, (ω k1 ∪ ω k2 )) = min( SIM (ω i, ω k1 ), SIM (ω i, ω k2 ))  We measure the similarity of two clusters by computing the diameter of the cluster that we would get if we merged them. 24

Introduction to Information Retrieval 25 Complete-link dendrogram  Notice that this dendrogram is much more balanced than the single-link one.  We can create a 2-cluster clustering with two clusters of about the same size. 25

Introduction to Information Retrieval 26 Exercise: Compute single and complete link clustering 26

Introduction to Information Retrieval 27 Single-link vs. Complete link clustering 27

Introduction to Information Retrieval 28 Single-link: Chaining Single-link clustering often produces long, straggly clusters. For most applications, these are undesirable. 28

Introduction to Information Retrieval 29 Complete-link: Sensitivity to outliers  The complete-link clustering of this set splits d 2 from its right neighbors – clearly undesirable.  The reason is the outlier d 1.  This shows that a single outlier can negatively affect the outcome of complete-link clustering.  Single-link clustering does better in this case. 29

Introduction to Information Retrieval Outline ❶ Recap ❷ Introduction ❸ Single-link/ Complete-link ❹ Centroid/ GAAC ❺ Variants ❻ Labeling clusters 30

Introduction to Information Retrieval 31 Centroid HAC  The similarity of two clusters is the average intersimilarity – the average similarity of documents from the first cluster with documents from the second cluster.  A implementation of centroid HAC is computing the similarity of the centroids:  Hence the name: centroid HAC 31

Introduction to Information Retrieval 32 Example: Compute centroid clustering 32

Introduction to Information Retrieval 33 Centroid clustering 33

Introduction to Information Retrieval 34 The Inversion in centroid clustering  In an inversion, the similarity increases during a merge sequence. Results in an “inverted” dendrogram.  Below: Similarity of the first merger (d 1 ∪ d 2 ) is -4.0, similarity of second merger ((d 1 ∪ d 2 ) ∪ d 3 ) is ≈ −

Introduction to Information Retrieval 35 Group-average agglomerative clustering (GAAC)  The similarity of two clusters is the average intrasimilarity – the average similarity of all document pairs (including those from the same cluster).  But we exclude self-similarities.  The similarity is defined as: 35

Introduction to Information Retrieval 36 Comparison of HAC algorithms 36 methodcombination similaritycomment single-linkmax intersimilarity of any 2 docschaining effect complete-linkmin intersimilarity of any 2 docssensitive to outliers group-averageaverage of all simsbest choice for most applications centroidaverage intersimilarityinversions can occur

Introduction to Information Retrieval Outline ❶ Recap ❷ Introduction ❸ Single-link/ Complete-link ❹ Centroid/ GAAC ❺ Variants ❻ Labeling clusters 37

Introduction to Information Retrieval 38 Efficient HAC 38

Introduction to Information Retrieval 39 Time complexity of HAC  The algorithm we look at earlier is O(N 3 ).  There is no known O(N 2 ) algorithm for complete-link, centroid and GAAC.  Best time complexity for these three is O(N 2 log N). 39

Introduction to Information Retrieval Outline ❶ Recap ❷ Introduction ❸ Single-link/ Complete-link ❹ Centroid/ GAAC ❺ Variants ❻ Labeling clusters 40

Introduction to Information Retrieval 41 Discriminative labeling  To label cluster ω, compare ω with all other clusters  Find terms or phrases that distinguish ω from the other clusters.  We can use any of the feature selection criteria we introduced in text classification to identify discriminating terms: mutual information. 41

Introduction to Information Retrieval 42 Non-discriminative labeling  Select terms or phrases based solely on information from the cluster itself.  Terms with high weights in the centroid. (if we are using a vector space model)  Non-discriminative methods sometimes select frequent terms that do not distinguish clusters.  For example, MONDAY, TUESDAY,... in newspaper text 42

Introduction to Information Retrieval 43 Using titles for labeling clusters  Terms and phrases are hard to scan and condense into a holistic idea of what the cluster is about.  Alternative: titles  For example, the titles of two or three documents that are closest to the centroid.  Titles are easier to scan than a list of phrases. 43

Introduction to Information Retrieval 44 Cluster labeling: Example 44 # docs labeling method centroidmutual informationtitle 4622oil plant mexico production crude power 000 refinery gas bpd plant oil production barrels crude bpd mexico dolly capacity petroleum MEXICO: Hurricane Dolly heads for Mexico coast 91017police security russian people military peace killed told grozny court police killed military security peace told troops forces rebels people RUSSIA: Russia’s Lebed meets rebel chief in Chechnya tonnes traders futures wheat prices cents september tonne delivery traders futures tonne tonnes desk wheat prices USA: Export Business - Grain/oilseeds complex  Apply three labeling method to a K-means clustering.  Three methods: most prominent terms in centroid, differential labeling using MI, title of doc closest to centroid  All three methods do a pretty good job.