CSCI 5417 Information Retrieval Systems Jim Martin Lecture 15 10/13/2011.

Slides:



Advertisements
Similar presentations
Lecture 15(Ch16): Clustering
Advertisements

Hierarchical Clustering
Information Retrieval Lecture 7 Introduction to Information Retrieval (Manning et al. 2007) Chapter 17 For the MSc Computer Science Programme Dell Zhang.
Albert Gatt Corpora and Statistical Methods Lecture 13.
©2012 Paula Matuszek CSC 9010: Text Mining Applications: Document Clustering l Dr. Paula Matuszek l
Clustering Paolo Ferragina Dipartimento di Informatica Università di Pisa This is a mix of slides taken from several presentations, plus my touch !
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 16 10/18/2011.
Unsupervised learning
Introduction to Bioinformatics
Clustering II.
CS347 Lecture 8 May 7, 2001 ©Prabhakar Raghavan. Today’s topic Clustering documents.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
CS728 Web Clustering II Lecture 14. K-Means Assumes documents are real-valued vectors. Clusters based on centroids (aka the center of gravity or mean)
1 Text Clustering. 2 Clustering Partition unlabeled examples into disjoint subsets of clusters, such that: –Examples within a cluster are very similar.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 16: Flat Clustering 1.
Lecture 13: Clustering (continued) May 12, 2010
CS276A Text Retrieval and Mining
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 17: Hierarchical Clustering 1.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Flat Clustering Adapted from Slides by
Unsupervised Learning: Clustering 1 Lecture 16: Clustering Web Search and Mining.
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
Text Clustering.
Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.
tch?v=Y6ljFaKRTrI Fireflies.
Introduction to Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan.
ITCS 6265 Information Retrieval & Web Mining Lecture 15 Clustering.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Clustering Algorithms k-means Hierarchic Agglomerative Clustering (HAC) …. BIRCH Association Rule Hypergraph Partitioning (ARHP) Categorical clustering.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
Introduction to Information Retrieval Introduction to Information Retrieval Modified from Stanford CS276 slides Chap. 16: Clustering.
Clustering Paolo Ferragina Dipartimento di Informatica Università di Pisa This is a mix of slides taken from several presentations, plus my touch !
Information Retrieval Lecture 6 Introduction to Information Retrieval (Manning et al. 2007) Chapter 16 For the MSc Computer Science Programme Dell Zhang.
Clustering Algorithms for Numerical Data Sets. Contents 1.Data Clustering Introduction 2.Hierarchical Clustering Algorithms 3.Partitional Data Clustering.
Information Retrieval and Organisation Chapter 16 Flat Clustering Dell Zhang Birkbeck, University of London.
Mehdi Ghayoumi MSB rm 132 Ofc hr: Thur, a Machine Learning.
Clustering. What is Clustering? Clustering: the process of grouping a set of objects into classes of similar objects –Documents within a cluster should.
V. Clustering 인공지능 연구실 이승희 Text: Text mining Page:82-93.
Machine Learning Queens College Lecture 7: Clustering.
Clustering (Modified from Stanford CS276 Slides - Lecture 17 Clustering)
Today’s Topic: Clustering  Document clustering Motivations Document representations Success criteria  Clustering algorithms Partitional Hierarchical.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
Lecture 12: Clustering May 5, Clustering (Ch 16 and 17)  Document clustering  Motivations  Document representations  Success criteria  Clustering.
Information Retrieval and Organisation Chapter 17 Hierarchical Clustering Dell Zhang Birkbeck, University of London.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Clustering Slides adapted from Chris Manning, Prabhakar Raghavan, and Hinrich Schütze (
1 CS 391L: Machine Learning Clustering Raymond J. Mooney University of Texas at Austin.
1 Machine Learning Lecture 9: Clustering Moshe Koppel Slides adapted from Raymond J. Mooney.
Information Retrieval Search Engine Technology (8) Prof. Dragomir R. Radev.
Hierarchical clustering approaches for high-throughput data Colin Dewey BMI/CS 576 Fall 2015.
Introduction to Information Retrieval Introduction to Information Retrieval Clustering Chris Manning, Pandu Nayak, and Prabhakar Raghavan.
Introduction to Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Christopher Manning and Prabhakar.
Introduction to Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Christopher Manning and Prabhakar.
Data Mining and Text Mining. The Standard Data Mining process.
Big Data Infrastructure Week 9: Data Mining (4/4) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States.
Hierarchical Clustering & Topic Models
Unsupervised Learning: Clustering
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Sampath Jayarathna Cal Poly Pomona
Unsupervised Learning: Clustering
Machine Learning Lecture 9: Clustering
Information Organization: Clustering
本投影片修改自Introduction to Information Retrieval一書之投影片 Ch 16 & 17
CS276B Text Information Retrieval, Mining, and Exploitation
CS 391L: Machine Learning Clustering
Text Categorization Berlin Chen 2003 Reference:
Clustering Techniques
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Presentation transcript:

CSCI 5417 Information Retrieval Systems Jim Martin Lecture 15 10/13/2011

12/25/2015 CSCI IR 2 Today 10/13 More Clustering Finish flat clustering Hierarchical clustering

12/25/2015 CSCI IR 3 K-Means Assumes documents are real-valued vectors. Clusters based on centroids (aka the center of gravity or mean) of points in a cluster, c: Iterative reassignment of instances to clusters is based on distance to the current cluster centroids. (Or one can equivalently phrase it in terms of similarities)

12/25/2015 CSCI IR 4 K-Means Algorithm Select K random docs {s 1, s 2,… s K } as seeds. Until stopping criterion: For each doc d i : Assign d i to the cluster c j such that dist(d i, s j ) is minimal. For each cluster c_j s_j = m(c_j)

12/25/2015 CSCI IR 5 K Means Example (K=2) Pick seeds Assign clusters Compute centroids x x Reassign clusters x x x x Compute centroids Reassign clusters Converged!

12/25/2015 CSCI IR 6 Termination conditions Several possibilities A fixed number of iterations Doc partition unchanged Centroid positions don’t change

Convergence Why should the K-means algorithm ever reach a fixed point? A state in which clusters don’t change. K-means is a special case of a general procedure known as the Expectation Maximization (EM) algorithm. EM is known to converge. Number of iterations could be large. But in practice usually isn’t Sec. 16.4

Seed Choice Results can vary based on random seed selection. Some seeds can result in poor convergence rate, or convergence to sub-optimal clusterings. Select good seeds using a heuristic (e.g., doc least similar to any existing mean) Try out multiple starting points Initialize with the results of another method. Sec. 16.4

Do this with K=2 12/25/2015CSCI IR9

12/25/2015 CSCI IR 10 Hierarchical Clustering Build a tree-based hierarchical taxonomy (dendrogram) from a set of unlabeled examples. animal vertebrate fish reptile amphib. mammal worm insect crustacean invertebrate

Dendrogram: Hierarchical Clustering Traditional clustering partition is obtained by cutting the dendrogram at a desired level: each connected component forms a cluster. 11

Break Past HW Best score on part 2 is.437 Best approaches Multifield indexing of title/keywords/abstract Snowball (English), Porter Tuning the stop list Ensemble (voting) Mixed results Boosts Relevance feedback 12/25/2015CSCI IR12

Descriptions For the most part, your approaches were pretty weak (or your descriptions were) Failed to report R-Precision Use of some kind of systematic approach X didn’t work Interactions between approaches Lack of details Use relevance feedback and it gave me Z I changed the stop list Boosted the title field Etc. 12/25/2015CSCI IR13

Next HW Due 10/25 I have a new untainted test set So don’t worry about checking for the test document; it won’t be there 12/25/2015CSCI IR14

12/25/2015 CSCI IR 15 Agglomerative (bottom-up): Start with each document being a single cluster. Eventually all documents belong to the same cluster. Divisive (top-down): Start with all documents belong to the same cluster. Eventually each node forms a cluster on its own. Does not require the number of clusters k to be known in advance But it does need a cutoff or threshold parameter condition Hierarchical Clustering algorithms

12/25/2015 CSCI IR 16 Hierarchical -> Partition Run the algorithm to completion Take a slice across the tree at some level Produces a partition Or insert an early stopping condition into either top-down or bottom-up

12/25/2015 CSCI IR 17 Hierarchical Agglomerative Clustering (HAC) Assumes a similarity function for determining the similarity of two instances and two clusters. Starts with all instances in separate clusters and then repeatedly joins the two clusters that are most similar until there is only one cluster. The history of merging forms a binary tree or hierarchy.

12/25/2015 CSCI IR 18 Hierarchical Clustering Key problem: as you build clusters, how do you represent each cluster, to tell which pair of clusters is closest?

12/25/2015 CSCI IR 19 “Closest pair” in Clustering Many variants to defining closest pair of clusters Single-link Similarity of the most cosine-similar Complete-link Similarity of the “furthest” points, the least cosine-similar “Center of gravity” Clusters whose centroids (centers of gravity) are the most cosine-similar Average-link Average cosine between all pairs of elements

12/25/2015 CSCI IR 20 Single Link Agglomerative Clustering Use maximum similarity of pairs: Can result in “straggly” (long and thin) clusters due to chaining effect. After merging c i and c j, the similarity of the resulting cluster to another cluster, c k, is:

12/25/2015 CSCI IR 21 Single Link Example

12/25/2015 CSCI IR 22 Complete Link Agglomerative Clustering Use minimum similarity of pairs: Makes “tighter,” spherical clusters that are typically preferable. After merging c i and c j, the similarity of the resulting cluster to another cluster, c k, is:

12/25/2015 CSCI IR 23 Complete Link Example

12/25/2015 CSCI IR 24 Misc. Clustering Topics Clustering terms Clustering people Feature selection Labeling clusters

12/25/2015 CSCI IR 25 Term vs. document space So far, we clustered docs based on their similarities in term space For some applications, e.g., topic analysis for inducing navigation structures, you can “dualize”: Use docs as axes Represent (some) terms as vectors Cluster terms, not docs

12/25/2015 CSCI IR 26 Clustering people Take documents (pages) containing mentions of ambiguous names and partition the documents into bins with identical referents. SemEval competition Web People Search Task: Given a name as a query to google, cluster the top 100 results so that each cluster corresponds to a real individual out in the world

12/25/2015 CSCI IR 27 Labeling clusters After clustering algorithm finds clusters - how can they be useful to the end user? Need pithy label for each cluster In search results, say “Animal” or “Car” in the jaguar example.

12/25/2015 CSCI IR 28 How to Label Clusters Show titles of typical documents Titles are easy to scan Authors create them for quick scanning But you can only show a few titles which may not fully represent cluster Show words/phrases prominent in cluster More likely to fully represent cluster Use distinguishing words/phrases Differential labeling But harder to scan

12/25/2015 CSCI IR 29 Labeling Common heuristics - list 5-10 most frequent terms in the centroid vector. Drop stop-words; stem. Differential labeling by frequent terms Within a collection “Computers”, clusters all have the word computer as frequent term. Discriminant analysis of centroids. Perhaps better: distinctive noun phrases Requires NP chunking

Summary In clustering, clusters are inferred from the data without human input (unsupervised learning) In practice, it’s a bit less clear. There are many ways of influencing the outcome of clustering: number of clusters, similarity measure, representation of documents,...