Massimo Poesio Lecture 2: Text (Document) clustering

Slides:



Advertisements
Similar presentations
Lecture 15(Ch16): Clustering
Advertisements

Clustering II.
SEEM Tutorial 4 – Clustering. 2 What is Cluster Analysis?  Finding groups of objects such that the objects in a group will be similar (or.
Clustering (2). Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram –A tree like.
Hierarchical Clustering
Unsupervised Learning
Cluster Analysis: Basic Concepts and Algorithms
1 CSE 980: Data Mining Lecture 16: Hierarchical Clustering.
Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.
Data Mining Cluster Analysis Basics
Hierarchical Clustering, DBSCAN The EM Algorithm
Hierarchical Clustering
PARTITIONAL CLUSTERING
©2012 Paula Matuszek CSC 9010: Text Mining Applications: Document Clustering l Dr. Paula Matuszek l
Clustering Paolo Ferragina Dipartimento di Informatica Università di Pisa This is a mix of slides taken from several presentations, plus my touch !
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.
Data Mining Cluster Analysis: Basic Concepts and Algorithms
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.
Clustering II.
CS347 Lecture 8 May 7, 2001 ©Prabhakar Raghavan. Today’s topic Clustering documents.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
1 Text Clustering. 2 Clustering Partition unlabeled examples into disjoint subsets of clusters, such that: –Examples within a cluster are very similar.
Cluster Analysis: Basic Concepts and Algorithms
What is Cluster Analysis?
Cluster Analysis CS240B Lecture notes based on those by © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004.
What is Cluster Analysis?
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 6 9/8/2011.
Partitional and Hierarchical Based clustering Lecture 22 Based on Slides of Dr. Ikle & chapter 8 of Tan, Steinbach, Kumar.
Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
CS276B Text Information Retrieval, Mining, and Exploitation Lecture 1 Jan
CSE5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai Li (Slides.
Information Retrieval Lecture 6 Introduction to Information Retrieval (Manning et al. 2007) Chapter 16 For the MSc Computer Science Programme Dell Zhang.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Information Retrieval and Organisation Chapter 16 Flat Clustering Dell Zhang Birkbeck, University of London.
Clustering. What is Clustering? Clustering: the process of grouping a set of objects into classes of similar objects –Documents within a cluster should.
Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree like diagram that.
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 15 10/13/2011.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Clustering/Cluster Analysis. What is Cluster Analysis? l Finding groups of objects such that the objects in a group will be similar (or related) to one.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.
Introduction to Data Mining Clustering & Classification Reference: Tan et al: Introduction to data mining. Some slides are adopted from Tan et al.
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
Data Mining and Text Mining. The Standard Data Mining process.
Data Mining Classification and Clustering Techniques Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to Data Mining.
Unsupervised Learning
Data Mining: Basic Cluster Analysis
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Hierarchical Clustering
Semi-Supervised Clustering
Clustering CSC 600: Data Mining Class 21.
Data Mining K-means Algorithm
Hierarchical Clustering
CSE 5243 Intro. to Data Mining
Data Mining Cluster Techniques: Basic
Revision (Part II) Ke Chen
CS276B Text Information Retrieval, Mining, and Exploitation
Text Categorization Berlin Chen 2003 Reference:
SEEM4630 Tutorial 3 – Clustering.
Hierarchical Clustering
Data Mining Cluster Analysis: Basic Concepts and Algorithms
Unsupervised Learning
Presentation transcript:

Massimo Poesio Lecture 2: Text (Document) clustering 807 - TEXT ANALYTICS Massimo Poesio Lecture 2: Text (Document) clustering

Reminder: Clustering Or, discovering similarities between objects Individuals, Documents, … Applications: Recommender systems Document organization

Recommending: restaurants We have a list of all Wivenhoe restaurants with  and  ratings for some as provided by some Uni Essex students / staff Which restaurant(s) should I recommend to you?

Input

Algorithm 0 Recommend to you the most popular restaurants say # positive votes minus # negative votes Ignores your culinary preferences And judgements of those with similar preferences How can we exploit the wisdom of “like-minded” people?

Another look at the input - a matrix

Now that we have a matrix View all other entries as zeros for now.

PREFERENCE-DEFINED DATA SPACE . . .

Similarity between two people Similarity between their preference vectors. Inner products are a good start. Dave has similarity 3 with Estie but -2 with Cindy. Perhaps recommend Black Buoy to Dave and Bakehouse to Bob, etc.

Algorithm 1.1 You give me your preferences and I need to give you a recommendation. I find the person “most similar” to you in my database and recommend something he likes. Aspects to consider: No attempt to discern cuisines, etc. What if you’ve been to all the restaurants he has? Do you want to rely on one person’s opinions?

Algorithm 1.k You give me your preferences and I need to give you a recommendation. I find the k people “most similar” to you in my database and recommend what’s most popular amongst them. Issues: A priori unclear what k should be Risks being influenced by “unlike minds”

Slightly more sophisticated attempt Group similar users together into clusters You give your preferences and seek a recommendation, then Find the “nearest cluster” (what’s this?) Recommend the restaurants most popular in this cluster Features: avoids data sparsity issues still no attempt to discern why you’re recommended what you’re recommended how do you cluster?

CLUSTERS Can cluster Dave, Estie Alice, Bob …. Fred .. Cindy

How do you cluster? Must keep similar people together in a cluster Separate dissimilar people Factors: Need a notion of similarity/distance Vector space? Normalization? How many clusters? Fixed a priori? Completely data driven? Avoid “trivial” clusters - too large or small

Looking beyond Clustering people for restaurant recommendations Clustering other things (documents, web pages) Other approaches to recommendation Amazon.com General unsupervised machine learning.

Text clustering Search results clustering Document clustering

Navigating search results Given the results of a search (say jaguar), partition into groups of related docs sense disambiguation Approach followed by Uni Essex site Kruschwitz / al Bakouri / Lungley Other examples: IBM InfoSphere Data Explorer (was: vivisimo.com)

Results list clustering example Jaguar Motor Cars’ home page Mike’s XJS resource page Vermont Jaguar owners’ club Cluster 2: Big cats My summer safari trip Pictures of jaguars, leopards and lions Cluster 3: Jacksonville Jaguars’ Home Page AFC East Football Teams

Search results clustering: example

Why cluster documents? For improving recall in search applications For speeding up vector space retrieval Corpus analysis/navigation Sense disambiguation in search results

Improving search recall Cluster hypothesis - Documents with similar text are related Ergo, to improve search recall: Cluster docs in corpus a priori When a query matches a doc D, also return other docs in the cluster containing D Hope: docs containing automobile returned on a query for car because clustering grouped together docs containing car with those containing automobile. Why might this happen?

Speeding up vector space retrieval In vector space retrieval, must find nearest doc vectors to query vector This would entail finding the similarity of the query to every doc - slow! By clustering docs in corpus a priori find nearest docs in cluster(s) close to query inexact but avoids exhaustive similarity computation Exercise: Make up a simple example with points on a line in 2 clusters where this inexactness shows up.

Corpus analysis/navigation Given a corpus, partition it into groups of related docs Recursively, can induce a tree of topics Allows user to browse through corpus to home in on information Crucial need: meaningful labels for topic nodes.

CLUSTERING DOCUMENTS IN A (VERY) LARGE COLLECTION: GOOGLE NEWS Notice traders traders

CLUSTERING DOCUMENTS IN A VERY LARGE COLLECTION: JRC’S NEWS EXPLORER Notice traders traders

What makes docs “related”? Ideal: semantic similarity. Practical: statistical similarity We will use cosine similarity. Docs as vectors. For many algorithms, easier to think in terms of a distance (rather than similarity) between docs. We will describe algorithms in terms of cosine similarity.

DOCUMENTS AS BAGS OF WORDS INDEX DOCUMENT broad may rally rallied signal stock stocks tech technology traders traders trend broad tech stock rally may signal trend - traders. technology stocks rallied on tuesday, with gains scored broadly across many sectors, amid what some traders called a recovery from recent doldrums. Notice traders traders

Doc as vector Each doc j is a vector of tfidf values, one component for each term. Can normalize to unit length. So we have a vector space terms are axes - aka features n docs live in this space even with stemming, may have 10000+ dimensions do we really want to use all terms?

TERM WEIGHTING IN VECTOR SPACE MODELS: THE TF.IDF MEASURE Also: t-score, chi-square FREQUENCY of term i in document k Number of documents with term i

Intuition t 3 D2 D3 D1 x y t 1 t 2 D4 Postulate: Documents that are “close together” in vector space talk about the same things.

Cosine similarity

Two flavors of clustering Given n docs and a positive integer k, partition docs into k (disjoint) subsets. Given docs, partition into an “appropriate” number of subsets. E.g., for query results - ideal value of k not known up front - though UI may impose limits. Can usually take an algorithm for one flavor and convert to the other.

Thought experiment Consider clustering a large set of computer science documents what do you expect to see in the vector space?

Thought experiment Consider clustering a large set of computer science documents what do you expect to see in the vector space? NLP Graphics AI Theory Arch.

Decision boundaries Could we use these blobs to infer the subject of a new document? NLP Graphics AI Theory Arch.

Deciding what a new doc is about Check which region the new doc falls into can output “softer” decisions as well. NLP Graphics AI Theory Arch. = AI

Setup Given “training” docs for each category Theory, AI, NLP, etc. Cast them into a decision space generally a vector space with each doc viewed as a bag of words Build a classifier that will classify new docs Essentially, partition the decision space Given a new doc, figure out which partition it falls into

Supervised vs. unsupervised learning This setup is called supervised learning in the terminology of Machine Learning In the domain of text, various names Text classification, text categorization Document classification/categorization “Automatic” categorization Routing, filtering … In contrast, the earlier setting of clustering is called unsupervised learning Presumes no availability of training samples Clusters output may not be thematically unified.

“Which is better?” Depends Can use in combination on your setting on your application Can use in combination Analyze a corpus using clustering Hand-tweak the clusters and label them Use clusters as training input for classification Subsequent docs get classified Computationally, methods quite different

What more can these methods do? Assigning a category label to a document is one way of adding structure to it. Can add others, e.g.,extract from the doc people places dates organizations … This process is known as information extraction can also be addressed using supervised learning.

Clustering: a bit of terminology REPRESENTATIVE CENTROID OUTLIER

Key notion: cluster representative In the algorithms to follow, will generally need a notion of a representative point in a cluster Representative should be some sort of “typical” or central point in the cluster, e.g., point inducing smallest radii to docs in cluster smallest squared distances, etc. point that is the “average” of all docs in the cluster Need not be a document

Key notion: cluster centroid Centroid of a cluster = component-wise average of vectors in a cluster - is a vector. Need not be a doc. Centroid of (1,2,3); (4,5,6); (7,2,6) is (4,3,5). Centroid

(Outliers in centroid computation) Can ignore outliers when computing centroid. What is an outlier? Lots of statistical definitions, e.g. moment of point to centroid > M  some cluster moment. Say 10. Centroid Outlier

Clustering algorithms Partitional vs. hierarchical Agglomerative K-means

Partitional Clustering A Partitional Clustering Original Points

Hierarchical Clustering Traditional Hierarchical Clustering Traditional Dendrogram Non-traditional Hierarchical Clustering Non-traditional Dendrogram

Agglomerative clustering Given target number of clusters k. Initially, each doc viewed as a cluster start with n clusters; Repeat: while there are > k clusters, find the “closest pair” of clusters and merge them.

“Closest pair” of clusters Many variants to defining closest pair of clusters Clusters whose centroids are the most cosine-similar … whose “closest” points are the most cosine-similar … whose “furthest” points are the most cosine-similar

Example: n=6, k=3, closest pair of centroids Centroid after second step. d1 d2 Centroid after first step.

Issues Have to support finding closest pairs continually compare all pairs? Potentially n3 cosine similarity computations To avoid: use approximations. “points” are changing as centroids change. Changes at each step are not localized on a large corpus, memory management an issue sometimes addressed by clustering a sample. Why? What’s this?

Exercise Consider agglomerative clustering on n points on a line. Explain how you could avoid n3 distance computations - how many will your scheme use?

“Using approximations” In standard algorithm, must find closest pair of centroids at each step Approximation: instead, find nearly closest pair use some data structure that makes this approximation easier to maintain simplistic example: maintain closest pair based on distances in projection on a random line Random line

A partitional clustering algorithm: K-MEANS Given k - the number of clusters desired. Each cluster associated with a centroid. Each point assigned to the cluster with the closest centroid. Iterate.

Different algorithm: k-means Given k - the number of clusters desired. Iterative algorithm. More locality within each iteration. Hard to get good bounds on the number of iterations.

Basic iteration At the start of the iteration, we have k centroids. Each doc assigned to the nearest centroid. All docs assigned to the same centroid are averaged to compute a new centroid; thus have k new centroids.

Iteration example Docs Current centroids

Iteration example Docs New centroids

K-means Clustering: the full algorithm

K-means Clustering – Details Initial centroids are often chosen randomly. Clusters produced vary from one run to another. The centroid is (typically) the mean of the points in the cluster. ‘Closeness’ is measured by Euclidean distance, cosine similarity, correlation, etc. K-means will converge for common similarity measures mentioned above. Most of the convergence happens in the first few iterations. Often the stopping condition is changed to ‘Until relatively few points change clusters’ Complexity is O( n * K * I * d ) n = number of points, K = number of clusters, I = number of iterations, d = number of attributes

Effect of initial choice of centroids: Two different K-means Clusterings Original Points Optimal Clustering Sub-optimal Clustering

Importance of Choosing Initial Centroids

k-means clustering Begin with k docs as centroids could be any k docs, but k random docs are better. Repeat Basic Iteration until termination condition satisfied.

Termination conditions Several possibilities, e.g., A fixed number of iterations. Doc partition unchanged. Centroid positions don’t change. Does this mean that the docs in a cluster are unchanged?

Convergence Why should the k-means algorithm ever reach a fixed point? A state in which clusters don’t change. k-means is a special case of a general procedure known as the EM algorithm. Under reasonable conditions, known to converge. Number of iterations could be large.

Exercise Consider running 2-means clustering on a corpus, each doc of which is from one of two different languages. What are the two clusters we would expect to see? Is agglomerative clustering likely to produce different results?

k not specified in advance Say, the results of a query. Solve an optimization problem: penalize having lots of clusters application dependant, e.g., compressed summary of search results list. Tradeoff between having more clusters (better focus within each cluster) and having too many clusters

k not specified in advance Given a clustering, define the Benefit for a doc to be the cosine similarity to its centroid Define the Total Benefit to be the sum of the individual doc Benefits. Why is there always a clustering of Total Benefit n?

Penalize lots of clusters For each cluster, we have a Cost C. Thus for a clustering with k clusters, the Total Cost is kC. Define the Value of a cluster to be = Total Benefit - Total Cost. Find the clustering of highest Value, over all choices of k.

Evaluating K-means Clusters Most common measure is Sum of Squared Error (SSE) For each point, the error is the distance to the nearest cluster To get SSE, we square these errors and sum them. x is a data point in cluster Ci and mi is the representative point for cluster Ci can show that mi corresponds to the center (mean) of the cluster Given two clusters, we can choose the one with the smallest error One easy way to reduce SSE is to increase K, the number of clusters A good clustering with smaller K can have a lower SSE than a poor clustering with higher K

Back to agglomerative clustering In a run of agglomerative clustering, we can try all values of k=n,n-1,n-2, … 1. At each, we can measure our Value, then pick the best choice of k.

Exercise Suppose a run of agglomerative clustering finds k=7 to have the highest Value amongst all k. Have we found the highest-Value clustering amongst all clusterings with k=7?

Limitations of K-means K-means has problems when clusters are of differing Sizes Densities Non-globular shapes K-means has problems when the data contains outliers.

Limitations of K-means: Differing Sizes Original Points K-means (3 Clusters)

Limitations of K-means: Differing Density Original Points K-means (3 Clusters)

Limitations of K-means: Non-globular Shapes Original Points K-means (2 Clusters)

Hierarchical clustering As clusters agglomerate, docs likely to fall into a hierarchy of “topics” or concepts. d3 d5 d1 d3,d4,d5 d4 d2 d1,d2 d4,d5 d3

Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram A tree like diagram that records the sequences of merges or splits

Strengths of Hierarchical Clustering Do not have to assume any particular number of clusters Any desired number of clusters can be obtained by ‘cutting’ the dendogram at the proper level They may correspond to meaningful taxonomies Example in biological sciences (e.g., animal kingdom, phylogeny reconstruction, …)

Hierarchical Clustering Two main types of hierarchical clustering Agglomerative: Start with the points as individual clusters At each step, merge the closest pair of clusters until only one cluster (or k clusters) left Divisive: Start with one, all-inclusive cluster At each step, split a cluster until each cluster contains a point (or there are k clusters) Traditional hierarchical algorithms use a similarity or distance matrix Merge or split one cluster at a time

Agglomerative vs. Divisive Clustering Agglomerative (bottom-up) methods start with each example in its own cluster and iteratively combine them to form larger and larger clusters. Divisive (partitional, top-down) separate all examples immediately into clusters.

Hierarchical Agglomerative Clustering (HAC) More popular hierarchical clustering technique Assumes a similarity function for determining the similarity of two instances. Starts with all instances in a separate cluster and then repeatedly joins the two clusters that are most similar until there is only one cluster. The history of merging forms a binary tree or hierarchy.

Agglomerative Clustering Algorithm Basic algorithm is straightforward Compute the proximity matrix Let each data point be a cluster Repeat Merge the two closest clusters Update the proximity matrix Until only a single cluster remains Key operation is the computation of the proximity of two clusters Different approaches to defining the distance between clusters distinguish the different algorithms

Starting Situation Start with clusters of individual points and a proximity matrix p1 p3 p5 p4 p2 . . . . Proximity Matrix

Intermediate Situation After some merging steps, we have some clusters C2 C1 C3 C5 C4 C3 C4 Proximity Matrix C1 C5 C2

Intermediate Situation We want to merge the two closest clusters (C2 and C5) and update the proximity matrix. C2 C1 C3 C5 C4 C3 C4 Proximity Matrix C1 C5 C2

After Merging The question is “How do we update the proximity matrix?” C2 U C5 C1 C3 C4 C1 ? ? ? ? ? C2 U C5 C3 C3 ? C4 ? C4 Proximity Matrix C1 C2 U C5

How to Define Inter-Cluster Similarity p1 p3 p5 p4 p2 . . . . Similarity? MIN MAX Group Average Distance Between Centroids Other methods driven by an objective function Ward’s Method uses squared error Proximity Matrix

How to Define Inter-Cluster Similarity p1 p3 p5 p4 p2 . . . . MIN MAX Group Average Distance Between Centroids Other methods driven by an objective function Ward’s Method uses squared error Proximity Matrix

How to Define Inter-Cluster Similarity p1 p3 p5 p4 p2 . . . . MIN MAX Group Average Distance Between Centroids Other methods driven by an objective function Ward’s Method uses squared error Proximity Matrix

How to Define Inter-Cluster Similarity p1 p3 p5 p4 p2 . . . . MIN MAX Group Average Distance Between Centroids Other methods driven by an objective function Ward’s Method uses squared error Proximity Matrix

Text clustering issues and applications

List of issues/applications covered Term vs. document space clustering Multi-lingual docs Feature selection Speeding up scoring Building navigation structures “Automatic taxonomy induction” Labeling

Term vs. document space Thus far, we clustered docs based on their similarities in terms space For some applications, e.g., topic analysis for inducing navigation structures, can “dualize”: use docs as axes represent (some) terms as vectors proximity based on co-occurrence of terms in docs now clustering terms, not docs

Term vs. document space If terms carefully chosen (say nouns) fixed number of pairs for distance computation independent of corpus size clusters have clean descriptions in terms of noun phrase co-occurrence - easier labeling? left with problem of binding docs to these clusters

Multi-lingual docs E.g., News Explorer, Canadian government docs. Every doc in English and equivalent French. Must cluster by concepts rather than language Simplest: pad docs in one lang with dictionary equivalents in the other thus each doc has a representation in both languages Axes are terms in both languages

Feature selection Which terms to use as axes for vector space? Huge body of (ongoing) research IDF is a form of feature selection can exaggerate noise e.g., mis-spellings Pseudo-linguistic heuristics, e.g., drop stop-words stemming/lemmatization use only nouns/noun phrases Good clustering should “figure out” some of these

Clustering to speed up scoring From CS276a, recall sampling and pre-grouping Wanted to find, given a query Q, the nearest docs in the corpus Wanted to avoid computing cosine similarity of Q to each of n docs in the corpus.

Sampling and pre-grouping First run a clustering phase pick a representative leader for each cluster. Process a query as follows: Given query Q, find its nearest leader L. Seek nearest docs from L’s followers only avoid cosine similarity to all docs.

Navigation structure Given a corpus, agglomerate into a hierarchy Throw away lower layers so you don’t have n leaf topics each having a single doc Many principled methods for this pruning such as MDL. d3 d5 d1 d3,d4,d5 d4 d2 d1,d2 d4,d5 d3

Navigation structure Can also induce hierarchy top-down - e.g., use k-means, then recur on the clusters. Need to figure out what k should be at each point. Topics induced by clustering need human ratification can override mechanical pruning. Need to address issues like partitioning at the top level by language.

Major issue - labeling After clustering algorithm finds clusters - how can they be useful to the end user? Need pithy label for each cluster In search results, say “Football” or “Car” in the jaguar example. In topic trees, need navigational cues. Often done by hand, a posteriori.

How to Label Clusters Show titles of typical documents Titles are easy to scan Authors create them for quick scanning! But you can only show a few titles which may not fully represent cluster Show words/phrases prominent in cluster More likely to fully represent cluster Use distinguishing words/phrases But harder to scan

Labeling Common heuristics - list 5-10 most frequent terms in the centroid vector. Drop stop-words; stem. Differential labeling by frequent terms Within the cluster “Computers”, child clusters all have the word computer as frequent terms. Discriminant analysis of sub-tree centroids.

The biggest issue in clustering? How do you compare two alternatives? Computation (time/space) is only one metric of performance How do you look at the “goodness” of the clustering produced by a method

READINGS Ingersoll / Morton / Farris – chapter 6

Resources Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections (1992) Cutting/Karger/Pederesen/Tukey http://citeseer.nj.nec.com/cutting92scattergather.html Data Clustering: A Review (1999) Jain/Murty/Flynn http://citeseer.nj.nec.com/jain99data.html

Acknowledgments Many of these slides were borrowed, from Chris Manning’s Stanford module Tan, Steinbach and Kumar, Intro to Data Mining Ray Mooney’s Austin module