©2012 Paula Matuszek CSC 9010: Text Mining Applications: Document Clustering l Dr. Paula Matuszek l

Slides:



Advertisements
Similar presentations
Lecture 15(Ch16): Clustering
Advertisements

CS 478 – Tools for Machine Learning and Data Mining Clustering: Distance-based Approaches.
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Cluster Analysis: Basic Concepts and Algorithms
Information Retrieval Lecture 7 Introduction to Information Retrieval (Manning et al. 2007) Chapter 17 For the MSc Computer Science Programme Dell Zhang.
Albert Gatt Corpora and Statistical Methods Lecture 13.
Clustering Basic Concepts and Algorithms
PARTITIONAL CLUSTERING
Clustering Paolo Ferragina Dipartimento di Informatica Università di Pisa This is a mix of slides taken from several presentations, plus my touch !
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 16 10/18/2011.
1 Machine Learning: Lecture 10 Unsupervised Learning (Based on Chapter 9 of Nilsson, N., Introduction to Machine Learning, 1996)
Unsupervised learning
Clustering II.
CS347 Lecture 8 May 7, 2001 ©Prabhakar Raghavan. Today’s topic Clustering documents.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
CS728 Web Clustering II Lecture 14. K-Means Assumes documents are real-valued vectors. Clusters based on centroids (aka the center of gravity or mean)
1 Text Clustering. 2 Clustering Partition unlabeled examples into disjoint subsets of clusters, such that: –Examples within a cluster are very similar.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 16: Flat Clustering 1.
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
What is Cluster Analysis?
Cluster Analysis CS240B Lecture notes based on those by © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004.
What is Cluster Analysis?
Clustering Unsupervised learning Generating “classes”
CSC 4510 – Machine Learning Dr. Mary-Angela Papalaskari Department of Computing Sciences Villanova University Course website:
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
Text Clustering.
Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.
tch?v=Y6ljFaKRTrI Fireflies.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
Unsupervised Learning. Supervised learning vs. unsupervised learning.
Information Retrieval Lecture 6 Introduction to Information Retrieval (Manning et al. 2007) Chapter 16 For the MSc Computer Science Programme Dell Zhang.
Clustering Clustering is a technique for finding similarity groups in data, called clusters. I.e., it groups data instances that are similar to (near)
Information Retrieval and Organisation Chapter 16 Flat Clustering Dell Zhang Birkbeck, University of London.
Mehdi Ghayoumi MSB rm 132 Ofc hr: Thur, a Machine Learning.
CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.
Clustering. What is Clustering? Clustering: the process of grouping a set of objects into classes of similar objects –Documents within a cluster should.
V. Clustering 인공지능 연구실 이승희 Text: Text mining Page:82-93.
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 15 10/13/2011.
Slide 1 EE3J2 Data Mining Lecture 18 K-means and Agglomerative Algorithms.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
E.G.M. PetrakisText Clustering1 Clustering  “Clustering is the unsupervised classification of patterns (observations, data items or feature vectors) into.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
1 CS 391L: Machine Learning Clustering Raymond J. Mooney University of Texas at Austin.
1 Machine Learning Lecture 9: Clustering Moshe Koppel Slides adapted from Raymond J. Mooney.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.
Data Mining and Text Mining. The Standard Data Mining process.
Big Data Infrastructure Week 9: Data Mining (4/4) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States.
Hierarchical Clustering & Topic Models
Unsupervised Learning: Clustering
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
CSC 4510/9010: Applied Machine Learning
Unsupervised Learning: Clustering
Semi-Supervised Clustering
Clustering CSC 600: Data Mining Class 21.
Machine Learning Clustering: K-means Supervised Learning
Machine Learning Lecture 9: Clustering
Data Mining K-means Algorithm
Instance Based Learning
CSE 5243 Intro. to Data Mining
Clustering.
Information Organization: Clustering
CS 391L: Machine Learning Clustering
Text Categorization Berlin Chen 2003 Reference:
Clustering Techniques
Presentation transcript:

©2012 Paula Matuszek CSC 9010: Text Mining Applications: Document Clustering l Dr. Paula Matuszek l l l (610)

©2012 Paula Matuszek Document Clustering l Clustering documents: group documents into “similar” categories l Categories are not predefined l Examples –Process news articles to see what today’s stories are –Review descriptions of calls to a help line to see what kinds of problems people are having –Get an overview of search results –Infer a set of labels for future training of a classifier l Unsupervised learning: no training examples and no test set to evaluate

Some Examples l search.carrot2.org/stable/search search.carrot2.org/stable/search l l news.google.com news.google.com

Clustering Basics l Collect documents l Compute similarity among documents l Group documents together such that –documents within a cluster are similar –documents in different clusters are different l Summarize each cluster l Sometimes -- assign new documents to the most similar cluster

3. Clustering Example Based on:

Measures of Similarity l Semantic similarity -- but this is hard. l Statistical similarity: BOW –Number of words in common –count –Weighted words in common –words get additional weight based on inverse document frequency: more similar if two documents share an infrequent word –With document as a vector: –Euclidean Distance –Cosine Similarity

Euclidean Distance l Euclidean distance: distance between two measures summed across each feature –Squared differences to give more weight to larger difference dist(x i, x j ) = sqrt((x i1 -x j1 ) 2 +(x i2 -x j2 ) (x in -x jn ) 2 )

Based on home.iitk.ac.in/~mfelixor/Files/non-numeric-Clustering-seminar.pp Cosine similarity measurement l Cosine similarity is a measure of similarity between two vectors by measuring the cosine of the angle between them. l The result of the Cosine function is equal to 1 when the angle is 0, and it is less than 1 when the angle is of any other value. l As the angle between the vectors shortens, the cosine angle approaches 1, meaning that the two vectors are getting closer, meaning that the similarity of whatever is represented by the vectors increases.

Clustering Algorithms l Hierarchical –Bottom up (agglomerative) –Top-down (not common in text) l Flat –K means l Probabilistic –Expectation Maximumization (E-M)

7 Hierarchical Agglomerative Clustering (HAC) l Starts with all instances in a separate cluster and then repeatedly joins the two clusters that are most similar until there is only one cluster. l The history of merging forms a binary tree or hierarchy. Based on:

8 HAC Algorithm Based on: l Start with all instances in their own cluster. l Until there is only one cluster: – Among the current clusters, determine the two clusters, c i and c j, that are most similar. –Replace c i and c j with a single cluster c i ∪ c j

9 Cluster Similarity l How to compute similarity of two clusters each possibly containing multiple instances? –Single Link: Similarity of two most similar members. –Can result in “straggly” (long and thin) clusters due to chaining effect. –Complete Link: Similarity of two least similar members. –Makes more “tight,” spherical clusters that are typically preferable –More sensitive to outliers –Group Average: Average similarity between members. Based on:

11 Single Link Example Based on:

13 Complete Link Example Based on:

16 Group Average Agglomerative Clustering l Use average similarity across all pairs within the merged cluster to measure the similarity of two clusters. l Compromise between single and complete link. l Averaged across all ordered pairs in the merged cluster instead of unordered pairs between the two clusters (to encourage tighter final clusters). Based on:

Dendogram: Hierarchical Clustering l Clustering obtained by cutting the dendrogram at a desired level: each connected component forms a cluster.

Partitioning (Flat) Algorithms l Partitioning method: Construct a partition of n documents into a set of K clusters l Given: a set of documents and the number K l Find: a partition of K clusters that optimizes the chosen partitioning criterion –Globally optimal: exhaustively enumerate all partitions –Effective heuristic methods: K-means algorithm.

18 Non-Hierarchical Clustering l Typically must provide the number of desired clusters, k. l Randomly choose k instances as seeds, one per cluster. l Form initial clusters based on these seeds. l Iterate, repeatedly reallocating instances to different clusters to improve the overall clustering. l Stop when clustering converges or after a fixed number of iterations. Based on:

19 K-Means l Assumes instances are real-valued vectors. l Clusters based on centroids, center of gravity, or mean of points in a cluster, c: l Reassignment of instances to clusters is based on distance to the current cluster centroids. Based on:

21 K-Means Algorithm Based on: l Let d be the distance measure between instances. l Select k random instances {s1, s2,… sk} as seeds. l Until clustering converges or other stopping criterion: –For each instance xi: –Assign xi to the cluster cj such that d(xi, sj) is minimal. –(Update the seeds to the centroid of each cluster) –For each cluster cj, sj = μ(cj)

22 K Means Example (K=2) Pick seeds Reassign clusters Compute centroids x x Reasssign clusters x x x x Compute centroids Reassign clusters Converged! Based on:

Termination conditions l Several possibilities, e.g., –A fixed number of iterations. –Doc partition unchanged. –Centroid positions don’t change.

Seed Choice l Results can vary based on random seed selection. l Some seeds can result in poor convergence rate, or convergence to sub-optimal clusterings. –Select good seeds using a heuristic (e.g., doc least similar to any existing mean) –Try out multiple starting points –Initialize with the results of another method. In the above, if you start with B and E as centroids you converge to {A,B,C} and {D,E,F} If you start with D and F you converge to {A,B,D,E} {C,F} Example showing sensitivity to seeds A B DE C F

How Many Clusters? l Number of clusters K is given –Partition n docs into predetermined # of clusters l Finding the “right” number of clusters is part of the problem l K not specified in advance –Solve an optimization problem: penalize having lots of clusters –application dependent, e.g., compressed summary of search results list. –Tradeoff between having more clusters (better focus within each cluster) and having too many cluster

Weaknesses of k-means l The algorithm is only applicable to numeric data l The user needs to specify k. l The algorithm is sensitive to outliers –Outliers are data points that are very far away from other data points. –Outliers could be errors in the data recording or some special data points with very different values.

Strengths of k-means l Strengths: –Simple: easy to understand and to implement –Efficient: Time complexity: O(tkn), –where n is the number of data points, –k is the number of clusters, and –t is the number of iterations. –Since both k and t are small. k-means is considered a linear algorithm. l K-means is the most popular clustering algorithm. l In practice, performs well on text.

26 Buckshot Algorithm l Combines HAC and K-Means clustering. l First randomly take a sample of instances of size √n l Run group-average HAC on this sample, which takes only O(n) time. l Use the results of HAC as initial seeds for K-means. l Overall algorithm is O(n) and avoids problems of bad seed selection. Based on:

27 Text Clustering l HAC and K-Means have been applied to text in a straightforward way. l Typically use normalized, TF/IDF-weighted vectors and cosine similarity. l Optimize computations for sparse vectors. l Applications: –During retrieval, add other documents in the same cluster as the initial retrieved documents to improve recall. –Clustering of results of retrieval to present more organized results to the user. –Automated production of hierarchical taxonomies of documents for browsing purposes. Based on:

28 Soft Clustering l Clustering typically assumes that each instance is given a “hard” assignment to exactly one cluster. l Does not allow uncertainty in class membership or for an instance to belong to more than one cluster. l Soft clustering gives probabilities that an instance belongs to each of a set of clusters. l Each instance is assigned a probability distribution across a set of discovered categories (probabilities of all categories must sum to 1). Based on:

29 Expectation Maximumization (EM) l Probabilistic method for soft clustering. l Direct method that assumes k clusters:{c1, c2,… ck} l Soft version of k-means. l Assumes a probabilistic model of categories that allows computing P(ci | E) for each category, ci, for a given example, E. l For text, typically assume a naïve-Bayes category model. Based on:

30 EM Algorithm l Iterative method for learning probabilistic categorization model from unsupervised data. l Initially assume random assignment of examples to categories. l Learn an initial probabilistic model by estimating model parameters θ from this randomly labeled data. l Iterate following two steps until convergence: –Expectation (E-step): Compute P(ci | E) for each example given the current model, and probabilistically re-label the examples based on these posterior probability estimates. –Maximization (M-step): Re-estimate the model parameters, θ, from the probabilistically re-labeled data. Based on:

31 EM Unlabeled Examples + − + − + − + − − + Assign random probabilistic labels to unlabeled data Initialize: Based on:

32 EM Prob. Learner + − + − + − + − − + Give soft-labeled training data to a probabilistic learner Initialize: Based on:

33 EM Prob. Learner Prob. Classifier + − + − + − + − − + Produce a probabilistic classifier Initialize: Based on:

34 EM Prob. Learner Prob. Classifier Relabel unlabled data using the trained classifier + − + − + − + − − + E Step: Based on:

35 EM Prob. Learner + − + − + − + − − + Prob. Classifier Continue EM iterations until probabilistic labels on unlabeled data converge. Retrain classifier on relabeled data M step: Based on:

46 Issues in Clustering l How to evaluate clustering? –Internal: –Tightness and separation of clusters (e.g. k-means objective) –Fit of probabilistic model to data –External –Compare to known class labels on benchmark data (purity) l Improving search to converge faster and avoid local minima. l Overlapping clustering. Based on:

How to Label Text Clusters Show titles of typical documents –Titles are easy to scan –Authors create them for quick scanning! –But you can only show a few titles which may not fully represent cluster l Show words/phrases prominent in cluster –More likely to fully represent cluster –Use distinguishing words/phrases –But harder to scan 22 cs.wellesley.edu/~cs315/315_PPTs/L17-Clustering/lecture_17.ppt

Summary: Text Clustering l Clustering can be useful in seeing “shape” of a document corpus l Unsupervised learning; training examples not needed –easier and faster –more appropriate when you don’t know the categories l Goal is to create clusters with similar documents –typically using BOW with TF*IDF as representation –typically using cosine-similarity as metric l Difficult to evaluate results –useful? –expert opinion? –In practice, only moderately successful