1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 8. Text Clustering
Text clustering most often separates the entire corpus of documents into mutually exclusive clusters – each document belongs to one and only one cluster (i.e., hard clustering) whereas Topic Extraction assigns a document to multiple topics (i.e., soft clustering). Text Clustering 2
Similarity-based Clustering A common approach to text clustering is to group documents which are similar. In the vector space model of textual data, there are several popular similarity metrics: –Correlation-based metrics: Often used in document search and retrieval e.g. Cosine angle –Distance-based metrics: Provides a ‘magnitude’ of similarity e.g. Euclidean Distance -- distance(x,y) = {∑ i (x i - y i ) 2 } ½ –Association-based measures (not always metrics): Often used for nominal attributes e.g. Jaccard coefficient 3 Doc 1Doc 2Doc 3…Doc N apple100…2 cat311…4 dog221…3 farm100…1 ……………… White House034…0 Senate024…0
Distribution-based clustering –Assumes a distribution model of the values, and try to fit the observations. –e.g. Gaussian Mixture Model (using Expectation- Maximization algorithm) Other Clustering Approaches Wikipedia, Cluster Analysis, 4 Density-based clustering –Clusters are defined as areas of higher density. –Observations in sparse areas are considered noise (but separate clusters).
5 Unigrams vs. Reduced Dimensions for Text Clustering Just like for text topics, you can apply clustering directly on a doc*term (or term*doc) matrix, or a matrix obtained after reducing dimensions (e.g. by SVD). When SVD is applied to a term*doc matrix, –Documents are represented by the column vector of the matrix V. –Terms are represented by the row vectors of the multiplication of U and S matrices. SVD dimensions U:U: A:A: S:S: V:V:
Clustering Algorithms Each document is assigned to the/one cluster to which the membership/similarity is the strongest. Broadly speaking, clustering algorithms can be divided into four groups: 1.Hierarchical – top-down (divisive) or bottom-up (aggromelative) 2.Non-hierarchical – partitioning algorithm such as Kmeans 3.Probabilistic – identifies dense regions of the data space 4.Neural Network – typically Kohonen Self Organizing Map (SOM) Reference on various clustering algorithms: (my old CSC 578 lecture note) –K-means clustering –Hierarchical clustering –Expectation-Maximization (EM) clustering – model-based, generative technique 6
Cluster Assignment 7
Coursera, Text Mining and Analytics, ChengXiang Zhai 8
Interpretation of Clusters Descriptive terms or Centroids 9 Descriptive Terms in SAS Enterprise Miner: The Text Cluster node uses a descriptive terms algorithm to describe the contents of both EM clusters and hierarchical clusters. If you specify to display m descriptive terms for each cluster, then the top 2*m most frequently occurring terms in each cluster are used to compute the descriptive terms. For each of the 2*m terms, a binomial probability for each cluster is computed. The probability of assigning a term to cluster j is prob=F(k|N, p). Here, F is the binomial cumulative distribution function, k is the number of times that the term appears in cluster j, N is the number of documents in cluster j, p is equal to (sum-k)/(total-N), sum is the total number of times that the term appears in all the clusters, and total is the total number of documents. The m descriptive terms are those that have the highest binomial probabilities. Descriptive terms must have a keep status of Y and must occur at least twice (by default) in a cluster.
Coursera, Text Mining and Analytics, ChengXiang Zhai 10
Coursera, Text Mining and Analytics, ChengXiang Zhai 11