Topic Significance Ranking for LDA Generative Models

Topic Significance Ranking for LDA Generative Models
Loulwah AlSumait Daniel Barbará James Gentle Carlotta Domeniconi ECML PKDD - Bled, Slovenia - September 7-11, 2009

Agenda Introduction Junk/Insignificant topic definitions
Distance measures 4-phase Weighted Combination Approach Experimental results Conclusions and future work

Latent Dirichlet Allocation (LDA) Model Blei, Ng, & Jordan (2003)
Exact inference is intractable Approximation approaches Input: K Output: Φ, θ Probabilistic generative model Hidden variables (topics) are associated with the observed text Dirichlet priors on document and topic distributions  D d  Inference Process Generative Process Nd  zi Latent Dirichlet Allocation PTM is a three-level hierarchical Bayesian network that represents the generative probabilistic model of a corpus of documents. It relates words and documents through latent topics. It considers documents to be multinomial distributions over topics and topics to be multinomial distributions over a fixed vocabulary of words. Documents are not directly linked to the words. Rather, this relationship is governed by additional latent variables, z, introduced to represent the responsibility of a particular topic in using that word in the document. The completeness of the LDA's generative process for documents is achieved by considering Dirichlet priors on the document distributions over topics and on the topic distributions over words. Because an exact approach to estimate Á is intractable, sophisticated approximations are usually used. Griffiths and Steyvers in [6] proposed Gibbs sampling as a simple and effective strategy for estimating Á and µ. This emerging approach has been successfully applied to find useful structures in many kinds of documents, including s [60], scientific literature [12, 23, 44], libraries of digital books and news archives [4, 61]. K wi

Topic Significance Ranking
Critical effect of the setting of K on the inferred topics Most of previous work manually examine the topics Quantify the semantic significance of topics How much different is the topic distribution from junk/insignificant topic distributions The quality of the topic model and the interpretability of the estimated topics are directly effected by the setting of the number of latent variables K is extremely critical. Models with very few topics would result in broad topic definitions that could be a mixture of two or more distributions. On the other hand, models with too many topics are expected to have very specific descriptions that are uninterpretable [52]. Since the actual number of underlying topics is unknown and there is no definite and efficient approach to accurately estimate it, the inferred topics of PTM does not always represent meaningful themes. Although LDA is widely investigated and heavily cited in the literature, none of the research provided an automatic analysis of the discovered topics to validate their importance and genuineness. Almost all the previous work manually examines the output to identify genuine topics in order to justify their work.

Example: 20 NewsGroup The Volgenau School of Information Technology and Engineering Department of Computer Science

Junk/Insignificant Topic Definitions
Uniform Distribution Over Words Uniformity of a topic: Vacuous Semantic Distribution , p(wi|k) = ik , Vacuousness of a topic: Background Distribution Background of a topic: , In practice, when writing a document, authors tend to use words from a specific pool of terminology by which the concept(s) that the document is intended to focus on is represented. So, a genuine topic is expected to be modeled by a distribution that is skewed toward a small set of words out of the total dictionary. This terminology, which is the set of words that are highly probable under a specific concept is called the \salient words" of the topic. Nonetheless, a topic distribution under which a large number of terms are highly probable is more likely to be insignificant or “junk". To illustrate this, the number of salient terms of topics estimated by LDA on the 20-Newsgroups dataset is computed. Under LDA, these words can be identified to be the ones that have the highest conditional probability under a topic k. The total salience of the topic is then quantified by summing the conditional probabilities of its salient words. the number of words for which the total salience is equal to a specified percentage, X, of the total topic probability is averaged over all the topics. It can be seen that most of the topic density corresponds to less than 3% of the total vocabulary. Under this frame, an extreme version of a junk topic will take the form of a uniform distribution over the dictionary. The degree of \uniformity", U, of an estimated topic, \phi(k), can be quantified by computing its distance from the W-Uniform junk distribution. The computed distance will provide a reasonable figure of the topic significance. The farther a topic description is from the uniform distribution over the dictionary, the higher its significance is, and vise versa. The empirical distribution is a convex combination of the probability distributions of the underlying themes which reveals no significant information if taken as a whole. A distribution of real topics is expected to have a unique characteristic rather than a mixture model. Accordingly, this can provide another approach to evaluate the importance of the estimated topics. The closer the topic distribution is to the empirical distribution of the sample, the less its significance is expected to be. So, the second junk topic introduced in this thesis, named the vacuous semantic distribution (W-Vacuous), is defined to be the empirical distribution of the sample set. It is equivalent to the marginal distribution of words over the latent variables. In order to detect junk topics, the \vacuous semantic", V, of a topic is measured by computing the distance between the estimated distribution and the W-Vacuous. The previous two definitions of junk topics are characterized by their distribution over words. However, investigating the distribution of topics over documents would identify another class of insignificant topics. In real datasets, well defined topics are usually covered in a subset (not all) of the documents. If a topic is estimated to be responsible of generating words in a wide range of documents, or all documents in the extreme case, then it is far from having a definite and authentic identity. Such topics are most likely to be constructed of the background terms, which are irrelevant to the domain structure. To show reasonable significance for consideration, a topic is required to be far (enough) from being a \background topic", which can be defined as a topic that has a nonzero weight in all the documents. In the extreme case, the background topic (B-Ground) is found equally probable in all the documents. The distance between a topic and the B-Ground topic would determine how much \background" does it carry and, ultimately, grade the significance of the topic.

Distance Measures Symmetric KL-Divergence Cosine Dissimilarity
Uniformity, Background, W-Vacuous Cosine Dissimilarity Uniformity , W-Vacuous , Background Coefficient Correlation Uniformity , W-Vacuous , Background The first distance measure is the symmetric KL-divergence which I’ve shown earlier. So the uniformity, W-Vacuous, and background of a topic are computed under KL distance, and denoted like this. A measure of similarity, SCOS, is defined by the cosine of the angle between the two feature vectors. In order to construct a cosine-based \dissimilarity" or distance metric, DCOS, the cosine similarity is subtracted from 1. The dissimilarity will take the value 0 if the two vectors are identical, while unrelated (orthogonal) vectors result in 0 distance value. Under the cosine dissimilarity, this is how the different criteria of topic insignificance are denoted. The correlation coefficient is a numerical descriptive statistic that measures the strength of the linear dependence between two random variables. It is obtained by dividing the covariance of the two variables by the product of their standard deviation. Subtracting it from 1 would construct a correlation-based distance measure. is now bounded by the closed interval [0; 2]. Independent and negatively related variables will result in distances greater than or equal to one. This fits with the definition of our problem since semantic relatedness between topics is evinced by positive correlations only. Thus, the correlation-based distance measure is used to quantify the insignificance of an inferred topic by computing the correlation between the topic description and the three junk/insignificance topics.

Multi-Criteria Weighted Combination 4 phases Standardization procedure Transfer distances into standardized measures Scores Weights Given the three J/I topic de¯nitions of topic signi¯cance that can be quanti¯ed by 3 distance measures, it is required to combine the information from these \multi-criteria measures" to form a single index of evaluation. Because of the di®erent scales upon which these criteria are measured, it is necessary that the measures be standardized before combination. This is accomplished by a simple form of weighted linear combination approach in which each score is first standardized and then weighted to a specified weighted before it is combined with other scores to compute the final score. In this work, a 4-phase weighted linear combination approach is used. In the ¯rst phase, two \standardization procedures" are performed to transfer each distance measure from its true value to a standardized score. The standardized measurements of each topic are then combined into a single ¯gure for each J/I de¯nition during the second phase. In the third phase, two di®erent techniques of \Weighted Linear Combinations" are performed to combine the J/I scores. As a result, two WLC ¯gures for each topic are computed from which the ¯nal score of the topic signi¯cance is constructed.

4 phases (Continued) Intra-Criterion Weighted Combination Combine standardized measures of each J/I definition Inter-Criteria Weighted Combination Combine J/I scores and weights Topic Rank Uniformity scores W-Vacuous scores S 1 V k 2 Background scores S 1 B k 2 S 1 U k 2 TSR X

Experimental Results: Simulated Data

20NewsGroups Top 10 significant topics

20NewsGroups Lowest 10 significant topics

NIPS Top 10 Significant Topics

NIPS Lowest 10 Significant Topics

Individual vs. Combined Score
Simulated Data

Individual vs. Combined Score
20 NewsGroups

Conclusions and Future Work
Unsupervised numerical quantification of the topics’ semantic Significance Novel post analysis in LDA modeling Three J/I topic distributions 4 levels of weighted combination approach Future directions: Analysis of TSR sensitivity to the approach, K and weights settings More J/I definitions Tool to visualize topic evolution in online setting

Topic Significance Ranking for LDA Generative Models

Similar presentations

Presentation on theme: "Topic Significance Ranking for LDA Generative Models"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Topic Significance Ranking for LDA Generative Models

Similar presentations

Presentation on theme: "Topic Significance Ranking for LDA Generative Models"— Presentation transcript:

Similar presentations

About project

Feedback