Topic Significance Ranking for LDA Generative Models

Slides:

Advertisements

Similar presentations

CmpE 104 SOFTWARE STATISTICAL TOOLS & METHODS MEASURING & ESTIMATING SOFTWARE SIZE AND RESOURCE & SCHEDULE ESTIMATING.

Advertisements

1 1 Slide © 2012 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole.

QUANTITATIVE DATA ANALYSIS

Statistical Methods Chichang Jou Tamkang University.

Copyright © 2010 Pearson Education, Inc. Publishing as Prentice Hall Ch. 2-1 Statistics for Business and Economics 7 th Edition Chapter 2 Describing Data:

Latent Dirichlet Allocation a generative model for text

Statistical Background

1 Simple Linear Regression Chapter Introduction In this chapter we examine the relationship among interval variables via a mathematical equation.

Experimental Evaluation

Chap 3-1 Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chapter 3 Describing Data: Numerical Statistics for Business and Economics.

Lecture II-2: Probability Review

Chapter 2 Describing Data with Numerical Measurements General Objectives: Graphs are extremely useful for the visual description of a data set. However,

1 1 Slide © 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole.

1 1 Slide © 2014 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole.

Semantic History Embedding in Online Generative Topic Models Pu Wang (presenter) Authors: Loulwah AlSumait Daniel Barbará

Chapter 3: Central Tendency. Central Tendency In general terms, central tendency is a statistical measure that determines a single value that accurately.

Analyzing Reliability and Validity in Outcomes Assessment (Part 1) Robert W. Lingard and Deborah K. van Alphen California State University, Northridge.

Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.

The Research Enterprise in Psychology. The Scientific Method: Terminology Operational definitions are used to clarify precisely what is meant by each.

Statistical Estimation of Word Acquisition with Application to Readability Prediction Proceedings of the 2009 Conference on Empirical Methods in Natural.

Chapter 1 Introduction to Statistics. Statistical Methods Were developed to serve a purpose Were developed to serve a purpose The purpose for each statistical.

A Model for Learning the Semantics of Pictures V. Lavrenko, R. Manmatha, J. Jeon Center for Intelligent Information Retrieval Computer Science Department,

© Copyright McGraw-Hill Correlation and Regression CHAPTER 10.

Latent Dirichlet Allocation D. Blei, A. Ng, and M. Jordan. Journal of Machine Learning Research, 3: , January Jonathan Huang

Chapter 2 Statistical Background. 2.3 Random Variables and Probability Distributions A variable X is said to be a random variable (rv) if for every real.

Probabilistic Models for Discovering E-Communities Ding Zhou, Eren Manavoglu, Jia Li, C. Lee Giles, Hongyuan Zha The Pennsylvania State University WWW.

Chapter 3, Part B Descriptive Statistics: Numerical Measures n Measures of Distribution Shape, Relative Location, and Detecting Outliers n Exploratory.

Topic Models Presented by Iulian Pruteanu Friday, July 28 th, 2006.

Evaluation of gene-expression clustering via mutual information distance measure Ido Priness, Oded Maimon and Irad Ben-Gal BMC Bioinformatics, 2007.

Variability Introduction to Statistics Chapter 4 Jan 22, 2009 Class #4.

Chapter 7: The Distribution of Sample Means

LESSON 5 - STATISTICS & RESEARCH STATISTICS – USE OF MATH TO ORGANIZE, SUMMARIZE, AND INTERPRET DATA.

A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation Yee W. Teh, David Newman and Max Welling Published on NIPS 2006 Discussion.

Some Terminology experiment vs. correlational study IV vs. DV descriptive vs. inferential statistics sample vs. population statistic vs. parameter H 0.

Descriptive Statistics ( )

Topic Modeling for Short Texts with Auxiliary Word Embeddings

Theme 5. Association 1. Introduction. 2. Bivariate tables and graphs.

Inference: Conclusion with Confidence

Theme 6. Linear regression

Different Types of Data

Chapter 4 Basic Estimation Techniques

Statistics for Managers Using Microsoft® Excel 5th Edition

Business and Economics 6th Edition

MATH-138 Elementary Statistics

Selecting the Best Measure for Your Study

Inferential Statistics Inferences from Two Samples

Sampling Distributions and Estimation

Inference for the Difference Between Two Means

Inference: Conclusion with Confidence

Statistics: The Z score and the normal distribution

Hypothesis Testing and Confidence Intervals (Part 1): Using the Standard Normal Lecture 8 Justin Kern October 10 and 12, 2017.

HLM with Educational Large-Scale Assessment Data: Restrictions on Inferences due to Limited Sample Sizes Sabine Meinck International Association.

Numerical Descriptive Measures

Correlation and Regression

Module 8 Statistical Reasoning in Everyday Life

Learning with information of features

Econ 3790: Business and Economics Statistics

Predict Failures with Developer Networks and Social Network Analysis

Matching Words with Pictures

Analysis and Interpretation of Experimental Findings

Latent Dirichlet Allocation

Topic Models in Text Processing

Two Halves to Statistics

Analyzing Reliability and Validity in Outcomes Assessment

INF 141: Information Retrieval

Introduction to the design (and analysis) of experiments

Business and Economics 7th Edition

MGS 3100 Business Analysis Regression Feb 18, 2016

Presentation transcript:

Topic Significance Ranking for LDA Generative Models Loulwah AlSumait Daniel Barbará James Gentle Carlotta Domeniconi ECML PKDD - Bled, Slovenia - September 7-11, 2009

Agenda Introduction Junk/Insignificant topic definitions Distance measures 4-phase Weighted Combination Approach Experimental results Conclusions and future work

Latent Dirichlet Allocation (LDA) Model Blei, Ng, & Jordan (2003) Exact inference is intractable Approximation approaches Input: K Output: Φ, θ Probabilistic generative model Hidden variables (topics) are associated with the observed text Dirichlet priors on document and topic distributions  D d  Inference Process Generative Process Nd  zi Latent Dirichlet Allocation PTM is a three-level hierarchical Bayesian network that represents the generative probabilistic model of a corpus of documents. It relates words and documents through latent topics. It considers documents to be multinomial distributions over topics and topics to be multinomial distributions over a fixed vocabulary of words. Documents are not directly linked to the words. Rather, this relationship is governed by additional latent variables, z, introduced to represent the responsibility of a particular topic in using that word in the document. The completeness of the LDA's generative process for documents is achieved by considering Dirichlet priors on the document distributions over topics and on the topic distributions over words. Because an exact approach to estimate Á is intractable, sophisticated approximations are usually used. Griffiths and Steyvers in [6] proposed Gibbs sampling as a simple and effective strategy for estimating Á and µ. This emerging approach has been successfully applied to find useful structures in many kinds of documents, including emails [60], scientific literature [12, 23, 44], libraries of digital books and news archives [4, 61]. K wi

Topic Significance Ranking Critical effect of the setting of K on the inferred topics Most of previous work manually examine the topics Quantify the semantic significance of topics How much different is the topic distribution from junk/insignificant topic distributions The quality of the topic model and the interpretability of the estimated topics are directly effected by the setting of the number of latent variables K is extremely critical. Models with very few topics would result in broad topic definitions that could be a mixture of two or more distributions. On the other hand, models with too many topics are expected to have very specific descriptions that are uninterpretable [52]. Since the actual number of underlying topics is unknown and there is no definite and efficient approach to accurately estimate it, the inferred topics of PTM does not always represent meaningful themes. Although LDA is widely investigated and heavily cited in the literature, none of the research provided an automatic analysis of the discovered topics to validate their importance and genuineness. Almost all the previous work manually examines the output to identify genuine topics in order to justify their work.

Topic Significance Ranking Example: 20 NewsGroup The Volgenau School of Information Technology and Engineering Department of Computer Science

Junk/Insignificant Topic Definitions Uniform Distribution Over Words Uniformity of a topic: Vacuous Semantic Distribution , p(wi|k) = ik , Vacuousness of a topic: Background Distribution Background of a topic: , In practice, when writing a document, authors tend to use words from a specific pool of terminology by which the concept(s) that the document is intended to focus on is represented. So, a genuine topic is expected to be modeled by a distribution that is skewed toward a small set of words out of the total dictionary. This terminology, which is the set of words that are highly probable under a specific concept is called the \salient words" of the topic. Nonetheless, a topic distribution under which a large number of terms are highly probable is more likely to be insignificant or “junk". To illustrate this, the number of salient terms of topics estimated by LDA on the 20-Newsgroups dataset is computed. Under LDA, these words can be identified to be the ones that have the highest conditional probability under a topic k. The total salience of the topic is then quantified by summing the conditional probabilities of its salient words. the number of words for which the total salience is equal to a specified percentage, X, of the total topic probability is averaged over all the topics. It can be seen that most of the topic density corresponds to less than 3% of the total vocabulary. Under this frame, an extreme version of a junk topic will take the form of a uniform distribution over the dictionary. The degree of \uniformity", U, of an estimated topic, \phi(k), can be quantified by computing its distance from the W-Uniform junk distribution. The computed distance will provide a reasonable figure of the topic significance. The farther a topic description is from the uniform distribution over the dictionary, the higher its significance is, and vise versa. The empirical distribution is a convex combination of the probability distributions of the underlying themes which reveals no significant information if taken as a whole. A distribution of real topics is expected to have a unique characteristic rather than a mixture model. Accordingly, this can provide another approach to evaluate the importance of the estimated topics. The closer the topic distribution is to the empirical distribution of the sample, the less its significance is expected to be. So, the second junk topic introduced in this thesis, named the vacuous semantic distribution (W-Vacuous), is defined to be the empirical distribution of the sample set. It is equivalent to the marginal distribution of words over the latent variables. In order to detect junk topics, the \vacuous semantic", V, of a topic is measured by computing the distance between the estimated distribution and the W-Vacuous. The previous two definitions of junk topics are characterized by their distribution over words. However, investigating the distribution of topics over documents would identify another class of insignificant topics. In real datasets, well defined topics are usually covered in a subset (not all) of the documents. If a topic is estimated to be responsible of generating words in a wide range of documents, or all documents in the extreme case, then it is far from having a definite and authentic identity. Such topics are most likely to be constructed of the background terms, which are irrelevant to the domain structure. To show reasonable significance for consideration, a topic is required to be far (enough) from being a \background topic", which can be defined as a topic that has a nonzero weight in all the documents. In the extreme case, the background topic (B-Ground) is found equally probable in all the documents. The distance between a topic and the B-Ground topic would determine how much \background" does it carry and, ultimately, grade the significance of the topic.

Distance Measures Symmetric KL-Divergence Cosine Dissimilarity Uniformity, Background, W-Vacuous Cosine Dissimilarity Uniformity , W-Vacuous , Background Coefficient Correlation Uniformity , W-Vacuous , Background The first distance measure is the symmetric KL-divergence which I’ve shown earlier. So the uniformity, W-Vacuous, and background of a topic are computed under KL distance, and denoted like this. A measure of similarity, SCOS, is defined by the cosine of the angle between the two feature vectors. In order to construct a cosine-based \dissimilarity" or distance metric, DCOS, the cosine similarity is subtracted from 1. The dissimilarity will take the value 0 if the two vectors are identical, while unrelated (orthogonal) vectors result in 0 distance value. Under the cosine dissimilarity, this is how the different criteria of topic insignificance are denoted. The correlation coefficient is a numerical descriptive statistic that measures the strength of the linear dependence between two random variables. It is obtained by dividing the covariance of the two variables by the product of their standard deviation. Subtracting it from 1 would construct a correlation-based distance measure. is now bounded by the closed interval [0; 2]. Independent and negatively related variables will result in distances greater than or equal to one. This fits with the definition of our problem since semantic relatedness between topics is evinced by positive correlations only. Thus, the correlation-based distance measure is used to quantify the insignificance of an inferred topic by computing the correlation between the topic description and the three junk/insignificance topics.

Topic Significance Ranking Multi-Criteria Weighted Combination 4 phases Standardization procedure Transfer distances into standardized measures Scores Weights Given the three J/I topic de¯nitions of topic signi¯cance that can be quanti¯ed by 3 distance measures, it is required to combine the information from these \multi-criteria measures" to form a single index of evaluation. Because of the di®erent scales upon which these criteria are measured, it is necessary that the measures be standardized before combination. This is accomplished by a simple form of weighted linear combination approach in which each score is first standardized and then weighted to a specified weighted before it is combined with other scores to compute the final score. In this work, a 4-phase weighted linear combination approach is used. In the ¯rst phase, two \standardization procedures" are performed to transfer each distance measure from its true value to a standardized score. The standardized measurements of each topic are then combined into a single ¯gure for each J/I de¯nition during the second phase. In the third phase, two di®erent techniques of \Weighted Linear Combinations" are performed to combine the J/I scores. As a result, two WLC ¯gures for each topic are computed from which the ¯nal score of the topic signi¯cance is constructed.

Topic Significance Ranking 4 phases (Continued) Intra-Criterion Weighted Combination Combine standardized measures of each J/I definition Inter-Criteria Weighted Combination Combine J/I scores and weights Topic Rank Uniformity scores W-Vacuous scores S 1 V k 2 Background scores S 1 B k 2 S 1 U k 2 TSR X

Experimental Results: Simulated Data

20NewsGroups Top 10 significant topics

20NewsGroups Lowest 10 significant topics

NIPS Top 10 Significant Topics

NIPS Lowest 10 Significant Topics

Individual vs. Combined Score Simulated Data

Individual vs. Combined Score 20 NewsGroups

Conclusions and Future Work Unsupervised numerical quantification of the topics’ semantic Significance Novel post analysis in LDA modeling Three J/I topic distributions 4 levels of weighted combination approach Future directions: Analysis of TSR sensitivity to the approach, K and weights settings More J/I definitions Tool to visualize topic evolution in online setting