TopicXP: Exploring Topics in Source Code using Latent Dirichlet Allocation Trevor Savage, Bogdan Dit, Malcom Gethers and Denys Poshyvanyk 26th IEEE International Conference on Software Maintenance Timişoara, Romania September 16, 2010 Good evening. My name is Malcom Gethers and I’m a PhD student in the SEMERU Group at the College of William and Mary. Today I will demonstrating a tool we developed, namely TopicXP. This tool assist developers with program understanding by utilizing the notion of topics which are obtained by modeling source code using the topic model latent dirichlet allocation. Additionally, the tool leverages structural information to provide developers with additional understanding of how topics in the source code relate. Before I begin my demonstration let me provide you with background on LDA and Maximal Weighted Entropy, a cohesion metric which the tool implements and utilizes.
Latent Dirichlet Allocation (LDA) LDA is a topic model which models documents as a probabilistic mixture of topics. The model is emerging as a useful tool for various software maintenance tasks. As input LDA accepts a collection of documents. Each document corresponds to a collection of words. Given a collection of documents as well as a parameter indicating the desired number of topics LDA infers topics from the provided documents. Topics are represented as a probabilistic distribution over the set of terms which appear in the collection of documents. For example, a topic related to a given term would have a high probability associated with that term compared to other terms within the corpus. After topics are inferred LDA models each document as a probabilistic distribution over the set of topics. So, a document which discusses a particular topic would be indicated by a high probability of the topic for the document. Probabilistic Topic Models (Latent Dirichlet Allocation –LDA [Blei’03]) Models documents as mixture of topics
Maximal Weighted Entropy (MWE) Maximal Weighted Entropy is a cohesion measure which combines Latent dirichlet allocation and Information entropy. This metric determines the cohesiveness of classes based on how topics are implemented across methods within a class. For example, classes where a topic is consistently discussed in all methods would result in high cohesion. In order to measure cohesion of a class we must analyze the topic distribution of each method within that class. The notion of Occupancy and Distribution are applied to capture the degree to which a topic is relevant to the class and the entropy of the topic across all methods in the class respectively. So, for each topic we evaluate the probability of it appearing in each method. After obtaining that information Occupancy and Distribution can be computed for the given topic. MWE is computed as the maximal of the product of Occupancy and Distribution across all topics. With this metric we are able to leverage LDA and Information entropy to measure cohesiveness of classes. Occupancy(tj) captures the average probability of topic tj Distribution (tj) captures distribution of tj using information entropy MWE(Cj)=max(Occupancy(tj) x Distribution (tj))
Demonstration
SEMERU @ William and Mary Thank you. Questions? SEMERU @ William and Mary http://www.cs.wm.edu/semeru/TopicXP