UTOPIAN: User-Driven Topic Modeling Based on Interactive Nonnegative Matrix Factorization Jaegul Choo1*, Changhyun Lee1, Chandan K. Reddy2, and Haesun Park1 1Georgia Institute of Technology, 2Wayne State University *e-mail: jaegul.choo@cc.gatech.edu
Intro: Topic Modeling Document 1 Document 2 Document 3 Document 4 brain evolve dna genetic gene nerve neuron life organism
Topic: a distribution over keywords Intro: Topic Modeling Document 1 Document 2 Document 3 Document 4 Topic 1 Topic 2 Topic 3 Topic: a distribution over keywords brain evolve dna genetic gene nerve neuron life organism
Intro: Topic Modeling Document : a distribution over topic Topic: a distribution over keywords brain evolve dna genetic gene nerve neuron life organism
Latent Dirichlet Allocation (LDA) in Visual Analytics LDA has been widely used in visual analytics. TIARA [Wei et al. KDD10], iVisClustering [Lee et al. EuroVis12], ParallelTopics [Dou et al. VAST12], TopicViz [Eisenstein et al. CHI-WIP12], … *Image courtesy of original papers.
Doc-induced topic creation Keyword-induced topic creation Overview of Our Work Proposes nonnegative matrix factorization (NMF) for topic modeling. Highlights advantages of NMF over LDA in visual analytics. Presents UTOPIAN, an NMF-based interactive topic modeling system. Topic merging Topic splitting Doc-induced topic creation Keyword-induced topic creation
What is Nonnegative Matrix Factorization?
Nonnegative Matrix Factorization (NMF) Lower-rank approximation with nonnegativity constraints Why nonnegativity? Easy interpretation and semantically meaningful output Algorithm Alternating nonnegativity-constrained least squares [Kim et al., 2008] H ~ = min || A – WH ||F W>=0, H>=0 A W Mention document vector
NMF as Topic Modeling ~ = Document : a distribution over topic W W H H ~ = A Document 1 Document 2 Document 3 Document 4 Document : a distribution over topic Topic 1 Topic 2 Topic 3 Topic: a distribution over keywords brain evolve dna genetic gene nerve neuron life organism
Why NMF in Visual Analytics?
Advantages of NMF in Visual Analytics Reliable algorithmic behaviors Flexible support for user interactions
NMF vs. LDA Consistency from Multiple Runs Documents’ topical membership changes among 10 runs InfoVis/VAST paper data set 20 newsgroup data set
NMF vs. LDA Empirical Convergence Documents’ topical membership changes between iterations InfoVis/VAST paper data set 48 seconds 10 minutes NMF LDA
NMF vs. LDA Topic Summary (Top Keywords) InfoVis/VAST paper data set Topics are more consistent in NMF than in LDA. Topic quality is comparable between NMF and LDA. Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 Topic 7 NMF Run #1 visualization design information user analysis system graph layout visual analytics data sets color weaving Run #2 LDA documents similarities knowledge edge query collaborative social tree measures multivariate animation dimensions treemap analysts scatterplot spatial text multidimensional, high aggregation
Advantages of NMF in Visual Analytics Reliable algorithmic behaviors Flexible support for user interactions
Weakly Supervised NMF [Choo et al., DMKD, accepted with rev.] min ||A – WH ||F2 + α||(W – Wr)MW ||F2 + β||MH(H – DHHr) ||F2 W>=0, H>=0 Wr, Hr : reference matrices for W and H MW, MH : diagonal matrices for weighting/masking columns/rows of W and H Provides flexible yet intuitive means for user interaction. Maintains the same computational complexity as original NMF.
UTOPIAN: User-Driven Topic Modeling Based on Interactive NMF Topic merging Topic splitting Doc-induced topic creation Keyword-induced topic creation
Doc-induced topic creation Keyword-induced topic creation UTOPIAN Overview Supervised t-distributed stochastic neighbor embedding (t-SNE) User interactions supported Keyword refinement Topic merging/splitting Keyword-/document-induced topic creation Real-time interaction via PIVE (Per-Iteration Visualization Environment) Topic merging Topic splitting Doc-induced topic creation Keyword-induced topic creation Just like In-Spire, documents are represented as dots, and their colors represent their topic cluster membership.
Supervised t-SNE Original t-SNE Documents are often too noisy to work with. Supervised t-SNE d(xi, xj) ← α•d(xi, xj) if xi and xj belongs to the same topic cluster.
PIVE (Per-Iteration Visualization Environment) for Real-time Interaction [Choo et al., under revision] Standard approach PIVE approach
Demo Video http://tinyurl.com/UTOPIAN2013
Usage Scenario: Hyundai Genesis Review Data Initial result After interaction
Summary Presented UTOPIAN, a User-Driven Topic Modeling based on Interactive NMF. Highlighted the advantages of NMF over LDA in visual analytics. Reliable algorithmic behaviors Consistency from multiple runs Early empirical convergence Flexible support for user interactions Keyword refinement Topic merging/splitting Keyword-/document-induced topic creation
More in the paper & On-going Work A general taxonomy of user interactions with computational methods Keyword-based vs. document-based Template-based vs. from-scratch-based Algorithmic details about supported user interactions Implementation details More usage scenarios On-going Work Scaling up the system with parallel distributed NMF
Thank you! http://tinyurl.com/UTOPIAN2013 Jaegul Choo jaegul.choo@cc.gatech.edu http://www.cc.gatech.edu/~joyfull/ Topic merging Topic splitting Doc-induced topic creation Keyword-induced topic creation For more details, please find me at ‘Meet the Candidate’ A601+ A602, 6PM today