Visual Analytics for Interactive Exploration of Large-Scale Documents via Nonnegative Matrix Factorization Jaegul Choo*, Barry L. Drake †, and Haesun Park* *Georgia Institute of Technology † Georgia Tech Research Institute Big Data Innovators Gathering (BIG) 2014
What is Visual Analytics? 2 AutomatedInteractive (human in the loop) Clearly defined tasksExploratory analysis Fast computationDeeper understanding >Millions of data itemsThousands of data items Data MiningVisualization
AutomatedInteractive (human in the loop) Clearly defined tasksExploratory analysis Fast computationDeeper understanding >Millions of data itemsThousands of data items What is Visual Analytics? Leveraging Both Worlds 3 Data MiningVisualization Visual Analytics +
Visual Analytics for Large-Scale Documents 4 Topic merging Topic splitting Doc-induced topic creation Keyword-induced topic creation UTOPIAN: User-driven Topic Modeling based on Interactive NMF VisIRR: Information Retrieval and Personalized Recommender System
Motivation: Too Many Documents to Read 5 Product reviews Which tablet to buy? iPad (2,000 reviews) vs. Galaxy Tab (1,300 reviews) Research papers Which sub-area in data mining to focus on? >Thousands of new papers every year Patent search Many other applications
Topic Modeling: Summarizing Documents 6 genednalifeevolveorganismbrainneuronnerve Document 1 Document 2 Document 3 Document 4 6 … …
Topic Modeling: Summarizing Documents Topic: distribution over keywords 7 genednalifeevolveorganismbrainneuronnerve Document 1 Document 2 Document 3 Document 4 Topic 1Topic 2Topic 3 7 … …
Topic Modeling: Summarizing Documents Topic: distribution over keywords Document: distribution over topics 8 genednalifeevolveorganismbrainneuronnerve Document 1 Document 2 Document 3 Document 4 Topic 1Topic 2Topic 3 8 … …
Nonnegative Matrix Factorization (NMF) Low-rank approximation via matrix factorization Why nonnegativity constraints? Better interpretation (vs. better approximation, e.g., SVD) 9 ~=~= min || A – WH || F W>=0, H>=0 A H W
~=~= A H W H W Topic: distribution over keywords Document: distribution over topics 10 genednalifeevolvebrainneuronnerve Document 1 Document 2 Document 3 Document 4 Topic 1Topic 2Topic 3 organism NMF as Topic Modeling … …
Documents’ topical membership changes among 10 runs Why NMF (instead of LDA)? Consistency from Multiple Runs 11 InfoVis/VAST paper data set 20 newsgroup data set
Why NMF (instead of LDA)? Empirical Convergence Documents’ topical membership changes between iterations 12 LDANMF 10 minutes 48 seconds InfoVis/VAST paper data set
NMF vs. LDA Topic Summary (Top Keywords) 13 NMF RunTopic 1Topic 2Topic 3Topic 4Topic 5Topic 6Topic 7 #1 visualization design information user analysis system graph layout visual analytics data sets color weaving #2 visualization design information user analysis system graph layout visual analytics data sets color weaving LDA RunTopic 1Topic 2Topic 3Topic 4Topic 5Topic 6Topic 7 #1 document similarities knowledge edge query collaborative social tree measures multivariate tree animation dimension treemap #2 document query analysts scatterplot spatial collaborative text document multidimensi onal high tree aggregation dimension treemap InfoVis/VAST paper data set Topics are more consistent in NMF than in LDA. Topic quality is comparable between NMF and LDA.
UTOPIAN: User-Driven Topic Modeling Based on Interactive NMF [Choo et al., TVCG’13] 14 Topic merging Topic splitting Doc-induced topic creation Keyword-induced topic creation
Visualization Example: Car Reviews Topic summaries are NOT perfect. UTOPIAN allows user interactions for improving them.
Weakly Supervised NMF: Supporting User Interactions Weakly supervised NMF [Choo et al., DMKD, accepted with rev.] min ||A – WH || F 2 + α||(W – W r )M W || F 2 + β||M H (H – D H H r ) || F 2 W>=0, H>=0 W r, H r : reference matrices for W and H (user-input) M W, M H : diagonal matrices for weighting/masking columns and rows of W and H Algorithm: block-coordinate descent framework 16
Interaction Demo Video 17 After topic splitting (triangle) and topic merging (circle) Before interaction InfoVis-VAST Paper Data
VisIRR: Information Retrieval and Personalized Recommender System 18
Features Efficient Large-scale Data Processing 19 Document corpus: ~400,000 academic papers in CS Data management Structured data: author, year, venue, keywords, citation/reference count Unstructured data: bag-of-words vectors of title, abstract, keywords Graph data: content, citation, and co-authorship Efficient data handling Dynamic loading from disk to memory via Cache-like strategy Scalable data expansion in O(n)
Features Personalized Recommendation 20 Works based on user preference on document Preference scale of 1 (highly dislike) to 5 (highly like) Various recommendation schemes Based on content, citation network, and co-authorship Algorithm Preference propagation on graph using heat kernel r α = α ∑ k (1- α) k fW k r α is a recommendation score vector with a control parameter α, and f is a user-assigned rating, and W is an input graph
VisIRR Demo Citation-based Recommendation 21 Preference-assigned item as ‘highly like’ : ‘Automatic Classification System for the Diagnosis of Alzheimer Disease Using Component-Based SVM Aggregations’ Most of the recommended items are highly cited. Computational zoom-in shows sub-areas relevant to the article.
VisIRR Demo Co-authorship-based Recommendation 22 Preference-assigned item as ‘highly like’ : ‘Automatic Classification System for the Diagnosis of Alzheimer Disease Using Component-Based SVM Aggregations’ It shows other areas of the authors of this paper. Computational zoom-in on recommended items Retrieved + recommended items
23 Interested in learning Micro-Financing Analysis in Kiva.org? Check out my presentation at Room 104, Wed 4pm
24 Thank you! Jaegul Choo (Currently on the Academic Job Market) Selected Papers Choo et al., Document Topic Modeling and Discovery in Visual Analytics via Nonnegative Matrix Factorization, TVCG, 2013 Choo et al., VisIRR: Interactive Visual Information Retrieval and Recommendation for Large-scale Document Data, Tech Report, Georgia Tech, 2013 Topic merging Topic splitting Doc-induced topic creation Keyword-induced topic creation UTOPIAN: User-driven Topic Modeling based on Interactive NMF VisIRR: Information Retrieval and Personalized Recommender System Micro-Financing Analysis in Kiva.org, : Room 104, Wed 4pm
Refining topic keywords Merging topics Splitting a topic Creating new topics from seed documents/keywords UTOPIAN Interactions and Key Techniques Visualization Supervised t-SNE Topic modeling NMF Interaction Weakly- supervised NMF Per-iteration Visualization Framework
Original t-SNE Documents do not have clear topic clusters. Supervised t-SNE: Visualizing documents Supervised t-SNE d(x i, x j ) ← αd(x i, x j ) if x i and x j belong to the same topic. (e.g., α = 0.3)
PIVE: (Per-iteration Visualization Environment) Standard approachPIVE approach Integration methodology of Iterative Methods for Real-Time Interactive Visualization [Choo et al., VAST’14, to submit] 27