Download presentation
Presentation is loading. Please wait.
Published byHarold Warner Modified over 9 years ago
1
Visual Analytics for Interactive Exploration of Large-Scale Documents via Nonnegative Matrix Factorization Jaegul Choo*, Barry L. Drake †, and Haesun Park* *Georgia Institute of Technology † Georgia Tech Research Institute Big Data Innovators Gathering (BIG) 2014
2
What is Visual Analytics? 2 AutomatedInteractive (human in the loop) Clearly defined tasksExploratory analysis Fast computationDeeper understanding >Millions of data itemsThousands of data items Data MiningVisualization
3
AutomatedInteractive (human in the loop) Clearly defined tasksExploratory analysis Fast computationDeeper understanding >Millions of data itemsThousands of data items What is Visual Analytics? Leveraging Both Worlds 3 Data MiningVisualization Visual Analytics +
4
Visual Analytics for Large-Scale Documents 4 Topic merging Topic splitting Doc-induced topic creation Keyword-induced topic creation UTOPIAN: User-driven Topic Modeling based on Interactive NMF VisIRR: Information Retrieval and Personalized Recommender System
5
Motivation: Too Many Documents to Read 5 Product reviews Which tablet to buy? iPad (2,000 reviews) vs. Galaxy Tab (1,300 reviews) Research papers Which sub-area in data mining to focus on? >Thousands of new papers every year Patent search Many other applications
6
Topic Modeling: Summarizing Documents 6 genednalifeevolveorganismbrainneuronnerve Document 1 Document 2 Document 3 Document 4 6 … …
7
Topic Modeling: Summarizing Documents Topic: distribution over keywords 7 genednalifeevolveorganismbrainneuronnerve Document 1 Document 2 Document 3 Document 4 Topic 1Topic 2Topic 3 7 … …
8
Topic Modeling: Summarizing Documents Topic: distribution over keywords Document: distribution over topics 8 genednalifeevolveorganismbrainneuronnerve Document 1 Document 2 Document 3 Document 4 Topic 1Topic 2Topic 3 8 … …
9
Nonnegative Matrix Factorization (NMF) Low-rank approximation via matrix factorization Why nonnegativity constraints? Better interpretation (vs. better approximation, e.g., SVD) 9 ~=~= min || A – WH || F W>=0, H>=0 A H W
10
~=~= A H W H W Topic: distribution over keywords Document: distribution over topics 10 genednalifeevolvebrainneuronnerve Document 1 Document 2 Document 3 Document 4 Topic 1Topic 2Topic 3 organism NMF as Topic Modeling … …
11
Documents’ topical membership changes among 10 runs Why NMF (instead of LDA)? Consistency from Multiple Runs 11 InfoVis/VAST paper data set 20 newsgroup data set
12
Why NMF (instead of LDA)? Empirical Convergence Documents’ topical membership changes between iterations 12 LDANMF 10 minutes 48 seconds InfoVis/VAST paper data set
13
NMF vs. LDA Topic Summary (Top Keywords) 13 NMF RunTopic 1Topic 2Topic 3Topic 4Topic 5Topic 6Topic 7 #1 visualization design information user analysis system graph layout visual analytics data sets color weaving #2 visualization design information user analysis system graph layout visual analytics data sets color weaving LDA RunTopic 1Topic 2Topic 3Topic 4Topic 5Topic 6Topic 7 #1 document similarities knowledge edge query collaborative social tree measures multivariate tree animation dimension treemap #2 document query analysts scatterplot spatial collaborative text document multidimensi onal high tree aggregation dimension treemap InfoVis/VAST paper data set Topics are more consistent in NMF than in LDA. Topic quality is comparable between NMF and LDA.
14
UTOPIAN: User-Driven Topic Modeling Based on Interactive NMF [Choo et al., TVCG’13] 14 Topic merging Topic splitting Doc-induced topic creation Keyword-induced topic creation
15
Visualization Example: Car Reviews Topic summaries are NOT perfect. UTOPIAN allows user interactions for improving them.
16
Weakly Supervised NMF: Supporting User Interactions Weakly supervised NMF [Choo et al., DMKD, accepted with rev.] min ||A – WH || F 2 + α||(W – W r )M W || F 2 + β||M H (H – D H H r ) || F 2 W>=0, H>=0 W r, H r : reference matrices for W and H (user-input) M W, M H : diagonal matrices for weighting/masking columns and rows of W and H Algorithm: block-coordinate descent framework 16
17
Interaction Demo Video 17 After topic splitting (triangle) and topic merging (circle) Before interaction InfoVis-VAST Paper Data http://tinyurl.com/UTOPIAN2013
18
VisIRR: Information Retrieval and Personalized Recommender System 18
19
Features Efficient Large-scale Data Processing 19 Document corpus: ~400,000 academic papers in CS Data management Structured data: author, year, venue, keywords, citation/reference count Unstructured data: bag-of-words vectors of title, abstract, keywords Graph data: content, citation, and co-authorship Efficient data handling Dynamic loading from disk to memory via Cache-like strategy Scalable data expansion in O(n)
20
Features Personalized Recommendation 20 Works based on user preference on document Preference scale of 1 (highly dislike) to 5 (highly like) Various recommendation schemes Based on content, citation network, and co-authorship Algorithm Preference propagation on graph using heat kernel r α = α ∑ k (1- α) k fW k r α is a recommendation score vector with a control parameter α, and f is a user-assigned rating, and W is an input graph
21
VisIRR Demo Citation-based Recommendation 21 Preference-assigned item as ‘highly like’ : ‘Automatic Classification System for the Diagnosis of Alzheimer Disease Using Component-Based SVM Aggregations’ Most of the recommended items are highly cited. Computational zoom-in shows sub-areas relevant to the article. http://tinyurl.com/VisIRR
22
VisIRR Demo Co-authorship-based Recommendation 22 http://tinyurl.com/VisIRR Preference-assigned item as ‘highly like’ : ‘Automatic Classification System for the Diagnosis of Alzheimer Disease Using Component-Based SVM Aggregations’ It shows other areas of the authors of this paper. Computational zoom-in on recommended items Retrieved + recommended items
23
23 Interested in learning Micro-Financing Analysis in Kiva.org? Check out my presentation at Room 104, Wed 4pm
24
24 Thank you! Jaegul Choo jaegul.choo@cc.gatech.edu (Currently on the Academic Job Market) jaegul.choo@cc.gatech.edu Selected Papers Choo et al., Document Topic Modeling and Discovery in Visual Analytics via Nonnegative Matrix Factorization, TVCG, 2013 Choo et al., VisIRR: Interactive Visual Information Retrieval and Recommendation for Large-scale Document Data, Tech Report, Georgia Tech, 2013 Topic merging Topic splitting Doc-induced topic creation Keyword-induced topic creation UTOPIAN: User-driven Topic Modeling based on Interactive NMF VisIRR: Information Retrieval and Personalized Recommender System Micro-Financing Analysis in Kiva.org, : Room 104, Wed 4pm
25
Refining topic keywords Merging topics Splitting a topic Creating new topics from seed documents/keywords UTOPIAN Interactions and Key Techniques Visualization Supervised t-SNE Topic modeling NMF Interaction Weakly- supervised NMF Per-iteration Visualization Framework
26
Original t-SNE Documents do not have clear topic clusters. Supervised t-SNE: Visualizing documents Supervised t-SNE d(x i, x j ) ← αd(x i, x j ) if x i and x j belong to the same topic. (e.g., α = 0.3)
27
PIVE: (Per-iteration Visualization Environment) Standard approachPIVE approach Integration methodology of Iterative Methods for Real-Time Interactive Visualization [Choo et al., VAST’14, to submit] 27
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.