Visual Analytics for Interactive Exploration of Large-Scale Documents via Nonnegative Matrix Factorization Jaegul Choo*, Barry L. Drake †, and Haesun Park*

Slides:



Advertisements
Similar presentations
Data Mining and the Web Susan Dumais Microsoft Research KDD97 Panel - Aug 17, 1997.
Advertisements

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
COMP423 Intelligent Agents. Recommender systems Two approaches – Collaborative Filtering Based on feedback from other users who have rated a similar set.
EventCube Aviation Safety Data Analysis System Fangbo Tao, Xiao Yu, Jiawei Han 08/10/13.
Understanding and Promoting Micro-Finance Activities in Kiva.org Jaegul Choo*, Changhyun Lee*, Daniel Lee †, Hongyuan Zha*, and Haesun Park* *Georgia Institute.
IVITA Workshop Summary Session 1: interactive text analytics (Session chair: Professor Huamin Qu) a) HARVEST: An Intelligent Visual Analytic Tool for the.
Funding Networks Abdullah Sevincer University of Nevada, Reno Department of Computer Science & Engineering.
Search Engines and Information Retrieval
Copyright© 2003 Avaya Inc. All rights reserved Avaya Interactive Dashboard (AID): An Interactive Tool for Mining Avaya Problem Ticket Database Ziyang Wang.
MANISHA VERMA, VASUDEVA VARMA PATENT SEARCH USING IPC CLASSIFICATION VECTORS.
Heterogeneous Consensus Learning via Decision Propagation and Negotiation Jing Gao † Wei Fan ‡ Yizhou Sun † Jiawei Han † †University of Illinois at Urbana-Champaign.
Heterogeneous Consensus Learning via Decision Propagation and Negotiation Jing Gao† Wei Fan‡ Yizhou Sun†Jiawei Han† †University of Illinois at Urbana-Champaign.
Recommender systems Ram Akella February 23, 2011 Lecture 6b, i290 & 280I University of California at Berkeley Silicon Valley Center/SC.
Enterprise Search With SharePoint Portal Server V2 Steve Tullis, Program Manager, Business Portal Group 3/5/2003.
Tuple – InfoVis Publication Browser CS533 Project Presentation by Alex Gukov.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Recommender systems Ram Akella November 26 th 2008.
Memoplex Browser: Searching and Browsing in Semantic Networks CPSC 533C - Project Update Yoel Lanir.
Overview of Web Data Mining and Applications Part I
Overview of Search Engines
Attention and Event Detection Identifying, attributing and describing spatial bursts Early online identification of attention items in social media Louis.
CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
Search Engines and Information Retrieval Chapter 1.
Pascal Visualization Challenge Blaž Fortuna, IJS Marko Grobelnik, IJS Steve Gunn, US.
Jaegul Choo1*, Changhyun Lee1, Chandan K. Reddy2, and Haesun Park1
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
How to get the most out of the survey task + suggested survey topics for CS512 Presented by Nikita Spirin.
Introduction to Web Mining Spring What is data mining? Data mining is extraction of useful patterns from data sources, e.g., databases, texts, web,
Improving Web Search Ranking by Incorporating User Behavior Information Eugene Agichtein Eric Brill Susan Dumais Microsoft Research.
FODAVA-Lead Research Dimension Reduction and Data Reduction: Foundations for Interactive Visualization Haesun Park Division of Computational Science and.
FODAVA-Lead Education, Community Building, and Research: Dimension Reduction and Data Reduction: Foundations for Interactive Visualization Haesun Park.
Proposal for Term Project J. H. Wang Mar. 2, 2015.
Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.
Recommending Twitter Users to Follow Using Content and Collaborative Filtering Approaches John HannonJohn Hannon, Mike Bennett, Barry SmythBarry Smyth.
Search Engine Architecture
SINGULAR VALUE DECOMPOSITION (SVD)
Sharad Oberoi Carnegie Mellon University DesignWebs: Learning in Engineering Project Teams.
Mining real world data Web data. World Wide Web Hypertext documents –Text –Links Web –billions of documents –authored by millions of diverse people –edited.
CiteSight: Contextual Citation Recommendation with Differential Search Avishay Livne 1, Vivek Gokuladas 2, Jaime Teevan 3, Susan Dumais 3, Eytan Adar 1.
Automatic Video Tagging using Content Redundancy Stefan Siersdorfer 1, Jose San Pedro 2, Mark Sanderson 2 1 L3S Research Center, Germany 2 University of.
Web Search and Text Mining Lecture 5. Outline Review of VSM More on LSI through SVD Term relatedness Probabilistic LSI.
Concept-based P2P Search How to find more relevant documents Ingmar Weber Max-Planck-Institute for Computer Science Joint work with Holger Bast Torino,
Sharad Oberoi and Susan Finger Carnegie Mellon University Interactive Design Navigation Tools.
Automatic Labeling of Multinomial Topic Models
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
哈工大信息检索研究室 HITIR ’ s Update Summary at TAC2008 Extractive Content Selection Using Evolutionary Manifold-ranking and Spectral Clustering Reporter: Ph.d.
Meta-Path-Based Ranking with Pseudo Relevance Feedback on Heterogeneous Graph for Citation Recommendation By: Xiaozhong Liu, Yingying Yu, Chun Guo, Yizhou.
Database Technologies for E-Commerce Rakesh Agrawal IBM Almaden Research Center.
Demonstration: Tools for large scale bibliometric analysis André Somers | 1 June 25, 2009.
Data Mining and Text Mining. The Standard Data Mining process.
1 Dongheng Sun 04/26/2011 Learning with Matrix Factorizations By Nathan Srebro.
Overview Issues in Mobile Databases – Data management – Transaction management Mobile Databases and Information Retrieval.
Big Data is a Big Deal!.
Information Organization: Overview
Proposal for Term Project
CSE5544 Final Project Interactive Visualization Tool(s) for IEEE Vis Publication Exploration and Analysis Team Name: Publication Miner Team Members:
CSE5544 Final Project Interactive Visualization Tool(s) for IEEE Vis Publication Exploration and Analysis Team Name: Publication Miner Team Members:
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Text & Web Mining 9/22/2018.
Project Implementation for ITCS4122
Jianping Fan Dept of CS UNC-Charlotte
Community Distribution Outliers in Heterogeneous Information Networks
Data Warehousing and Data Mining
Discovering Functional Communities in Social Media
Visualizing Document Collections
Junghoo “John” Cho UCLA
Semi-Automatic Data-Driven Ontology Construction System
Information Organization: Overview
Presentation transcript:

Visual Analytics for Interactive Exploration of Large-Scale Documents via Nonnegative Matrix Factorization Jaegul Choo*, Barry L. Drake †, and Haesun Park* *Georgia Institute of Technology † Georgia Tech Research Institute Big Data Innovators Gathering (BIG) 2014

What is Visual Analytics? 2 AutomatedInteractive (human in the loop) Clearly defined tasksExploratory analysis Fast computationDeeper understanding >Millions of data itemsThousands of data items Data MiningVisualization

AutomatedInteractive (human in the loop) Clearly defined tasksExploratory analysis Fast computationDeeper understanding >Millions of data itemsThousands of data items What is Visual Analytics? Leveraging Both Worlds 3 Data MiningVisualization Visual Analytics +

Visual Analytics for Large-Scale Documents 4 Topic merging Topic splitting Doc-induced topic creation Keyword-induced topic creation UTOPIAN: User-driven Topic Modeling based on Interactive NMF VisIRR: Information Retrieval and Personalized Recommender System

Motivation: Too Many Documents to Read 5 Product reviews Which tablet to buy? iPad (2,000 reviews) vs. Galaxy Tab (1,300 reviews) Research papers Which sub-area in data mining to focus on? >Thousands of new papers every year Patent search Many other applications

Topic Modeling: Summarizing Documents 6 genednalifeevolveorganismbrainneuronnerve Document 1 Document 2 Document 3 Document 4 6 … …

Topic Modeling: Summarizing Documents Topic: distribution over keywords 7 genednalifeevolveorganismbrainneuronnerve Document 1 Document 2 Document 3 Document 4 Topic 1Topic 2Topic 3 7 … …

Topic Modeling: Summarizing Documents Topic: distribution over keywords Document: distribution over topics 8 genednalifeevolveorganismbrainneuronnerve Document 1 Document 2 Document 3 Document 4 Topic 1Topic 2Topic 3 8 … …

Nonnegative Matrix Factorization (NMF) Low-rank approximation via matrix factorization Why nonnegativity constraints? Better interpretation (vs. better approximation, e.g., SVD) 9 ~=~=  min || A – WH || F W>=0, H>=0 A H W

~=~= A H W H W Topic: distribution over keywords Document: distribution over topics 10 genednalifeevolvebrainneuronnerve Document 1 Document 2 Document 3 Document 4 Topic 1Topic 2Topic 3 organism NMF as Topic Modeling … …

Documents’ topical membership changes among 10 runs Why NMF (instead of LDA)? Consistency from Multiple Runs 11 InfoVis/VAST paper data set 20 newsgroup data set

Why NMF (instead of LDA)? Empirical Convergence Documents’ topical membership changes between iterations 12 LDANMF 10 minutes 48 seconds InfoVis/VAST paper data set

NMF vs. LDA Topic Summary (Top Keywords) 13 NMF RunTopic 1Topic 2Topic 3Topic 4Topic 5Topic 6Topic 7 #1 visualization design information user analysis system graph layout visual analytics data sets color weaving #2 visualization design information user analysis system graph layout visual analytics data sets color weaving LDA RunTopic 1Topic 2Topic 3Topic 4Topic 5Topic 6Topic 7 #1 document similarities knowledge edge query collaborative social tree measures multivariate tree animation dimension treemap #2 document query analysts scatterplot spatial collaborative text document multidimensi onal high tree aggregation dimension treemap InfoVis/VAST paper data set Topics are more consistent in NMF than in LDA. Topic quality is comparable between NMF and LDA.

UTOPIAN: User-Driven Topic Modeling Based on Interactive NMF [Choo et al., TVCG’13] 14 Topic merging Topic splitting Doc-induced topic creation Keyword-induced topic creation

Visualization Example: Car Reviews Topic summaries are NOT perfect. UTOPIAN allows user interactions for improving them.

Weakly Supervised NMF: Supporting User Interactions Weakly supervised NMF [Choo et al., DMKD, accepted with rev.] min ||A – WH || F 2 + α||(W – W r )M W || F 2 + β||M H (H – D H H r ) || F 2 W>=0, H>=0 W r, H r : reference matrices for W and H (user-input) M W, M H : diagonal matrices for weighting/masking columns and rows of W and H Algorithm: block-coordinate descent framework 16

Interaction Demo Video 17 After topic splitting (triangle) and topic merging (circle) Before interaction InfoVis-VAST Paper Data

VisIRR: Information Retrieval and Personalized Recommender System 18

Features Efficient Large-scale Data Processing 19 Document corpus: ~400,000 academic papers in CS Data management Structured data: author, year, venue, keywords, citation/reference count Unstructured data: bag-of-words vectors of title, abstract, keywords Graph data: content, citation, and co-authorship Efficient data handling Dynamic loading from disk to memory via Cache-like strategy Scalable data expansion in O(n)

Features Personalized Recommendation 20 Works based on user preference on document Preference scale of 1 (highly dislike) to 5 (highly like) Various recommendation schemes Based on content, citation network, and co-authorship Algorithm Preference propagation on graph using heat kernel r α = α ∑ k (1- α) k fW k r α is a recommendation score vector with a control parameter α, and f is a user-assigned rating, and W is an input graph

VisIRR Demo Citation-based Recommendation 21 Preference-assigned item as ‘highly like’ : ‘Automatic Classification System for the Diagnosis of Alzheimer Disease Using Component-Based SVM Aggregations’ Most of the recommended items are highly cited. Computational zoom-in shows sub-areas relevant to the article.

VisIRR Demo Co-authorship-based Recommendation 22 Preference-assigned item as ‘highly like’ : ‘Automatic Classification System for the Diagnosis of Alzheimer Disease Using Component-Based SVM Aggregations’ It shows other areas of the authors of this paper. Computational zoom-in on recommended items Retrieved + recommended items

23 Interested in learning Micro-Financing Analysis in Kiva.org? Check out my presentation at Room 104, Wed 4pm

24 Thank you! Jaegul Choo (Currently on the Academic Job Market) Selected Papers Choo et al., Document Topic Modeling and Discovery in Visual Analytics via Nonnegative Matrix Factorization, TVCG, 2013 Choo et al., VisIRR: Interactive Visual Information Retrieval and Recommendation for Large-scale Document Data, Tech Report, Georgia Tech, 2013 Topic merging Topic splitting Doc-induced topic creation Keyword-induced topic creation UTOPIAN: User-driven Topic Modeling based on Interactive NMF VisIRR: Information Retrieval and Personalized Recommender System Micro-Financing Analysis in Kiva.org, : Room 104, Wed 4pm

Refining topic keywords Merging topics Splitting a topic Creating new topics from seed documents/keywords UTOPIAN Interactions and Key Techniques Visualization Supervised t-SNE Topic modeling NMF Interaction Weakly- supervised NMF Per-iteration Visualization Framework

Original t-SNE Documents do not have clear topic clusters. Supervised t-SNE: Visualizing documents Supervised t-SNE d(x i, x j ) ← αd(x i, x j ) if x i and x j belong to the same topic. (e.g., α = 0.3)

PIVE: (Per-iteration Visualization Environment) Standard approachPIVE approach Integration methodology of Iterative Methods for Real-Time Interactive Visualization [Choo et al., VAST’14, to submit] 27