Jaegul Choo1*, Changhyun Lee1, Chandan K. Reddy2, and Haesun Park1

Slides:



Advertisements
Similar presentations
Topic Identification in Forums Evaluation Strategy IA Seminar Discussion Ahmad Ammari School of Computing, University of Leeds.
Advertisements

Visualization Taxonomies and Techniques Text: Documents and Collections University of Texas – Pan American CSCI 6361, Spring 2014.
Unsupervised Detection of Regions of Interest Using Iterative Link Analysis Gunhee Kim 1 Antonio Torralba 2 1: SCS, CMU 2: CSAIL, MIT Neural Information.
Probabilistic Clustering-Projection Model for Discrete Data
Visualization and Cluster
Understanding and Promoting Micro-Finance Activities in Kiva.org Jaegul Choo*, Changhyun Lee*, Daniel Lee †, Hongyuan Zha*, and Haesun Park* *Georgia Institute.
Two Technique Papers on High Dimensionality Allan Rempel December 5, 2005.
Caimei Lu et al. (KDD 2010) Presented by Anson Liang.
Hinrich Schütze and Christina Lioma
Communities in Heterogeneous Networks Chapter 4 1 Chapter 4, Community Detection and Mining in Social Media. Lei Tang and Huan Liu, Morgan & Claypool,
Texture Synthesis Tiantian Liu. Definition Texture – Texture refers to the properties held and sensations caused by the external surface of objects received.
Graph Based Semi- Supervised Learning Fei Wang Department of Statistical Science Cornell University.
Personalized Search Result Diversification via Structured Learning
ClearEye: An Visualization System for Document Revision CPSC 533C Project Update Qiang Kong Qixing Zheng.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
Heterogeneous Consensus Learning via Decision Propagation and Negotiation Jing Gao † Wei Fan ‡ Yizhou Sun † Jiawei Han † †University of Illinois at Urbana-Champaign.
Heterogeneous Consensus Learning via Decision Propagation and Negotiation Jing Gao† Wei Fan‡ Yizhou Sun†Jiawei Han† †University of Illinois at Urbana-Champaign.
DataMeadow DataMeadow A Visual Canvas for Analysis of Large-Scale Multivariate Data Niklas Elmqvist – John Stasko –
Evaluating the Quality of Image Synthesis and Analysis Techniques Matthew O. Ward Computer Science Department Worcester Polytechnic Institute.
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Memoplex Browser: Searching and Browsing in Semantic Networks CPSC 533C - Project Update Yoel Lanir.
CSC 478 Programming Data Mining Applications Course Summary Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Visual Analytics for Interactive Exploration of Large-Scale Documents via Nonnegative Matrix Factorization Jaegul Choo*, Barry L. Drake †, and Haesun Park*
Joint Image Clustering and Labeling by Matrix Factorization
Cao et al. ICML 2010 Presented by Danushka Bollegala.
Gwangju Institute of Science and Technology Intelligent Design and Graphics Laboratory Multi-scale tensor voting for feature extraction from unstructured.
Surface Simplification Using Quadric Error Metrics Michael Garland Paul S. Heckbert.
IntroDefinitionSizeComplexityWrap-up 1/54 Individual Big Data Visual Analytics: Challenges and Opportunities Remco Chang and Eli Brown Tufts University.
X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox Associate Dean for.
Non Negative Matrix Factorization
Sharad Oberoi and Susan Finger Carnegie Mellon University DesignWebs: Towards the Creation of an Interactive Navigational Tool to assist and support Engineering.
VAST 2011 Sebastian Bremm, Tatiana von Landesberger, Martin Heß, Tobias Schreck, Philipp Weil, and Kay Hamacher Interactive-Graphics Systems TU Darmstadt,
FODAVA-Lead Research Dimension Reduction and Data Reduction: Foundations for Interactive Visualization Haesun Park Division of Computational Science and.
FODAVA-Lead Education, Community Building, and Research: Dimension Reduction and Data Reduction: Foundations for Interactive Visualization Haesun Park.
Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Understanding Text Corpora with Multiple Facets Lei Shi, Furu Wei, Shixia Liu, Xiaoxiao Lian, Li Tan and Michelle X. Zhou IBM Research.
Collective Vision: Using Extremely Large Photograph Collections Mark Lenz CameraNet Seminar University of Wisconsin – Madison February 2, 2010 Acknowledgments:
Information Retrieval Lecture 6 Introduction to Information Retrieval (Manning et al. 2007) Chapter 16 For the MSc Computer Science Programme Dell Zhang.
Probabilistic Models for Discovering E-Communities Ding Zhou, Eren Manavoglu, Jia Li, C. Lee Giles, Hongyuan Zha The Pennsylvania State University WWW.
Ranking objects based on relationships Computing Top-K over Aggregation Sigmod 2006 Kaushik Chakrabarti et al.
PCI th Panhellenic Conference in Informatics Clustering Documents using the 3-Gram Graph Representation Model 3 / 10 / 2014.
A Clustering Method Based on Nonnegative Matrix Factorization for Text Mining Farial Shahnaz.
Distributed Data Analysis & Dissemination System (D-DADS ) Special Interest Group on Data Integration June 2000.
Local Linear Matrix Factorization for Document Modeling Institute of Computing Technology, Chinese Academy of Sciences Lu Bai,
Visualization Lab By: Thomas Kraft.  What is being talked about and where?  Twitter has massive amounts of data  Tweets are unstructured  Goal: Quickly.
Sparse nonnegative matrix factorization for protein sequence motifs information discovery Presented by Wooyoung Kim Computer Science, Georgia State University.
Progress Report #2 Alvaro Velasquez. Project Selection I chose to work with Nasim Souly on the project titled “Subspace Clustering via Graph Regularized.
Provable Learning of Noisy-OR Networks
Semi-Supervised Clustering
Manuel Gomez Rodriguez
Automatic Video Shot Detection from MPEG Bit Stream
Document Clustering Based on Non-negative Matrix Factorization
Manuel Gomez Rodriguez
Hansheng Xue School of Computer Science and Technology
Privacy and Fault-Tolerance in Distributed Optimization Nitin Vaidya University of Illinois at Urbana-Champaign.
Project Implementation for ITCS4122
Jianping Fan Dept of CS UNC-Charlotte
Community Distribution Outliers in Heterogeneous Information Networks
Multi-Dimensional Data Visualization
Discovering Functional Communities in Social Media
Visualizing Document Collections
Information Design and Visualization
Pramod Bhatotia, Ruichuan Chen, Myungjin Lee
Magnet & /facet Zheng Liang
Junghoo “John” Cho UCLA
Research Institute for Future Media Computing
Non-Negative Matrix Factorization
Unsupervised learning of visual sense models for Polysemous words
Presentation transcript:

UTOPIAN: User-Driven Topic Modeling Based on Interactive Nonnegative Matrix Factorization Jaegul Choo1*, Changhyun Lee1, Chandan K. Reddy2, and Haesun Park1 1Georgia Institute of Technology, 2Wayne State University *e-mail: jaegul.choo@cc.gatech.edu

Intro: Topic Modeling Document 1 Document 2 Document 3 Document 4 brain evolve dna genetic gene nerve neuron life organism

Topic: a distribution over keywords Intro: Topic Modeling Document 1 Document 2 Document 3 Document 4 Topic 1 Topic 2 Topic 3 Topic: a distribution over keywords brain evolve dna genetic gene nerve neuron life organism

Intro: Topic Modeling Document : a distribution over topic Topic: a distribution over keywords brain evolve dna genetic gene nerve neuron life organism

Latent Dirichlet Allocation (LDA) in Visual Analytics LDA has been widely used in visual analytics. TIARA [Wei et al. KDD10], iVisClustering [Lee et al. EuroVis12], ParallelTopics [Dou et al. VAST12], TopicViz [Eisenstein et al. CHI-WIP12], … *Image courtesy of original papers.

Doc-induced topic creation Keyword-induced topic creation Overview of Our Work Proposes nonnegative matrix factorization (NMF) for topic modeling. Highlights advantages of NMF over LDA in visual analytics. Presents UTOPIAN, an NMF-based interactive topic modeling system. Topic merging Topic splitting Doc-induced topic creation Keyword-induced topic creation

What is Nonnegative Matrix Factorization?

Nonnegative Matrix Factorization (NMF) Lower-rank approximation with nonnegativity constraints Why nonnegativity? Easy interpretation and semantically meaningful output Algorithm Alternating nonnegativity-constrained least squares [Kim et al., 2008] H ~ = min || A – WH ||F W>=0, H>=0 A W Mention document vector

NMF as Topic Modeling ~ = Document : a distribution over topic W W H H ~ = A Document 1 Document 2 Document 3 Document 4 Document : a distribution over topic Topic 1 Topic 2 Topic 3 Topic: a distribution over keywords brain evolve dna genetic gene nerve neuron life organism

Why NMF in Visual Analytics?

Advantages of NMF in Visual Analytics Reliable algorithmic behaviors Flexible support for user interactions

NMF vs. LDA Consistency from Multiple Runs Documents’ topical membership changes among 10 runs InfoVis/VAST paper data set 20 newsgroup data set

NMF vs. LDA Empirical Convergence Documents’ topical membership changes between iterations InfoVis/VAST paper data set 48 seconds 10 minutes NMF LDA

NMF vs. LDA Topic Summary (Top Keywords) InfoVis/VAST paper data set Topics are more consistent in NMF than in LDA. Topic quality is comparable between NMF and LDA. Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 Topic 7 NMF Run #1 visualization design information user analysis system graph layout visual analytics data sets color weaving Run #2 LDA documents similarities knowledge edge query collaborative social tree measures multivariate animation dimensions treemap analysts scatterplot spatial text multidimensional, high aggregation

Advantages of NMF in Visual Analytics Reliable algorithmic behaviors Flexible support for user interactions

Weakly Supervised NMF [Choo et al., DMKD, accepted with rev.] min ||A – WH ||F2 + α||(W – Wr)MW ||F2 + β||MH(H – DHHr) ||F2 W>=0, H>=0 Wr, Hr : reference matrices for W and H MW, MH : diagonal matrices for weighting/masking columns/rows of W and H Provides flexible yet intuitive means for user interaction. Maintains the same computational complexity as original NMF.

UTOPIAN: User-Driven Topic Modeling Based on Interactive NMF Topic merging Topic splitting Doc-induced topic creation Keyword-induced topic creation

Doc-induced topic creation Keyword-induced topic creation UTOPIAN Overview Supervised t-distributed stochastic neighbor embedding (t-SNE) User interactions supported Keyword refinement Topic merging/splitting Keyword-/document-induced topic creation Real-time interaction via PIVE (Per-Iteration Visualization Environment) Topic merging Topic splitting Doc-induced topic creation Keyword-induced topic creation Just like In-Spire, documents are represented as dots, and their colors represent their topic cluster membership.

Supervised t-SNE Original t-SNE Documents are often too noisy to work with. Supervised t-SNE d(xi, xj) ← α•d(xi, xj) if xi and xj belongs to the same topic cluster.

PIVE (Per-Iteration Visualization Environment) for Real-time Interaction [Choo et al., under revision] Standard approach PIVE approach

Demo Video http://tinyurl.com/UTOPIAN2013

Usage Scenario: Hyundai Genesis Review Data Initial result After interaction

Summary Presented UTOPIAN, a User-Driven Topic Modeling based on Interactive NMF. Highlighted the advantages of NMF over LDA in visual analytics. Reliable algorithmic behaviors Consistency from multiple runs Early empirical convergence Flexible support for user interactions Keyword refinement Topic merging/splitting Keyword-/document-induced topic creation

More in the paper & On-going Work A general taxonomy of user interactions with computational methods Keyword-based vs. document-based Template-based vs. from-scratch-based Algorithmic details about supported user interactions Implementation details More usage scenarios On-going Work Scaling up the system with parallel distributed NMF

Thank you! http://tinyurl.com/UTOPIAN2013 Jaegul Choo jaegul.choo@cc.gatech.edu http://www.cc.gatech.edu/~joyfull/ Topic merging Topic splitting Doc-induced topic creation Keyword-induced topic creation For more details, please find me at ‘Meet the Candidate’ A601+ A602, 6PM today