Topic Modeling with Network Regularization Qiaozhu Mei, Deng Cai, Duo Zhang, ChengXiang Zhai University of Illinois at Urbana-Champaign.

Slides:



Advertisements
Similar presentations
1 A Probabilistic Approach to Spatiotemporal Theme Pattern Mining on Weblogs Qiaozhu Mei, Chao Liu, Hang Su, and ChengXiang Zhai : University of Illinois.
Advertisements

Topic models Source: Topic models, David Blei, MLSS 09.
Information retrieval – LSI, pLSI and LDA
A Phrase Mining Framework for Recursive Construction of a Topical Hierarchy Date : 2014/04/15 Source : KDD’13 Authors : Chi Wang, Marina Danilevsky, Nihit.
One Theme in All Views: Modeling Consensus Topics in Multiple Contexts Jian Tang 1, Ming Zhang 1, Qiaozhu Mei 2 1 School of EECS, Peking University 2 School.
1 Multi-topic based Query-oriented Summarization Jie Tang *, Limin Yao #, and Dewei Chen * * Dept. of Computer Science and Technology Tsinghua University.
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, Introduction to IR Research ChengXiang Zhai Department of Computer.
Statistical Topic Modeling part 1
SFU, CMPT 741, Fall 2009, Martin Ester 418 Outlook Outline Trends in KDD research Graph mining and social network analysis Recommender systems Information.
Generative Topic Models for Community Analysis
Topic Modeling with Network Regularization Md Mustafizur Rahman.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) Classic Information Retrieval (IR)
IR Challenges and Language Modeling. IR Achievements Search engines  Meta-search  Cross-lingual search  Factoid question answering  Filtering Statistical.
Sparse Word Graphs: A Scalable Algorithm for Capturing Word Correlations in Topic Models Ramesh Nallapati Joint work with John Lafferty, Amr Ahmed, William.
An Overview of Text Mining Rebecca Hwa 4/25/2002 References M. Hearst, “Untangling Text Data Mining,” in the Proceedings of the 37 th Annual Meeting of.
Automating Keyphrase Extraction with Multi-Objective Genetic Algorithms (MOGA) Jia-Long Wu Alice M. Agogino Berkeley Expert System Laboratory U.C. Berkeley.
2008 © ChengXiang Zhai 1 Contextual Text Analysis with Probabilistic Topic Models ChengXiang Zhai Department of Computer Science Graduate School of Library.
Presented by Zeehasham Rasheed
Basic IR Concepts & Techniques ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Honglei Zhuang1, Jing Zhang2, George Brova1,
Memoplex Browser: Searching and Browsing in Semantic Networks CPSC 533C - Project Update Yoel Lanir.
1 Collaborative Filtering: Latent Variable Model LIU Tengfei Computer Science and Engineering Department April 13, 2011.
Context Analysis in Text Mining and Search Qiaozhu Mei Department of Computer Science University of Illinois at Urbana-Champaign
Overview of Search Engines
1 A Topic Modeling Approach and its Integration into the Random Walk Framework for Academic Search 1 Jie Tang, 2 Ruoming Jin, and 1 Jing Zhang 1 Knowledge.
Rui Yan, Yan Zhang Peking University
CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Automatic Construction of Topic Maps for Navigation in Information Space ChengXiang (“Cheng”) Zhai Department of Computer Science University of Illinois.
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, Prepare Yourself for IR Research ChengXiang Zhai Department of Computer.
Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.
CS598CXZ (CS510) Advanced Topics in Information Retrieval (Fall 2014) Instructor: ChengXiang (“Cheng”) Zhai 1 Teaching Assistants: Xueqing Liu, Yinan Zhang.
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Towards Contextual Text Mining Qiaozhu Mei University of Illinois at Urbana-Champaign.
1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Contextual Text Mining Qiaozhu Mei University of Illinois at Urbana-Champaign.
Probabilistic Topic Models
Comparative Text Mining Q. Mei, C. Liu, H. Su, A. Velivelli, B. Yu, C. Zhai DAIS The Database and Information Systems Laboratory. at The University of.
A General Optimization Framework for Smoothing Language Models on Graph Structures Qiaozhu Mei, Duo Zhang, ChengXiang Zhai University of Illinois at Urbana-Champaign.
Learning Geographical Preferences for Point-of-Interest Recommendation Author(s): Bin Liu Yanjie Fu, Zijun Yao, Hui Xiong [KDD-2013]
Toward A Session-Based Search Engine Smitha Sriram, Xuehua Shen, ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 7. Topic Extraction.
Automatic Labeling of Multinomial Topic Models Qiaozhu Mei, Xuehua Shen, ChengXiang Zhai University of Illinois at Urbana-Champaign.
Positional Relevance Model for Pseudo–Relevance Feedback Yuanhua Lv & ChengXiang Zhai Department of Computer Science, UIUC Presented by Bo Man 2014/11/18.
Probabilistic Models for Discovering E-Communities Ding Zhou, Eren Manavoglu, Jia Li, C. Lee Giles, Hongyuan Zha The Pennsylvania State University WWW.
Introduction to LDA Jinyang Gao. Outline Bayesian Analysis Dirichlet Distribution Evolution of Topic Model Gibbs Sampling Intuition Analysis of Parameter.
Probabilistic Latent Query Analysis for Combining Multiple Retrieval Sources Rong Yan Alexander G. Hauptmann School of Computer Science Carnegie Mellon.
Local Linear Matrix Factorization for Document Modeling Institute of Computing Technology, Chinese Academy of Sciences Lu Bai,
Supporting Knowledge Discovery: Next Generation of Search Engines Qiaozhu Mei 04/21/2005.
Active Feedback in Ad Hoc IR Xuehua Shen, ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Discovering Evolutionary Theme Patterns from Text - An Exploration of Temporal Text Mining Qiaozhu Mei and ChengXiang Zhai Department of Computer Science.
Automatic Labeling of Multinomial Topic Models
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Mining Advisor-Advisee Relationships from Research Publication.
Automatic Labeling of Multinomial Topic Models Qiaozhu Mei, Xuehua Shen, and ChengXiang Zhai DAIS The Database and Information Systems Laboratory.
Text Information Management ChengXiang Zhai, Tao Tao, Xuehua Shen, Hui Fang, Azadeh Shakery, Jing Jiang.
Discovering Evolutionary Theme Patterns from Text -An exploration of Temporal Text Mining KDD’05, August 21–24, 2005, Chicago, Illinois, USA. Qiaozhu Mei.
Enhanced hypertext categorization using hyperlinks Soumen Chakrabarti (IBM Almaden) Byron Dom (IBM Almaden) Piotr Indyk (Stanford)
A Study of Poisson Query Generation Model for Information Retrieval
哈工大信息检索研究室 HITIR ’ s Update Summary at TAC2008 Extractive Content Selection Using Evolutionary Manifold-ranking and Spectral Clustering Reporter: Ph.d.
Probabilistic Topic Models Hongning Wang Outline 1.General idea of topic models 2.Basic topic models -Probabilistic Latent Semantic Analysis (pLSA)
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Hierarchical Clustering & Topic Models
Context Analysis in Text Mining and Search
Exploring Social Tagging Graph for Web Object Classification
Probabilistic Topic Model
Course Summary (Lecture for CS410 Intro Text Info Systems)
What is IR? In the 70’s and 80’s, much of the research focused on document retrieval In 90’s TREC reinforced the view that IR = document retrieval Document.
ChengXiang (“Cheng”) Zhai Department of Computer Science
Bayesian Inference for Mixture Language Models
Course Summary ChengXiang “Cheng” Zhai Department of Computer Science
Michal Rosen-Zvi University of California, Irvine
Topic Models in Text Processing
Presentation transcript:

Topic Modeling with Network Regularization Qiaozhu Mei, Deng Cai, Duo Zhang, ChengXiang Zhai University of Illinois at Urbana-Champaign

2 Outline Motivation: topic modeling with network structure An optimization framework –probabilistic topic model with graph regularization Instantiation: NetPLSA Experiments Summary

Text Collections with Network Structure –Literature + coauthor/citation network – + sender/receiver network –… 3 Blog articles + friend network News + geographic network Web page + hyperlink structure

4 Probabilistic Topic Models for Text Mining Text Collections Probabilistic Topic Modeling … web 0.21 search 0.10 link 0.08 graph 0.05 … … Subtopic discovery Opinion comparison Summarization Topical pattern analysis … term 0.16 relevance 0.08 weight 0.07 feedback 0.04 independ model 0.03 … Topic models (Multinomial distributions) PLSA [Hofmann 99] LDA [Blei et al. 03] Author-Topic [Steyvers et al. 04] CPLSA [Mei & Zhai 06] … Pachinko allocation [Li & McCallum 06] CTM [Blei et al. 06] Usually don’t include network structure

Social Network Analysis 5 Generation, evolution e.g., [Leskovec 05] Community extraction e.g., [Kleinberg 00]; Diffusion [Gruhl 04]; [Backstrom 06] Ranking e.g., [Brin and Page 98]; [Kleinberg 98] Structural patterns e.g., [Yan 02] - Kleinberg and Backstrom 2006, New York Times Usually don’t model topics in text - Jeong et al Nature 411

Importance of Topic Modeling Plus Network Analysis 6 physicist, physics, scientist, theory, gravitation … writer, novel, best-sell, book, language, film… Topic modeling to help community extraction Information Retrieval + Data Mining + Machine Learning, … = Domain Review + Algorithm + Evaluation, … or Computer Science Literature Network analysis to help topic extraction ?

Intuitions People working on the same topic belong to the same “topical community” Good community: coherent topic + well connected A topic is semantically coherent if people working on this topic also collaborate a lot 7 IR ? More likely to be an IR person or a compiler person? Intuition: my topics are similar to my neighbors

Social Network Context for Topic Modeling 8 Context = author Coauthor = similar contexts Intuition: I work on similar topics to my neighbors Smoothed Topic distributions  P(θ j |author) e.g. coauthor network

Challenging Research Questions 9 How to formalize the intuitive assumption? –smoothed topic distributions over neighbors –without hurting topic modeling How to map a topic on a network structure? How to interpret the semantics of communities on a network? –These vertices form a community, but why? –Topical Communities

A Unified Optimization Framework Probabilistic topic modeling as an optimization problem (e.g., PLSA/LDA: Maximum Likelihood): Regularized objective function with network constrains –Topic distribution are smoothed over adjacent vertices Flexibility in selecting topic models and regularizers 10

11 A Document d Probabilistic Latent Semantic Analysis (Hofmann ‘99) Topics θ 1…k government donation New Orleans government 0.3 response donate 0.1 relief 0.05 help city 0.2 new 0.1 orleans P(θ i |d) government donate new Draw a word from  i response aid help Orleans Criticism of government response to the hurricane primarily consisted of criticism of its response to … The total shut-in oil production from the Gulf of Mexico … approximately 24% of the annual production and the shut- in gas production … Over seventy countries pledged monetary donations or other assistance. … Choose a topic Generation Process: Can be context- sensitive; topic distributions unconstrained -- overfits data

Instantiation: NetPLSA 12 Basic Assumption: Neighbors have similar topic distribution PLSA Graph Harmonic Regularizer, Generalization of [Zhu ’03], importance (weight) of an edge tradeoff topic distribution of a document difference of topic distribution

Parameter Estimation PLSA: EM algorithm NetPLSA: Generalized EM Algorithm 13 E-StepM-Step … … Compute expectationMaximize, closed form solution Maximize, using Newton-Raphson Efficient Alg.: Improve E-StepM-Step … … Compute expectation Regularized complete likelihood

How it Works in M Step PLSA: NetPLSA: 14 Maximizing likelihood Smooth using network Regularizer; iteratively until Q drops T

15 Variations and Utilities Use different regularizers for different network structure –P(θ|u) = P(θ|d u ) if one document per vertex Topic Mapping –Map a topic on the network using p(θ|u) –p(θ|u) should be smooth on the graph Topical community extraction: –We may use p(θ|u) to decide which cluster u should be in

Experiments Bibliography data and coauthor networks –DBLP: text = titles; network = coauthors –Four conferences (expect 4 topics): SIGIR, KDD, NIPS, WWW Blog articles and Geographic network –Blogs from spaces.live.com containing topical words, e.g. “weather” –Network: US states (adjacent states) 16

Topical Communities with PLSA Topic 1Topic 2Topic 3Topic 4 term 0.02 peer 0.02 visual 0.02 interface 0.02 question 0.02 patterns 0.01 analog 0.02 towards 0.02 protein 0.01 mining 0.01 neurons 0.02 browsing 0.02 training 0.01 clusters 0.01 vlsi 0.01 xml 0.01 weighting 0.01 stream 0.01 motion 0.01 generation 0.01 multiple 0.01 frequent 0.01 chip 0.01 design 0.01 recognition 0.01 e 0.01 natural 0.01 engine 0.01 relations 0.01 page 0.01 cortex 0.01 service 0.01 library 0.01 gene 0.01 spike 0.01 social ? ? ? ? Noisy community assignment

Topical Communities with NetPLSA 18 Topic 1Topic 2Topic 3Topic 4 retrieval 0.13 mining 0.11 neural 0.06 web 0.05 information 0.05 data 0.06 learning 0.02 services 0.03 document 0.03 discovery 0.03 networks 0.02 semantic 0.03 query 0.03 databases 0.02 recognition 0.02 services 0.03 text 0.03 rules 0.02 analog 0.01 peer 0.02 search 0.03 association 0.02 vlsi 0.01 ontologies 0.02 evaluation 0.02 patterns 0.02 neurons 0.01 rdf 0.02 user 0.02 frequent 0.01 gaussian 0.01 management 0.01 relevance 0.02 streams 0.01 network 0.01 ontology 0.01 Information Retrieval Data mining Machine learning Web Coherent community assignment

Coherent Topical Communities 19 Semantics of community: “Data Mining (KDD) ” NetPLSA mining 0.11 data 0.06 discovery 0.03 databases 0.02 rules 0.02 association 0.02 patterns 0.02 frequent 0.01 streams 0.01 PLSA peer 0.02 patterns 0.01 mining 0.01 clusters 0.01 stream 0.01 frequent 0.01 e 0.01 page 0.01 gene 0.01 PLSA visual 0.02 analog 0.02 neurons 0.02 vlsi 0.01 motion 0.01 chip 0.01 natural 0.01 cortex 0.01 spike 0.01 NetPLSA neural 0.06 learning 0.02 networks 0.02 recognition 0.02 analog 0.01 vlsi 0.01 neurons 0.01 gaussian 0.01 network 0.01 Semantics of community: “machine learning (NIPS)”

Topic Modeling and SNA Improve Each Other MethodsCut Edge Weights Ratio Cut/ Norm. Cut Community Size Community 1 Community 2 Community 3 Community 4 PLSA / NetPLSA / NCut / Ncut: spectral clustering with normalized cut. J. Shi et al pure network based community finding Network Regularization helps extract coherent communities (network assures the focus of topics) Topic Modeling helps balancing communities (text implicitly bridges authors) The smaller the better

Smoothed Topic Map 21 Map a topic on the network (e.g., using p(θ|a)) PLSA (Topic : “information retrieval”) NetPLSA Core contributors Irrelevant Intermediate

Smoothed Topic Map 22 The Windy States -Blog articles: “weather” -US states network: -Topic: “windy” PLSA NetPLSA Real reference

Summary Combine Topic modeling and network analysis A unified optimization framework NetPLSA = PLSA + Network Regularization Topical communities and topic map Future work: –Using other topic models (e.g., LDA) –More network properties (e.g., small world) –Topic evolution/spreading on network 23

Thanks! 24