Hybrid Models for Text and Graphs 10/23/2012 Analysis of Social Media.

Slides:



Advertisements
Similar presentations
Topic models Source: Topic models, David Blei, MLSS 09.
Advertisements

Community Detection with Edge Content in Social Media Networks Paper presented by Konstantinos Giannakopoulos.
1 Modeling Political Blog Posts with Response Tae Yano Carnegie Mellon University IBM SMiLe Open House Yorktown Heights, NY October 8,
Title: The Author-Topic Model for Authors and Documents
An Introduction to LDA Tools Kuan-Yu Chen Institute of Information Science, Academia Sinica.
Partitioned Logistic Regression for Spam Filtering Ming-wei Chang University of Illinois at Urbana-Champaign Wen-tau Yih and Christopher Meek Microsoft.
Final Project Presentation Name: Samer Al-Khateeb Instructor: Dr. Xiaowei Xu Class: Information Science Principal/ Theory (IFSC 7321) TOPIC MODELING FOR.
Statistical Topic Modeling part 1
Generative Topic Models for Community Analysis
Caimei Lu et al. (KDD 2010) Presented by Anson Liang.
Topic Modeling with Network Regularization Md Mustafizur Rahman.
Introduction to Automatic Classification Shih-Wen (George) Ke 7 th Dec 2005.
IR Challenges and Language Modeling. IR Achievements Search engines  Meta-search  Cross-lingual search  Factoid question answering  Filtering Statistical.
Predictively Modeling Social Text William W. Cohen Machine Learning Dept. and Language Technologies Institute School of Computer Science Carnegie Mellon.
Sparse Word Graphs: A Scalable Algorithm for Capturing Word Correlations in Topic Models Ramesh Nallapati Joint work with John Lafferty, Amr Ahmed, William.
1 Unsupervised Modeling and Recognition of Object Categories with Combination of Visual Contents and Geometric Similarity Links Gunhee Kim Christos Faloutsos.
Latent Dirichlet Allocation a generative model for text
Multiscale Topic Tomography Ramesh Nallapati, William Cohen, Susan Ditmore, John Lafferty & Kin Ung (Johnson and Johnson Group)
Models for Authors and Text Documents Mark Steyvers UCI In collaboration with: Padhraic Smyth (UCI) Michal Rosen-Zvi (UCI) Thomas Griffiths (Stanford)
Topic models for corpora and for graphs. Motivation Social graphs seem to have –some aspects of randomness small diameter, giant connected components,..
Hypertext Categorization using Hyperlink Patterns and Meta Data Rayid Ghani Séan Slattery Yiming Yang Carnegie Mellon University.
Generating Impact-Based Summaries for Scientific Literature Qiaozhu Mei, ChengXiang Zhai University of Illinois at Urbana-Champaign 1.
Modeling Scientific Impact with Topical Influence Regression James Foulds Padhraic Smyth Department of Computer Science University of California, Irvine.
Semantic History Embedding in Online Generative Topic Models Pu Wang (presenter) Authors: Loulwah AlSumait Daniel Barbará
MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.
Introduction to Machine Learning for Information Retrieval Xiaolong Wang.
Correlated Topic Models By Blei and Lafferty (NIPS 2005) Presented by Chunping Wang ECE, Duke University August 4 th, 2006.
Example 16,000 documents 100 topic Picked those with large p(w|z)
Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.
Modeling Documents by Combining Semantic Concepts with Unsupervised Statistical Learning Author: Chaitanya Chemudugunta America Holloway Padhraic Smyth.
Annealing Paths for the Evaluation of Topic Models James Foulds Padhraic Smyth Department of Computer Science University of California, Irvine* *James.
Modeling Text and Links: Overview William W. Cohen Machine Learning Dept. and Language Technologies Institute School of Computer Science Carnegie Mellon.
This work is supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center contract number.
1 Linmei HU 1, Juanzi LI 1, Zhihui LI 2, Chao SHAO 1, and Zhixing LI 1 1 Knowledge Engineering Group, Dept. of Computer Science and Technology, Tsinghua.
Topic Modelling: Beyond Bag of Words By Hanna M. Wallach ICML 2006 Presented by Eric Wang, April 25 th 2008.
Transfer Learning Task. Problem Identification Dataset : A Year: 2000 Features: 48 Training Model ‘M’ Testing 98.6% Training Model ‘M’ Testing 97% Dataset.
Probabilistic Topic Models
Predictively Modeling Social Text William W. Cohen Machine Learning Dept. and Language Technologies Institute School of Computer Science Carnegie Mellon.
Integrating Topics and Syntax -Thomas L
Latent Dirichlet Allocation D. Blei, A. Ng, and M. Jordan. Journal of Machine Learning Research, 3: , January Jonathan Huang
Probabilistic Models for Discovering E-Communities Ding Zhou, Eren Manavoglu, Jia Li, C. Lee Giles, Hongyuan Zha The Pennsylvania State University WWW.
Topic Models Presented by Iulian Pruteanu Friday, July 28 th, 2006.
Latent Dirichlet Allocation
1 A Biterm Topic Model for Short Texts Xiaohui Yan, Jiafeng Guo, Yanyan Lan, Xueqi Cheng Institute of Computing Technology, Chinese Academy of Sciences.
Techniques for Dimensionality Reduction
Predictively Modeling Social Text William W. Cohen Machine Learning Dept. and Language Technologies Institute School of Computer Science Carnegie Mellon.
Automatic Labeling of Multinomial Topic Models
Sophia(Xueyao) Liang CPSC 503 Final Project. K=3 Unsupervised P( |d) Olympic, vancouver Snow, cold Moon light, spider man.
Web-Mining Agents Topic Analysis: pLSI and LDA
Analysis of Social Media MLD , LTI William Cohen
Hypertext Categorization using Hyperlink Patterns and Meta Data Rayid Ghani Séan Slattery Yiming Yang Carnegie Mellon University.
Enhanced hypertext categorization using hyperlinks Soumen Chakrabarti (IBM Almaden) Byron Dom (IBM Almaden) Piotr Indyk (Stanford)
CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Link mining ( based on slides.
Text-classification using Latent Dirichlet Allocation - intro graphical model Lei Li
Concept-Based Analysis of Scientific Literature Chen-Tse Tsai, Gourab Kundu, Dan Roth UIUC.
Analysis of Social Media MLD , LTI William Cohen
Recent Paper of Md. Akmal Haidar Meeting before ICASSP 2013 報告者:郝柏翰 2013/05/23.
A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation Yee W. Teh, David Newman and Max Welling Published on NIPS 2006 Discussion.
Understanding unstructured texts via Latent Dirichlet Allocation Raphael Cohen DSaaS, EMC IT June 2015.
Topic Modeling for Short Texts with Auxiliary Word Embeddings
Extracting Mobile Behavioral Patterns with the Distant N-Gram Topic Model Lingzi Hong Feb 10th.
Predicting Response to Political Blog Posts with Topic Models
Online Multiscale Dynamic Topic Models
The topic discovery models
Topic and Role Discovery In Social Networks
Bayesian Inference for Mixture Language Models
Topic models for corpora and for graphs
Michal Rosen-Zvi University of California, Irvine
Topic models for corpora and for graphs
Topic Models in Text Processing
Presentation transcript:

Hybrid Models for Text and Graphs 10/23/2012 Analysis of Social Media

Newswire Text Formal Primary purpose: –Inform “typical reader” about recent events Broad audience: –Explicitly establish shared context with reader –Ambiguity often avoided Informal Many purposes: –Entertain, connect, persuade… Narrow audience: –Friends and colleagues –Shared context already established –Many statements are ambiguous out of social context Social Media Text

Newswire Text Goals of analysis: –Extract information about events from text –“Understanding” text requires understanding “typical reader” conventions for communicating with him/her Prior knowledge, background, … Goals of analysis: –Very diverse –Evaluation is difficult And requires revisiting often as goals evolve –Often “understanding” social text requires understanding a community Social Media Text

Outline Tools for analysis of text –Probabilistic models for text, communities, and time Mixture models and LDA models for text LDA extensions to model hyperlink structure LDA extensions to model time

Introduction to Topic Models Mixture model: unsupervised naïve Bayes model C W N M   Joint probability of words and classes: But classes are not visible: Z

Introduction to Topic Models JMLR, 2003

Introduction to Topic Models Latent Dirichlet Allocation z w  M  N  For each document d = 1, ,M Generate  d ~ Dir(.|  ) For each position n = 1, , N d generate z n ~ Mult(. |  d ) generate w n ~ Mult(.|  z n )

Introduction to Topic Models Latent Dirichlet Allocation –Overcomes some technical issues with PLSA PLSA only estimates mixing parameters for training docs –Parameter learning is more complicated: Gibbs Sampling: easy to program, often slow Variational EM

Introduction to Topic Models Perplexity comparison of various models Unigram Mixture model PLSA LDA Lower is better

Introduction to Topic Models Prediction accuracy for classification using learning with topic-models as features Higher is better

Before LDA….LSA and pLSA

Introduction to Topic Models Probabilistic Latent Semantic Analysis Model d z w  M Select document d ~ Mult(  ) For each position n = 1, , N d generate z n ~ Mult( _ |  d ) generate w n ~ Mult( _ |  z n ) dd  N Topic distribution PLSA model: each word is generated by a single unknown multinomial distribution of words, each document is mixed by  d need to estimate  d for each d  overfitting is easy LDA: integrate out  d and only estimate 

Introduction to Topic Models PLSA topics (TDT-1 corpus)

Outline Tools for analysis of text –Probabilistic models for text, communities, and time Mixture models and LDA models for text LDA extensions to model hyperlink structure LDA extensions to model time –Alternative framework based on graph analysis to model time & community Preliminary results & tradeoffs Discussion of results & challenges

Hyperlink modeling using PLSA

Hyperlink modeling using PLSA [Cohn and Hoffman, NIPS, 2001] d z w  M dd  N z c  Select document d ~ Mult(  ) For each position n = 1, , N d generate z n ~ Mult(. |  d ) generate w n ~ Mult(. |  z n ) For each citation j = 1, , L d generate z j ~ Mult(. |  d ) generate c j ~ Mult(. |  z j ) L

Hyperlink modeling using PLSA [Cohn and Hoffman, NIPS, 2001] d z w  M dd  N z c  L PLSA likelihood: New likelihood: Learning using EM

Hyperlink modeling using PLSA [Cohn and Hoffman, NIPS, 2001] Heuristic: 0 ·  · 1 determines the relative importance of content and hyperlinks  (1-  )

Hyperlink modeling using PLSA [Cohn and Hoffman, NIPS, 2001] Experiments: Text Classification Datasets: –Web KB 6000 CS dept web pages with hyperlinks 6 Classes: faculty, course, student, staff, etc. –Cora 2000 Machine learning abstracts with citations 7 classes: sub-areas of machine learning Methodology: –Learn the model on complete data and obtain  d for each document –Test documents classified into the label of the nearest neighbor in training set –Distance measured as cosine similarity in the  space –Measure the performance as a function of 

Hyperlink modeling using PLSA [Cohn and Hoffman, NIPS, 2001] Classification performance Hyperlinkcontentlinkcontent

Hyperlink modeling using LDA

Hyperlink modeling using LinkLDA [Erosheva, Fienberg, Lafferty, PNAS, 2004] z w  M  N  For each document d = 1, ,M Generate  d ~ Dir(¢ |  ) For each position n = 1, , N d generate z n ~ Mult(. |  d ) generate w n ~ Mult(. |  z n ) For each citation j = 1, , L d generate z j ~ Mult(. |  d ) generate c j ~ Mult(. |  z j ) z c  L Learning using variational EM

Hyperlink modeling using LDA [Erosheva, Fienberg, Lafferty, PNAS, 2004]

Newswire Text Goals of analysis: –Extract information about events from text –“Understanding” text requires understanding “typical reader” conventions for communicating with him/her Prior knowledge, background, … Goals of analysis: –Very diverse –Evaluation is difficult And requires revisiting often as goals evolve –Often “understanding” social text requires understanding a community Social Media Text Science as a testbed for social text: an open community which we understand

Author-Topic Model for Scientific Literature

Author-Topic Model for Scientific Literature [Rozen-Zvi, Griffiths, Steyvers, Smyth UAI, 2004] z w  M  N  For each author a = 1, ,A Generate  a ~ Dir(. |  ) For each topic k = 1, ,K Generate  k ~ Dir(. |  ) For each document d = 1, ,M For each position n = 1, , N d Generate author x ~ Unif(. | a d ) generate z n ~ Mult(. |  a ) generate w n ~ Mult(. |  z n ) x a A P  K

Author-Topic Model for Scientific Literature [Rozen-Zvi, Griffiths, Steyvers, Smyth UAI, 2004] Perplexity results

Author-Topic Model for Scientific Literature [Rozen-Zvi, Griffiths, Steyvers, Smyth UAI, 2004 ] Topic-Author visualization

Author-Topic Model for Scientific Literature [Rozen-Zvi, Griffiths, Steyvers, Smyth UAI, 2004 ] Application 1: Author similarity

Author-Topic Model for Scientific Literature [Rozen-Zvi, Griffiths, Steyvers, Smyth UAI, 2004] Application 2: Author entropy

Labeled LDA: [ Ramage, Hall, Nallapati, Manning, EMNLP 2009]

Labeled LDA Del.icio.us tags as labels for documents

Labeled LDA

Author-Topic-Recipient model for data [McCallum, Corrada-Emmanuel,Wang, ICJAI’05]

Gibbs sampling

Author-Topic-Recipient model for data [McCallum, Corrada-Emmanuel,Wang, ICJAI’05] Datasets –Enron data 23,488 messages between 147 users –McCallum’s personal 23,488(?) messages with 128 authors

Author-Topic-Recipient model for data [McCallum, Corrada-Emmanuel,Wang, ICJAI’05] Topic Visualization: Enron set

Author-Topic-Recipient model for data [McCallum, Corrada-Emmanuel,Wang, ICJAI’05] Topic Visualization: McCallum’s data

Author-Topic-Recipient model for data [McCallum, Corrada-Emmanuel,Wang, ICJAI’05]

Models of hypertext for blogs [ICWSM 2008] Ramesh Nallapati me Amr Ahmed Eric Xing

LinkLDA model for citing documents Variant of PLSA model for cited documents Topics are shared between citing, cited Links depend on topics in two documents Link-PLSA-LDA

Experiments 8.4M blog postings in Nielsen/Buzzmetrics corpus –Collected over three weeks summer 2005 Selected all postings with >=2 inlinks or >=2 outlinks –2248 citing (2+ outlinks), 1777 cited documents (2+ inlinks) –Only 68 in both sets, which are duplicated Fit model using variational EM

Topics in blogs Model can answer questions like : which blogs are most likely to be cited when discussing topic z?

Topics in blogs Model can be evaluated by predicting which links an author will include in a an article Lower is better Link-PLSA-LDA Link-LDA

Another model: Pairwise Link-LDA z w   N  z w  N z z c  LDA for both cited and citing documents Generate an indicator for every pair of docs Vs. generating pairs of docs Link depends on the mixing components (  ’s) stochastic block model

Pairwise Link-LDA supports new inferences… …but doesn’t perform better on link prediction

Outline Tools for analysis of text –Probabilistic models for text, communities, and time Mixture models and LDA models for text LDA extensions to model hyperlink structure –Observation: these models can be used for many purposes… LDA extensions to model time –Alternative framework based on graph analysis to model time & community Discussion of results & challenges

Authors are using a number of clever tricks for inference….

Predicting Response to Political Blog Posts with Topic Models [NAACL ’09] Tae Yano Noah Smith

54 Political blogs and and comments Comment style is casual, creative, less carefully edited Posts are often coupled with comment sections

Political blogs and comments Most of the text associated with large “A-list” community blogs is comments –5-20x as many words in comments as in text for the 5 sites considered in Yano et al. A large part of socially-created commentary in the blogosphere is comments. –Not blog  blog hyperlinks Comments do not just echo the post

Modeling political blogs Our political blog model: CommentLDA D = # of documents; N = # of words in post; M = # of words in comments z, z` = topic w = word (in post) w`= word (in comments) u = user

Modeling political blogs Our proposed political blog model: CommentLDA LHS is vanilla LDA D = # of documents; N = # of words in post; M = # of words in comments

Modeling political blogs Our proposed political blog model: CommentLDA RHS to capture the generation of reaction separately from the post body Two separate sets of word distributions D = # of documents; N = # of words in post; M = # of words in comments Two chambers share the same topic-mixture

Modeling political blogs Our proposed political blog model: CommentLDA User IDs of the commenters as a part of comment text generate the words in the comment section D = # of documents; N = # of words in post; M = # of words in comments

Modeling political blogs Another model we tried: CommentLDA This is a model agnostic to the words in the comment section! D = # of documents; N = # of words in post; M = # of words in comments Took out the words from the comment section! The model is structurally equivalent to the LinkLDA from (Erosheva et al., 2004)

61 Topic discovery - Matthew Yglesias (MY) site

62 Topic discovery - Matthew Yglesias (MY) site

63 Topic discovery - Matthew Yglesias (MY) site

64 LinkLDA and CommentLDA consistently outperform baseline models Neither consistently outperforms the other. Comment prediction % % % Comment LDA (R) Link LDA (R) Link LDA (C) user prediction: Precision at top 10 From left to right: Link LDA(-v, -r,-c) Cmnt LDA (-v, -r, -c), Baseline (Freq, NB) (CB) (MY) (RS)

Document modeling with Latent Dirichlet Allocation (LDA) z w  M  N  For each document d = 1, ,M Generate  d ~ Dir(. |  ) For each position n = 1, , N d generate z n ~ Mult(. |  d ) generate w n ~ Mult(. |  z n )

Author-Topic-Recipient model for data [McCallum, Corrada-Emmanuel,Wang, ICJAI’05]

“SNA” = Jensen-Shannon divergence for recipients of messages

Modeling Citation Influences

Modeling Citation Influences [Dietz, Bickel, Scheffer, ICML 2007] Copycat model of citation influence c is a cited document s is a coin toss to mix γ and  plaigarism innovation

s is a coin toss to mix γ and 

Modeling Citation Influences [Dietz, Bickel, Scheffer, ICML 2007] Citation influence graph for LDA paper

Modeling Citation Influences

User study: self- reported citation influence on Likert scale LDA-post is Prob(cited doc|paper) LDA-js is Jensen-Shannon dist in topic space