Predictively Modeling Social Text William W. Cohen Machine Learning Dept. and Language Technologies Institute School of Computer Science Carnegie Mellon.

Slides:

Advertisements

Similar presentations

Topic models Source: Topic models, David Blei, MLSS 09.

Advertisements

Unsupervised Learning

1 Modeling Political Blog Posts with Response Tae Yano Carnegie Mellon University IBM SMiLe Open House Yorktown Heights, NY October 8,

Expectation Maximization

Supervised Learning Recap

Probabilistic Clustering-Projection Model for Discrete Data

Segmentation and Fitting Using Probabilistic Methods

Statistical Topic Modeling part 1

Parallelized variational EM for Latent Dirichlet Allocation: An experimental evaluation of speed and scalability Ramesh Nallapati, William Cohen and John.

Generative Topic Models for Community Analysis

Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.

Topic Modeling with Network Regularization Md Mustafizur Rahman.

Sparse Word Graphs: A Scalable Algorithm for Capturing Word Correlations in Topic Models Ramesh Nallapati Joint work with John Lafferty, Amr Ahmed, William.

Cluster Analysis.  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods.

1 Unsupervised Modeling and Recognition of Object Categories with Combination of Visual Contents and Geometric Similarity Links Gunhee Kim Christos Faloutsos.

Latent Dirichlet Allocation a generative model for text

Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.

Multiscale Topic Tomography Ramesh Nallapati, William Cohen, Susan Ditmore, John Lafferty & Kin Ung (Johnson and Johnson Group)

Visual Recognition Tutorial

Scalable Text Mining with Sparse Generative Models

Topic models for corpora and for graphs. Motivation Social graphs seem to have –some aspects of randomness small diameter, giant connected components,..

Introduction to Machine Learning for Information Retrieval Xiaolong Wang.

Correlated Topic Models By Blei and Lafferty (NIPS 2005) Presented by Chunping Wang ECE, Duke University August 4 th, 2006.

Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.

CSC2535: Computation in Neural Networks Lecture 11: Conditional Random Fields Geoffrey Hinton.

Annealing Paths for the Evaluation of Topic Models James Foulds Padhraic Smyth Department of Computer Science University of California, Irvine* *James.

Modeling Text and Links: Overview William W. Cohen Machine Learning Dept. and Language Technologies Institute School of Computer Science Carnegie Mellon.

Bayesian networks Classification, segmentation, time series prediction and more. Website: Twitter:

Transfer Learning Task. Problem Identification Dataset : A Year: 2000 Features: 48 Training Model ‘M’ Testing 98.6% Training Model ‘M’ Testing 97% Dataset.

Predictively Modeling Social Text William W. Cohen Machine Learning Dept. and Language Technologies Institute School of Computer Science Carnegie Mellon.

LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.

MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:

Prototype-Driven Learning for Sequence Models Aria Haghighi and Dan Klein University of California Berkeley Slides prepared by Andrew Carlson for the Semi-

Ramnath Balasubramanyan, William W. Cohen Language Technologies Institute and Machine Learning Department, School of Computer Science, Carnegie Mellon.

Probabilistic Models for Discovering E-Communities Ding Zhou, Eren Manavoglu, Jia Li, C. Lee Giles, Hongyuan Zha The Pennsylvania State University WWW.

Fast Effective Clustering for Graphs and Document Collections William W. Cohen Machine Learning Dept. and Language Technologies Institute School of Computer.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Mining Logs Files for Data-Driven System Management Advisor.

Introduction to LDA Jinyang Gao. Outline Bayesian Analysis Dirichlet Distribution Evolution of Topic Model Gibbs Sampling Intuition Analysis of Parameter.

Topic Models Presented by Iulian Pruteanu Friday, July 28 th, 2006.

Analysis of Social Media MLD , LTI William Cohen

Latent Dirichlet Allocation

Analysis of Social Media MLD , LTI William Cohen

Techniques for Dimensionality Reduction

CS246 Latent Dirichlet Analysis. LSI  LSI uses SVD to find the best rank-K approximation  The result is difficult to interpret especially with negative.

Predictively Modeling Social Text William W. Cohen Machine Learning Dept. and Language Technologies Institute School of Computer Science Carnegie Mellon.

Towards Total Scene Understanding: Classiﬁcation, Annotation and Segmentation in an Automatic Framework N 工科所錢雅馨 2011/01/16 Li-Jia Li, Richard.

Link Distribution on Wikipedia [0407]KwangHee Park.

Analysis of Social Media MLD , LTI William Cohen

Bayesian Networks in Document Clustering Slawomir Wierzchon, Mieczyslaw Klopotek Michal Draminski Krzysztof Ciesielski Mariusz Kujawiak Institute of Computer.

2005/09/13 A Probabilistic Model for Retrospective News Event Detection Zhiwei Li, Bin Wang*, Mingjing Li, Wei-Ying Ma University of Science and Technology.

Unsupervised Learning on Graphs. Spectral Clustering: Graph = Matrix A B C F D E G I H J ABCDEFGHIJ A111 B11 C1 D11 E1 F111 G1 H111 I111 J11.

Spectral Clustering Shannon Quinn (with thanks to William Cohen of Carnegie Mellon University, and J. Leskovec, A. Rajaraman, and J. Ullman of Stanford.

Hybrid Models for Text and Graphs 10/23/2012 Analysis of Social Media.

Analysis of Social Media MLD , LTI William Cohen

Recent Paper of Md. Akmal Haidar Meeting before ICASSP 2013 報告者：郝柏翰 2013/05/23.

A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation Yee W. Teh, David Newman and Max Welling Published on NIPS 2006 Discussion.

Understanding unstructured texts via Latent Dirichlet Allocation Raphael Cohen DSaaS, EMC IT June 2015.

Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.

Learning Deep Generative Models by Ruslan Salakhutdinov

Online Multiscale Dynamic Topic Models

Multimodal Learning with Deep Boltzmann Machines

The topic discovery models

Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models

Topic models for corpora and for graphs

Michal Rosen-Zvi University of California, Irvine

Junghoo “John” Cho UCLA

Topic models for corpora and for graphs

Topic Models in Text Processing

Spectral clustering methods

Sequential Learning with Dependency Nets

Presentation transcript:

Predictively Modeling Social Text William W. Cohen Machine Learning Dept. and Language Technologies Institute School of Computer Science Carnegie Mellon University Joint work with: Amr Ahmed, Andrew Arnold, Ramnath Balasubramanyan, Frank Lin, Matt Hurst (MSFT), Ramesh Nallapati, Noah Smith, Eric Xing, Tae Yano

Newswire Text Formal Primary purpose: –Inform “typical reader” about recent events Broad audience: –Explicitly establish shared context with reader –Ambiguity often avoided Informal Many purposes: –Entertain, connect, persuade… Narrow audience: –Friends and colleagues –Shared context already established –Many statements are ambiguous out of social context Social Media Text

Newswire Text Goals of analysis: –Extract information about events from text –“Understanding” text requires understanding “typical reader” conventions for communicating with him/her Prior knowledge, background, … Goals of analysis: –Very diverse –Evaluation is difficult And requires revisiting often as goals evolve –Often “understanding” social text requires understanding a community Social Media Text

Outline Tools for analysis of text –Probabilistic models for text, communities, and time Mixture models and LDA models for text LDA extensions to model hyperlink structure LDA extensions to model time –Alternative framework based on graph analysis to model time & community Preliminary results & tradeoffs Discussion of results & challenges

Introduction to Topic Models Multinomial Naïve Bayes C W1W1 W2W2 W3W3 ….. WNWN  M  football ThePittsburghSteelers …..   won Box is shorthand for many repetitions of the structure….

Introduction to Topic Models Multinomial Naïve Bayes C W1W1 W2W2 W3W3 ….. WNWN  M  politics ThePittsburghmayor …..   stated

Introduction to Topic Models Naïve Bayes Model: Compact representation C W1W1 W2W2 W3W3 ….. WNWN C W N  M M   

Introduction to Topic Models Multinomial Naïve Bayes C W1W1 W2W2 W3W3 ….. WNWN  M  For each document d = 1, , M Generate C d ~ Mult( ¢ |  ) For each position n = 1, , N d Generate w n ~ Mult(¢| ,C d ) For document d = 1 Generate C d ~ Mult( ¢ |  ) = ‘football’ For each position n = 1, , N d =67 Generate w 1 ~ Mult(¢| ,C d ) = ‘the’ Generate w 2 = ‘Pittsburgh’ Generate w 3 = ‘Steelers’ ….

Introduction to Topic Models Multinomial Naïve Bayes C W1W1 W2W2 W3W3 ….. WNWN  M  In the graphs: shaded circles are known values parents of variable W are the inputs to the function used in generating W. Goal: given known values, estimate the rest, usually to maximize the probability of the observations:

Introduction to Topic Models Mixture model: unsupervised naïve Bayes model C W N M   Joint probability of words and classes: But classes are not visible: Z

Introduction to Topic Models Learning for naïve Bayes: –Take logs, the function is convex, linear and easy to optimize for any parameter Learning for mixture model: –Many local maxima (at least one for each permutation of classes) –Expectation/maximization is most common method

Introduction to Topic Models Mixture model: EM solution E-step: M-step:

Introduction to Topic Models Mixture model: EM solution E-step: M-step: Estimate the expected values of the unknown variables (“soft classification”) Maximize the values of the parameters subject to this guess—usually, this is learning the parameter values given the “soft classifications”

Introduction to Topic Models

Probabilistic Latent Semantic Analysis Model d z w  M Select document d ~ Mult(  ) For each position n = 1, , N d generate z n ~ Mult( ¢ |  d ) generate w n ~ Mult( ¢ |  z n ) dd  N Topic distribution Mixture model: each document is generated by a single (unknown) multinomial distribution of words, the corpus is “mixed” by  PLSA model: each word is generated by a single unknown multinomial distribution of words, each document is mixed by  d

Introduction to Topic Models JMLR, 2003

Introduction to Topic Models Latent Dirichlet Allocation z w  M  N  For each document d = 1, ,M Generate  d ~ Dir(¢ |  ) For each position n = 1, , N d generate z n ~ Mult( ¢ |  d ) generate w n ~ Mult( ¢ |  z n )

Introduction to Topic Models LDA’s view of a document

Introduction to Topic Models LDA topics

Introduction to Topic Models Latent Dirichlet Allocation –Overcomes some technical issues with PLSA PLSA only estimates mixing parameters for training docs –Parameter learning is more complicated: Gibbs Sampling: easy to program, often slow Variational EM

Introduction to Topic Models Perplexity comparison of various models Unigram Mixture model PLSA LDA Lower is better

Introduction to Topic Models Prediction accuracy for classification using learning with topic-models as features Higher is better

Outline Tools for analysis of text –Probabilistic models for text, communities, and time Mixture models and LDA models for text LDA extensions to model hyperlink structure LDA extensions to model time –Alternative framework based on graph analysis to model time & community Preliminary results & tradeoffs Discussion of results & challenges

Hyperlink modeling using LDA

Hyperlink modeling using LinkLDA [Erosheva, Fienberg, Lafferty, PNAS, 2004] z w  M  N  For each document d = 1, ,M Generate  d ~ Dir(¢ |  ) For each position n = 1, , N d generate z n ~ Mult( ¢ |  d ) generate w n ~ Mult( ¢ |  z n ) For each citation j = 1, , L d generate z j ~ Mult(. |  d ) generate c j ~ Mult(. |  z j ) z c  L Learning using variational EM

Hyperlink modeling using LDA [Erosheva, Fienberg, Lafferty, PNAS, 2004]

Newswire Text Goals of analysis: –Extract information about events from text –“Understanding” text requires understanding “typical reader” conventions for communicating with him/her Prior knowledge, background, … Goals of analysis: –Very diverse –Evaluation is difficult And requires revisiting often as goals evolve –Often “understanding” social text requires understanding a community Social Media Text Science as a testbed for social text: an open community which we understand

Modeling Citation Influences [Dietz, Bickel, Scheffer, ICML 2007] Copycat model of citation influence LDA model for cited papers Extended LDA model for citing papers For each word, depending on coin flip c, you might chose to copy a word from a cited paper instead of generating the word

Modeling Citation Influences [Dietz, Bickel, Scheffer, ICML 2007] Citation influence graph for LDA paper

Models of hypertext for blogs [ICWSM 2008] Ramesh Nallapati me Amr Ahmed Eric Xing

LinkLDA model for citing documents Variant of PLSA model for cited documents Topics are shared between citing, cited Links depend on topics in two documents Link-PLSA-LDA

Experiments 8.4M blog postings in Nielsen/Buzzmetrics corpus –Collected over three weeks summer 2005 Selected all postings with >=2 inlinks or >=2 outlinks –2248 citing (2+ outlinks), 1777 cited documents (2+ inlinks) –Only 68 in both sets, which are duplicated Fit model using variational EM

Topics in blogs Model can answer questions like : which blogs are most likely to be cited when discussing topic z?

Topics in blogs Model can be evaluated by predicting which links an author will include in a an article Lower is better Link-PLDA-LDA Link-LDA

Another model: Pairwise Link-LDA z w   N  z w  N z z c  LDA for both cited and citing documents Generate an indicator for every pair of docs Vs. generating pairs of docs Link depends on the mixing components (  ’s) stochastic block model

Pairwise Link-LDA supports new inferences… …but doesn’t perform better on link prediction

Outline Tools for analysis of text –Probabilistic models for text, communities, and time Mixture models and LDA models for text LDA extensions to model hyperlink structure –Observation: these models can be used for many purposes… LDA extensions to model time –Alternative framework based on graph analysis to model time & community Discussion of results & challenges

Predicting Response to Political Blog Posts with Topic Models [NAACL ’09] Tae Yano Noah Smith

39 Political blogs and and comments Comment style is casual, creative, less carefully edited Posts are often coupled with comment sections

Political blogs and comments Most of the text associated with large “A-list” community blogs is comments –5-20x as many words in comments as in text for the 5 sites considered in Yano et al. A large part of socially-created commentary in the blogosphere is comments. –Not blog  blog hyperlinks Comments do not just echo the post

Modeling political blogs Our political blog model: CommentLDA D = # of documents; N = # of words in post; M = # of words in comments z, z` = topic w = word (in post) w`= word (in comments) u = user

Modeling political blogs Our proposed political blog model: CommentLDA LHS is vanilla LDA D = # of documents; N = # of words in post; M = # of words in comments

Modeling political blogs Our proposed political blog model: CommentLDA RHS to capture the generation of reaction separately from the post body Two separate sets of word distributions D = # of documents; N = # of words in post; M = # of words in comments Two chambers share the same topic-mixture

Modeling political blogs Our proposed political blog model: CommentLDA User IDs of the commenters as a part of comment text generate the words in the comment section D = # of documents; N = # of words in post; M = # of words in comments

Modeling political blogs Another model we tried: CommentLDA This is a model agnostic to the words in the comment section! D = # of documents; N = # of words in post; M = # of words in comments Took out the words from the comment section! The model is structurally equivalent to the LinkLDA from (Erosheva et al., 2004)

46 Topic discovery - Matthew Yglesias (MY) site

47 Topic discovery - Matthew Yglesias (MY) site

48 Topic discovery - Matthew Yglesias (MY) site

49 LinkLDA and CommentLDA consistently outperform baseline models Neither consistently outperforms the other. Comment prediction % % % Comment LDA (R) Link LDA (R) Link LDA (C) user prediction: Precision at top 10 From left to right: Link LDA(-v, -r,-c) Cmnt LDA (-v, -r, -c), Baseline (Freq, NB) (CB) (MY) (RS)

From Episodes to Sagas: Temporally Clustering News Via Social- Media Commentary [current work] Noah Smith Matthew HurstFrank Lin Ramnath Balasubramanyan

Motivation News-related blogosphere is driven by recency Some recent news is better understood based on context of sequence of related stories Some readers have this context – some don’t To reconstruct the context, reconstruct the sequence of related stories (“saga”) –Similar to retrospective event detection First efforts: –Find related stories –Cluster by time Evaluation: agreement with human annotators

Clustering results on Democratic-primary- related documents SpeCluster + time: Mixture of multinomials + model for “general” text + timestamp from Gaussian k-walks (more later)

Clustering results on Democratic-primary- related documents Also had three human annotators build gold-standard timelines hierarchical annotated with names of events, times, … Can evaluate a machine-produced timeline by tree-distance to gold- standard one

Clustering results on Democratic-primary- related documents Issue: divergence of opinion with human annotators is modeling community interests the problem? how much of what we want is actually in the data? should this task be supervised or unsupervised?

More sophisticated time models Hierarchical LDA Over Time model: –LDA to generate text –Also generate a timestamp for each document from topic-specific Gaussians –“Non-parametric” model Number of clusters is also generated (not specified by user) –Allows use of user-provided “prototypes” –Evaluated on liberal/conservative blogs and ML papers from NIPS conferences Ramnath Balasubramanyan

Results with HOTS model - unsupervised

Results with HOTS model – human guidance Adding human seeds for some key events improves performance on all events. Allows a user to partially specify a timeline of events and have the system complete it.

Outline Tools for analysis of text –Probabilistic models for text, communities, and time Mixture models and LDA models for text LDA extensions to model hyperlink structure LDA extensions to model time –Alternative framework based on graph analysis to model time & community Preliminary results & tradeoffs Discussion of results & challenges

Spectral Clustering: Graph = Matrix Vector = Node  Weight H ABCDEFGHIJ A_111 B1_1 C11_ D_11 E1_1 F111_ G_11 H_11 I11_1 J111_ A B C F D E G I J A A3 B2 C3 D E F G H I J M M v

Spectral Clustering: Graph = Matrix M*v 1 = v 2 “propogates weights from neighbors” H ABCDEFGHIJ A_111 B1_1 C11_ D_11 E1_1 F11_ G_11 H_11 I11_1 J111_ A B C F D E G I J A3 B2 C3 D E F G H I J M M v1v1 A2*1+3*1+0 *1 B3*1+3*1 C3*1+2*1 D E F G H I J v2v2 *=

Spectral Clustering: Graph = Matrix M*v 1 = v 2 “propogates weights from neighbors” H ABCDEFGHIJ A_111 B1_1 C11_ D_11 E1_1 F11_ G_11 H_11 I11_1 J111_ A B C F D E G I J A3 B2 C3 D E F G H I J M M v1v1 A5 B6 C5 D E F G H I J v2v2 *=

Spectral Clustering: Graph = Matrix W*v 1 = v 2 “propogates weights from neighbors” M [Shi & Meila, 2002] e2e2 e3e x x x x x x y yy y y x x x x x x z z z z z z z z z z z e1e1 e2e2

Spectral Clustering M If W is row-normalized adjacency matrix for a connected graph with k closely-connected subcommunities then the “top” eigenvector is a constant vector the next k eigenvectors are roughly piecewise constant with “pieces” corresponding to subcommunities Spectral clustering: Find the top k+1 eigenvectors v 1,…,v k+1 Discard the “top” one Replace every node a with k-dimensional vector x a = Cluster with k-means

Spectral Clustering: Pros and Cons Elegant, and well-founded mathematically Works quite well when relations are approximately transitive (like similarity, social connections) Expensive for very large datasets –Computing eigenvectors is the bottleneck Noisy datasets cause problems –“Informative” eigenvectors need not be in top few –Performance can drop suddenly from good to terrible

Experimental results: best-case assignment of class labels to clusters Adamic & Glance “Divided They Blog:…” 2004

Spectral Clustering: Graph = Matrix M*v 1 = v 2 “propogates weights from neighbors” H ABCDEFGHIJ A_111 B1_1 C11_ D_11 E1_1 F11_ G_11 H_11 I11_1 J111_ A B C F D E G I J A3 B2 C3 D E F G H I J M M v1v1 A5 B6 C5 D E F G H I J v2v2 *=

Repeated averaging with neighbors as a clustering method Pick a vector v 0 (maybe even at random) Compute v 1 = Wv 0 –i.e., replace v 0 [x] with weighted average of v 0 [y] for the neighbors y of x Plot v 1 [x] for each x Repeat for v 2, v 3, … What are the dynamics of this process?

Repeated averaging with neighbors on a sample problem… small larger

PIC: Power Iteration Clustering run power iteration (repeated averaging w/ neighbors) with early stopping Formally, can show this works when spectral techniques work –Experimentally, linear time –Easy to implement and efficient –Very easily parallelized –Experimentally, often better than traditional spectral methods Frank Lin

Experimental results: best-case assignment of class labels to clusters

Experiments: run time and scalability Time in millisec

Clustering results on Democratic-primary- related documents k-walks vs human annotators k-walks is early version of PIC cluster a graph with several types of nodes: blog entries; news stories; and dates. clusters (“communities”) of the graph correspond to events in the “saga”. Advantage: PIC clusters at interactive speeds.

Outline Tools for analysis of text –Probabilistic models for text, communities, and time Mixture models and LDA models for text LDA extensions to model hyperlink structure LDA extensions to model time –Alternative framework based on graph analysis to model time & community Preliminary results & tradeoffs Discussion of results & challenges

Goals of analysis: –Very diverse –Evaluation is difficult And requires revisiting often as goals evolve –Often “understanding” social text requires understanding a community Social Media Text Probabilistic models –can model many aspects of social text Community (links, comments) Time Evaluation: –introspective, qualitative on communities we understand Scientific communities –quantitative on predictive tasks Link prediction, user prediction, … –Against “gold-standard visualization” (sagas) Comments

Thanks to… NIH/NIGMS NSF Microsoft LiveLabs Microsoft Research Johnson & Johnson Language Technology Inst, CMU