Predicting Response to Political Blog Posts with Topic Models

Predicting Response to Political Blog Posts with Topic Models
NAACL-HLT 2009, Boulder, CO June 03, 2009 Tae Yano, William Cohen, Noah Smith

Talk is about How we are designing topic models for online political discussion This is the joint work with William Cohen and Noah Smith

Roadmap Designing political blog models Model evaluation
Background Political discussion forums (blogs) Topic modeling Designing political blog models Model design Model specification Model evaluation Quantitative evaluation (comment prediction) Qualitative analysis Summary and Future work

Political blogs Why study blogs? An influential social phenomenon
Important venue for civil discourse Previous work in link analysis: (Adamic and Glance 2005) (Leskovec et al. 2007) Blog text is relatively understudied (Even though content is really what we understand) A different / interesting type of text we don’t usually deal with in NLP

Political blogs - Illustration
Why is blog data different / interesting? The text is a heterogeneous mixture of components, reflecting different styles its text really is not one voice, its made up with a pack of different components. (take a bit of pause) Here are some examples. There are many other interesting aspects, challenge to it, but

This is one post from a popular political site called Daily Kos. It is initially Written by one author at one point of time, But since this is something people actually care about, the text did not quite end there.

Posts are often coupled with comment sections The text continued to grow for some time, Collecting the reaction from the community

Each comment is short, but comment section as a whole is usually much longer than the post itself Comment are certainly thematically related to the post, But when you look into closely, its quite different Comment style is casual, creative, less carefully edited

Comments often meander across several themes Ranting? Taxes and prices ? Health care? Those are three comments to one post about healthcare.

Posts tend to discuss multiple themes House republicans? Government neglect? Oil companies? Energy policy? This attention to multiple themes is present in the main post. This post is largely about the house republican politics, but then,

Comments can be constructive and formal Comment section are often much diverse within, due to the multi-authorships Those top comment here, if you read it carefully, quite constractive, and formal While this bottom one here is rather subjective and conversational. …or subjective and conversational

Comments can be very long Comment can be quite verbose or extreamly terse …or quite terse

Blog text is comprised of several distinctive components: A post and its reactions (comments) A mixture of different themes within one post More so in the comment section, due to multi-authorship: Meandering among different themes Diverse personal styles and pet issues So what we are looking here is a pack of different components: Those components are, first, a post and its reaction, Within one post there are interlacing of different themes. Within the comments, there are meanderin among the themes, and also diverse personal styles and pet peeves. This page can be a speaker note (or just a bullet point of the bold words) A blog site discuss multiple themes A blog post is often coupled with comments (or, reactions). Both blog posts and comments can have multiple themes. The languages between the two section is quite distinctive. Comments tend be more casual and candid even when it is talking about the same subject Copious spelling mistakes, jargon, ungrammaticality Also comment seems to reflect more opinionated. Multiple authors, with personal style and personal peeves They often drift into such “talking points”

How should we approach this sort of data? Our approach is to treat it as an instance of Topic Modeling There is no one answer to this question

Topic modeling What is “topic modeling”?
A topic model is a type of probabilistic model Often used to describe the generation of text in a collection The generative process (story) is viewed as a series of stochastic steps from multiple distributions Here are the steps.

Topic modeling A document is a collection of words.
… global climate change children mortgage market … banking crisis … believe … washington recess legislators … president united states … oil companies … united

Topic modeling A document is a collection of words. Each word is one draw from a topic, a distribution over words. Important points: Word-topic assignments do not need to be annotated in the text. The model treats them as latent variables ( and dealing with them at inference time). … global climate change children mortgage market … banking crisis … believe … washington recess legislators … president united states … oil companies … united … P(word) C P(word) A P(word) B

Topic modeling A document is a collection of words. Each word is one draw from a topic. A topic is one draw from the topic mixture, a distribution over topics which is unique to each document. … global climate change children mortgage market … banking crisis … believe … washington recess legislators … president united states … oil companies … united P(topic) doc 1 D

Topic modeling A document is a collection of words. Each word is one draw from a topic. A topic is a draw from the topic mixture. Sometimes the topic mixture itself is drawn from a prior distribution. … global climate change children mortgage market … banking crisis … believe … washington recess legislators … president united states … oil companies … united P(topic) doc 1 P(doc) D

Topic modeling … D For each of D documents:
A document is a collection of words. Each word is one draw from a topic. A topic is a draw from the topic mixture. A topic-mixture is a draw from a distribution. For each of D documents: Draw a topic mixture from a distribution For each of N words in the document: Draw a topic from the topic mixture Draw a word from the topic Draw a topic mixture, And draw a topic from the mixture And word from the topic, Repeat the last two step n time And, you get a document global P(topic) doc 1 climate … P(doc) (Repeat N times) united D

α θ β Topic modeling … D Z1 Z2 Zn The best known example:
Latent Dirichlet Allocation or LDA (Blei, Ng, Jordan 2003) For each of D documents: Draw a θ from a Dirichlet prior α For each of N words in the document: Draw a Zi from the multinomial θ Draw a Wi from the multinomial β given Zi β The signiture of this model is the dirichlet prior alpha over the topic mixture, theta W1 Z1 α Z2 W2 θ … (Repeat N times) Zn Wn D

α θ β Topic modeling Nd D Zn These two are equivalent W1
For each of D documents: Draw a θ from a Dir α For each of N words in the document: Draw a Zi from the mult θ Draw a Wi from the mult β given Zi LDA and its variants are hierarchical models. They are often drawn with plate notation: These two are equivalent Nd D α θ β Zn W1 Draw a topic then a word, Nd times shade = “Observed” Draw a topic mixture, D times Nd = # of words in document d D = # of documents in the corpus

Topic modeling: Advantages
What does this complex model buy us? First, this model expresses the idea that one document is a mixture of variety of themes—which fits quite nicely with our text! Second is a nuts-and-bolt engineering issue: this approach is convenient when working with structured corpora with uncertainty Let me elaborate the second point a bit more.

We have some (general) ideas about the structure and dynamics of our text (comments, multiple themes, role of personal styles, etc.) But, not a lot (no obvious taxonomy, no annotation scheme)

We have some (general) ideas But, not a whole lot Topic modeling is handy here because: Flexible: We can encode hypotheses in the model’s structure; model uncertainty with latent variables; and have the model learn from available data Adaptive: Its modular built makes it easy to modify, expand, or remove portions of the model to adapt to new hypotheses or data Versatile: One joint model can be used to answer many questions Not all, but in many cases you can use generic algorithms (such as Gibbs sampling, Variational EM) for inference.

Modeling political blogs
All of that is great, but it doesn’t quite tell you what to do: Modeling framework is extremely flexible!

How, exactly, should we design the model? By carefully reflecting upon the data! Thematic richness Post and comments are related but different: Stylistic differences Personality in comments We like to capture the salient traits:

How, exactly, should we design the model? By carefully reflecting upon the data What types of responses are likely for a post? Which users will be interested in a post? Do different posts cause similar reactions? Do different posts attract similar users? etc What interesting questions?

Our proposed political blog model: CommentLDA Relatively simple extension of plane-vanilla LDA I will explain this piece by piece D = # of documents; N = # of words in post; M = # of words in comments

zi wi D d ß Nd Our proposed political blog model: CommentLDA LHS is vanilla LDA So the LHS is plain vanilla LDA. This part handle the the generation of the words in the post. D = # of documents; N = # of words in post; M = # of words in comments

RHS to capture the generation of reaction separately from the post body Our proposed political blog model: CommentLDA Two chambers share the same topic-mixture We added right hand side to handle the generation of reaction separately from the post The two chambers share the same topic-mixture, So this capture the thematical relation between two. but the observable words are generated from the two separate set of distribution, beta and beta prime, This is to captured the difference in language between two section. Two separate sets of word distributions D = # of documents; N = # of words in post; M = # of words in comments

User IDs as a part of comment, generate them alongside the words Our proposed political blog model: CommentLDA generate the words in the comment section In seide the RHS camber are two block. We include the user id, u, as the part of comment, and model its generation As well as the word contents of the model. So at each word, we draw a topic from theta, and draw a word from beta prime, and also user id from another multinomial, gamma D = # of documents; N = # of words in post; M = # of words in comments

Same model, took out the words from the comment section! Another model we tried: This is a model agnostic to the words in the comment section! We tried one alternative model The basic structure is the same as the first model But we take out the part modeling the generation of the words in the comments. The belief encoded here is that a posts’ reaction can be sufficently summarized by who reacted, entirely disregarding the word used in the writing This seems to be an over simplification, but there is nothing prevent us from doing that, and in fact D = # of documents; N = # of words in post; M = # of words in comments

The model is equivalent to the Link LDA from (Erosheva et al, 2004), also in (Nallapati and Cohen 2008) Another model we tried : LinkLDA (Erosheva et al, 2004) This is equivalent to another topic model studied in Erosheva 2004 ad Nallapati 2008) They applied the model to citation network, and link analysis in blogs. D = # of documents; N = # of words in post; M = # of words in comments

How often should the commenter’s user ID be generated in a given post? Variations on user ID generation: RHS (comments) Generating user id by: - “Verbosity” (original model) M = # of words in all comments - “Comment frequency” M = # of comments to the post - “Response” M = # of participants to the post LHS (post) L We also tried some variation on user id generation. The original model generate user id as many time as there are words, But this does not have to be the only way, We can design the model so that it generate one user id at each comments, generate one user id per participants. We called the original way as “count by verbosity” the second one “count by comment frequency” And the third one ‘count by response” We try a few variation, each slightly different way of thinking about users. This involved in how often user id is generate We call it count by verbosity, response, or frequency, Verbosity generate as many user id as there are words, Comments generate them on par with the number of comments And response as many as the number of commenter to the post. Verbosity variation do so as many time there are words, Frequency for the number of comments And response for the number of unique user. L = 1 (Verbosity), # of words in the comment (Comment), # of words by one participant (Response)

Liberty Democracy Fraternity Whatever Think of this as encoding a hypothesis about which type of user ought to weigh more! Equality Comment freq ….Liberty… …Democracy… ….Fraternity… …Equality… …Whatever… :^) In verbosity user who write long comments weigh more In comment frequency, users who engaged in the conversation more is more influential And in response all the participants are treated equally, regardless what they said or how often. Verbosity Response

Model evaluation How good are those models?
Inference: Doable with standard algorithms. We use Gibbs sampling implemented in C, using the Hierarchical Bayesian Complier (Daumé 2007) Two evaluations: “Comment prediction”: Guess who is going to comment on new posts. (Could be useful for personalized blog filtering.) Qualitative analysis: Examine the most probable words for each topic. (Could be useful for keyword extraction, or trend discovery.) We used gibbs sampling, although it is quite doable with Variational inference as in original LDA. We chose those evaluation because they are interesting, but they could be useful for some practical application such as personal blog filtering or keyword extraction, or trend discovery

Model evaluation: Our data sets
We collected the posts from 40 major political sites in the U.S. Experiments were conducted on data from 5 of the sites. The data set is available on our website: Includes both liberal and conservative blogs Also diverse in terms of post frequency, comment volume, and the size of the user pool (see the paper for more statistics) We build the data set ourselves. Last year we collected the posts from 40 major political sites in us. We trained our political blog models for 5 of those sites. Sites are chose from both political spectram, They are also diverse in terms of size.

Model evaluation: Our data sets
Six models for each site: Comment LDA and Link LDA with three variations on user ID generation—“Verbosity” (-v) , “Response” (-r), and “Comment frequency” (-c) Each model trained with to posts We show here some results from three sites: Matthew Yglesias (MY), Carpetbagger Report (CB) and Red State (RS) We trained six model for each site … In the next few slides we show some results from three sites, Matthew yglesias, Carpet bagger report and red state.

Comment prediction: Procedure
Task: given an unseen blog post, guess who is going to react. Procedure (for each model): Fit the model parameters using the training portion of the data For each post in the test portion, withhold the comments from the data Infer the topic mixture (theta) for each test post using the fitted model Compute p(user | topic mixture, model) for each user Rank users according to their probabilities, using the top N (=5, 10, 20 and 30) users as the model’s “prediction” Two Baselines: Fixed prediction: most frequent users for ALL post Naïve Bayes classifier with bag-of-words features So the first task is comment prediction We are given an unseen blog post and we have to guess who is going to react.

Comment prediction: Results
Precision at “10 user prediction” (MY) 20.54 % Modest performance (16% to 32% precision), but compares favorably to the Naïve Bayes baseline Comment LDA (R) Precision at “10 user prediction” (RS) Precision at “10 user prediction” (CB) And here is some results This is the precision for 10 user prediction. Bars are ,from left to right, three variation of linkLDA, comment LDA, and the two baseline the orenge bar is the best performing model for each data set. What this result is showing here is that, if asked to predict 10 user who would likely to comment on a given post We could do it right 20 % of them right for MY site with comment LDA model, 17 % for RS site with link LDA model, and 32 % for CB site again with LinkLDA model The number looks modest, But, This is farily difficult task, For each post you have to make dicision on the 10 spots out of 7000 to 3000 possible users, And considering how NB classifier is doing, further left in this chart, We think topic modeling is doing actually fairly descent. 16.92 % 32.06 % Link LDA (R) Link LDA (C) From left to right: Link LDA(-v, -r,-c) Cmnt LDA (-v, -r, -c), 2 Baselines (Freq, NB)

Precision at 5, 10, 20, 30 users (MY) Comment LDA performs consistently better for MY site, while LinkLDA is a much better option for RS. 27.54 20.54 14.83 12.56 Does our model lack certain flexibility to reflect site differences? Comment LDA (R,C) Precision at 5, 10, 20, 30 users (RS) Precision at 5, 10, 20, 30 users (CB) This is the precision on predicting 5, , and 30 users. In this chart, each cluster of the bars is each model’s performance in predicting 5, 10, 20 and 30 user Again orenge indicate the best performance for each task. So what this is showing here is that it s really depends on the site whether a model does well or not In MY site Comment LDA variation performed consistently better at all tasks, while LinkLDA is a better choice for for RS and CB. This seems to suggests that Our models is probably lacking a certain flexibility to reflect some critical difference between the site.. 25.19 37.02 32.06 25.28 16.92 12.14 21.01 9.82 Link LDA (R,C) Link LDA (R) From left to right: Link LDA(-v, -r,-c) Cmnt LDA (-v, -r, -c)

Comment LDA: Verbosity vs Response (MY) Variation in user generation does make difference. Giving more weights to verbose users does not help for this task. Link LDA: Verbosity vs Response (MY) Link LDA: Verbosity vs Response (RS) Another finding from this evaluation is that variation in user id generation does make difference This is again performance on predicting 5 to 30 users, but just the comparison between verbosity and response. Blue one is verbosity and the orenge one is response What this chart showing us is that Switching to verbosity you can buy a good deal of improve in user prediction. The largest difference is the 5 user prediction task with RS data, the further left in the bottom right chart, Verbosity’s precision is 14 %, but switing to response variation you would get 10% gain, reaching upto 25 % precision.

Model evaluation: Qualitative
2nd task: What words are in the topics? Also: What differences are there between the post and comments? Procedure (Comment LDA only): Fit the model parameters using the training portion For each topic, examine p(word | topic) in posts Rank words according to their probabilities; examin the top 40 words for each topic Do the same for the comment section We report here the relatively transparent examples from the MY data set. The next is the qualitative evaluation The question here is basically those two: (read the blue) And here is the procedure.

Post body Those are the most likely words for one topic (pause here) Post comments

Topic : “Religion” Post body And we think this topic is rather quite obviously, a religion. Note those word distribution is discovered quite independently Post comments

Topic : “Religion” Post body We highlighted some word we think suggests interesting trends unique to this corpus, We know a prior when and where this corpus is written, and we have have knowledge about the current political affair, So that those names, romney, backabee, or mitt, being strongly associated to the religiously charged words, is not a surprise to us,. But the point here is that our models, without guidance from human knowledge, can discover those latent trends and association in the text. It is also interesting that the comment section include more obliquely related words such as dawkins or write. Post comments Finding related, but not obvious words unique to this corpus

Topic : “Primary” Post body Next topic we labeled it primaly, If you see the words closely, you can convince youself that this is quite obviously about democratic primary Post comments

Topic : “Primary” Post body Here I highlighted some words that bring out some difference in focus between the blogger and his community. commenters concern more about the contest within the part, Meanwhile, the blogger is more prone to use the words connected to upcoming contest against republican. Post comments Different focus in the blogger and his community.

Topic : “Iraq War” Post body This is the topic we labeled “Iraq War” Post comments

Topic : “Iraq War” Post body Here the blogger use the word suggest the interest in the strategic aspects of the affair more often, While the comments tend to care more about tangibles that you can immidiately point finger at, such as bush, weappon, troops, or oil. Post comments The blogger discusses strategy, comments focus on “tangibles”

Summary and future work
Topic modeling is a viable framework for analyzing the text of online political discussions. It is convenient. Also, It is competitive in tasks that have potential use in real application. We found that not all blog sites are equal. Depending on the site, richer models or simpler models seem to do better, even for the same task. It can be a useful tool for the users who like to know which blog is personally interesting to them before actually reading them. Or it can be used to bring out the trends in the corpus, or textual difference between the blogger’s and the response from the community. We found that our tool, topic modeling, is a … We also found something about out corpus. Political blog sites are diverse entities. Even for the same task, different way of approximating user reaction made big difference.

Summary and future work
Topic modeling is a viable framework . We found that not all sites are equal. Future work Models that can explain differences between sites Asking other questions to the models (besides what we tried) Models using other non-textual data (timestamps, conversational threads) as evidence Like to desgin the topic model which can explain the differences between the blog sites. A joint model such as ours could potentially answer more questions; we explored here only two of them, and there are many more possibles. Different topic models reasoning over other non-textual “evidences”, (such as timestamps, thread,) that we did not consider for this study.

References Our published version of this work includes the detailed profile of our data set, as well as more experiments Please refer back to the original LDA paper for the complete picture. We presented here only an overview: The Gibbs sampling for LDA is detailed in (Griffiths and Steyver 2004): The Hierarchical Bayesian Compiler (HBC) is available from:

End of presentation Please let us know what we can do bette

α θ β Topic modeling Nd D Zn
For each of D documents: Draw a θ from a Dir α For each of N words in the document: Draw a Zi from the mult θ Draw a Wi from the mult β given Zi LDA and its variants are hierarchical models. They are often drawn with plate notation: Nd D α θ β Zn W1 This type of hierachical models are often written with plate noteation.

Predicting Response to Political Blog Posts with Topic Models

Similar presentations

Presentation on theme: "Predicting Response to Political Blog Posts with Topic Models"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Predicting Response to Political Blog Posts with Topic Models

Similar presentations

Presentation on theme: "Predicting Response to Political Blog Posts with Topic Models"— Presentation transcript:

Similar presentations

About project

Feedback