2008 © ChengXiang Zhai 1 Contextual Text Analysis with Probabilistic Topic Models ChengXiang Zhai Department of Computer Science Graduate School of Library & Information Science Institute for Genomic Biology, Statistics University of Illinois, Urbana-Champaign Joint work with Qiaozhu Mei
2008 © ChengXiang Zhai 2 Motivation Documents are often associated with context (meta- data) –Direct context: time, location, source, authors,… –Indirect context: events, policies, … Many applications require “contextual text analysis”: –Discovering topics from text in a context-sensitive way –Analyzing variations of topics over different contexts –Revealing interesting patterns (e.g., topic evolution, topic variations, topic communities)
2008 © ChengXiang Zhai 3 Example 1: Comparing News Articles Common Themes“Vietnam” specific“Afghan” specific“Iraq” specific United nations ……… Death of people ……… … ……… Vietnam WarAfghan War Iraq War CNNFox Blog Before 9/11During Iraq war Current US blogEuropean blog Others What’s in common? What’s unique?
2008 © ChengXiang Zhai 4 More Contextual Analysis Questions What positive/negative aspects did people say about X (e.g., a person, an event)? Trends? How does an opinion/topic evolves over time? What are emerging topics? What topics are fading away? How can we characterize a social network?
2008 © ChengXiang Zhai 5 Research Questions Can we model all these problems generally? Can we solve these problems with a unified approach? How can we bring human into the loop?
2008 © ChengXiang Zhai 6 Document context: Time = July 2005 Location = Texas Author = xxx Occup. = Sociologist Age Group = 45+ … Contextual Probabilistic Latent Semantics Analysis View1View2View3 Themes government donation New Orleans government 0.3 response donate 0.1 relief 0.05 help city 0.2 new 0.1 orleans TexasJuly 2005 sociolo gist Theme coverages: Texas July 2005 document …… Choose a view Choose a Coverage government donate new Draw a word from i response aid help Orleans Criticism of government response to the hurricane primarily consisted of criticism of its response to … The total shut-in oil production from the Gulf of Mexico … approximately 24% of the annual production and the shut- in gas production … Over seventy countries pledged monetary donations or other assistance. … Choose a theme
2008 © ChengXiang Zhai 7 Comparing News Articles Iraq War (30 articles) vs. Afghan War (26 articles) Cluster 1Cluster 2Cluster 3 Common Theme united nations 0.04 … killed month deaths … … Iraq Theme n 0.03 Weapons Inspections … troops hoon sanches … … Afghan Theme Northern 0.04 alliance 0.04 kabul 0.03 taleban aid 0.02 … taleban rumsfeld 0.02 hotel front … … The common theme indicates that “United Nations” is involved in both wars Collection-specific themes indicate different roles of “United Nations” in the two wars
2008 © ChengXiang Zhai 8 Spatiotemporal Patterns in Blog Articles Query= “Hurricane Katrina” Topics in the results: Spatiotemporal patterns
2008 © ChengXiang Zhai 9 Theme Life Cycles (“Hurricane Katrina”) city orleans new louisiana flood evacuate storm … price oil gas increase product fuel company … Oil Price New Orleans
2008 © ChengXiang Zhai 10 Theme Snapshots (“Hurricane Katrina”) Week4: The theme is again strong along the east coast and the Gulf of Mexico Week3: The theme distributes more uniformly over the states Week2: The discussion moves towards the north and west Week5: The theme fades out in most states Week1: The theme is the strongest along the Gulf of Mexico
2008 © ChengXiang Zhai 11 Theme Life Cycles (KDD Papers) gene expressions probability microarray … marketing customer model business … rules association support …
2008 © ChengXiang Zhai 12 Theme Evolution Graph: KDD T SVM criteria classifica – tion linear … decision tree classifier class Bayes … Classifica - tion text unlabeled document labeled learning … Informa - tion web social retrieval distance networks … ………… 1999 … web classifica – tion features0.006 topic … mixture random cluster clustering variables … topic mixture LDA semantic … …
2008 © ChengXiang Zhai 13 Multi-Faceted Sentiment Summary (query=“Da Vinci Code”) NeutralPositiveNegative Facet 1: Movie... Ron Howards selection of Tom Hanks to play Robert Langdon. Tom Hanks stars in the movie,who can be mad at that? But the movie might get delayed, and even killed off if he loses. Directed by: Ron Howard Writing credits: Akiva Goldsman... Tom Hanks, who is my favorite movie star act the leading role. protesting... will lose your faith by... watching the movie. After watching the movie I went online and some research on... Anybody is interested in it?... so sick of people making such a big deal about a FICTION book and movie. Facet 2: Book I remembered when i first read the book, I finished the book in two days. Awesome book.... so sick of people making such a big deal about a FICTION book and movie. I’m reading “Da Vinci Code” now. … So still a good book to past time. This controversy book cause lots conflict in west society.
2008 © ChengXiang Zhai 14 Separate Theme Sentiment Dynamics “book” “religious beliefs”
2008 © ChengXiang Zhai 15 Event Impact Analysis: IR Research vector concept extend model space boolean function feedback … xml model collect judgment rank subtopic … probabilist model logic ir boolean algebra estimate weight … model language estimate parameter distribution probable smooth markov likelihood … 1998 Publication of the paper “A language modeling approach to information retrieval” Starting of the TREC conferences year 1992 term relevance weight feedback independence model frequent probabilistic document … Theme: retrieval models SIGIR papers
2008 © ChengXiang Zhai 16 Topic Modeling + Social Networks 16 Authors writing about the same topic form a community Topic Model OnlyTopic Model + Social Network Separation of 3 research communities: IR, ML, Web
2008 © ChengXiang Zhai 17 On-Going Work Combining contextual text analysis with visualization More detailed semantic modeling (entities, relations,…) Integration of search and contextual text analysis to develop an analyst’s workbench: –Interactive semantic navigation and probing –Synthesis of information/knowledge –Personalized/customized service
2008 © ChengXiang Zhai 18 The End Thank You!