KDD 2011 Summary of Text Mining sessions Hongbo Deng
3 Text Mining Sessions, 9 Papers Beyond Keyword Search: Discovering Relevant Scientific Literature – Khalid El-Arini (Carnegie Mellon University), Carlos Guestrin Collaborative Topic Modeling for Recommending Scientific Articles – Chong Wang (Princeton University), David M. Blei Partially Labeled Topic Models for Interpretable Text Mining – Daniel Ramage (Stanford University), Christopher D. Manning, Susan Dumais Refining Causality: Who Copied from Whom? – Tristan Snowsill (University of Bristol), Nick Fyson, Tijl De Bie, Nello Cristianini Conditional Topical Coding: An Efficient Topic Model Conditioned on Rich Features – Jun Zhu (Carnegie Mellon University), Ni Lao, Ning Chen, Eric P. Xing Tracking Trends: Incorporating Term Volume into Temporal Topic Models – Liangjie Hong (Lehigh University), Dawei Yin, Jian Guo, Brian D. Davison Latent Topic Feedback for Information Retrieval – David Andrzejewski (Lawrence Livermore National Laboratory), David Buttler Localized Factor Models for Multi-Context Recommendation – Deepak Agarwal (Yahoo! Labs), Bee-Chung Chen, Bo Long Latent Aspect Rating Analysis without Aspect Keyword Supervision – Hongning Wang (University of Illinois at Urbana-Champaign), Yue Lu, ChengXiang Zhai Topic Model Recommendation Topic models are widely used in other sessions, e.g., user modeling, query log analysis, ad …
Collaborative Topic Modeling for Recommending Scientific Articles Problem: – To recommend scientific articles to users of an online community Input: – Userss libraries from CiteULike – The content of the articles Output: – Find articles relevant to their interests Three traditional ways – Follow citations in other articles they are interested in – Keyword search – Using recommendation methods (CiteULike) Several criteria – Recommending older articles is important – Recommending new articles is also important – Exploratory variables can be valuable in online scientific archives and communities Collaborative Filtering + Topic Modeling
Collaborative Topic Modeling for Recommending Scientific Articles Two types of data – The other users libraries [collaborative filtering] Like latent factor models, use information from other users libraries For a particular user, it can recommend articles from other users who liked similar articles Latent factor models work well for recommending known articles, but cannot generalize to previously unseen articles – The content of the articles [topic modeling] To generalize to unseen articles, the authors uses topic modeling Can recommend articles that have similar content to other articles that a user likes
Collaborative Topic Modeling for Recommending Scientific Articles Intuition: Combine collaborative filtering and probabilistic topic modeling for recommending scientific articles The key property in CTR lies in how the item latent vector $v_j$ is generated We assume the item latent vector $v_j$ is close to topic proportion $\theta_j$, but could diverge from it if it has to
Latent Topic Feedback for Information Retrieval Problem: a user navigation an unfamiliar corpus of text documents where document metadata is limited or unavailable Intuition: To augment keyword search with user feedback on latent topics Key point: A new method for obtaining and exploiting user feedback at the latent topic level
Latent Topic Feedback for Information Retrieval Method: – To learn latent topics from the corpus and construct meaningful representations of these topics – At query time, decide which latent topics are potentially relevant and present the appropriate topic representations alongside keyword search results – When a user selects a latent topic, the vocabulary terms most strongly associated with that topic are then used to augment the original query
Beyond Keyword Search: Discovering Relevant Scientific Literature Problem: As the number of publications has grown, difficult for scientists to find relevant prior work for their particular research Input: a set of papers as a query Output: a set of highly relevant articles Method: – Modeling scientific influence between documents: optimize an objective function – Select a set of papers A with maximum influence to/from the query set Q – Incorporate trust and personalization: as scientists trust some authors more than others, results can be personalized to individual preferences
Partially Labeled Topic Models for Interpretable Text Mining Problem: make use of the unsupervised learning of topic modeling, with constrains that align some learned topics with a human-provided label Input: a collection of documents, partial labels Graphical model for PLDA Observed: each documents words w and labels Λ per-doc label distribution per-topic word distribution per-doc-label topic distribution Output: θ, Φ, ψ Extend the generative story of LDA to incorporate labels, and of Labeled LDA to incorporate per-label latent topics a multinomial distribution over words $V$ that tend to co-occur with each other and some label
Latent Aspect Rating Analysis without Aspect Keyword Supervision Reviews + overall ratingsAspect segments location:1 amazing:1 walk:1 anywhere: nice:1 accommodating:1 smile:1 friendliness:1 attentiveness:1 Term WeightsAspect Rating room:1 nicely:1 appointed:1 comfortable: Aspect Segmentation Latent Rating Regression Aspect Weight Gap ???
Latent Aspect Rating Analysis without Aspect Keyword Supervision LARAM Jointly model aspects and aspect rating/weights LRR (Wang et al., 2010) Segmented aspects from previous step
Some Observations Text mining is very hot Topic modeling has been widely used in text analysis or many other applications, e.g., query understanding, advertisement … – Combine topic modeling with other models, e.g., collaborative filtering – Integrate more information into topic modeling, e.g., labeled and unlabeled information (partially labeled) – Two-step solution -> unified way
Thanks!