Example 16,000 documents 100 topic Picked those with large p(w|z)

Slides:



Advertisements
Similar presentations
Topic models Source: Topic models, David Blei, MLSS 09.
Advertisements

Xiaolong Wang and Daniel Khashabi
COMPUTER AIDED DIAGNOSIS: CLASSIFICATION Prof. Yasser Mostafa Kadah –
Ouyang Ruofei Topic Model Latent Dirichlet Allocation Ouyang Ruofei May LDA.
Information retrieval – LSI, pLSI and LDA
Simultaneous Image Classification and Annotation Chong Wang, David Blei, Li Fei-Fei Computer Science Department Princeton University Published in CVPR.
Title: The Author-Topic Model for Authors and Documents
An Introduction to LDA Tools Kuan-Yu Chen Institute of Information Science, Academia Sinica.
Generative learning methods for bags of features
Probabilistic Clustering-Projection Model for Discrete Data
{bojan.furlan, jeca, 1/42 Probabilistic Graphical Models For Text Mining: A Topic Modeling Survey V. Jelisavčić*, B.
Statistical Topic Modeling part 1
Decoupling Sparsity and Smoothness in the Discrete Hierarchical Dirichlet Process Chong Wang and David M. Blei NIPS 2009 Discussion led by Chunping Wang.
Generative Topic Models for Community Analysis
Caimei Lu et al. (KDD 2010) Presented by Anson Liang.
Bayes Rule How is this rule derived? Using Bayes rule for probabilistic inference: –P(Cause | Evidence): diagnostic probability –P(Evidence | Cause): causal.
Latent Dirichlet Allocation a generative model for text
Unsupervised discovery of visual object class hierarchies Josef Sivic (INRIA / ENS), Bryan Russell (MIT), Andrew Zisserman (Oxford), Alyosha Efros (CMU)
Generative learning methods for bags of features
British Museum Library, London Picture Courtesy: flickr.
A Bayesian Hierarchical Model for Learning Natural Scene Categories L. Fei-Fei and P. Perona. CVPR 2005 Discovering objects and their location in images.
Language Modeling Frameworks for Information Retrieval John Lafferty School of Computer Science Carnegie Mellon University.
LATENT DIRICHLET ALLOCATION. Outline Introduction Model Description Inference and Parameter Estimation Example Reference.
Discriminative and generative methods for bags of features
Step 3: Classification Learn a decision rule (classifier) assigning bag-of-features representations of images to different classes Decision boundary Zebra.
Introduction to Machine Learning for Information Retrieval Xiaolong Wang.
Correlated Topic Models By Blei and Lafferty (NIPS 2005) Presented by Chunping Wang ECE, Duke University August 4 th, 2006.
Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.
Online Learning for Latent Dirichlet Allocation
Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Presented by Chen Yi-Ting.
Hierarchical Topic Models and the Nested Chinese Restaurant Process Blei, Griffiths, Jordan, Tenenbaum presented by Rodrigo de Salvo Braz.
Topic Modelling: Beyond Bag of Words By Hanna M. Wallach ICML 2006 Presented by Eric Wang, April 25 th 2008.
Finding Scientific topics August , Topic Modeling 1.A document as a probabilistic mixture of topics. 2.A topic as a probability distribution.
Transfer Learning Task. Problem Identification Dataset : A Year: 2000 Features: 48 Training Model ‘M’ Testing 98.6% Training Model ‘M’ Testing 97% Dataset.
27. May Topic Models Nam Khanh Tran L3S Research Center.
Finding the Hidden Scenes Behind Android Applications Joey Allen Mentor: Xiangyu Niu CURENT REU Program: Final Presentation 7/16/2014.
Eric Xing © Eric CMU, Machine Learning Latent Aspect Models Eric Xing Lecture 14, August 15, 2010 Reading: see class homepage.
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Unsupervised Learning: Kmeans, GMM, EM Readings: Barber
Latent Dirichlet Allocation D. Blei, A. Ng, and M. Jordan. Journal of Machine Learning Research, 3: , January Jonathan Huang
Topic Models Presented by Iulian Pruteanu Friday, July 28 th, 2006.
Topic Modeling using Latent Dirichlet Allocation
An Introduction to Latent Dirichlet Allocation (LDA)
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
Latent Dirichlet Allocation
1 A Biterm Topic Model for Short Texts Xiaohui Yan, Jiafeng Guo, Yanyan Lan, Xueqi Cheng Institute of Computing Technology, Chinese Academy of Sciences.
Discovering Objects and their Location in Images Josef Sivic 1, Bryan C. Russell 2, Alexei A. Efros 3, Andrew Zisserman 1 and William T. Freeman 2 Goal:
Gaussian Processes For Regression, Classification, and Prediction.
CS246 Latent Dirichlet Analysis. LSI  LSI uses SVD to find the best rank-K approximation  The result is difficult to interpret especially with negative.
04/21/2005 CS673 1 Being Bayesian About Network Structure A Bayesian Approach to Structure Discovery in Bayesian Networks Nir Friedman and Daphne Koller.
Modeling Annotated Data (SIGIR 2003) David M. Blei, Michael I. Jordan Univ. of California, Berkeley Presented by ChengXiang Zhai, July 10, 2003.
Text-classification using Latent Dirichlet Allocation - intro graphical model Lei Li
Bar Graphs Used to compare amounts, or quantities The bars provide a visual display for a quick comparison between different categories.
Inferring User Interest Familiarity and Topic Similarity with Social Neighbors in Facebook INSTRUCTOR: DONGCHUL KIM ANUSHA BOOTHPUR
A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation Yee W. Teh, David Newman and Max Welling Published on NIPS 2006 Discussion.
Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Microsoft Research Cambridge,
Topic Modeling for Short Texts with Auxiliary Word Embeddings
Online Multiscale Dynamic Topic Models
The topic discovery models
Statistical Models for Automatic Speech Recognition
Multimodal Learning with Deep Boltzmann Machines
The topic discovery models
CSCI 5822 Probabilistic Models of Human and Machine Learning
Statistical Models for Automatic Speech Recognition
The topic discovery models
Bayesian Inference for Mixture Language Models
Michal Rosen-Zvi University of California, Irvine
Latent Dirichlet Allocation
CS246: Latent Dirichlet Analysis
Topic Models in Text Processing
Unsupervised learning of visual sense models for Polysemous words
Presentation transcript:

Example 16,000 documents 100 topic Picked those with large p(w|z)

Given a new document, compute and words allocated to each topic approximates p(z n |w) See cases where these values are relatively large 4 topics found New document?

Unseen document (contd.) Bag of words - William Randolph Hearst Foundation assigned to different topics

Applications and empirical results Document modeling Document classification Collaborative filtering

Document modeling Task: density estimation, high likelihood to unseen document Measure of goodness: perplexity Monotonically decreases in the likelihood

The experiment ArticlesTerms Scientific abstracts 5,22528,414 Newswire articles 16,33323,075

The experiment (contd.) Preprocessed –stop words –appearing once 10% held for training Trained with the same stopping criteria

Results

Overfitting in Mixture of unigrams Peaked posterior in the training set Unseen document with unseen word Word will have very small probability Remedy: smoothing

Overfitting in pLSI Mixture of topics allowed Marginalize over d to find p(w) Restriction to having the same topic proportions as training documents “Folding in” ignore p(z|d) parameters and refit p(z|d new )

LDA Documents can have different proportions of topics No heuristics

Document classification Generative or discriminative Choice of features in document classification LDA as dimensionality reduction technique as LDA features

The experiment Binary classification 8000 documents, 15,818 words True label not known 50 topic Trained SVM on the LDA features Compared with SVM on all word features LDA reduced feature space by 99.6%

GRAIN vs NOT GRAIN

EARN vs NOT EARN

LDA in document classification Feature space reduced, performance improved Results need further investigation Use for feature selection

Collaborative filtering Collection of users and movies they prefer Trained on observed users Task: given unobserved user and all movies preferred but one, predict the held out movie Only users who positively rated 100 movies Trained on 89% of data

Some quantities required… Probability of held out movie p(w|w obs ) –For mixture of unigrams and pLSI sum out topic variable –For LDA sum out topic and Dirichlet variables (quantity efficient to compute)

Results

Further work Other approaches for inference and parameter estimation Embedded in another model Other types of data Partial exchangeability

Example – Visual words Document = image Words = image features: bars, circles Topics = face, airplane Bag of words = no spatial relationship between objects

Visual words

Identifying the visual words and topics

Conclusion Exchangeability, De Finetti Theorem Dirichlet distribution Generative  Bag of words  Independence assumption in Dirichlet distribution - correlated topics

Implementations In C (by one of the authors) – In C and Matlab –

References Latent Dirichlet allocation, D. Blei, A. Ng, and M. Jordan. In Journal of Machine Learning Research, 3: , 2003 Discovering object categories in image collections. J. Sivic, B. C. Russell, A. A. Efros, A. Zisserman, W. T. Freeman. MIT AI Lab Memo AIM , February, 2005 Correlated topic models, David Blei and John Lafferty, Advances in Neural Information Processing Systems 18, 2005.