Presentation is loading. Please wait.

Presentation is loading. Please wait.

Example 16,000 documents 100 topic Picked those with large p(w|z)

Similar presentations


Presentation on theme: "Example 16,000 documents 100 topic Picked those with large p(w|z)"— Presentation transcript:

1 Example 16,000 documents 100 topic Picked those with large p(w|z)

2

3 Given a new document, compute and words allocated to each topic approximates p(z n |w) See cases where these values are relatively large 4 topics found New document?

4 Unseen document (contd.) Bag of words - William Randolph Hearst Foundation assigned to different topics

5 Applications and empirical results Document modeling Document classification Collaborative filtering

6 Document modeling Task: density estimation, high likelihood to unseen document Measure of goodness: perplexity Monotonically decreases in the likelihood

7 The experiment ArticlesTerms Scientific abstracts 5,22528,414 Newswire articles 16,33323,075

8 The experiment (contd.) Preprocessed –stop words –appearing once 10% held for training Trained with the same stopping criteria

9 Results

10 Overfitting in Mixture of unigrams Peaked posterior in the training set Unseen document with unseen word Word will have very small probability Remedy: smoothing

11 Overfitting in pLSI Mixture of topics allowed Marginalize over d to find p(w) Restriction to having the same topic proportions as training documents “Folding in” ignore p(z|d) parameters and refit p(z|d new )

12 LDA Documents can have different proportions of topics No heuristics

13

14 Document classification Generative or discriminative Choice of features in document classification LDA as dimensionality reduction technique as LDA features

15 The experiment Binary classification 8000 documents, 15,818 words True label not known 50 topic Trained SVM on the LDA features Compared with SVM on all word features LDA reduced feature space by 99.6%

16 GRAIN vs NOT GRAIN

17 EARN vs NOT EARN

18 LDA in document classification Feature space reduced, performance improved Results need further investigation Use for feature selection

19 Collaborative filtering Collection of users and movies they prefer Trained on observed users Task: given unobserved user and all movies preferred but one, predict the held out movie Only users who positively rated 100 movies Trained on 89% of data

20 Some quantities required… Probability of held out movie p(w|w obs ) –For mixture of unigrams and pLSI sum out topic variable –For LDA sum out topic and Dirichlet variables (quantity efficient to compute)

21 Results

22 Further work Other approaches for inference and parameter estimation Embedded in another model Other types of data Partial exchangeability

23 Example – Visual words Document = image Words = image features: bars, circles Topics = face, airplane Bag of words = no spatial relationship between objects

24 Visual words

25 Identifying the visual words and topics

26 Conclusion Exchangeability, De Finetti Theorem Dirichlet distribution Generative  Bag of words  Independence assumption in Dirichlet distribution - correlated topics

27 Implementations In C (by one of the authors) –http://www.cs.princeton.edu/~blei/lda-c/http://www.cs.princeton.edu/~blei/lda-c/ In C and Matlab –http://chasen.org/~daiti-m/dist/lda/

28 References Latent Dirichlet allocation, D. Blei, A. Ng, and M. Jordan. In Journal of Machine Learning Research, 3:993-1022, 2003 Discovering object categories in image collections. J. Sivic, B. C. Russell, A. A. Efros, A. Zisserman, W. T. Freeman. MIT AI Lab Memo AIM-2005-005, February, 2005 Correlated topic models, David Blei and John Lafferty, Advances in Neural Information Processing Systems 18, 2005.


Download ppt "Example 16,000 documents 100 topic Picked those with large p(w|z)"

Similar presentations


Ads by Google