Download presentation
Presentation is loading. Please wait.
Published byElfrieda Barnett Modified over 9 years ago
1
1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 7. Topic Extraction
2
Word association, represented by concept links, is useful in understanding the relationships between terms (as concepts). The same idea can be applied to understand the association between documents associated to a topic. Text Topics 2
3
Problems with “Term as Topic” Using single term to define a topic is problematic. –Lack of expressive power Can only represent simple topics Cannot represent complicated topics –Incompleteness in vocabulary coverage Cannot capture variations of vocabulary (e.g. related terms) –Ambiguous word Many words have more than one meaning/sense. 3
4
Multiple Terms as Topic A solution is to use multiple terms to define a topic. –Topic = {word1, word2,.. } –A weight assigned to each term represents the importance/relevance of the term in the topic. –Every document in the corpus can be given a score that represents the strength of association to a topic. –A document can contain zero, one or many topics. 4
5
Approach (1): Probabilistic Topic Mining Coursera, Text Mining and Analytics, ChengXiang Zhai 5
6
Topic as Word Distribution Coursera, Text Mining and Analytics, ChengXiang Zhai 6
7
Probabilistic Topic Mining Coursera, Text Mining and Analytics, ChengXiang Zhai 7
8
Techniques for Probabilistic Topic Mining Several techniques have been used in probabilistic topic mining to extract topics. –Maximum Likelihood –Bayesian –Mixture Model (where parameters are estimated typically using the Expectation Maximization (EM) algorithm) 8
9
Mixture Model for Topic Extraction (1) Coursera, Text Mining and Analytics, ChengXiang Zhai 9
10
Mixture Model for Topic Extraction (2) Coursera, Text Mining and Analytics, ChengXiang Zhai 10
11
Mixture Model as a Generative Model Coursera, Text Mining and Analytics, ChengXiang Zhai 11
12
Mixture of Two Unigram Language Models Coursera, Text Mining and Analytics, ChengXiang Zhai 12
13
Coursera, Text Mining and Analytics, ChengXiang Zhai 13
14
Coursera, Text Mining and Analytics, ChengXiang Zhai 14
15
Coursera, Text Mining and Analytics, ChengXiang Zhai 15
16
Expectation-Maximization (EM) Algorithm Coursera, Text Mining and Analytics, ChengXiang Zhai 16
17
Coursera, Text Mining and Analytics, ChengXiang Zhai 17
18
18 Approach (2): Dimensionality Reduction for Topics Extraction Reduced dimensions can also be considered topics. Singular Value Decomposition derives eigenvectors (SVD dimensions/Principal Components) Topics. D1: “I love iPad.” D2: “iPad is great for kids.” D3: “Kids love to play soccer.” D4: “I play soccer at OSU.”
19
19 Example: Topics extracted by SAS Enterprise Miner for the yelp data
20
20 Term topic weight – relevance of the term in the topic Each term is assigned a weight corresponding to each topic. Since each topic is an SVD dimension, the term topic weights for a term are the coordinates of the term in the SVD space. The Term cutoff is used to determine whether a term belongs to a topic. Document topic weight – relevance of the document to the topic Every document in the corpus is assigned a weight corresponding to each topic. The document topic weight of a document towards a topic is the normalized sum of the TF*IDF weights for each term in the document multiplied by their term topic weights. The Document cutoff is used to determine whether a document belongs to a topic.
21
21 Interpretability of Extracted Topics A topic as a collection of weighted terms provides precise information about the topic. But some analysts find the binary topics are easier to understand.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.