Probabilistic Topic Models.

Probabilistic Topic Models.
GOGS STUDY GROUP Probabilistic Topic Models. Thaleia Ntiniakou Intelligence TUC

Introduction Topic models consist of algorithms which discover main themes that pervade a big and unorganized collection of documents. Topic models can organize the collection , without prior knowledge ( keywords, tags etc.) , according to the discovered themes. Probabilistic topic models (Blei et al. 2003) have revolutionized text-mining methods. PTMs have wide usage except text mining like genetics, video clustering and other we will refer to later on.

Prior to LDA, latent variable methods (1)
Tf-idf scheme: The first method proposed for text corpora. Each document in the collection of documents is reduced to a vector of real numbers, each of which represents a ratio of counts. In each document, a count is formed of the number of occurrences of each word(term-frequency). Afterwards, tf is compared to an inverse document frequency count (idf) which count the number of occurrences of the word in the entire collection. Con: Reveals little information about the inter document statistical structure.

Prior to LDA latent variable methods (2)
LSI (Latent Semantic Indexing) proposed by Deerwester et al., 1990 uses singular value decomposition to the term-document matrix in order to identify a linear subspace in the space of tf-idf features that captures most of the variance in the corpora.

Prior to LDA latent variable methods (3)
PLSI(Probabilistic latent semantic indexing) models each word in a document as a sample from a mixture model , where the mixture components are multinomial random variables that can be view as representations of “topics”. Each document can be described as a variety of topics. Cons: No probabilistic model at the level of documents The number of parameters grows as the collection gets bigger. Unclear in the event of adding a document outside of the training set. T. Hofmann. Probabilistic latent semantic indexing. Proceedings of the Twenty-Second Annual International SIGIR Conference, 1999

Intuition Treat data as observations that arise from a generative probabilistic process that includes latent variables. For documents the hidden variables are the topics. Infer the hidden structure posterior inference What are the topics that describe this corpus. Test new data into the estimated model. What is the generative probabilistic process in our case? Each topics is a random mixture -wide topics Each word is drawn from one of those topics. Blei DM, Ng AY, Jordan MI: Latent Dirichlet Allocation. J Mach Learn Res 2003, 3: Blei DM: Probabilistic Topic Models. Communication of the ACM 2012,55(4):77-84

Latent Dirichlet Allocation
The intuitions behind latent Dirichlet allocation. Assume some number of “topics,” which are distributions over words,exist for the whole collection (far left). First choose a distribution over the topics (the histogram at right); then, for each word, choose a topic assignment (the colored coins) and choose the word from the corresponding topic.The topics and topic assignments in this figure are illustrative—they are not fit from real data.

Notation A word is the basic unit of discrete data, defined to be an item from a vocabulary indexed by {1..V}. We represent words using unit-basis vectors that have a single component equal to one and all other components equal to zero. Thus, using superscripts to denote components, the th word in the vocabulary is represented by a V-vector such that and for . A document is a sequence of N words denoted by where is the word in the sequence. A corpus is a collection of M documents denoted by

Probabilistic Graphical Models
Nodes are random variables. Edges denote possible dependence. Observed variables are shaded. Plates denote replicated structure.

Probabilistic Graphical Models(2)
Structure of the graph defines the pattern of conditional dependence between the ensemble of random variables.

LDA as a Propabilistic graphical model

PLSI vs. LDA PLSI graphical model (right). LDA graphical model (left).

Generative process Draw each topic βi ∼ Dir(η), for i ∈ {1, , K }. For each document: Draw topic proportions θ(d) ∼ Dir(α). For each word: Draw Z (d,n) ∼ Mult(θ d ). Draw W (d,n) ∼ Mult(β,z,d,n ).

Generative Process(2) From a collection of documents, infer
Per-word topic assignment z d,n Per-document topic proportions θ d Per-corpus topic distributions β k Use posterior expectations to perform the task at hand, e.g.,information retrieval, document similarity, etc.

In order to perform inference we need to calculate the following posterior distribution :
Because of the relation between θ and β this probability though is intractable. The denominator is the probability of seeing the observed corpus under any topic model. Topic modelings’ task is to approximate the above probability. How do we approximate it?

Inference There are two categories of algorithms that we can use in order to approximate the posterior. Sampling based algorithms : Algorithms that attempt to collect samples from the posterior to approximate it with an empirical distribution(e.g. Gibbs Sampling). Variation Algorithms : Deterministic algorithms that posit a parameterized family of distributions over the hidden structure and then find the distribution that is the closest to posterior(Mean Variational Inference, Online LDA).

Gibbs Sampling Goal : approximate the posterior by reaching to an “empirical” solution. Idea : generate posterior samples by sweeping through each variable (or block of variables) to sample from its conditional distribution with the remaining variables fixed to their current values. The algorithm is executed until convergence.

Variational Inference
Goal: Obtain an adjustable lower bound on the log likelihood score. Step 1. Find the family of of lower bounds of the posterior we desire to approximate. That is achieved by simplifying the original pgm of LDA (left) to a more simple pgm(right) that eliminates edges between θ z,w.

Variational Inference (2)
The family of distributions the simpler graph is the following: Which transforms to the following optimization problem:

Variational Inference(3)
In order to find the (γ,φ) that minimize the equation we perform the variational inference algorithm:

Parameter Estimation Goal: Find parameters α,β that maximize the marginal log-likelihood of the data: How do we achieve this goal, having in mind that p(w|α,β) cannot be computed? We perform an alternating EM- Variational procedure.

Parameter Estimation (2)
1. (E-step) For each document, find the optimizing values of the variational parameters {γ ∗ d , φ ∗ d :d ∈ D }. This is done as described in the previous section. 2. (M-step) Maximize the resulting lower bound on the log likelihood with respect to the model parameters α and β. This corresponds to finding maximum likelihood estimates with expected sufficient statistics for each document under the approximate posterior which is computed in the E-step.

LDA The simplest LDA algorithm that we described makes the following assumptions: Exchangeability of variables : We treat both documents and words as observed variables that are exchangeable. This means that the sequence of words or documents (random variables) can be neglected. Our data may have a sequential meaning. The number of topics is known and fixed.

LDA Extensions Over the years a number of extensions of LDA that omit the above assumptions: Hidden Topic Markov Models : Words in sentence are assigned to one topic.Successive sentences are more likely to have the same topics.Thus hidden variables are the topics and HMM inference can be applied. Gruber A., Rosen-Zvi M., Weiss Y. Hidden Topic Markov Models,Proceedings of the Eleventh International Conference on Artificial Intelligence and Statistics, PMLR 2: , Composite model that switches between HMM and LDA.Griffiths, T., Steyvers, M., Blei, D., Tenenbaum, J.Integrating topics and syntax. Advances in Neural Information Processing Systems 17. L. K. Saul, Y.Weiss, and L. Bottou, eds. MIT Press, Cambridge, MA,2005, 537–544. Dynamic Topic Model.Blei, D., Lafferty, J. Dynamic topic models. In International Conference on Machine Learning (2006),ACM, New York, NY, USA, 113–120. Bayesian nonparametric topic model. Teh, Y., Jordan, M., Beal, M., Blei, D. Hierarchical Dirichlet processes. J. Am. Stat. Assoc. 101, 476 (2006), 1566–1581 Online LDA.Hoffman, M., Blei, D., Bach, F. On-line learning for latent Dirichlet allocation. In Neural Information Processing Systems(2010).

Uses of PTMs

Uses of PTMs (2) Beyond text mining:
Collaborative Filtering and user recommendation. Bioinformatics Image processing The challenge in applying LDA to another field rather than text mining is to translate the bag of words to your use-case.

User Recommendation Example
Suppose we have a database with users and their rating on movies. In this case the documents are the users and the movie preferences are the words. Step 1: Train the dataset with user and their preferences. Step 2 After training, test the data . Input is a user with movie preferences except one and the outcome is a likeable to be preferred movie.

MCTM for Mining behavior in video
Video Mining : A task involved in this field is detecting salient behavior in video. This approach gives answers to questions such as : what are the typical activities and scene behaviours in this scene. What are the most interesting events in this video? Hospedales et al. introduce an new combination of dynamic bayesian networks and ptms. Hospedales, Timothy & Gong, Shaogang & Xiang, Tao. (2012). Video Behaviour Mining Using a Dynamic Topic Model. IJCV /s

How is video transformed to a bag of words problem?
A camera view (frame) is divide into CxC pixel cells and in each cell an optical flow is computed. Afterwards a threshold is determined in order to check if the magnitude of the optical flow is greater than this threshold. If the flow is characterized as reliable it is then quantized into one of four cardinal directions. Eventually, a discrete visual event is defined based on the position of the cell and the motion direction.

How is video transformed to a bag of words problem?
Topics: Simple actions (co-occurring events). Words: Visual events. Documents : Complex behaviours (co-occurring actions) In this case , the model manipulates a three layer latent structure: events actions and behaviours.

PTMs for Malware Analysis
Malware develped to attack android devices has rapidly increased. Topic modelling could be a part of a framework for analyzing android malware. An application is considered as an operation of opcodes. Topics: distribution of opcodes. Documents: sequence of opcodes. Words: opcodes. Medvet et. al conclude that topic modelling and the information it provides about the malware , can help understand malware characteristics and similarities. Reference

PTMs in Bioinformatics
Topic Modelling can assist in understanding data and inference. PTMs can perform the following tasks on biological data: Clustering analysis Classification Feature Extraction

Clustering Unlike in traditional clustering, a topic model allows data to come from a mixture of clusters rather than from a single cluster. Microarray expression : data matrix of real numbers Word - document relate to gene-sample analogy Topics : functional groups LPD introduced Gaussian distributions to LDA in place of word multinomial distributions. PLSA was used for extraction of biclusters; this model simultaneously groups genes and samples.

Clustering (2) For protein interaction data , proposed an infinite topic model to find functional gene modules (topics) combined with gene expression data. For gene sequence data, the desirable task is to characterize a set of common genomic features shared by the same species analyzed the genome level composition of DNA sequences by means of LDA. For genome annotation data, studies have implemented LDA to directly identify functional modules of protein families.

Classification Biologically aware latent Dirichlet allocation : perform a classification task that extends the LDA model by integrating document dependences and starts from the LPD. BaLDA does not contain the assumption present in both PLSA and LDA that each gene is independently generated given its corresponding latent topic. Genomic Sequence Classification: Documents: genomic sequences Words : k-mers of DNA string (nucleotides) Topics : assigned taxonomic labels

Feature Extraction Protein Sequence Data : An hierarchical LDA-Random Forest has been proposed to solve this problem in order to predict human protein-protein interactions. Project the local sequence feature space to the topics space by LDA. Hidden structure between proteins is revealed. Random forest model measures the probability of interaction between two proteins .

Topic Model of Genetic Mutations in Cancer.
Goal : Apply LDA to gene expression data. The use of topic modelling in this case is feature extraction. BoW translation: Document : patient Words : genomic states of patient Afterwards, in order to predict the survival rate of cancer patients the distributions obtained from LDA are used as feature. Then, Multi-Task Logistic Regression is performed to predict patient specific survival by using the features extracted from LDA. The hard task in this problem is capturing the effective information of gene expression values. In contrast to frequency of word occurrences being integer numbers, g.e.v are real numbers

Example Workflow.

Probabilistic Topic Models.

Similar presentations

Presentation on theme: "Probabilistic Topic Models."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Probabilistic Topic Models.

Similar presentations

Presentation on theme: "Probabilistic Topic Models."— Presentation transcript:

Similar presentations

About project

Feedback