Probabilistic Topic Models.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Part 2: Unsupervised Learning
Bayesian Belief Propagation
Topic models Source: Topic models, David Blei, MLSS 09.
Weakly supervised learning of MRF models for image region labeling Jakob Verbeek LEAR team, INRIA Rhône-Alpes.
Information retrieval – LSI, pLSI and LDA
Hierarchical Dirichlet Processes
Simultaneous Image Classification and Annotation Chong Wang, David Blei, Li Fei-Fei Computer Science Department Princeton University Published in CVPR.
Supervised Learning Recap
An Introduction to LDA Tools Kuan-Yu Chen Institute of Information Science, Academia Sinica.
Statistical Topic Modeling part 1
Ai in game programming it university of copenhagen Statistical Learning Methods Marco Loog.
Visual Recognition Tutorial
Latent Dirichlet Allocation a generative model for text
Visual Recognition Tutorial
Scalable Text Mining with Sparse Generative Models
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Information Retrieval in Practice
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Introduction to Machine Learning for Information Retrieval Xiaolong Wang.
Correlated Topic Models By Blei and Lafferty (NIPS 2005) Presented by Chunping Wang ECE, Duke University August 4 th, 2006.
Example 16,000 documents 100 topic Picked those with large p(w|z)
Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.
(Infinitely) Deep Learning in Vision Max Welling (UCI) collaborators: Ian Porteous (UCI) Evgeniy Bart UCI/Caltech) Pietro Perona (Caltech)
Memory Bounded Inference on Topic Models Paper by R. Gomes, M. Welling, and P. Perona Included in Proceedings of ICML 2008 Presentation by Eric Wang 1/9/2009.
Topic Modelling: Beyond Bag of Words By Hanna M. Wallach ICML 2006 Presented by Eric Wang, April 25 th 2008.
Integrating Topics and Syntax -Thomas L
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Ch 8. Graphical Models Pattern Recognition and Machine Learning, C. M. Bishop, Revised by M.-O. Heo Summarized by J.W. Nam Biointelligence Laboratory,
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
A Model for Learning the Semantics of Pictures V. Lavrenko, R. Manmatha, J. Jeon Center for Intelligent Information Retrieval Computer Science Department,
Latent Dirichlet Allocation D. Blei, A. Ng, and M. Jordan. Journal of Machine Learning Research, 3: , January Jonathan Huang
Probabilistic Models for Discovering E-Communities Ding Zhou, Eren Manavoglu, Jia Li, C. Lee Giles, Hongyuan Zha The Pennsylvania State University WWW.
Probabilistic Latent Query Analysis for Combining Multiple Retrieval Sources Rong Yan Alexander G. Hauptmann School of Computer Science Carnegie Mellon.
Topic Models Presented by Iulian Pruteanu Friday, July 28 th, 2006.
Lecture 2: Statistical learning primer for biologists
Latent Dirichlet Allocation
Unsupervised Mining of Statistical Temporal Structures in Video Liu ze yuan May 15,2011.
Web-Mining Agents Topic Analysis: pLSI and LDA
6. Population Codes Presented by Rhee, Je-Keun © 2008, SNU Biointelligence Lab,
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
Text-classification using Latent Dirichlet Allocation - intro graphical model Lei Li
1 Dongheng Sun 04/26/2011 Learning with Matrix Factorizations By Nathan Srebro.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Hidden Markov Models BMI/CS 576
Topic Modeling for Short Texts with Auxiliary Word Embeddings
Chapter 7. Classification and Prediction
Learning Deep Generative Models by Ruslan Salakhutdinov
Deep Feedforward Networks
The topic discovery models
Classification of unlabeled data:
Multimodal Learning with Deep Boltzmann Machines
Machine Learning Basics
The topic discovery models
Latent Dirichlet Analysis
Hidden Markov Models Part 2: Algorithms
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Matching Words with Pictures
The topic discovery models
Bayesian Inference for Mixture Language Models
Stochastic Optimization Maximization for Latent Variable Models
Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models
Michal Rosen-Zvi University of California, Irvine
Latent Dirichlet Allocation
Topic models for corpora and for graphs
Topic Models in Text Processing
Parametric Methods Berlin Chen, 2005 References:
Biointelligence Laboratory, Seoul National University
Ch 3. Linear Models for Regression (2/2) Pattern Recognition and Machine Learning, C. M. Bishop, Previously summarized by Yung-Kyun Noh Updated.
Presentation transcript:

Probabilistic Topic Models. GOGS STUDY GROUP Probabilistic Topic Models. Thaleia Ntiniakou Intelligence Lab @ TUC

Introduction Topic models consist of algorithms which discover main themes that pervade a big and unorganized collection of documents. Topic models can organize the collection , without prior knowledge ( keywords, tags etc.) , according to the discovered themes. Probabilistic topic models (Blei et al. 2003) have revolutionized text-mining methods. PTMs have wide usage except text mining like genetics, video clustering and other we will refer to later on.

Prior to LDA, latent variable methods (1) Tf-idf scheme: The first method proposed for text corpora. Each document in the collection of documents is reduced to a vector of real numbers, each of which represents a ratio of counts. In each document, a count is formed of the number of occurrences of each word(term-frequency). Afterwards, tf is compared to an inverse document frequency count (idf) which count the number of occurrences of the word in the entire collection. Con: Reveals little information about the inter document statistical structure.

Prior to LDA latent variable methods (2) LSI (Latent Semantic Indexing) proposed by Deerwester et al., 1990 uses singular value decomposition to the term-document matrix in order to identify a linear subspace in the space of tf-idf features that captures most of the variance in the corpora.

Prior to LDA latent variable methods (3) PLSI(Probabilistic latent semantic indexing) models each word in a document as a sample from a mixture model , where the mixture components are multinomial random variables that can be view as representations of “topics”. Each document can be described as a variety of topics. Cons: No probabilistic model at the level of documents The number of parameters grows as the collection gets bigger. Unclear in the event of adding a document outside of the training set. T. Hofmann. Probabilistic latent semantic indexing. Proceedings of the Twenty-Second Annual International SIGIR Conference, 1999

Intuition Treat data as observations that arise from a generative probabilistic process that includes latent variables. For documents the hidden variables are the topics. Infer the hidden structure posterior inference What are the topics that describe this corpus. Test new data into the estimated model. What is the generative probabilistic process in our case? Each topics is a random mixture -wide topics Each word is drawn from one of those topics. Blei DM, Ng AY, Jordan MI: Latent Dirichlet Allocation. J Mach Learn Res 2003, 3:993-1022 Blei DM: Probabilistic Topic Models. Communication of the ACM 2012,55(4):77-84

Latent Dirichlet Allocation The intuitions behind latent Dirichlet allocation. Assume some number of “topics,” which are distributions over words,exist for the whole collection (far left). First choose a distribution over the topics (the histogram at right); then, for each word, choose a topic assignment (the colored coins) and choose the word from the corresponding topic.The topics and topic assignments in this figure are illustrative—they are not fit from real data.

Notation A word is the basic unit of discrete data, defined to be an item from a vocabulary indexed by {1..V}. We represent words using unit-basis vectors that have a single component equal to one and all other components equal to zero. Thus, using superscripts to denote components, the th word in the vocabulary is represented by a V-vector such that and for . A document is a sequence of N words denoted by where is the word in the sequence. A corpus is a collection of M documents denoted by .

Probabilistic Graphical Models Nodes are random variables. Edges denote possible dependence. Observed variables are shaded. Plates denote replicated structure.

Probabilistic Graphical Models(2) Structure of the graph defines the pattern of conditional dependence between the ensemble of random variables.

LDA as a Propabilistic graphical model

PLSI vs. LDA PLSI graphical model (right). LDA graphical model (left).

Generative process Draw each topic βi ∼ Dir(η), for i ∈ {1, . . . , K }. For each document: Draw topic proportions θ(d) ∼ Dir(α). For each word: Draw Z (d,n) ∼ Mult(θ d ). Draw W (d,n) ∼ Mult(β,z,d,n ).

Generative Process(2) From a collection of documents, infer Per-word topic assignment z d,n Per-document topic proportions θ d Per-corpus topic distributions β k Use posterior expectations to perform the task at hand, e.g.,information retrieval, document similarity, etc.

In order to perform inference we need to calculate the following posterior distribution : Because of the relation between θ and β this probability though is intractable. The denominator is the probability of seeing the observed corpus under any topic model. Topic modelings’ task is to approximate the above probability. How do we approximate it?

Inference There are two categories of algorithms that we can use in order to approximate the posterior. Sampling based algorithms : Algorithms that attempt to collect samples from the posterior to approximate it with an empirical distribution(e.g. Gibbs Sampling). Variation Algorithms : Deterministic algorithms that posit a parameterized family of distributions over the hidden structure and then find the distribution that is the closest to posterior(Mean Variational Inference, Online LDA).

Gibbs Sampling Goal : approximate the posterior by reaching to an “empirical” solution. Idea : generate posterior samples by sweeping through each variable (or block of variables) to sample from its conditional distribution with the remaining variables fixed to their current values. The algorithm is executed until convergence.

Variational Inference Goal: Obtain an adjustable lower bound on the log likelihood score. Step 1. Find the family of of lower bounds of the posterior we desire to approximate. That is achieved by simplifying the original pgm of LDA (left) to a more simple pgm(right) that eliminates edges between θ z,w.

Variational Inference (2) The family of distributions the simpler graph is the following: Which transforms to the following optimization problem:

Variational Inference(3) In order to find the (γ,φ) that minimize the equation we perform the variational inference algorithm:

Parameter Estimation Goal: Find parameters α,β that maximize the marginal log-likelihood of the data: How do we achieve this goal, having in mind that p(w|α,β) cannot be computed? We perform an alternating EM- Variational procedure.

Parameter Estimation (2) 1. (E-step) For each document, find the optimizing values of the variational parameters {γ ∗ d , φ ∗ d :d ∈ D }. This is done as described in the previous section. 2. (M-step) Maximize the resulting lower bound on the log likelihood with respect to the model parameters α and β. This corresponds to finding maximum likelihood estimates with expected sufficient statistics for each document under the approximate posterior which is computed in the E-step.

LDA The simplest LDA algorithm that we described makes the following assumptions: Exchangeability of variables : We treat both documents and words as observed variables that are exchangeable. This means that the sequence of words or documents (random variables) can be neglected. Our data may have a sequential meaning. The number of topics is known and fixed.

LDA Extensions Over the years a number of extensions of LDA that omit the above assumptions: Hidden Topic Markov Models : Words in sentence are assigned to one topic.Successive sentences are more likely to have the same topics.Thus hidden variables are the topics and HMM inference can be applied. Gruber A., Rosen-Zvi M., Weiss Y. Hidden Topic Markov Models,Proceedings of the Eleventh International Conference on Artificial Intelligence and Statistics, PMLR 2:163-170, 2007. Composite model that switches between HMM and LDA.Griffiths, T., Steyvers, M., Blei, D., Tenenbaum, J.Integrating topics and syntax. Advances in Neural Information Processing Systems 17. L. K. Saul, Y.Weiss, and L. Bottou, eds. MIT Press, Cambridge, MA,2005, 537–544. Dynamic Topic Model.Blei, D., Lafferty, J. Dynamic topic models. In International Conference on Machine Learning (2006),ACM, New York, NY, USA, 113–120. Bayesian nonparametric topic model. Teh, Y., Jordan, M., Beal, M., Blei, D. Hierarchical Dirichlet processes. J. Am. Stat. Assoc. 101, 476 (2006), 1566–1581 Online LDA.Hoffman, M., Blei, D., Bach, F. On-line learning for latent Dirichlet allocation. In Neural Information Processing Systems(2010).

Uses of PTMs

Uses of PTMs (2) Beyond text mining: Collaborative Filtering and user recommendation. Bioinformatics Image processing The challenge in applying LDA to another field rather than text mining is to translate the bag of words to your use-case.

User Recommendation Example Suppose we have a database with users and their rating on movies. In this case the documents are the users and the movie preferences are the words. Step 1: Train the dataset with user and their preferences. Step 2 After training, test the data . Input is a user with movie preferences except one and the outcome is a likeable to be preferred movie.

MCTM for Mining behavior in video Video Mining : A task involved in this field is detecting salient behavior in video. This approach gives answers to questions such as : what are the typical activities and scene behaviours in this scene. What are the most interesting events in this video? Hospedales et al. introduce an new combination of dynamic bayesian networks and ptms. Hospedales, Timothy & Gong, Shaogang & Xiang, Tao. (2012). Video Behaviour Mining Using a Dynamic Topic Model. IJCV. 98. 303-323. 10.1007/s11263-011-0510-7.

How is video transformed to a bag of words problem? A camera view (frame) is divide into CxC pixel cells and in each cell an optical flow is computed. Afterwards a threshold is determined in order to check if the magnitude of the optical flow is greater than this threshold. If the flow is characterized as reliable it is then quantized into one of four cardinal directions. Eventually, a discrete visual event is defined based on the position of the cell and the motion direction.

How is video transformed to a bag of words problem? Topics: Simple actions (co-occurring events). Words: Visual events. Documents : Complex behaviours (co-occurring actions) In this case , the model manipulates a three layer latent structure: events actions and behaviours.

PTMs for Malware Analysis Malware develped to attack android devices has rapidly increased. Topic modelling could be a part of a framework for analyzing android malware. An application is considered as an operation of opcodes. Topics: distribution of opcodes. Documents: sequence of opcodes. Words: opcodes. Medvet et. al conclude that topic modelling and the information it provides about the malware , can help understand malware characteristics and similarities. Reference

PTMs in Bioinformatics Topic Modelling can assist in understanding data and inference. PTMs can perform the following tasks on biological data: Clustering analysis Classification Feature Extraction

Clustering Unlike in traditional clustering, a topic model allows data to come from a mixture of clusters rather than from a single cluster. Microarray expression : data matrix of real numbers Word - document relate to gene-sample analogy Topics : functional groups LPD introduced Gaussian distributions to LDA in place of word multinomial distributions. PLSA was used for extraction of biclusters; this model simultaneously groups genes and samples.

Clustering (2) For protein interaction data , proposed an infinite topic model to find functional gene modules (topics) combined with gene expression data. For gene sequence data, the desirable task is to characterize a set of common genomic features shared by the same species analyzed the genome level composition of DNA sequences by means of LDA. For genome annotation data, studies have implemented LDA to directly identify functional modules of protein families.

Classification Biologically aware latent Dirichlet allocation : perform a classification task that extends the LDA model by integrating document dependences and starts from the LPD. BaLDA does not contain the assumption present in both PLSA and LDA that each gene is independently generated given its corresponding latent topic. Genomic Sequence Classification: Documents: genomic sequences Words : k-mers of DNA string (nucleotides) Topics : assigned taxonomic labels

Feature Extraction Protein Sequence Data : An hierarchical LDA-Random Forest has been proposed to solve this problem in order to predict human protein-protein interactions. Project the local sequence feature space to the topics space by LDA. Hidden structure between proteins is revealed. Random forest model measures the probability of interaction between two proteins .

Topic Model of Genetic Mutations in Cancer. Goal : Apply LDA to gene expression data. The use of topic modelling in this case is feature extraction. BoW translation: Document : patient Words : genomic states of patient Afterwards, in order to predict the survival rate of cancer patients the distributions obtained from LDA are used as feature. Then, Multi-Task Logistic Regression is performed to predict patient specific survival by using the features extracted from LDA. The hard task in this problem is capturing the effective information of gene expression values. In contrast to frequency of word occurrences being integer numbers, g.e.v are real numbers

Example Workflow.