Techniques for Dimensionality Reduction

Slides:

Advertisements

Similar presentations

Part 2: Unsupervised Learning

Advertisements

Topic models Source: Topic models, David Blei, MLSS 09.

Unsupervised Learning Clustering K-Means. Recall: Key Components of Intelligent Agents Representation Language: Graph, Bayes Nets, Linear functions Inference.

Information retrieval – LSI, pLSI and LDA

Hierarchical Dirichlet Processes

Community Detection with Edge Content in Social Media Networks Paper presented by Konstantinos Giannakopoulos.

Supervised Learning Recap

Scaling up LDA William Cohen. Outline LDA/Gibbs algorithm details How to speed it up by parallelizing How to speed it up by faster sampling – Why sampling.

Probabilistic Clustering-Projection Model for Discrete Data

EE462 MLCV Lecture Introduction of Graphical Models Markov Random Fields Segmentation Tae-Kyun Kim 1.

Statistical Topic Modeling part 1

10/11/2001Random walks and spectral segmentation1 CSE 291 Fall 2001 Marina Meila and Jianbo Shi: Learning Segmentation by Random Walks/A Random Walks View.

Generative Topic Models for Community Analysis

Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.

Statistical Models for Networks and Text Jimmy Foulds UCI Computer Science PhD Student Advisor: Padhraic Smyth.

Stochastic Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation James Foulds 1, Levi Boyles 1, Christopher DuBois 2 Padhraic Smyth.

Principal Component Analysis

Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.

Latent Dirichlet Allocation a generative model for text

IR Models: Latent Semantic Analysis. IR Model Taxonomy Non-Overlapping Lists Proximal Nodes Structured Models U s e r T a s k Set Theoretic Fuzzy Extended.

Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.

Bayesian Networks Alan Ritter.

Topic models for corpora and for graphs. Motivation Social graphs seem to have –some aspects of randomness small diameter, giant connected components,..

Correlated Topic Models By Blei and Lafferty (NIPS 2005) Presented by Chunping Wang ECE, Duke University August 4 th, 2006.

Example 16,000 documents 100 topic Picked those with large p(w|z)

Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.

Latent Dirichlet Allocation (LDA) Shannon Quinn (with thanks to William Cohen of Carnegie Mellon University and Arvind Ramanathan of Oak Ridge National.

Finding Scientific topics August , Topic Modeling 1.A document as a probabilistic mixture of topics. 2.A topic as a probability distribution.

Transfer Learning Task. Problem Identification Dataset : A Year: 2000 Features: 48 Training Model ‘M’ Testing 98.6% Training Model ‘M’ Testing 97% Dataset.

Eric Xing © Eric CMU, Machine Learning Latent Aspect Models Eric Xing Lecture 14, August 15, 2010 Reading: see class homepage.

Integrating Topics and Syntax -Thomas L

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.

MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:

ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Unsupervised Learning: Kmeans, GMM, EM Readings: Barber

Latent Dirichlet Allocation D. Blei, A. Ng, and M. Jordan. Journal of Machine Learning Research, 3: , January Jonathan Huang

Probabilistic Models for Discovering E-Communities Ding Zhou, Eren Manavoglu, Jia Li, C. Lee Giles, Hongyuan Zha The Pennsylvania State University WWW.

Lecture 2: Statistical learning primer for biologists

Latent Dirichlet Allocation

CS246 Latent Dirichlet Analysis. LSI  LSI uses SVD to find the best rank-K approximation  The result is difficult to interpret especially with negative.

Matrix Factorization and its applications By Zachary 16 th Nov, 2010.

Web-Mining Agents Topic Analysis: pLSI and LDA

1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.

Analysis of Social Media MLD , LTI William Cohen

Latent Dirichlet Allocation (LDA)

Advanced Artificial Intelligence Lecture 8: Advance machine learning.

Modeling Annotated Data (SIGIR 2003) David M. Blei, Michael I. Jordan Univ. of California, Berkeley Presented by ChengXiang Zhai, July 10, 2003.

14.0 Linguistic Processing and Latent Topic Analysis.

Matrix Factorization 1. Recovering latent factors in a matrix m columns v11 … …… vij … vnm n rows 2.

Scaling up LDA William Cohen. First some pictures…

Analysis of Social Media MLD , LTI William Cohen

Probabilistic models for corpora and graphs. Review: some generative models Multinomial Naïve Bayes C W1W1 W2W2 W3W3 ….. WNWN  M  For each document.

A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation Yee W. Teh, David Newman and Max Welling Published on NIPS 2006 Discussion.

Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.

Dimensionality Reduction and Principle Components Analysis

B. Freeman, Tomasz Malisiewicz, Tom Landauer and Peter Foltz,

Document Clustering Based on Non-negative Matrix Factorization

Latent Dirichlet Allocation

Exam Review Session William Cohen.

Techniques for Dimensionality Reduction

Latent Dirichlet Analysis

Topic models for corpora and for graphs

Michal Rosen-Zvi University of California, Irvine

Latent Dirichlet Allocation

LDA AND OTHER DIRECTED MODELS FOR MODELING TEXT

CS246: Latent Dirichlet Analysis

Junghoo “John” Cho UCLA

Topic models for corpora and for graphs

10-405: LDA Lecture 2.

Topic Models in Text Processing

Presentation transcript:

Techniques for Dimensionality Reduction Topic Models/LDA Latent Dirichlet Allocation

Outline Recap: PCA Directed models for text Naïve Bayes unsupervised Naïve Bayes (mixtures of multinomials) probabilistic latent semantic indexing pLSI and connection to PCA, LSI, MF latent Dirichlet allocation (LDA) inference with Gibbs sampling LDA-like models for graphs & etc

Quick Recap….

V[i,j] = pixel j in image i PCA as MF PC1 10,000 pixels 1000 * 10,000,00 2 prototypes ~ x1 y1 x2 y2 .. … xn yn a1 a2 .. … am b1 b2 bm v11 … vij 1000 images … vnm PC2 V[i,j] = pixel j in image i

V[i,j] = pixel j in image i PC1 1.4*PC1 + 0.5*PC2 = 10,000 pixels 1000 * 10,000,00 2 prototypes 1.4 0.5 x2 y2 .. … xn yn a1 a2 .. … am b1 b2 bm v11 … vij 1000 images … vnm PC2 V[i,j] = pixel j in image i

PCA for movie recommendation… m movies m movies ~ x1 y1 x2 y2 .. … xn yn a1 a2 .. … am b1 b2 bm v11 … vij V n users Bob … vnm V[i,j] = user i’s rating of movie j

indicators for r clusters ….. vs k-means indicators for r clusters cluster means original data set ~ M 1 .. … xn yn a1 a2 .. … am b1 b2 bm v11 … vij Z X n examples Question: what other generative models look like MF? … vnm

LDA AND OTHER DIRECTED MODELS FOR MODELING TEXT

Supervised Multinomial Naïve Bayes Naïve Bayes Model: Compact notation   C C = ….. WN W1 W2 W3 W M N b M b

Supervised Multinomial Naïve Bayes Naïve Bayes Model: Compact representation   C C = ….. WN W1 W2 W3 W M N b M  K

Supervised Naïve Bayes Multinomial Naïve Bayes  For each class 1..K Construct a multinomial i For each document d = 1,, M Generate Cd ~ Mult( . | ) For each position n = 1,, Nd Generate wn ~ Mult(.|,Cd) … or if you prefer wn ~ Pr(w|Cd) C ….. WN W1 W2 W3 M b K

Unsupervised Naïve Bayes Mixture model: unsupervised naïve Bayes Joint probability of words and classes: But classes are not visible:  C Z W N M Solved using EM b K

Unsupervised Naïve Bayes Mixture model: EM solution E-step: M-step: Key capability: estimate distribution of latent variables given observed variables

Beyond Naïve Bayes - Probabilistic Latent Semantic Indexing (PLSI) Every document is a mixture of topics For i=1…K: Let bi be a multinomial over words For each document d: Let d be a distribution over {1,..,K} For each word position in d: Pick a topic z from d Pick a word w from bi  Z C Latent variable for each word not for each document W N Can still learn with EM M b compare to unsupervised NB/ mixture of multinomials K K

LATENT SEMANTIC INDEXING VS SINGULAR VALUE DECOMPOSITION VS PRINCIPLE COMPONENTS ANALYSIS

V[i,j] = pixel j in image i PCA as MF PC1 10,000 pixels 1000 * 10,000,00 2 prototypes ~ x1 y1 x2 y2 .. … xn yn a1 a2 .. … am b1 b2 bm v11 … vij 1000 images … vnm PC2 V[i,j] = pixel j in image i Remember CX – covariance of features of the original data matrix? We can look also at CZ and see what the covariance is… Turns out CZ[i.j]=0 for i!=j and CZ [i,j]=λi

Background: PCA vs LSI Some connections PCA is closely related to singular value decomposition (SVD): see handout we factored data matrix X into Z*V, V is eigenvectors of CX one more step: factor Z into U*C where Σ is diagonal and Σ(i,i)=sqrt(λi) and variables in U have unit variance then X = U*Σ*V and this is called SVD When X is a term-document matrix this is called latent semantic indexing (LSI)

LSI ~ V m words m words n documents x1 y1 x2 y2 .. … xn yn a1 a2 .. … am b1 b2 bm v11 … vij V n documents … vnm V[i,j] = freq of word j in document i

LSI http://technowiki.wordpress.com/2011/08/27/latent-semantic-analysis-lsa-tutorial/

Beyond Naïve Bayes - Probabilistic Latent Semantic Indexing (PLSI) Every document is a mixture of topics For i=1…K: Let bi be a multinomial over words For each document d: Let d be a distribution over {1,..,K} For each word position in d: Pick a topic z from d Pick a word w from bi  C Z Latent variable for each word not for each document W N M b compare to unsupervised NB/ mixture of multinomials K K

Beyond Naïve Bayes - Probabilistic Latent Semantic Indexing (PLSI) W  b Z K For i=1…K: Let bi be a multinomial over words For each document d: Let d be a distribution over {1,..,K} For each word position in d: Pick a topic z from d Pick a word w from bi m words π1 π2 .. … πn x1 y1 x2 y2 .. … xn yn b1 b2 a1 a2 .. … am b1 b2 bm n documents

LSI or pLSI ~ V m words m words n documents x1 y1 x2 y2 .. … xn yn a1 a2 .. … am b1 b2 bm v11 … vij V n documents … vnm V[i,j] = freq of word j in document i

Background: LSI vs NMF http://papers.nips.cc/paper/1861-algorithms-for-non-negative-matrix-factorization.pdf

PLSI and MF and Nonnegative MF

PLSI and MF and Nonnegative MF ~ We considered minimized L2 reconstruction error. Another choice: constrain C, H to be non-negative and minimize JNMF: where Ding et al: the objective functions are for NMF and PLSI are equivalent

Beyond Naïve Bayes - Probabilistic Latent Semantic Indexing (PLSI) Every document is a mixture of topics For i=1…K: Let bi be a multinomial over words For each document d: Let d be a distribution over {1,..,K} For each word position in d: Pick a topic z from d Pick a word w from bi Turns out to be hard to fit: Lots of parameters! Also: only applies to the training data  C Z W N M b K

LATENT DIRICHLET ANALYSIS (LDA)

The LDA Topic Model

LDA Motivation Assumptions: 1) documents are i.i.d 2) within a document, words are i.i.d. (bag of words) For each document d = 1,,M Generate d ~ D1(…) For each word n = 1,, Nd generate wn ~ D2( . | θdn) Now pick your favorite distributions for D1, D2  w N M

Latent Dirichlet Allocation “Mixed membership”  For each document d = 1,,M Generate d ~ Dir(. | ) For each position n = 1,, Nd generate zn ~ Mult( . | d) generate wn ~ Mult( . | zn) a z w N M f K b

LDA’s view of a document

LDA topics

LDA used as dimension reduction for classification 50 topics vs all words, SVM

LDA used for CF Users rating 100+ movies only

LDA Latent Dirichlet Allocation Parameter learning: Variational EM Numerical approximation using lower-bounds Results in biased solutions Convergence has numerical guarantees Gibbs Sampling Stochastic simulation unbiased solutions Stochastic convergence

LDA Gibbs sampling – works for any directed model! Applicable when joint distribution is hard to evaluate but conditional distribution is known Sequence of samples comprises a Markov Chain Stationary distribution of the chain is the joint distribution Key capability: estimate distribution of one latent variables given the other latent variables and observed variables.

Why does Gibbs sampling work? What’s the fixed point? Stationary distribution of the chain is the joint distribution When will it converge (in the limit)? If graph defined by the chain is connected How long will it take to converge? Depends on second eigenvector of that graph

Fr: Parameter estimation for text analysis - Gregor Heinrich Called “collapsed Gibbs sampling” since you’ve marginalized away some variables Fr: Parameter estimation for text analysis - Gregor Heinrich

LDA Latent Dirichlet Allocation Randomly initialize each zm,n “Mixed membership”  Randomly initialize each zm,n Repeat for t=1,…. For each doc m, word n Find Pr(zmn=k|other z’s) Sample zmn according to that distr. a z w N M f K b

EVEN MORE DETAIL ON LDA…

Way way more detail

More detail

What gets learned…..

In A Math-ier Notation N[*,k] N[d,k] N[*,*]=V M[w,k]

for each document d and word position j in d z[d,j] = k, a random topic N[d,k]++ W[w,k]++ where w = id of j-th word in d

for each pass t=1,2,…. for each document d and word position j in d z[d,j] = k, a new random topic update N, W to reflect the new assignment of z: N[d,k]++; N[d,k’] - - where k’ is old z[d,j] W[w,k]++; W[w,k’] - - where w is w[d,j]

Some comments on LDA Very widely used model Also a component of many other models

Techniques for Modeling Graphs Latent Dirichlet Allocation++

Review - LDA Motivation Assumptions: 1) documents are i.i.d 2) within a document, words are i.i.d. (bag of words) For each document d = 1,,M Generate d ~ D1(…) For each word n = 1,, Nd generate wn ~ D2( ¢ | θdn) Docs and words are exchangeable.  w N M

A Matrix Can Also Represent a Graph B C D E F G H I J 1 C A B G I H J F D E

Karate club Karate club

Schoolkids Middle school kids – by race and age

College football College football teams

citations citeseer

blogs

blogs

Stochastic Block Models “Stochastic block model”, aka “Block-stochastic matrix”: Draw ni nodes in block i With probability pij, connect pairs (u,v) where u is in block i, v is in block j Special, simple case: pii=qi, and pij=s for all i≠j

Stochastic Block models: assume 1) nodes w/in a block z and 2) edges between blocks zp,zq are exchangeable a b zp zp zq p apq N N2

Stochastic Block models: assume 1) nodes w/in a block z and 2) edges between blocks zp,zq are exchangeable a Gibbs sampling: Randomly initialize zp for each node p. For t = 1… For each node p Compute zp given other z’s Sample zp b zp zp zq p apq N N2 See: Snijders & Nowicki, 1997, Estimation and Prediction for Stochastic Blockmodels for Groups with Latent Graph Structure

Mixed Membership Stochastic Block models q p zp. z.q p apq N N2 Airoldi et al, JMLR 2008

Another mixed membership block model

Another mixed membership block model z=(zi,zj) is a pair of block ids nz = #pairs z qz1,i = #links to i from block z1 qz1,. = #outlinks in block z1 δ = indicator for diagonal M = #nodes

Another mixed membership block model

Sparse network model vs spectral clustering

Hybrid models Balasubramanyan & Cohen, BioNLP 2012