Statistical NLP Winter 2009 Lecture 5: Unsupervised text categorization Roger Levy Many thanks to Tom Griffiths for use of his slides.

Slides:



Advertisements
Similar presentations
Topic models Source: Topic models, David Blei, MLSS 09.
Advertisements

Xiaolong Wang and Daniel Khashabi
Unsupervised Learning Clustering K-Means. Recall: Key Components of Intelligent Agents Representation Language: Graph, Bayes Nets, Linear functions Inference.
Information retrieval – LSI, pLSI and LDA
Hierarchical Dirichlet Processes
Title: The Author-Topic Model for Authors and Documents
Probabilistic Clustering-Projection Model for Discrete Data
Statistical Topic Modeling part 1
Unsupervised and Weakly-Supervised Probabilistic Modeling of Text Ivan Titov April TexPoint fonts used in EMF. Read the TexPoint manual before.
Generative Topic Models for Community Analysis
Pattern Recognition and Machine Learning
Stochastic Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation James Foulds 1, Levi Boyles 1, Christopher DuBois 2 Padhraic Smyth.
Nonparametric Bayes and human cognition Tom Griffiths Department of Psychology Program in Cognitive Science University of California, Berkeley.
Part IV: Inference algorithms. Estimation and inference Actually working with probabilistic models requires solving some difficult computational problems…
Latent Dirichlet Allocation a generative model for text
British Museum Library, London Picture Courtesy: flickr.
Concepts & Categorization. Measurement of Similarity Geometric approach Featural approach  both are vector representations.
Multiscale Topic Tomography Ramesh Nallapati, William Cohen, Susan Ditmore, John Lafferty & Kin Ung (Johnson and Johnson Group)
Visual Recognition Tutorial
Models for Authors and Text Documents Mark Steyvers UCI In collaboration with: Padhraic Smyth (UCI) Michal Rosen-Zvi (UCI) Thomas Griffiths (Stanford)
Topic models for corpora and for graphs. Motivation Social graphs seem to have –some aspects of randomness small diameter, giant connected components,..
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Topics in statistical language modeling Tom Griffiths.
Bayesian topic models Tom Griffiths UC Berkeley.
Lecture #13: Gibbs Sampling for LDA
Genetic Regulatory Network Inference Russell Schwartz Department of Biological Sciences Carnegie Mellon University.
Correlated Topic Models By Blei and Lafferty (NIPS 2005) Presented by Chunping Wang ECE, Duke University August 4 th, 2006.
Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.
Statistical Modeling of Text (Can be seen as an application of probabilistic graphical models) Best reading supplement for just LDA :
Introduction to Bayesian statistics Yves Moreau. Overview The Cox-Jaynes axioms Bayes’ rule Probabilistic models Maximum likelihood Maximum a posteriori.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 16 Nov, 3, 2011 Slide credit: C. Conati, S.
Statistical NLP Winter 2008 Lecture 16: Unsupervised Learning I Roger Levy [thanks to Sharon Goldwater for many slides]
Topic Modelling: Beyond Bag of Words By Hanna M. Wallach ICML 2006 Presented by Eric Wang, April 25 th 2008.
Finding Scientific topics August , Topic Modeling 1.A document as a probabilistic mixture of topics. 2.A topic as a probability distribution.
Inferring structure from data Tom Griffiths Department of Psychology Program in Cognitive Science University of California, Berkeley.
Probabilistic Topic Models
Mixture Models, Monte Carlo, Bayesian Updating and Dynamic Models Mike West Computing Science and Statistics, Vol. 24, pp , 1993.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
Eric Xing © Eric CMU, Machine Learning Latent Aspect Models Eric Xing Lecture 14, August 15, 2010 Reading: see class homepage.
Integrating Topics and Syntax -Thomas L
Summary We propose a framework for jointly modeling networks and text associated with them, such as networks or user review websites. The proposed.
Randomized Algorithms for Bayesian Hierarchical Clustering
Latent Dirichlet Allocation D. Blei, A. Ng, and M. Jordan. Journal of Machine Learning Research, 3: , January Jonathan Huang
Probabilistic Models for Discovering E-Communities Ding Zhou, Eren Manavoglu, Jia Li, C. Lee Giles, Hongyuan Zha The Pennsylvania State University WWW.
Topic Models Presented by Iulian Pruteanu Friday, July 28 th, 2006.
Topic Modeling using Latent Dirichlet Allocation
Latent Dirichlet Allocation
CS246 Latent Dirichlet Analysis. LSI  LSI uses SVD to find the best rank-K approximation  The result is difficult to interpret especially with negative.
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov
Hierarchical Beta Process and the Indian Buffet Process by R. Thibaux and M. I. Jordan Discussion led by Qi An.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Probabilistic Topic Models Hongning Wang Outline 1.General idea of topic models 2.Basic topic models -Probabilistic Latent Semantic Analysis (pLSA)
Institute of Statistics and Decision Sciences In Defense of a Dissertation Submitted for the Degree of Doctor of Philosophy 26 July 2005 Regression Model.
Understanding unstructured texts via Latent Dirichlet Allocation Raphael Cohen DSaaS, EMC IT June 2015.
Nonparametric Bayesian Learning of Switching Dynamical Processes
Multimodal Learning with Deep Boltzmann Machines
Latent Dirichlet Analysis
Hidden Markov Models Part 2: Algorithms
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Bayesian Inference for Mixture Language Models
Topic models for corpora and for graphs
Bayesian Inference for Mixture Language Models
Michal Rosen-Zvi University of California, Irvine
Junghoo “John” Cho UCLA
Topic Models in Text Processing
Parametric Methods Berlin Chen, 2005 References:
Multivariate Methods Berlin Chen
Multivariate Methods Berlin Chen, 2005 References:
Presentation transcript:

Statistical NLP Winter 2009 Lecture 5: Unsupervised text categorization Roger Levy Many thanks to Tom Griffiths for use of his slides

Where we left off Supervised text categorization through Naïve Bayes Generative model: first generate a document category, then words in the document (unigram model) Inference: obtain posterior over document categories using Bayes rule (argmax to choose the category) Cat w1w1 w2w2 w3w3 w4w4 w5w5 w6w6 …

What we’re doing today Supervised categorization requires hand-labeling documents This can be extremely time-consuming Unlabeled documents are cheap So we’d really like to do unsupervised text categorization Today we’ll look at unsupervised learning within the Naïve Bayes model

Compact graphical model representations We’re going to lean heavily on graphical model representations here. We’ll use a more compact notation: Cat w1w1 w2w2 w3w3 w4w4 wnwn … w1w1 n “generate a word from Cat n times” a “plate”

Now suppose that Cat isn’t observed We need to learn two distributions: P( Cat ) P( w|Cat ) How do we do this? We might use the method of maximum likelihood (MLE) But it turns out that the likelihood surface is highly non- convex and lots of information isn’t contained in a point estimate Alternative: Bayesian methods Cat w1w1 n

Bayesian document categorization Cat w1w1 nDnD     D priors P(w|Cat) P(Cat)

Latent Dirichlet allocation (Blei, Ng, & Jordan, 2001; 2003) NdNd D zizi wiwi  (d) (d)  (j) (j)    (d)  Dirichlet(  ) z i  Discrete(  (d) )  (j)  Dirichlet(  ) w i  Discrete(  (z i ) ) T distribution over topics for each document topic assignment for each word distribution over words for each topic word generated from assigned topic Dirichlet priors Main difference: one topic per word

A generative model for documents HEART0.2 LOVE0.2 SOUL0.2 TEARS0.2 JOY0.2 SCIENTIFIC 0.0 KNOWLEDGE 0.0 WORK 0.0 RESEARCH0.0 MATHEMATICS0.0 HEART0.0 LOVE0.0 SOUL0.0 TEARS0.0 JOY0.0 SCIENTIFIC 0.2 KNOWLEDGE 0.2 WORK 0.2 RESEARCH0.2 MATHEMATICS0.2 topic 1topic 2 w P(w|Cat = 1)w P(w|Cat = 2)

Choose mixture weights for each document, generate “bag of words” {P(z = 1), P(z = 2)} {0, 1} {0.25, 0.75} {0.5, 0.5} {0.75, 0.25} {1, 0} MATHEMATICS KNOWLEDGE RESEARCH WORK MATHEMATICS RESEARCH WORK SCIENTIFIC MATHEMATICS WORK SCIENTIFIC KNOWLEDGE MATHEMATICS SCIENTIFIC HEART LOVE TEARS KNOWLEDGE HEART MATHEMATICS HEART RESEARCH LOVE MATHEMATICS WORK TEARS SOUL KNOWLEDGE HEART WORK JOY SOUL TEARS MATHEMATICS TEARS LOVE LOVE LOVE SOUL TEARS LOVE JOY SOUL LOVE TEARS SOUL SOUL TEARS JOY

Dirichlet priors Multivariate equivalent of Beta distribution Hyperparameters  determine form of the prior

words  documents words topics documents P(w|z)P(w|z) P(z)P(z) Matrix factorization interpretation Maximum-likelihood estimation is finding the factorization that minimizes KL divergence P(w)P(w) (Hofmann, 1999)

STORY STORIES TELL CHARACTER CHARACTERS AUTHOR READ TOLD SETTING TALES PLOT TELLING SHORT FICTION ACTION TRUE EVENTS TELLS TALE NOVEL MIND WORLD DREAM DREAMS THOUGHT IMAGINATION MOMENT THOUGHTS OWN REAL LIFE IMAGINE SENSE CONSCIOUSNESS STRANGE FEELING WHOLE BEING MIGHT HOPE WATER FISH SEA SWIM SWIMMING POOL LIKE SHELL SHARK TANK SHELLS SHARKS DIVING DOLPHINS SWAM LONG SEAL DIVE DOLPHIN UNDERWATER DISEASE BACTERIA DISEASES GERMS FEVER CAUSE CAUSED SPREAD VIRUSES INFECTION VIRUS MICROORGANISMS PERSON INFECTIOUS COMMON CAUSING SMALLPOX BODY INFECTIONS CERTAIN FIELD MAGNETIC MAGNET WIRE NEEDLE CURRENT COIL POLES IRON COMPASS LINES CORE ELECTRIC DIRECTION FORCE MAGNETS BE MAGNETISM POLE INDUCED SCIENCE STUDY SCIENTISTS SCIENTIFIC KNOWLEDGE WORK RESEARCH CHEMISTRY TECHNOLOGY MANY MATHEMATICS BIOLOGY FIELD PHYSICS LABORATORY STUDIES WORLD SCIENTIST STUDYING SCIENCES BALL GAME TEAM FOOTBALL BASEBALL PLAYERS PLAY FIELD PLAYER BASKETBALL COACH PLAYED PLAYING HIT TENNIS TEAMS GAMES SPORTS BAT TERRY JOB WORK JOBS CAREER EXPERIENCE EMPLOYMENT OPPORTUNITIES WORKING TRAINING SKILLS CAREERS POSITIONS FIND POSITION FIELD OCCUPATIONS REQUIRE OPPORTUNITY EARN ABLE Interpretable topics each column shows words from a single topic, ordered by P(w|z)

STORY STORIES TELL CHARACTER CHARACTERS AUTHOR READ TOLD SETTING TALES PLOT TELLING SHORT FICTION ACTION TRUE EVENTS TELLS TALE NOVEL MIND WORLD DREAM DREAMS THOUGHT IMAGINATION MOMENT THOUGHTS OWN REAL LIFE IMAGINE SENSE CONSCIOUSNESS STRANGE FEELING WHOLE BEING MIGHT HOPE WATER FISH SEA SWIM SWIMMING POOL LIKE SHELL SHARK TANK SHELLS SHARKS DIVING DOLPHINS SWAM LONG SEAL DIVE DOLPHIN UNDERWATER DISEASE BACTERIA DISEASES GERMS FEVER CAUSE CAUSED SPREAD VIRUSES INFECTION VIRUS MICROORGANISMS PERSON INFECTIOUS COMMON CAUSING SMALLPOX BODY INFECTIONS CERTAIN FIELD MAGNETIC MAGNET WIRE NEEDLE CURRENT COIL POLES IRON COMPASS LINES CORE ELECTRIC DIRECTION FORCE MAGNETS BE MAGNETISM POLE INDUCED SCIENCE STUDY SCIENTISTS SCIENTIFIC KNOWLEDGE WORK RESEARCH CHEMISTRY TECHNOLOGY MANY MATHEMATICS BIOLOGY FIELD PHYSICS LABORATORY STUDIES WORLD SCIENTIST STUDYING SCIENCES BALL GAME TEAM FOOTBALL BASEBALL PLAYERS PLAY FIELD PLAYER BASKETBALL COACH PLAYED PLAYING HIT TENNIS TEAMS GAMES SPORTS BAT TERRY JOB WORK JOBS CAREER EXPERIENCE EMPLOYMENT OPPORTUNITIES WORKING TRAINING SKILLS CAREERS POSITIONS FIND POSITION FIELD OCCUPATIONS REQUIRE OPPORTUNITY EARN ABLE Handling multiple senses each column shows words from a single topic, ordered by P(w|z)

Natural statistics DocumentsSemantic representation MATHEMATICS KNOWLEDGE RESEARCH WORK MATHEMATICS RESEARCH WORK SCIENTIFIC MATHEMATICS WORK SCIENTIFIC KNOWLEDGE MATHEMATICS SCIENTIFIC HEART LOVE TEARS KNOWLEDGE HEART MATHEMATICS HEART RESEARCH LOVE MATHEMATICS WORK TEARS SOUL KNOWLEDGE HEART WORK JOY SOUL TEARS MATHEMATICS TEARS LOVE LOVE LOVE SOUL TEARS LOVE JOY SOUL LOVE TEARS SOUL SOUL TEARS JOY HEART LOVE SOUL TEARS JOY … SCIENTIFIC KNOWLEDGE WORK RESEARCH MATHEMATICS … ?

Inverting the generative model Maximum likelihood estimation (EM) e.g. Hofmann (1999) Deterministic approximate algorithms variational EM; Blei, Ng & Jordan (2001; 2003) expectation propagation; Minka & Lafferty (2002) Markov chain Monte Carlo full Gibbs sampler; Pritchard et al. (2000) collapsed Gibbs sampler; Griffiths & Steyvers (2004)

The collapsed Gibbs sampler Using conjugacy of Dirichlet and multinomial distributions, integrate out continuous parameters Defines a distribution on discrete ensembles z

The collapsed Gibbs sampler Sample each z i conditioned on z -i This is nicer than your average Gibbs sampler: memory: counts can be cached in two sparse matrices optimization: no special functions, simple arithmetic the distributions on  and  are analytic given z and w, and can later be found for each sample

Gibbs sampling in LDA iteration 1

Gibbs sampling in LDA iteration 1 2

Gibbs sampling in LDA iteration 1 2

Gibbs sampling in LDA iteration 1 2

Gibbs sampling in LDA iteration 1 2

Gibbs sampling in LDA iteration 1 2

Gibbs sampling in LDA iteration 1 2

Gibbs sampling in LDA iteration 1 2

Gibbs sampling in LDA iteration 1 2 … 1000

Effects of hyperparameters  and  control the relative sparsity of  and  smaller , fewer topics per document smaller , fewer words per topic Good assignments z compromise in sparsity log  (x) x

Varying   decreasing  increases sparsity

Varying  decreasing  increases sparsity ?

Number of words per topic Topic Number of words

NuNu U zizi wiwi (u)(u)  (j) (j)    (u) |s u =0  Delta(  (u-1) )  (u) |s u =1  Dirichlet(  ) z i  Discrete(  (u) )  (j)  Dirichlet(  ) w i  Discrete(  (z i ) ) T Extension: a model for meetings susu  (u-1) … (Purver, Kording, Griffiths, & Tenenbaum, 2006)

Sample of ICSI meeting corpus (25 meetings) no it's o_k. it's it'll work. well i can do that. but then i have to end the presentation in the middle so i can go back to open up javabayes. o_k fine. here let's see if i can. alright. very nice. is that better. yeah. o_k. uh i'll also get rid of this click to add notes. o_k. perfect NEW TOPIC (not supplied to algorithm) so then the features we decided or we decided we were talked about. right. uh the the prosody the discourse verb choice. you know we had a list of things like to go and to visit and what not. the landmark-iness of uh. i knew you'd like that. nice coinage. thank you.

Topic segmentation applied to meetings Inferred Segmentation Inferred Topics

Comparison with human judgments Topics recovered are much more coherent than those found using random segmentation, no segmentation, or an HMM

Learning the number of topics Can use standard Bayes factor methods to evaluate models of different dimensionality e.g. importance sampling via MCMC Alternative: nonparametric Bayes fixed number of topics per document, unbounded number of topics per corpus (Blei, Griffiths, Jordan, & Tenenbaum, 2004) unbounded number of topics for both (the hierarchical Dirichlet process) (Teh, Jordan, Beal, & Blei, 2004)

The Author-Topic model (Rosen-Zvi, Griffiths, Smyth, & Steyvers, 2004) NdNd D zizi wiwi  (a) (a)  (j) (j)    (a)  Dirichlet(  ) z i  Discrete(  (x i ) )  (j)  Dirichlet(  ) w i  Discrete(  (z i ) ) T xixi A x i  Uniform(A (d) ) each author has a distribution over topics the author of each word is chosen uniformly at random

Four example topics from NIPS

Who wrote what? A method 1 is described which like the kernel 1 trick 1 in support 1 vector 1 machines 1 SVMs 1 lets us generalize distance 1 based 2 algorithms to operate in feature 1 spaces usually nonlinearly related to the input 1 space This is done by identifying a class of kernels 1 which can be represented as norm 1 based 2 distances 1 in Hilbert spaces It turns 1 out that common kernel 1 algorithms such as SVMs 1 and kernel 1 PCA 1 are actually really distance 1 based 2 algorithms and can be run 2 with that class of kernels 1 too As well as providing 1 a useful new insight 1 into how these algorithms work the present 2 work can form the basis 1 for conceiving new algorithms This paper presents 2 a comprehensive approach for model 2 based 2 diagnosis 2 which includes proposals for characterizing and computing 2 preferred 2 diagnoses 2 assuming that the system 2 description 2 is augmented with a system 2 structure 2 a directed 2 graph 2 explicating the interconnections between system 2 components 2 Specifically we first introduce the notion of a consequence 2 which is a syntactically 2 unconstrained propositional 2 sentence 2 that characterizes all consistency 2 based 2 diagnoses 2 and show 2 that standard 2 characterizations of diagnoses 2 such as minimal conflicts 1 correspond to syntactic 2 variations 1 on a consequence 2 Second we propose a new syntactic 2 variation on the consequence 2 known as negation 2 normal form NNF and discuss its merits compared to standard variations Third we introduce a basic algorithm 2 for computing consequences in NNF given a structured system 2 description We show that if the system 2 structure 2 does not contain cycles 2 then there is always a linear size 2 consequence 2 in NNF which can be computed in linear time 2 For arbitrary 1 system 2 structures 2 we show a precise connection between the complexity 2 of computing 2 consequences and the topology of the underlying system 2 structure 2 Finally we present 2 an algorithm 2 that enumerates 2 the preferred 2 diagnoses 2 characterized by a consequence 2 The algorithm 2 is shown 1 to take linear time 2 in the size 2 of the consequence 2 if the preference criterion 1 satisfies some general conditions Written by (1) Scholkopf_B Written by (2) Darwiche_A

Analysis of PNAS abstracts Test topic models with a real database of scientific papers from PNAS All 28,154 abstracts from All words occurring in at least five abstracts, not on “stop” list (20,551) Total of 3,026,970 tokens in corpus (Griffiths & Steyvers, 2004)

FORCE SURFACE MOLECULES SOLUTION SURFACES MICROSCOPY WATER FORCES PARTICLES STRENGTH POLYMER IONIC ATOMIC AQUEOUS MOLECULAR PROPERTIES LIQUID SOLUTIONS BEADS MECHANICAL HIV VIRUS INFECTED IMMUNODEFICIENCY CD4 INFECTION HUMAN VIRAL TAT GP120 REPLICATION TYPE ENVELOPE AIDS REV BLOOD CCR5 INDIVIDUALS ENV PERIPHERAL MUSCLE CARDIAC HEART SKELETAL MYOCYTES VENTRICULAR MUSCLES SMOOTH HYPERTROPHY DYSTROPHIN HEARTS CONTRACTION FIBERS FUNCTION TISSUE RAT MYOCARDIAL ISOLATED MYOD FAILURE STRUCTURE ANGSTROM CRYSTAL RESIDUES STRUCTURES STRUCTURAL RESOLUTION HELIX THREE HELICES DETERMINED RAY CONFORMATION HELICAL HYDROPHOBIC SIDE DIMENSIONAL INTERACTIONS MOLECULE SURFACE NEURONS BRAIN CORTEX CORTICAL OLFACTORY NUCLEUS NEURONAL LAYER RAT NUCLEI CEREBELLUM CEREBELLAR LATERAL CEREBRAL LAYERS GRANULE LABELED HIPPOCAMPUS AREAS THALAMIC A selection of topics TUMOR CANCER TUMORS HUMAN CELLS BREAST MELANOMA GROWTH CARCINOMA PROSTATE NORMAL CELL METASTATIC MALIGNANT LUNG CANCERS MICE NUDE PRIMARY OVARIAN

Cold topicsHot topics

Cold topicsHot topics 2 SPECIES GLOBAL CLIMATE CO2 WATER ENVIRONMENTAL YEARS MARINE CARBON DIVERSITY OCEAN EXTINCTION TERRESTRIAL COMMUNITY ABUNDANCE 134 MICE DEFICIENT NORMAL GENE NULL MOUSE TYPE HOMOZYGOUS ROLE KNOCKOUT DEVELOPMENT GENERATED LACKING ANIMALS REDUCED 179 APOPTOSIS DEATH CELL INDUCED BCL CELLS APOPTOTIC CASPASE FAS SURVIVAL PROGRAMMED MEDIATED INDUCTION CERAMIDE EXPRESSION

Cold topicsHot topics 2 SPECIES GLOBAL CLIMATE CO2 WATER ENVIRONMENTAL YEARS MARINE CARBON DIVERSITY OCEAN EXTINCTION TERRESTRIAL COMMUNITY ABUNDANCE 134 MICE DEFICIENT NORMAL GENE NULL MOUSE TYPE HOMOZYGOUS ROLE KNOCKOUT DEVELOPMENT GENERATED LACKING ANIMALS REDUCED 179 APOPTOSIS DEATH CELL INDUCED BCL CELLS APOPTOTIC CASPASE FAS SURVIVAL PROGRAMMED MEDIATED INDUCTION CERAMIDE EXPRESSION 37 CDNA AMINO SEQUENCE ACID PROTEIN ISOLATED ENCODING CLONED ACIDS IDENTITY CLONE EXPRESSED ENCODES RAT HOMOLOGY 289 KDA PROTEIN PURIFIED MOLECULAR MASS CHROMATOGRAPHY POLYPEPTIDE GEL SDS BAND APPARENT LABELED IDENTIFIED FRACTION DETECTED 75 ANTIBODY ANTIBODIES MONOCLONAL ANTIGEN IGG MAB SPECIFIC EPITOPE HUMAN MABS RECOGNIZED SERA EPITOPES DIRECTED NEUTRALIZING