Presentation is loading. Please wait.

Presentation is loading. Please wait.

Statistical NLP Winter 2009 Lecture 5: Unsupervised text categorization Roger Levy Many thanks to Tom Griffiths for use of his slides.

Similar presentations


Presentation on theme: "Statistical NLP Winter 2009 Lecture 5: Unsupervised text categorization Roger Levy Many thanks to Tom Griffiths for use of his slides."— Presentation transcript:

1 Statistical NLP Winter 2009 Lecture 5: Unsupervised text categorization Roger Levy Many thanks to Tom Griffiths for use of his slides

2 Where we left off Supervised text categorization through Naïve Bayes Generative model: first generate a document category, then words in the document (unigram model) Inference: obtain posterior over document categories using Bayes rule (argmax to choose the category) Cat w1w1 w2w2 w3w3 w4w4 w5w5 w6w6 …

3 What we’re doing today Supervised categorization requires hand-labeling documents This can be extremely time-consuming Unlabeled documents are cheap So we’d really like to do unsupervised text categorization Today we’ll look at unsupervised learning within the Naïve Bayes model

4 Compact graphical model representations We’re going to lean heavily on graphical model representations here. We’ll use a more compact notation: Cat w1w1 w2w2 w3w3 w4w4 wnwn … w1w1 n “generate a word from Cat n times” a “plate”

5 Now suppose that Cat isn’t observed We need to learn two distributions: P( Cat ) P( w|Cat ) How do we do this? We might use the method of maximum likelihood (MLE) But it turns out that the likelihood surface is highly non- convex and lots of information isn’t contained in a point estimate Alternative: Bayesian methods Cat w1w1 n

6 Bayesian document categorization Cat w1w1 nDnD     D priors P(w|Cat) P(Cat)

7 Latent Dirichlet allocation (Blei, Ng, & Jordan, 2001; 2003) NdNd D zizi wiwi  (d) (d)  (j) (j)    (d)  Dirichlet(  ) z i  Discrete(  (d) )  (j)  Dirichlet(  ) w i  Discrete(  (z i ) ) T distribution over topics for each document topic assignment for each word distribution over words for each topic word generated from assigned topic Dirichlet priors Main difference: one topic per word

8 A generative model for documents HEART0.2 LOVE0.2 SOUL0.2 TEARS0.2 JOY0.2 SCIENTIFIC 0.0 KNOWLEDGE 0.0 WORK 0.0 RESEARCH0.0 MATHEMATICS0.0 HEART0.0 LOVE0.0 SOUL0.0 TEARS0.0 JOY0.0 SCIENTIFIC 0.2 KNOWLEDGE 0.2 WORK 0.2 RESEARCH0.2 MATHEMATICS0.2 topic 1topic 2 w P(w|Cat = 1)w P(w|Cat = 2)

9 Choose mixture weights for each document, generate “bag of words” {P(z = 1), P(z = 2)} {0, 1} {0.25, 0.75} {0.5, 0.5} {0.75, 0.25} {1, 0} MATHEMATICS KNOWLEDGE RESEARCH WORK MATHEMATICS RESEARCH WORK SCIENTIFIC MATHEMATICS WORK SCIENTIFIC KNOWLEDGE MATHEMATICS SCIENTIFIC HEART LOVE TEARS KNOWLEDGE HEART MATHEMATICS HEART RESEARCH LOVE MATHEMATICS WORK TEARS SOUL KNOWLEDGE HEART WORK JOY SOUL TEARS MATHEMATICS TEARS LOVE LOVE LOVE SOUL TEARS LOVE JOY SOUL LOVE TEARS SOUL SOUL TEARS JOY

10 Dirichlet priors Multivariate equivalent of Beta distribution Hyperparameters  determine form of the prior

11 words  documents words topics documents P(w|z)P(w|z) P(z)P(z) Matrix factorization interpretation Maximum-likelihood estimation is finding the factorization that minimizes KL divergence P(w)P(w) (Hofmann, 1999)

12 STORY STORIES TELL CHARACTER CHARACTERS AUTHOR READ TOLD SETTING TALES PLOT TELLING SHORT FICTION ACTION TRUE EVENTS TELLS TALE NOVEL MIND WORLD DREAM DREAMS THOUGHT IMAGINATION MOMENT THOUGHTS OWN REAL LIFE IMAGINE SENSE CONSCIOUSNESS STRANGE FEELING WHOLE BEING MIGHT HOPE WATER FISH SEA SWIM SWIMMING POOL LIKE SHELL SHARK TANK SHELLS SHARKS DIVING DOLPHINS SWAM LONG SEAL DIVE DOLPHIN UNDERWATER DISEASE BACTERIA DISEASES GERMS FEVER CAUSE CAUSED SPREAD VIRUSES INFECTION VIRUS MICROORGANISMS PERSON INFECTIOUS COMMON CAUSING SMALLPOX BODY INFECTIONS CERTAIN FIELD MAGNETIC MAGNET WIRE NEEDLE CURRENT COIL POLES IRON COMPASS LINES CORE ELECTRIC DIRECTION FORCE MAGNETS BE MAGNETISM POLE INDUCED SCIENCE STUDY SCIENTISTS SCIENTIFIC KNOWLEDGE WORK RESEARCH CHEMISTRY TECHNOLOGY MANY MATHEMATICS BIOLOGY FIELD PHYSICS LABORATORY STUDIES WORLD SCIENTIST STUDYING SCIENCES BALL GAME TEAM FOOTBALL BASEBALL PLAYERS PLAY FIELD PLAYER BASKETBALL COACH PLAYED PLAYING HIT TENNIS TEAMS GAMES SPORTS BAT TERRY JOB WORK JOBS CAREER EXPERIENCE EMPLOYMENT OPPORTUNITIES WORKING TRAINING SKILLS CAREERS POSITIONS FIND POSITION FIELD OCCUPATIONS REQUIRE OPPORTUNITY EARN ABLE Interpretable topics each column shows words from a single topic, ordered by P(w|z)

13 STORY STORIES TELL CHARACTER CHARACTERS AUTHOR READ TOLD SETTING TALES PLOT TELLING SHORT FICTION ACTION TRUE EVENTS TELLS TALE NOVEL MIND WORLD DREAM DREAMS THOUGHT IMAGINATION MOMENT THOUGHTS OWN REAL LIFE IMAGINE SENSE CONSCIOUSNESS STRANGE FEELING WHOLE BEING MIGHT HOPE WATER FISH SEA SWIM SWIMMING POOL LIKE SHELL SHARK TANK SHELLS SHARKS DIVING DOLPHINS SWAM LONG SEAL DIVE DOLPHIN UNDERWATER DISEASE BACTERIA DISEASES GERMS FEVER CAUSE CAUSED SPREAD VIRUSES INFECTION VIRUS MICROORGANISMS PERSON INFECTIOUS COMMON CAUSING SMALLPOX BODY INFECTIONS CERTAIN FIELD MAGNETIC MAGNET WIRE NEEDLE CURRENT COIL POLES IRON COMPASS LINES CORE ELECTRIC DIRECTION FORCE MAGNETS BE MAGNETISM POLE INDUCED SCIENCE STUDY SCIENTISTS SCIENTIFIC KNOWLEDGE WORK RESEARCH CHEMISTRY TECHNOLOGY MANY MATHEMATICS BIOLOGY FIELD PHYSICS LABORATORY STUDIES WORLD SCIENTIST STUDYING SCIENCES BALL GAME TEAM FOOTBALL BASEBALL PLAYERS PLAY FIELD PLAYER BASKETBALL COACH PLAYED PLAYING HIT TENNIS TEAMS GAMES SPORTS BAT TERRY JOB WORK JOBS CAREER EXPERIENCE EMPLOYMENT OPPORTUNITIES WORKING TRAINING SKILLS CAREERS POSITIONS FIND POSITION FIELD OCCUPATIONS REQUIRE OPPORTUNITY EARN ABLE Handling multiple senses each column shows words from a single topic, ordered by P(w|z)

14 Natural statistics DocumentsSemantic representation MATHEMATICS KNOWLEDGE RESEARCH WORK MATHEMATICS RESEARCH WORK SCIENTIFIC MATHEMATICS WORK SCIENTIFIC KNOWLEDGE MATHEMATICS SCIENTIFIC HEART LOVE TEARS KNOWLEDGE HEART MATHEMATICS HEART RESEARCH LOVE MATHEMATICS WORK TEARS SOUL KNOWLEDGE HEART WORK JOY SOUL TEARS MATHEMATICS TEARS LOVE LOVE LOVE SOUL TEARS LOVE JOY SOUL LOVE TEARS SOUL SOUL TEARS JOY HEART LOVE SOUL TEARS JOY … SCIENTIFIC KNOWLEDGE WORK RESEARCH MATHEMATICS … ?

15 Inverting the generative model Maximum likelihood estimation (EM) e.g. Hofmann (1999) Deterministic approximate algorithms variational EM; Blei, Ng & Jordan (2001; 2003) expectation propagation; Minka & Lafferty (2002) Markov chain Monte Carlo full Gibbs sampler; Pritchard et al. (2000) collapsed Gibbs sampler; Griffiths & Steyvers (2004)

16 The collapsed Gibbs sampler Using conjugacy of Dirichlet and multinomial distributions, integrate out continuous parameters Defines a distribution on discrete ensembles z

17 The collapsed Gibbs sampler Sample each z i conditioned on z -i This is nicer than your average Gibbs sampler: memory: counts can be cached in two sparse matrices optimization: no special functions, simple arithmetic the distributions on  and  are analytic given z and w, and can later be found for each sample

18 Gibbs sampling in LDA iteration 1

19 Gibbs sampling in LDA iteration 1 2

20 Gibbs sampling in LDA iteration 1 2

21 Gibbs sampling in LDA iteration 1 2

22 Gibbs sampling in LDA iteration 1 2

23 Gibbs sampling in LDA iteration 1 2

24 Gibbs sampling in LDA iteration 1 2

25 Gibbs sampling in LDA iteration 1 2

26 Gibbs sampling in LDA iteration 1 2 … 1000

27 Effects of hyperparameters  and  control the relative sparsity of  and  smaller , fewer topics per document smaller , fewer words per topic Good assignments z compromise in sparsity log  (x) x

28 Varying   decreasing  increases sparsity

29 Varying  decreasing  increases sparsity ?

30 Number of words per topic Topic Number of words

31 NuNu U zizi wiwi (u)(u)  (j) (j)    (u) |s u =0  Delta(  (u-1) )  (u) |s u =1  Dirichlet(  ) z i  Discrete(  (u) )  (j)  Dirichlet(  ) w i  Discrete(  (z i ) ) T Extension: a model for meetings susu  (u-1) … (Purver, Kording, Griffiths, & Tenenbaum, 2006)

32 Sample of ICSI meeting corpus (25 meetings) no it's o_k. it's it'll work. well i can do that. but then i have to end the presentation in the middle so i can go back to open up javabayes. o_k fine. here let's see if i can. alright. very nice. is that better. yeah. o_k. uh i'll also get rid of this click to add notes. o_k. perfect NEW TOPIC (not supplied to algorithm) so then the features we decided or we decided we were talked about. right. uh the the prosody the discourse verb choice. you know we had a list of things like to go and to visit and what not. the landmark-iness of uh. i knew you'd like that. nice coinage. thank you.

33 Topic segmentation applied to meetings Inferred Segmentation Inferred Topics

34 Comparison with human judgments Topics recovered are much more coherent than those found using random segmentation, no segmentation, or an HMM

35 Learning the number of topics Can use standard Bayes factor methods to evaluate models of different dimensionality e.g. importance sampling via MCMC Alternative: nonparametric Bayes fixed number of topics per document, unbounded number of topics per corpus (Blei, Griffiths, Jordan, & Tenenbaum, 2004) unbounded number of topics for both (the hierarchical Dirichlet process) (Teh, Jordan, Beal, & Blei, 2004)

36 The Author-Topic model (Rosen-Zvi, Griffiths, Smyth, & Steyvers, 2004) NdNd D zizi wiwi  (a) (a)  (j) (j)    (a)  Dirichlet(  ) z i  Discrete(  (x i ) )  (j)  Dirichlet(  ) w i  Discrete(  (z i ) ) T xixi A x i  Uniform(A (d) ) each author has a distribution over topics the author of each word is chosen uniformly at random

37 Four example topics from NIPS

38 Who wrote what? A method 1 is described which like the kernel 1 trick 1 in support 1 vector 1 machines 1 SVMs 1 lets us generalize distance 1 based 2 algorithms to operate in feature 1 spaces usually nonlinearly related to the input 1 space This is done by identifying a class of kernels 1 which can be represented as norm 1 based 2 distances 1 in Hilbert spaces It turns 1 out that common kernel 1 algorithms such as SVMs 1 and kernel 1 PCA 1 are actually really distance 1 based 2 algorithms and can be run 2 with that class of kernels 1 too As well as providing 1 a useful new insight 1 into how these algorithms work the present 2 work can form the basis 1 for conceiving new algorithms This paper presents 2 a comprehensive approach for model 2 based 2 diagnosis 2 which includes proposals for characterizing and computing 2 preferred 2 diagnoses 2 assuming that the system 2 description 2 is augmented with a system 2 structure 2 a directed 2 graph 2 explicating the interconnections between system 2 components 2 Specifically we first introduce the notion of a consequence 2 which is a syntactically 2 unconstrained propositional 2 sentence 2 that characterizes all consistency 2 based 2 diagnoses 2 and show 2 that standard 2 characterizations of diagnoses 2 such as minimal conflicts 1 correspond to syntactic 2 variations 1 on a consequence 2 Second we propose a new syntactic 2 variation on the consequence 2 known as negation 2 normal form NNF and discuss its merits compared to standard variations Third we introduce a basic algorithm 2 for computing consequences in NNF given a structured system 2 description We show that if the system 2 structure 2 does not contain cycles 2 then there is always a linear size 2 consequence 2 in NNF which can be computed in linear time 2 For arbitrary 1 system 2 structures 2 we show a precise connection between the complexity 2 of computing 2 consequences and the topology of the underlying system 2 structure 2 Finally we present 2 an algorithm 2 that enumerates 2 the preferred 2 diagnoses 2 characterized by a consequence 2 The algorithm 2 is shown 1 to take linear time 2 in the size 2 of the consequence 2 if the preference criterion 1 satisfies some general conditions Written by (1) Scholkopf_B Written by (2) Darwiche_A

39 Analysis of PNAS abstracts Test topic models with a real database of scientific papers from PNAS All 28,154 abstracts from 1991-2001 All words occurring in at least five abstracts, not on “stop” list (20,551) Total of 3,026,970 tokens in corpus (Griffiths & Steyvers, 2004)

40 FORCE SURFACE MOLECULES SOLUTION SURFACES MICROSCOPY WATER FORCES PARTICLES STRENGTH POLYMER IONIC ATOMIC AQUEOUS MOLECULAR PROPERTIES LIQUID SOLUTIONS BEADS MECHANICAL HIV VIRUS INFECTED IMMUNODEFICIENCY CD4 INFECTION HUMAN VIRAL TAT GP120 REPLICATION TYPE ENVELOPE AIDS REV BLOOD CCR5 INDIVIDUALS ENV PERIPHERAL MUSCLE CARDIAC HEART SKELETAL MYOCYTES VENTRICULAR MUSCLES SMOOTH HYPERTROPHY DYSTROPHIN HEARTS CONTRACTION FIBERS FUNCTION TISSUE RAT MYOCARDIAL ISOLATED MYOD FAILURE STRUCTURE ANGSTROM CRYSTAL RESIDUES STRUCTURES STRUCTURAL RESOLUTION HELIX THREE HELICES DETERMINED RAY CONFORMATION HELICAL HYDROPHOBIC SIDE DIMENSIONAL INTERACTIONS MOLECULE SURFACE NEURONS BRAIN CORTEX CORTICAL OLFACTORY NUCLEUS NEURONAL LAYER RAT NUCLEI CEREBELLUM CEREBELLAR LATERAL CEREBRAL LAYERS GRANULE LABELED HIPPOCAMPUS AREAS THALAMIC A selection of topics TUMOR CANCER TUMORS HUMAN CELLS BREAST MELANOMA GROWTH CARCINOMA PROSTATE NORMAL CELL METASTATIC MALIGNANT LUNG CANCERS MICE NUDE PRIMARY OVARIAN

41 Cold topicsHot topics

42 Cold topicsHot topics 2 SPECIES GLOBAL CLIMATE CO2 WATER ENVIRONMENTAL YEARS MARINE CARBON DIVERSITY OCEAN EXTINCTION TERRESTRIAL COMMUNITY ABUNDANCE 134 MICE DEFICIENT NORMAL GENE NULL MOUSE TYPE HOMOZYGOUS ROLE KNOCKOUT DEVELOPMENT GENERATED LACKING ANIMALS REDUCED 179 APOPTOSIS DEATH CELL INDUCED BCL CELLS APOPTOTIC CASPASE FAS SURVIVAL PROGRAMMED MEDIATED INDUCTION CERAMIDE EXPRESSION

43 Cold topicsHot topics 2 SPECIES GLOBAL CLIMATE CO2 WATER ENVIRONMENTAL YEARS MARINE CARBON DIVERSITY OCEAN EXTINCTION TERRESTRIAL COMMUNITY ABUNDANCE 134 MICE DEFICIENT NORMAL GENE NULL MOUSE TYPE HOMOZYGOUS ROLE KNOCKOUT DEVELOPMENT GENERATED LACKING ANIMALS REDUCED 179 APOPTOSIS DEATH CELL INDUCED BCL CELLS APOPTOTIC CASPASE FAS SURVIVAL PROGRAMMED MEDIATED INDUCTION CERAMIDE EXPRESSION 37 CDNA AMINO SEQUENCE ACID PROTEIN ISOLATED ENCODING CLONED ACIDS IDENTITY CLONE EXPRESSED ENCODES RAT HOMOLOGY 289 KDA PROTEIN PURIFIED MOLECULAR MASS CHROMATOGRAPHY POLYPEPTIDE GEL SDS BAND APPARENT LABELED IDENTIFIED FRACTION DETECTED 75 ANTIBODY ANTIBODIES MONOCLONAL ANTIGEN IGG MAB SPECIFIC EPITOPE HUMAN MABS RECOGNIZED SERA EPITOPES DIRECTED NEUTRALIZING


Download ppt "Statistical NLP Winter 2009 Lecture 5: Unsupervised text categorization Roger Levy Many thanks to Tom Griffiths for use of his slides."

Similar presentations


Ads by Google