Topics in statistical language modeling Tom Griffiths.

Slides:



Advertisements
Similar presentations
Topic models Source: Topic models, David Blei, MLSS 09.
Advertisements

Xiaolong Wang and Daniel Khashabi
Course: Neural Networks, Instructor: Professor L.Behera.
Hierarchical Dirichlet Process (HDP)
Ouyang Ruofei Topic Model Latent Dirichlet Allocation Ouyang Ruofei May LDA.
Information retrieval – LSI, pLSI and LDA
Hierarchical Dirichlet Processes
If you did not pick up homework yesterday do so today!! Due FRIDAY
FCAT Review The Nature of Science
An Introduction to LDA Tools Kuan-Yu Chen Institute of Information Science, Academia Sinica.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.
Probabilistic Clustering-Projection Model for Discrete Data
Statistical Topic Modeling part 1
Decoupling Sparsity and Smoothness in the Discrete Hierarchical Dirichlet Process Chong Wang and David M. Blei NIPS 2009 Discussion led by Chunping Wang.
Unsupervised and Weakly-Supervised Probabilistic Modeling of Text Ivan Titov April TexPoint fonts used in EMF. Read the TexPoint manual before.
Stochastic Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation James Foulds 1, Levi Boyles 1, Christopher DuBois 2 Padhraic Smyth.
Part IV: Inference algorithms. Estimation and inference Actually working with probabilistic models requires solving some difficult computational problems…
Latent Dirichlet Allocation a generative model for text
Exploring subjective probability distributions using Bayesian statistics Tom Griffiths Department of Psychology Cognitive Science Program University of.
Machine Learning CMPT 726 Simon Fraser University
British Museum Library, London Picture Courtesy: flickr.
Concepts & Categorization. Measurement of Similarity Geometric approach Featural approach  both are vector representations.
Latent Semantic Analysis (LSA). Introduction to LSA Learning Model Uses Singular Value Decomposition (SVD) to simulate human learning of word and passage.
Computer vision: models, learning and inference Chapter 10 Graphical Models.
Integrating Topics and Syntax Paper by Thomas Griffiths, Mark Steyvers, David Blei, Josh Tenenbaum Presentation by Eric Wang 9/12/2008.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Introduction to Machine Learning for Information Retrieval Xiaolong Wang.
Correlated Topic Models By Blei and Lafferty (NIPS 2005) Presented by Chunping Wang ECE, Duke University August 4 th, 2006.
Example 16,000 documents 100 topic Picked those with large p(w|z)
Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.
Modeling Documents by Combining Semantic Concepts with Unsupervised Statistical Learning Author: Chaitanya Chemudugunta America Holloway Padhraic Smyth.
(Infinitely) Deep Learning in Vision Max Welling (UCI) collaborators: Ian Porteous (UCI) Evgeniy Bart UCI/Caltech) Pietro Perona (Caltech)
Annealing Paths for the Evaluation of Topic Models James Foulds Padhraic Smyth Department of Computer Science University of California, Irvine* *James.
Bayesian Hierarchical Clustering Paper by K. Heller and Z. Ghahramani ICML 2005 Presented by HAO-WEI, YEH.
A Markov Random Field Model for Term Dependencies Donald Metzler W. Bruce Croft Present by Chia-Hao Lee.
Hierarchical Topic Models and the Nested Chinese Restaurant Process Blei, Griffiths, Jordan, Tenenbaum presented by Rodrigo de Salvo Braz.
Topic Modelling: Beyond Bag of Words By Hanna M. Wallach ICML 2006 Presented by Eric Wang, April 25 th 2008.
Finding Scientific topics August , Topic Modeling 1.A document as a probabilistic mixture of topics. 2.A topic as a probability distribution.
Inferring structure from data Tom Griffiths Department of Psychology Program in Cognitive Science University of California, Berkeley.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
Eric Xing © Eric CMU, Machine Learning Latent Aspect Models Eric Xing Lecture 14, August 15, 2010 Reading: see class homepage.
Integrating Topics and Syntax -Thomas L
Summary We propose a framework for jointly modeling networks and text associated with them, such as networks or user review websites. The proposed.
Randomized Algorithms for Bayesian Hierarchical Clustering
Bayes’ Nets: Sampling [These slides were created by Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188 materials are available.
A Model for Learning the Semantics of Pictures V. Lavrenko, R. Manmatha, J. Jeon Center for Intelligent Information Retrieval Computer Science Department,
Latent Dirichlet Allocation D. Blei, A. Ng, and M. Jordan. Journal of Machine Learning Research, 3: , January Jonathan Huang
Probabilistic Models for Discovering E-Communities Ding Zhou, Eren Manavoglu, Jia Li, C. Lee Giles, Hongyuan Zha The Pennsylvania State University WWW.
1 Dirichlet Process Mixtures A gentle tutorial Graphical Models – Khalid El-Arini Carnegie Mellon University November 6 th, 2006 TexPoint fonts used.
Topic Models Presented by Iulian Pruteanu Friday, July 28 th, 2006.
Lecture 2: Statistical learning primer for biologists
Latent Dirichlet Allocation
Exact Inference in Bayes Nets. Notation U: set of nodes in a graph X i : random variable associated with node i π i : parents of node i Joint probability:
Gaussian Processes For Regression, Classification, and Prediction.
CS246 Latent Dirichlet Analysis. LSI  LSI uses SVD to find the best rank-K approximation  The result is difficult to interpret especially with negative.
Web-Mining Agents Topic Analysis: pLSI and LDA
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Bayesian Hierarchical Clustering Paper by K. Heller and Z. Ghahramani ICML 2005 Presented by David Williams Paper Discussion Group ( )
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
Online Multiscale Dynamic Topic Models
CSCI 5822 Probabilistic Models of Human and Machine Learning
Latent Dirichlet Analysis
Hierarchical Topic Models and the Nested Chinese Restaurant Process
Bayesian Inference for Mixture Language Models
CS 188: Artificial Intelligence
Michal Rosen-Zvi University of California, Irvine
Junghoo “John” Cho UCLA
Topic Models in Text Processing
Presentation transcript:

Topics in statistical language modeling Tom Griffiths

Mark Steyvers UC Irvine Josh Tenenbaum MIT Dave Blei CMU Mike Jordan UC Berkeley

Latent Dirichlet Allocation (LDA) Each document a mixture of topics Each word chosen from a single topic Introduced by Blei, Ng, and Jordan (2001), reinterpretation of PLSI (Hofmann, 1999) Idea of probabilistic topics widely used (eg. Bigi et al., 1997; Iyer & Ostendorf, 1996; Ueda & Saito, 2003)

Each document a mixture of topics Each word chosen from a single topic from parameters Latent Dirichlet Allocation (LDA)

HEART0.2 LOVE0.2 SOUL0.2 TEARS0.2 JOY0.2 SCIENTIFIC 0.0 KNOWLEDGE 0.0 WORK 0.0 RESEARCH0.0 MATHEMATICS0.0 HEART0.0 LOVE0.0 SOUL0.0 TEARS0.0 JOY0.0 SCIENTIFIC 0.2 KNOWLEDGE 0.2 WORK 0.2 RESEARCH0.2 MATHEMATICS0.2 topic 1topic 2 w P(w|z = 1) =  (1) w P(w|z = 2) =  (2)

Choose mixture weights for each document, generate “bag of words”  = {P(z = 1), P(z = 2)} {0, 1} {0.25, 0.75} {0.5, 0.5} {0.75, 0.25} {1, 0} MATHEMATICS KNOWLEDGE RESEARCH WORK MATHEMATICS RESEARCH WORK SCIENTIFIC MATHEMATICS WORK SCIENTIFIC KNOWLEDGE MATHEMATICS SCIENTIFIC HEART LOVE TEARS KNOWLEDGE HEART MATHEMATICS HEART RESEARCH LOVE MATHEMATICS WORK TEARS SOUL KNOWLEDGE HEART WORK JOY SOUL TEARS MATHEMATICS TEARS LOVE LOVE LOVE SOUL TEARS LOVE JOY SOUL LOVE TEARS SOUL SOUL TEARS JOY

Generating a document 1. Choose  d  Dirichlet (  ) 2. For each word in the document –choose z  Multinomial (  (d) ) –choose w  Multinomial (  (z) )  z w z z ww

Inverting the generative model Generative model gives procedure to obtain corpus from topics, mixing proportions Inverting the model extracts topics  and mixing proportions  from corpus Goal: describe content of documents, and be able to identify content of new documents All inference completely unsupervised, fixed # of topics T, words W, documents D

Inverting the generative model Maximum likelihood estimation (EM) –eg. Hofmann (1999) –slow, local maxima Approximate E-steps –VB; Blei, Ng & Jordan (2001) –EP; Minka & Lafferty (2002) Bayesian inference (via Gibbs sampling)

Gibbs sampling in LDA Numerator rewards sparsity in words assigned to topics, topics to documents Sum in the denominator over T n terms Full posterior tractable to a constant, so use Markov chain Monte Carlo (MCMC)

Markov chain Monte Carlo Sample from a Markov chain constructed to converge to the target distribution Allows sampling from unnormalized posterior, and other complex distributions Can compute approximate statistics from intractable distributions Gibbs sampling one such method, construct Markov chain with conditional distributions

Gibbs sampling in LDA Need full conditional distributions for variables Since we only sample z we need number of times word w assigned to topic j number of times topic j used in document d

Gibbs sampling in LDA iteration 1

Gibbs sampling in LDA iteration 1 2

Gibbs sampling in LDA iteration 1 2

Gibbs sampling in LDA iteration 1 2

Gibbs sampling in LDA iteration 1 2

Gibbs sampling in LDA iteration 1 2

Gibbs sampling in LDA iteration 1 2

Gibbs sampling in LDA iteration 1 2

Gibbs sampling in LDA iteration 1 2 … 1000

Estimating topic distributions Parameter estimates from posterior predictive distributions

pixel = word image = document sample each pixel from a mixture of topics A visual example: Bars

Strategy Markov chain Monte Carlo (MCMC) is normally slow, so why consider using it? In discrete models, use conjugate priors to reduce inference to discrete variables Several benefits: –save memory: need only track sparse counts –save time: cheap updates, even with complex dependencies between variables

(not estimating Dirichlet hyperparameters ,  ) Perplexity vs. time

Strategy Markov chain Monte Carlo (MCMC) is normally slow, so why consider using it? In discrete models, use conjugate priors to reduce inference to discrete variables Several benefits: –save memory: need only track sparse counts –save time: cheap updates, even with complex dependencies between variables These properties let us explore larger, more complex models

Application to corpus data TASA corpus: text from first grade to college word types, over documents, used approximately 6 million word tokens Run Gibbs for models with T = 300, 500, …, 1700 topics

THEORY SCIENTISTS EXPERIMENT OBSERVATIONS SCIENTIFIC EXPERIMENTS HYPOTHESIS EXPLAIN SCIENTIST OBSERVED EXPLANATION BASED OBSERVATION IDEA EVIDENCE THEORIES BELIEVED DISCOVERED OBSERVE FACTS SPACE EARTH MOON PLANET ROCKET MARS ORBIT ASTRONAUTS FIRST SPACECRAFT JUPITER SATELLITE SATELLITES ATMOSPHERE SPACESHIP SURFACE SCIENTISTS ASTRONAUT SATURN MILES ART PAINT ARTIST PAINTING PAINTED ARTISTS MUSEUM WORK PAINTINGS STYLE PICTURES WORKS OWN SCULPTURE PAINTER ARTS BEAUTIFUL DESIGNS PORTRAIT PAINTERS STUDENTS TEACHER STUDENT TEACHERS TEACHING CLASS CLASSROOM SCHOOL LEARNING PUPILS CONTENT INSTRUCTION TAUGHT GROUP GRADE SHOULD GRADES CLASSES PUPIL GIVEN BRAIN NERVE SENSE SENSES ARE NERVOUS NERVES BODY SMELL TASTE TOUCH MESSAGES IMPULSES CORD ORGANS SPINAL FIBERS SENSORY PAIN IS CURRENT ELECTRICITY ELECTRIC CIRCUIT IS ELECTRICAL VOLTAGE FLOW BATTERY WIRE WIRES SWITCH CONNECTED ELECTRONS RESISTANCE POWER CONDUCTORS CIRCUITS TUBE NEGATIVE A selection from 500 topics [ P(w|z = j) ]

STORY STORIES TELL CHARACTER CHARACTERS AUTHOR READ TOLD SETTING TALES PLOT TELLING SHORT FICTION ACTION TRUE EVENTS TELLS TALE NOVEL MIND WORLD DREAM DREAMS THOUGHT IMAGINATION MOMENT THOUGHTS OWN REAL LIFE IMAGINE SENSE CONSCIOUSNESS STRANGE FEELING WHOLE BEING MIGHT HOPE FIELD MAGNETIC MAGNET WIRE NEEDLE CURRENT COIL POLES IRON COMPASS LINES CORE ELECTRIC DIRECTION FORCE MAGNETS BE MAGNETISM POLE INDUCED SCIENCE STUDY SCIENTISTS SCIENTIFIC KNOWLEDGE WORK RESEARCH CHEMISTRY TECHNOLOGY MANY MATHEMATICS BIOLOGY FIELD PHYSICS LABORATORY STUDIES WORLD SCIENTIST STUDYING SCIENCES BALL GAME TEAM FOOTBALL BASEBALL PLAYERS PLAY FIELD PLAYER BASKETBALL COACH PLAYED PLAYING HIT TENNIS TEAMS GAMES SPORTS BAT TERRY JOB WORK JOBS CAREER EXPERIENCE EMPLOYMENT OPPORTUNITIES WORKING TRAINING SKILLS CAREERS POSITIONS FIND POSITION FIELD OCCUPATIONS REQUIRE OPPORTUNITY EARN ABLE A selection from 500 topics [ P(w|z = j) ]

STORY STORIES TELL CHARACTER CHARACTERS AUTHOR READ TOLD SETTING TALES PLOT TELLING SHORT FICTION ACTION TRUE EVENTS TELLS TALE NOVEL MIND WORLD DREAM DREAMS THOUGHT IMAGINATION MOMENT THOUGHTS OWN REAL LIFE IMAGINE SENSE CONSCIOUSNESS STRANGE FEELING WHOLE BEING MIGHT HOPE FIELD MAGNETIC MAGNET WIRE NEEDLE CURRENT COIL POLES IRON COMPASS LINES CORE ELECTRIC DIRECTION FORCE MAGNETS BE MAGNETISM POLE INDUCED SCIENCE STUDY SCIENTISTS SCIENTIFIC KNOWLEDGE WORK RESEARCH CHEMISTRY TECHNOLOGY MANY MATHEMATICS BIOLOGY FIELD PHYSICS LABORATORY STUDIES WORLD SCIENTIST STUDYING SCIENCES BALL GAME TEAM FOOTBALL BASEBALL PLAYERS PLAY FIELD PLAYER BASKETBALL COACH PLAYED PLAYING HIT TENNIS TEAMS GAMES SPORTS BAT TERRY JOB WORK JOBS CAREER EXPERIENCE EMPLOYMENT OPPORTUNITIES WORKING TRAINING SKILLS CAREERS POSITIONS FIND POSITION FIELD OCCUPATIONS REQUIRE OPPORTUNITY EARN ABLE A selection from 500 topics [ P(w|z = j) ]

PLANET Cue: (Nelson, McEvoy & Schreiber, 1998) Evaluation: Word association

EARTH PLUTO JUPITER NEPTUNE VENUS URANUS SATURN COMET MARS ASTEROID PLANET Cue: Associates: (Nelson, McEvoy & Schreiber, 1998) Evaluation: Word association

associates cues Evaluation: Word association

Comparison with Latent Semantic Analysis (LSA; Landauer & Dumais, 1997) Both algorithms applied to TASA corpus (D > 30,000, W > 20,000, n > 6,000,000) Compare LSA cosine, inner product, with the “on-topic” conditional probability Evaluation: Word association

Latent Semantic Analysis (Landauer & Dumais, 1997) 1…1… 6…6… 11 … spaces … 6195semantic 2120in 3034words Doc3 …Doc2Doc1 SVD words in semantic spaces XU D V T co-occurrence matrixhigh dimensional space

words documents  U D V  words dims vectors documents LSA Latent Semantic Analysis (Landauer & Dumais, 1997) Dimensionality reduction makes storage efficient, extracts correlation

Properties of word association Asymmetry Violation of triangle inequality “Small world” graph JOY LOVE RESEARCH MATHEMATICS MYSTERY

Small world graph JOY LOVE RESEARCH MATHEMATICS MYSTERY treat association matrix as adjacency matrix (edges indicate positive association) associates cues

Small world graph Properties: –short path lengths –clustering –power law degree distribution Small world graphs arise elsewhere –social relations, biology, the internet JOY LOVE RESEARCH MATHEMATICS MYSTERY

Small world graph Properties: –short path lengths –clustering –power law degree distribution Small world graphs arise elsewhere –social relations, biology, the internet JOY LOVE RESEARCH MATHEMATICS MYSTERY

What is a power law distribution?

Exponential: height Power law: wealth

A power law in word association (Steyvers & Tenenbaum) Cue: PLANET Associates: EARTH PLUTO JUPITER NEPTUNE Word association data k = number of cues

The statistics of meaning Zipf’s law of meaning –number of senses Roget’s Thesaurus –number of classes k = number of classes Roget’s Thesaurus (Steyvers & Tenenbaum)

Meanings and associations Word association involves words Meaning involves words and contexts

Meanings and associations Word association involves words: unipartite Meaning involves words and contexts JOY LOVE RESEARCH MATHEMATICS MYSTERY

Meanings and associations Word association involves words: unipartite Meaning involves words and contexts: bipartite CONTEXT 1 CONTEXT 2 JOY LOVE RESEARCH MYSTERY MATHEMATICS JOY LOVE RESEARCH MATHEMATICS MYSTERY

Meanings and associations Power law in bipartite implies same in unipartite Can get word association power law from meanings CONTEXT 1 CONTEXT 2 JOY LOVE RESEARCH MYSTERY MATHEMATICS JOY LOVE RESEARCH MATHEMATICS MYSTERY

Power law in word association WORDS IN SEMANTIC SPACES (Steyvers & Tenenbaum) Word association data k = number of associations

(Steyvers & Tenenbaum) Word association data k = number of associations Power law in word association WORDS IN SEMANTIC SPACES

Power law in word association Latent Semantic Analysis (Steyvers & Tenenbaum) Word association data k = number of associations

Probability of contaning first associate Rank

Meanings and associations k = number of topics Topic model - P(w|z = j) k = number of cues Topic model - P(w 2 |w 1 )

Problems Finding the right number of topics No dependencies between topics The “bag of words” assumption Need for a stop list

Problems Finding the right number of topics No dependencies between topics The “bag of words” assumption Need for a stop list } } CRP models (Blei, Jordan, Tenenbaum) HMM syntax (Steyvers, Blei Tenenbaum)

Problems Finding the right number of topics No dependencies between topics The “bag of words” assumption Need for a stop list } } CRP models (Blei, Jordan, Tenenbaum) HMM syntax (Steyvers, Blei Tenenbaum)

T corpus topics all T topics are in each document 1T Standard LDA: doc1 doc2 doc3

T corpus topics only L topics are in each document 1T doc1 doc2 doc3

T corpus topics only L topics are in each document 1T doc1 doc2 doc3 topic identities indexed by c

Richer dependencies Nature of topic dependencies comes from prior on assignments to documents p(c) Inference with Gibbs is straightforward Boring prior: pick L from T uniformly Some interesting priors on assignments: –Chinese restaurant process (CRP) –nested CRP (for hierarchies)

Chinese restaurant process The mth customer at an infinitely large Chinese restaurant chooses a table with Also Dirichlet process, infinite models (Beal, Ghahramani, Neal, Rasmussen) Prior on assignments: one topic on each table, L visits/document, T is unbounded

Generating a document 1. Choose c by sampling L tables from the Chinese restaurant, without replacement 2. Choose  d  Dirichlet (  ) (over L slots) 3. For each word in the document –choose z  Multinomial (  (d) ) –choose w  Multinomial (  (c(z)) )

Inverting the generative model Draw z as before, but conditioned on c Draw c one at a time from Need only track occupied tables Recover topics, number of occupied tables

Chinese restaurant process prior Bayes factor Model selection with the CRP

Nested CRP Infinitely many infinite-table restaurants Every table has a card for another restaurant, forming an infinite-branching tree L day vacation: visit root restaurant first night, go to restaurant on card the next night, etc. Once inside the restaurant, choose the table (and the next restaurant) via the standard CRP

The nested CRP as a prior One topic per restaurant, each document has one topic at each of the L-levels of a tree Each c is a path through the tree Collecting these paths from all documents gives a finite subtree of used topics Allows unsupervised learning of hierarchies Extends Hofmann’s (1999) topic hierarchies

Generating a document 1. Choose c by sampling a path from the nested Chinese restaurant process 2. Choose  d  Dirichlet (  ) (over L slots) 3. For each word in the document –choose z  Multinomial (  (d) ) –choose w  Multinomial (  (c(z)) )

Inverting the generative model Draw z as before, but conditioned on c Draw c as a block from Need only track previously taken paths Recover topics, set of paths (finite subtree)

Twelve years of NIPS

Summary Letting document topics to be a subset of corpus topics allows richer dependencies Using Gibbs sampling makes it possible to have an unbounded number of corpus topics Flat model, hierarchies only two options of many: factorial, arbitrary graphs, etc

Problems Finding the right number of topics No dependencies between topics The “bag of words” assumption Need for a stop list } } CRP models (Blei, Jordan, Tenenbaum) HMM syntax (Steyvers, Tenenbaum)

Syntax and semantics from statistics  z w z z ww x x x semantics: probabilistic topics syntax: probabilistic regular grammar Factorization of language based on statistical dependency patterns: long-range, document specific, dependencies short-range dependencies constant across all documents

HEART 0.2 LOVE 0.2 SOUL 0.2 TEARS 0.2 JOY 0.2 z = SCIENTIFIC 0.2 KNOWLEDGE 0.2 WORK 0.2 RESEARCH 0.2 MATHEMATICS 0.2 z = x = 1 THE 0.6 A 0.3 MANY 0.1 x = 3 OF 0.6 FOR 0.3 BETWEEN 0.1 x =

THE ……………………………… HEART 0.2 LOVE 0.2 SOUL 0.2 TEARS 0.2 JOY 0.2 z = SCIENTIFIC 0.2 KNOWLEDGE 0.2 WORK 0.2 RESEARCH 0.2 MATHEMATICS 0.2 z = x = 1 THE 0.6 A 0.3 MANY 0.1 x = 3 OF 0.6 FOR 0.3 BETWEEN 0.1 x =

THE LOVE…………………… HEART 0.2 LOVE 0.2 SOUL 0.2 TEARS 0.2 JOY 0.2 z = SCIENTIFIC 0.2 KNOWLEDGE 0.2 WORK 0.2 RESEARCH 0.2 MATHEMATICS 0.2 z = x = 1 THE 0.6 A 0.3 MANY 0.1 x = 3 OF 0.6 FOR 0.3 BETWEEN 0.1 x =

THE LOVE OF……………… HEART 0.2 LOVE 0.2 SOUL 0.2 TEARS 0.2 JOY 0.2 z = SCIENTIFIC 0.2 KNOWLEDGE 0.2 WORK 0.2 RESEARCH 0.2 MATHEMATICS 0.2 z = x = 1 THE 0.6 A 0.3 MANY 0.1 x = 3 OF 0.6 FOR 0.3 BETWEEN 0.1 x =

THE LOVE OF RESEARCH …… HEART 0.2 LOVE 0.2 SOUL 0.2 TEARS 0.2 JOY 0.2 z = SCIENTIFIC 0.2 KNOWLEDGE 0.2 WORK 0.2 RESEARCH 0.2 MATHEMATICS 0.2 z = x = 1 THE 0.6 A 0.3 MANY 0.1 x = 3 OF 0.6 FOR 0.3 BETWEEN 0.1 x =

Inverting the generative model Sample z conditioned on x, other z –draw from prior if x > 1 Sample x conditioned on z, other x Inference allows estimation of –“semantic” topics –“syntactic” classes

FOOD FOODS BODY NUTRIENTS DIET FAT SUGAR ENERGY MILK EATING FRUITS VEGETABLES WEIGHT FATS NEEDS CARBOHYDRATES VITAMINS CALORIES PROTEIN MINERALS MAP NORTH EARTH SOUTH POLE MAPS EQUATOR WEST LINES EAST AUSTRALIA GLOBE POLES HEMISPHERE LATITUDE PLACES LAND WORLD COMPASS CONTINENTS DOCTOR PATIENT HEALTH HOSPITAL MEDICAL CARE PATIENTS NURSE DOCTORS MEDICINE NURSING TREATMENT NURSES PHYSICIAN HOSPITALS DR SICK ASSISTANT EMERGENCY PRACTICE BOOK BOOKS READING INFORMATION LIBRARY REPORT PAGE TITLE SUBJECT PAGES GUIDE WORDS MATERIAL ARTICLE ARTICLES WORD FACTS AUTHOR REFERENCE NOTE GOLD IRON SILVER COPPER METAL METALS STEEL CLAY LEAD ADAM ORE ALUMINUM MINERAL MINE STONE MINERALS POT MINING MINERS TIN BEHAVIOR SELF INDIVIDUAL PERSONALITY RESPONSE SOCIAL EMOTIONAL LEARNING FEELINGS PSYCHOLOGISTS INDIVIDUALS PSYCHOLOGICAL EXPERIENCES ENVIRONMENT HUMAN RESPONSES BEHAVIORS ATTITUDES PSYCHOLOGY PERSON CELLS CELL ORGANISMS ALGAE BACTERIA MICROSCOPE MEMBRANE ORGANISM FOOD LIVING FUNGI MOLD MATERIALS NUCLEUS CELLED STRUCTURES MATERIAL STRUCTURE GREEN MOLDS Semantic topics PLANTS PLANT LEAVES SEEDS SOIL ROOTS FLOWERS WATER FOOD GREEN SEED STEMS FLOWER STEM LEAF ANIMALS ROOT POLLEN GROWING GROW

GOOD SMALL NEW IMPORTANT GREAT LITTLE LARGE * BIG LONG HIGH DIFFERENT SPECIAL OLD STRONG YOUNG COMMON WHITE SINGLE CERTAIN THE HIS THEIR YOUR HER ITS MY OUR THIS THESE A AN THAT NEW THOSE EACH MR ANY MRS ALL MORE SUCH LESS MUCH KNOWN JUST BETTER RATHER GREATER HIGHER LARGER LONGER FASTER EXACTLY SMALLER SOMETHING BIGGER FEWER LOWER ALMOST ON AT INTO FROM WITH THROUGH OVER AROUND AGAINST ACROSS UPON TOWARD UNDER ALONG NEAR BEHIND OFF ABOVE DOWN BEFORE SAID ASKED THOUGHT TOLD SAYS MEANS CALLED CRIED SHOWS ANSWERED TELLS REPLIED SHOUTED EXPLAINED LAUGHED MEANT WROTE SHOWED BELIEVED WHISPERED ONE SOME MANY TWO EACH ALL MOST ANY THREE THIS EVERY SEVERAL FOUR FIVE BOTH TEN SIX MUCH TWENTY EIGHT HE YOU THEY I SHE WE IT PEOPLE EVERYONE OTHERS SCIENTISTS SOMEONE WHO NOBODY ONE SOMETHING ANYONE EVERYBODY SOME THEN Syntactic classes BE MAKE GET HAVE GO TAKE DO FIND USE SEE HELP KEEP GIVE LOOK COME WORK MOVE LIVE EAT BECOME

Bayes factors for different models Part-of-speech tagging

MODEL ALGORITHM SYSTEM CASE PROBLEM NETWORK METHOD APPROACH PAPER PROCESS IS WAS HAS BECOMES DENOTES BEING REMAINS REPRESENTS EXISTS SEEMS SEE SHOW NOTE CONSIDER ASSUME PRESENT NEED PROPOSE DESCRIBE SUGGEST USED TRAINED OBTAINED DESCRIBED GIVEN FOUND PRESENTED DEFINED GENERATED SHOWN IN WITH FOR ON FROM AT USING INTO OVER WITHIN HOWEVER ALSO THEN THUS THEREFORE FIRST HERE NOW HENCE FINALLY #*IXTN-CFP#*IXTN-CFP EXPERTS EXPERT GATING HME ARCHITECTURE MIXTURE LEARNING MIXTURES FUNCTION GATE DATA GAUSSIAN MIXTURE LIKELIHOOD POSTERIOR PRIOR DISTRIBUTION EM BAYESIAN PARAMETERS STATE POLICY VALUE FUNCTION ACTION REINFORCEMENT LEARNING CLASSES OPTIMAL * MEMBRANE SYNAPTIC CELL * CURRENT DENDRITIC POTENTIAL NEURON CONDUCTANCE CHANNELS IMAGE IMAGES OBJECT OBJECTS FEATURE RECOGNITION VIEWS # PIXEL VISUAL KERNEL SUPPORT VECTOR SVM KERNELS # SPACE FUNCTION MACHINES SET NETWORK NEURAL NETWORKS OUPUT INPUT TRAINING INPUTS WEIGHTS # OUTPUTS NIPS Semantics NIPS Syntax

Function and content words

Highlighting and templating

Open questions Are MCMC methods useful elsewhere? –“smoothing with negative weights” –Markov chains on grammars Other nonparametric language models? –infinite HMM, infinite PCFG, clustering Better ways of combining topics and syntax? –richer syntactic models –better combination schemes