GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 A Hierarchical Nonparametric Bayesian.

Slides:

Advertisements

Similar presentations

Pattern Recognition and Machine Learning

Advertisements

Sinead Williamson, Chong Wang, Katherine A. Heller, David M. Blei

Topic models Source: Topic models, David Blei, MLSS 09.

Teg Grenager NLP Group Lunch February 24, 2005

Gentle Introduction to Infinite Gaussian Mixture Modeling

Xiaolong Wang and Daniel Khashabi

Course: Neural Networks, Instructor: Professor L.Behera.

MAD-Bayes: MAP-based Asymptotic Derivations from Bayes

Hierarchical Dirichlet Process (HDP)

Information Retrieval and Organisation Chapter 12 Language Models for Information Retrieval Dell Zhang Birkbeck, University of London.

1 CS 388: Natural Language Processing: N-Gram Language Models Raymond J. Mooney University of Texas at Austin.

Language Modeling.

Basics of Statistical Estimation

Ouyang Ruofei Topic Model Latent Dirichlet Allocation Ouyang Ruofei May LDA.

Information retrieval – LSI, pLSI and LDA

Gibbs Sampling Methods for Stick-Breaking priors Hemant Ishwaran and Lancelot F. James 2001 Presented by Yuting Qi ECE Dept., Duke Univ. 03/03/06.

Hierarchical Dirichlet Processes

DEPARTMENT OF ENGINEERING SCIENCE Information, Control, and Vision Engineering Bayesian Nonparametrics via Probabilistic Programming Frank Wood

Nonparametric hidden Markov models Jurgen Van Gael and Zoubin Ghahramani.

Adaption Adjusting Model’s parameters for a new speaker. Adjusting all parameters need a huge amount of data (impractical). The solution is to cluster.

HW 4. Nonparametric Bayesian Models Parametric Model Fixed number of parameters that is independent of the data we’re fitting.

Flipping A Biased Coin Suppose you have a coin with an unknown bias, θ ≡ P(head). You flip the coin multiple times and observe the outcome. From observations,

Decoupling Sparsity and Smoothness in the Discrete Hierarchical Dirichlet Process Chong Wang and David M. Blei NIPS 2009 Discussion led by Chunping Wang.

Latent Dirichlet Allocation a generative model for text

The Sequence Memoizer Frank Wood Cedric Archambeau Jan Gasthaus Lancelot James Yee Whye Teh Gatsby UCL Gatsby HKUST Gatsby TexPoint fonts used in EMF.

Using CTW as a language modeler in Dasher Phil Cowans, Martijn van Veen Inference Group Department of Physics University of Cambridge.

Sequential Monte Carlo and Particle Filtering Frank Wood Gatsby, November 2007 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this.

Computer vision: models, learning and inference Chapter 10 Graphical Models.

Language Modeling Approaches for Information Retrieval Rong Jin.

CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.

Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.

Modeling Scientific Impact with Topical Influence Regression James Foulds Padhraic Smyth Department of Computer Science University of California, Irvine.

Motivation Parametric models can capture a bounded amount of information from the data. Real data is complex and therefore parametric assumptions is wrong.

Adaption Def: To adjust model parameters for new speakers. Adjusting all parameters requires too much data and is computationally complex. Solution: Create.

Extracting Places and Activities from GPS Traces Using Hierarchical Conditional Random Fields Yong-Joong Kim Dept. of Computer Science Yonsei.

1 Bayesian Learning for Latent Semantic Analysis Jen-Tzung Chien, Meng-Sun Wu and Chia-Sheng Wu Presenter: Hsuan-Sheng Chiu.

6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection -Augmentative.

Hierarchical Dirichelet Processes Y. W. Tech, M. I. Jordan, M. J. Beal & D. M. Blei NIPS 2004 Presented by Yuting Qi ECE Dept., Duke Univ. 08/26/05 Sharing.

Adaptor Grammars Ehsan Khoddammohammadi Recent Advances in Parsing Technology WS 2012/13 Saarland University 1.

Bayesian Hierarchical Clustering Paper by K. Heller and Z. Ghahramani ICML 2005 Presented by HAO-WEI, YEH.

The Sequence Memoizer Frank Wood Cedric Archambeau Jan Gasthaus Lancelot James Yee Whye Teh UCL Gatsby HKUST Gatsby TexPoint fonts used in EMF. Read the.

Hierarchical Topic Models and the Nested Chinese Restaurant Process Blei, Griffiths, Jordan, Tenenbaum presented by Rodrigo de Salvo Braz.

Inferring structure from data Tom Griffiths Department of Psychology Program in Cognitive Science University of California, Berkeley.

Style & Topic Language Model Adaptation Using HMM-LDA Bo-June (Paul) Hsu, James Glass.

Sequence Models With slides by me, Joshua Goodman, Fei Xia.

An Overview of Nonparametric Bayesian Models and Applications to Natural Language Processing Narges Sharif-Razavian and Andreas Zollmann.

The Dirichlet Labeling Process for Functional Data Analysis XuanLong Nguyen & Alan E. Gelfand Duke University Machine Learning Group Presented by Lu Ren.

Timeline: A Dynamic Hierarchical Dirichlet Process Model for Recovering Birth/Death and Evolution of Topics in Text Stream (UAI 2010) Amr Ahmed and Eric.

Randomized Algorithms for Bayesian Hierarchical Clustering

Language modelling María Fernández Pajares Verarbeitung gesprochener Sprache.

Hierarchical Dirichlet Process and Infinite Hidden Markov Model Duke University Machine Learning Group Presented by Kai Ni February 17, 2006 Paper by Y.

1 Dirichlet Process Mixtures A gentle tutorial Graphical Models – Khalid El-Arini Carnegie Mellon University November 6 th, 2006 TexPoint fonts used.

The Infinite Hierarchical Factor Regression Model Piyush Rai and Hal Daume III NIPS 2008 Presented by Bo Chen March 26, 2009.

A Hierarchical Nonparametric Bayesian Approach to Statistical Language Model Domain Adaptation Frank Wood and Yee Whye Teh AISTATS 2009 Presented by: Mingyuan.

Beam Sampling for the Infinite Hidden Markov Model by Jurgen Van Gael, Yunus Saatic, Yee Whye Teh and Zoubin Ghahramani (ICML 2008) Presented by Lihan.

Adaption Def: To adjust model parameters for new speakers. Adjusting all parameters requires an impractical amount of data. Solution: Create clusters and.

Nonparametric Bayesian Models. HW 4 x x Parametric Model Fixed number of parameters that is independent of the data we’re fitting.

Hierarchical Beta Process and the Indian Buffet Process by R. Thibaux and M. I. Jordan Discussion led by Qi An.

APPLICATIONS OF DIRICHLET PROCESS MIXTURES TO SPEAKER ADAPTATION Amir Harati and Joseph PiconeMarc Sobel Institute for Signal and Information Processing,

Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.

Statistical NLP Spring 2011 Lecture 3: Language Models II Dan Klein – UC Berkeley TexPoint fonts used in EMF. Read the TexPoint manual before you delete.

Recent Paper of Md. Akmal Haidar Meeting before ICASSP 2013 報告者：郝柏翰 2013/05/23.

Oliver Schulte Machine Learning 726

Online Multiscale Dynamic Topic Models

Non-Parametric Models

Statistical Models for Automatic Speech Recognition

Kernel Stick-Breaking Process

Collapsed Variational Dirichlet Process Mixture Models

Chinese Restaurant Representation Stick-Breaking Construction

Presentation transcript:

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 A Hierarchical Nonparametric Bayesian Approach to Statistical Language Model Domain Adaptation TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAA A A A A A A

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 Statistical Natural Language Modelling (SLM) Learning distributions over sequences of discrete observations (words) Useful for –Automatic Speech Recognition (ASR) P(text | speech) / P(speech | text) P(text) –Machine Translation (MT) P(english | french) / P(french | english) P(english)

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 Markov n–gram models tri-gram e.g. mice | blind, three league | premier, Barclays Context Word following context Discrete conditional distribution over words that follow given context

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 ???? ??? ? ? Domain Adaptation Data Estimation Adaptation Estimation Big Domain Small Domain

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 Hierarchical Bayesian Approach CorporaCorpus specific models Corpus independent model

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 Bayesian Domain Adaptation Estimation goal Inference objective Simultaneous estimation = automatic domain adaptation

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 One domain-specific n-gram (Markov) model for each domain Model Structure : The Likelihood Corpus Context Word following context Corpus specific parameters Distribution over words that follow given context

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 Model Structure : The Prior Doubly hierarchical Pitman-Yor process language model (DHPYLM) Prior (parameters µ ) Likelihood Novel generalization of back-off Every line is conditioned on everything on the right

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 Pitman-Yor Process Distribution over distributions Base distribution is the mean Generalization of the Dirichlet Process (d = 0) –Different (power-law) clustering properties –Better for text [Teh, 2006] concentrationdiscount base distribution TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA A A A A

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 DHPYLM ObservationsDistribution over words following given context Domain-specific distribution over words following shorter context (by 1) Non-specific distribution over words following full context Focus on a leaf Imagine recursion Base distribution

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 Tree-Shaped Graphical Model Latent LM Small LM Big LM {Parcel, etc.} {States, Kingdom, etc.}

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 Intuition Domain = The Times of London Context = the United The distribution over subsequent words will look like:

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 Problematic Example Domain = The Times of London Context = quarterback Joe The distribution over subsequent words should look like: But where could this come from?

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 The Model Estimation Problem Domain = The Times of London Context = quarterback Joe No counts for American football phrases in UK publications! (hypothetical but nearly true) Not how we do estimation!

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 A Solution: In-domain Back-Off Regularization smoothing backing-off In-domain back-off –[Kneser & Ney, Chen & Goodman, MacKay & Peto, Teh], etc. –Use counts from shorter contexts, roughly: Due to zero-counts, Times of London training data alone will still yield a skewed distribution Recursive!

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 Our Approach: Generalized Back-Off Use inter-domain shared counts too. –roughly: TL: Times of London In-domain back-off; shorter context Out-of-domain back-off; same context, no domain-specificity Probability that Montana follows quarterback Joe In-domain counts

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 DHPYLM generalized back-off Desired intuitive form Generic Pitman-Yor single-sample posterior predictive distribution Leaf-layer of DHPYLM

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 DHPYLM – Latent Language Model … no associated observations Generates Latent Language Model

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 DHPYLM – Domain Specific Model Generates Domain Specific Language Model

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 An Generalization Inference in the graphical Pitman-Yor process accomplished via a collapsed Gibbs auxiliary variable sampler The Graphical Pitman-Yor Process

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 Graphical Pitman-Yor Process Inference Auxiliary variable collapsed Gibbs sampler –Chinese restaurant franchise representation –Must keep track of which parent restaurant each table comes from –Multi-floor Chinese restaurant franchise Every table has two labels – Á k : the parameter (here a word type) –s k : the parent restaurant from which it came

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 SOU / Brown Training corpora –Small State of the Union (SOU) – –~ 370,000 words, 13,000 unique –Big Brown –1967 –~ 1,000,000 words, 50,000 unique Test corpus Johnsons SOU Addresses – –~ 37,000 words

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 SLM Domain Adaptation Approaches Mixture [Kneser & Steinbiss, 1993] Union [Bellegarda, 2004] MAP [Bacchiani, 2006] smallbig Data small big

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 AMI / Brown Training corpora –Small Augmented Multi-Party Interaction (AMI) –2007 –~ 800,000 words, 8,000 unique –Big Brown –1967 –~ 1,000,000 words, 50,000 unique Test corpus AMI excerpt –2007 –~ 60,000 words

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 Cost / Benefit Analysis Realistic scenario –Computational cost ($) of using more big corpus data (Brown corpus) –Data preparation cost ($$$) of gathering more small corpus data (SOU corpus) –Perplexity Goal Same test data

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 Cost / Benefit Analysis Realistic scenario –Computational cost ($) of using more big corpus data (Brown corpus) –Data preparation cost ($$$) of gathering more small corpus data (SOU corpus) –Perplexity Goal Same test data

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 Take Home Showed how to do SLM domain adaptation through hierarchical Bayesian language modelling Introduced a new type of model –Graphical Pitman Yor Process

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 Thank You

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 Select References [1] Bellegarda, J. R. (2004). Statistical language model adaptation: review and perspectives. Speech Communication, 42, 93–108. [2] Carletta, J. (2007). Unleashing the killer corpus: experiences in creating the multi- everything AMI meeting corpus. Language Resources and Evaluation Journal, 41, 181–190. [3] Daum´e III, H., & Marcu, D. (2006). Domain adaptation for statistical classifiers. Journal of Artificial Intelligence Research, 101–126. [4] Goldwater, S., Griffiths, T. L., & Johnson, M. (2007). Interpolating between types and tokens by estimating power law generators. NIPS 19 (pp. 459–466). [5] Iyer, R., Ostendorf, M., & Gish, H. (1997). Using out-of-domain data to improve in-domain language models. IEEE Signal processing letters, 4, 221–223. [6] Kneser, R., & Steinbiss, V. (1993). On the dynamic adaptation of stochastic language models. IEEE Conference on Acoustics, Speech, and Signal Processing (pp. 586–589). [7] Kucera, H., & Francis, W. N. (1967). Computational analysis of present-day American English. Brown University Press. [8] Rosenfeld, R. (2000). Two decades of statistical language modeling: where do we go from here? Proceedings of the IEEE (pp. 1270–1278). [9] Teh, Y.W. (2006). A hierarchical Bayesian language model based on Pitman-Yor processes. ACL Proceedings (44th) (pp. 985–992). [10] Zhu, X., & Rosenfeld, R. (2001). Improving trigram language modeling with the world wide web. IEEE Conference on Acoustics, Speech, and Signal Processing (pp. 533–536).

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 HHPYLM Inference Every table at level n corresponds to a customer in a restaurant at level n-1 All customers in the entire hierarchy are unseated and re-seated in each sampler sweep amwasam here I amwas. here I. am. I.... HHS S LatentDomain Specific

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 HHPYPLM Intuition Each G is still a weighted set of (repeated) atoms Á1Á1 Á2Á2. {context}... D s1s1 s2s2

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 Sampling the floor indicators Switch variables drawn from a discrete distribution (mixture weights) Integrate out ¸ s Sample this in the Chinese restaurant representation as well

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 Why PY vs DP? PY power law characteristics match the statistics of natural language well –number of unique words in a set of n words follows power law [Teh 2006] Number of words drawn from Pitman-Yor Process Number of unique words

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 What does a PY draw look like? A draw from a PY process is a discrete distribution with infinite support [Sethurman, 94] aa (.0001), aah (.00001), aal ( ), aardvark (.001), …

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 How does one work with such a model? Truncation in the stick breaking representation Or in a representation where G is analytically marginalized out roughly

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 Posterior PY This marginalization is possible and results in the Pitman Yor Polya urn representation. By invoking exchangeability one can easily construct Gibbs samplers using such representations –A generalization of the hierarchical Dirichlet process sampler of Teh [2006]. [Pitman, 02]

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 Doubly-Hierarchical Pitman-Yor Process Language Model Finite, fixed vocabularly Two domains –One big (general) –One small (specific) Backs-off to –out of domain model with same context –or in domain model with one fewer words of context

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 Bayesian SLM Domain Adaptation Hyper SLM Domain specific SLMGeneral SLM

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 HHPYLM Inference Every table at level n corresponds to a customer in a restaurant at level n-1 All customers in the entire hierarchy are unseated and re-seated in each sampler sweep amwasam here I amwas. here I. am. I.... HHS S LatentDomain Specific

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 HHPYPLM Intuition Each G is still a weighted set of (repeated) atoms Á1Á1 Á2Á2. {context}... D s1s1 s2s2

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 Sampling the floor indicators Switch variables drawn from a discrete distribution (mixture weights) Integrate out ¸ s Sample this in the Chinese restaurant representation as well

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 Language Modelling Maximum likelihood –Smoothed empirical counts Hierarchical Bayes –Finite MacKay & Peto, Hierarchical Dirichlet Language Model, 1994 –Nonparametric Teh, A hierarchical Bayesian language model based on Pitman-Yor processes Comparable to the best smoothing n-gram model

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 Questions and Contributions How does one do this? –Hierarchical Bayesian modelling approach What does such a model look like? –Novel model architecture How does one estimate such a model? –Novel auxiliary variable Gibbs sampler Given such a model, does inference in the model actually confer application benefits? –Positive results

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 Review: Hierarchical Pitman-Yor Process Language Model Finite, fixed vocabulary Infinite dimensional parameter space (Gs, ds, and ® s) Forms a suffix tree of depth n+1 (n is from n-gram) with one distribution at each node America (.00001), of (.01), States (.00001), the (.01), … America (.0001), of ( ), States (.0001), the (.01), … America (.8), of ( ), States ( ), the (. 01), … [Teh, 2006]

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 Review: HPYPLM General Notation America (.00001), of (.01), States (.00001), the (.01), … America (.0001), of ( ), States (.0001), the (.01), … America (.8), of ( ), States ( ), the (. 01), …

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 Review: Pitman-Yor / Dirichlet Process Stick-Breaking Representation Each G is a weighted set of atoms Atoms may repeat –Discrete base distribution aa (.0001), aah (.00001), aal ( ), aardvark (.001), …

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 Now you know… Statistical natural language modelling –Perplexity as a performance measure Domain adaptation Hierarchical Bayesian natural language models –HPYLM –HDP Inference

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 Dirichlet Process (DP) Review Notation (G and H are measures over space X) Definition –For any fixed partition (A 1, A 2, …, A K ) of X Fergusen 1973, Blackwell & MacQueen 1973, Adous 1985, Sethuraman 1994

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 Inference Quantities of interest Critical identities from multinomial / Dirichlet conjugacy (roughly)

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 Posterior updating in one step With the following identifications then

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 Dont need Gs – integrated out Many µ s are the same The Chinese restaurant process is a representation (data structure) of the posterior updating scheme that keeps track of only the unique µ s and the number times each was drawn

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 Review: Hierarchical Dirichlet Process (HDP) Inference Starting with training data (a corpus) we estimate the posterior distribution through sampling HPYP Inference is the roughly the same as HDP inference

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 HDP Inference: Review Chinese restaurant franchise [Teh et al, 2006] –Seating arrangements of hierarchy of Chinese restaurant processes are the state of the sampler Data are the observed tuples/n-grams (here I am (6), here I was (2)) amwasam here I

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 HDP Inference: Review Induction over base distribution draws -> Chinese restaurant franchise Every level is a Chinese restaurant process Auxiliary variables indicate from which table each n-gram came –Table label is a draw from the base distribution State of the sampler: {am, 1}, {am, 3}, {was, 2}, … (in context, here I) amwasam … here I

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 HDP Inference: Review Every draw from the base distribution means that the resulting atom must (also) be in the base distributions set of atoms. amwasam here I amwas. I..

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 HDP Inference: Review Every table at level n corresponds to a customer in a restaurant at level n-1 –If a customer is seated at a new table, this corresponds to a draw from the base distribution All customers in the entire hierarchy are unseated and re-seated in each sweep amwasam here I amwas. I..

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 Parameter Infestation If V is the vocabulary size then a full tri-gram model has V 3 parameters (potentially)

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 Big models Assume a 50,000 word vocabulary –1.25x10 14 parameters Not all must be realized or stored in a practical implementation –Maximum number of tri-grams in a billion word corpus, ~1x10 9

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 Language Modelling Zero counts and smoothing Kneser-Ney

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 Multi-Floor Chinese Restaurant Process

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 Domain LM Adaptation Approaches MAP [Bacchiani, 2006] Mixture [Kneser & Steinbiss, 1993] Union [Bellegarda, 2004]

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 HHPYLM Inference: Intuition

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 Estimation Counting Zero counts are a problem Smoothing (or regularization) is essential –Kneser-Ney –Hierarchical Dirichlet Language Model (HDLM) –Hierarchical Pitman-Yor Process Language Model (HPYLM) –etc.

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 Justification Domain specific training data can be costly to collect and process –i.e. manual transcription of customer service telephone calls Different domains are different!

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 Why Domain Adaptation? … a language model trained on Dow-Jones newswire text will see its perplexity doubled when applied to the very similar Associated Press newswire text from the same time period –Ronald Rosenfeld, Two Decades Of Statistical Language Modeling: Where Do We Go From Here, 2000

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 SLM Evaluation Per-word perplexity is a model evaluation metric common to the NLP community –Computed using test data and trained model –Related to log likelihood and entropy –Language modelling interpretation: Average size of replacement word set

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 SOU / Brown State of the Union Corpus, –~ 370,000 words, 13,000 unique Today, the entire world is looking to America for enlightened leadership to peace and progress. Truman, 1945 Today, an estimated 4 out of every 10 students in the 5th grade will not even finish high school - and that is a waste we cannot afford. Kennedy, 1963 In 1945, there were about two dozen lonely democracies in the world. Today, there are 122. And we're writing a new chapter in the story of self-government -- with women lining up to vote in Afghanistan, and millions of Iraqis marking their liberty with purple ink, and men and women from Lebanon to Egypt debating the rights of individuals and the necessity of freedom. Bush, 2006 Brown Corpus, 1967 –~ 1,000,000 words, 50,000 unique During the morning hours, it became clear that the arrest of Spencer was having no sobering effect upon the men of the Somers. He certainly didn't want a wife who was fickle as Ann. It is being fought, moreover, in fairly close correspondence with the predictions of the soothsayers of the think factories. Test: Johnsons SOU Addresses, –~ 37,000 words But we will not permit those who fire upon us in Vietnam to win a victory over the desires and the intentions of all the American people. This Nation is mighty enough, its society is healthy enough, its people are strong enough, to pursue our goals in the rest of the world while still building a Great Society here at home.

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 AMI / Brown AMI, 2007 –Approx 800,000 tokens, 8,000 types Yeah yeah for example the l.c.d. you can take it you can put it put it back in or you can use the other one or the speech recognizer with the microphone yeah yeah you want a microphone to pu in the speech recognizer you dont you pay less for the system you see so

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 Bayesian Domain Adaptation CorpusCorpus specific models Corpus independent model

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 HPYPLM

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 Bayesian Domain Adaptation Domain Adaptation Hierarchical Modelling Domain specific observationsDomain specific modelsHyper model ……

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 Doubly hierarchical Pitman-Yor Process Language Model Uniform distribution over words Hyper language model Domain

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 Hierarchical Pitman-Yor Process Language Model e.g. America (.00001), of (.01), States (.00001), the (.01), … America (.0001), of ( ), States (.0001), the (.01), … America (.8), of ( ), States ( ), the (. 01), … [Teh, Y.W., 2006], [Goldwater, S., Griffiths, T. L., & Johnson, M., 2007]

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 Pitman-Yor Process? Observations base distribution concentrationdiscount

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 Hierarchical Bayesian Modelling CorporaCorpus specific models Corpus independent model

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 What does G look like? G is a discrete distribution with infinite support [Sethurman, 94] aa (.0001), aah (.00001), aal ( ), aardvark (.001), …

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 What does G look like? Stick breaking representation (GEM) [Sethurman, 94] The Dirichlet process arises when d = 0