An Overview of Nonparametric Bayesian Models and Applications to Natural Language Processing Narges Sharif-Razavian and Andreas Zollmann.

Slides:



Advertisements
Similar presentations
Hierarchical Dirichlet Process (HDP)
Advertisements

Information Retrieval and Organisation Chapter 12 Language Models for Information Retrieval Dell Zhang Birkbeck, University of London.
1 CS 388: Natural Language Processing: N-Gram Language Models Raymond J. Mooney University of Texas at Austin.
Gibbs Sampling Methods for Stick-Breaking priors Hemant Ishwaran and Lancelot F. James 2001 Presented by Yuting Qi ECE Dept., Duke Univ. 03/03/06.
Hierarchical Dirichlet Processes
CS479/679 Pattern Recognition Dr. George Bebis
Albert Gatt Corpora and Statistical Methods – Lecture 7.
Nonparametric hidden Markov models Jurgen Van Gael and Zoubin Ghahramani.
Flipping A Biased Coin Suppose you have a coin with an unknown bias, θ ≡ P(head). You flip the coin multiple times and observe the outcome. From observations,
N-Gram Language Models CMSC 723: Computational Linguistics I ― Session #9 Jimmy Lin The iSchool University of Maryland Wednesday, October 28, 2009.
Basics of Statistical Estimation. Learning Probabilities: Classical Approach Simplest case: Flipping a thumbtack tails heads True probability  is unknown.
The Sequence Memoizer Frank Wood Cedric Archambeau Jan Gasthaus Lancelot James Yee Whye Teh Gatsby UCL Gatsby HKUST Gatsby TexPoint fonts used in EMF.
. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.
1 Smoothing LING 570 Fei Xia Week 5: 10/24/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA A A AA A A A.
Using CTW as a language modeler in Dasher Phil Cowans, Martijn van Veen Inference Group Department of Physics University of Cambridge.
Language Modeling Approaches for Information Retrieval Rong Jin.
Learning In Bayesian Networks. Learning Problem Set of random variables X = {W, X, Y, Z, …} Training set D = { x 1, x 2, …, x N }  Each observation specifies.
SI485i : NLP Set 3 Language Models Fall 2012 : Chambers.
Chapter Two Probability Distributions: Discrete Variables
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Multi-Style Language Model for Web Scale Information Retrieval Kuansan Wang, Xiaolong Li and Jianfeng Gao SIGIR 2010 Min-Hsuan Lai Department of Computer.
Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.
Forgetting Counts : Constant Memory Inference for a Dependent Hierarchical Pitman-Yor Process Nicholas Bartlett, David Pfau, Frank Wood Presented by Yingjian.
Bayesian Inference Ekaterina Lomakina TNU seminar: Bayesian inference 1 March 2013.
Hierarchical Dirichelet Processes Y. W. Tech, M. I. Jordan, M. J. Beal & D. M. Blei NIPS 2004 Presented by Yuting Qi ECE Dept., Duke Univ. 08/26/05 Sharing.
Bayesian Hierarchical Clustering Paper by K. Heller and Z. Ghahramani ICML 2005 Presented by HAO-WEI, YEH.
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
Statistical NLP: Lecture 8 Statistical Inference: n-gram Models over Sparse Data (Ch 6)
Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Presented by Chen Yi-Ting.
The Sequence Memoizer Frank Wood Cedric Archambeau Jan Gasthaus Lancelot James Yee Whye Teh UCL Gatsby HKUST Gatsby TexPoint fonts used in EMF. Read the.
Chapter6. Statistical Inference : n-gram Model over Sparse Data 이 동 훈 Foundations of Statistic Natural Language Processing.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
Style & Topic Language Model Adaptation Using HMM-LDA Bo-June (Paul) Hsu, James Glass.
Efficient Language Model Look-ahead Probabilities Generation Using Lower Order LM Look-ahead Information Langzhou Chen and K. K. Chin Toshiba Research.
Lecture 4 Ngrams Smoothing
1 Modeling Long Distance Dependence in Language: Topic Mixtures Versus Dynamic Cache Models Rukmini.M Iyer, Mari Ostendorf.
Hierarchical Dirichlet Process and Infinite Hidden Markov Model Duke University Machine Learning Group Presented by Kai Ni February 17, 2006 Paper by Y.
1 Dirichlet Process Mixtures A gentle tutorial Graphical Models – Khalid El-Arini Carnegie Mellon University November 6 th, 2006 TexPoint fonts used.
Introduction to LDA Jinyang Gao. Outline Bayesian Analysis Dirichlet Distribution Evolution of Topic Model Gibbs Sampling Intuition Analysis of Parameter.
Intro. ANN & Fuzzy Systems Lecture 23 Clustering (4)
Ngram models and the Sparcity problem. The task Find a probability distribution for the current word in a text (utterance, etc.), given what the last.
Learning In Bayesian Networks. General Learning Problem Set of random variables X = {X 1, X 2, X 3, X 4, …} Training set D = { X (1), X (2), …, X (N)
Learning to Detect Events with Markov-Modulated Poisson Processes Ihler, Hutchins and Smyth (2007)
A Hierarchical Nonparametric Bayesian Approach to Statistical Language Model Domain Adaptation Frank Wood and Yee Whye Teh AISTATS 2009 Presented by: Mingyuan.
Beam Sampling for the Infinite Hidden Markov Model by Jurgen Van Gael, Yunus Saatic, Yee Whye Teh and Zoubin Ghahramani (ICML 2008) Presented by Lihan.
Language Modeling Putting a curve to the bag of words Courtesy of Chris Jordan.
Bayesian Multi-Population Haplotype Inference via a Hierarchical Dirichlet Process Mixture Duke University Machine Learning Group Presented by Kai Ni August.
Gaussian Processes For Regression, Classification, and Prediction.
Natural Language Processing Statistical Inference: n-grams
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
The Uniform Prior and the Laplace Correction Supplemental Material not on exam.
10.1 – Estimating with Confidence. Recall: The Law of Large Numbers says the sample mean from a large SRS will be close to the unknown population mean.
A Maximum Entropy Language Model Integrating N-grams and Topic Dependencies for Conversational Speech Recognition Sanjeev Khudanpur and Jun Wu Johns Hopkins.
The Nested Dirichlet Process Duke University Machine Learning Group Presented by Kai Ni Nov. 10, 2006 Paper by Abel Rodriguez, David B. Dunson, and Alan.
Bayesian Hierarchical Clustering Paper by K. Heller and Z. Ghahramani ICML 2005 Presented by David Williams Paper Discussion Group ( )
N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.
Recent Paper of Md. Akmal Haidar Meeting before ICASSP 2013 報告者:郝柏翰 2013/05/23.
Language Modeling Part II: Smoothing Techniques Niranjan Balasubramanian Slide Credits: Chris Manning, Dan Jurafsky, Mausam.
A Study on Speaker Adaptation of Continuous Density HMM Parameters By Chin-Hui Lee, Chih-Heng Lin, and Biing-Hwang Juang Presented by: 陳亮宇 1990 ICASSP/IEEE.
Statistical Language Models
ICS 280 Learning in Graphical Models
Parameter Estimation 主講人:虞台文.
Bayes Net Learning: Bayesian Approaches
Dirichlet process tutorial
Collapsed Variational Dirichlet Process Mixture Models
CSCI 5822 Probabilistic Models of Human and Machine Learning
Bayesian Inference for Mixture Language Models
Topic Models in Text Processing
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Presentation transcript:

An Overview of Nonparametric Bayesian Models and Applications to Natural Language Processing Narges Sharif-Razavian and Andreas Zollmann

Motivation Given: x 1,…,x n y 1,…,y n {p(x,y;θ)} θ Find most appropriate θ MLE: choose arg max θ p(x 1,y 1 ; θ) … p(x N,y N ; θ ) Drawback: only events encountered during training acknowledged -> overfitting

Bayesian Methods Treat θ as random  assign prior distribution p(θ) to θ  use posterior distribution p(θ|D)=p(D|θ)p(θ)/p(D) ∝ p(D|θ)p(θ)  Compute predictive prob’s for new data points by marginalizing over θ: p(y n+1 |y 1,…,y n )=∫ θ p(θ|y 1,…,y n ) p(y n+1 |θ)

Dirichlet Priors Let y 1,…,y n be drawn from Multinomial(β 1,…,β K )  K outcomes, p(K; β)=β K Conjugate prior: Dirichlet(α 1,...,α K ) distribution (where all α i > 0) : p(β; α 1,...,α K ) = β 1 ^(α 1 -1)…β K ^(α K -1) Predictive prob’s: p(y=k| y 1,…,y n )=

Dirichlet Process Priors Number of parameters K unknown –Higher value for K still works –K could be inherently infinite Nonparametric extension of Dirichlet distribution: Dirichlet Process Example: Word segmentation –infinite word types

Dirichlet Process Given a distribution H over event space Θ G ~ DP(α, H) Predictive probability of θ n+1 conditioned on the previous draws θ 1,.., θ n is

DP Model for Unsupervised Word Segmentation Word types as mixture components Unigram Model : Infinite word types: P w ~ DP(α, P 0 ) Where Posterior

HDP Model for Unsupervised Word Segmentation Bigram Model: The HDP Prior The posterior

DP and HDP Model Results CHILDES corpus –9790 sentence –Baselines NGS: Maximum Likelihood MBDP: Bayesian unigram

Pitman-Yor Processes Adds a discount parameter d to DP (α, H)  more control over increase rate in the number of components as function of sample size Prob. of value coming from H decreases according to (α+d*t)/(α+n) t: n.o. draws from H so far N.o. unique values = O(α n^d) [d>0] O(α log(n)) [d=0]

Language Modeling based on Hierarchical Pitman-Yor Processes language modeling: assign a probability distribution over possible utterances in a certain language and domain Tricky: choice of n  Smoothing (back-off or interpolation)

Teh, Yee Whye A hierarchical Bayesian language model based on Pitman-Yor processes. (COLING/ACL). Nonparametric Bayesian approach: n- gram models drawn from distributions whose priors are based on (n-1)-gram models

Model Definition u=u 1 …u m : LM context G u (w): prob of word w given u d m ~ Uniform(0,1) θ m ~ Gamma(1,1) Base cases:

Relation to smoothing methods Paper shows that interpolated Kneser-Ney can be interpreted as approximation of HPY

Experiments Text corpus: 16M words (APNews) HPYCV: strength and discount parameters estimated by CV

Conclusions gave an overview of several common nonparametric Bayesian models reviewed some recent NLP applications of these models use of Bayesian priors masters trade-off between –powerful model to capture detail in the data –having enough evidence in the data to support the model’s inferred parameters computationally expensive and non-exact inference algorithms none of the applications we reviewed improved significantly over a smoothed non-Bayesian version of the same model nonparametric Bayesian methods in ML / NLP still in its infancy need more insight into inference algorithms new DP variants and generalizations being found every year.