Presentation is loading. Please wait.

Presentation is loading. Please wait.

An Overview of Nonparametric Bayesian Models and Applications to Natural Language Processing Narges Sharif-Razavian and Andreas Zollmann.

Similar presentations


Presentation on theme: "An Overview of Nonparametric Bayesian Models and Applications to Natural Language Processing Narges Sharif-Razavian and Andreas Zollmann."— Presentation transcript:

1 An Overview of Nonparametric Bayesian Models and Applications to Natural Language Processing Narges Sharif-Razavian and Andreas Zollmann

2 Motivation Given: x 1,…,x n y 1,…,y n {p(x,y;θ)} θ Find most appropriate θ MLE: choose arg max θ p(x 1,y 1 ; θ) … p(x N,y N ; θ ) Drawback: only events encountered during training acknowledged -> overfitting

3 Bayesian Methods Treat θ as random  assign prior distribution p(θ) to θ  use posterior distribution p(θ|D)=p(D|θ)p(θ)/p(D) ∝ p(D|θ)p(θ)  Compute predictive prob’s for new data points by marginalizing over θ: p(y n+1 |y 1,…,y n )=∫ θ p(θ|y 1,…,y n ) p(y n+1 |θ)

4 Dirichlet Priors Let y 1,…,y n be drawn from Multinomial(β 1,…,β K )  K outcomes, p(K; β)=β K Conjugate prior: Dirichlet(α 1,...,α K ) distribution (where all α i > 0) : p(β; α 1,...,α K ) = β 1 ^(α 1 -1)…β K ^(α K -1) Predictive prob’s: p(y=k| y 1,…,y n )=

5 Dirichlet Process Priors Number of parameters K unknown –Higher value for K still works –K could be inherently infinite Nonparametric extension of Dirichlet distribution: Dirichlet Process Example: Word segmentation –infinite word types

6 Dirichlet Process Given a distribution H over event space Θ G ~ DP(α, H) Predictive probability of θ n+1 conditioned on the previous draws θ 1,.., θ n is

7 DP Model for Unsupervised Word Segmentation Word types as mixture components Unigram Model : Infinite word types: P w ~ DP(α, P 0 ) Where Posterior

8 HDP Model for Unsupervised Word Segmentation Bigram Model: The HDP Prior The posterior

9 DP and HDP Model Results CHILDES corpus –9790 sentence –Baselines NGS: Maximum Likelihood MBDP: Bayesian unigram

10 Pitman-Yor Processes Adds a discount parameter d to DP (α, H)  more control over increase rate in the number of components as function of sample size Prob. of value coming from H decreases according to (α+d*t)/(α+n) t: n.o. draws from H so far N.o. unique values = O(α n^d) [d>0] O(α log(n)) [d=0]

11 Language Modeling based on Hierarchical Pitman-Yor Processes language modeling: assign a probability distribution over possible utterances in a certain language and domain Tricky: choice of n  Smoothing (back-off or interpolation)

12 Teh, Yee Whye. 2006. A hierarchical Bayesian language model based on Pitman-Yor processes. (COLING/ACL). Nonparametric Bayesian approach: n- gram models drawn from distributions whose priors are based on (n-1)-gram models

13 Model Definition u=u 1 …u m : LM context G u (w): prob of word w given u d m ~ Uniform(0,1) θ m ~ Gamma(1,1) Base cases:

14 Relation to smoothing methods Paper shows that interpolated Kneser-Ney can be interpreted as approximation of HPY

15 Experiments Text corpus: 16M words (APNews) HPYCV: strength and discount parameters estimated by CV

16 Conclusions gave an overview of several common nonparametric Bayesian models reviewed some recent NLP applications of these models use of Bayesian priors masters trade-off between –powerful model to capture detail in the data –having enough evidence in the data to support the model’s inferred parameters computationally expensive and non-exact inference algorithms none of the applications we reviewed improved significantly over a smoothed non-Bayesian version of the same model nonparametric Bayesian methods in ML / NLP still in its infancy need more insight into inference algorithms new DP variants and generalizations being found every year.


Download ppt "An Overview of Nonparametric Bayesian Models and Applications to Natural Language Processing Narges Sharif-Razavian and Andreas Zollmann."

Similar presentations


Ads by Google