Language modelling María Fernández Pajares Verarbeitung gesprochener Sprache.

language modelling María Fernández Pajares Verarbeitung gesprochener Sprache

Index: 1.introduction 2. regular grammars 3. stochastics languages 4. N-grams models 5. perplexity

Introduction: Language models What is a language model?  It´s a language structure defining method, in order to limit the most probable linguistic units sequences.  They tend to be useful for aplications which show a complex syntax and/or semantic.  A good ML should only accept( with a high probability) right sentences and reject (or give a low probability) to wrong word sequences.  CLASSIC MODELS: - N-gramms - Stochastic Grammars.

Introduction: general scheme of a system signal measurement of parameters comparison of models Rule of decision Acustic and grammar models text

Introduction: task´s difficulty measurement Determined by the admited language`s real flexibility Perplexity: average of options There are finer measures that take into account the difficulty of the words or the acustics models Speech recognizers seek the word sequence W which is most likely to be produced from acoustic evidence A Speech recognition involves acoustic processing, acoustic modelling, language modelling, and search

Language models (LMs) assign a probability estimate P(W ) to word sequences W = {w1,...,wn} subject to Language models help guide and constrain the search among alternative word hypotheses during recognition Huge vocabularies: integration of the acoustic models and of the language in a hidden macro-model in the Markov to all the language.

Introduction: problems dificulty dimensions conectivity speakers Vocabulary and language complexity (+noise, robustness)

Introduction: MODELS BASED IN GRAMMARS * They represent language restrictions in a natural way *They allow the modelling of dependencies as long as required *the definition of these models involves a big difficulty for tasks that entail languages next to natural languages (pseudo-natural) *Integration with the acustic model isn´t very natural

Introduction: Kinds of grammars If we take the following grammar G=(N,S,P,S) Chomsky hierarchy  0. No restrictions in the rules  too complex to be useful  1 Sensible rules to the context  too complex  2 Independent of the context  they are used in experimental systems  3 regulars or Finite state

Grammars and automat Every kind of grammar is relationed with a kind of automat, that recognizes it:  Kind 0 (without restrictions): Turing Machine  Kind 1(free of context): lineal limited automat  Kind 2 (sensibles to the context):push-down automat  Kind 3 (regulars): finite state automat

Regular grammars A regular grammar is any right-linear or left-linear grammar Examples: Regular grammars generate regular languages Languages Generated by Regular Grammars Regular Languages

space search

An example:

Grammars and stochastics languages Add a probability to each of the production rules A stochastics grammar is a couple (G,p) Where G is a grammar and p is a function p:P  [0,1] that has the property Where represents a set of grammar rules who´s antecedent is A. A stochastic language over an alphabet is a pair that fulfill the following conditions :

example

P(W) can be broken down like: When n=2  bigrams When n=3  trigrams N-gramms models

Let us suppose that the result of an acoustic decoding assigns to resemblances probabilities to the phrases: If: * P(pig | the)=P(big | the) then the election of one or another depends of the word dog. * P(the pig dog)=P(the). P(pig | the). P(dog | the pig) * P(the big dog)=P(the). P(big | the). P(dog | the big) as P(dog | the big)> P(dog | the pig) the model helps to decode the sentences correctly Problems: Necessity of elevating number of learning samples: unigram: bigram: trigram : Example:

Advantages: Probabilities are based on data Parameters determined automatically from corpora Incorporate local syntax, semantics, and pragmatics Many languages have a strong tendency toward standard word order and are thus substantially local Relatively easy to integrate into forward search methods such as Viterbi (bigram) or A ∗ Disadvantages: Unable to incorporate long-distance constraints Not well suited for flexible word order languages Cannot easily accommodate – New vocabulary items – Alternative domains – Dynamic changes (e.g., discourse) Not as good as humans at tasks of – Identifying and correcting recognizer errors – Predicting following words (or letters) Do not capture meaning for speech understanding

Estimation of the Probabilities We go to you suppose that the model of N-gramms has been modelized with a finite automat: Unigram: bigram w1w2:trigram w1w2w3: Let us suppose that they we have a sample of training, on which has considered a model of N-gramms, represented like a finite automat. A state of the automat is q, and is c (q) is total number of events (N- gramas) observed in the sample when model is in state q.

C(w|q) is the number of times that the word w has been observed in the sample,being the model in the state q. P(w|q) is the probability of observation of the word w conditioned to the state q. The set of words observed in the sample when the model is in the state q. The total vocabulary of the language that has to be modelate For example in a bigram: This attitude approach assigns the probability 0 to the events that haven´t been said  this cause problems of cover  the solution is smooth the model  we can smooth the model with:plane,lineal,no lineal, back-off, sintact back-off..

 Bigrams are easily incorporated in Viterbi search  Trigrams used for large vocabulary recognition in mid-1970’s and remain the dominant language modeL  IBM TRIGRAM EXAMPLE:

Methods, in order to measure the probability of ungesehenen N-grams: n-gram performance can be improved by clustering words – Hard clustering puts a word into a single cluster – Soft clustering allows a word to belong to multiple clusters Clusters can be created manually, or automatically – Manually created clusters have worked well for small domains – Automatic clusters have been created bottom-up or top- down

PERPLEXITY Average of options Quantifying LM Complexity One LM is better than another if it can predict an n word test corpus W with a higher probability For LMs representable by the chain rule, comparisons are usually based on the average per word logprob, LP A more intuitive representation of LP is the perplexity (a uniform LM will have PP equal to vocabulary size) PP is often interpreted as an average branching factor

Perplexity Examples

Bibliography: P. Brown et al., Class-based n-gram models of natural language, Computational Linguistics, 1992. R. Lau, Adaptive Statistical Language Modelling, S.M. Thesis, MIT, 1994. M. McCandless, Automatic Acquisition of Language Models for Speech Recognition, S.M. Thesis, MIT, 1994. L.R.Rabiner y B.-H.Juang:”Fundamentals of Speech Recognition”,Prentice-Hall,1993 GOOGLE

Language modelling María Fernández Pajares Verarbeitung gesprochener Sprache.

Similar presentations

Presentation on theme: "Language modelling María Fernández Pajares Verarbeitung gesprochener Sprache."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Language modelling María Fernández Pajares Verarbeitung gesprochener Sprache.

Similar presentations

Presentation on theme: "Language modelling María Fernández Pajares Verarbeitung gesprochener Sprache."— Presentation transcript:

Similar presentations

About project

Feedback